[PDF] Reverse-Bayes methods for evidence assessment and research synthesis

Abstract

It is now widely accepted that the standard inferential toolkit used by the scientific research community -- null-hypothesis significance testing (NHST) -- is not fit for purpose. Yet despite the threat posed to the scientific enterprise, there is no agreement concerning alternative approaches for evidence assessment. This lack of consensus reflects long-standing issues concerning Bayesian methods, the principal alternative to NHST. We report on recent work that builds on an approach to inference put forward over 70 years ago to address the well-known "Problem of Priors" in Bayesian analysis, by reversing the conventional prior-likelihood-posterior ("forward") use of Bayes's Theorem. Such Reverse-Bayes analysis allows priors to be deduced from the likelihood by requiring that the posterior achieve a specified level of credibility. We summarise the technical underpinning of this approach, and show how it opens up new approaches to common inferential challenges, such as assessing the credibility of scientific findings, setting them in appropriate context, estimating the probability of successful replications, and extracting more insight from NHST while reducing the risk of misinterpretation. We argue that Reverse-Bayes methods have a key role to play in making Bayesian methods more accessible and attractive for evidence assessment and research synthesis. As a running example we consider a recently published meta-analysis from several randomized controlled clinical trials investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19.

Full PDF

11 Reverse-Bayes methods:a review of recent technical advances

Leonhard Held ∗ , § , Robert Matthews † , § , Manuela Ott ∗ , ‡ , and Samuel Pawel ∗ , § Abstract.

It is now widely accepted that the standard inferential toolkit used bythe scientiﬁc research community – null-hypothesis signiﬁcance testing (NHST)– is not ﬁt for purpose. Yet despite the threat posed to the scientiﬁc enterprise,there is no agreement concerning alternative approaches. This lack of consensusreﬂects long-standing issues concerning Bayesian methods, the principal alterna-tive to NHST. We report on recent work that builds on an approach to inferenceput forward over 70 years ago to address the well-known “Problem of Priors” inBayesian analysis, by reversing the conventional prior-likelihood-posterior (“for-ward”) use of Bayes’s Theorem. Such Reverse-Bayes analysis allows priors to bededuced from the likelihood by requiring that the posterior achieve a speciﬁedlevel of credibility. We summarise the technical underpinning of this approach,and show how it opens up new approaches to common inferential challenges, suchas assessing the credibility of scientiﬁc ﬁndings, setting them in appropriate con-text, estimating the probability of successful replications, and extracting moreinsight from NHST while reducing the risk of misinterpretation. We argue thatReverse-Bayes methods have a key role to play in making Bayesian methods moreaccessible and attractive to the scientiﬁc community. As a running example weconsider a recently published meta-analysis from several randomized controlledclinical trials investigating the association between corticosteroids and mortalityin hospitalized patients with COVID-19.

Keywords:

Reverse-Bayes, Analysis of Credibility, Bayes factor, false positiverisk, prior-data conﬂict. “We can make judgments of initial probabilities and infer ﬁnalones, or we can equally make judgments of ﬁnal ones and inferinitial ones by

Bayes’s theorem in reverse .” Good (1983, p. 29)There is now a common consensus that the most widely-used methods of statisticalinference have led to a crisis in both the interpretation of research ﬁndings and theirreplication ( e. g.

Gelman and Loken, 2014; Wasserstein and Lazar, 2016). At the same ∗ Department of Biostatistics, University of Zurich, [email protected], [email protected] † Department of Mathematics, Aston University, [email protected] ‡ Data Team, Swiss National Science Foundation, [email protected] § Supported by the Swiss National Science Foundation ( http://p3.snf.ch/Project-189295 ) a r X i v : . [ s t a t . M E ] F e b time, there is a lack of consensus on how to address the challenge, as highlighted by theplethora of alternative techniques to null-hypothesis signiﬁcance testing now being putforward (see e. g. Wasserstein et al., 2019, and references therein). Especially strikingis the relative dearth of alternatives based on Bayesian concepts. Given their intuitiveinferential basis and output (see e. g.

Wagenmakers et al., 2008; McElreath, 2018, orsome other textbook), these would seem obvious candidates to supplant the prevail-ing frequentist methodology. However, it is well-known that the adoption of Bayesianmethods continues to be hampered by several factors, such as the belief that advancedcomputational tools are required to make Bayesian statistics practical ( e. g.

Greenet al., 2015). The most persistent of these is that the full beneﬁt of Bayesian methodsdemands speciﬁcation of a prior level of belief, even in the absence of any appropriateinsight. This “Problem of Priors” has cast a shadow over Bayesian methods since theiremergence over 250 years ago (see e. g.

McGrayne, 2011), and has led to a variety ofapproaches, such as prior elicitation, prior sensitivity analysis, and objective Bayesianmethodology; all have their supporters and critics.One of the least well-known was suggested over 70 years ago (Good, 1950) by oneof the best-known proponents of Bayesian methods during the 20 th century, I.J. Good.It involves reversing the conventional direction of Bayes’s Theorem and determiningthe level of prior belief required to reach a speciﬁed level of posterior belief, given theevidence observed. This reversal of Bayes’s Theorem allows the assessment of new ﬁnd-ings on the basis of whether the resulting prior is reasonable in the light of existingknowledge. Whether a prior is plausible in the light of existing knowledge can be as-sessed informally or more formally using techniques for comparing priors with existingdata as suggested by Box (1980) and further reﬁned by Evans and Moshonov (2006).Good stressed that despite the routine use of the adjectives “prior” and “posterior” inapplications of Bayes’s Theorem, the validity of any resulting inference does not requirea speciﬁc temporal ordering, as the theorem is simply a constraint ensuring consistencywith the axioms of probability. While reversing Bayes’s Theorem is still regarded asunacceptable by some on the grounds it allows “cheating” in the sense of choosing pri-ors to achieve a desired posterior inference ( e. g. O’Hagan and Forster, 2004, p. 143),others point out this is not an ineluctable consequence of the reversal ( e. g.

Cox, 2006,p. 78–79). As we shall show, recent technical advances further weaken this criticism.Good’s belief in the value of Reverse-Bayes methods won support from E.T. Jaynesin his well-known treatise on probability. Explaining a speciﬁc manifestation of the ap-proach (to be discussed shortly) Jaynes remarked: “We shall ﬁnd it helpful in many caseswhere our prior information seems at ﬁrst too vague to lead to any deﬁnite prior prob-abilities; it stimulates our thinking and tells us how to assign them after all” (Jaynes,2003, p. 126). Yet despite the advocacy of two leading ﬁgures in the foundations ofBayesian methodology, the potential of Reverse-Bayes methods has remained largelyunexplored. Most published work has focused on their use in putting new researchclaims in context, with Reverse-Bayes methods being used to assess whether the priorevidence needed to make a claim credible is consistent with existing insight (Carlin andLouis, 1996; Matthews, 2001a,b; Spiegelhalter, 2004; Greenland, 2006, 2011; Held, 2013;Colquhoun, 2017, 2019; Held, 2019a, 2020; Pawel and Held, 2020).The purpose of this paper is to highlight recent technical developments of Good’sbasic idea which lead to inferential tools of practical value in data analysis. Speciﬁcally,we show how Reverse-Bayes methods address the current concerns about the interpre-tation of new ﬁndings and their replication. We begin by illustrating the basics of theReverse-Bayes approach for both hypothesis testing and parameter estimation. Thisis followed by a discussion of Reverse-Bayes methods for assessing eﬀect estimates inSection 2. These allow the credibility of both new and existing research ﬁndings re-ported in terms of NHST to be evaluated in the context of existing knowledge. Thisenables researchers to go beyond the standard dichotomy of statistical signiﬁcance/non-signiﬁcance, extracting further insight from their ﬁndings. We then discuss the use ofthe Reverse-Bayes approach in the most recalcitrant form of the Problem of Priors,involving the assessment of research ﬁndings which are unprecedented and thus lack-ing any clear source of prior support. We show how the concept of intrinsic credibilityresolves this challenge, and puts recent calls to tighten p -value thresholds on a prin-cipled basis (Benjamin et al., 2017). In Section 3 we describe Reverse-Bayes methodswith Bayes factors, the principled solution for Bayesian hypothesis testing. Finally, wedescribe in Section 4. Reverse-Bayes approaches to interpretational issues that arise inconventional statistical analysis based on p -values, and how they can be used to ﬂag therisk of inferential fallacies. We close with some extensions and ﬁnal conclusions. The subjectivity involved in the speciﬁcation of prior distributions is often seen as aweak point of Bayesian inference. The Reverse-Bayes approach can help to resolve thisissue both in hypothesis testing and parameter estimation, we will start with the former.Consider a null hypothesis H with prior probability π = Pr ( H ), so Pr ( H ) = 1 − π is the prior probability of the alternative hypothesis H . Computation of the posteriorprobability of H is routine with Bayes’ theorem:Pr( H | data) = Pr(data | H ) Pr( H )Pr(data | H ) Pr( H ) + Pr(data | H ) Pr( H ) . Bayes’ theorem can be written in more compact form asPr( H | data)Pr( H | data) = Pr(data | H )Pr(data | H ) Pr( H )Pr( H ) , (1) i. e. the posterior odds are the likelihood ratio times the prior odds. The standard’forward-Bayes’ approach thus ﬁxes the prior odds (or one of the underlying probabil-ities), determines the likelihood ratio for the available data, and takes the product tocompute the posterior odds. Of course, the latter can be easily back-transformed to theposterior probability Pr( H | data), if required. The Problem of Priors is now apparent:in order for us to update the odds in favour of H , we must ﬁrst specify the prior odds.This can be problematic in situations where, for example, the evidence on which to basethe prior odds is controversial or even non-existent. However, as Good emphasised it is entirely justiﬁable to “ﬂip” Bayes’s theoremaround, allowing us to ask the question: Which prior, when combined with the data,leads to our speciﬁed posterior?Pr( H )Pr( H ) = Pr( H | data)Pr( H | data) (cid:46) Pr(data | H )Pr(data | H ) . (2)For illustration we re-visit an example put forward by Good (1950, p. 35), perhapsthe ﬁrst published Reverse-Bayes calculation. It centres on a question for which the set-ting of an initial prior is especially problematic: does an experiment provide convincingevidence for the existence of extra-sensory perception (ESP)? The substantive hypoth-esis H is that ESP exists, so that H asserts it does not exist. Imagine an experimentin which a person has to make n consecutive guesses of random digits (between 0 and9) and all are correct. The likelihood ratio is thereforePr(data | H )Pr(data | H ) = 1(1 / n = 10 n . It is unlikely that sceptics and advocates of the existence of ESP would ever agree onwhat constitutes reasonable priors from which to start a standard Bayesian analysis ofthe evidence. However, Good argued that Reverse-Bayes oﬀers a way forward by using itto set bounds on the prior probabilities for H and H . This is achieved via the outcomeof an imaginary (Gedanken) experiment capable of demonstrating H is more likely than H , that is, of leading to posterior probabilities such that Pr( H | data) > Pr( H | data).Using this approach, which Good termed the Device of Imaginary Results , we see that ifthe ESP experiment produced 20 correct consecutive guesses, (2) implies that ESP maybe deemed more likely than not to exist by anyone whose priors satisfy Pr( H ) / Pr( H ) > − . In contrast, if only n = 3 correct guesses emerged, then the existence of ESPcould be rejected by anyone whose priors satisfy Pr( H ) / Pr( H ) < − . Using Bayes’sTheorem in reverse has thus led to a quantitative statement of the prior beliefs thateither advocates or sceptics of ESP must be able to justify in the face of results froma real experiment. The practical value of Good’s approach was noted by Jaynes inhis treatise: “[I]n the present state of development of probability theory, the device ofimaginary results is usable and useful in a very wide variety of situations, where wemight not at ﬁrst think it applicable” (Jaynes, 2003, p. 125–126).It is straightforward to extend (1) and (2) to hypotheses that involve unknownparameters θ . The likelihood ratio Pr(data | H ) / Pr(data | H ) is then called a Bayesfactor (Jeﬀreys, 1961; Kass and Raftery, 1995) wherePr(data | H i ) = (cid:90) Pr(data | θ, H i ) f ( θ | H i ) dθ is the marginal likelihood under hypothesis H i , i = 0 ,

1, obtained be integration of theordinary likelihood with respect to the prior distribution f ( θ | H i ). We will apply theReverse-Bayes approach to Bayes factors in Section 3 and 4. We can also apply the Reverse-Bayes idea to continuous prior and posterior distributionsof a parameter of interest θ . Reversing Bayes’ theorem f ( θ | data) = f (data | θ ) f ( θ ) f (data)then leads to f ( θ ) = f (data) f ( θ | data) f (data | θ ) . (3)So the prior is proportional to the posterior divided by the likelihood with proportion-ality constant f (data).Consider Bayesian inference for the mean θ of a univariate normal distribution,assuming the variance σ is known. Let x denote the observed value from that N( θ, σ )distribution and suppose the prior for θ (and hence also the posterior) is normal. Eachof them is determined by two parameters, usually the mean and the variance, but twodistinct quantiles would also work. If we ﬁx both parameters of the posterior, then theprior in (3) is – under a certain regularity condition – uniquely determined. For ease ofpresentation we work with the observational precision κ = 1 /σ and denote the priorand posterior precision by δ and δ (cid:48) , respectively. Finally let µ and µ (cid:48) denote the priorand posterior mean, respectively.Forward-Bayesian updating tells us how to compute the posterior precision andmean: δ (cid:48) = δ + κ,µ (cid:48) = µδ + xκδ (cid:48) . Reverse-Bayes simply inverts these equations, which leads to the following: δ = δ (cid:48) − κ, (4) µ = µ (cid:48) δ (cid:48) − xκδ , (5)provided δ (cid:48) > κ , i. e. the posterior precision must be larger than the observationalprecision.We will illustrate the application of (4) and (5) as well as the methodology in therest of this review using a recent meta-analysis combining information from n = 7 ran-domized controlled clinical trials investigating the association between corticosteroidsand mortality in hospitalized patients with COVID-19 (WHO REACT Working Group,2020); its results are reproduced in Figure 1 (here and henceforth, odds ratios (ORs)are expressed as log odds ratios to transform the range from (0 , ∞ ) to ( −∞ , + ∞ ), con-sistent with the assumption of normality). Let x i = ˆ θ i denote the maximum likelihoodestimate (MLE) of the log odds ratio θ in the i -th study with standard error σ i . Themeta-analytic odds ratio estimate under the ﬁxed-eﬀects model is (cid:100) OR = 0 .

66 [95% CI,0.53, 0.82], respectively ˆ θ = − .

42 [95% CI, -0.63, -0.20] for the log odds ratio θ , in-dicating evidence for lower mortality of patients treated with corticosteroids comparedto patients receiving usual care or placebo. The pooled eﬀect estimate ˆ θ represents aposterior mean µ (cid:48) with posterior precision δ (cid:48) = 83 . FE Model −2 −1 0 1 2 3 4Log Odds RatioSteroids−SARIREMAP−CAPCOVID STEROIDCAPE COVIDRECOVERYCoDEXDEXA−COVID 19 13/2426/1056/1511/7595/32469/1282/7 13/2329/922/1420/73283/68376/1282/12 0.870.290.140.060.00020.380.54 0.580.790.050.360.220.400.33−0.10 [−1.25, 1.06]−0.34 [−0.96, 0.29] 1.39 [−0.43, 3.21]−0.79 [−1.61, 0.03]−0.53 [−0.82, −0.25]−0.22 [−0.72, 0.27] 0.69 [−1.54, 2.93]−0.42 [−0.63, −0.20]

Trial logOR [95%CI]

Steroids No Steroids p p

Box

Deaths/Patients

Figure 1: Forest plot of ﬁxed eﬀects meta-analysis of randomized clinical trials investi-gating association between corticosteroids and mortality in hospitalized patients withCOVID-19 (WHO REACT Working Group, 2020). Shown are number of deaths amongtotal number of patients for treatment/control group, log odds ratio eﬀect estimateswith 95% conﬁdence interval, two-sided p -values p , and prior-predictive tail probabili-ties p Box with a meta-analytic estimate based on the remaining studies serving as theprior.With a meta-analysis such as this, it is of interest to quantify potential conﬂictamong the eﬀect estimates from the diﬀerent studies. To do this, we follow Presaniset al. (2013) and compute a prior-predictive tail probability (Box, 1980; Evans andMoshonov, 2006) for each study-speciﬁc estimate ˆ θ i , with a meta-analytic estimatebased on the remaining studies serving as the prior. Fixed eﬀects (FE) meta-analysisis standard (forward-)Bayesian updating for normally distributed eﬀect estimates withan initial ﬂat prior considered here. Hence, instead of ﬁtting a reduced meta-analysisfor each study, we can simply use the the Reverse-Bayes equations (4) and (5) togetherwith the overall estimate to compute the parameters of the prior in the absence of the i -th study (denoted by the index − i ): δ − i = δ (cid:48) − /σ i ,µ − i = µ (cid:48) δ (cid:48) − ˆ θ i /σ i δ − i . For example, through omitting the RECOVERY Collaborative Group (2020) trial resultˆ θ i = − .

53 with standard error σ i = 0 .

145 we obtain δ − i = 36 . µ − i = − .

26. Aprior predictive tail probability using the approach from Box (1980) is then obtainedby computing p Box = Pr ( χ ≥ t ) with t Box = ˆ θ i − µ − i (cid:112) σ i + 1 /δ − i = − . . This leads to p Box = 0 .

22 for the RECOVERY trial, indicating very little prior-dataconﬂict, see Figure 1 for the tail probabilities p Box for the other studies.Instead of determining the prior completely based on the posterior, one may alsowant to ﬁx one parameter of the posterior and one parameter of the prior. This isof particular interest in order to challenge “signiﬁcant” or “non-signiﬁcant” ﬁndingsthrough the Analysis of Credibility, as we will see in the following section.

A more general question amenable to Reverse-Bayes methods is the assessment of eﬀectestimates and their statistical signiﬁcance or non-signiﬁcance. This issue has recentlyattracted intense interest following the public statement of the American StatisticalAssociation about the misuse and misinterpretation of the NHST concepts of statisti-cal signiﬁcance and non-signiﬁcance (Wasserstein and Lazar, 2016). First investigated20 years ago in Matthews (2001a) with subsequent discussion in Matthews (2001b),Reverse-Bayes methods for assessing both statistically signiﬁcant and non-signiﬁcantﬁndings has been termed the Analysis of Credibility (or AnCred, Matthews, 2018),whose principles and practice we now brieﬂy review.

Suppose the study gives rise to a conventional conﬁdence interval for the unknowneﬀect size θ at level 1 − α with lower limit L and upper limit U . Assume that L and U are symmetric around the point estimate ˆ θ (assumed to be normally distributed withstandard error σ ). AnCred then takes this likelihood and uses a Reverse-Bayes approachto deduce the prior required in order to generate evidence for the existence of an eﬀect,in the form of a posterior that excludes no eﬀect. As such, AnCred allows evidencedeemed statistically signiﬁcant / non-signiﬁcant in the NHST framework to be assessedfor its credibility in the Bayesian framework. As the latter represents Pr( H | data)and thus a conditioning on the data rather than the null hypothesis, it is inferentiallydirectly relevant to researchers. After a suitable transformation AnCred can be appliedto a large number of commonly used eﬀect measures such as diﬀerences in means, oddsratios, relative risks and correlations (see the literature of meta-analysis for detailsabout conversion among eﬀect size scales, e. g. Cooper et al., 2019, Chapter 11.6). Theinversion of Bayes’s Theorem needed to assess credibility requires the form and locationof the prior distribution to be speciﬁed. This in turn depends on whether the claimbeing assessed is statistically signiﬁcant or non-signiﬁcant; we consider each below.

Challenging statistically signiﬁcant ﬁndings

A statistically signiﬁcant ﬁnding at level α is characterized by both L and U beingeither positive or negative. Equivalently z > z α/ is required where z = ˆ θ/σ denotesthe corresponding test statistic and z α/ the (1 − α/ − S and S is derived such that the corresponding posteriorcredible interval just includes zero, the value of no eﬀect. This critical prior interval canthen be compared with internal or external evidence to assess if the ﬁnding is credibleor not, despite being “statistically signiﬁcant”.More speciﬁcally, a reverse Bayes approach is applied to signiﬁcant conﬁdence in-tervals (at level α ) based on a normally distributed eﬀect estimate. The prior is a“sceptical” mean-zero normal distribution with variance τ = g · σ , so the only freeparameter is the relative prior variance g = τ /σ . The posterior is hence also normaland either its lower α/ θ ) or upper 1 − α/ θ ) is ﬁxed to zero, so just represents “non-credible”. The suﬃciently sceptical prior thenhas relative variance g =  z /z α/ − z > z α/ undeﬁned else (6)see Held (2019a, Appendix) for a derivation. The corresponding scepticism limit is S = ( U − L ) √ U L , (7)which holds for any value of α provided the eﬀect is signiﬁcant at that level.The left plot in Figure 2 illustrates the AnCred procedure for the ﬁnding fromthe RECOVERY trial (RECOVERY Collaborative Group, 2020). The trial found adecrease in COVID-19 mortality for patients treated with corticosteroids compared tousual care or placebo (ˆ θ = − .

53 [95% CI, -0.82, -0.25]). The suﬃciently sceptical priorhas relative variance g = 0 .

39, so the suﬃciently sceptical prior variance needs to beroughly 2.5 times smaller than the variance of the estimate to make the result non-credible. The scepticism limit on the log odds ratio scale turns out to be -0.18, which is0.84 on the odds ratio scale. Thus sceptics may still reject the RECOVERY trial ﬁndingas lacking credibility despite its statistical signiﬁcance if external evidence suggestsmortality reductions (in terms of odds) are unlikely to exceed around 1 − . ≈ Challenging statistically non-signiﬁcant ﬁndings

It is also possible to challenge “non-signiﬁcant” ﬁndings ( i. e. those for which the CInow includes zero, so z < z α/ ) using a prior that pushes the posterior towards beingcredible in the Bayesian sense, with posterior credible interval no longer including zero,corresponding to no eﬀect.Matthews (2018) proposed the “advocacy prior” for this purpose, a normal priorwith positive mean µ and variance τ chosen such that the α/ θ > − α/ − U + L U L ( U − L ) (8)to reach credibility of the corresponding posterior at level α . We show in Appendix Athat the corresponding relative prior mean m = µ/ ˆ θ is m =  − z /z α/ if z < z α/ undeﬁned else. (9)There are two important properties of the advocacy prior. First, the coeﬃcient ofvariation CV is CV = τ /µ = z − α/ . The advocacy prior θ ∼ N( µ, τ = µ CV ) is hence characterized by a ﬁxed coeﬃcientof variation, so this prior has equal evidential weight (quantiﬁed in terms of µ/τ = z α/ )as data which are “just signiﬁcant” at level α . Second, the advocacy limit AL deﬁnesthe family of normal priors capable of rendering a “non-signiﬁcant” ﬁnding credibleat the same level. Such priors are summarized by the credible interval ( L o , U o ) where L o ≥ U o ≤ AL. Thus when confronted with a “non-signiﬁcant” result – often, andwrongly, interpreted as indicating no eﬀect – advocates of the existence of an eﬀectmay still claim the existence of the eﬀect is credible to the same level if there existsprior evidence or insight compatible with the credible interval ( L o , U o ) where L o ≥ U o ≤ AL. If the evidence for an eﬀect is weak (strong), the resulting advocacy prior willbe broad (narrow), giving advocates of an eﬀect more (less) latitude to make their caseunder terms of AnCred. Note that (8) and (9) also hold for negative eﬀect estimates,where we ﬁx the (1 − α/ α/ θ = − .

34 [95% CI, − . , . Significant effect estimate O R Sceptical Prior Data Posterior

Non−significant effect estimate O R Advocacy Prior Data Posterior

Figure 2: Two examples of the Analysis of Credibility. Shown are point estimates within95% conﬁdence/credible intervals. The left plot illustrates how a sceptical prior is used tochallenge the signiﬁcant ﬁnding from the RECOVERY trial (RECOVERY CollaborativeGroup, 2020). The right plot illustrates how an advocacy prior is used to challenge anon-signiﬁcant ﬁnding from the REMAP-CAP trial (REMAP-CAP Investigators, 2020).In both scenarios the posterior is ﬁxed to be just credible/non-credible.ratio scale for REMAP-CAP is − . i. e. .

15 on the odds ratio scale, see also theright plot in Figure 2. Thus advocates of the eﬀectiveness of corticosteroids can regardthe trial as providing credible evidence of eﬀectiveness despite its non-signiﬁcance ifexternal evidence supports mortality reductions (in terms of odds) in the range 0%to 85%. So broad an advocacy range reﬂects the fact that this relatively small trialprovides only modest evidential weight, and thus little constraint on prior beliefs aboutthe eﬀectiveness of corticosteroids.

Relationship between Analysis of Credibility and the fail-safe N method There is an interesting connection between AnCred and the well-known “fail-safe N ”method, sometimes also called “ﬁle-drawer analysis”. This method, ﬁrst introducedby Rosenthal (1979) and later reﬁned by Rosenberg (2005), is commonly applied tothe results from a meta-analysis and answers the question: “How many unpublishednegative studies do we need to make the meta-analytic eﬀect estimate non-signiﬁcant?”A relatively large N of such unpublished studies suggests that the estimate is robustto potential null-ﬁndings, for example due to publication bias. Calculations are madeunder the assumption that the unpublished studies have an average eﬀect of zero anda precision equal to the average precision of the published ones.1While the method does not identify nor adjust for publication bias, it provides aquick way to assess how robust the meta-analytic eﬀect estimate is. The method isavailable in common software packages such as metafor (Viechtbauer, 2010) and itssimplicity and intuitive appeal have made it very popular among researchers.AnCred and the fail-safe N are both based on the idea to challenge eﬀect estimatessuch that they become “non-signiﬁcant/not credible”, and it is easy to show that themethods are under some circumstances also technically equivalent. To illustrate this, weconsider again the meta-analysis on the association between corticosteroids and COVID-19 mortality (WHO REACT Working Group, 2020) which gave the pooled log odds ratioestimate ˆ θ = − .

42 with standard error σ = 0 .

11, posterior precision δ (cid:48) = 83 . z = ˆ θ/σ = − . fsn() function fromthe metafor package) we ﬁnd that at least N = 20 additional but unpublished non-signiﬁcant ﬁndings are needed to make the published meta-analysis eﬀect non-signiﬁcant.If instead, we challenge the overall estimate with AnCred, we obtain the relative priorvariance g = 0 .

36 using equation (6), so τ = 0 . δ (cid:48) /n = 11 .

98 of the diﬀerent eﬀect estimates estimates in the meta-analysisleads to N = n/ ( δ (cid:48) · τ ) = 19 . N result after roundingto the next larger integer. The Problem of Priors is at its most challenging in the context of entirely novel “outof the blue” eﬀects for which no obviously relevant external evidence exist. By theirnature, such ﬁndings often attract considerable interest both within and beyond the re-search community, making their reliability of particular importance. Given the absenceof external sources of evidence, Matthews (2018) proposed the concept of intrinsic cred-ibility . This requires that the evidential weight of an unprecedented ﬁnding is suﬃcientto put it in conﬂict with the sceptical prior rendering it non-credible. In the AnCredframework, this implies a ﬁnding possesses intrinsic credibility at level α if the estimateˆ θ is outside the corresponding sceptical prior interval [ − S, S ] extracted using Reverse-Bayes from the ﬁnding itself, i. e. ˆ θ > S with S given in (7). Matthews showed thisimplies an unprecedented ﬁnding is intrinsically credible at level α = 0 .

05 if its p -valuedoes not exceed 0.013.Held (2019a) reﬁned the concept by suggesting the use of a prior-predictive check(Box, 1980; Evans and Moshonov, 2006) to assess potential prior-data conﬂict. Withthis approach the uncertainty of the estimate ˆ θ is also taken into account since it isbased on the prior-predictive distribution, in this case ˆ θ ∼ N(0 , σ + τ = σ (1 + g ))with g as given in (6). Intrinsic credibility is declared if the (two-sided) tail-probability p Box = Pr (cid:16) χ ≥ ˆ θ / ( σ + τ ) (cid:17) = Pr (cid:0) χ ≥ z / (1 + g ) (cid:1) of ˆ θ under the prior-predictive distribution is smaller than α . It turns out that the p -valueassociated with θ needs to be at least as small as 0.0056 to obtain intrinsic credibility2at level α = 0 .

05, providing another principled argument for the recent proposition tolower the p -value threshold for the claims of new discoveries to 0.005 (Benjamin et al.,2017). A simple check for intrinsic credibility is based on the credibility ratio , the ratioof the upper to the lower limit (or vice versa) of a conﬁdence interval for a credibleeﬀect size. If the credibility ratio is smaller than 5.8 then the result is intrinsicallycredible (Held, 2019a). This holds for conﬁdence intervals at all possible values of α , notjust for the 0.05 standard. For example, in the RECOVERY study the 95% conﬁdenceinterval for the log-odds ratio ranges from − .

82 to − .

25, so the credibility ratio is − . / − .

25 = 3 . < . Replication of eﬀect direction

Whether intrinsic credibility is assessed based on the prior or the prior-predictive distri-bution, it depends on the level α in both cases. To remove this dependence, Held (2019a)proposed to consider the smallest level at which intrinsic credibility can be established,deﬁning the p -value for intrinsic credibility p IC = 2 (cid:26) − Φ (cid:18) | z |√ (cid:19)(cid:27) , see Held (2019a, section 4) for the derivation. Now z = ˆ θ/σ , so compared to the standard p -value p = 2 { − Φ ( | z | ) } , the p -value for intrinsic credibility is based on twice thevariance σ of the estimate ˆ θ . Although motivated from a diﬀerent perspective, inferencebased on intrinsic credibility thus mimics the doubling the variance rule advocated byCopas and Eguchi (2005) as a simple means of adjusting for model uncertainty.Moreover, Held (2019a) showed that p IC is connected to p rep (Killeen, 2005), theprobability that a replication will result in an eﬀect estimate ˆ θ r in the same directionas the observed eﬀect estimate ˆ θ , by p rep = 1 − p IC /

2. Hence, an intrinsically credibleestimate at a small level α will have high chance of replicating since p rep ≥ − α/ p rep lies between 0.5 and 1 with the extreme case p rep = 0 . θ = 0.As an example, the p -value for intrinsic credibility for the RECOVERY trial ﬁnding(with p -value p = 0 . p IC = 0 .

01 and thus the probability of thereplication eﬀect going in the same direction ( i. e. reduced mortality in this case) is0 . p = 0 . p IC = 0 .

46, and the probability of eﬀect direction replication is hence only0 . The AnCred procedure as described above uses posterior credible intervals as a means ofquantifying evidence. However, quantiﬁcation of evidence with Bayes factors is a moreprincipled solution for hypothesis testing in the Bayesian framework (Jeﬀreys, 1961;Kass and Raftery, 1995). Bayes factors enable direct probability statements about null3and alternative hypothesis and they can also quantify evidence for the null hypothesis,both are impossible with indirect measures of evidence such as p -values (Held andOtt, 2018). Reverse-Bayes approaches combined with Bayes factor methodology waspioneered in Carlin and Louis (1996) but then remained unexplored until Pawel andHeld (2020) proposed an extension of AnCred where Bayes factors are used as a meansof quantifying evidence. Rather than determining a prior such that a ﬁnding becomes“non-credible” in terms of a posterior credible interval, this approach determines aprior such that the ﬁnding becomes “non-compelling” in terms of a Bayes factor. In thesecond step of the procedure, the plausibility of this prior is quantiﬁed using externaldata from a replication study. Here, we will illustrate the methodology using only anoriginal study; we mention extensions for replications in Section 5.1. Sceptical priors

A standard hypothesis test compares the null hypothesis H : θ = 0 to the alternative H : θ (cid:54) = 0. Bayesian hypothesis testing requires speciﬁcation of a prior distributionof θ under H . A typical choice is a local alternative, a unimodal symmetric priordistribution centred around the null value (Johnson and Rossell, 2010). We consideragain the sceptical prior θ | H ∼ N(0 , τ = g · σ ) with relative prior variance g for thispurpose. This leads to the Bayes factor comparing H to H beingBF = (cid:112) g · exp (cid:26) − g g · z (cid:27) . Yet again, the amount of evidence which the data provide against the null hypothesisdepends on the prior parameter g ; As g becomes smaller ( g ↓ → g → ∞ ), the nullhypothesis will always prevail (BF → ∞ ) due to the Jeﬀreys-Lindley paradox (Robert,2014). In between, the BF reaches a minimum at g = max (cid:8) z − , (cid:9) leading tominBF = (cid:40) | z | · exp (cid:8) − z / (cid:9) · √ e if | z | >

11 else (10)which is an instance of a minimum Bayes factor , the smallest possible Bayes factorwithin a class of alternative hypotheses, in this case zero-mean normal alternatives(Edwards et al., 1963; Berger and Sellke, 1987; Sellke et al., 2001; Held and Ott, 2018).Reporting of minimum Bayes factors is one attempt of solving the problem of priorsin Bayesian inference. However, this bound may be rather small and the correspondingprior unrealistic. In contrast, the Reverse-Bayes approach makes the choice of the priorexplicit by determining the relative prior variance parameter g such that the ﬁnding isno longer compelling, followed by assessing the plausibility of this prior. To do so, oneﬁrst ﬁxes BF = γ , where γ is a cut-oﬀ above which the result is no longer convincing,for example γ = 1 /

10, the level for strong evidence according to Jeﬀreys (1961). The4suﬃciently sceptical relative prior variance is then given by g =  − z q − − z q ≥ q = W (cid:18) − z γ · exp (cid:8) − z (cid:9)(cid:19) where W( · ) is the Lambert W function (Corless et al., 1996), see Pawel and Held (2020,Appendix B) for a proof.The suﬃciently sceptical relative prior variance g exists only for a cut-oﬀ γ ifminBF ≤ γ , similar to standard AnCred where it exists only at level α if the originalﬁnding was signiﬁcant at the same level. In contrast to standard AnCred, however, ifthe suﬃciently sceptical relative prior variance g exists, there are always two solutions,a consequence of the Jeﬀreys-Lindley paradox: If BF decreases in g below the chosencut-oﬀ γ , after attaining its minimum it will monotonically increase and intersect asecond time with γ , admitting a second solution for the suﬃciently sceptical prior.We revisit the meta-analysis example considered earlier: The left plot in Figure 3shows the Bayes factor BF as a function of the relative prior variance g for eachﬁnding included in the meta-analysis. Most of them did not include a great number ofparticipants and thus provide little evidence against the null for any value of the relativeprior variance g . In contrast, the ﬁnding from the RECOVERY trial (RECOVERYCollaborative Group, 2020) provides more compelling evidence and can be challengedup to minBF = 1 / .

9. For example, we see in Figure 3 that the sceptical priorvariance needs to be g = 0 .

59, so 1 .

69 times smaller than the variance of the eﬀectestimate, such that the ﬁnding is no longer compelling at level γ = 1 /

10. This translatesto a 95% prior credible interval from 0.8 to 1.24 for the OR. Hence, a sceptic mightstill consider the RECOVERY ﬁnding to be unconvincing, despite its minimum BFbeing very compelling, if external evidence supports ORs in that range. Note that also g (cid:48) = 8190 gives a Bayes factor of BF = 1 /

10, however, such a large relative priorvariance represents ignorance rather than scepticism and is less useful for Reverse-Bayesinference.The plausibility of the suﬃciently sceptical prior can be evaluated in light of exter-nal evidence, but what should we do in the absence of such? We could again use theBox (1980) prior-predictive check, however, the resulting tail probability is diﬃcult tocompare to the Bayes-factor cut-oﬀ γ . When a speciﬁc alternative model to the nullis in mind, Box also suggested to use a Bayes factor contrasting the two models. Fol-lowing this approach, Pawel and Held (2020) proposed to deﬁne a second Bayes factorcontrasting the suﬃciently sceptical prior to an optimistic prior, which they deﬁned as θ | H ∼ N(ˆ θ, σ ) the posterior of θ based on the data and the reference prior f ( θ ) ∝ γ if thedata favour the optimistic prior over the suﬃciently sceptical prior at a higher levelthan 1 /γ ( i. e. if BF ≤ γ ), analogously to intrinsic credibility based on signiﬁcance.For example, we obtain BF = 1 /

64 for the ﬁnding from the RECOVERY trial, so it5

Sceptical prior

Relative prior variance g B F Advocacy prior

Relative prior mean m B F DEXA−COVID 19 CoDEX RECOVERY CAPE COVID COVID STEROID REMAP−CAP Steroids−SARI

Trial

Figure 3: Illustration of the AnCred with Bayes factors procedure using the ﬁndings fromthe meta-analysis on the association of COVID-19 mortality and corticosteroids. The leftplot shows the Bayes factor BF as a function of the relative variance g of the scepticalprior. The result from the RECOVERY trial is challenged with a sceptical prior suchthat BF = 1 /

10, for the other trials such a prior does not exist. The right plot showsthe Bayes factor BF as a function of the relative mean m = µ/ ˆ θ of the advocacy priorwhere the coeﬃcient of variation from the prior is ﬁxed to CV = τ /µ = z (1 / − = 0 . = 1 / γ = 1 /

10. To remove the dependence on the choice of γ , onecan then determine the smallest cut-oﬀ γ where intrinsic credibility can be established,deﬁning a Bayes factor for intrinsic credibility similar to the deﬁnition of the p -value forintrinsic credibility. For the RECOVERY ﬁnding, this turns out to be BF IC = 1 / Advocacy priors

A natural question is whether we can also deﬁne an advocacy prior, a prior which rendersan uncompelling ﬁnding compelling, in the AnCred framework with Bayes factors. Intraditional AnCred, advocacy priors always exist since one can always ﬁnd a prior that,when combined with the data, can overrule them. This is fundamentally diﬀerent toinference based on Bayes factors, where the prior is not synthesized with the data, butrather used to predict them. A classical result due to Edwards et al. (1963) states thatif we consider the class of all possible priors under H , the minimum Bayes factor is6given by minBF = exp (cid:8) − z / (cid:9) (12)which is obtained for H : θ = ˆ θ . This implies that a non-compelling ﬁnding can not be“rescued” further than to this bound. For example, for the ﬁnding from the REMAP-CAP trial (REMAP-CAP Investigators, 2020) the bound is unsatisfactorily minBF =1 / .

7, so at most “worth a bare mention” according to Jeﬀreys (1961).Putting these considerations aside, we may still consider the class of N( µ, τ ) priorsunder the alternative H . The Bayes factor contrasting H to H is then given byBF = (cid:112) τ /σ · exp (cid:40) − (cid:34) ˆ θ σ − (ˆ θ − µ ) σ + τ (cid:35)(cid:41) . The reverse-Bayes approach now determines the prior mean µ and variance τ which leadto the Bayes factor BF being just at some cut-oﬀ γ . However, if both parameters arefree, there are inﬁnitely many solutions to BF = γ , if any exist at all. The traditionalAnCred framework resolves this by restricting the class of possible priors to advocacypriors with ﬁxed coeﬃcient of variation of CV = τ /µ = z − α/ . We can translate thisidea to the Bayes factor AnCred framework and ﬁx the prior’s coeﬃcient of variationto CV = z ( γ ) − , where z ( γ ) is a z -value corresponding to minBF = γ . Invertingequation (12) leads to z ( γ ) = (cid:112) − γ. Under this constraint, the prior carries the same evidential weight as data with minBF = γ . Moreover, the determination of the prior parameters becomes more feasible since thereis only one free parameter left (either µ or τ ).The right plot in Figure 3 illustrates application of the procedure on data fromthe meta-analysis on association between COVID-19 mortality and corticosteroids. Thecoeﬃcient of variation of the advocacy prior is ﬁxed to CV = z (1 / − = 0 .

67 and thusthe Bayes factor BF only depends on the relative mean parameter m . While underthe sceptical prior, only the RECOVERY ﬁnding could be challenged at γ = 1 /

3, withthis advocacy prior it is now also possible for the CAPE COVID ﬁnding (Dequin et al.,2020). We see that, a prior with mean µ = m · ˆ θ = 0 . · − .

79 = − .

29 and standarddeviation τ = CV · µ = 0 . γ = 1 /

3. Thiscorresponds to a 95% prior credible interval from 0.5 to 1.1 for the OR. Advocatesmay thus still consider the “non-compelling” ﬁnding as providing moderate evidencein favour of a beneﬁt, if external evidence supports mortality reductions in that range.Note that the advocacy prior may not be unique, e. g. for the CAPE COVID ﬁndingthe prior with relative mean m (cid:48) = 1 .

26 and standard deviation τ (cid:48) = 0 .

67 renders thedata also just compelling at γ = 1 /

3. We recommend to choose the prior with m closerto zero, as it is the more conservative choice. Application of the Analysis of Credibility with Bayes factors as described in Section 3assumes some familiarity with Bayes factors as measures of evidence. Colquhoun (2019)7argued that very few nonprofessional users of statistics are familiar with the notionof Bayes factors or likelihood ratios. He proposes to quantify evidence with the falsepositive risk , “if only because that is what most users still think, mistakenly, that is whatthe p -value tells them”. More speciﬁcally, Colquhoun (2019) deﬁnes the false positiverisk (FPR) as the posterior probability that the point null hypothesis H of no eﬀectis true given the observed p -value p , i. e. FPR = Pr( H | p ). As before, H correspondsto the point null hypothesis H : θ = 0. Note also that we take the exact (two-sided) p -value p as the observed “data”, regardless of whether or not it is signiﬁcant at somepre-speciﬁed level, the so-called “ p -equals” interpretation of NHST (Colquhoun, 2017).FPR can be calculated based on the Bayes factor associated with p . For ease ofpresentation we invert Bayes’ theorem (1) and obtainFPR1 − FPR = Pr( H | p )Pr( H | p ) = BF Pr( H )Pr( H ) , (13)where BF = 1 / BF is the Bayes factor for H against H , computed directly fromthe observed p -value p .The common ’forward-Bayes’ approach is to compute the FPR from the prior prob-ability Pr( H ) and the Bayes factor with (13). However, the prior probability Pr( H )is usually unknown in practice and often hard to assess. This can be resolved via theReverse-Bayes approach (Colquhoun, 2017, 2019): Given a p -value and a false positiverisk value, calculate the corresponding prior probability Pr( H ) that is needed to achievethat false positive risk. Of speciﬁc interest is the value FPR = 5%, because many scien-tists believe that a Type-I error of 5% is equivalent to a FPR of 5% (Greenland et al.,2016). This is of course not true and we follow Berger and Sellke (1987, Example 1) anduse the reverse-Bayes approach to derive the necessary prior assumptions on Pr( H ) toachieve FPR = 5% with Equation (13):Pr( H ) = (cid:20) − FPRFPR · BF (cid:21) − . (14)Colquhoun (2017, appendix A.2) uses a Bayes factor based on the t -test, but forcompatibility with the previous sections we assume normality of the underlying teststatistic. We consider Bayes factors under all simple alternatives, but also Bayes factorsunder local normal priors, see Held and Ott (2018) for a detailed comparison.Instead of working with a Bayes factor for a speciﬁc prior distribution, we preferto work with the minimum Bayes factor minBF as introduced in Section 3. In whatfollows we will use the minimum Bayes factor based on the z -test (Held and Ott, 2018,Section 2.1 and 2.2). The minimum Bayes factor based on the z -test among all possiblepriors can be computed using the function zCalibrate in the package pCalibrate . Theoption alternative = "local" gives the minBF (10) under local normal priors.Let minBF denote the minimum Bayes factor over a speciﬁc class of alternatives.From equation (14) we obtain the inequalityPr( H ) ≤ (cid:20) − FPRFPR · minBF (cid:21) − . (15)8The right-hand side is thus an upper bound on the prior probability Pr( H ) for a given p -value to achieve a pre-speciﬁed FPR value.There are also minBFs not based on the z -test statistic, but directly on the (two-sided) p -value p , the so-called “ − e p log p ” (Sellke et al., 2001) calibrationminBF = (cid:26) − e p log p for p < /e − e q log q ” calibration, where q = 1 − p (Held and Ott, 2018, Section 2.3):minBF = (cid:26) − e (1 − p ) log(1 − p ) for p < − /e p , equation (17) can be simpliﬁed to minBF ≈ e p , which mimics the Good(1958) transformation of p -values to Bayes factors (Held, 2019b).The two p -based calibrations are also available in the package pCalibrate . Theycarry less assumptions than the minimum Bayes factors based on the z -test under nor-mality. The “ − e p log p ” provides a general bound under all unimodal and symmetricallocal priors for p -values from z -tests (Sellke et al., 2001, Section 3.2). The “ − e q log q ”calibration is more conservative and gives a smaller bound on the Bayes factor thanthe “ − e p log p ” calibration. It can be viewed as a general lower bound under simplealternatives where the direction of the eﬀect is taken into account, see Held and Ott(2018, Section 2.1 and 2.3).The left plot in Figure 4 shows the resulting upper bound on the prior probabilityPr( H ) as a function of the two-sided p -value if the FPR is ﬁxed at 5%. For p = 0 . − e p log p ” bound is around 11% and 28% for the “ − e q log q ” calibration. The cor-responding values based on the z -test are slightly smaller (10% and 15%, respectively).All the probabilities are below the 50% value of equipoise, illustrating that borderlinesigniﬁcant result with p ≈ .

05 do not provide suﬃcient evidence to justify an FPRvalue of 5%. For p = 0 . p -value associated with the estimated treatment eﬀect is p = 0 . H ) are all very largefor such a small p -value.Fixing FPR at the 5% level may be considered as arbitrary. Another widespreadmisconception is the belief that that the FPR is equal to the p -value. Held (2013) useda reverse-Bayes approach to investigate which prior assumptions are required such thatFPR = p holds. Combining (14) with the “ − e p log p ” calibration (16) gives the explicitcondition Pr( H ) ≤ / { − e (1 − p ) log( p ) } whereas the “ − e q log q ” calibration (17) leads toPr( H ) ≤ / (cid:26) − e (1 − p ) p log(1 − p ) (cid:27) ≈ / { e (1 − p ) } , FPR = p −value (two−sided z −test) uppe r bound on P r( H ) calibration type z (simple) z (local) - e q log q - e p log p FPR = p −value p −value (two−sided z −test) uppe r bound on P r( H ) Figure 4: The left plot shows the upper bound on the prior probability Pr( H ) toachieve a false positive risk of 5% as a function of the p -value calibrated with either a z -test calibration (simple and local alternatives) or with the “ − e p log p ” or “ − e q log q ”calibrations, respectively. The right plot shows the upper bound on Pr( H ) as a functionof the p -value using the same calibrations but assuming the p -value equals the FPR.which is approximately 1 / (1 + e ) = 26 .

9% for small p .The right plot in Figure 4 compares the bounds based on these two calibrations withthe ones obtained from simple respectively local alternatives. We can see that strongassumptions on Pr( H ) are needed to justify the claim FPR = p : Pr( H ) cannot belarger than 15.2% if the p -value is conventionally signiﬁcant ( p < . p < . − e q log q ” calibration,the upper bound on Pr( H ) is 26 .

9% for small p and increases only slightly for largervalues of p . This illustrates that the misinterpretation FPR = p only holds if the priorprobability of H is substantially smaller than 50%, an assumption which is questionablein the absence of strong prior knowledge. The Reverse-Bayes methods described above have focused on the comparison of theprior needed for credibility with ﬁndings from other studies and/or more general in-0sights. However, replication studies make an obvious additional source of external evi-dence, as these are typically conducted to conﬁrm original ﬁndings by repeating theirexperiments as closely as possible. The question is then whether the original ﬁndingshave been successfully “replicated”, currently of considerable concern to the researchcommunity. To date, there remains no consensus on the precise meaning of replicationin a statistical sense. The proposal of Held (2020) (see also Held et al., 2020) was tochallenge the original ﬁnding using AnCred, as described in Section 2.1, and then eval-uate the plausibility of the resulting prior using a prior-predictive check on the datafrom a replication study. A similar procedure but using AnCred based on Bayes factorsas in Section 3 was proposed in Pawel and Held (2020). Reverse-Bayes inference seemsto ﬁt naturally into this setting as it provides a formal framework to challenge andsubstantiate scientiﬁc ﬁndings.Apart from using data from a replication study, there are also other possible ex-tensions of AnCred: We proposed either prior-predictive checks (Box, 1980; Evans andMoshonov, 2006) or Bayes-factors (Jeﬀreys, 1961; Kass and Raftery, 1995) for the for-mal evaluation of the plausibility of the priors derived through Reverse-Bayes. Othermethods could be used for this purpose, for example, Bayesian measures of surprise (Ba-yarri and Morales, 2003). Furthermore, AnCred in its current state is derived assuminga normal likelihood for the eﬀect estimate ˆ θ . This is the same framework as in standardmeta-analysis, and it provides a good approximation for studies with reasonable samplesize (Carlin, 1992). Nevertheless, the normality assumption could be relaxed and morerobust distributions could be considered, for example a t -distribution, which could leadto more accurate inferences for studies with small sample size. The inferential advantages of Bayesian methods are increasingly recognised within thestatistical community. However, among the majority of working researchers they havefailed to make any serious headway, and retain a reputation for complex and “contro-versial”.In this review, we have outlined how an idea that began with Jack Good’s proposalfor resolving the “Problem of priors” over 70 years ago (Good, 1950) has experienceda renaissance over recent years. The basic idea is to invert Bayes’ theorem: a speci-ﬁed posterior is combined with the data to obtain the Reverse-Bayes prior, which isthen used for further inference. This approach is useful in situations where it is diﬃcultto decide what constitutes a reasonable prior, but easy to specify the posterior whichwould lead to a particular decision. Starting with the work of Matthews (2001a,b), theReverse-Bayes methodology has been shown capable of addressing many common infer-ential challenges, including assessing the credibility of scientiﬁc ﬁndings (Spiegelhalter,2004; Greenland, 2006, 2011), making sense of “out of the blue” discoveries with noprior support (Matthews, 2018; Held, 2019a), estimating the probability of successfulreplications (Held, 2019a, 2020), and extracting more insight from standard p -valueswhile reducing the risk of misinterpretation (Held, 2013; Colquhoun, 2017, 2019). Theappeal of Reverse-Bayes techniques has recently been widened by the development of1inferential methods using both posterior probabilities and Bayes Factors (Carlin andLouis, 1996; Pawel and Held, 2020).These developments come at a crucial time for the role of statistical methods inresearch. Despite the many serious – and now well-publicised – inadequacies of NHST(Wasserstein and Lazar, 2016), the research community has shown itself to be remark-ably reluctant to abandon NHST. Techniques based on the Reverse-Bayes methodologyof the kind described in this review could encourage the wider use of Bayesian inferenceby researchers. As such, we believe they can play a key role in the scientiﬁc enterpriseof the 21 th century. Software

All analyses were performed in the R programming language version 4.0.3 (R CoreTeam, 2017). The code to reproduce all analyses is available at https://gitlab.uzh.ch/samuel.pawel/Reverse-Bayes-Code . Acknowledgments

Support by the Swiss National Science Foundation (Project

References

Bayarri, M. and Morales, J. (2003). “Bayesian measures of surprise for outlier detec-tion.”

Journal of Statistical Planning and Inference , 111(1-2): 3–22.URL https://doi.org/10.1016/s0378-3758(02)00282-3

Nature Human Behaviour , 2(1): 6–10.URL https://doi.org/10.1038/s41562-017-0189-z

3, 12Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: Irreconcilabilityof P values and evidence (with discussion).” Journal of the American StatisticalAssociation , 82: 112–139.URL https://doi.org/10.1080/01621459.1987.10478397

13, 172Box, G. E. P. (1980). “Sampling and Bayes’ Inference in Scientiﬁc Modelling andRobustness (with discussion).”

Journal of the Royal Statistical Society, Series A ,143: 383–430.URL https://doi.org/10.2307/2982063

2, 6, 7, 11, 14, 20Carlin, B. P. and Louis, T. A. (1996). “Identifying Prior Distributions That ProduceSpeciﬁc Decisions, With Application to Monitoring Clinical Trials.” In Berry, D.,Chaloner, K., and Geweke, J. (eds.),

Bayesian Analysis in Statistics and Economet-rics: Essays in Honor of Arnold Zellner , 493–503. New York: Wiley. 2, 13, 21Carlin, J. B. (1992). “Meta-analysis for 2 × Statisticsin Medicine , 11(2): 141–158.URL https://doi.org/10.1002/sim.4780110202

Royal Society Open Science , 4(12).URL https://dx.doi.org/10.1098/rsos.171085

2, 17, 20— (2019). “The False Positive Risk: A Proposal Concerning What to Do About p -Values.” The American Statistician , 73(sup1): 192–201.URL https://doi.org/10.1080/00031305.2018.1529622

2, 16, 17, 20Cooper, H., Hedges, L. V., and Valentine, J. C. (eds.) (2019).

The Handbook of ResearchSynthesis and Meta-Analysis . Russell Sage Foundation.URL https://doi.org/10.7758/9781610448864

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 67(4): 459–513.URL https://doi.org/10.1111/j.1467-9868.2005.00512.x

Advances in Computational Mathematics , 5(1): 329–359.URL https://doi.org/10.1007/bf02124750

Principles of Statistical Inference . Cambridge: Cambridge UniversityPress. 2Dequin, P.-F., Heming, N., Meziani, F., Plantef`eve, G., Voiriot, G., Badi´e, J., Fran¸cois,B., Aubron, C., Ricard, J.-D., Ehrmann, S., Jouan, Y., Guillon, A., Leclerc, M.,Coﬀre, C., Bourgoin, H., Lengell´e, C., Caille-F´en´erol, C., Tavernier, E., Zohar, S.,Giraudeau, B., Annane, D., and and, A. L. G. (2020). “Eﬀect of Hydrocortisone on21-Day Mortality or Respiratory Support Among Critically Ill Patients With COVID-19.”

JAMA , 324(13): 1298.URL https://doi.org/10.1001/jama.2020.16761

Psychological Review , 70: 193–242.URL https://doi.org/10.1037/h0044139

13, 153Evans, M. and Moshonov, H. (2006). “Checking for prior-data conﬂict.”

BayesianAnalysis , 1(4): 893–914.URL https://doi.org/10.1214/06-ba129

2, 6, 11, 20Gelman, A. and Loken, E. (2014). “The statistical crisis in science.”

American Scientist ,102(6): 460–465.URL https://doi.org/10.1511/2014.111.460

Probability and the Weighing of Evidence . London, UK: Griﬃn. 2,4, 20— (1958). “Signiﬁcance Tests in Parallel and in Series.”

Journal of the AmericanStatistical Association , 53(284): 799–813.URL https://doi.org/10.1080/01621459.1958.10501480

18— (1983).

Good Thinking: The Foundations of Probability and Its Applications . Min-neapolis: University of Minnesota Press. 1Green, P., (cid:32)Latuszy´nski, K., Pereyra, M., and Robert, C. (2015). “Bayesian computation:a summary of the current state, and samples backwards and forwards.”

Statistics andComputing , 25(6): 835–862.URL https://doi.org/10.1007/s11222-015-9574-5

International Journal of Epidemiology , 35: 765–775.URL https://doi.org/10.1093/ije/dyi312

2, 20— (2011). “Null misinterpretation in statistical testing and its impact on health riskassessment.”

Preventive Medicine , 53: 225–228.URL https://doi.org/10.1016/j.ypmed.2011.08.010

2, 20Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., andAltman, D. G. (2016). “Statistical tests, P values, conﬁdence intervals, and power: aguide to misinterpretations.”

European Journal of Epidemiology , 31(4): 337–350.URL https://doi.org/10.1007/s10654-016-0149-3

Clinical Trials , 10: 236–242.URL https://doi.org/10.1177/1740774512468807

2, 18, 20— (2019a). “The assessment of intrinsic credibility and a new argument for p < . Royal Society Open Science .URL https://doi.org/10.1098/rsos.181534

2, 8, 11, 12, 20— (2019b). “On the Bayesian interpretation of the harmonic mean p -value.” Proceedingsof the National Academy of Sciences , 116(13): 5855–5856.URL https://doi.org/10.1073/pnas.1900671116

18— (2020). “A new standard for the analysis and design of replication studies (withdiscussion).”

Journal of the Royal Statistical Society: Series A (Statistics in Society) ,183(2): 431–448.URL https://doi.org/10.1111/rssa.12493

2, 204Held, L., Micheloud, C., and Pawel, S. (2020). “The assessment of replication successbased on relative eﬀect size.” Technical report.URL http://arxiv.org/abs/2009.07782 p -Values and Bayes Factors.” Annual Review ofStatistics and Its Application , 5(1).URL https://doi.org/10.1146/annurev-statistics-031017-100307

13, 17, 18Jaynes, E. T. (2003).

Probability Theory: The Logic of Science . Cambridge, UK NewYork, NY: Cambridge University Press.URL https://doi.org/10.1017/cbo9780511790423

2, 4Jeﬀreys, H. (1961).

Theory of Probability . Oxford: Oxford University Press. 3rd edition.4, 12, 13, 16, 20Johnson, V. E. and Rossell, D. (2010). “On the use of non-local prior densities inBayesian hypothesis tests.”

Journal of the Royal Statistical Society, Series B , 72(2):143–170.URL https://doi.org/10.1111/j.1467-9868.2009.00730.x

Journal of the americanstatistical association , 90(430): 773–795.URL

4, 12, 20Killeen, P. R. (2005). “An Alternative to Null-Hypothesis Signiﬁcance Tests.”

Psycho-logical Science , 16(5): 345–353.URL https://doi.org/10.1111/j.0956-7976.2005.01538.x

Drug Information Journal , 35: 1469–1478.URL https://doi.org/10.1177/009286150103500442

2, 7, 20— (2001b). “Why should clinicians care about Bayesian methods? (with discussion).”

Journal of Statistical Planning and Inference , 94: 43–71.URL https://doi.org/10.1016/S0378-3758(00)00232-9

2, 7, 8, 20— (2018). “Beyond ’signiﬁcance’: principles and practice of the Analysis of Credibility.”

Royal Society Open Science , 5(1): 171047.URL https://doi.org/10.1098/rsos.171047

7, 9, 11, 20McElreath, R. (2018).

Statistical Rethinking . Chapman and Hall/CRC.URL https://doi.org/10.1201/9781315372495

The Theory That Would Not Die . New Haven, CT: YaleUniversity Press. 2O’Hagan, A. and Forster, J. (2004).

Kendall’s Advanced Theory of Statistic 2B . Wiley,second edition. 2Pawel, S. and Held, L. (2020). “The sceptical Bayes factor for the assessment of repli-cation success.”URL https://arxiv.org/abs/2009.01520

2, 13, 14, 20, 215Presanis, A. M., Ohlssen, D., Spiegelhalter, D. J., and Angelis, D. D. (2013). “Con-ﬂict Diagnostics in Directed Acyclic Graphs, with Applications in Bayesian EvidenceSynthesis.”

Statistical Science , 28(3): 376–397.URL https://doi.org/10.1214/13-sts426

6R Core Team (2017).

R: A Language and Environment for Statistical Computing . RFoundation for Statistical Computing, Vienna, Austria.URL

New England Journal of Medicine .URL https://doi.org/10.1056/nejmoa2021436

7, 8, 10, 14, 18REMAP-CAP Investigators (2020). “Eﬀect of Hydrocortisone on Mortality and OrganSupport in Patients With Severe COVID-19.”

JAMA , 324(13): 1317.URL https://doi.org/10.1001/jama.2020.17022

9, 10, 16Robert, C. P. (2014). “On the Jeﬀreys-Lindley Paradox.”

Philosophy of Science , 81(2):216–232. https://doi.org/10.1086/675729 . 13Rosenberg, M. S. (2005). “The ﬁle-drawer problem revisited: A general weighted methodfor calculating fails-safe numbers in meta-analysis.”

Evolution , 59(2): 464–468.URL https://doi.org/10.1111/j.0014-3820.2005.tb01004.x

10, 11Rosenthal, R. (1979). “The ﬁle drawer problem and tolerance for null results.”

Psycho-logical Bulletin , 86(3): 638–641.URL https://doi.org/10.1037/0033-2909.86.3.638 p Values for TestingPrecise Null Hypotheses.”

The American Statistician , 55: 62–71.URL https://doi.org/10.1198/000313001300339950

13, 18Spiegelhalter, D. J. (2004). “Incorporating Bayesian Ideas into Health-Care Evaluation.”

Statistical Science , 19(1): 156–174.URL https://doi.org/10.1214/088342304000000080

2, 20Viechtbauer, W. (2010). “Conducting Meta-Analyses in R with the metafor Package.”

Journal of Statistical Software , 36(3).URL https://doi.org/10.18637/jss.v036.i03

BayesianVersus Frequentist Inference , 181–207. New York, NY: Springer New York.URL https://doi.org/10.1007/978-0-387-09612-4_9

The American Statistician , 70(2): 129–133.URL https://doi.org/10.1080/00031305.2016.1154108

1, 7, 21Wasserstein, R. L., Schirm, A. L., and Lazar, N. A. (2019). “Moving to a World Beyond“ p < . The American Statistician , 73(sup1): 1–19.URL https://doi.org/10.1080/00031305.2019.1583913

JAMA , 324(13): 1330–1341.URL https://doi.org/10.1001/jama.2020.17023

5, 6, 11

Appendices

A Proof of equation (9)

Suppose that the estimate ˆ θ is not signiﬁcant at level α , so z /z α/ <

1. With

U, L =ˆ θ ± z α/ σ we have U + L = 2 ˆ θ , U L = ˆ θ − z α/ σ and U − L = 2 z α/ σ .We therefore obtain with (8): µ = AL2 = − θ θ − z α/ σ ) (2 z α/ σ ) θ z α/ σ z α/ σ − ˆ θ = 2 ˆ θ − z /z α/ . The advocacy standard deviation is τ = AL / (2 z α/ ) = µ/z α/ and the coeﬃcient ofvariation is therefore CV = τ /µ = z − α/2