[PDF] Frequentist tests for Bayesian models

Abstract

Analogues of the frequentist chi-square and F tests are proposed for testing goodness-of-fit and consistency for Bayesian models. Simple examples exhibit these tests' detection of inconsistency between consecutive experiments with identical parameters, when the first experiment provides the prior for the second. In a related analysis, a quantitative measure is derived for judging the degree of tension between two different experiments with partially overlapping parameter vectors.

Full PDF

aa r X i v : . [ a s t r o - ph . I M ] F e b Astronomy & Astrophysics manuscript no. lblucy c (cid:13)

ESO 2018April 17, 2018

Frequentist tests for Bayesian models

L.B.Lucy

Astrophysics Group, Blackett Laboratory, Imperial College London, Prince Consort Road, London SW7 2AZ, UKReceived ; Accepted

ABSTRACT

Analogues of the frequentist chi-square and F tests are proposed for testing goodness-of-ﬁt and consistency for Bayesianmodels. Simple examples exhibit these tests’ detection of inconsistency between consecutive experiments with identicalparameters, when the ﬁrst experiment provides the prior for the second. In a related analysis, a quantitative measureis derived for judging the degree of tension between two diﬀerent experiments with partially overlapping parametervectors. Key words.

Methods: data analysis – Methods: statistical

1. Introduction

Bayesian statistical methods are now widely applied in as-tronomy. Of the new techniques thus introduced, modelselection (or comparison) is especially notable in that itreplaces the frequentist acceptance-rejection paradigm fortesting hypotheses. Thus, given a data set D , there mightbe several hypotheses { H k } that have the potential to ex-plain D . From these, Bayesian model selection identiﬁes theparticular H k that best explains D . The procedure is sim-ple, though computationally demanding: starting with anarbitrary pair of the { H k } , we apply the model selectionmachinery, discard the weaker hypothesis, replace it withone of the remaining H k , and repeat.This procedure usefully disposes of the weakest hy-potheses, but there is no guarantee that the surviving H k explains D . If the correct hypothesis is not included in the { H k } , we are left with the ’best’ hypothesis but are notmade aware that the search for an explanation should con-tinue. In the context of model selection, the next step wouldbe a comparison of this ’best’ H k with the hypothesis that H k is false. But model selection fails at this point becausewe cannot compute the required likelihood (Sivia & Skilling2006, p.84). Clearly, what is needed is a goodness-of-ﬁt cri-terion for Bayesian models.This issue is discussed by Press et al. (2007, p.779).They note that “There are no good fully Bayesian methodsfor assessing goodness-of-ﬁt ...” and go on to report that“Sensible Bayesians usually fall back to p -value tail statis-tics ...when they really need to know if a model is wrong.”On the assumption that astronomers do really need toknow if their models are wrong, this paper adopts a fre-quentest approach to testing Bayesian models. Althoughthis approach may be immediately abhorrent to commit-ted Bayesians, the role of the tests proposed herein ismerely to provide a quantitative measure according towhich Bayesians decide whether their models are satisfac-tory. When they are, the Bayesian inferences are presented -and with increased conﬁdence. When not, ﬂaws in the mod- Send oﬀprint requests to : L.B.Lucy els or the data must be sought, with the aim of eventuallyachieving satisfactory Bayesian inferences.

2. Bayesian models

The term Bayesian model - subsequently denoted by M -must now be deﬁned. The natural deﬁnition of M is thatwhich must be speciﬁed in order that Bayesian inferencescan be drawn from D - i.e., in order to compute posteriorprobabilities. This deﬁnition implies that, in addition to thehypothesis H , which introduces the parameter vector α , theprior probability distribution π ( α ) must also be includedin M . Thus, symbolically, we write M ≡ { π, H } (1)It follows that diﬀerent Bayesian models can share a com-mon H . For example, H may be the hypothesis that D isdue to Keplerian motion. But for the motion of a star aboutthe Galaxy’s central black hole, the appropriate π will dif-fer from that for the reﬂex orbit of star due to a planetarycompanion.A further consequence is that a failure of M to explain D is not necessarily due to H : an inappropriate π is also apossibility.To astronomers accustomed to working only with uni-form priors, the notion that a Bayesian model’s poor ﬁt to D could be due to π might be surprising. A speciﬁc cir-cumstance where π could be at fault arises when Bayesianmethods are used to repeatedly update our knowledge ofsome phenomenon - e.g., the value of a fundamental con-stant that over the years is the subject of numerous ex-periments ( X , . . . , X i , . . . ), usually of increasing precision.With an orthodox Bayesian approach, X i +1 is analysedwith the prior set equal to the posterior from X i . Thus π i +1 ( α ) = p ( α | H, D i ) (2)This is the classical use of Bayes theorem to update ouropinion by incorporating past experience. However, if X i is ﬂawed - e.g., due to an unrecognized systematic error - then this choice of π impacts negatively on the analysis of X i +1 , leading perhaps to a poor ﬁt to D i +1 Now, since subsequent ﬂawless experiments result in thedecay of the negative impact of X i , this recursive procedureis self-correcting, so a Bayesian might argue that the prob-lem can be ignored. But scientists feel obliged to resolvesuch discrepancies before publishing or undertaking furtherexperiments and therefore need a statistic that quantiﬁesany failure of M to explain D .

3. A goodness-of-ﬁt statistic for Bayesian models

The most widely used goodness-of-ﬁt statistic in the fre-quentist approach to hypothesis testing is χ , whose valueis determined by the residuals between the ﬁtted model andthe data, with no input from prior knowledge. Thus, χ = χ ( α ) (3)is the goodnes-of-ﬁt statistic for the minimum- χ solution α .A simple analogue of χ for Bayesian models is the pos-terior mean of χ ( α ), h χ i π = Z χ ( α ) p ( α | H, D ) dV α (4)where the posterior distribution p ( α | H, D ) = π ( α ) L ( α | H, D ) R π ( α ) L ( α | H, D ) dV α (5)Here L ( α | H, D ) is the likelihood of α given data D .Note that since h χ i π depends on both constituents of M , namely H and π , it has the potential to detect a poor ﬁtdue to either or both being at fault, as required by Sect.2.In Eq.(4) the subscript π is attached to h χ i to stressthat a non-trivial, informative prior is included in M . Onthe other hand, when a uniform prior is assumed, h χ i isindependent of the prior and is then denoted by h χ i u .The Bayesian goodness-of-ﬁt statistic h χ i u is used inLucy (2014; L14) to illustrate the detection of a weak sec-ond orbit in simulations of Gaia scans of an astrometricbinary. In that case, H states that the scan residuals aredue to the reﬂex Keplerian orbit caused by one invisiblecompanion. With increasing amplitude of the second orbit,the point comes when the investigator will surely abandon H - i.e., abandon the assumption of just one companion -see Fig.12, L14. P -values With the classical frequentist acceptance-rejectionparadigm, a null hypothesis H is rejected on the basis of a p -value tail statistic. Thus, with the χ test, H is rejectedif χ > χ ν,β , where p ( χ ν > χ ν,β ) = β , and acceptedotherwise. Here ν = n − k is the number of degrees offreedom, where n is the number of measurements and k isthe number of parameters introduced by H , and β is thedesignated p -threshold chosen by the investigator.For a Bayesian model, a p -value can be computed fromthe h χ i π statistic, whose approximate distribution is de-rived below in Sect.5.1. However, a sharp transition fromacceptance to rejection of the null hypothesis at some desig-nated p -value is not recommended. First, p -values overstate the strength of the evidence against H (e.g., Sellke et al.2001). In particular, the value p = 0 .

05 recommended inelementary texts does not imply strong evidence againts H . Second, the p -value is best regarded (Sivia & Skilling2006, p.85) as serving a qualitative purpose, with a smallvalue prompting us to think about alternative hypotheses.Thus, if h χ i π exceeds the chosen threshold χ ν,β , this isa warning that something is amiss and should be investi-gated, with the degree of concern increasing as β decreases.If the β = 0 .

001 threshold is exceeded, then the investigatorwould be well-advised to suspect that M or D is at fault.Although statistics texts emphasize tests of H not D , as-tronomers know that D can be corrupted by biases or cali-bration errors. Departures from normally-distributed errorscan also increase h χ i π .If D is not at fault, then M is the culprit, implying thateither π or H is at fault. If the fault lies with π not H ,then we expect that h χ i u < χ ν,β even though h χ i π > χ ν,β .If neither D nor π can be faulted, then the investigatormust seek a reﬁned or alternative H . In the frequentist approach to hypothesis testing, decisionerrors are said to be of type I if H is true but the test saysreject H , and of type II if H is false but the test says accept H . Since testing a Bayesian model is not concernedexclusively with H , these deﬁnitions must be revised, asfollows:A type I error arises when M and D are ﬂawless but thestatistic (e.g., h χ i π ) exceeds the designated threshold.A type II error arise when M or D are ﬂawed but thestatistic does not exceed the designated threshold.Here the words accept and reject are avoided. Moreover,no particular threshold is mandatory: it is at the discretionof the investigator and is chosen with regard to the conse-quences of making a decision error.

4. Statistics of h χ i The intuitive understanding that scientists have regarding χ derives from its simplicity and the rigorous theorems onits distribution that allow us to derive conﬁdence regionsfor multi-dimensional linear models (e.g., Press et al. 2007,Sect.15.6.4).Rigorous statistics for χ require two assumptions: 1)that the model ﬁtted to D is linear in its parameters, and2) that measurement errors obey the normal distribution.Nevertheless, even when these standard assumptions do notstrictly hold, scientists still commonly rely on χ to gaugegoodness-of-ﬁt, with perhaps Monte Carlo (MC) samplingto provide justiﬁcation or calibration (e.g., Press et al. 2007,Sect.15.6.1).Rigorous results for the statistic h χ i are therefore of in-terest. In fact, if we add the assumption of a uniform priorto the above standard assumptions, then we may prove(Appendix A) that h χ i u = χ + k (6)where k is the number of parameters. Given that Eq.(6) is exact under the stated assump-tions, it follows that the quantity h χ i u − k is distributedas χ ν with ν = n − k degrees of freedom.For minimum- χ ﬁtting of linear models, the solution isalways a better ﬁt to D than is the true solution. In par-ticular, E ( χ true ) = n , whereas E ( χ ) = n − k . Accordingly,the + k term in Eq.(6) ’corrects’ χ for overﬁtting and so E ( h χ i u ) = E ( χ true ) - i.e., the expected value of χ for theactual measurement errors. When an informative prior is included in M , an analyticformula for h χ i π is in general not available. However, itsapproximate statistical properties are readily found.Consider again a linear model with normally-distributederrors and suppose further that the experiment ( X ) iswithout ﬂaws. The χ surfaces are then self-similar k -dimensional ellipsoids with minimum at α ≈ α true , theunknown exact solution. Let us now further suppose thatthe informative prior π ( α ) derives from a previous ﬂawlessexperiment ( X ). The prior π will then be a convex (bell-shaped) function with maximum at α max ≈ α . Now, con-sider a 1-D family of such functions all centred on α andranging from the very broad to the very narrow. For theformer h χ i π ≈ h χ i u ; for the latter h χ i π ≈ χ . Thus,ideally, when a Bayesian model M is applied to data D weexpect that h χ i u > ∼ h χ i π ≥ χ (7)Now, a uniform prior and δ ( α − α ) are the limits of theabove family of bell-shaped functions. Since the delta func-tion limit is not likely to be closely approached in practice,a ﬁrst approximation to the distribution of h χ i π is that of h χ i u - i.e., that of χ n − k + k .The above discussion assumes faultless X and X . Butnow suppose that there is an inconsistency between π and X . The peak of π at α max will then in general be oﬀsetfrom the minimum of χ at α . Accordingly, in the calcu-lation of h χ i π from Eq.(4), the neighbourhood of the χ minimum χ at α has reduced weight relative to χ ( α max )at the peak of π . Evidently, in this circumstance, h χ i π cangreatly exceed h χ i u , and the investigator is then alertedto the inconsistency.

5. Numerical experiments

Given that rigorous results h χ i are not available for infor-mative priors, numerical tests are essential to illustrate thediscussion of Sect.4.1. A simple example with just one parameter µ is as follows: H states that u = µ , and D comprises n measurements u i = µ + σz i , where the z i here and below are independentgaussian variates randomly sampling N (0 , µ = 0 , σ = 1 and n = 100.In the ﬁrst numerical experiment, two independent datasets D and D are created comprising n and n measure-ments, respectively. On the assumption of a uniform prior,the posterior density of µ derived from D is p ( µ | H, D ) = L ( µ | H, D ) R L ( µ | H, D ) dµ (8) P r ob a b ilit y d e n s it y < χ > π PDF of < χ > π N = 10 Fig. 1.

Empirical PDF of h χ i π derived from the analysis of 10 data pairs D , D as described in Sect.5.1. The dashed curve isthe theoretical PDF for h χ i u . We now regard p ( µ | H, D ) as prior knowledge to be takeninto account in analysing D . Thus π ( µ ) = p ( µ | H, D ) (9)so that the posterior distribution derived from D is p ( µ | H, D ) = π ( µ ) L ( µ | H, D ) R π ( µ ) L ( µ | H, D ) dµ (10)The statistic quantifying the goodness-of-ﬁt achieved when M = { π, H } is applied to data D is then h χ i π = Z χ ( µ ) p ( µ | H, D ) dµ (11)From N independent data pairs ( D , D ), we obtain N independent values of h χ i π thus allowing us to test theexpectation (Sect.4.1) that this statistic is approximatelydistributed as h χ i u . In Fig.1, this empirical PDF is plot-ted together the theoretical PDF for h χ i u . The accuracyof the approximation at large values of χ is of special im-portance. For n = 100 and k = 1, the 0.05,0.01 and 0.001critical points of h χ i u are 124.2, 135.6 and 149.2, respec-tively. From a simulation with N = 10 , the number of h χ i π values exceeding these thresholds are 50177, 10011and 1025, respectively. Thus, the fraction of h χ i π exceed-ing the critical values derived from the distribution of h χ i u are close to their predicted values. Accordingly, the conven-tional interpretation of these critical values is valid.In this experiment, the analysis of X beneﬁts fromknowledge gained from X . We expect therefore that h χ i π is less than h χ i u , since replacing the uniform prior withthe informative π obtained from X should improve the ﬁt.From 10 repetitions, this proves to be so with probabil-ity 0.683. Sampling noise in D and D accounts for theshortfall.

50 100 150 200 250 0 0.5 1 1.5 2 2.5 < χ > b / σ Detection of bias in X X precedes X Fig. 2.

Detection of bias in X with the h χ i π statistic when X is analysed with prior derived from X . Values of h χ i π (ﬁlledcircles) and h χ i u (open circles) are plotted against the biasparameter b/σ . The dashed lines are the 5 and 95% levels. When X and X are ﬂawless, the statistic h χ i π indicatesdoubts - i.e., type I errors (Sect.3.2) - about the mutualconsistency of X and X with just the expected frequency.Thus, with the 5% threshold, doubts arise in 5 .

02% of theabove 10 trials. This encourages the use of h χ i π to detectinconsistency.Accordingly, in a second test, X is ﬂawed due to biasedmeasurements. Thus, now u i = µ + σz i + b for D . As a re-sult, the prior for X obtained from Eq.(9) is compromised,and this impacts on the statistic h χ i π from Eq.(11).In Fig.2, the values of h χ i π and h χ i u are plottedagainst b/σ . Since the compromised prior is excluded from h χ i u , its values depend only on the ﬂawless data sets D ,and so mostly fall within the (0 . , .

95) interval. In con-trast, as b/σ increases, the values of h χ i π are increasinglyaﬀected by the compromised prior.The mutual consistency of X and X is assessed on thebasis of h χ i π , with choice of critical level at our discretion.However, when h χ i π exceeds the 0 .

1% level at 149 .

2, wewould surely conclude that X and X are in conﬂict andseek to resolve the discrepancy. On the other hand, wheninconsistency is not indicated, we may accept the Bayesianinferences derived from X in the conﬁdent belief that in-corporating prior knowledge from X is justiﬁed and ben-eﬁcial. This test illustrates the important point that aninappropriate π can corrupt the Bayesian inferences drawnfrom a ﬂawless experiment. Thus, in this case, the bias in D propagates into the posterior p ( µ | H, D ) derived from X . This can be (and is) avoided by preferring the frequen-tist methodology. But to do so is to forgo the great merit ofBayesian inference, namely its ability to incorporate infor-mative prior information (Feldman & Cousins 1998). If onedoes therefore prefer Bayesian inference, it is evident thata goodness-of-ﬁt statistic such as h χ i π is essential in order to detect errors propagating into the posterior distributionfrom an ill-chosen prior. In the above test, the analysis of X is preceded by thatof X . This order can be reversed. Thus, with the same N data pairs ( D , D ), we now ﬁrst analyse X with a uni-form prior to obtain p ( µ | H, D ). This becomes the prior forthe analysis of X . This analysis then gives the posterior p ( µ | H, D ) from which a new value of h χ i π is obtained.When the values of h χ i π obtained with this reversed or-der of analysis are plotted against b/σ , the result is similarto Fig.2, implying that the order is unimportant. Indeed,statistically, the same decision is reached independently oforder. For example, for 10 independent data pairs ( D , D )with b/σ = 1, the number with h χ i π > .

2, the 5%threshold, is 50267 when X precedes X and 50149 when X precedes X . Noting that the Bayesian evidence is = ¯ L , the prior-weighted mean of the likelihood, we can, under standardassumptions, write ¯ L ∝ exp (cid:20) − χ (cid:21) (12)where the eﬀective χ (Bridges et al. 2009) is χ = − Z π ( α ) exp (cid:20) − χ ( α ) (cid:21) dV α (13)This is a possible alternative to h χ i π deﬁned in Eq.(4).However, in the test of Sect.5.1, the two values are so nearlyidentical it is immaterial which mean is used. Here h χ i π is preferred because it remains well-deﬁned for a uniformprior, for which an analytic result is available (AppendixA).Because h χ i π and χ are nearly identical, the distri-bution of χ should approximate that of h χ i u (Sect.4.1).To test this, the experiment of Sect.5.1 is repeated with χ replacing h χ i π . From a simulation with N = 10 , thenumber of χ values exceeding the 0.05,0.01 and 0.001thresholds are 50167, 9951 and 970, respectively. Thus if χ is chosen as the goodness-of-ﬁt statistic, accurate p -values can be derived on the assumption that χ − k isdistributed as χ ν with ν = n − k degrees of freedom. FromSect.4.1, we expect these p -values to be accurate if π ( α ) is not more sharply peaked than L ( α | H, D ).

6. An F statistic for Bayesian models Inspection of Fig.2 shows that a more powerful test of in-consistency between X and X must exist. A systematicdisplacement of h χ i π relative to h χ i u is already evidenteven when h χ i π is below the 5% threshold at 124.2. Thissuggests that a Bayesian analogue of the F statistic be con-structed. F -test In frequentist statistics, a standard result (e.g., Hamilton1964, p.139) in the testing of linear hypotheses is the fol-lowing: we deﬁne the statistic F = n − ij χ c − χ χ (14)where χ is the minimum value of χ when all i param-eters are adjusted, and χ c is the minimum value when alinear constraint is imposed on j ( ≤ i ) parameters, so thatonly the remaining i − j are adjusted. Then, on the nullhypothesis H that the constraint is true, F is distributedas F ν ,ν with ν = j and ν = n − i , where n is the numberof measurements. Accordingly, H is tested by comparingthe value F given by Eq.(14) with critical values F ν ,ν ,β derived from the distribution of F ν ,ν .Note that when j = i , the constraint completely deter-mines α . If this value is α ∗ , then χ c = χ ( α ∗ ) and H statesthat α ∗ = α true .A particular merit of the statistic F is that it is inde-pendent of σ . However, the resulting F -test does assumenormally-distributed measurement errors. F In a Bayesian context, the frequentist hypothesis that α true = α ∗ is replaced by the statement that α true obeysthe posterior distribution p ( α | H, D ). Thus an exact con-straint is replaced by a fuzzy constaint.Adopting the simplest approach, we deﬁne F , aBayesian analogue of F , by taking χ c to be the value atthe posterior mean of α , h α i π = R α π ( α ) L ( α | H, D ) dV α R π ( α ) L ( α | H, D ) dV α (15)where π ( α ) is the informative prior. Considerations of ac-curacy when values of χ are computed on a grid suggestwe take χ to be the value at h α i u = R α L ( α | H, D ) dV α R L ( α | H, D ) dV α (16)the posterior mean for a uniform prior.With χ c and χ o thus deﬁned, the Bayesian F is F = n − ij χ ( h α i π ) − χ ( h α i u ) χ ( h α i u ) (17)and its value is to be compared with the chosen threshold F ν ,ν ,β when testing the consistency of X and X . Since h α i u is independent of π ( α ), the statistic F measures theeﬀect of π ( α ) in displacing h α i π from h α i u .In this Bayesian version of the F -test, the null hypoth-esis states that the posterior mean h α i π = α true . This willbe approximately true when a ﬂawless Bayesian model M is applied to a ﬂawless data set D . However, if the cho-sen threshold on F is exceeded, then one or more of π, H and D is suspect as discussed in Sect.3.1. If the thresholdis not exceeded, then Bayesian inferences drawn from theposterior distribution p ( α | H, D ) are supported. p -values from F If Eq.(17) gives F = F ∗ , the corresponding p -value is p ∗ = Z ∞F ∗ F ν ,ν dF (18)where ν = j and ν = n − i . The accuracy of these p -values can be tested with the 1-D toy model of Sect.5.1as follows:(i) Independent data sets D , D are created corre-sponding to X , X .(ii) With π from X , the quantities h µ i ∗ π and F ∗ arecalculated for X with Eqs. (15)-(17).(iii) M independent data sets D m are now created with u i = h µ i ∗ π + σz i .(iv) For each D m , the new value of χ ( h α i ∗ π ) is calcu-lated for the h µ i ∗ π derived at step (ii).(v) For each D m , the new value of χ ( h α i u ) is calcu-lated with h α i u from Eq.(16).(vi) With these new χ values, the statistic F m isobtained from Eq.(17).The resulting M values of F m give us an empirical es-timate of p ∗ , namely f ∗ , the fraction of the F m that ex-ceed F ∗ . In one example of this test, a data pair D , D gives h µ i ∗ π = 0 .

089 and F ∗ = 1 . ν = 1 and ν = 99, Eq.(18) then gives p ∗ = 0 . M = 10 data sets D m with the assumptionthat µ true = 0 . f ∗ = 0 . p ∗ From 100 independent pairs D , D , the mean value of | p ∗ − f ∗ | is 0 . p -values derived from Eq.(18) and therefore of decisionthresholds F ν ,ν ,β . To investigate this Bayesian F -test’s ability to detect in-consistecy between X and X , the bias test of Sect.5.2is repeated, again with n = n = 100. The vector α inSect.6.1 now becomes the scalar µ , and i = 1.In Fig.3, the values of F from Eq.(17) with j = i = 1are plotted against the bias parameter b/σ . Also plottedare critical values derived from the distribution F ν ,ν with ν = 1 and ν = 99. In contrast to Fig.2 for the h χ i π statistic, inconsistency between X and X is now detecteddown to b/σ ≈ . X is the bias b . Now, if it were known for certain that this was the ﬂaw,then a Bayesian analysis with H changed from u = µ to u = µ + b - i.e., with an extra parameter - is staightforward.The result is the posterior density for b , allowing for correc-tion. In contrast, the detection of a ﬂaw with h χ i π or F iscause-independent. Although Figs.2 and 3 have b/σ as theabscissa, for real experiments this quantity is not knownand conclusions are drawn just from the ordinate. B a y e s F b/ σ Detection ofinconsistency

Fig. 3.

Detection of inconsistency between X and X . Values of F are plotted against the bias parameter b/σ . The dashed linesare the 0 . ,

7. Tension between experiments

In the above, the goodness-of-ﬁt of M to D is investigatedtaking into account the possibility that a poor ﬁt mightbe due the prior derived from a previous experiment. Arelated goodness-of-ﬁt issue commonly arises in modernastrophysics, particularly cosmology, namely whether esti-mates of non-identical but partially overlapping parametervectors from diﬀerent experiments are mutually consistent.The term commonly used in this regard is tension, withinvestigators often reporting their subjective assessments -e.g., there is marginal tension between X and X - basedon the two credibility domains (often multi-dimensional)for the parameters in common.In a recent paper, Seehars et al. (2015) review the at-tempts in the cosmological literature to quantify the con-cept of tension, with emphasis on CMB experiments. Below,we develop a rather diﬀerent approach based on the F statistic deﬁned in Sect.6.2.Since detecting and resolving conﬂicts between exper-iments is essential for scientiﬁc progress, it is desirable toquantify the term tension and to optimize its detection.The conjecture here is that this optimum is achieved wheninferences from X are imposed on the analysis of X A special case of assessing tension between experiments isthat where the parameter vectors are identical. But this isjust the problem investigated in Sects. 5 and 6.When X (with bias) and X (without bias) are sepa-rately analysed, the limited overlap of the two credibilityintervals for µ provides a qualitative indication of tension.However, if X is analysed with a prior derived from X ,then the statistic h χ i π - see Fig.2 - or, more powerfully,the statistic F - see Fig. 3 - provide a quantitative measureto inform statements about the degree of tension. We now suppose that D , D are data sets from diﬀerentexperiments X , X designed to test the hypotheses H , H .However, though diﬀerent, these hypotheses have parame-ters in common. Speciﬁcally, the parameter vectors of H and H are ξ = ( α , β ) and η = ( β , γ ), respectively, and k, l, m are the numbers of elements in α , β , γ , respectively.If X and X are analysed independently, we may wellﬁnd that M and M provide satisfactory ﬁts to D and D and yet still be obliged to report tension between theexperiments because of a perceived inadequate overlap ofthe two l -dimensional credibility domains for β .A quantitative measure of tension between X and X can be derived via the priors, as follows: The analysis of X gives p ( ξ | H , D ), the posterior distribution of ξ , fromwhich we may derive the posterior distribution of β by in-tegrating over α . Thus p ( β | H , D ) = Z p ( ξ | H , D ) dV α (19)Now, for a Bayesian analysis of X , we must specify π ( η )throughout η -space, not just β -space . But π ( η ) = π ( β , γ ) = π ( γ | β ) π ( β ) (20)Accordingly, what we infer from X can be imposed on theanalysis of X by writing π ( η ) = π ( γ | β ) p ( β | H , D ) (21)The conditional prior π ( γ | β ) must now be speciﬁed. Thiscan be taken to be uniform unless we have prior knowledgefrom other sources - i.e., not from D .With π ( η ) speciﬁed, the analysis of X gives the poste-rior density p ( η | H , D ). As in Sect.6.2, we now regard thisas a fuzzy constraint on η from which we compute the sharpconstraint h η i π given by Eq.(15). Now, in general, h η i π willbe displaced from h η i u given by Eq.(16). The question thenis: Is the increment in χ ( η | H , D ) between h η i π and h η i u so large that we must acknowledge tension between X and X ?Following Sect.6, we answer this question by computing F from Eq.(17) with i = j = l + m , the total numberparameters in η . The result is then compared to selectedcritical values from the F ν ,ν distribution, where ν = l + m and ν = n − l − m . With standard assumptions, F obeysthis distribution if h η i π = η true - i.e, if h β i π = β true and h γ i π = γ true - see Sect.6.3. A simple example with one parameter ( µ ) for X and two( µ, κ ) for X is as follows: H states that u = µ and H states that v = µ + κx . The data set D comprises n measurements u i = µ + σz i + b , where b is the bias. Thedata set D comprises n measurements v j = µ + κx j + σz j ,where the x j are uniform in ( − , +1). The parameters are µ = 0 , κ = 1 , σ = 1 and n = n = 100. In the notationof Sect.7.2, the vectors β , γ contract to the scalars µ, κ ,whence l = m = 1, and α does not appear, whence k = 0. B a y e s F b / σ Detection of tensionbetween X and X Fig. 4.

Detection of tension between diﬀerent experiments.Values of F are plotted against the bias parameter b/σ . Thedashed lines are the 0 . , In the above, H are H are diﬀerent hypotheses but havethe parameter µ in common. If b = 0, the analyses of X are X should give similar credibility intervals for µ andtherefore no tension. But with suﬃciently large b , tensionshould arise.This is investigated following Sect.7.2. Applying M to D , we derive p ( µ | H , D ). Then, taking the conditionalprior π ( κ | µ ) to be constant, we obtain π ( µ, κ ) ∝ p ( µ | H , D ) (22)as the prior for the analysis of X . This gives us the pos-terior distribution p ( µ, κ | H , D ), which is a fuzzy con-straint in ( µ, κ )-space. Replacing this by the sharp con-straint ( h µ i , h κ i ), the constrained χ is χ c = χ ( h µ i , h κ i| H , D ) (23)Substitution in Eq.(17) with j = i = 2, then gives F as ameasure of the tension between X and X . Under standardassumptions, F is distributed as F ν ,ν with ν = 2 , ν = n − h µ i , h κ i ) = ( µ, κ ) true .In Fig.4, the values of F are plotted against b/σ togetherwith critical values for F ν ,ν with ν = 2 , ν = 98. This plotshows that tension is detected for b/σ > ∼ .

6. This is slightlyinferior to Fig.3 as is to be expected because of the morecomplicated X .The importance of Fig.4 is in showing that the F statis-tic has the desired characteristics of reliably informingthe investigator of tension between diﬀerent experimentswith partially overlapping parameter vectors. When the in-consistency is slight ( b/σ < ∼ . b/σ > ∼ . If F calculated as in Sect.7.2 indicates tension, the possibleﬂaws include the conditional prior π ( γ | β ). Thus, tensioncould be indicated even if the prior π ( β ) inferred from X is accurate and consistent with D .Accordingly, we might prefer to focus just on β - i.e.,on the parameters in common. If so, we compute h β i π = Z β p ( η | H , D ) dV η (24)and then ﬁnd the minimum of χ ( η | H , D ) when β = h β i π . Thus, the contrained χ is now χ c = min γ χ ( h β i π , γ ) (25)The F test also applies in this case - see Sect.6.1. Thusthis value χ c is substituted in Eq.(17), where we now take j = l , the number of parameters in β . The resulting F isthen compared to critical values derived from F ν ,ν with ν = l, ν = n − l − m . With the standard assumptions, F obeys this distribution if h β i π = β true .For the simple model of Sect.7.3, the resulting plot of F against b/σ is closely similar to Fig.4 and so is omit-ted. Nevertheless, the option of constraining a subset ofthe parameters is likely to be a powerful diagnostic tool forcomplex, multi-dimensional problems, identifying the pa-rameters contributing most to the tension revealed whenthe entire vector is constrained (cf. Seehars et al. 2015).

8. Discussion and Conclusion

A legitimate question to ask about the statistical analysis ofdata acquired in a scientiﬁc experiment is: How well or howbadly does the model ﬁt the data? Asking this question is,after all, just the last step in applying the scientiﬁc method.In a frequentist analysis, where the estimated parametervector α is typically the minimum- χ point, this questionis answered by reporting the goodness-of-ﬁt statisic χ = χ ( α ) or, equivalently, the corresponding p -value. If a poorﬁt is thereby indicated, the investigator and the communityare aware that not only is the model called into questionbut so also is the estimate α and its conﬁdence domain.If the same data is subject to a Bayesian analysis, thesame question must surely be asked: The Bayesian ap-proach does not exempt the investigator from the obliga-tion to report on the success or otherwise of the adoptedmodel. In this case, if the ﬁt is poor, the Bayesian model iscalled into question and so also are inferences drawn fromthe posterior distribution p ( α | H, D ).As noted in Sect.1, the diﬃculty in testing Bayesianmodels is that there are no good fully Bayesian methodsfor assessing goodness-of-ﬁt. Accordingly, in this paper, afrequentist approach is advocated. Speciﬁcally, h χ i π is pro-posed in Sect.3 as a suitable goodness-of-ﬁt statisic forBayesian models. Under the null hypothesis that M and D are ﬂawless and with the standard assumptions of lin-earity and normally-distributed errors, then, as argued inSect.4.1 and illustrated in Fig.1, h χ i π − k is approximatelydistributed as χ n − k , and so p -values can be computed.A p -value thus derived from h χ i π quantiﬁes the averagegoodness-of-ﬁt provided by the posterior distribution. Incontrast, in a frequentist minimum- χ analysis, the p -value quantiﬁes the goodness-of-ﬁt provided by the point esti-mate α .In the above, it is regarded as self-evident that as-tronomers want to adhere to the scientiﬁc method by alwaysreporting the goodness-of-ﬁt achieved in Bayesian analysesof observational data. However, Gelman & Shalizi (2013), inan essay on the philosophy and practice of Bayesian statis-tics, note that investigators who identify Bayesian inferencewith inductive inference regularly ﬁt and compare modelswithout checking them. They deplore this practice. Instead,these authors advocate the hypothetico-deductive approachin which model checking is crucial. As in this paper, theydiscuss non-Bayesian checking of Bayesian models - specif-ically, the derivation of p -values from posterior predictivedistributions. Moreover, they also stress that the prior dis-tribution is a testable part of a Bayesian model.In the astronomical literature, the use of frequentisttests to validate Bayesian models is not unique to thispaper. Recently, Harrison et al. (2015) have presented aningenious procedure for validating multidimensional poste-rior distributions with the frequentist Kolmogorov-Smirnov(KS) test for one-dimensional data. Their aim, as here, isto test the entire Bayesian inference procedure.Frequentist testing also arises in recent applications ofthe Kullback-Leibler divergence to quantify tension be-tween cosmological probes (e.g. Seehars et al. 2015). Forlinear models, and with the assumption of Gaussian priorsand likelihoods, a term in the relative entropy is a statisticthat measures tension. With these assumptions, the statis-tic follows a generalized χ distribution, thus allowing a p -value to be computed.Seehars et al.(2015) also investigate various purelyBayesian measures of tension. They conclude that interpret-ing these measures on a ﬁxed, problem-independent scale -e.g., the Jeﬀrey’s scale - can be misleading - see also Nesseris& Garca-Bellido (2013). Acknowledgements.

I thank D.J.Mortlock for comments on an earlydraft of this paper, and A.H.Jaﬀe, M.P.Hobson and the referee foruseful references.

Appendix A: Evaluation of h χ i u If α denotes the minimum- χ solution, then α = α + a (A.1)where a is the displacement from α . Then, on the assumption oflinearity, ∆ χ ( a ) = χ + X i,j A ij a i a j (A.2)where the A ij are the constant elements of the k × k curvature ma-trix (Press et al. 2007, p.680), where k is the number of parameters.It follows that surfaces of constant χ are k -dimensional self-similarellipsoids centered on α .Now, given the second assumption of normally-distributed mea-surement errors, the likelihood L ( α ) ∝ exp (cid:18) − χ (cid:19) × exp (cid:18) −

12 ∆ χ (cid:19) (A.3)Thus, in the case of a uniform prior, the posterior mean of χ is h χ i u = χ + R ∆ χ exp( − ∆ χ ) dV α R exp( − ∆ χ ) dV α (A.4)Because surfaces of constant ∆ χ are self-similar, the k -dimensionalintegrals in Eq.(A.4) reduce to 1-D integrals. Suppose ∆ χ = ∆ χ ∗ on the surface of the ellipsoid with volume V ∗ . If lengths are increased by the factor λ , then the new ellipsoid has∆ χ = ∆ χ ∗ × λ and V = V ∗ × λ k (A.5)With these scaling relations, the integrals in Eq.(A.4) can be trans-formed into integrals over λ . The result is h χ i u = χ + 2 b R ∞ λ k +1 exp( − bλ ) dλ R ∞ λ k − exp( − bλ ) dλ (A.6)where 2 b = ∆ χ ∗ . The integrals have now been transformed to a knowndeﬁnite integral, Z ∞ λ z exp( − bλ ) dλ = 12 Γ( x ) b − x (A.7)where x = ( z + 1) /

2. Aplying this formula, we obtain h χ i u = χ + k (A.8)an exact result under the stated assumptions. References