[PDF] When tension is just a fluctuation: How noisy data affect model comparison

Abstract

Summary statistics of the likelihood, such as the Bayesian evidence, offer a principled way of comparing models and assessing tension between, or within, the results of physical experiments. Noisy realisations of the data induce scatter in these model comparison statistics. For a realistic case of cosmological inference from large-scale structure we show that the logarithm of the Bayes factor attains scatter of order unity, increasing significantly with stronger tension between the models under comparison. We develop an approximate procedure that quantifies the sampling distribution of the evidence at small additional computational cost and apply it to real data to demonstrate the impact of the scatter, which acts to reduce the significance of any model discrepancies. Data compression is highlighted as a potential avenue to suppressing noise in the evidence to negligible levels, with a proof of concept on Planck cosmic microwave background data.

Full PDF

AAstronomy & Astrophysics manuscript no. noisy_evidence © ESO 2021February 19, 2021 L etter to the E ditor When is tension just a ﬂuctuation?

How noisy data affects model comparison

B. Joachimi , F. Köhlinger , W. Handley , and P. Lemos Department of Physics and Astronomy, University College London, Gower Street, London WC1E 6BT, UKe-mail: [email protected] Kavli IPMU (WPI), UTIAS, The University of Tokyo, Kashiwa, Chiba 277-8583, Japan Astrophysics Group, Cavendish Laboratory, J.J. Thomson Avenue, Cambridge CB3 0HE, UK Kavli Institute for Cosmology, Madingley Road, Cambridge CB3 0HA, UKReceived 2020 September 30; accepted 2021 February 10

ABSTRACT

Summary statistics of the likelihood, such as the Bayesian evidence, o ﬀ er a principled way of comparing models and assessing tensionbetween, or within, the results of physical experiments. Noisy realisations of the data induce scatter in these model comparisonstatistics. For a realistic case of cosmological inference from large-scale structure we show that the logarithm of the Bayes factorattains scatter of order unity, increasing signiﬁcantly with stronger tension between the models under comparison. We develop anapproximate procedure that quantiﬁes the sampling distribution of the evidence at small additional computational cost and apply it toreal data to demonstrate the impact of the scatter, which acts to reduce the signiﬁcance of any model discrepancies. Data compressionis highlighted as a potential avenue to suppressing noise in the evidence to negligible levels, with a proof of concept on Planck cosmicmicrowave background data.

Key words.

Methods: data analysis – Methods: statistical – Cosmology: observations – cosmic background radiation – gravitationallensing: weak

1. Introduction

Binary decisions inevitably have to be made at the conclusion ofa physical experiment. Has a feature been detected signiﬁcantly?Which model describes the data better? Are data sets (or subsetsthereof) consistent with each other, or are they in ‘tension’, a po-tential indicator for new physics not incorporated in the model?Traditionally, hypothesis tests were the statistical tools ofchoice to answer these questions. With the advent of high-performance computing, Bayesian techniques building on theBayesian evidence have risen in popularity (e.g., Ja ﬀ e 1996;Kunz et al. 2006; Marshall et al. 2006; see Trotta 2008 for areview). In cosmology, discrepancies in the ∼ − σ rangebetween high-redshift observations (primarily the cosmic mi-crowave background, CMB hereafter, as constrained most accu-rately by the Planck mission; Planck Collaboration et al. 2020a)and various probes of the low-redshift universe in the measure-ment of the Hubble constant (e.g. Riess et al. 2016, 2018, 2019;Wong et al. 2020) and the amplitude of matter density ﬂuctu-ations (e.g. Joudaki et al. 2017, 2020; Abbott et al. 2018; As-gari et al. 2020; Heymans et al. 2020) have emerged recently(cf. Verde et al. 2019 for an overview). This has spurred a ﬂurryof work on approaches to quantifying tension and performingmodel comparison (e.g. Verde et al. 2013; Seehars et al. 2014;Lin & Ishak 2017; Charnock et al. 2017; Köhlinger et al. 2019;Handley & Lemos 2019b; Nicola et al. 2019; Adhikari & Huterer2019; Raveri & Hu 2019).These methods have in common that they infer a single scalarwhich is then compared against a pre-deﬁned scale to judge sig-niﬁcance. In Bayesian statistics the tension measure is condi- tioned on the observed data. The posterior probability of the pa-rameters p of a model M i , for measured data, d , isPr( p | d , M i ) = Z i Pr( d | p , M i ) Pr( p | M i ) , (1)where Pr( d | p , M i ) is the likelihood and Pr( p | M i ) the prior on themodel parameters. The Bayesian evidence, or marginal likeli-hood, is the normalisation given by Z i ≡ Pr( d | M i ) = (cid:90) d m p Pr( d | p , M i ) Pr( p | M i ) , (2)where m denotes the number of parameters. For a given data set, Z i reduces to a non-stochastic scalar that attains larger valuesthe more likely the realisation of the data is under model M i , andthe more predictive the model is (as a more ﬂexible model couldaccommodate many possible forms of the data).However, a physical experiment does generally not take ac-quired data as a given, but rather interprets it as a stochasticrealisation of an underlying truth that we wish to approximateby our model. A di ﬀ erent realisation of the data leads to a dif-ferent value for Z i , which could alter our decision on tensionor consistency. In this view, statistical uncertainty in the dataturns the evidence (or any related tension and model compari-son measure) into a noisy statistic . Jenkins & Peacock (2011)argued, based on toy experiments and analytical arguments, that This viewpoint will require us to go beyond a purely Bayesian ap-proach. However, hybrid Bayesian-frequentist methods are common-place in statistics; see Good (1992) for an overview, and also Jenkins(2014). Article number, page 1 of 8 a r X i v : . [ a s t r o - ph . C O ] F e b & A proofs: manuscript no. noisy_evidence

Fig. 1.

Illustration of model comparison via evidence and of its associated scatter, for a one-dimensional data vector d and two models M i with i = ,

2, each with a single parameter p . The joint distributions Pr ( d , p | M i ) are shown in blue / red shades. The projections of these distributions ontothe parameter and data axes yield the prior Pr ( p | M i ) and the evidence or marginal likelihood (shown in purple / orange for M and M ), respectively.An experiment produces the observation d shown as the black solid line. Conditioning on d yields the posterior Pr ( p | d , M i ), shown in green andbrown for M and M . Evaluating the marginal likelihood at d yields Pr ( d | M i ), which is used in model comparison. The dotted horizontal linesmark the 1 σ interval of possible alternative realisations of the data given the best-ﬁt parameter p best (dashed vertical lines) of either model. Theblue / red boxes show the resulting 1 σ range in possible evidence values under M , which has higher evidence in this case (the corresponding rangesfor M are shown as hatched areas). the thus-inherited statistical uncertainty in Z i is substantial. Ig-noring this scatter will therefore lead to over-conﬁdent or incor-rect decisions in model comparison.In this work we quantify the scatter in the Bayesian evidenceand some of its derived tension / model comparison statistics, af-ﬁrming the ﬁndings of Jenkins & Peacock (2011) in a realisticcosmological experiment. We devise a computationally e ﬃ cientprocedure to calculate statistical errors on the evidence, applyit to an analysis of internal consistency in Kilo Degree Survey(KiDS) weak lensing data, and explore the impact of data com-pression on evidence scatter for the example of Planck

CMBdata.

2. Noisy model comparison

Figure 1 illustrates the notion of evidence and its associatedscatter using a Gaussian toy model that is one-dimensional inboth data and parameter space. It builds on Fig. 28.6 of MacKay(2003). While at the observed data Model 1 has higher evidencein this example, it is not unambiguously superior because alter-native realisations of the data under the more probable Model1 could result in equal or reversed evidences of Models 1 and2 instead (see the boxes in blue and red shading) . We seek toquantify this statistical uncertainty of the evidence. See also Ap-pendix A for a closed-form analytic calculation in the Gaussiancase analogously to Fig. 1. For ease of illustration, the toy model considers the likelihood of thedata conditioned on the best ﬁt parameter p best ; in our implementationwe take the full posterior into account when drawing new data realisa-tions. The standard statistic to compare two models i and j is the Bayesfactor (see Kass & Raftery 1995 for a review), R i j ≡ Pr( M i | d )Pr( M j | d ) = Z i Pr( M i ) Z j Pr( M j ) = Z i Z j , (3)which, for equal prior probabilities of the models themselves, isgiven by the ratio of the model evidences. R i j has the intuitiveinterpretation of betting odds in favour of model i over j . Let usassume that we know the true underlying model M true , includingits parameters p true , that generates the data we observe, whichneed not coincide with either M i or M j . Then the probabilitydensity of the Bayes factor is given byPr( R i j | M true ) = (cid:90) d n d (cid:48) Pr( R i j | d (cid:48) ) Pr( d (cid:48) | M true ) , (4)where n is the dimension of the data vector, Pr( d | M true ) is thetrue likelihood of the data, and Pr( R i j | d ) the distribution of R i j for a given data set which we shall assume to be deterministic.Hence, if the true likelihood is known, we can proceed as followsto create a distribution of R i j : (i) generate samples of the datafrom the true likelihood; (ii) for each sample calculate the Bayesfactor according to Eqs. (2) and (3).As a realistic example we choose a recent cosmological anal-ysis of tomographic weak lensing measurements by the KiDSsurvey (KiDS-450, Kuijken et al. 2015; Hildebrandt et al. 2017).We work with a simulated data vector that, like the real data, hassize n =

130 and depends in a highly non-linear way on 7 modelparameters (5 cosmological parameters of a ﬂat Λ CDM model,plus two parameters describing astrophysical e ﬀ ects on the ob-servables). It is assumed that the data have a Gaussian likelihood Article number, page 2 of 8. Joachimi, F. Köhlinger, W. Handley and P. Lemos: When is tension just a ﬂuctuation?

Fig. 2.

Joint distribution of evidences calculated under two models (themock KiDS cosmology analysis for a joint [Model 0] and a split [Model1] data vector). The arrow indicates the direction of increasing Bayesfactor, ln R , with the red line marking the true value of R . Black(blue) points correspond to the inferences from 100 noise realisations ofthe mock data vector, evaluated on the full nested sampling analysis (the χ approximation of Section 3). Contours show the Gaussian kerneldensity approximation of the distribution based on the black points. with a known and ﬁxed covariance. To perform an internal con-sistency test, we create two copies of the parameter set and as-sign one copy to the elements of the data vector dependent ontomographic bin no. 3, and the other copy to the remaining ele-ments. The model comparison is then between the analysis withthe original set of model parameters (Model 0) and that with thedoubled parameter set (Model 1). For details of the methodologyand analysis, see Köhlinger et al. (2019) and Appendix B.We generate 100 realisations of the data vector from the truelikelihood, evaluated at a ﬁducial choice of the parameters. Foreach simulated data vector we repeat a full nested sampling anal-ysis of both models (0 and 1) and infer the evidences (see Ap-pendix B for an assessment of the robustness of the samplingalgorithms). By default we do not introduce any systematic shiftin our simulated data, so that strong concordance is expected asthe outcome of the tension analysis.The resulting distribution of evidences is shown in Fig. 2.We compute the true value of the Bayes factor by re-running theanalyses for a noise-free data vector generated for the ﬁducialparameter values. The two evidence distributions are each con-sistent with being lognormal , each with a standard deviationin the log of 7.9. The evidences are strongly correlated (Pear-son correlation coe ﬃ cient 0.99), which is plausible as the scatterderives from the same noisy data realisation, with both modelsyielding good ﬁts.Due to the strong correlation, the distribution of the Bayesfactor is narrower, with σ (ln R ) ≈ .

25; see Fig. 3 for its dis-tribution. We also observe skewness in ln R (already visiblein Fig. 2), which causes the mean to be low relative to the truevalue by ∼ σ . We do not ﬁnd evidence that the skewness is dueto numerical issues, so ascribe it to the non-linearity of the mod-els, which means this feature will be strongly dependent on thedetails of the analysis. A value of σ (ln R ) ∼ For a Gaussian likelihood ln Z is expected to be χ -distributed. Fig. 3.

Probability density of the Bayes factor, ln R , shifted to haveits true value at 0. The black (blue) curve is the distribution of ∆ ln R extracted from the full nested sampling analysis (the χ approximationof Section 3). The green curve corresponds to the highly discrepant caseof one of four tomographic redshift bins being shifted by d z = . though they predicted a normal distribution for ln R (see alsoAppendix A). By design, the Bayes factor depends on the parameter prior,which can be a hindrance for tension assessment, as demon-strated by Handley & Lemos (2019b). They proposed a modiﬁedstatistic called suspiciousness, deﬁned as ln S i j ≡ ln R i j + D KL , i − D KL , j , where D KL , i = (cid:90) d m p Pr( p | d , M i ) ln Pr( p | d , M i )Pr( p , M i ) (5)is the Kullback-Leibler (KL) divergence (Kullback & Leibler1951; see Lemos et al. 2020 for a generalisation to correlateddata sets, which we consider here). This combination of evidenceand KL divergence is independent of the prior widths, providingthey do not impinge on the posterior bulk, and may be calibratedinto a traditional ‘ σ tension’ value.Again assuming a Gaussian likelihood, ln S i j is χ -distributed with the degrees of freedom given by the di ﬀ erence inthe e ﬀ ective dimension of the parameter space in the two mod-els. Handley & Lemos (2019a) propose to calculate this e ﬀ ectivedimension as m e ﬀ , i = (cid:26)(cid:68)(cid:2) ln Pr( p | d , M i ) (cid:3) (cid:69) p − (cid:68) ln Pr( p | d , M i ) (cid:69) (cid:27) , (6)i.e. twice the variance of the log-likelihood evaluated over theposterior distribution (indicated by the subscript ’p’).We extract ln S i j from the output of our nested samplinganalysis and determine the scatter of m e ﬀ from the sub-samplevariance computed on a posterior sample. The standard devia-tions of ln R and ln S agree to better than 10 %; see the fol-lowing section for an argument why the distributions of R and S are expected to be very similar. The standard deviation of m e ﬀ isof order 10 % and can be treated as uncorrelated with S (corre-lation coe ﬃ cient -0.17).There are two obstacles to using the approach above on realdata: (i) repeating full likelihood analyses including evidence Article number, page 3 of 8 & A proofs: manuscript no. noisy_evidence calculations many times to build a sample is prohibitively ex-pensive, and (ii) we do not know the true likelihood to generatesamples of the real data. We will address both points in Sects. 3and 4.

To investigate a case of strong tension, we insert a large shiftin the mean redshift of tomographic bin no. 3 of d z = . . In this casethe alternative model 1 is clearly preferred (ln R ≈ − R is dramatic, as can be seenfrom Fig. 3. While the skewness and corresponding discrepancybetween mean and true value persist, the standard deviation in-creases to 7 .

3, spanning more than three orders of magnitude inodds.This result is driven by an increase in the scatter of the evi-dences (to 10.8 and 8.5 for models 0 and 1, respectively) and inparticular by their partial decorrelation (correlation coe ﬃ cient0.74). As shown by Jenkins & Peacock (2011), the scatter inln R i j is to good approximation proportional to the typical dif-ference between the model predictions at the respective best-ﬁtparameters of the models under comparison (as also shown inAppendix A). This di ﬀ erence is small in our concordant casewith nested models, deviating from zero only through scatter inthe data. In the d z = .

3. A fast approximate algorithm

Let us consider the Laplace approximation of the log-likelihood(we drop the explicit dependence on the model for simplicity), ln Pr( d | p ) ≈ ln Pr( d | p ) −

12 ( p − p ) τ F ( p ) ( p − p ) , (7)where we expanded around the maximum of the log-likelihoodat p and introduced the Fisher matrix F αβ = − (cid:42) ∂ ln Pr( d | p ) ∂ p α ∂ p β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:43) , (8)as the negative expectation of the Hessian of the log-likelihoodat p . With this approximation the evidence reads Z ≈ (2 π ) m / Pr( d | p ) (cid:112) det F ( p ) V prior , (9)where we additionally assumed that the prior is uninformative,i.e. the bulk of the likelihood lies well within the volume cov-ered by the prior, denoted as V prior . Considering a Gaussian like-lihood, so that Pr( d | p ) ∝ e − χ ( p ) , one ﬁnds (cf. Handley &Lemos 2019b)ln Z ≈ const. − ln V prior −

12 ln det F ( p ) − χ ( p ) ; (10) D KL ≈ ln V prior − m + ln 2 π ) +

12 ln det F ( p ) . (11) Note that we employ the fast approximate algorithm detailed inSect. 3. We thank our referee for pointing out that this assumption has a moreprincipled grounding in that it maximises the entropy in absence of fur-ther information on the form of the likelihood.

We see that the only source of scatter is due to the best-ﬁt param-eter set p , which varies with the noise realisation of the data. Ifwe further assume that the curvature of the likelihood does notvary strongly as the best-ﬁt position moves, only the last term inEq. (10) is relevant for the statistical uncertainty in ln Z , while D KL is robust to the scatter. Since ln Z + D KL ≈ const. − χ ( p ) / S has identical noise properties to ln Z under these assump-tions.Equipped with these considerations, we propose the follow-ing algorithm: (i) perform a single full likelihood analysis anddetermine ﬁducial evidence values, Z ﬁd ; (ii) generate samplesof the data from the likelihood; (iii) for each sample determinethe maximum of the likelihood or equivalently, χ ; (iv) derivesamples of the evidence vialn Z approx , i : = ln Z ﬁd − (cid:16) χ , i − χ , ﬁd (cid:17) . (12)Following this procedure with 100 samples results in the bluepoints shown in Fig. 2 and the blue distribution in Fig. 3. Apartfrom sampling noise in the tail, we recover the true distributionwell, with mean and variance in agreement within ∼

10 %. Thechange from a full exploration of the posterior, which typicallyruns in hours to days, to a maximisation of the likelihood, whichwill usually take minutes to hours, makes exploring the noiseproperties of the Bayesian evidence and its derived quantitiesfeasible.

4. Evidence samples from real data

When analysing real data, the true likelihood Pr( d (cid:48) | M true ) foundin Eq. (4) from which to generate new copies of the data vector isunavailable. Our best guess for this truth is the best-ﬁtting model,which itself carries uncertainty as it is inferred from the data. Inthis case we can make use of the posterior predictive distribution(PPD; Gelman et al. 1996), Pr( d (cid:48) | d , M k ), which yields new sam-ples of the data d (cid:48) for a given observation d assuming model M k (see Trotta 2007 for a very similar application of the PPD). Av-eraging over all models using the posterior model probabilitiesPr( M k | d ) from Eq. (3) then yieldsPr( d (cid:48) | M true ) ≈ Pr( d (cid:48) | d ) = (cid:88) k Pr( d (cid:48) | d , M k )Pr( M k | d ) . (13)The algorithm presented in Sect. 3 therefore only needs to be ad-justed in Step (ii), where instead of generating data realisationsfrom the true likelihood in the mock scenario, these are now pro-duced from the PPD by randomly selecting a subset of posteriorsamples and evaluating the likelihood at the parameter valuescorresponding to these samples.In practice, we simplify the approach by choosing the modelthat yields the higher evidence to produce the PPD samples,rather than full model averaging. If both models have similar evi-dences, the choice should have little impact; if the evidence ratiois large, the model with higher evidence is more accurate and / ormore predictive (cf. the solid and hatched regions in Fig. 1).

5. Application to KiDS-450 internal consistency

We now insert the real KiDS-450 data vector into our analysisand generate 10 PPD samples from the joint Model 0 as thisyields a signiﬁcantly higher evidence than the split model. The In practice, we obtain the minimum χ within the wide prior rangesof the parameters.Article number, page 4 of 8. Joachimi, F. Köhlinger, W. Handley and P. Lemos: When is tension just a ﬂuctuation? Fig. 4.

Tension statistics for the case of KiDS-450 internal consistencywith respect to tomographic bin 3. Shown are the Bayes factor ln R with odds ratios and the suspiciousness ln S with tension signiﬁcancein multiples of the width of an equivalent Gaussian, σ . The smaller,red / black error bars are the errors associated with the nested sampling,while the larger, orange / grey error bars are the statistical errors derivedin this work. Red bands show the statistical uncertainty in determiningthe σ -levels for ln S . derived standard deviations of ln R and ln S are shown in Fig. 4.These statistical errors far exceed the typically quoted ‘method’errors which derive from the ﬁnite sampling of the posterior. Theinterpretation of the suspiciousness acquires an additional, albeitsmaller, source of error through the e ﬀ ective model dimension m e ﬀ that determines the σ -levels.The noise in the tension statistics leads to a more conser-vative evaluation of discrepancies in the data. While the pointestimate suggests ‘tension’ at 1 . σ , this reduces to 1 . σ if werequire that all but 16 % (i.e. the one-sided tail beyond 1 σ of anormal distribution) of possible realisations of the data are dis-crepant by at least that level. Visually, this corresponds to theupper 1 σ error of ln S almost touching the lower limit of the 1 σ band in Fig. 4.

6. Beneﬁts of data compression

Planck

CMB data are at the centre of both current major ten-sion controversies in cosmology. A practical obstacle to apply-ing our formalism is the complexity of the

Planck temperaturelikelihood, which is assumed to be Gaussian only for (cid:96) >

30 andbuilds on pixelised sky maps on larger scales (Planck Collab-oration et al. 2020b). This makes drawing PPD samples chal-lenging. However, Prince & Dunkley (2019) recently showedthat the low- (cid:96) likelihood can be e ﬃ ciently compressed intotwo Gaussian-distributed band powers. They proceeded to ap-ply maximal, linear compression (using the MOPED scheme,Tegmark et al. 1997; Heavens et al. 2000) to the full tempera-ture likelihood and demonstrated it to be nearly lossless. This isnot unexpected since the cosmological sampling parameters inCMB analyses are chosen to be close to linear and Gaussian-distributed (Kosowsky et al. 2002).There is an additional motivation to apply data compres-sion: it can suppress scatter in the Bayesian evidence. Underthe assumptions of Sect. 3 the statistical properties of ln Z aredriven by the distribution of χ ( p ), i.e. the minimum χ (cf.Eq. 10). If the data are approximately Gaussian and well ﬁtted by a model whose parameters are close to linear, χ ( p ) followsa χ -distribution with N dof = n − m degrees of freedom, so thatVar(ln Z ) = N dof . Data compression decreases n and can yield N dof ≈ n linear parameters per-fectly ﬁts n compressed data. Appendix A demonstrates this ex-plicitly for the Gaussian case.This may seem paradoxical because compression can at bestpreserve information, so how can it facilitate a more precise de-termination of evidence? In the context of Fig. 1 compressionreduces the scatter between the model parameter and the data, sothat for a given parameter the data varies little and thus the evi-dence is known precisely. Conversely, a broad likelihood and / ora high-dimensional data vector lead to large variations in possi-ble realisations of data. While this has no bearing on the poste-rior, and therefore information content, it increases the probabil-ity that a certain level of tension or model preference is owed toa particularly (un)lucky noise realisation of the data vector anddoes not reﬂect a physical trend.As a proof of concept, we adopt the Prince & Dunkley (2019)approach using the provided software and compress the Planck temperature (TT) power spectra into the six cosmological pa-rameters of a spatially ﬂat Λ CDM model (nuisance parametersare marginalised over pre-compression). We then determine the χ for the compressed real data, as well as for new data real-isations generated from the compressed likelihood. We ﬁnd anextremely small χ ≈ . × − for the real data and simi-lar values for the noise realisations, with standard deviation of4 . × − . Hence, practically noise-free evidence measurementsfrom Planck are indeed possible.

7. Conclusions

We studied the impact on model comparison statistics if these arebased on the ensemble of possible observations, rather than a sin-gle observed realisation of the data. In this setting they becomenoisy quantities, which a ﬀ ects binary decisions on signal detec-tion, model selection, or tension between experiments. Conﬁrm-ing earlier analytic arguments, we found standard deviations oforder unity for the logarithm of the Bayes factor, and the sus-piciousness statistic, with substantially broader distributions incase of strong discrepancies between the models under compar-ison. We expect these conclusions to apply to most, possibly all,informative tension metrics available in the literature as they typ-ically depend on the maximum likelihood or χ -like expressions.We proposed a method to approximate the probability distri-bution of the evidence via repeated draws of mock data from thelikelihood and then obtaining the maximum likelihood for eachmock data set, which will add negligible computation time to afull exploration of the posterior distribution. Conclusions drawnfrom noisy model comparison measures inevitably become moreconservative, e.g. the tension signiﬁcance according to the sus-piciousness for an internal consistency analysis of KiDS weaklensing data reduces from 1 . σ in the traditional approach to1 . σ when scatter is accounted for. While in this application thetwo models under comparison were nested, our formalism andconclusions also hold for the more general case in which param-eter spaces di ﬀ er.Finally, we demonstrated that data compression suppressesthe impact of noisy data on the evidence, in the case of Planck

CMB constraints to negligible levels. In light of this, the follow-ing pre-processing steps are beneﬁcial before any form of model https://github.com/heatherprince/planck-lite-py Article number, page 5 of 8 & A proofs: manuscript no. noisy_evidence comparison: (i) compress the data vector as much as possible aslong as the compression is essentially lossless; and (ii) choose aparametrisation such that the model is close to linear in the pa-rameters (see e.g. Schuhmann et al. 2016) which increases thechances of achieving a near-perfect ﬁt for any noise realisationof the data.

Acknowledgements.

We thank C. Jenkins, J. Peacock, and R. Trotta for insight-ful discussions. We are also grateful to J. Peacock for a thorough review of themanuscript and to C. Heymans for alerting us to the work of Jenkins and Pea-cock. BJ acknowledges the kind hospitality of IPMU where part of this workwas carried out. FK acknowledges support from the World Premier InternationalResearch Center Initiative (WPI), MEXT, Japan. PL acknowledges STFC Con-solidated Grant ST / R000476 / References

Abbott, T. M. C., Abdalla, F. B., Alarcon, A., et al. 2018, Phys. Rev. D, 98,043526Adhikari, S. & Huterer, D. 2019, J. Cosmology Astropart. Phys., 2019, 036Asgari, M., Tröster, T., Heymans, C., et al. 2020, A&A, 634, A127Audren, B., Lesgourgues, J., Benabed, K., & Prunet, S. 2013, J. Cosmol. As-tropart. Phys., 2013, 001Brinckmann, T. & Lesgourgues, J. 2018, arXiv e-prints [ arXiv:1804.07261 ]Charnock, T., Battye, R. A., & Moss, A. 2017, Phys. Rev. D, 95, 123535Feroz, F. & Hobson, M. P. 2008, MNRAS, 384, 449Feroz, F., Hobson, M. P., & Bridges, M. 2009, MNRAS, 398, 1601Feroz, F., Hobson, M. P., Cameron, E., & Pettitt, A. N. 2013, arXiv e-prints[ arXiv:1306.2144 ]Gelman, A., Meng, X.-L., & Stern, H. 1996, Statistica Sinica, 6, 733Good, I. J. 1992, J Am Stat Assoc, 87, 597Handley, W. & Lemos, P. 2019a, Phys. Rev. D, 100, 023512Handley, W. & Lemos, P. 2019b, Phys. Rev. D, 100, 043504Handley, W. J., Hobson, M. P., & Lasenby, A. N. 2015a, MNRAS, 450, L61Handley, W. J., Hobson, M. P., & Lasenby, A. N. 2015b, MNRAS, 453, 4384Heavens, A. F., Jimenez, R., & Lahav, O. 2000, MNRAS, 317, 965Heavens, A. F., Kitching, T. D., & Verde, L. 2007, MNRAS, 380, 1029Heymans, C., Tröster, T., Asgari, M., et al. 2020, arXiv e-prints,arXiv:2007.15632Hildebrandt, H., Viola, M., Heymans, C., et al. 2017, MNRAS, 465, 1454Ja ﬀ e, A. 1996, ApJ, 471, 24Jenkins, C. 2014, in American Institute of Physics Conference Series, Vol. 1636,Bayesian Inference and Maximum Entropy Methods in Science and Engi-neering, 106–111Jenkins, C. R. & Peacock, J. A. 2011, MNRAS, 413, 2895Joudaki, S., Blake, C., Heymans, C., et al. 2017, MNRAS, 465, 2033Joudaki, S., Hildebrandt, H., Traykova, D., et al. 2020, A&A, 638, L1Kass, R. E. & Raftery, A. E. 1995, J Am Stat Assoc, 90, 773Köhlinger, F., Joachimi, B., Asgari, M., et al. 2019, MNRAS, 484, 3126Kosowsky, A., Milosavljevic, M., & Jimenez, R. 2002, Phys. Rev. D, 66, 063007Kuijken, K., Heymans, C., Hildebrandt, H., et al. 2015, MNRAS, 454, 3500Kullback, S. & Leibler, R. A. 1951, Ann. Math. Statist., 22, 79Kunz, M., Trotta, R., & Parkinson, D. R. 2006, Phys. Rev. D, 74, 023503Lazarides, G., Ruiz de Austri, R., & Trotta, R. 2004, Phys. Rev. D, 70, 123527Lemos, P., Köhlinger, F., Handley, W., et al. 2020, MNRAS, 496, 4647Lin, W. & Ishak, M. 2017, Phys. Rev. D, 96, 023532MacKay, D. J. C. 2003, Information Theory, Inference, and Learning Algorithms(Cambridge University Press)Marshall, P., Rajguru, N., & Slosar, A. 2006, Phys. Rev. D, 73, 067302Nicola, A., Amara, A., & Refregier, A. 2019, J. Cosmology Astropart. Phys.,2019, 011Petersen, K. B. & Pedersen, M. S. 2012, The Matrix Cookbook, version20121115Planck Collaboration, Aghanim, N., Akrami, Y., et al. 2020a, A&A, 641, A6Planck Collaboration, Aghanim, N., Akrami, Y., et al. 2020b, A&A, 641, A5Prince, H. & Dunkley, J. 2019, Phys. Rev. D, 100, 083502Raveri, M. & Hu, W. 2019, Phys. Rev. D, 99, 043506Riess, A. G., Casertano, S., Yuan, W., et al. 2018, ApJ, 861, 126Riess, A. G., Casertano, S., Yuan, W., Macri, L. M., & Scolnic, D. 2019, ApJ,876, 85Riess, A. G., Macri, L. M., Ho ﬀ mann, S. L., et al. 2016, ApJ, 826, 56Schuhmann, R. L., Joachimi, B., & Peiris, H. V. 2016, MNRAS, 459, 1916Seehars, S., Amara, A., Refregier, A., Paranjape, A., & Akeret, J. 2014,Phys. Rev. D, 90, 023533Skilling, J. 2006, Bayesian Anal., 1, 833Tegmark, M., Taylor, A. N., & Heavens, A. F. 1997, ApJ, 480, 22Trotta, R. 2007, MNRAS, 378, 819Trotta, R. 2008, Contemporary Physics, 49, 71Verde, L., Protopapas, P., & Jimenez, R. 2013, Physics of the Dark Universe, 2,166Verde, L., Treu, T., & Riess, A. G. 2019, Nature Astronomy, 3, 891Wong, K. C., Suyu, S. H., Chen, G. C. F., et al. 2020, MNRAS, 498, 1420 Article number, page 6 of 8. Joachimi, F. Köhlinger, W. Handley and P. Lemos: When is tension just a ﬂuctuation?

Appendix A: The complete Gaussian case

Assume that an experiment produces a single observation of n data points drawn from a Gaussian distribution about some truemean ¯ d with covariance C ,ln Pr( d ) = −

12 ln | π C | −

12 ( d − ¯ d ) τ C − ( d − ¯ d ) . (A.1)In general we do not know ¯ d , but design a model M which pa-rameterises the data by some function f ( p ), with m (cid:28) n param-eters, in the hope that the true data are well approximated by ourmodel. Assuming a given model, the likelihood becomesln Pr( d | p , M ) = −

12 ln | π C | − (cid:2) d − f ( p ) (cid:3) τ C − (cid:2) d − f ( p ) (cid:3) . (A.2)Often a likelihood may be approximated as a Gaussian in theparameter spaceln Pr( d | p , M ) = ln L max −

12 ( p − µ ) τ Σ − ( p − µ ) , (A.3)with mean µ and covariance Σ , and the corresponding log-evidence readsln Z ≡ ln Pr( d | M ) = ln L max + ln √| π Σ | V prior , (A.4)where V prior is the volume of a uniform prior fully encompassingthe posterior. One can make the link between Eqs. (A.2) and(A.3) explicit by assuming that we can model our function f as linear in the region of parameter space around p ∗ where thelikelihood is signiﬁcantly non-zero, f ( p ) ≈ f ( p ∗ ) + ∇ f ( p ∗ ) ( p − p ∗ ) = : ˆ d + J ( p − p ∗ ) , (A.5)from which one can identifyln L max = −

12 ln | π C | −

12 ( d − ˆ d ) τ ˜ C − ( d − ˆ d ); (A.6) Σ − = J τ C − J ; µ = p ∗ + Σ J τ C − ( d − ˆ d ) , (A.7)where we deﬁned˜ C − : = C − − C − J Σ J τ C − . (A.8)As an aside, Eq. (A.7) shows that noisy data realisations a ﬀ ectthe posterior mean but not its covariance. In other words, whilethe posterior shape is una ﬀ ected, the distribution moves as awhole in parameter space with di ﬀ erent realisations of the data.From the above expressions we can immediately see thatthe evidence is quite a noisy statistic, driven by the secondterm in Eq. (A.6). Taking the variance of Eq. (A.4) after insert-ing Eq. (A.6) and assuming that d follows the distribution ofEq. (A.1) yieldsVar(ln Z ) =

12 Tr (cid:104) ( ˜ C − C ) (cid:105) + ( ¯ d − ˆ d ) τ ˜ C − C ˜ C − ( ¯ d − ˆ d ) . (A.9)The ﬁrst term here is equal to ( n − m ), and hence the variancein the raw evidence is large for n (cid:29) m , even in the event of agood ﬁt to the data (i.e. ˆ d ≈ ¯ d ). We also see that in the case of Equivalently, one could consider a data vector that has been averagedover multiple observations. heavily compressed data, n ∼ m , the evidence scatter reducesconsiderably.To derive the expression (A.9) above, and some of the fol-lowing ones, it is helpful to note that for a Gaussian-distributedvariable x with covariance C centered on zero (Petersen & Ped-ersen 2012) (cid:104) ( x − a ) τ A ( x − a ) (cid:105) = Tr [ AC ] + a τ A a ;Cov (cid:2) ( x − a ) τ A ( x − a ) , ( x − b ) τ B ( x − b ) (cid:3) = ACBC ] + b τ BCA a , (A.10)where A and B are symmetric matrices, and a and b are arbitrary,non-stochastic vectors.We are of course really interested in how model comparison(i.e. a di ﬀ erence in evidence) scatters with noisy data, so we in-troduce two models with ˆ d , ˆ d , J and J , respectively , andask what is the variance in their evidence di ﬀ erence. Under thetrue distribution of Eq. (A.1), we ﬁnd that the log Bayes factor(under the same assumptions as in Eq. 3), ln R = ln Z − ln Z ,has mean (cid:104) ln R (cid:105) =

12 ( ¯ d − ˆ d ) τ ˜ C − ( ¯ d − ˆ d ) −

12 ( ¯ d − ˆ d ) τ ˜ C − ( ¯ d − ˆ d ) +

12 Tr [ ∆ ] + ln √| π Σ | V prior , √| π Σ | V prior , , (A.11)(see also Lazarides et al. 2004; Heavens et al. 2007 for similar,less general expressions) and varianceVar(ln R ) =

12 Tr (cid:104) ∆ (cid:105) + ( ∆ ¯ d − δ ) τ C − ( ∆ ¯ d − δ ) , (A.12)where we deﬁned ∆ : = C ( ˜ C − − ˜ C − ) ; δ : = C ( ˜ C − ˆ d − ˜ C − ˆ d ) . (A.13)The mean in Eq. (A.11) has three portions, a set of misﬁt termson the ﬁrst line, a constant trace term equal to ( m − m ) andan Occam factor. The trace contribution can be understood as atypically small modiﬁcation of the Occam factor.In the variance (Eq. A.12) there is a trace term which isroughly the dimensionality of the parameter space(s) ≤ ( m + m ), as well as a data misﬁt term. The trace term is alwayspresent, and represents the ‘order unity’ term for the generalGaussian case, but can reduce to zero (via a cross-term depen-dent on both models) the more similar the two model parametri-sations (as quantiﬁed by J ) are to each other. As opposed to thevariance of the evidence (cf. Eq. A.9), the trace term in the vari-ance of the Bayes factor does not depend on n , so if n (cid:29) m ,the scatter in R is signiﬁcantly smaller than the scatter in ei-ther Z and Z if the data is well ﬁtted. This is the situation weencountered in Fig. 2.The second term may be small if the models are good, butcan become arbitrarily large, which corresponds to the scatterseen in Fig. 3. It should be noted that in the event of large misﬁts,the mean and variance are both of the same order, which gives aPoisson-type evidence error associated with measurement noise.This is reassuring as it means the evidence in theory becomesrelatively less noisy the larger it becomes.We note that, if Eq. (A.5) is a reasonable approximation, pro-vided one can compute (by numerical derivatives or otherwise)the Jacobian J , one may use Eq. (A.12) to evaluate the expected We note that in general the models under comparison do not needto share any part of their parameter space in which case the pivot p ∗ inEq. (A.5) could also di ﬀ er. Article number, page 7 of 8 & A proofs: manuscript no. noisy_evidence scatter, e.g. by employing the observed data vector as an estimateof the true ¯ d .Finally, we o ﬀ er some illustration for the use of informationabout the sampling distribution of the Bayes factor (such as thevariance of Eq. A.12) by invoking the popular analogy with bet-ting odds. Rather than placing one’s bets based on a single mea-sure of the odds conditioned on some observation, it is beneﬁcialto take the scatter of these odds into account, even if the latter isbuilt upon an imperfect model and noisy data. The scatter mightindicate that a di ﬀ erent outcome has substantial probability, andhence it would be wise to invest one’s money more cautiously. Appendix B: Details of the likelihood analysis

For most of our analyses we employ simulated and real datafrom the KiDS-450 analysis (Hildebrandt et al. 2017) . We alsoadopt their ﬁve-parameter Λ CDM cosmological model with spa-tially ﬂat geometry and use the same set of priors. The sampleparameters are the amplitude of the primordial power spectrumln(10 A s ), the value h of the Hubble parameter today dividedby 100 km s − Mpc − , the cold dark matter density Ω cdm h , thebaryonic matter density Ω b h , and the power-law exponent ofthe primordial power spectrum n s . In addition to these key cos-mological parameters we vary the free amplitude parameters A IA and A bary of the intrinsic alignment and baryon feedback models,respectively. The implementation of the inference pipeline is thatpresented in Köhlinger et al. (2019) , which is independent of,but in excellent agreement with, the analysis of Hildebrandt et al.(2017).We opt for nested sampling (Skilling 2006) to explore theposterior distribution as the most e ﬃ cient way to evaluate high-dimensional likelihoods and calculate Bayesian evidence simul-taneously. To avoid signiﬁcant algorithm-induced scatter in theevidence values, we check three variants of nested sampling al-gorithms for their consistency. We use MULTINEST12 (Feroz &Hobson 2008; Feroz et al. 2009, 2013) and

POLYCHORD13 (Han-dley et al. 2015a,b), which primarily di ﬀ er in the key step ofhow new ‘live’ sampling points are drawn at each likelihood con-tour. Moreover, we consider an importance-sampled determina-tion of the evidence in MULTINEST that utilises the full set of gener-ated sample points and can achieve higher accuracy (Feroz et al.2013). For

MULTINEST ﬃ ciency of 0.3 and a ﬁnal error tolerance on the log-evidenceof 0.1. Live points for POLYCHORD runs are 25 times the numberof parameters (7 for Model 0; 14 for Model 1) with a ﬁnal errortolerance on the log-evidence of 0.001.Figure B.1 shows evidence values for a KiDS-like noise-freesimulated data vector, as well as for ten realisations with noiseincluded. It is evident that in all cases, and for both the jointand split cosmological models, the three nested sampling vari-ants agree very well, with the residual scatter at a small fractionof the statistical errors. Our

MULTINEST and

POLYCHORD settingswere optimised to yield accurate evidences. However, we notethat evidence values are faithfully recovered as soon as the bulk The data are publicly available at http://kids.strw.leidenuniv.nl/sciencedata.php . Likelihood pipelines available in

MONTE PYTHON , https://github.com/brinckmann/montepython_public (Audren et al.2013; Brinckmann & Lesgourgues 2018), and from https://github.com/fkoehlin/montepython_2cosmos_public . Version 3.8 from http://ccpforge.cse.rl.ac.uk/gf/project/multinest/ Version 1.16 from https://github.com/polychord/polychordlite

Fig. B.1.

Comparison of sampler outputs. Black (grey) points corre-spond to the evidence of the joint (Model 0) and split (Model 1) analy-sis for 10 noise realisations measured from the traditional / importance-sampled approaches of MULTINEST , as well as

POLYCHORD . Red pointsdisplay the mean and standard deviation over these realisations. Bluepoints show results for a noise-free data vector. of the posterior is explored, while credible regions of the param-eters as well as the e ﬀ ective dimension (see Eq. 2) are sensitiveto the tails of the distribution. Therefore, when these latter quan-tities are required in high-stakes real-data applications, we rec-ommend increasing the accuracy settings of the nested samplingruns.The χ minimisation for the approximate method is per-formed with the built-in MONTE PYTHON maximum likelihooddetermination, with a precision tolerance of 10 − on the log-likelihood. With this setup a minimisation run consumes about500 times less wall-clock time than full, parallelised samplingon high-performance computing infrastructure.on the log-likelihood. With this setup a minimisation run consumes about500 times less wall-clock time than full, parallelised samplingon high-performance computing infrastructure.