[PDF] Interpreting Internal Consistency of DES Measurements

Abstract

Bayesian evidence ratios are widely used to quantify the statistical consistency between different experiments. However, since the evidence ratio is prior dependent, the precise translation between its value and the degree of concordance/discordance requires additional information. The most commonly adopted metric, the Jeffreys scale, can falsely suggest agreement between datasets when priors are chosen to be sufficiently wide. In this work, we examine evidence ratios in a DES-Y1 simulated analysis, focusing on the internal consistency between weak lensing and galaxy clustering. We study two scenarios using simulated data in controlled experiments. First, we calibrate the expected evidence ratio distribution given noise realizations around the best fit DES-Y1 Λ CDM cosmology. Second, we show the behavior of evidence ratios for noiseless fiducial data vectors simulated using a modified gravity model, which generates internal tension in the Λ CDM analysis. We show that the choice of prior could conceal the discrepancies between weak lensing and galaxy clustering induced by such models and that the evidence ratio in a DES-Y1 study is, indeed, biased towards agreement.

Full PDF

MMNRAS , 000–000 (2020) Preprint 1 October 2020 Compiled using MNRAS L A TEX style ﬁle v3.0

Interpreting Internal Consistency of DES Measurements

V. Miranda, (cid:63) P. Rogozenski, and E. Krause , Steward Observatory, Department of Astronomy, University of Arizona, Tucson, Arizona, 85721, USA Department of Physics, University of Arizona, Tucson, Arizona, 85721, USA

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

Bayesian evidence ratios are widely used to quantify the statistical consistency betweendiﬀerent experiments. However, since the evidence ratio is prior dependent, the pre-cise translation between its value and the degree of concordance/discordance requiresadditional information. The most commonly adopted metric, the Jeﬀreys scale, canfalsely suggest agreement between datasets when priors are chosen to be suﬃcientlywide (Raveri & Hu 2019; Handley & Lemos 2019). In this work, we examine evidenceratios in a DES-Y1 simulated analysis, focusing on the internal consistency betweenweak lensing and galaxy clustering. We study two scenarios using simulated data incontrolled experiments. First, we calibrate the expected evidence ratio distributiongiven noise realizations around the best ﬁt DES-Y1 Λ CDM cosmology. Second, weshow the behavior of evidence ratios for noiseless ﬁducial data vectors simulated usinga modiﬁed gravity model, which generates internal tension in the Λ CDM analysis. Weshow that the choice of prior could conceal the discrepancies between weak lensing andgalaxy clustering induced by such models and that the evidence ratio in a DES-Y1study is, indeed, biased towards agreement.

Key words: cosmological parameters – theory – large-scale structure of the Universe

Since the discovery of the accelerating expansion ofthe universe (Riess et al. 1998; Perlmutter et al. 1999),various surveys have been designed to measure the back-ground expansion and structure formation of the Universewith increasing precision. The Dark Energy Task Force(DETF) (Albrecht et al. 2006) classiﬁes these surveys fromstage I to stage IV according to their ability to increase theﬁgure-of-merit (Albrecht et al. 2009) of the w − w a parame-terization for the dark energy equation of state (Linder 2003;Chevallier & Polarski 2001). The community is currently an-alyzing the stage III surveys, while stage IV surveys such asDESI (Levi et al. 2019), Nancy Grace Roman Space Tele-scope (Akeson et al. 2019), CMB-S4 (Abazajian et al. 2016)and Vera Rubin Telescope Legacy Survey of Space and Time(LSST) (The LSST Dark Energy Science Collaboration et al.2018) will start collecting data in the next few years withthe potential to signiﬁcantly expand our knowledge aboutthe early and late-time cosmos.Ongoing stage III surveys, such as the Dark Energy Sur-vey (DES) (Abbott et al. 2005), constrain the parameters ofthe standard model ( Λ CDM) with unprecedented precision.These constraints encompass measurements of the CosmicMicrowave Background (CMB) (Planck Collaboration et al. (cid:63)

E-mail: [email protected] H , (Riess et al. 2019) is agood example of a tension that may require new physics tobe fully resolved (Knox & Millea 2019; Verde et al. 2019).The Dark Energy Survey uses the combination of weaklensing and galaxy clustering to break degeneracies between © a r X i v : . [ a s t r o - ph . C O ] S e p V. Miranda, P. Rogozenski, and E. Krause, dark energy and other parameters. For example, the DESyear one (DES-Y1) error bars from the cosmic shear investi-gation on the dark energy equation of state are reduced by ∼ in the combined analysis (Abbott et al. 2018b; Troxelet al. 2018). The joint analysis is only permitted howeverif the datasets are statistically consistent. In Abbott et al.(2018b), consistency was ascertained by the Bayesian evi-dence ratio, R , utilizing the Jeﬀreys Scale. However, ana-lytical examples show that the Jeﬀreys scale should not beused as an universal scale (Nesseris & Garcia-Bellido 2013),given that priors can always be chosen to be wide enough toenable consistency (Marshall et al. 2006).In order to make meaningful statements about consis-tency of datasets it is important to investigate how theBayesian evidence ratio R is aﬀected by the priors underconsideration. These investigations are particularly relevantwhen tension with modest statistical signiﬁcance is detected,e.g., the disagreement between Planck data and weak lens-ing surveys over the value of S ≡ σ Ω / m parameter (Abbottet al. 2018b; Heymans et al. 2020; Hikage et al. 2019).Given the demanding computational costs associatedwith Bayesian evidence computation (Handley et al. 2015),calibrating survey data concordance with simulated data isnot always feasible. Alternative metrics with reduced priordependence have been suggested (Handley & Lemos 2019;Seehars et al. 2016). In simple cases (e.g. multivariate Gaus-sians), these alternatives can be prior independent. How-ever, in more general cases, the interpretation of alternativemetrics still requires careful scale calibration using simu-lated data. Yet another approach to reduce prior dependen-cies is to adopt approximations, such as the validity of theGaussian linear model (GLM), which allows Bayesian esti-mators to be computed either analytically or from MonteCarlo Markov Chains (Raveri & Hu 2019).In this paper we examine the Bayesian evidence ratioin the context of quantifying consistency between cosmicshear, galaxy-galaxy lensing, and galaxy clustering in DES-Y1 data. In particular, we want to quantify whether cosmicshear and the combination of galaxy clustering and galaxy-galaxy lensing (so-called 2x2pt) can be combined into a so-called 3x2pt analysis. We test how this metric responds tonoise drawn from the DES-Y1 covariance around the best-ﬁtcosmology at varying conﬁdence intervals in (cid:174) χ space. Thisﬁrst test demonstrates how ‘real’ survey noise at known devi-ations from the best-ﬁt cosmology propagates into Bayesianestimators. We then explore how the evidence ratio behaveswhen data vectors generated from an underlying modiﬁedgravity theory are ﬁt with the standard model. When con-ﬁned to the standard model, these modiﬁed gravity baseddata vectors naturally induce a tension between weak lens-ing and galaxy clustering.This manuscript is structured as follows: In Sect. 2 wedeﬁne the tension metrics studied in this paper. In Sect. 3we explain the theoretical modeling and aspects of oursimulated analyses. Section 4 describes our ﬁndings aboutBayesian evidence ratios and other tension metrics whenconsidering noisy Λ CDM data vectors that are analyzed witha Λ CDM model. This scenario corresponds to the case whererealistic noise in a data vector might be misinterpreted as aphysical tension. In Sect. 5 we consider a noise free modiﬁedgravity data vector that is analyzed with a Λ CDM model.This scenario mimics the case where an actual physical ten- sion between the clustering and weak lensing parts of thedata vector exist. Four appendices oﬀer further explanationof the details that are only summarized in this section. Weconclude in Sect. 6.

In this section we brieﬂy review tension metrics andestablish consistent notation. We start deﬁning the posteriorprobability for a set of parameters (cid:174) θ in a given model H andobserved dataset d as P ( (cid:174) θ | d , H) . The posterior is related tothe likelihood, P ( d | (cid:174) θ, H) , via the Bayes’ Theorem P ( (cid:174) θ | d , H) = P ( d | (cid:174) θ, H) P ( (cid:174) θ |H) P ( d |H) . (1)The prior, P ( (cid:174) θ |H) , describes the a priori probability distri-bution of the parameters (cid:174) θ within the assumed model H .The normalization factor, P ( d |H) , is called the Bayesian ev-idence (Marshall et al. 2006). The Bayesian evidence of M datasets (cid:174) d = ( d , . . . , d M ) given a model H of N parameters (cid:174) θ = ( θ , . . . , θ N ) is givenby P ( (cid:174) d |H) = ∫ d (cid:174) θ P ( (cid:174) d | (cid:174) θ, H) P ( (cid:174) θ |H) . (2)In order to evaluate the probability that experiments d and d are in agreement, we evaluate the odds of hypothesis H ,that we can model both datasets with a single set of param-eters, against the alternative hypothesis H , that modelingeach dataset with a diﬀerent set of parameters is preferable.These odds are deﬁned as P(H | d , d )/P(H | d , d ) andtheir relation to the evidences P ( d , d |H ) and P ( d , d |H ) can be readily seen when applying Bayes’ theorem P(H | d , d )P(H | d , d ) = P ( d , d |H ) P ( d , d |H ) · P (H ) P (H ) , (3)where P ( H i = { , } ) are the prior probabilities of models H i = { , } . The ﬁrst ratio on the right-hand side of Eq. 3 isknown as the Bayesian evidence ratio, R. If the datasets areindependent, we may express it as R = P ( d , d |H ) P ( d |H ) P ( d |H ) . (4)The Bayesian evidence ratio generally implies agree-ment between datasets when R (cid:29) , while R (cid:28) ﬂags theopposite. The ratio changes as a function of prior range,which can mimic consistency even in the presence of ten-sion. ∆ ¯ χ statistic The ¯ χ value is a statistic related to the average log-likelihood of a chain marginalized over the posterior. Giventhe weights of each sample i of a chain of length N , we cal-culate the statistic directly as ¯ χ j = − (cid:10) ln P ( (cid:174) d j | (cid:174) θ, H) (cid:11) = − (cid:205) Ni w i ln P i ( (cid:174) d j | (cid:174) θ, H) (cid:205) Ni w i , (5) MNRAS , 000–000 (2020) nterpreting Internal Consistency of DES Measurements where the sample weights are deﬁned as the ratio of thesample posterior over the maximum sampled posterior of thechain. We deﬁne a statistic similar to the delta chi-squaredstatistic of (Marshall et al. 2006) as the diﬀerence betweenthe ¯ χ values of the joint and independent datasets as: ∆ ¯ χ = ¯ χ − ( ¯ χ + ¯ χ ) . (6) The Generalized Parameter Distance estimates the de-parture from the ﬁducial vector (in this case determinedby the DES-Y1 best-ﬁt cosmology) and it is determined bycalculating the covariance of a chain, ˆ Σ , then taking the dif-ference, in parameter space, of the ﬁducial data vector, (cid:174) µ and the best-ﬁt data vector of the samples, (cid:174) θ , as ∆ ≡ (cid:113)(cid:0) (cid:174) θ − (cid:174) µ (cid:1) t ˆ Σ − (cid:0) (cid:174) θ − (cid:174) µ (cid:1) , (7) Alternatively to the evidence ratio, the Kullback-Leibler (KL) Divergence, also known as the relative entropy,determines how parameters are constrained by the data com-pared to the prior constraints (Kullback & Leibler 1951).Deﬁned as D i = ∫ d (cid:174) θ P ( (cid:174) θ | d i , H) ln (cid:34) P ( (cid:174) θ | d i , H) P ( (cid:174) θ |H) (cid:35) , (8)the KL Divergence is invariant under model reparameteri-zation and can be interpreted as measuring the informationgain when going from the prior distribution to the poste-rior. Similar to entropy, D i ≥ . The KL Divergence can alsomeasure the information gain of augmented datasets by tak-ing P ( (cid:174) θ |H) → P ( (cid:174) θ | d i , H) and P ( (cid:174) θ | d i , H) → P ( (cid:174) θ | d i + d new , H) .The relative entropy between datasets is the basis of a ten-sion metric called Surprise (Seehars et al. 2014, 2016). Boththe KL Divergence and Surprise computation is non-trivialoutside the Gaussian case, which limits their applicabilityas a check for statistical consistency. Suspiciousness is a tension metric that aims to alleviatethe prior dependence exhibited in the evidence ratio (Han-dley & Lemos 2019). This metric is deﬁned as ln S ≡ ln R − ln I , (9)where ln I is deﬁned as the information ratio ln I ≡ D + D − D . (10)In restricted cases (e.g. the case of ﬂat priors imposed on amultivariate Gaussian likelihood), the prior dependence inthe metric is completely eliminated. For this particular case,a generalization to correlated datasets has been found Lemoset al. (2019). Details on the numerical evaluation of suspi-ciousness, as well as the evidence, in a nested sampling runare shown in Appendix D. Table 1.

Table with priors for the cosmological and nuisance pa-rameters, similar to the adopted priors in DES-Y1. In addition,we applied ﬂat( . , . ) priors on Ω b h for minimal compati-bility with BBN constraints in CosmoLike (see Appendix A forfurther details).Parameter Prior Cosmology Ω m ﬂat ( . , . ) A s × − ﬂat ( . , . ) n s ﬂat (0.87, 1.07) Ω b ﬂat (0.03, 0.07) H ﬂat (55.0, 91.0) m ν ﬂat( . , . ) Lens Galaxy Bias b i ( i = , ) ﬂat (0.8, 3.0) Intrinsic Alignment A IA ( z ) = A IA [( + z )/ . ] η IA A IA ﬂat ( − , ) η IA ﬂat ( − , ) Lens photo- z shift ∆ z Gauss ( , . ) ∆ z Gauss ( , . ) ∆ z Gauss ( , . ) ∆ z Gauss ( , . ) ∆ z Gauss ( , . ) Source photo- z shift ∆ z Gauss ( , . ) ∆ z Gauss ( , . ) ∆ z Gauss ( , . ) ∆ z Gauss ( , . ) Shear calibration m i ( i = , ) Gauss ( , . ) The theoretical modeling and covariance computationand validation for the DES-Y1 3x2pt analysis are describedin detail in (Krause et al. 2017). We summarize the mainmodeling details brieﬂy below.

The DES 3x2pt data vector consists of the angular galaxyclustering statistic w i ( θ ) of galaxies in redshift bin i , thegalaxy–galaxy lensing statistic γ ij t ( θ ) for galaxies in redshiftbin i and shape measurements for source galaxies in red-shift bin j , and cosmic shear two-point correlations functions ξ ij ± ( θ ) of shape measurements for source galaxies in redshiftbins i , j . The galaxy sample used in the clustering measure-ment, which also constitutes the “lens” sample for galaxy-galaxy lensing, is selected using the redMaGiC algorithm(Rozo et al. 2016). Details on the DES-Y1 sample selec-tion and redshift calibration described in Elvin-Poole et al.(2018); Cawthon et al. (2018). For the weak lensing galaxysample, we adopt the DES-Y1 metacal source galaxy sam-ple, for which the sample selection from the DES-Y1 goldcatalog (Drlica-Wagner et al. 2018) and the shear catalogare described in Zuntz et al. (2018), and the source redshiftestimates are described in Hoyle et al. (2018), respectively.We denote the redshift distribution of the red-MaGiC/Metacal source galaxy sample in tomography bin i as n i g / κ ( z ) , and the angular number densities of galaxies in MNRAS , 000–000 (2020)

V. Miranda, P. Rogozenski, and E. Krause, this redshift bin as ¯ n i g / κ = ∫ dz n i g / κ ( z ) . (11)Assuming a ﬂat Λ CDM universe, we write the radial weightfunction for clustering in terms of the comoving radial dis-tance χ as q i δ g ( k , χ ) = b i ( k , z ( χ )) n i g ( z ( χ )) ¯ n i g dzd χ , (12)with b i ( k , z ( χ )) the galaxy bias of the redMaGiC galaxies intomography bin i , and the lensing eﬃciency q i κ ( χ ) = H Ω m χ a ( χ ) ∫ d χ (cid:48) n i κ ( z ( χ (cid:48) )) dz / d χ (cid:48) ¯ n i κ χ (cid:48) − χχ (cid:48) , (13)where H is the Hubble constant, c the speed of light, and a the scale factor. The angular power spectra for cosmic shear,galaxy-galaxy lensing, and galaxy clustering are calculatedusing the Limber approximation C ij κκ ( l ) = ∫ d χ q i κ ( χ ) q j κ ( χ ) χ P NL (cid:18) l + / χ , z ( χ ) (cid:19) , C ij δ g κ ( l ) = ∫ d χ q i δ g (cid:16) l + / χ , χ (cid:17) q j κ ( χ ) χ P NL (cid:18) l + / χ , z ( χ ) (cid:19) , C ij δ g δ g ( l ) = ∫ d χ q i δ g (cid:16) l + / χ , χ (cid:17) q j δ g (cid:16) l + / χ , χ (cid:17) χ P NL (cid:18) l + / χ , z ( χ ) (cid:19) , (14)where P NL ( k , z ) is the non-linear matter power spectrum atwave vector k and redshift z computed via Halofit (Taka-hashi et al. 2012).The angular correlation functions are calculated fromthe angular power spectra as ξ ij + /− ( θ ) = ∫ dl l π J / ( l θ ) C ij κκ ( l ) ,γ ij t ( θ ) = ∫ dl l π J ( l θ ) C ij δ g κ ( l ) , w i ( θ ) = (cid:213) l l + π P l ( cos ( θ )) C ii δ g δ g ( l ) , (15)with J n ( x ) the n -th order Bessel function of the ﬁrst kind,and P l ( x ) the Legendre polynomial of order l . The DES-Y1 baseline model includes nuisance parameters toaccount for uncertainties in astrophysical and observationalsystematic eﬀects, summarized below. Prior distributions ofour parameters are given in Table 1, similar to those in DES-Y1 analyses. Parameters with Gaussian priors (i.e. the lensphoto- z shifts, the source photo- z shifts, and the shear cali-brations) are prior-dominated. A detailed validation of theseparameterizations can be found in Elvin-Poole et al. (2018);Krause et al. (2017) and Troxel et al. (2018). Photometric redshift uncertainties

The uncertainty inthe redshift distribution n is modeled through shift param-eters ∆ z , n ix ( z ) = ˆ n ix (cid:16) z − ∆ iz , x (cid:17) , x ∈ { g , κ } , (16) where ˆ n denotes the estimated redshift distribution. Wemarginalize over one parameter for each source and lensredshift bin (nine parameters in total), using the the pri-ors derived in Hoyle et al. (2018); Cawthon et al. (2018). Multiplicative shear calibration is marginalized usingone parameter m i per redshift bin, which aﬀects cosmic shearand galaxy–galaxy lensing correlation functions via ξ ij ± ( θ ) −→ ( + m i ) ( + m j ) ξ ij ± ( θ ) ,γ ijt ( θ ) −→ ( + m j ) γ ijt ( θ ) , (17)with Gaussian priors as determined in Troxel et al. (2018);Zuntz et al. (2018). Galaxy bias

The DES-Y1 baseline model assumes an eﬀec-tive linear galaxy bias ( b ) using one parameter per galaxyredshift bin b i ( k , z ) = b i , i.e. ﬁve parameters, which aremarginalized over conservative ﬂat priors. Intrinsic galaxy alignments (IA) are modeled using apower spectrum shape and amplitude A ( z ) , assuming thenon-linear linear alignment (NLA) model (Hirata & Seljak2004; Bridle & King 2007) for the IA power spectrum. Theimpact of this speciﬁc IA power spectrum model can be writ-ten as q i κ ( χ ) −→ q i κ ( χ ) − A ( z ( χ )) n i κ ( z ( χ )) ¯ n i κ dzd χ . (18)The IA amplitude is modeled as a power-law scaling in ( + z ) with normalization A IA , and power law slope α IA , which areboth marginalized using conservative priors. Λ CDM DATA VECTORS

In this section, we analyze the distribution of Bayesianevidence ratios for a set of realistic noise realizations of theDES-Y1 data vectors around the DES-Y1 best-ﬁt Λ CDMcosmology. We aim to examine which of these noise realiza-tions of Λ CDM can be ﬂagged as tension according to theJeﬀreys scale. We also investigate whether noise realizationsat the one σ level are more or less likely to be classiﬁedas tension by the Jeﬀreys scale compared to three and ﬁvesigma events. In the following two sections we run multiple simulatedDES-Y1 likelihood analyses to explore the distribution ofBayesian evidence ratios as a function diﬀerent input datavectors. The input data vectors computed in Sect. 4.2 resem-ble realistic noise realizations of the DES-Y1 survey assum-ing the DES-Y1 best-ﬁt cosmology. The input data vectorsin Sect. 5.1 are computed from a modiﬁed gravity model,thereby inducing a physical tension between the weak lens-ing and the galaxy clustering part of the data vector.Throughout this paper we assume that the likelihoodfunction ( L ) of our data vector ( D ) is well approximated by MNRAS , 000–000 (2020) nterpreting Internal Consistency of DES Measurements

150 200 250 300 350 x L n E v i d e n ce R a ti o Figure 1.

The distribution of (cid:174) χ for cosmic shear, χ shear , andthe 2x2pt (galaxy-galaxy lensing and galaxy clustering), χ ,generated using the DES-Y1 joint covariance matrix. We com-pute the , . , and . conﬁdence intervals from thegeneration of hundreds of millions of noise realizations, smooththe contours, and deﬁne conﬁdence intervals using a KDE. Thedata vectors are chosen along these contours and are representedas colored points. The color-code denotes the log-evidence ratioof the 3x2pt evidence to the 2x2pt and shear evidences (c.f. Eq.4). Our selected points are sample the conﬁdence limits in all ra-dial directions and we don’t ﬁnd radial or angular trends of theevidence ratio. a multivariate Gaussian L ∝ exp (cid:18) − (cid:20)(cid:16) D − M ( (cid:174) θ ) (cid:17) t C − (cid:16) D − M ( (cid:174) θ ) (cid:17)(cid:21)(cid:19) , (19)where M denotes the theory prediction or model vector. AsLin et al. (2019) demonstrate Gaussian functional form isa acceptable approximation, at least for ongoing and futurecosmic shear surveys.We use CosmoLike (Krause & Eiﬂer 2017) with

CLASS (Lesgourgues 2011a; Blas et al. 2011; Lesgourgues 2011b;Lesgourgues & Tram 2011) to compute the ﬁducial data vec-tor and covariance. We sample the parameter space with the

Polychord (Handley et al. 2015) nested sampling, with aninterface implemented in the

Cobaya framework (Torrado &Lewis 2020), assuming the

CAMB (Lewis et al. 2000; Howlettet al. 2012) Boltzmann code. We perform extensive testsof our pipeline that merged

CosmoLike and

Cobaya , furtherdescribed in Appendices A and C.

The DES-Y1 covariance matrix for cosmic-shear, galaxy-galaxy lensing, and galaxy clustering and the noiseless ﬁdu-cial data vector are evaluated at the DES-Y1 best-ﬁt cosmol-ogy using

CosmoLike . We use the DES-Y1 covariance matrixto generate hundreds of millions of (Gaussian) noise realiza-tions around the noiseless ﬁducial DES-Y1 Λ CDM best-ﬁtdata vector. The generation of a large sample of noise realiza-tions densely populates the (cid:174) χ = ( χ shear , χ ) space around DES Y1: 6.39Noiseless: 6.615 , 5.8 ± 3.163 , 5.9 ± 4.521 , 4.74 ± 2.59

Figure 2.

Histogram of evaluated log-evidence at the one, three,and ﬁve σ conﬁdence intervals. For comparison, we include thelog-evidence ratios of our noiseless ﬁducial data vector and the of-ﬁcial DES-Y1 analysis. The mean log-evidence ratio of each con-ﬁdence interval is represented as a dotted line, with the mean andscatter explicitly given for each interval in the top-right key. Thehistogram reveals that the points on each contour all have similarlog-evidence ratio distributions. The histogram also shows thatthe observed DES-Y1 evidence ratio is rather typical and doesnot point to an unusual level of agreement between the datasets,where the Jeﬀreys scale declares the DES-Y1 log-evidence ratioto be decisive agreement. our ﬁducial data vector. We then applied Kernel Density Es-timator (KDE) to deﬁne, from the samples, conﬁdence in-tervals of agreement. Based on these conﬁdence regions weselect 68 data vectors that lie at the (one σ ), . (three σ ), and . (ﬁve σ ) conﬁdence intervals withapproximate angular uniformity in (cid:174) χ space.The KDE method, implemented with help of GetDist (Lewis 2019) routines, approximates the probabilitydistribution of a continuum of values for (cid:174) χ from N gener-ated samples (cid:174) χ i = , ··· , N as follows P ( (cid:174) χ ) = N (cid:213) i = K f ( (cid:174) χ − (cid:174) χ i ) (20)where K f is a multivariate Gaussian kernel with zero meanand covariance f × ˆ C where ˆ C is the sample covariance of the (cid:174) χ . We found that given our large sample of computed datavectors f ∼ . is a good choice to balance smoothing andnoise features in the P ( (cid:174) χ ) contours. Figure 1 shows the ﬁ-nal selection of data vectors as seen in (cid:174) χ space and displaysthe 1-5 σ conﬁdence intervals as determined by our selectedKDE. The angular distribution of the selected noise real-izations nicely covers all quadrants. Figure 1 also illustratesthe evidence ratios of the selected data vector realizations,speciﬁcally the color bar shows the natural-log ratio of thedata vector’s 3x2pt evidence to its 2x2pt and shear evidencesas deﬁned in Eq. 4. MNRAS , 000–000 (2020)

V. Miranda, P. Rogozenski, and E. Krause, Combined: -1.95 lnR + 10.431 : -1.82 lnR + 9.93 : -1.96 lnR + 10.685 : -2.0 lnR + 10.49 l n S Combined: 1.0 lnR - 12.521 : 0.94 lnR - 12.353 : 1.0 lnR - 12.65 : 1.01 lnR - 12.5

Figure 3.

Correlation between Bayesian evidence ratios and ∆ ¯ χ (left panel), Bayesian evidence ratios and suspiciousness (right panel).In both cases, the ﬁt parameters of the slope are similar for one, three, and ﬁve σ noise realizations. For ∆ ¯ χ , the slope of the ﬁt is closeto the predicted for multivariate Gaussian posteriors. Using the data vectors as generated in Sect. 4.2, wenow investigate whether statistical ﬂuctuations in the DES-Y1 data vector have a high probability of causing tension(as deﬁned by the Jeﬀreys scale).Figure 1 shows that there is no radial or angular depen-dency in the value of the evidence ratio as a function of χ values in cosmic shear and 2x2. Similarly, Fig. 2 shows nodiﬀerences in the evidence ratio distribution associated withone, three, and ﬁve σ noise realizations; the histograms ofevidence ratios are all centered on large positive values aspredicted by (Raveri & Hu 2019) and (Handley & Lemos2019) for wide uninformative priors.The comparison between the evidence ratio and suspi-ciousness (c.f. Fig. 3) shows that broad priors signiﬁcantlyincrease the number of noise ﬂuctuations that are not ﬂaggedas internal tension by evidence ratios, but they would beﬂagged by using suspiciousness. It is however not clear thata prior independent metric, such as suspiciousness, is neces-sarily more objective. While Bayesian evidence tends to hidetensions if broad priors are chosen, it is important to notethat tensions in data are inevitably connected to our priorunderstanding of the situation. Handley & Lemos (2019) ar-gue that some known tensions in cosmology would have beeninterpreted diﬀerently had they been observed decades ago,when our prior beliefs encompassed a broader range.It is diﬃcult to estimate which tension estimator is abetter choice. In Fig. 3 (right panel), we present a com-parison and relative calibration between evidence ratios andsuspiciousness (for the speciﬁc DES-Y1 case considered inthis paper). Our results show how metrics that rely, at leastfor Gaussian Likelihoods, solely on the likelihood of the datadiﬀer from tension estimators that take the DES-Y1 priorbeliefs into account.Figure 2 shows that the observed DES-Y1 evidence ra-tio does not point towards an exceptional level of agreementbetween the datasets as would be inferred by the Jeﬀreysscale. Generally speaking we do not ﬁnd a signiﬁcant diﬀer-ence in the evidence ratio’s mean or variance of data vectors . . . . − . − . . . l n ( R / R ) NoiselessNoiseless w/ cov/20Noiseless w/ cov/50

Figure 4.

Comparison between the evidence Ratio, R , for mod-els with Σ = { . , . , . } and the evidence, R , for the Λ CDMmodel ( Σ = ) model. Black diamonds are chains with DES-Y1 covariance, while blue squares and red triangles are chainswith covariances that were divided by 20 and 50, respectively.For DES-Y1 chains, the posterior for many parameters are beingpressed against the prior boundaries before inconsistencies be-tween cosmic shear and 2x2pt become important, which explainsthe unexpected behavior of evidence ratio going up as a functionof Σ . drawn from the 1- σ , 3- σ , 5- σ noise level (also c.f. Fig. 3, leftpanel). In addition, we also ﬁnd that a noisy DES-Y1 datarealizations from the 1- σ conﬁdence region of the param-eter covariance matrix can have a negative evidence ratio,which would point towards a signiﬁcant discrepancy. Theseﬁndings make it diﬃcult to motivate the DES-Y1 Bayesianevidence ratio as a strong indicator for signiﬁcant agreementbetween cosmic shear and 2x2.In the case of correlated Gaussians, the evidence ratioand ∆ χ = χ − χ − χ (i.e. the maximum log-likelihoods)are linearly correlated. In our DES-Y1 posteriors, we how-ever ﬁnd that a linear combination of the log-likelihoods, de- MNRAS , 000–000 (2020) nterpreting Internal Consistency of DES Measurements m H n s m A s m

64 72 80 H n s m Figure 5.

The posterior distribution of selected parameters for cosmic shear (dashed) and 2x2pt (solid) analyses, and for the defaultDES-Y1 covariance (yellow) against the case where the covariance was reduced by a factor 50 (blue). While it is true that Σ (cid:44) predictsinconsistencies between the cosmological parameters in Λ CDM, it is diﬃcult to see them in DES-Y1 chains. Not only the error bars arelarger in DES-Y1, but also the posteriors are being squeezed against the prior boundaries. ﬁned as ∆ ¯ χ (Eq. 6), is correlated with the evidence ratio.No correlation was found when comparing evidence ratiosagainst generalized parameter distances. In this section we investigate the evidence ratio’s behaviorwhen assuming a µ - Σ modiﬁed gravity scenario (as studiedin Abbott et al. (2019b), Ade et al. (2016), Aghanim et al.(2018), and Simpson et al. (2012)) that induces tension be-tween the weak lensing and the galaxy clustering parts of the3x2 data vector. Recall that Σ (cid:44) only aﬀects cosmic-shearand galaxy-galaxy lensing. Following the deﬁnitions in Ferreira & Skordis (2010),the Poisson and lensing equations in Newtonian Gauge arealtered in the µ - Σ model as: k Ψ = − π Ga ( + µ ( a )) ρδ (21) k ( Ψ + Φ ) = − π Ga ( + Σ ( a )) ρδ. (22)Similar to the Λ CDM case (c.f. Sect. 4.2), we com-pute the µ - Σ data vector at the DES-Y1 best ﬁt param-eter values. Speciﬁcally, we set µ ( a ) = µ Ω Λ ( z )/ Ω Λ and Σ ( a ) = Σ Ω Λ ( z )/ Ω Λ , with Ω Λ ( z ) being the redshift depen-dent dark energy density over the critical density. No noiseis added to the modiﬁed gravity data vectors. Similar to the Λ CDM cases, we apply

Halofit (Takahashi et al. 2012) to

MNRAS , 000–000 (2020)

V. Miranda, P. Rogozenski, and E. Krause, compute the nonlinear matter power spectrum in the µ − Σ case. The fact that Halofit does not correctly describe thenonlinear physics of µ − Σ gravity is not a signiﬁcant concernfor this paper since it is not out goal to analyze actual data.Instead our goal is to examine changes in the evidence ratiowhen the data vector is computed from a diﬀerent underly-ing physics than the model that is assumed in the analysis. We now investigate induced internal tensions in thecase where a data vector originating from µ - Σ gravity (seeSect. 5.1 for deﬁnitions) is evaluated in the DES-Y1 pipelinefor a Λ CDM cosmology. We have generated ﬁducial data vec-tors with ﬁxed µ = and Σ ranging from ≤ Σ ≤ . We havenot added noise realizations from DES-Y1 covariances; themodiﬁed gravity data vector is noise free. Figure 4 presents asurprising behavior of evidence ratios: the log-evidence ratioof the noiseless modiﬁed gravity data vector and our ﬁdu-cial noiseless Λ CDM data vector increases as a function of Σ (black diamonds). This means that the physical tensionintroduced by the modiﬁed gravity parameters in the galaxyclustering, galaxy-galaxy lensing, and cosmic shear parts ofthe data vector is not identiﬁed as such by the Bayesianevidence ratio.Such unexpected behavior of the evidence ratio can bebetter understood by looking at Fig. 5. We see that severalparameters are pushing against the prior boundaries. Thisboundary eﬀect reduces diﬀerences between the cosmologicalparameters that ﬁt cosmic shear and 2x2pt at the expense ofmaking the goodness of ﬁt between theory and data worse.To check that prior boundaries are indeed responsible forthe unusual behavior of the evidence ratio, we re-examinethe log-evidence ratio of the noiseless modiﬁed gravity datavector and our ﬁducial noiseless Λ CDM data vector, how-ever this time we rescale the covariance matrices by factorsof twenty (c.f. Fig. 4 blue squares) and ﬁfty (c.f. Fig. 4 redtriangles). This rescaling procedure signiﬁcantly reduces theposterior volume, which reduces or even removes the priorboundary eﬀects. Indeed, the evidence ratio now decreasesas a function of Σ as expected. This type of behavior exem-pliﬁes the diﬃculties in interpreting tension metrics in re-alistic examples without extensive validation via simulatedanalyses. Tension metrics are an important aspect of multi-probeanalyses; they will be used increasingly to determine whetherprobes can be combined or whether tension across probesneed to be further explored. However, tension metrics them-selves need to be calibrated by simulated analyses for eachdataset in order to deﬁne levels of discordance.In this work we study the properties of several tensionmetrics for the speciﬁc case of the DES-Y1 3x2pt analysis.In Abbott et al. (2018b) the individual analyses of 1) cosmicshear and 2) the galaxy-galaxy lensing plus galaxy clustering(so-called 2x2pt) were compared and ultimately combinedinto a so-called 3x2pt analysis. Both data vectors, cosmicshear and 2x2pt, were deemed consistent under an assumed Λ CDM model. Consistency was demonstrated by computingthe Bayesian evidence ratio, with the result of 6.39, andinterpreted using the Jeﬀreys scale. Bayesian evidence ratioshowever are known to be prior dependent and it is importantto calibrate the computed numbers through a large suite ofsimulated analyses.In this paper we calibrate the distribution of evidenceratios for a large set of noise realizations around the DES-Y1best ﬁt Λ CDM cosmology. The noisy data vectors are drawnfrom the DES-Y1 data covariance, not from the parametercovariance. While the data covariance and parameter co-variance are closely related, noise realizations drawn fromthe low-dimensional parameter covariance map onto smoothmodulations in the 457-dimensional data space with littlescatter from the ﬁducial data vector. Our data covariance in-cludes Gaussian cosmic variance, shot/shape noise (for clus-tering/weak lensing, respectively), and non-Gaussian con-tributions to the covariance from the connected four-pointfunction of the matter density ﬁeld as well as super-samplecovariance (SSC) (Takada & Hu 2013). As the Gaussiancosmic variance terms and shape/shot noise are caused, re-spectively, by the limited number of independent Fouriermodes sampled in each angular bin and the limited numberof galaxies sampled in the power spectrum measurement,noise realizations drawn from the data covariance are nearlyuncorrelated between diﬀerent Fourier modes and provide”noisy” scatter with little noticeable bias from the ﬁducialdata vector.”We run multiple simulated likelihood analyses for aDES-Y1 cosmic shear, 2x2pt, and 3x2pt data vector andﬁnd that the Bayesian evidence value obtained by DES-Y1 (6.39) is rather typical. We then explore evidence ratioswhere noiseless data vectors that are computed from a µ − Σ modiﬁed gravity model are analyzed with a pipeline thatassumes a Λ CDM model. Under these assumptions, a phys-ical tension is induced between the weak lensing and galaxyclustering parts of the 3x2pt data vector and we explore theBayesian evidence ratio behavior as a function of increasingthe strength of the modiﬁed gravity model (increasing Σ ).We demonstrate that prior boundary eﬀects can eﬃcientlyhide tensions between the weak lensing and galaxy clusteringpart of the 3x2pt data vector. When signiﬁcantly increasingthe constraining power, by dividing the covariance by factors20 and 50, we show that such boundary eﬀects are signiﬁ-cantly reduced and the expected tension appears.Our ﬁndings conﬁrm that the evidence ratio, as mea-sured by the Jeﬀreys scale, is biased towards compatibilitybetween the datasets due to DES-Y1’s adopted priors. Thesewide priors were intentionally chosen conservatively and didnot take into account prior knowledge from other experi-ments. Such wide priors have the potential to hide tensionsbetween probes. In the near future DES data quality willbe superseded by stage IV experiments, in particular, Ru-bin Observatory’s LSST (Ivezi´c et al. (2019)), SPHEREx(Bock & SPHEREx Science Team (2018)), Euclid (Masterset al. (2017)), and the Roman Space Telescope (Spergel et al.(2015), Eiﬂer et al. (2020)). These experiments will providean unprecedented amount of high-quality data that will en-able not just 3x2pt analyses, as considered in this paper, buta large variety of other cosmological probes as well. Explor-ing tensions between probes of the same data set and (evenmore interesting) between datasets will be a critical part of MNRAS , 000–000 (2020) nterpreting Internal Consistency of DES Measurements the data analysis of these missions, throughout which simu-lated analyses to calibrate tension metrics should become astandard tool in precision cosmology. ACKNOWLEDGMENTS

We want to thank T. Eiﬂer for the thorough review andextensive suggestions that improved our results’ presenta-tion. We would also like to thank P. Lemos and M. Raverifor fruitful discussions. VM is supported by NASA ROSES16-ADAP16-0116 and NASA ROSES ATP 16-ATP16-0084.PR and EK are supported by Department of Energy grantDE-SC0020247.

APPENDIX A: PIPELINE VALIDATION

This appendix focuses on the technical aspects of thepipeline calibration. As shown in the main manuscript, theDES posteriors are non-Gaussian in some dimensions, whilethe DES priors are partially informative in several direc-tions, where the likelihood is weakly constraining. Suchproperties aﬀect the required calibration of samplers hy-perparameters, such as the

Multinest ’s eﬃciency (Ferozet al. 2013), given that the entire volume of the parameterspace needs to be well sampled. Indeed, regions in param-eter space with low non-negligible likelihood probabilitiescan contribute to the Bayesian evidence as long as there isenough prior volume where the likelihood has similar values.The default

Multinest conﬁguration on DES-Y1 is:number of live-points n live = , tolerance = . and eﬃ-ciency = . . Figure A1 reveals biases in the evidence valueswith such settings. For other hyperparameters, such as thenumber of live-points, changes in the reported evidence arecompatible with the quoted error bars. These statements arevalid for both the shear-only and the 3x2pt analyses. Oneprominent feature on ﬁgure A1 is the constant slope of theevidence bias as a function of the Multinest ’s eﬃciency inthe case of the 3x2pt analysis. There is no guarantee, there-fore, that even eﬃciencies of the order of − would providereliable results, and such settings raise the evidence’s com-putational costs by one order of magnitude in comparisonto the hyperparameter values adopted on DES-Y1. We em-phasize that no conclusions on the general applicability of Multinest can be drawn from our analysis; results are spe-ciﬁc to DES-Y1. Figure A1 also does not imply that thereare no settings where

Multinest provides unbiased evidenceratios.We also checked if the detected biases on

Multinest reported evidences could have been identiﬁed through fea-tures in the posterior by-product, something that would havecalled the attention as being ﬂagrantly corrupted. Figure A2shows no substantial deviations in the posterior as a functionof the eﬃciency parameter, except for slight enlargement ofthe two sigma contours, and we have run similar chains us-ing the

Emcee (Foreman-Mackey et al. 2013) sampler to con-ﬁrm such statement. Comparisons diﬀerent

Multinest and

Emcee require robust calibration on both samplers, as onecould argue that direct comparison could point to problemsin

Emcee .To double-check that convergence on

Emcee has been achieved, we have run extremely long chains to check theconsistency of our results. Also, we have compared on Fig-ure A3

Emcee against a third sampler - Metropolis-Hasting- where the well established and reliable Gelman-Rubin cri-teria (Gelman & Rubin 1992) for convergence can be ap-plied. Such comparison also cross-checks our code devel-opment, which unites

Cosmolike and

Cobaya pipelines . Inour new code, Cosmolike receives distances, parameter val-ues and the matter power spectrum as function of redshiftand wavenumber and returns the DES-Y1 data vector. Thismerging allowed us to use both

Polychord and Metropolis-Hasting samplers with the fast-slow decomposition com-monly adopted in CMB analyses (Neal 2005; Lewis 2013),while

Emcee and

Multinest chains employ the original stan-dalone

Cosmolike .It is unclear how much

Multinest ’s biases might haveaﬀected DES-Y1 oﬃcial results, and it is beyond the scopeof this article to make such an in-depth analysis of the DES-Y1 oﬃcial chains. We do, however, believe that

Cobaya-Cosmolike code combines the pipeline validation eﬀort thathas been performed on

Cosmolike with samplers that aremore robust than

Multinest in evaluating Bayesian evidenceratios.

Cobaya-Cosmolike also provides Metropolis-Hastingwith fast-slow decomposition that possesses robust conver-gence criteria, which is hard to be assessed in

Emcee . Indeed,the posterior comparison between Metropolis-Hasting and

Polychord show perfect agreement, as seen in ﬁgure A4.Moreover, Figures A5 and A6 show that

Polychord ’s ev-idence and posterior are robust against variations on theadopted values for its hyperparameters.One additional issue emerged from the comparison be-tween

CAMB and

CLASS

Boltzmann codes. While the original

Cosmolike is directly integrated to

CLASS , Cobaya frameworkprovided, at the time we run our simulations, full supportonly to

CAMB . Diﬀerences between CAMB or CLASS shouldhave been negligible, but we did detect an extra factor onthe

Halofit formula implemented by

CLASS . We then mod-iﬁed

CAMB to match

CLASS choices, and we discuss this issuein greater depth on appendix C. In addition to that,

CLASS has limitations on the Ω b h range when dealing with BBNconstraints and because of that Cosmolike does assume theprior . < Ω b h < . . We, therefore, applied the sameprior choice in the Cobaya - Cosmolike joint pipeline. We donot expect such minor choices to aﬀect the qualitative con-clusions of this work.

APPENDIX B: GAUSSIAN APPROXIMATION

There is a signiﬁcant diﬀerence in computational costsbetween running MCMC for parameter estimation and eval-uating Bayesian evidence with nested sampling algorithms.The possibility of assessing evidence ratios using MCMCsamples could, therefore, incentivize a more widespread useof such metric as well as make the recalibration of the Jef-freys scale a lot simpler. Such inference is, however, gener-ically challenging in high-dimensional spaces (see Heavenset al. (2017) and references within it). Recently, Raveri & Hu https://github.com/CosmoLike/cocoa https://github.com/CobayaSampler/cobaya/issues/46 .MNRAS , 000–000 (2020) V. Miranda, P. Rogozenski, and E. Krause, − − − Eﬃciency024 M u l t i n e s t E v i d e n ce B i a s n live . . . . − − Tolerance − . − . . . Figure A1.

MultiNest evidence bias as a function of the sampling eﬃciency (left panel), number of live points (middle panel) andevidence tolerance factor (right panel). As a simplifying assumption, the evidence evaluated from the chain with either the lowesteﬃciency or the highest number of live points or the tolerance factor has zero bias by construction. The error bars reﬂect

MultiNest ’sclaimed uncertainties and no error propagation was applied to take into account the error bars in the value of the unbiased evidence.Sampler × pt DV0 × pt DV1 × pt DV0 × pt DV1 cosmic shear

DV0 cosmic shear

DV1 R DV0 R DV1

GLM - Mean -306.4 -204.0 -172.4 -116.3 -154.5 -110.89 20.5 23.2GLM - Chain BF -307.5 -204.6 -176.4 -117.7 -142.1 -91.7 11 4.8GLM - MKL -306.4 -204.6 -176.4 -117.7 -154.5 -110.89 24.5 23.9Polychord − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Table A1.

The Comparison performed between predicted Bayesian evidence evaluated using

MultiNest , PolyChord and Gaussian LinearModeling of Metropolis-Hasting chains around either the median of the parameters or the chain best ﬁt. MKL stands for MinimumKullback-Leibler divergence (Kullback & Leibler 1951), and in that row, we select the Gaussian approximation from the two previous casesby minimizing the KL divergence against the full posterior (Raveri & Hu 2019)). In all cases, the additional constraint . < Ω b h < . were applied as an additional top-hat likelihood. DV0 and

DV1 represent distinct noise realizations of the best-ﬁt data vector.Sampler n live Eﬃciency Tolerance n repeats Multinest (MN)

256 0 . . –Polychord – .

05 3 × dim Table A2.

Default values assumed for the internal parametersemployed in the multiple sampler codes we analyzed in our ap-pendix. In regards to

MultiNest , tolerance corresponds to the ev-idence tolerance factor ; eﬃciency is the sampling eﬃciency (thevariable efr ) and n live matches the number of live points . In ad-dition, we set to False the boolean variable that sets up the con-stant eﬃciency mode . Using

PolyChord , clustering was turned oﬀby default, and n repeats matches the variable num_repeats . Emcee runs consume a ﬁxed amount of computer resources to ensure thatchains contain no less than 5 million samples. On the other hand,Metropolis Hasting samples were run until reaching convergenceaccording to the Gelman and Rubin criteria, where we ﬁnd themean and standard deviation of the Gelman-Rubin criteria to be0.02 and 0.2, respectively. (2019) proposed a Gaussian approximation to the posteriorthat can provide an estimate for the evidence ratio. For DESonly chains, some partially constrained parameters are priorlimited, which is an indication that the Gaussian approxima-tion may fail. Nevertheless, we tested this approximation infew data vectors given the potential reward such a methodcould have brought to the ongoing DES-Y3 analysis and thiswork.We have followed Raveri & Hu (2019) closely, imple-menting the Gaussian approximation around either the bestﬁt or the median of the MCMC chain. Initially, we havetested such a scheme in two noise realizations generated us-ing an approximate DES-Y3 covariance (see table A1). The use of DES-Y3 covariance matrix represents a best-case sce-nario given that more constraining data should make theGaussian expansion to work better. For shear only, the ap-proximation does not provide accurate Bayesian evidence ra-tios. Results were more encouraging for the 2x2pt and 3x2ptanalyses, and we further examined such cases in eight ad-ditional noise realizations. Results are shown in ﬁgure B1.Unfortunately, there are order unit biases that make theadoption of this approximation in our work unfeasible foreven the most constraining 3x2pt analysis.

MNRAS , 000–000 (2020) nterpreting Internal Consistency of DES Measurements m n s H b m A s m n s

64 72 80 H b m MN, Efficiency: 0.3MN, Efficiency: 0.0005EMCEE

Figure A2.

The panel presents the posterior predicted by

Multinest as a function of the adopted eﬃciency hyperparam-eter. Table A2 shows the values of additional

Multinest settings.The comparison against the

Emcee sampler conﬁrms that chainswith high-eﬃciency do predict posteriors that are quite close tothe truth. Indeed, no posterior feature stands out as being anoutlier, something that would indicate that lower eﬃciency is in-deed needed as it predicts order unity bias for the evidence (seeFigure A1). m n s H b m A s m n s

64 72 80 H b m EMCEE: 3x2 DV0EMCEE: 3x2 DV1COBAYA-MH: 3x2 DV0COBAYA-MH: 3x2 DV1

Figure A3.

The ﬁgure compares the predicted posterior for thecosmological parameters given by

Emcee and Metropolis-Hastingsamplers. Blue shades on the two-dimensional panels correspondto dashed blue lines on the 1D posterior plots. The two 3x2ptdata vectors -

DV0 and

DV1 were data vectors with noise gener-ated using a simulated DES-Y3 covariance. The agreement be-tween the two samplers is good to cross-check, considering thepipelines are somewhat diﬀerent: the linear power spectrum on

Emcee was evaluated within

CLASS (default

CosmoLike pipeline)while for the Metropolis-Hasting we have performed a mergingbetween

Cobaya and

CosmoLike and used

CAMB to calculate thematter power spectrum.MNRAS , 000–000 (2020) V. Miranda, P. Rogozenski, and E. Krause, m n s H b m A s m n s

64 72 80 H b m COBAYA-PC: 3x2 DV0COBAYA-PC: 3x2 DV1COBAYA-MH: 3x2 DV0COBAYA-MH: 3x2 DV1

Figure A4.

The ﬁgure compares the predicted posterior for thecosmological parameters given by

Polychord against Metropolis-Hasting. Shades on the 2D panels correspond to dashed lines onthe 1D posterior plots. The two 3x2pt data vectors -

DV0 and

DV1 were data vectors with noise generated using a simulated DES-Y3 covariance. In both cases, the matter power spectrum wasevaluated using

CAMB (without removing the extra

Halofit factorshown in Eq. C2). m n s H b m A s m n s

64 72 80 H b m Shear, n repeats : n DIM

Shear, n repeats : 3 × n DIM

Shear, n repeats : 20 × n DIM

Figure A5.

The ﬁgure compares the predicted posterior for thecosmological parameter given by

Polychord as a function of thehyperparameter n repeats written in units of the number of param-eters in the chain ( n DIM ). Blue shades on the two-dimensionalpanels correspond to dashed blue lines on the 1D posterior plots.On shear-shear, the posterior shows uncertain behavior in thecase n repeats = n DIM , with no appreciable changes were seen in therange < n repeats / n DIM < . This is not necessarily the case for3x2pt data vectors, where setting n repeats = n DIM is acceptable forposteriors. MNRAS , 000–000 (2020) nterpreting Internal Consistency of DES Measurements n repeats ( × n DIM )0 . . . P o l y c h o r d E v i d e n ce B i a s

200 300 400 500 n live − . − . − . . −

100 precision criterion − . . . Figure A6.

Polychord evidence bias as a function of the n repeats parameter (left panel), number of live points (middle panel) andprecision criterion (right panel). As a simplifying assumption, the evidence evaluated from the chain with the highest n repeats (leftpanel), the highest number of live points (middle panel), or the lowest precision criterion factor (right panel) has zero bias by construc-tion. The parameter n repeats on the left panel is shown in units of the parameter dimension, n DIM . The error bars reﬂect

Polychord ’sclaimed uncertainties, and no error propagation was applied to take into account the error bars in the value of the unbiased evidence.Computational costs scale as O ( n repeats ) (Handley et al. 2015), the main bottleneck of our chains, so we have adopted n repeats = × n DIM as a middle ground between accuracy and computational costs. − E v i d e n ce B i a s Max Like Median − Figure B1.

The panels present the comparison between the Bayesian evidence calculated using the Gaussian approximation and

Polychord ’s results. Bias is deﬁned as the diﬀerence for the natural logarithm of the Bayesian evidence. The left panel assumed the 3x2pt data vector, while we restrict the analysis to galaxy-galaxy lensing and galaxy clustering in the right panel. The data vectors wererandomly generated using a simulated DES-Y3 covariance. Triangle blue points with thick error bars show the results when the Gaussianapproximation is made around the median of the chain, while black round points provide the results for the Gaussian estimation aroundthe sample of chain with the best likelihood. The error bars reﬂect

Polychord ’s claimed uncertainties.MNRAS , 000–000 (2020) V. Miranda, P. Rogozenski, and E. Krause,

APPENDIX C: HALOFIT

One practical issue has emerged in our sampler com-parison that is related to implementation diﬀerences be-tween

CAMB and

CLASS codes . The Cobaya pipeline versionadopted in this work had only partial support to

CLASS ,while

CosmoLike is incompatible with

CAMB . Therefore, theMetropolis-Hasting and

Polychord chains employed

CAMB toevaluate the background comoving distances and the non-linear matter power spectrum, while

Multinest and

Emcee chains used

CLASS . We, consequently, tested the compatibil-ity between these Boltzmann codes, and discrepancies in the

Halofit formula were spotted.The original Takahashi

Halofit formula for the non-linear matter power spectrum ∆ ( k ) = k P ( k )/( π ) is givenby ∆ ( k ) = ∆ Q ( k ) + ∆ H ( k ) . (C1)The speciﬁc expression for ∆ Q ( k ) and ∆ H ( k ) can be foundat (Takahashi et al. 2012). Both Class and

CAMB have up-dates to Takahashi formula that aims to provided betteragreement against cosmology with massive neutrinos. Wewere unable to ﬁnd the references in peer-reviewed journalsfor such updates. One of the new terms is, in

Class , thefollowing ∆ Q ( k ) → ∆ Q ( k ) (cid:8) + f ν (cid:2) . − . × ( Ω m − . ) (cid:3)(cid:9) , (C2)with f ν ≡ Ω ν / Ω m . In CAMB , on the other hand, the termproportional to ( Ω m − . ) does not exists; the impact ofsuch factor is shown on ﬁgure C1. APPENDIX D: NESTED SAMPLING

Evaluation of the Bayesian evidence is possible withnested sampling algorithms (Skilling 2006), and we willbrieﬂy review them in this appendix. Let P ( (cid:174) θ |H) be theprior distribution of the parameters (cid:174) θ within a model H , L be the likelihood distribution P ( (cid:174) d | (cid:174) θ, H) , and E be the evi-dence P ( (cid:174) d |H) . We deﬁne X ( λ ) to be the fraction of the priorvolume contained within the isolikelihood contour given by P ( (cid:174) d | (cid:174) θ, H) = λ as shown below X ( λ ) = ∫ L >λ d (cid:174) θ P ( (cid:174) θ |H) . (D1)Nested sampling algorithms evaluate evidences via theone dimensional integral E = ∫ L( X ) dX . (D2)This integration is performed by maintaining a set of livepoints, n live , that samples a sequence of exponentially con-tracting volumes that respects that hard boundary L > L i at iteration i + . The L i value corresponds to the worse like-lihood of all live points at iteration i , which is subsequently CAMB commit onthe oﬃcial

GitHub repository https://github.com/cmbant/CAMB . CLASS commit onthe oﬃcial

GitHub repository https://github.com/lesgourg/class_public m n s H b m A s m n s

64 72 80 H b m CAMB Halofit: 3x2 DV0CAMB Halofit: 3x2 DV1Class Halofit: 3x2 DV0Class Halofit: 3x2 DV1

Figure C1.

This ﬁgure compares the impact of the additionalterm that

CLASS implements on the

Halofit in comparison tothe expression that

CAMB assumes for the non-linear comple-tion of the matter power spectrum. All MCMC chains adoptedthe Metropolis-Hasting sampler and

CAMB code. Shades on thetwo-dimensional panels correspond to dashed lines on the one-dimensional posterior plots. The two 3x2pt data vectors -

DV0 and

DV1 were randomly generated around the default cosmologyusing a simulated DES-Y3 covariance. As expected, the posteri-ors diﬀer the most on the volume of parameter space associatedwith high values for the sum of neutrino masses. Such discrep-ancy is also non-negligible on the one-dimensional Ω m and H marginalized posteriors. discarded and replaced by another point with L > L i . Mak-ing this replacement eﬃcient is the technically challengingpart of the algorithm (see Feroz et al. (2013) and Handleyet al. (2015) for speciﬁc implementations). The set of dis-carded points are named dead points, and the discretizationof the one dimensional evidence integral above is given by E ≈ (cid:213) i ∈ dead (cid:0) X i − − X i (cid:1) × L i . (D3)The precise X i volumes are unknown, but can be probabilis-tically estimated. To reconstruct the prior volume at the ithiteration, the algorithm sample n live times the uniform dis-tribution spanning from 0 to X i − and retrieve the maximumprior volume Skilling (2006).The same procedure can also be used to calculate theKL divergence D i ≈ (cid:213) i ∈ dead ( X i − − X i ) × L i ln (cid:18) L i E (cid:19) . (D4)This expression allows us to evaluate suspiciousness usingthe same nested sampling runs used to calculate evidence,and we have cross-check our numerical results for the KLdivergence against the anesthetic package (run on the samechains) (Handley 2019). Finally, this section it also shows MNRAS , 000–000 (2020) nterpreting Internal Consistency of DES Measurements why the evaluation of the Surprise metric is challenging. Thecalculation of the relative entropy between datasets wouldrequire additional nested sampling runs where the “prior”would be one of the dataset’s posteriors. REFERENCES

Abazajian K. N., et al., 2016. ( arXiv:1610.02743 )Abbott T., et al., 2005. ( arXiv:astro-ph/0510346 )Abbott T. M. C., et al., 2018a, Mon. Not. Roy. Astron. Soc., 480,3879Abbott T. M. C., et al., 2018b, Phys. Rev., D98, 043526Abbott T. M. C., et al., 2019a, ApJ, 872, L30Abbott T. M. C., et al., 2019b, Phys. Rev., D99, 123505Ade P. A. R., et al., 2016, Astronomy & Astrophysics, 594, A14Aghanim N., et al., 2018. ( arXiv:1807.06209 )Akeson R., et al., 2019. ( arXiv:1902.05569 )Alam S., et al., 2017, Mon. Not. Roy. Astron. Soc., 470, 2617Albrecht A., et al., 2006. ( arXiv:astro-ph/0609591 )Albrecht A., et al., 2009. ( arXiv:0901.0721 )Asgari M., et al., 2020. ( arXiv:2007.15633 )Austermann J. E., et al., 2012, SPTpol: an instrument for CMBpolarization measurements with the South Pole Telescope. p.84521E, doi:10.1117/12.927286Blas D., Lesgourgues J., Tram T., 2011, J. Cosmology Astropart.Phys., 2011, 034Bock J., SPHEREx Science Team 2018, in American Astronomi-cal Society Meeting Abstracts arXiv:1306.2144 )Ferreira P. G., Skordis C., 2010, Physical Review D, 81Foreman-Mackey D., Hogg D. W., Lang D., Goodman J., 2013,PASP, 125, 306Gelman A., Rubin D. B., 1992, Statist. Sci., 7, 457Handley W., 2019. ( arXiv:1905.04768 ), doi:10.21105/joss.01414Handley W., Lemos P., 2019, Phys. Rev., D100, 043504Handley W. J., Hobson M. P., Lasenby A. N., 2015, Mon. Not.R. Astron. Soc. , 453, 4384Heavens A., Fantaye Y., Mootoovaloo A., Eggers H., Hosenie Z.,Kroon S., Sellentin E., 2017. ( arXiv:1704.03472 )Heymans C., et al., 2020, arXiv e-prints, p. arXiv:2007.15632Hikage C., et al., 2019, Publ. Astron. Soc. Jap., 71, Publicationsof the Astronomical Society of Japan, Volume 71, Issue 2,April 2019, 43, https://doi.org/10.1093/pasj/psz010Hildebrandt H., et al., 2018. ( arXiv:1812.06076 )Hirata C. M., Seljak U., 2004, Phys. Rev. D, 70, 063526Howlett C., Lewis A., Hall A., Challinor A., 2012, J. CosmologyAstropart. Phys., 1204, 027Hoyle B., et al., 2018, Mon. Not. R. Astron. Soc. , 478, 592Ivanov M. M., Simonovi´c M., Zaldarriaga M., 2020, Phys. Rev.D, 101, 083504Ivezi´c ˇZ., et al., 2019, ApJ, 873, 111Knox L., Millea M., 2019. ( arXiv:1908.03663 )Krause E., Eiﬂer T., 2017, Mon. Not. R. Astron. Soc. , 470, 2100 Krause E., et al., 2017. p. arXiv:1706.09359 ( arXiv:1706.09359 )Kullback S., Leibler R. A., 1951, Ann. Math. Statist., 22, 79Lemos P., K ˜A˝uhlinger F., Handley W., Joachimi B., WhitewayL., Lahav O., 2019. ( arXiv:1910.07820 )Lesgourgues J., 2011a, arXiv e-prints, p. arXiv:1104.2932Lesgourgues J., 2011b, arXiv e-prints, p. arXiv:1104.2934Lesgourgues J., Tram T., 2011, J. Cosmology Astropart. Phys.,2011, 032Levi M. E., et al., 2019. ( arXiv:1907.10688 )Lewis A., 2013, Phys. Rev., D87, 103529Lewis A., 2019. ( arXiv:1910.13970 )Lewis A., Challinor A., Lasenby A., 2000, ApJ, 538, 473Lin C.-H., Harnois-D´eraps J., Eiﬂer T., Pospisil T., Mandel-baum R., Lee A. B., Singh S., 2019, arXiv e-prints, p.arXiv:1905.03779Linder E. V., 2003, Phys. Rev. Lett., 90, 091301Liske J., et al., 2015, Mon. Not. R. Astron. Soc. , 452, 2087Marshall P., Rajguru N., Slosar A., 2006, Phys. Rev., D73, 067302Masters D. C., Stern D. K., Cohen J. G., Capak P. L., RhodesJ. D., Castander F. J., Paltani S., 2017, ApJ, 841, 111Neal R. M., 2005. ( arXiv:math/0502099 )Nesseris S., Garcia-Bellido J., 2013, JCAP, 1308, 036Perlmutter S., et al., 1999, Astrophys. J., 517, 565Planck Collaboration et al., 2018. ( arXiv:1807.06205 )Prakash A., et al., 2016, Astrophys. J. Suppl., 224, 34Raveri M., Hu W., 2019, Phys. Rev., D99, 043506Riess A. G., et al., 1998, Astron. J., 116, 1009Riess A. G., Casertano S., Yuan W., Macri L. M., Scolnic D.,2019, Astrophys. J., 876, 85Rozo E., et al., 2016, Mon. Not. R. Astron. Soc. , 461, 1431Scolnic D. M., et al., 2018, Astrophys. J., 859, 101Seehars S., Amara A., Refregier A., Paranjape A., Akeret J., 2014,Phys. Rev., D90, 023533Seehars S., Grandis S., Amara A., Refregier A., 2016, Phys. Rev.,D93, 103507Simpson F., et al., 2012, Monthly Notices of the Royal Astronom-ical Society, 429, 2249ˆa ˘A¸S2263Skilling J., 2006, Bayesian Anal., 1, 833Spergel D., et al., 2015, arXiv e-prints, p. arXiv:1503.03757Takada M., Hu W., 2013, Phys. Rev. D, 87, 123504Takahashi R., Sato M., Nishimichi T., Taruya A., Oguri M., 2012,Astrophys. J., 761, 152Tegmark M., Eisenstein D. J., Hu W., Kron R. G., 1998aTegmark M., Eisenstein D. J., Hu W., 1998b, in 33rd Rencontresde Moriond: Fundamental Parameters in Cosmology. pp 355–358 ( arXiv:astro-ph/9804168 )The LSST Dark Energy Science Collaboration et al., 2018.( arXiv:1809.01669 )Thornton R. J., et al., 2016, ApJS, 227, 21Torrado J., Lewis A., 2020Troxel M. A., et al., 2018, Phys. Rev., D98, 043528Verde L., Treu T., Riess A. G., 2019, in Nature Astronomy 2019.( arXiv:1907.10625 ), doi:10.1038/s41550-019-0902-0Zuntz J., et al., 2018, Mon. Not. R. Astron. Soc. , 481, 1149MNRAS000