[PDF] Diagnostics for Conditional Density Models and Bayesian Inference Algorithms

Abstract

There has been growing interest in the AI community for precise uncertainty quantification. Conditional density models f(y|x), where x represents potentially high-dimensional features, are an integral part of uncertainty quantification in prediction and Bayesian inference. However, it is challenging to assess conditional density estimates and gain insight into modes of failure. While existing diagnostic tools can determine whether an approximated conditional density is compatible overall with a data sample, they lack a principled framework for identifying, locating, and interpreting the nature of statistically significant discrepancies over the entire feature space. In this paper, we present rigorous and easy-to-interpret diagnostics such as (i) the "Local Coverage Test" (LCT), which distinguishes an arbitrarily misspecified model from the true conditional density of the sample, and (ii) "Amortized Local P-P plots" (ALP) which can quickly provide interpretable graphical summaries of distributional differences at any location x in the feature space. Our validation procedures scale to high dimensions and can potentially adapt to any type of data at hand. We demonstrate the effectiveness of LCT and ALP through a simulated experiment and applications to prediction and parameter inference for image data.

Full PDF

VValidating Conditional Density Models and Bayesian Inference Algorithms

David Zhao Niccolò Dalmasso Rafael Izbicki Ann B. Lee Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA Department of Statistics, Federal University of São Carlos (UFSCar), São Carlos, Brazil

Abstract

Conditional density models f ( y | x ) , where x repre-sents a potentially high-dimensional feature vector,are an integral part of uncertainty quantiﬁcation inprediction and Bayesian inference. However, suchmodels can be difﬁcult to calibrate. While exist-ing validation techniques can determine whetheran approximated conditional density is compatibleoverall with a data sample, they lack practical pro-cedures for identifying, localizing, and interpretingthe nature of (statistically signiﬁcant) discrepan-cies over the entire feature space. In this paper,we present more discerning diagnostics such as (i)the “Local Coverage Test” (LCT), which is ableto distinguish an arbitrarily misspeciﬁed modelfrom the true conditional density of the sample,and (ii) “Amortized Local P-P plots” (ALP) whichcan quickly provide interpretable graphical sum-maries of distributional differences at any location x in the feature space. Our validation proceduresscale to high dimensions, and can potentially adaptto any type of data at hand. We demonstrate theeffectiveness of LCT and ALP through a simulatedexperiment and a realistic application to parameterinference for galaxy images. Conditional densities models play a key role in uncertaintyquantiﬁcation. For instance, the conditional density f ( y | x ) of the response variable y given features x can be used tobuild predictive regions for y , which are more informativethan point predictions. Indeed, in prediction problems, f pro-vides a full account of the uncertainty in the outcome y givennew observations x . Conditional densities are also centralto Bayesian inference. A key component of Bayesian pa-rameter inference is the posterior distribution f ( θ | x ) , which quantiﬁes uncertainty about the parameters θ of interestafter observing data x . Often the posterior f is not tractable,so computational methods such as Markov Chain MonteCarlo [Brooks et al., 2011] are used to produce samplesfrom an approximation of f at the observed value of x .Recently, a large body of work in machine learning hasbeen developed for estimating conditional densities f for allpossible values of x (see Uria et al. 2014, Sohn et al. 2015,Papamakarios et al. 2017, Dutordoir et al. 2018 and refer-ences therein). With the advent of high-precision data andsimulations, simulation-based inference (SBI; Cranmer et al.[2020]) has also played a growing role in disciplines rangingfrom physics, chemistry and engineering to the biologicaland social sciences. The latter SBI category includes bothclassical MCMC-based methods for approximating poste-riors (e.g., ABC; Sisson et al. [2018]), as well as machine-learning based methods to learn an explicit surrogate modelof the posterior [Marin et al., 2016, Papamakarios and Mur-ray, 2016, Lueckmann et al., 2017, Chen and Gutmann,2019, izb, 2019, Greenberg et al., 2019].Inevitably, any downstream analysis in Bayesian inferenceor predictive modeling will depend on the trustworthi-ness of the assumed conditional density model. Validat-ing such models can be challenging, especially for a high-dimensional feature vector or mixed-type data x . To date, wedo not have a comprehensive and rigorous process of testing,for all possible values of x , whether a conditional densitymodel (cid:98) f ( y | x ) ﬁts actually observed data, or equivalently (inthe Bayesian framework) if an approximation (cid:98) f ( θ | x ) of theposterior ﬁts the true posterior given by the model.In machine learning, validation and model selection typi-cally rely on a global loss function like the Kullback-Leiblerdivergence or the L loss [Izbicki et al., 2017a, Rothfusset al., 2019]. Loss functions are useful for training models,but they only provide relative comparisons of overall modelﬁt. Hence, a practitioner may not know whether he or sheshould keep looking for better models (using larger trainingsamples, training times, etc.), or if the current estimate is a r X i v : . [ s t a t . M E ] F e b close enough” to the true density. Current goodness-of-ﬁtdiagnostics for conditional density models (cid:98) f only test fora form of overall coherence between a data-averaged con-ditional (posterior) distribution and its marginal (prior) dis-tribution, by computing so-called probability integral trans-form (PIT) values [Cook et al., 2006, Freeman et al., 2017,Talts et al., 2018, D’Isanto and Polsterer, 2018]. Althoughthese diagnostic methods provide valuable information topractitioners, they were originally developed for assessingunconditional density models [Gan and Koehler, 1990]; assuch, they are unable to detect some clearly misspeciﬁedconditional models. Indeed, PIT diagnostics are insensitiveto covariate transformations (Theorem 1). As an example,models trained on only a subset of the features can passthese checks (see Section 4).In this paper, we present a new validation framework specif-ically developed for assessing the quality of ﬁt of a condi-tional density model or Bayesian inference algorithm. Toenrich our vocabulary for desired properties of conditionaldensity estimators (CDEs), we begin by deﬁning global andlocal consistency (see Deﬁnitions 1 and 3, respectively). Wethen propose a validation approach that efﬁciently estimatesthe coverage of a conditional density model (cid:98) f ( ·| x ) at everylocation x in the feature space. Our proposed machinery hasthree main components:• [GCT - Global Coverage Test] A statistical hypoth-esis test that is able to distinguish any misspeciﬁeddensity model from the true conditional density. (Thisis a test of global consistency.)• [LCT - Local Coverage Test]

A statistical hypothesistest that is able to tell where in the feature space themodel needs to be improved. (This is a test of localconsistency.)• [ALP - Amortized Local P-P plots]

A visualizationtool that quickly provides interpretable graphical sum-maries of the ﬁtted model by showing how it deviatesfrom the true density at any location in feature space(see Figure 1 for examples).Our proposed diagnostics are easy to compute, and canquickly identify, locate, and interpret the nature of (statisti-cally signiﬁcant) discrepancies over the entire feature space.At the heart of our approach is the realization that the localcoverage of a CDE model is itself a conditional probability(see Equation 5) that often varies smoothly with x . Hence,we can estimate the local coverage at any given x by lever-aging a suitable regression method using sample points ina neighborhood of x . Thanks to the impressive arsenal ofexisting regression methods, we can adapt to different typesof potentially high-dimensional data to obtain computation-ally and statistically efﬁcient validation. Finally, because wespeciﬁcally evaluate local coverage (rather than other typesof discrepancies), the practitioner can, when the LCT ﬂagsstatistically signiﬁcant local discrepancies, “zoom in” on those locations and identify common modes of failure in theﬁtted conditional density (see Figures 4 and 5 for examples).This paper is organized as follows: in Section 2, we de-ﬁne properties that goodness-of-ﬁt tests would ideally have,and show how existing diagnostics may not satisfy them.In Section 3, we introduce our new diagnostic frameworkwith GCT, LCT, and ALP. In Section 4, we show that ourframework can identify and diagnose omitted variable biasof CDE in a prediction setting. In Section 5, we considera realistic application of validation of Bayesian inferencealgorithms ﬁt to high-dimensional simulated galaxy images. Notation.

Let D = { ( X , Y ) , . . . , ( X n , Y n ) } denote an i.i.d.sample from F X , Y , the joint distribution of ( X , Y ) for a ran-dom variable Y ∈ Y ⊂ R (in Section 3.3, Y is multivariate),and a random vector X ∈ X ⊂ R d . In a prediction setting, D represents a hold-out set not used to train (cid:98) f . In a Bayesiansetting, Y represents the parameter of interest (sometimesalso denoted with θ ), and each element of D is obtained byﬁrst drawing Y i from the prior distribution, and then drawing X i from the statistical model of X | Y i .Our goal is to check whether the conditional density model (cid:98) f ( y | x ) is a good estimate of the true density f ( y | x ) . Ideally,such a test should be able to distinguish any model (cid:98) f ( y | x ) from the true density f ( y | x ) , as well as locate discrepanciesin the feature space X . More precisely, a test should beable to identify what we in this section deﬁne as global andlocal consistency. Deﬁnition 1 ( Global Consistency ) . An estimate (cid:98) f ( y | x ) isglobally consistent with the density f ( y | x ) if the followingnull hypothesis holds:H : (cid:98) f ( y | x ) = f ( y | x ) for almost every x ∈ X and y ∈ Y . (1)Existing diagnostics typically calibrate density models bycomputing PIT values on independent data, which were notused to estimate (cid:98) f ( y | x ) : Deﬁnition 2 ( PIT ) . Fix x ∈ X and y ∈ Y . The probabilityintegral transform of y at x , as modeled by the conditionaldensity estimate (cid:98) f ( y | x ) , isPIT ( y ; x ) = (cid:90) y − ∞ (cid:98) f ( y (cid:48) | x ) dy (cid:48) . (2)See Figure 2, top panel for an illustration of this calculation. Remark 1.

For implicit models of (cid:98) f ( y | x ) (that is, genera-tive models that via e.g. MCMC can sample from, but notdirectly evaluate (cid:98) f ( y | x ) ), we can approximate the PIT val-ues by forward-simulating data: For ﬁxed x ∈ X and y ∈ Y ,raw Y , . . . , Y L ∼ (cid:98) f ( ·| x ) . Then, approximate PIT ( y ; x ) viathe cumulative sum L ∑ Li = I ( y i ≤ y ) .If the conditional density model (cid:98) f ( y | x ) is globally consis-tent, then the PIT values are uniformly distributed. Moreprecisely, if H (Equation 1) is true, then the random vari-ables PIT ( Y ; X ) , . . . , PIT ( Y n ; X n ) i . i . d . ∼ Unif ( , ) . This re-sult is often used to test goodness-of-ﬁt of conditional den-sity models in practice [Cook et al., 2006, Bordoloi et al.,2010, Tanaka et al., 2018].Our ﬁrst point is that unfortunately, such random variablescan be uniformly distributed even if global consistency doesnot hold. This is shown in the following theorem. Theorem 1 ( Insensitivity to covariate transformations ) . Suppose there exists a function g : X −→ Z that satisﬁes (cid:98) f ( y | x ) = f ( y | g ( x )) . (3) Let ( X , Y ) ∼ F X , Y . Then PIT ( Y ; X ) ∼ Unif ( , ) . Many models naturally lead to estimates that could satisfythe condition in Equation 3, even without being globallyconsistent. In fact, clearly misspeciﬁed models (cid:98) f can yielduniform PIT values and “pass” an associated goodness-of-ﬁttest regardless of the sample size. For example: if (cid:98) f ( y | x ) isbased on a linear model, then the estimate (cid:98) f ( y | x ) will by con-struction depend on x only through g ( x ) : = β T x for some β ∈ R d . As a result, we could have (cid:98) f ( y | x ) = f ( y | g ( x )) evenwhen (cid:98) f ( y | x ) is potentially very different from f ( y | x ) . As an-other example, a conditional density estimator that performsvariable section [Shiga et al., 2015, Izbicki et al., 2017b,Dalmasso et al., 2020] could satisfy (cid:98) f ( y | x ) = f ( y | g ( x )) for g ( x ) : = ( x ) S , where S ⊂ { , . . . , d } is a subset of the co-variates. A test of the overall uniformity of PIT values isno guarantee that we are correctly modeling the relation-ship between y and the predictors x ; see Figure 3 for anillustration.Our second point is that current diagnostics also do not pin-point the locations in feature space X where the estimatesof f should be improved. Hence, in addition to global consis-tency, we need diagnostics that test the following property: Deﬁnition 3 ( Local Consistency ) . Fix x ∈ X . An estimate (cid:98) f ( y | x ) is locally consistent with the density f ( y | x ) at ﬁxed x if the following null hypothesis holdsH ( x ) : (cid:98) f ( y | x ) = f ( y | x ) for every y ∈ Y . (4)In the next section, we introduce new diagnostics that areable to test whether a conditional density model (cid:98) f is bothglobally and locally consistent with the underlying condi-tional distribution f of the data. Our diagnostics are stillbased on PIT, and hence retain the properties (e.g., inter-pretability, ability to provide graphical summaries, and soon) that have made PIT a popular choice in model validation. Our new diagnostics rely on the following key result:

Theorem 2 ( Local Consistency and Pointwise Unifor-mity ) . For any x ∈ X , the local null hypothesis H ( x ) : (cid:98) f ( ·| x ) = f ( ·| x ) holds if, and only if, the distribution ofPIT ( Y ; x ) given x is uniform over ( , ) . Theorem 2 implies that if we had a sample of Y ’s at the ﬁxedlocation x , we could test the local consistency (Deﬁnition 3)of (cid:98) f by determining whether the sample’s PIT values comefrom a uniform distribution. In addition, for global consis-tency we need local consistency at every x ∈ X . Clearly,such a testing procedure would not be practical: typically,we have data of the form ( X , Y ) , . . . , ( X n , Y n ) with at mostone observation at any given x ∈ X .Our solution is to instead address this problem as a regres-sion: for ﬁxed α ∈ ( , ) , we consider the cumulative distri-bution function (CDF) of PIT at x , r α ( x ) : = P ( PIT ( Y ; x ) < α | x ) , (5)which interestingly is the regression of the random variable Z α : = I ( PIT ( Y ; X ) < α ) on X .From Theorem 2, it follows that the estimated density islocally consistent at x if and only if r α ( x ) = α for every α : Corollary 1.

Fix x ∈ X . Then r α ( x ) = α for every α ∈ ( , ) if, and only if, (cid:98) f ( y | x ) = f ( y | x ) for every y ∈ Y . Our new diagnostics are able to test for both local and globalconsistency. They rely on the simple idea of estimating r α ( x ) and then evaluating how much it deviates from α (seeSection 3.1). Note thatPIT ( Y ; x ) < α ⇐⇒ Y ∈ ( − ∞ , (cid:98) q α ( x )) where (cid:98) q α ( x ) is the α -quantile of (cid:98) f . That is, r α ( x ) evaluatesthe local level- α coverage of (cid:98) f at x . In Section 3.2, we ex-plore the connection between test statistics and coverage, forinterpretable descriptions of how conditional density models (cid:98) f may fail to approximate the true conditional density f . Our procedure for testing local and global consistencyis very simple and can be adapted to different types ofdata. For an i.i.d. test sample ( X , Y ) , . . . , ( X n , Y n ) from F X , Y (which was not used to construct (cid:98) f ), we compute Z α i : = I ( PIT ( Y i ; X i ) < α ) . To estimate the coverage r α ( x ) (Equation 5) for any x ∈ X , we then simply regress Z on X using the transformed data ( X , Z ) , . . . , ( X n , Z n ) . Numer-ous classes of regression estimators can be used, from kernelsmoothers to random forests to neural networks.o test local consistency (Deﬁnition 3), we introduce the Local Coverage Test (LCT) with the test statistic T ( x ) : = | G | ∑ α ∈ G ( (cid:98) r α ( x ) − α ) , where (cid:98) r α denotes the regression estimator and G is a grid of α values. Large values of T ( x ) indicate a large discrepancybetween (cid:98) f and f at x in terms of coverage, and Corollary 1links coverage to consistency. To decide on the correct cutofffor rejecting H ( x ) , we use a Monte Carlo technique thatsimulates T ( x ) under H . Algorithm 1 details our procedure.Similarly, we can also test global consistency (Deﬁnition1) with a Monte Carlo strategy. We introduce the GlobalCoverage Test (GCT) based on the following test statistic: S : = n n ∑ i = T ( X i ) . We recommend performing the global test ﬁrst and, if theglobal null is rejected, investigating further with local tests.

Algorithm 1

P-values for Local Coverage Test

Require: conditional density model (cid:98) f ; test data { X i , Y i } ni = ; testpoint x ∈ X ; regression estimator (cid:98) r ; number of null training sam-ples B Ensure: estimated p-value (cid:98) p ( x ) for any x ∈ X // Compute test statistic at x: Compute values PIT ( Y ; X ) , . . . , PIT ( Y n ; X n ) G ← grid of α values in ( , ) . for α in G do Compute indicators Z α , . . . , Z α n Train regression method (cid:98) r α on { X i , Z α i } ni = end for Compute test statistic T ( x ) // Recompute test statistic under null distribution: for b in 1 , . . . , B do Draw U ( b ) , . . . , U ( b ) n ∼ Unif [ , ] . for α in G do Compute indicators { Z ( b ) α , i = I ( U ( b ) i < α ) } ni = Train regression method (cid:98) r ( b ) α on { X i , Z ( b ) α , i } ni = end for Compute T ( b ) ( x ) : = | G | ∑ α ∈ G ( (cid:98) r ( b ) α ( x ) − α ) end for return (cid:98) p ( x ) : = B B ∑ b = I (cid:16) T ( x ) < T ( b ) ( x ) (cid:17) Our diagnostic framework does not just give us the ability toidentify deviations from local consistency in different partsof the feature space X . It also provides us with insight intothe nature of such deviations at any given location x . For unconditional density models, data scientists have longfavored using P-P plots (which plot two cumulative distri-bution functions against each other) to assess how closely adensity model agrees with actual observed data. What makesour work unique is that we are able to construct “amortizedlocal P-P plots” (ALPs) with similar interpretations to assess conditional density models over the entire feature space.Figure 1 illustrates how a local P-P plot of (cid:98) r α ( x ) against α (that is, the estimated CDF against the true CDF at x )can identify different types of deviations in a conditionaldensity model. For example, positive or negative bias in theestimated density (cid:98) f relative to f leads to P-P plot valuesthat are too high or too low, respectively. We can also easilyidentify overdispersion or underdispersion of (cid:98) f from an“S”-shaped P-P plot.Of particular note is that our local P-P plots are “amortized”,in the sense that computationally expensive steps do nothave to be repeated with e.g Monte Carlo sampling at each x of interest. Both the consistency tests in Section 3.1 andthe local P-P plots only require initially training (cid:98) r α on theobserved data; the regression estimator can then be used tocompute (cid:98) r α ( x val ) at any new evaluation point x val . Becauseof the ﬂexibility in the choice of regression method, ourconstruction also potentially scales to high-dimensional ordifferent types of data x . Algorithm 2 details the construc-tion of ALPs, including how to compute conﬁdence bandsby a Monte Carlo algorithm. Algorithm 2

Conﬁdence bands for local P-P plot

Require: test data { X i } ni = ; test point x ∈ X ; regression estima-tor (cid:98) r ; number of null training samples B ; conﬁdence level η Ensure: estimated upper and lower conﬁdence bands U ( x ) , L ( x ) at level 1 − η for any x ∈ X // Recompute regression under null distribution: G ← grid of α values in ( , ) . for b in 1 , . . . , B do Draw U ( b ) , . . . , U ( b ) n ∼ Unif [ , ] . for α in G do Compute indicators { Z ( b ) α , i = I ( U ( b ) i < α ) } ni = Train regression method (cid:98) r ( b ) α on { X i , Z ( b ) α , i } ni = end for Compute (cid:98) r ( b ) α ( x ) end for // Compute ( − η ) conﬁdence bands for (cid:98) r α ( x ) : U ( x ) , L ( x ) ← /0 for α in G do U ( x ) ← U ( x ) ∪ ( − η ) -quantile of { (cid:98) r ( b ) α ( x ) } Bb = L ( x ) ← L ( x ) ∪ η -quantile of { (cid:98) r ( b ) α ( x ) } Bb = end for return U ( x ) , L ( x ) IAS DISPERSION f( | x ) or f(y| x ) f f Positive Bias r () f( | x ) or f(y| x ) f f Overdispersion r () f( | x ) or f(y| x ) ff Negative Bias r () f( | x ) or f(y| x ) f f Underdispersion r () Figure 1:

P-P plots are commonly used to assess how well a density model ﬁts actual data. Such plots display, in a clear and interpretableway, effects like bias (left panel) and dispersion (right panel) in an estimated distribution (cid:98) f vis-a-vis the true data-generating distribution f . Our framework yields a computationally efﬁcient way to construct “amortized local P-P plots” for validating posterior distributions (cid:98) f ( θ | x ) and conditional density models (cid:98) f ( y | x ) , at any location x of the feature space X , thus providing insight into how such estimatescan be improved. See text for details and Sections 4 and 5 for examples. If the response Y is multivariate, then the random variable F Y | X ( Y | X ) is not uniformly distributed [Genest and Rivest,2001], so PIT values cannot be trivially generalized to higherdimensions. One way to overcome this is to evaluate thePIT statistic of univariate projections of Y , as done by Taltset al. [2018] for Bayesian consistency checks and Muceshet al. [2020] for the prediction setting. That is, the PIT val-ues can be computed using the estimate (cid:98) f ( h ( Y ) | x ) inducedby (cid:98) f ( Y | x ) for some chosen h : R p −→ R . Different projec-tions can be used depending on the context. For instance, inBayesian applications, posterior distributions are often usedto compute credible regions for univariate projections of theparameters θ . Thus, it is natural to evaluate PIT values of h ( θ ) = θ i for each parameter of interest. Another useful pro-jection is copPIT [Ziegel et al., 2014], which creates a uni-dimensional projection that has information about the jointdistribution of Y . Even though our diagnostic techniquesapplied to these projections are not enough to consistentlyassess the ﬁt to f ( Y | x ) , they do consistently evaluate the ﬁtto f ( h ( Y ) | x ) , which is often good enough in practice.An alternative approach to assessing (cid:98) f is through highestpredictive density values (HPD values; Harrison et al. 2015,Dalmasso et al. 2020), which are deﬁned byHPD ( y ; x ) = (cid:90) y (cid:48) : (cid:98) f ( y (cid:48) | x ) ≥ (cid:98) f ( y | x ) (cid:98) f ( y (cid:48) | x ) d y (see Figure 2, bottom, for an illustration). HPD ( y ; x ) is ameasure of how plausible y is according to (cid:98) f ( y | x ) (in the Bayesian context, this is the complement of the e-value[de Bragança Pereira and Stern, 1999]; small values indi-cate high plausibility). As with PIT values, HPD values areuniform under the global null hypothesis [Dalmasso et al.,2020]. However, standard goodness-of-ﬁt tests based onHPD values share the same problem as those based on PIT:they are insensitive to covariate transformations (see Theo-rem 4, Supp. Mat. A). Fortunately, HPD values are uniformunder the local consistency hypothesis: Theorem 3.

For any x ∈ X , if the local null hypothe-sis H ( x ) : (cid:98) f ( ·| x ) = f ( ·| x ) holds, then the distribution ofHPD ( Y ; x ) given x is uniform over ( , ) . It follows that the same techniques developed in Sections 3.1and 3.2 can be used with HPD values to test the global andlocal consistency hypotheses for multivariate responses, aswell as to construct amortized local P-P plots. This statisticis especially appealing if one wishes to construct predictiveregions with (cid:98) f , as HPD values are intrinsically related tohighest predictive density sets [Hyndman, 1996]. HPD setsare region estimates of y that contain all y ’s for which (cid:98) f ( y | x ) is larger than a certain threshold (in the Bayesian case, theseare the high posterior credible regions). More precisely, ifHPD − α ( x ) is the ( − α ) -level HPD set for y , thenHPD ( y ; x ) > α ⇐⇒ Y ∈ HPD − α ( x ) . Thus, by testing local consistency of (cid:98) f via HPD values, weare automatically assessing the coverage of HPD sets. Itshould be noted, however, that even if the HPD values areuniform (conditional on x ), it may be the case that (cid:98) f (cid:54) = f . f ( y | x ) y Probability Integral Transform (PIT) f ( y | x ) f ( y | x ) y f ( y | x ) y Highest Probability Density (HPD)

Figure 2:

Schematic diagram of the construction of PIT (toppanel, shaded blue region) and HPD values (bottom panel, shadedgreen region) for an estimated density (cid:98) f evaluated at ( y , x ) . Thehighlighted intervals in the bottom panel correspond to the highestdensity regions (HDR) of y | x . Our ﬁrst example shows how our tools are useful for di-agnosing omitted but clearly relevant variables in a predic-tion setting. Inspired by Section 2.2.2 of Shalizi [2013],we generate two covariates according to X = ( X , X ) ∼ N ( , Σ ) ∈ R , with Σ , = Σ , = Σ , = .

8, andtake the response to be Y | X ∼ N ( X + X , ) . In order tomimic the variable selection procedure common in manyhigh-dimensional inference methods, we ﬁt two conditionaldensity models: (cid:98) f , trained only on X , and (cid:98) f , trained on X .Both models are here ﬁtted using a nearest-neighbor kernelCDE [Dalmasso et al., 2020] with hyperparameters chosenby data-splitting. We use 10000 training and 5000 validationpoints, and a test sample of 200 points to assess the models.This is a toy example where omitting one of the two vari-ables might lead to unwanted bias when predicting the out-come Y for new inputs X . As an indication of this bias,we have included a heat map (see panel (d) of Figure 4)of the difference in the true (unknown) conditional means, E [ Y | x ] − E [ Y | x , x ] as a function of x and x . (In this ex-ample, the omitted variable bias is approximately the sameas the difference in the averages of the predictions of Y whenusing the model (cid:98) f versus the model (cid:98) f at any given x ∈ X ;see Figure 4 panels (c) and (d)). Despite the fact that thereis a clear relationship between Y and X , both (cid:98) f (whichomits X ) and (cid:98) f pass existing goodness-of-ﬁt tests basedon PIT (Figure 3). This result can be explained by Theorem Figure 3: Standard diagnostics for Example 1 showing histogramsof PIT values computed on 200 test points (with 95% conﬁdencebands for a Unif[0,1] distribution).

Top:

Results for (cid:98) f , which hasonly been ﬁt to the ﬁrst of two covariates. Bottom:

Results for (cid:98) f ,which has been ﬁt to both covariates. The top panel shows thatstandard PIT diagnostics cannot tell that (cid:98) f is a poor approxima-tion to f . GCT, on the other hand, detects that (cid:98) f is misspeciﬁed(p=0.004), while not rejecting the global null for (cid:98) f (p=0.894).

1: because PIT is insensitive to covariate transformationsand (cid:98) f ( y | x ) ≈ f ( y | x ) , PIT values are uniformly distributed,even though (cid:98) f omits a key variable. The GCT, however,detects that (cid:98) f is misspeciﬁed ( p = . (cid:98) f ( p = . (cid:98) f across the entire feature space of X . The patterns in thesep-values are largely explained by panel (d), which showsthe difference between the conditional means of Y given x and given x , x . The detected level of discrepancy betweenthe estimate (cid:98) f and the true underlying conditional density f at a point x directly relates to the omitted variable bias E [ Y | x ] − E [ Y | x , x ] = . x − x : the LCT p-values closeto the line x = . x are large (indicating no statisticallysigniﬁcant deviations from the true model), and p-values getsmaller as we move away from this line.As an illustration of how one can depict and interpret distri-butional deviations at speciﬁc points of interest x , panel (b)of Figure 4 zooms in on a few different locations, with localP-P plots that show how (cid:98) f could potentially be improved.At the blue point, (cid:98) f underestimates the true density: wereject the local null (Equation 4) at level α = .

05, and theP-P plot indicates negative bias. Conversely, at the red point, (cid:98) f overestimates the true density; we reject the local null,and the P-P plot indicates positive bias. At the purple point, (cid:98) f is close to f , so the local null hypothesis is not rejected.Finally, the right panel of Figure 6 (in Supp. Mat. B) shows a) (b) (c) (d) Figure 4:

New diagnostics for Example 1. (a)

P-values for LCTs for (cid:98) f indicate a poor ﬁt across most of the feature space. (b) Amortizedlocal P-P plots at selected points show the density (cid:98) f as negatively biased (blue), well estimated at signiﬁcance level α = .

05 with barelyperceived overdispersion (purple), and positively biased (red). (Gray regions represent 95% conﬁdence bands under the null.) (c) (cid:98) f and (cid:98) f vs. the true (unknown) conditional density f at the selected points. (cid:98) f is clearly negatively and positively biased at the blue and redpoints, respectively, while the model does not reject the local null at the purple point. (cid:98) f ﬁts well at all three points. (d) The difference onaverage in the predictions of Y from ˆ f ( ·| x ) vs. the true distribution f ( ·| x ) for ﬁxed x indeed corresponds to the “omitted variable bias” E [ Y | x ] − E [ Y | x , x ] . ( Note:

Panels (c) and (d) require knowledge of the true f , which would not be available to the practitioner.) p-values from LCTs for the model (cid:98) f which was ﬁt to bothvariables. Here (cid:98) f passes all tests, and local P-P plots atvarious points indicate a good ﬁt.This toy example is a simple illustration of the generalphenomenon of potentially unwanted omitted variable bias,which can be difﬁcult to detect without testing for localand global consistency of models. Our proposed diagnosticsidentify this issue and provide insight into how the omittedvariable distorts the ﬁtted model relative to the true condi-tional density, across the entire feature space. Our next example illustrates how our diagnostics can beused for simulator-based inference to test, for all possiblevalues of image data x ∈ R , whether a Bayesian inferencealgorithm or approximation (cid:98) f ( θ | x ) of the posterior ﬁts thetrue posterior given by the model. In this example, x represents an image of an elliptical galaxyas observed by a telescope, and the parameter of interest isthe galaxy’s rotation angle with respect to the x-axis. Ourmodel for x is implictly given by GalSim , an open-sourcetoolkit for simulating realistic images of astronomical ob-jects that include observational effects such as pixelizationand blurring due to the atmosphere and telescope [Roweet al., 2015]. Besides θ , our model also contains anotherparameter, the axis ratio of the galaxy, λ , here deﬁned as theratio between the minor and major axes of the projection ofthe elliptical galaxy.For illustration, we create a mixture of two populations ofgalaxies: one large population with λ = . λ = . λ and θ from a prior distribution given by P ( λ = . ) = − P ( λ = . ) = . θ ∼ Uni f ( − π , π ) Then we sample 20 ×

20 galaxy images X according to theigure 5: New diagnostics for simulation-based inference algorithm in Example 2. For visualization, we show the location of the testgalaxy points in R using multidimensional scaling (see center panel “MDS with LCT p-values”). P-values for LCTs indicate that theConvMDN generally ﬁts well for the dominant 90% population of round galaxies ( λ = . λ = . data model X | λ , θ ∼ GalSim ( a , λ ) , where a | λ = . ∼ N ( θ , . ) a | λ = . ∼ Laplace ( θ , . ) + Laplace ( θ , . ) . As an example of simulator-based inference, we ﬁt a convo-lutional mixture density network (ConvMDN, D’Isanto andPolsterer [2018]) with 2 convolutional and 2 fully connectedlayers with ReLU activations [Glorot et al., 2011] to 7000training images and 3000 validation images. (For training,we use the Adam optimizer [Kingma and Ba, 2014] withlearning rate 10 − , β = .

9, and β = . f ( α | x ) .According to the KL divergence loss computed on a sepa-rate test sample with 1000 images, the best ﬁt of f ( θ | x ) isachieved by a ConvMDN model with K = p < . λ = . λ = . λ = . λ = . Conclusion.

Conditional density models are widely used foruncertainty quantiﬁcation in prediction and Bayesian infer-ence. In this work, we offer practical procedures (GCT, LCT,ALP) for identifying, localizing, and interpreting modes offailure for an approximation of the true conditional density.Our tools can be used in conjunction with loss functions,which are useful for performing model selection, but notgood at evaluating whether a practitioner should keep look-ing for better models, or at providing information as to howa model could be improved. Finally, because LCT localizeshard-to-train areas of the feature space, our framework canprovide guidance for active learning schemes. eferences

ABC–CDE: Toward approximate bayesian computationwith complex high-dimensional data and limited simula-tions.

Journal of Computational and Graphical Statistics ,pages 1–20, 2019. doi: 10.1080/10618600.2018.1546594.R. Bordoloi, S. J. Lilly, and A. Amara. Photo-z per-formance for precision cosmology.

Monthly Noticesof the Royal Astronomical Society , 406(2):881–895,08 2010. URL https://doi.org/10.1111/j.1365-2966.2010.16765.x .Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-LiMeng.

Handbook of markov chain monte carlo . CRCpress, 2011.Yanzhi Chen and Michael U. Gutmann. Adaptive gaus-sian copula ABC. In Kamalika Chaudhuri and MasashiSugiyama, editors,

Proceedings of Machine Learning Re-search , volume 89 of

Proceedings of Machine Learn-ing Research , pages 1584–1592. PMLR, 16–18 Apr2019. URL http://proceedings.mlr.press/v89/chen19d.html .Samantha R. Cook, Andrew Gelman, and Donald B. Rubin.Validation of software for bayesian models using poste-rior quantiles.

Journal of Computational and GraphicalStatistics , 15(3):675–692, 2006.Kyle Cranmer, Johann Brehmer, and Gilles Louppe.

Pro-ceedings of the National Academy of Sciences , 117(48):30055–30062, 2020.N. Dalmasso, T. Pospisil, A.B. Lee, R. Izbicki, P.E. Freeman,and A.I. Malz. Conditional density estimation tools inpython and r with applications to photometric redshiftsand likelihood-free cosmological inference.

Astronomyand Computing , 30:100362, Jan 2020. ISSN 2213-1337.doi: 10.1016/j.ascom.2019.100362. URL http://dx.doi.org/10.1016/j.ascom.2019.100362 .Carlos Alberto de Bragança Pereira and Julio Michael Stern.Evidence and credibility: full bayesian signiﬁcance testfor precise hypotheses.

Entropy , 1(4):99–110, 1999.Antonio D’Isanto and Kai Lars Polsterer. Photometric red-shift estimation via deep learning. generalized and pre-classiﬁcation-less, image based, fully probabilistic red-shifts.

Astronomy & Astrophysics , 609:A111, 2018.Vincent Dutordoir, Hugh Salimbeni, Marc Peter Deisen-roth, and James Hensman. Gaussian process conditionaldensity estimation. In

Advances in Neural InformationProcessing Systems 31 , Neural Information ProcessingSystems. Curran Associates, Inc., 2018.Peter Freeman, Rafael Izbicki, and Ann B. Lee. A uniﬁedframework for constructing, tuning and assessing photo-metric redshift density estimates in a selection bias setting.

Monthly Notices of the Royal Astronomical Society , 468(4):4556–4565, 2017. doi: 10.1093/mnras/stx764.F. F. Gan and K. J. Koehler. Goodness-of-ﬁt testsbased on p-p probability plots.

Technometrics , 32(3):289–303, 1990. doi: 10.1080/00401706.1990.10484682. URL https://doi.org/10.1080/00401706.1990.10484682 .Christian Genest and Louis-Paul Rivest. On the multivariateprobability integral transformation.

Statistics & probabil-ity letters , 53(4):391–399, 2001.Xavier Glorot, Antoine Bordes, and Yoshua Bengio.Deep sparse rectiﬁer neural networks. In GeoffreyGordon, David Dunson, and Miroslav Dudík, edi-tors,

Proceedings of the Fourteenth International Con-ference on Artiﬁcial Intelligence and Statistics , vol-ume 15 of

Proceedings of Machine Learning Research ,pages 315–323, Fort Lauderdale, FL, USA, 11–13Apr 2011. JMLR Workshop and Conference Proceed-ings. URL http://proceedings.mlr.press/v15/glorot11a.html .David Greenberg, Marcel Nonnenmacher, and Jakob Macke.Automatic posterior transformation for likelihood-freeinference. In Kamalika Chaudhuri and Ruslan Salakhut-dinov, editors,

Proceedings of the 36th InternationalConference on Machine Learning , volume 97 of

Proceedings of Machine Learning Research , pages2404–2414, Long Beach, California, USA, 09–15 Jun2019. PMLR. URL http://proceedings.mlr.press/v97/greenberg19a.html .Diana Harrison, David Sutton, Pedro Carvalho, and MichaelHobson. Validation of Bayesian posterior distribu-tions using a multidimensional Kolmogorov–Smirnovtest.

Monthly Notices of the Royal Astronomical Soci-ety , 451(3):2610–2624, 06 2015. ISSN 0035-8711. doi:10.1093/mnras/stv1110. URL https://doi.org/10.1093/mnras/stv1110 .Rob J Hyndman. Computing and graphing highest densityregions.

The American Statistician , 50(2):120–126, 1996.Rafael Izbicki, Ann B Lee, Peter E Freeman, et al. Photo- z estimation: An example of nonparametric conditionaldensity estimation under selection bias. Annals of AppliedStatistics , 11(2):698–724, 2017a.Rafael Izbicki, Ann B Lee, et al. Converting high-dimensional regression to high-dimensional conditionaldensity estimation.

Electronic Journal of Statistics , 11(2):2800–2831, 2017b.Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.an-Matthis Lueckmann, Pedro J. Gonçalves, Giacomo Bas-setto, Kaan Öcal, Marcel Nonnenmacher, and Jakob H.Macke. Flexible statistical inference for mechanistic mod-els of neural dynamics. In

Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems , NIPS’17, page 1289–1299, Red Hook, NY, USA,2017. Curran Associates Inc. ISBN 9781510860964.Jean-Michel Marin, Louis Raynal, Pierre Pudlo, MathieuRibatet, and Christian Robert. ABC random forests forbayesian parameter inference.

Bioinformatics (Oxford,England) , 35, 05 2016. doi: 10.1093/bioinformatics/bty867.S Mucesh, WG Hartley, A Palmese, O Lahav, L Whiteway,A Amon, K Bechtol, GM Bernstein, A Carnero Rosell,M Carrasco Kind, et al. A machine learning approachto galaxy properties: Joint redshift-stellar mass proba-bility distributions with random forest. arXiv preprintarXiv:2012.05928 , 2020.George Papamakarios and Iain Murray. Fast ε -freeinference of simulation models with bayesian conditionaldensity estimation. In D. Lee, M. Sugiyama, U. Luxburg,I. Guyon, and R. Garnett, editors, Advances in NeuralInformation Processing Systems , volume 29. Curran As-sociates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/6aca97005c68f1206823815f66102863-Paper.pdf .George Papamakarios, Theo Pavlakou, and Iain Murray.Masked autoregressive ﬂow for density estimation. arXivpreprint arXiv:1705.07057 , 2017.George Papamakarios, David Sterratt, and Iain Murray. Se-quential neural likelihood: Fast likelihood-free inferencewith autoregressive ﬂows. In , Proceedingsof Machine Learning Research, pages 837–848. PMLR,2019.Jonas Rothfuss, Fabio Ferreira, Simon Walther, and MaximUlrich. Conditional density estimation with neural net-works: Best practices and benchmarks, 2019.BTP Rowe, Mike Jarvis, Rachel Mandelbaum, Gary MBernstein, James Bosch, Melanie Simet, Joshua E Mey-ers, Tomasz Kacprzak, Reiko Nakajima, Joe Zuntz, et al.Galsim: The modular galaxy image simulation toolkit.

Astronomy and Computing , 10:121–150, 2015.Cosma Shalizi.

Advanced data analysis from an elementarypoint of view . 2013.Motoki Shiga, Voot Tangkaratt, and Masashi Sugiyama. Di-rect conditional probability density estimation with sparsefeature selection.

Machine Learning , 100(2):161–182,2015. doi: 10.1007/s10994-014-5472-x. URL https://doi.org/10.1007/s10994-014-5472-x . Scott A Sisson, Yanan Fan, and Mark Beaumont.

Handbookof Approximate Bayesian Computation . Chapman andHall/CRC, 2018.Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learningstructured output representation using deep conditionalgenerative models. In C. Cortes, N. Lawrence, D. Lee,M. Sugiyama, and R. Garnett, editors,

Advances in NeuralInformation Processing Systems , volume 28. Curran As-sociates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/8d55a249e6baa5c06772297520da2051-Paper.pdf .Sean Talts, Michael Betancourt, Daniel Simpson, Aki Ve-htari, and Andrew Gelman. Validating bayesian infer-ence algorithms with simulation-based calibration. arXivpreprint arXiv:1804.06788 , 2018.Masayuki Tanaka, Jean Coupon, Bau-Ching Hsieh, SogoMineo, Atsushi J Nishizawa, Joshua Speagle, Hisanori Fu-rusawa, Satoshi Miyazaki, and Hitoshi Murayama. Pho-tometric redshifts for hyper suprime-cam subaru strate-gic program data release 1.

Publications of the As-tronomical Society of Japan , 70(SP1), 01 2018. URL https://doi.org/10.1093/pasj/psx077 .Bengio Uria, Iain Murray, and Hugo Larochelle. A deepand tractable density estimator. In

Proceedings of the31st International Conference on Machine Learning , vol-ume 32 of

Proceedings of Machine Learning Research ,Beijing, China, 09–15 Jun 2014. JMLR. URL http://proceedings.mlr.press/v32/uria14.html .Johanna F Ziegel, Tilmann Gneiting, et al. Copula cali-bration.

Electronic journal of statistics , 8(2):2619–2638,2014.

UPPLEMENTARY MATERIALS

A: PROOFS

Proof of Theorem 1.

Proof of Corollary 1.

Notice that r α ( x ) = E [ Z α | x ] = P ( PIT ( Y ; X ) < α | x ) . It follows that r α ( x ) = α for every α ∈ ( , ) if, and only if, the distribution of PIT ( Y ; X ) , con-ditional on X , is uniform over ( , ) . The conclusion followsfrom Theorem 2. Theorem 4 ( HPD values are insensitive to covariatetransformations ) . Let ( X , Y ) ∼ F X , Y . If there exists afunction g : X → Z such that (cid:98) f ( y | x ) = f ( y | g ( x )) , thenHPD ( Y ; X ) ∼ Uni f ( , ) . Proof of Theorem 4. Under the assumption we can rewritethe HPD value as :HPD ( y , x ) = (cid:90) y (cid:48) : f ( y (cid:48) | g ( x )) > f ( y | g ( x )) f ( y (cid:48) | g ( x )) dy (cid:48) = (cid:90) y (cid:48) : f ( y (cid:48) | z ) > f ( y | z ) f ( y (cid:48) | z ) dy (cid:48) = HPD ( y , z ) , with g ( x ) = z . Following the proof structure by Harrisonet al. [2015] closely, we deﬁne the random variable ξ z , y = HPD ( z , y ) , equipped with the probability density function h : ( Z × Y ) → R . Dropping the subscripts for simplicity, let ξ ∗ = HPD ( z ∗ , y ∗ ) the HPD value of a speciﬁc pair ( z ∗ , y ∗ ) ; ξ ∗ is the probability mass of f above the level set f ( y ∗ | z ∗ = g ( x ∗ )) . Without loss of generality, if we show that h ( ξ ∗ ) = ξ ( y , z ) is uniformly distributed U [ , ] .Using the fundamental theorem of calculus we can write: h ( ξ ∗ ) = ∂∂ ξ ∗ (cid:90) ξ ∗ − ∞ g ( ε ) d ε = ∂∂ ξ ∗ (cid:90) ξ ∗ − ∞ (cid:90) Z × Y δ ( ξ ( y , z ) − ε ) dF ( z , y ) d ε = ∂∂ ξ ∗ (cid:90) Z × Y Φ ( ξ ( y , z ) − ξ ∗ ) dF ( z , y )= ∂∂ ξ ∗ (cid:90) Z (cid:20) (cid:90) Y Φ ( ξ ( y , z ) − ξ ∗ ) f ( y | z ) dy (cid:21) f ( z ) dz = ∂∂ ξ ∗ (cid:90) Z ξ ∗ f ( z ) dz = ∂∂ ξ ∗ ξ ∗ = Φ is the Heavyside function, which is 1 when theargument is positive and 0 otherwise. Proof of Theorem 3.

Under the null hypothesis H ( x ) forany x ∈ X we have that:HPD ( y ; x ) = (cid:90) y (cid:48) : (cid:98) f ( y (cid:48) | x ) ≥ (cid:98) f ( y | x ) (cid:98) f ( y (cid:48) | x ) d y (6) = (cid:90) y (cid:48) : f ( y (cid:48) | x ) ≥ f ( y | x ) f ( y (cid:48) | x ) d y . (7)Applying the results about uniformity of HPD for f ( ·| x ) from Harrison et al. [2015, Section A.2] (also reproduced inthe proof of Theorem 4) proves the theorem. -0.729 -0.885 -0.915 -0.906 -0.897 -0.917 -0.906 -0.911 -0.905 Table 1:

The KL divergence loss indicates that the number of mixture components in the ConvMDN approximation of the posterior inExample 2 should be K = B: EXAMPLE 1: OMITTED VARIABLE BIAS INCDE MODELS

Figure 6, right panel, shows the p-values from LCTs acrossthe feature space for the model (cid:98) f . Unlike model (cid:98) f , whichwas ﬁt on X solely, (cid:98) f was ﬁt on both X and X variables.Hence, (cid:98) f is able to pass all tests, with local P-P plots indi-cating a good ﬁt (with two examples shown in the Figure 6,left panel).Figure 6: P-values for LCTs for (cid:98) f in Example 1 suggest anadequate ﬁt everywhere in the feature space; local coverage plotsat selected points also suggest a good ﬁt. C: EXAMPLE 2: POSTERIOR INFERENCE FORSYNTHETIC GALAXY IMAGES

Table 1 reports the KL divergence loss over a test set of 1000galaxy images for a ConvMDN model with K components,for K = , ...,

10. However, the model which minimizes theKL loss (a ConvMDN with K ==