IInformative Goodness-of-Fit for Multivariate Distributions
Sara Algeri School of Statistics, University of Minnesota,0461 Church St SE, Minneapolis, MN 55455, USA.Email: [email protected]
Abstract
This article introduces an informative goodness-of-fit (iGOF) approach to studymultivariate distributions. Conversely from standard goodness-of-fit tests, when thenull model is rejected, iGOF allows us to identify the underlying sources of mismod-elling and naturally equip practitioners with additional insights on the underlyingdata distribution. The informative character of the procedure proposed is achievedby introducing the joint comparison density . As a result, the methods presentedhere naturally extend the seminal work of Parzen (1979) on univariate comparisondistributions to the multivariate setting. Simulation studies show that iGOF enjoyshigh power for different types of alternatives.Multivariate goodness-of-fit, informative inference, tests of independence, joint comparisondensity.
The problem.
In any experimental science, the knowledge available on a given phe-nomenon is typically formalized into a statistical model. The latter encapsulates ourunderstanding of its nature, its properties as well as our uncertainties. Experimentalmeasurements are then collected and statistical tests of hypothesis are used to answerthe important question: is our model valid?
When converting this scientific question intoa statistical one, the validity check translates into the assessment of the properties charac-terizing the underlying data distribution. As a result, a variety of tests for goodness-of-fit(GOF) have been proposed in literature to study multivariate distributions.For instance, let the measurements collected by the experiment be realizations of arandom vector X = ( X , . . . , X p ) . Hotelling’s T test (e.g., Mardia, 1975) allow us totest the location parameter characterizing of the distribution of X . In scientific terms,this translates into assessing if what we expect our measurements to be, on average, isreasonable or not. Rank tests (e.g., Taskinen et al., 2005) can be used to assess indepen-dence, i.e., they help us understand which aspects of the phenomenon are connected. Fi-nally, classical univariate tests such as Kolmogorov-Smirnov (Kolmogorov, 1933; Smirnov,1939) can be extended to the multivariate setting (e.g., Khmaladze, 2016) to validate ar-bitrary distributional assumptions, that is, assessing if our overall understanding of thephenomenon of interest is correct.Despite their usefulness, these and similar GOF methods are somehow limited bytheir confirmatory nature. Specifically, when the respective null hypothesis is rejected,1 a r X i v : . [ s t a t . M E ] O c t hey do not allow us to identify the underlying causes which invalidate the model postu-lated by the scientists, nor they give any indication on how the latter can be improved toobtain a closer representation of the true data distribution. In simple words, they do notprovide any insights on what went wrong . As a result, this aspect is typically addressedby conducting a separate exploratory analysis followed by a (confirmatory) validity checkfor a newly postulated model. Statistical goals.
This article aims to bridge confirmatory and exploratory dataanalysis by introducing an informative approach to goodness-of-fit (iGOF). Specifically,iGOF targets three important questions arising in the statistical analysis of multivariatedata:Q1.
Is the distribution of X correctly specified and, if not, what are the sources ofmismodelling? For instance, are we misspecifying only some of the components of X or is the problem affecting the entire random vector?Q2. Are all the p components of X independent and, if not, can we identify independentsubstructures? As discussed in Section 5, the ability to address this question willnot only allow us to characterize the dependence structure of X , but it will alsoprovide technical and computational advantages while investigating Q1.Q3. How can we improve our postulated model?
Whatever the source of mismodellingis, either the dependence structure, the marginals or both, how can we provide adata-driven correction of our hypothesized model?Finally, for the sake of generalizability, the underlying requirement of Q1-Q3 is to allowthe components of X to be either continuous or discrete. Scientific motivations.
While solutions for Q1-Q3 would enjoy a broad spectrumof applications across several scientific fields, the motivation of this work is rooted in thephysical sciences. Interestingly, when converted in statistical terms, several issues arisingin physics and astronomy naturally translate into Q1-Q3.For instance, a common difficulty arising in the searches of new astrophysical phenom-ena is the impossibility of correctly specifying the background distribution. In physicsand astronomy, the “background” refers to all the astronomical objects or particles whichare not those we aim to discover. Unfortunately, since many sources contribute to thebackground, its distribution is particularly difficult to model (e.g., Priel et al., 2017;Dauncey et al., 2015; Algeri et al., 2018). Investigating the validity of the hypothesizedbackground models postulated by the scientists is a crucial step in many searches inphysics and astronomy (Q1). Furthermore, when the hypothesized model is rejected, thepossibility of identifying the source of mismodelling and provide adequate data-drivencorrections (Q1,Q3) offers a notable advantage over the most popular methods proposedin literature to address this issue (e.g., Yellin, 2002).When searching for new particles via colliders, another issue arising is the dependenceamong the kinematic variables involved. Unfortunately, their correlation is not alwaysunderstood and therefore ignored (e.g., Balázs et al., 2017). Hence the need to assess2he risk of ignoring the dependence structure (Q2) and, when not negligible, provideguidelines on how to model it (Q3). In this context, the measurements of kinematicsvariables (e.g., energy emission, number of jets in an event, number of leptons, etc) canhave either continuous or discrete nature. Hence the need of providing a solution thatcould easily handle mixed data distributions.
The key elements of the solution.
The comparison distribution has been popular-ized by the seminal work of Parzen (1979) as a valuable tool to perform goodness-of-fit.Specifically, let g X , G X , Q G X be, respectively, the hypothesized probability function,cumulative distribution function (cdf) and quantile function of a random variable X ,either continuous or discrete, and denote with f X and F X the true probability functionand cdf. The comparison density between F X and G X Parzen (1979, 1983) can then bespecified as d ( u ; G X , F X ) = f X (cid:0) Q G X ( u ) (cid:1) g X (cid:0) Q G X ( u ) (cid:1) with u = G X ( x ) , (1)and we assume f X = 0 whenever g X = 0 . The comparison distribution is defined as D ( u ) = (cid:82) u d ( s ; G X , F X ) d s .From (1) it easy to see that the comparison density transparently models the de-parture of the hypothesized model and the true distribution. Its usefulness, however, islimited to the study of univariate distributions ( p = 1 ). In this work, we overcome thislimitation by introducing the joint comparison density . As it will become clear in theSections to follows, the latter represents the key ingredient in addressing Q1-Q3. Main results and organization.
The core of the theoretical framework is pre-sented in Section 2. Here, we define the joint comparison density and we propose itsrepresentation in terms of LP score functions (e.g., Mukhopadhyay and Parzen, 2014;Mukhopadhyay and Wang, 2020). As shown in Sections 3 and 4, such representationsubstantially simplifies the subsequent stages of estimation, model selection and (post-selection) inference.In Section 5, we propose the dependence learning graph , a valuable graphical tool tovisualize the dependence structure of X . As an important byproduct, the dependencelearning graph also provides technical and computational advantages in the estimationof the joint comparison density.In Section 6, we discuss an ANOVA-like testing strategy, which we call iGOF-diagnostic analysis , the latter is what allows us to assess the validity of the hypothesizedmodel and, when rejected, identify the sources of mismodelling. Moreover, in support ofthe usefulness of iGOF, power studies are conducted via simulations in both Sections 4and 6.The main body of the article focuses on simple null hypothesis, that is, the postulatedmodel is assumed to be fully specified. Extensions to situations where the null modeldepends on unknown parameters are discussed in Section 7. In the LP acronym, the letter L typically denotes nonparametric methods based on quantiles, whereas P stands for polynomials (Mukhopadhyay and Wang, 2020, Supp S1).
3s noted above, this work finds its main motivations in the context of astrophysicalsearches. Therefore, in Section 8 we illustrate how iGOF can be used to address theproblem of mismodelling of the cosmic background considering a realistic simulationfrom the Fermi Large Area Telescope (Atwood et al., 2009).Section 9 collects a summary of the results and a discussion of the limitations ofiGOF. Technical proofs are collected in the Supplementary Material.
The univariate comparison density in (1) is also known in literature as connector density (e.g., Mukhopadhyay, 2017) in virtue of its ability to connect the true and the postulatedmodel by means of the simple yet elegant decomposition f X ( x ) = g X ( x ) d (cid:0) G X ( x ); G X , F X (cid:1) . (2)From (2), it is easy to see that if G X ≡ F X , then d ( u ; G X , F X ) = 1 for all u ∈ [0 , .Conversely, if G X (cid:54)≡ F X , d ( u ; G X , F X ) allows us to identify the regions of the support of X where f X deviates substantially from g X .To exploit these useful properties in the multivariate setting, it is necessary to intro-duce a suitable definition of comparison density which extends to random vectors. Definition 2.1 ( The joint comparison density ) . Let f X and F X be, respectively, theprobability function and the cdf of X ∈ X ⊆ R p . Let g X and G X be the hypothesizedprobability function and cdf of X , and such that, for every x = ( x , . . . , x p ) ∈ X , g X ( x ) = p (cid:89) d =1 g d ( x d | x The attentive reader may have noticed that, in principle, one could chooseeach G d ≡ G X d , i.e., assuming independence among the components of X . In this setting, (4) coincides with the copula density (e.g., Nelsen, 2007) of X , whenever G X d ≡ F X d ,for all d = 1 , . . . , p (see Section 5.1). The multivariate analogue of (2) follows directly from (4). While estimating (4) is an important goal of iGOF (see Q3 in Section 1), we also aimto provide a sufficiently detailed representation of the substructures characterizing thedistribution of X (see Q1 and Q2 in Section 1). To satisfy both these requirements,it is convenient to express the joint comparison density by means of a suitable basis oforthonormal functions. For the sake of generalizability, a valuable requirement here tochoose of a basis which easily generalizes to both discrete and continuous distributions.To serve this purpose, we construct a tensor product basis of LP score functions (e.g.,Mukhopadhyay and Parzen, 2014; Mukhopadhyay and Wang, 2020).When p = 1 , a complete orthonormal basis of LP score functions in L ( G X ) can bespecified by letting the first component to be T (cid:0) G X ( x ) (cid:1) = 1 . Subsequent components { T j (cid:0) G X ( x ) (cid:1) } j> are obtained by Gram–Schimidt orthonormalization of powers of T (cid:0) G X ( x ) (cid:1) = G mid X ( x ) − . (cid:113) [1 − < p G X , > ] / , (7)where p G X ( x ) = P ( X = x ) under the assumption that X ∼ G X , and G mid X ( x ) = G X ( x ) − . p G X ( x ) is the so-called mid-distribution function , with mean . and variance [1 − ] / , respectively (Parzen, 2004). When X is continuous, G mid ( x ) = G ( x ) and < p G X , > = 0 , consequently, the LP score functions reduce to normalized shiftedLegendre polynomials. Whereas, when X is discrete, < p G X , > = (cid:80) x ∈X p G X ( x ) , with X being the set of distinct points in the support of X .For each G d in Definition 2.1, we can then specify a basis of LP score functions,namely (cid:110) T j d (cid:0) G d ( x d | x By construction (see (8) , (12) ), the LP score functions are such that S j d ( u ) = 1 whenever j d = 0 , for all u ∈ [0 , d and d = 1 , . . . , p . Therefore, S ... ( u ) = 1 for all u ∈ [0 , p and θ ... = 1 . Let x , . . . , x n be a sample of n i.i.d. observations from the random vector X and let U = G ( X ) be the respective Rosenblatt transformation (see Definition 2.1). Denotewith u , . . . , u n the transformed sample with elements u i = G ( x i ) = (cid:0) G ( x i ) , . . . , G p ( x pi | x Let d (cid:0) u ; G X , F X (cid:1) be the joint comparison density of the random vec-tor U , then E [ (cid:98) θ ] = θ and Cov ( (cid:98) θ ) = Σ (16) where Σ has diagonal elements σ j ,...jp n = n V (cid:2) S j ,...j p ( U ) (cid:3) and non-diagonal elements σ j ,...jp,h ...hp n = n Cov (cid:2) S j ,...j p ( U ) , S h ...h p ( U ) (cid:3) . Furthermore, if F X ≡ G X , the equalities n (16) reduce to E [ (cid:98) θ ] = and Cov ( (cid:98) θ ) = 1 n I M , (17) where is a M × zero vector and I M is an M × M identity matrix. From Proposition 3.1, it follows that an estimate of (4) can be constructed as (cid:98) d ( u ; G X , F X ) = 1 + (cid:98) θ (cid:48) S ( u ) , (18)and has variance V (cid:104) (cid:98) d ( u ; G X , F X ) (cid:105) = S ( u ) (cid:48) Σ S ( u ) . (19)Combining (2), (4) and (18) an estimate of f X is (cid:98) f X ( x ) = g X ( x ) (cid:98) d (cid:0) G ( x ); G X , F X (cid:1) (20) = g X ( x ) (cid:2) (cid:98) θ (cid:48) T (cid:0) G ( x ) (cid:1)(cid:3) (21)Interestingly, (20) implies that the comparison density allow one to obtain an estimateof f X which incorporates the information carried by the hypothesized model g X . Fur-thermore, from Proposition 3.2, it follows that the closer g X is to f X in terms of squarednormalized distance, the lower the bias of (cid:98) d ( u ; G X , F X ) . Proposition 3.2. The integrated squared bias of (18) is (cid:90) [0 , p (cid:18) f X (cid:0) Q G X ( u ) (cid:1) − g X (cid:0) Q G X ( u ) (cid:1) f X (cid:0) Q G X ( u ) (cid:1) (cid:19) d u − θ (cid:48) (22) where is a M × unit vector. The estimate in (21) is essentially that of a smooth model (e.g., Rayner and Best,1990), that is, a smoothed version of the true underlying probability function. Similarlyto the smooth model proposed by Barton (1956) in the univariate setting, the estimatorin (21) may lead to estimate that are not bona-fide , i.e, they may be negative and/or theymay not integrate/sum up to one. In this manuscript we focus on (21) mostly for thesake of mathematical convenience in deriving the inferential results of Section 4 and theirsimple implementation. Nonetheless, bona-fide estimators can be constructed similarlyto the univariate case as described in Algeri and Zhang (2020). Example I. Let ( X , X ) be a random vector with support X = [5 , × [0 , .Suppose that the hypothesized distribution, G X X , is that of a truncated bivariatenormal with mean vector (12 , , variances and and covariance . Whereas, the truemodel, F X X , is a mixture involving also another truncated bivariate Gaussian with thesame mean vector, variances and , covariance , and mixture parameter . . Inorder to estimate the comparison density, we set G ≡ G X and G ≡ G X | X . Theestimated joint comparison density, obtained over a sample of n = 7000 observations,is shown in the right panel of Figure 1; the true comparison density is plotted on theleft panel of the same figure. While the estimate obtained appears to recover the maindepartures from uniformity, the contours highlight that the estimator is rather noisy.Therefore, it is important to investigate the properties of (18) to assess the significanceof the deviations observed. 8 Inference and model selection Definition 2.1 implies that testing if F ≡ G , is equivalent to test that d ( u ; G X , F X ) = 1 with probability one, i.e., H : d ( u ; G X , F X ) = 1 for all u ∈ [0 , p versus H : d ( u ; G X , F X ) (cid:54) = 1 for some u ∈ [0 , p . (23)Furthermore, from Proposition 2.3 it is easy to see that d ( u ; G X , F X ) = 1 for all u ∈ [0 , p , whenever all the θ j ...j p coefficients, but θ ... , are identically equal to zero.Therefore, in practice, we test H : θ = vs H : θ (cid:54) = . (24)Notice that H in (23) implies H in (24), but the opposite is not true in general.Whereas, H in (24) does imply H in (23). With a little abuse of nomenclature, inthis section and those to follow, we will refer to G X as the “null model”. Furthermore,we will refer to H in (23) when generically saying that a result is valid “under H ”.Indeed, most of the results presented here, only require validity of the “milder” H in(24). However, more often than not, our derivations will involve model selection, andthus, to avoid complicating the notation further by specifying different H for the differentmodels under consideration, we will simply refer to H in (23), and which implies all themodel-specific nulls.To conduct our inference, we consider the so-called deviance test statistics D = n (cid:98) θ (cid:48) (cid:98) θ . (25)The asymptotic null distribution of the deviance is given in Theorem 4.1. Theorem 4.1. If H is true, then √ n (cid:98) θ d −→ N ( , I ) , as n → ∞ (26) where N ( , I ) denotes a standard multivariate normal distribution. Furthermore, D d −→ χ M , as n → ∞ , (27) where M is the length of (cid:98) θ . Corollary 4.2 follows directly from Theorem 4.1. Corollary 4.2. Denote with { (cid:98) d ( u ; G X , F X ) } the random field indexed by u ∈ [0 , p withcomponents as in (18) . Under H , (cid:40) (cid:98) d ( u ; G X , F X ) − (cid:113) n S ( u ) (cid:48) S ( u ) (cid:41) d −→ Z ( u ) , as n → ∞ , (28) where Z ( u ) denotes a Gaussian random field with mean zero, unit variance and covari-ance function Cov (cid:16) Z ( u ) , Z ( u † ) (cid:17) = S ( u ) (cid:48) S ( u † ) (cid:112) S ( u ) (cid:48) S ( u ) S ( u † ) (cid:48) S ( u † ) . (29)9he proofs of both Theorem (4.1) and Corollary (4.2) are given in the SupplementaryMaterial.At this stage, constructing inference on the basis of Theorem 4.1 and Corollary 4.2would be tempting. However, to guarantee the validity of our results we must, for themoment, refrain our impulses to compute p-values and confidence regions and considerhow a very practical aspect of our estimation strategy could effectively invalidate a naiveinferential approach. Specifically, either via an automatic procedure or by visual inspec-tion, chances are that, when estimating the joint comparison density in (18), a modelselection procedure is implemented. Unfortunately, when a model is selected by a poolof possibilities, such process introduces an additional source of variability and thus theresulting inference is automatically affected (e.g., Berk et al., 2013). Section 4.1 addressesthis aspect directly. The estimate of the joint comparison density considered so far involves up to M tensorbasis functions S j ...j p ( u ) . Nonetheless, it is possible that not all of these terms areneeded to capture the departures of G X from F X and indeed, it is often convenient toremove some of them in order to avoid unnecessary sources of noise. Various criteria havebeen proposed in literature for univariate comparison densities and smooth models (e.g.,Mukhopadhyay, 2017; Algeri, 2020) (see also Rayner et al. (2009, Ch.10)) and whichcan be easily extended to the multivariate setting. Here, we focus on the approach ofLedwina (1994) and (Claeskens and Hjort, 2004) and which relies on the BIC and AICcriteria.Let (cid:98) θ ( k ) be the k th largest (cid:98) θ j ...j p estimate in terms of magnitude, i.e., (cid:98) θ ≥ (cid:98) θ ≥· · · ≥ (cid:98) θ M ) . Select the K largest coefficients which maximize eitherBIC ( K ) = K (cid:88) ( k )=1 (cid:100) LP k ) − K log nn or AIC ( K ) = K (cid:88) ( k )=1 (cid:100) LP k ) − Kn , (30)and include only the respective terms in the estimate of the comparison density in (18).Clearly, the choice of BIC or AIC is arbitrary and, from a practical standpoint, the BICtends to lead to smoother estimates than the AIC.The selection rules in (30) compare M possible models assuming that each m d , for d = 1 , . . . , D was fixed before the researcher looked at the data. However, if one wasto repeat the selection process several times, considering different m d values each time,the number of models under consideration would increase even further. Because of this,and as advocated in Berk et al. (2013), a post-selection inferential strategy should begeneral enough to cover situations where multiple attempts, either in terms of selectionprocedure adopted and/or visual inspection, are made. Furthermore, Berk et al. (2013,Section 4.6) show that, under such framework, the problem of post-selection inference canindeed be reduced to one of simultaneous inference. In a similar spirit, here we construct asimple post-selection correction, which allows us to perform valid post-selection inference10in terms of both tests of hypothesis and confidence regions) while imposing one mainassumption, i.e., Assumption 4.3. Either by visual assessment or data-driven procedures, none of theestimators of (4) considered at any point during the analyses, contained more than M ∗ terms. Notice that Assumption 4.3 is sufficiently general to allow researchers to “cherry-pick”their p-values, as far as they are honest in declaring their M ∗ . Under Assumption 4.3,it follows from Theorem 4.1 that valid post-selection inference can be constructed as inCorollary 4.4. The respective proof is provided in the Supplementary Material. Corollary 4.4. Let { (cid:98) d , . . . , (cid:98) d M ∗ } be a pool of possible estimators of (4) from which (cid:98) d isselected under Assumption 4.3. Let K , ≤ K ≤ M ∗ , be the number of LP tensors basisfunctions in each (cid:98) d K . Denote with D K the deviance of (cid:98) d K , so that D = D K whenever (cid:98) d = (cid:98) d K . If D K (cid:22) D M ∗ for all K = 0 , . . . , M ∗ − , (31) a valid post-selection bound for the p-value to test (23) isp-value adj = P ( χ M ∗ > D obs ) as n → ∞ , (32) where D obs is the value of D observed. In (31) the symbol “ (cid:22) ” indicates that the left-hand-side is stochastically lower orequal than the right-hand-side. Each D K in Corollary 4.4 only involves the sum of thesquares of the estimates (cid:98) θ j ...j p of the coefficients θ j ...j p ∈ (cid:98) d K . Therefore, condition (31)is verified whenever the models under comparison are nested within (cid:98) d M ∗ , as it is the casewhen relying on the selection rules in (30) with M ∗ = M .The deviance test presented here is a generalization of classical smooth tests to themultivariate setting and when considering random vectors with mixed components fromarbitrary (either continuous and/or discrete) distributions. It follows that, classical re-sults on smooth tests also apply to our setting. Among these, Ledwina (1994) has shownthat, under H , the BIC rule in (30), always selects the first component (i.e., the BICis maximized at K = 1 ) as n → ∞ . Hence, in this setting, the (post-selection) distribu-tion of the deviance statistic converges to that of a χ and (32) could, in principle, bereplaced by its less conservative counterpart P ( χ > D obs ) . For finite samples, however,any K > may still be selected with non-zero probability under H , and thus applyingthe χ approximation may lead to a larger probability of type I error than the nominalone.In order to go beyond the binary nature of decisions based on the p-value and graspfurther insights on the deviations of G X from F X , it is worth constructing adequateconfidence bands for our estimate of the comparison density. This can be done, whileaccounting for post-selection adjustments, as in Corollary 4.5. The proof is given in theSupplementary Material. 11 orollary 4.5. Let SE (cid:0) (cid:98) d K ( u ) | H (cid:1) be the standard error of the estimator (cid:98) d K under H .If sup u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:40) (cid:98) d K ( u ) − SE (cid:0) (cid:98) d K ( u ) | H (cid:1) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:22) sup u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:40) (cid:98) d M ∗ ( u ) − SE (cid:0) (cid:98) d M ∗ ( u ) | H (cid:1) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) for all K = 1 , . . . , M ∗ − , (33) then valid (post-selection adjusted) (1 − α )% confidence regions, under H in (23) , for (cid:98) d ( u ) are (cid:34) − c α SE (cid:0) (cid:98) d ( u ) | H (cid:1) , c α SE (cid:0) (cid:98) d ( u ) | H (cid:1)(cid:35) for all u ∈ [0 , p (34) with c α such that P (cid:32) sup u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:40) (cid:98) d M ∗ ( u ; G X , F X ) − SE (cid:0) (cid:98) d M ∗ ( u ) | H (cid:1) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > c α (cid:12)(cid:12)(cid:12)(cid:12) H (cid:33) = α. (35)Also in this case, we may expect the stochastic ordering condition in (33) to holdwhen considering estimators (cid:98) d K nested within (cid:98) d M ∗ . In this setting, (cid:98) d M ∗ has the sameterms as any other (cid:98) d K plus, eventually, additional ones. Therefore, (cid:98) d M ∗ is the leastsmooth among all the estimators considered. Consequently, we expect that the randomfield resulting from (cid:98) d M ∗ has the largest probability of crossing the fixed level c α .In Corollary 4.5, the choice of constructing confidence regions under the null hypoth-esis is justified by the fact that the estimator in (18) is, in general, a biased estimator of(4) (see Proposition 3.2). As a result, confidence regions constructed around (cid:98) d ( u ) couldbe shifted away from the true joint comparison density. While the bias cannot be easilyquantified in a general setting, it can be shown to be equal to zero when H in (24) istrue, hence the validity of (34).From a theoretical perspective, a non-trivial aspect in the construction of (34) is theestimation of the quantile c α . Probabilities such (35) are known in literature as excur-sion probabilities (e.g., Adler, 2000). While excursion probabilities cannot be expressedin closed form, accurate approximations under mild conditions exist (e.g., Taylor andWorsley, 2008). For the case of p = 2 , and letting X be continuous, the probability in(35) can be approximated by (cid:0) − Φ( c α ) (cid:1) + L e − c α π + L e − c α √ π / + O (cid:0) exp( − γc α / (cid:1) , as n → ∞ , (36) for some γ > (Taylor et al., 2005). In (36), L and L are constant known as Lipischitz-Killing curvatures and are typically estimated numerically (e.g., Algeri and van Dyk,2020; Vitells and Gross, 2011).As one may expect, despite Assumption 4.3, along with the stochastic ordering con-ditions in (31) and (33), guarantee wide applicability of the post-selection adjustmentsin (32) and (34), this advantage comes with a price. Specifically, they can be ratherconservative for increasing values of M ∗ . Specifically, the larger m d , p and M ∗ are, the12 .80.91.01.11.21.31.46 8 10 12 14 16 18 20051015 Deviance simulated p−value <10 - X1 X Deviance (adjusted) p−value =3.328e−05 X1 X Figure 2: Simulated and approximated confidence regions for Example I. The left panel corre-sponds to the (post-selection) confidence regions and deviance p-value obtained via a simulationof size B = 10 , . The right panel shows to the (post-selection adjusted) confidence regions anddeviance p-value computed as in (32) and (34) . Darker shades correspond to significant devi-ations of the estimated comparison density above one. Lighter shades correspond to significantdeviations below one. more conservative our (post-selection) inference becomes. However, as shown below forExample I and in the sections to follow, in a relatively low-dimensional setting (32) stillleads to high power even if the sample size is only moderately large. Similarly, (34)can be quite accurate and match closely the confidence regions obtained by simulatingdirectly the distribution of (28), while repeating the selection process at each replicate. Example I (continued). The comparison density estimate in the right panel ofFigure 1 has been obtained by setting m = 4 and m = 3 and selecting the terms of therespective tensor basis via the AIC rule in (30). The resulting estimate is (cid:98) d (cid:0) G ( x ) ,G ( x | x ); G X X , F X X (cid:1) = 1 + 0 . T (cid:0) G ( x | x ) (cid:1) − . T (cid:0) G ( x ) (cid:1) − . T (cid:0) G ( x ) (cid:1) + 0 . T (cid:0) G ( x ) , G ( x | x ) (cid:1) +0 . T (cid:0) G ( x ) , G ( x | x ) (cid:1) − . T (cid:0) G ( x ) , G ( x | x ) (cid:1) − . T (cid:0) G ( x ) , G ( x | x ) (cid:1) − . T (cid:0) G ( x ) , G ( x | x ) (cid:1) − . T (cid:0) G ( x ) , G ( x | x ) (cid:1) where G ≡ G X and G ≡ G X | X . The AIC procedure selects terms out of M ∗ = M = 19 . The post-selection adjusted p-value and confidence regions are shown inthe right panel of Figure 2. The confidence contours are constructed by setting equal toone all the values of (cid:98) d contained within the bands in (34). Whereas the quantile c α hasbeen calculated by solving (cid:0) − Φ( c α ) (cid:1) + L e − c α π + L e − c α √ π / − α = 0 (37)13 = 500 n = 1000 n=2000 n = 5000 n = 7000 n = 10 , Type I error 0.0540 0.0500 0.0499 0.0482 0.0508 0.04930( ± SE) ( ± ± ± ± ± ± ± SE) ( ± ± ± ± ± ± Table 1: Simulated probability of type I error and power for Example I considering differentsample sizes. The nominal level is chosen to be α = 0 . . Each simulation involves B = 10 , replicates. and estimating L and L by means of the R package TOHM (Algeri, 2019) as describedin Algeri and van Dyk (2020). This approach led to c α = 3 . . The confidencecontours suggest that the most prominent deviations occur in correspondence of theregions [10 , × [0 , and [12 , × [12 , . Here, the estimate of the comparisondensity shows significant deviations above one and thus we conclude that the postulatedmodel underestimates the truth over these areas. The presence of significant departuresof G X X from F X X are confirmed by the deviance test (adjusted p-value ∼ . · − ). The left panel of Figure 2 shows the confidence regions and deviance p-valueobtained by means of a Monte Carlo simulation involving B = 10 , replicates. Theselection procedure has been implemented at each replicate. While more conservative,the confidence regions computed via (34) and (37), approximate reasonably well thoseobtained via simulation.Finally, we investigate the probability of type I error and the power of the deviancetest based on (32). Table 1 reports the results obtained considering a suite of five sim-ulations, each of size B = 10 , , conducted using five different sample sizes. For all n considered, the probability of type I error observed is approximately the same than thenominal level α = 0 . . Whereas, the power increases rapidly with n . For the smallestsamples sizes considered, i.e., n = 500 and n = 1000 , the power is rather low ( ∼ and ∼ , respectively). However, it has to be noted that, in our example, the mixtureparameter is . ; therefore the deviations from the postulated model effectively accountfor only ∼ and ∼ data points when n = 500 and n = 1000 , respectively. When the dimension of the problem increases, the specification of a postulated model G X , and the respective Rosenblatt transform (RT) in (2.1), may be challenging. Thedifficulty of this task is aggravated when limited knowledge on the overall dependencestructure of X is available. 14 ariable True ( F X ) Hypothesized ( G X ) Correct X | X , X , X Poi (cid:104) e . x +0 . x +0 . x +0 . x (cid:105) Poi (cid:104) e . x +0 . x +0 . x (cid:105) No X , X , X N (cid:34)(cid:32) (cid:33) , (cid:32) . . (cid:33)(cid:35) N (cid:34)(cid:32) (cid:33) , (cid:32) . . (cid:33)(cid:35) Yes X | X Exponential (cid:16) x (cid:17) Exponential (cid:16) x (cid:17) Yes X Exponential (1) Exponential (0 . No X T Cauchy (0 , No Table 2: True and postulated model for Example II. The rows corresponds to the true and pos-tulated models for each of the variables in the first column. The last columns highlights whenmismodelling occurs. To illustrate this aspect, let X be a random vector with p = 7 . A naive specification ofthe RT would be G ≡ G X | X ,...,X , G ≡ G X | X ,...,X , . . . , G ≡ G X | X , G ≡ G X . (38)The RT in (38) implies that all the conditional distributions listed are known. In practice,however, even when a joint probability function g X is available, deriving the conditionalsin (38) from g X may translate into a particularly burdensome computational task. Inseveral physics experiments, for instance, the joint pdf g X is not fully known analyticallyand some of its components can only be computed numerically (e.g., Balázs et al., 2017).In this setting, deriving the conditional distributions could involve expensive marginal-izations and, even when derived, their evaluations on the data could be time consumingas each u di = G d ( x di | x Remark 5.1. The deviance statistics D ld in (42) is the equivalent of the LPINFORdependence measure proposed by Mukhopadhyay and Parzen (2020). The latter offersthe rather unique advantage of quantify dependence (beyond linearity) among randomvariables, regardless of their discrete or continuous nature. In light of the results in Section 5.1, we can now construct the so-called dependence learn-ing graph , a graphical tool to visualize the dependence structure of X and, consequently,simplify the specification of the RT. Definition 5.2 ( The dependence learning graph ) . The dependence learning graphof X is an undirected weighted graph, G X = ( X d , E dl , w dl ) , with vertices the componentsof X and edges E dl such that X d and X l are joined by an edge if and only if p ld in (42) is smaller than a predetermined significance level α . Each edge has weight w dl = D ld ,with D ld as in (42) . Notice that the dependence learning graph is a purely nonparametric tool whichdoes not require any distributional assumption on the distribution of X . When theoverall dependence is unknown, G X allows us to identify possible substructures andguide practitioners in the specification of their postulated models G X and the Rosenblatttransform G ( x ) . 17 q df (Adjusted)p-value X < − ( X , X , X , X ) < − ( X , X , X ) X , X ) . · − X . · − X . · − Table 4: iGOF-diagnostic table. The third col-umn reports the (post-selection adjusted) de-viance p-values for the test in (44) with X q specified as in the first column. The second col-umn corresponds to the degrees of freedom usedin the calculation of the p-value, i.e., the quanti-ties M ∗ q . The tests are ordered from the largestto the smallest sub-structure. The p-value inthe first row is for the overall test in (24) . . . . . . . X3 D en s i t y Figure 4: Comparing the postulated and thetrue model of X . The green solid line corre-sponds to the probability density function (pdf )of an exponential random variable with unitrate. The red dashed line corresponds to thepdf of an exponential random variable with rate0.9. Specifically, let G ( k ) , k = 1 , . . . , K , be the k th largest disconnected sub-graph of G X , sothat each X d ∈ G ( k ) is disconnected from any X l ∈ G ( k (cid:48) ) for all k (cid:54) = k (cid:48) . Denote with (cid:12)(cid:12) E d. (cid:12)(cid:12) the degree of connectivity of each X d i.e., the total number of edges connecting X d toany other variable; whereas, w d. is the sum of the weights of each X d , i.e., w d. = (cid:80) l w dl .A parsimonious RT can then be specified according to the following criteria:(i) if G d is a conditional cdf, the conditioning is only with respect to the componentsof X belonging to the same sub-graph of X d ;(ii) if G d is a conditional cdf, the conditioning is only with respect to the componentsof X with equal or lower connectivity than X d ;(iii) among variables in the same sub-graph and with the same degree of connectivity,variables with higher weight w d. should be conditioned to variables with lowerweight.To illustrate these aspects we consider the following example. Example II. We consider a sample of n = 5000 observations from a random vector X = ( X , . . . , X ) with components distributed as summarized in the second column ofTable 2. Specifically, the sub-vector ( X , X , X ) is distributed as a bivariate normal, X | X , X , X is Poisson distributed with rate depending on ( X , X , X ) . The variables ( X , X ) have joint pdf f X X ( x , x ) = x exp (cid:0) − x − x /x (cid:1) and thus X | X = x and18 q Sample size ( n ) 500 1000 2000 3000 5000 10 , X ( X , X , X , X ) ( X , X , X ) ( X , X ) ± ± ± ± ± X ± ± ± ± ± X Table 5: Performance of the iGOF-diagnostic analysis for different sample sizes. The tableshows the probabilities of rejection for each components when repeating the analysis in Table 4 B = 10 , times, and using different sample sizes. For values different from zero and one theMonte Carlo errors ( ± SE ) are also reported. The significance level considered is α = 0 . . X are both exponentially distributed, with means x and one, respectively. Finally, X follows a T distribution. It follows that ( X , X , X , X ) , ( X , X ) and X are allindependent from each other. Choosing m = · · · = m = 3 , we estimate ˜ θ as describedin Section 5.1, and we select its components according to the BIC rule in (30). Weproceed deriving the pairwise p-values in (42) and we construct the dependence learninggraph G X . The resulting graph is shown in Figure 3.As expected, each of independent sub-group of variables forms a disconnected sub-graphsof G X . Furthermore, since X and X are uncorrelated Gaussians (and therefore inde-pendent), despite they both belong to the largest sub-graph G (1) , they are not connectedby an edge. Table 3 collects the RT obtained in accordance to (i)-(iii). Notice that since G (2) contains only two variables, the RT can be equivalently specified by first conditioning X to X or vice-versa. Finally, since G X in Figure 3 suggests independence between X and X , we have G = G X | X = G X . The constructs introduced so far allow us to investigate the dependence structure of X (Section 5), assess the validity of the postulated model G X (Section 4) and obtain anestimate of the joint comparison density to visualize where and how departures of g X from f X occur (Section 3). Unfortunately, however, a visual inspection based on thecomparison density is only possible when p ≤ . Nevertheless, when p > , more insightson the sources of mismodelling affecting G X can be obtained by conducting an ANOVA-like analysis where random sub-vectors of X are tested individually, starting from thelargest to the smallest. The underlying theory relies on the following theorem. Theorem 6.1. Without loss of generality, let m = · · · = m p = m and assume that thecomponents of (cid:98) θ have been selected under Assumption 4.3, with M ∗ = ( m + 1) d − , and ssume that condition (31) holds. Let X q and U q = G q ( X q ) be random vectors collecting q ≤ p components of X and U = G ( X ) , respectively, and g X q (cid:0) x q (cid:1) = q (cid:89) h =1 g h (cid:0) x h | x Because of condition (43) , Theorem 6.1 holds only for random sub-vectorsof X whose RT transform G q includes all the conditioning, from the higher to the lower,necessary to recover g X q (cid:0) x q (cid:1) . To some extent, this condition can be seen as the iGOFcounterpart of the marginality principle advocated by Nelder (1977) in the context ofANOVA. Similarly to the ANOVA, Theorem 6.1 allows us to construct an iGOF-diagnostictable to identify the source of mismodelling for a given random vector X . Below weshow how this can be done for Example II. Example II (continued). Consider once again our 7-dimensional random vectorwith components distributed as in Table 2. We are interested in testing the validity ofthe postulated model in the third column of Table 2. The RT is specified as in the fifthcolumn of Table 3 and a deviance test has been constructed for each sub-graph G ( k ) iden-tified via the dependence learning plot. The results are collected in Table 4. The overalldeviance test is reported in the first row and correctly reject the null model. Similarly,the test in the second row, rejects the hypotheses that the vector ( X , X , X , X ) ismodelled correctly, and fails to rejects the model for ( X , X , X ) . This aspect is partic-ularly important as it highlights that the mismodelling occurs only with respect to theconditional distribution of X | X , X , X . The tests in the fourth and fifth row showthat the vector ( X , X ) has been mismodelled and one source of mismodelling is themarginal of X . Ultimately, the test for X also correctly rejects the null hypothesis ofCauchy distribution. Table 5 collects the results of a simulation obtained by repeatingthe diagnostic analysis in Table 4 through a simulation of B = 10 , replicates, whileconsidering different sample sizes. Even when the sample size considered is only ,the most prominent deviations are captured with probability one, whereas, the model for ( X , X , X ) is never rejected. More issues arise in diagnosing mismodelling of X and,20onsequently, ( X , X ) for smaller samples. For instance, even when n = 1000 the powerof the procedure in detecting departures of G X from F X is only ∼ and ∼ for ( X , X ) . It has to be noted, however, that detecting mismodelling of X is a partic-ularly challenging task. As shown in Figure 4, the postulated and the true pdf of X are very close one-another; this minor differences are further “diluted” when consideringthe joint distribution of ( X , X ) , since X | X is correctly specified. Nevertheless, suchminor deviations are detected with high power for larger sample sizes. Remark 6.3. From Remark 5.1 it follows that when setting G d ≡ (cid:101) F X d , for all d =1 , . . . , p , our iGOF-diagnostic analysis reduces to an analysis of independence for groupsof two or more variables. The methods discuss so far assume that the postulated model G X is fully specified andit does not include free parameters to be estimated. For instance, G X may have beenspecified on the basis of the results obtained from previous studies, it has been calibratedover multiple runs of the experiment or it has been estimated on a training set and iGOFis used to assess its validity on a test set. In many practical situations, however, G X does depend on unknown parameters, which we denote with β and indeed, one of themain goals of many physics experiments is to test different models and set reliable limitson the so-called “systematic uncertainties”, i.e., the components of β . Therefore, at thisstage, a legitimate question is: can we still use iGOF in this setting? The answer is yes,if the procedure is adequately modified.In order to discuss how this can be done in practice, here we focus on the casewhere the free parameter β is estimated via maximum likelihood; hence, we assume theusual regularity conditions (e.g., Cramér, 1999, p.500) hold. Specifically, let (cid:98) β be theMaximum Likelihood Estimate (MLE) of β , let b β ( x ) and n I β be the score vector andthe the Fisher information matrix, respectively, i.e., b β ( x ) = dd β log g X ( x , β ) and I β = d d β d β log g X ( x , β ) . (46)Moreover, let (cid:98) θ (cid:98) β be the equivalent of (cid:98) θ in Section 3 obtained by replacing β with (cid:98) β in G X . While one may expect the asymptotic distribution of (cid:98) θ (cid:98) β to be the same as (cid:98) θ , under H , this is not true in general. Specifically, classical results on smooth tests(e.g., Thas, 2010, Sec 4.2.2.3 and 5.2.2.3 ) can be used to show that the null asymptoticdistribution of √ n (cid:98) θ (cid:98) β is that of a zero-mean multivariate normal with matrix of variancesand covariances given by Σ (cid:98) β = I M − Σ bβ I − β Σ βb (47)with M being the length of (cid:98) θ (cid:98) β , Σ βb is the matrix of elements (cid:82) X T j ...j p (cid:0) G ( x ) (cid:1) b q β ( x ) d G ( x ) and b q β is the q th component of b β in (46).21 .900.951.001.051.101.151.20170 180 190 200 210 22001020304050 Deviance simulated p−value <10 - X1 X Deviance (adjusted) p−value =1.766e−67 X1 X Figure 5: Simulated and approximated confidence regions for the Fermi LAT simulation. Theleft panel corresponds to the (post-selection) confidence regions and deviance p-value obtainedvia a simulation of size B = 10 , . The right panel shows to the (post-selection adjusted)confidence regions and deviance p-value computed as in (32) and (34) . Darker shades correspondto significant deviations of the estimated comparison density above one. Lighter shades correspondto significant deviations below one. Furthermore, since (47) is non-diagonal, the deviance statistics must be modifiedaccordingly, and thus we write D (cid:98) β = n (cid:98) θ (cid:48) (cid:98) β Σ − (cid:98) β (cid:98) θ (cid:98) β . (48)By exploiting the results of Kallenberg and Ledwina (1997), the distribution of (48)can be shown to be χ M distributed, when no model selection is performed. Whereas,corrections such as those discussed in Section 4.1, can be implemented as post-selectionadjustments. Unfortunately, however, the asymptotic approximations are known to berather slow in the parametric case. Therefore, in practical applications, when G X de-pends on unknown parameters β , it is recommended to perform inference by means ofthe parametric bootstrap and which has been shown by Babu and Rao (2004) to beconsistent also in the multivariate setting. When conducting searches for new phenomena, mismodelling of the background dis-tribution can dramatically compromise the sensitivity of the experiment. Specifically,overestimating the background can increase the chances of false negatives. Whereas, un-derestimating the background may lead to claiming false discoveries. To illustrate howiGOF can be used to understand if and how the postulated background model have been22isspecified, we consider a simulated observation by the Fermi Large Area Telescope(LAT) Atwood et al. (2009) obtained with the gtobssim package . The simulation in-cludes a realistic representations of the instrumental noise of the detector and presentbackgrounds.The region of interest corresponds to a disc in the sky of ◦ radius and centered at( RA, DEC), where RA and DEC are the coordinates in the sky. Here we assumethat, despite the cosmic background is known to follow a uniform distribution over thesearch area, it is unclear if the instrumental error is effectively negligible, or if it has aprominent effect on the underlying distribution. Therefore, we set G X X to be the cdf ofa uniform distribution with support X × X = [165 , × [28 − (cid:112) − ( x − , 28 + (cid:112) − ( x − ] and we proceed by estimating the joint comparison density over asample of n = 68658 observations. Specifically, we set m = m = 4 and we select thecomponents of (cid:98) θ via the BIC criterion in (30). The resulting estimate is (cid:98) d (cid:0) G ( x ) , G ( x | x ); G X X , F X X (cid:1) = 1+0 . T (cid:0) G ( x ) (cid:1) − . T (cid:0) G ( x | x ) (cid:1) +00412 T (cid:0) G ( x | x ) (cid:1) . (49) In order to assess the significance of the deviations captured by (49), we computeboth the confidence regions and deviance p-values via (34) and (32). The results arereported in the right panel of Figure 5, whereas the left panel shows the confidence re-gions and deviance p-value obtained via simulation. Similarly to what we have observedfor Example I (see Figure 2), despite the approximate confidence bands are more con-servative, they still allow to capture the main departures from uniformity. Indeed, inboth cases, we can see that the prominent deviations of the true underlying model fromthe postulated uniform distribution occur in proximity of low values of X . Whereas,at the center-left of the search area, the uniform model significantly underestimates themodel inclusive of the instrumental error. Finally, it follows from (20) that an updatedmodel background distribution which accounts for these deviations can be constructedby simply multiplying the uniform pdf by the estimated comparison density in (49). This work proposes an informative approach to goodness-of-fit which bridges exploratoryand confirmatory data analysis to study multivariate distributions. The use of the jointcomparison density allows practitioners to identify the sources of mismodelling, hencethe informative character of the procedure. Specifically, confidence regions can be con-structed as in Corollary 4.5 to identify regions of the parameter space where significantdeviations occurs. While this approach is practical only for problems in at most threedimensions, in more dimensions a detailed diagnosis of mismodelling can be achieved bymeans of the iGOF-diagnostic analysis proposed in Section 6. It follows that confidenceregion plots and iGOF-diagnostic tables can be used to directly address Q1 in Section 1. http://fermi.gsfc.nasa.gov/ssc/data/analysis/software 23n important step of the procedure proposed is the LP representation of the jointcomparison density in (13). Specifically, the use of LP score functions allow us to gener-alize our methods to both discrete and continuous data. Furthermore, as highlighted inSection 5, iGOF reduces to an analysis of independence when replacing the Rosenblatttransform in Definition 2.1 with the empirical marginal distribution functions. This allowus to learn the underlying dependence structure of the data and visualize it through thedependence learning graph (see Section 5.2). Consequently, the latter can be used todirectly address Q2 in Section 1.As we aimed for when formulating Q3 in Section 1, the true probability function of thedata can be estimated nonparametrically, when investigating the dependence structure,and semi-parametrically via (20), when assessing the validity of the model postulatedby the scientists. Interestingly, in the latter case the resulting estimate incorporates theknowledge carried by the hypothesized model and thus, it provides a data-driven updatefor it in the direction of the true distribution of the data.While the usefulness of the methods presented here in applied settings, and in thephysical sciences in particular (e.g., Section 8), they are not exempt from limitations. Forinstance, several problems in physics and astronomy, often involve no more than 8 or 10dimensions and/or can be reduced to 2D planes (e.g., Aprile et al., 2017). In this context,choosing m d equal to or for all d = 1 , . . . , p , is often sufficient to avoid overfittingand, eventually, lack of power by implementing adequate model selection strategies andfor sufficiently large samples (see Sections 4 and 6). In more dimensions, however, themethod suffers from the curse of dimensionality (e.g. Friedman et al., 2001), as the sizeof the LP tensor basis increases exponentially fast with p . In this context, a regularizedsolution could be particularly valuable (see for instance Signoretto et al., 2014, in thecontext of tensor data analysis) when analyzing, for instance, data coming from largeastronomical surveys such as the Large Synoptic Survey Telescope (LSST) survey (e.g.,Tyson, 2002).Finally, the post-selection inference approach presented in Section 4.1 enjoys widegeneralizability and simple implementation. However, the adjustments proposed are, byconstruction, conservative. Despite this issue can, in principle, be overcome by simulationstudies, in the most crucial (astro)physics discoveries the claim for a discover can onlybe made at a significance level of · − or lower (e.g., Lyons, 2013). Therefore, anumerical solution may be computationally prohibitive and an (asymptotically) exactpost-selection inferential solution (see for instance Tibshirani et al., 2016, in the contextof regression), could reduces drastically the computational complexity when dealing withstringent significant requirements. Acknowledgments The author thanks sincerely Estate V. Khmaladze and G. Jogesh Babu for the en-lightening discussions and comments. Their valuable feedback has led to a substantialimprovement of the quality and clarity of the manuscript.24 eferences Adler, R. J. (2000). On excursion sets, tube formulas and maxima of random fields. Annals of AppliedProbability , pages 1–74.Algeri, S. (2019). TOHM: Testing One Hypothesis Multiple Times . R package version 1.3.Algeri, S. (2020). Detecting new signals under background mismodeling. Phys. Rev. D , 101:015003.Algeri, S. et al. (2018). Statistical challenges in the search for dark matter. arXiv:1807.09273 .Algeri, S. and van Dyk, D. A. (2020). Testing one hypothesis multiple times: the multidimensional case. Journal of Computational and Graphical Statistics , 29(2):358–371.Algeri, S. and Zhang, X. (2020). Smoothed inference and graphics via lp modeling. arXiv preprintarXiv:2005.13011 .Aprile, E., Aalbers, J., Agostini, F., Alfonsi, M., Amaro, F., Anthony, M., Arneodo, F., Barrow, P.,Baudis, L., Bauermeister, B., et al. (2017). First dark matter search results from the xenon1t experi-ment. Physical review letters , 119(18):181301.Atwood et al., W. B. (2009). The large area telescope on the fermi gamma-ray space telescope mission. The Astrophysical Journal , 697(2):1071.Babu, G. J. and Rao, C. R. (2004). Goodness-of-fit tests when parameters are estimated. Sankhya ,66(1):63–74.Balázs, C. et al. (2017). Colliderbit: a gambit module for the calculation of high-energy collider observ-ables and likelihoods. The European Physical Journal C , 77(11):795.Barton, D. E. (1956). Neyman’s test of goodness of fit when the null hypothesis is composite. Scandi-navian Actuarial Journal , 1956(2):216–245.Berk, R., Brown, L., Buja, A., Zhang, K., Zhao, L., et al. (2013). Valid post-selection inference. TheAnnals of Statistics , 41(2):802–837.Claeskens, G. and Hjort, N. L. (2004). Goodness of fit via non-parametric likelihood ratios. ScandinavianJournal of Statistics , 31(4):487–513.Cramér, H. (1999). Mathematical methods of statistics , volume 43. Princeton university press.Dauncey, P., Kenzie, M., Wardle, N., and Davies, G. (2015). Handling uncertainties in backgroundshapes: the discrete profiling method. Journal of Instrumentation , 10(04):P04015.Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical learning , volume 1.Springer series in statistics New York.Kallenberg, W. C. M. and Ledwina, T. (1997). Data-driven smooth tests when the hypothesis is com-posite. Journal of the American Statistical Association , 92(439):1094–1104.Khmaladze, E. (2016). Unitary transformations, empirical processes and distribution free testing. Bernoulli , 22(1):563–588.Kolmogorov, A. (1933). Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari,Giorn. , 4:83–91.Ledwina, T. (1994). Data-driven version of neyman’s smooth test of fit. Journal of the AmericanStatistical Association , 89(427):1000–1005.Lyons, L. (2013). Discovering the Significance of 5 sigma. arXiv:1310.1284 .Mardia, K. V. (1975). Assessment of multinormality and the robustness of hotelling’s t2. test. Journalof the Royal Statistical Society: Series C (Applied Statistics) , 24(2):163–171.Mukhopadhyay, S. (2017). Large-scale mode identification and data-driven sciences. Electronic Journalof Statistics , 11(1):215–240.Mukhopadhyay, S. and Parzen, E. (2014). Lp approach to statistical modeling. arXiv:1405.2601 .Mukhopadhyay, S. and Parzen, E. (2020). Nonparametric universal copula modeling. Applied StochasticModels in Business and Industry , 36(1):77–94.Mukhopadhyay, S. and Wang, K. (2020). Nonparametric high-dimensional k-sample comparison. Biometrika (to appear) .Nelder, J. (1977). A reformulation of linear models. Journal of the Royal Statistical Society: Series A(General) , 140(1):48–63.Nelsen, R. B. (2007). An introduction to copulas . Springer Science & Business Media.Parzen, E. (1979). Nonparametric statistical data modeling. Journal of the American statistical associ-ation , 74(365):105–121.Parzen, E. (1983). Fun. stat quantile approach to two sample statistical data analysis. Technical report,Texas A&M University, College Station, Institue of Statistics.Parzen, E. (2004). Quantile probability and statistical data modeling. Statistical Science , 19(4):652–662.Priel, N., Rauch, L., Landsman, H., Manfredini, A., and Budnik, R. (2017). A model independentsafeguard against background mismodeling for statistical inference. Journal of Cosmology and As-troparticle Physics , 2017(05):013. ayner, J. C. W. and Best, D. J. (1990). Smooth tests of goodness of fit: an overview. InternationalStatistical Review/Revue Internationale de Statistique , pages 9–17.Rayner, J. C. W., Thas, O., and Best, D. J. (2009). Smooth tests of goodness of fit: using R . John Wiley& Sons.Rosenblatt, M. (1952). Remarks on a multivariate transformation. The annals of mathematical statistics ,23(3):470–472.Signoretto, M., Dinh, Q. T., De Lathauwer, L., and Suykens, J. A. (2014). Learning with tensors: aframework based on convex optimization and spectral regularization. Machine Learning , 94(3):303–351.Smirnov, N. V. (1939). On the estimation of the discrepancy between empirical curves of distributionfor two independent samples. Bull. Math. Univ. Moscou , 2(2):3–14.Taskinen, S., Oja, H., and Randles, R. H. (2005). Multivariate nonparametric tests of independence. Journal of the American Statistical Association , 100(471):916–925.Taylor, J., Takemura, A., Adler, R. J., et al. (2005). Validity of the expected euler characteristic heuristic. The Annals of Probability , 33(4):1362–1396.Taylor, J. E. and Worsley, K. J. (2008). Random fields of multivariate test statistics, with applicationsto shape analysis. Ann. Statist. , 36(1):1–27.Thas, O. (2010). Comparing distributions . Springer.Tibshirani, R. J., Taylor, J., Lockhart, R., and Tibshirani, R. (2016). Exact post-selection inference forsequential regression procedures. Journal of the American Statistical Association , 111(514):600–620.Tyson, J. A. (2002). Large synoptic survey telescope: overview. In Survey and Other Telescope Tech-nologies and Discoveries , volume 4836, pages 10–20. International Society for Optics and Photonics.Vitells, O. and Gross, E. (2011). Estimating the significance of a signal in a multi-dimensional search. Astroparticle Physics , 35(5):230 – 234.Yellin, S. (2002). Finding an upper limit in the presence of an unknown background. Physical ReviewD , 66(3):032005., 66(3):032005.