Conditional Independence Testing in Hilbert Spaces with Applications to Functional Data Analysis
CConditional Independence Testing in Hilbert Spaces withApplications to Functional Data Analysis
Anton Rask Lundborg ∗ University of Cambridge, UK [email protected]
Rajen D. Shah † University of Cambridge, UK [email protected]
Jonas Peters ‡ University of Copenhagen, Denmark [email protected]
January 19, 2021
Abstract
We study the problem of testing the null hypothesis that X and Y are conditionallyindependent given Z , where each of X , Y and Z may be functional random variables. Thisgeneralises, for example, testing the significance of X in a scalar-on-function linear regressionmodel of response Y on functional regressors X and Z . We show however that even in theidealised setting where additionally ( X, Y, Z ) have a non-singular Gaussian distribution, thepower of any test cannot exceed its size. Further modelling assumptions are needed torestrict the null and we argue that a convenient way of specifying these is based on choosingmethods for regressing each of X and Y on Z . We thus propose as a test statistic, theHilbert–Schmidt norm of the outer product of the resulting residuals, and prove that typeI error control is guaranteed when the in-sample prediction errors are sufficiently small.We show this requirement is met by ridge regression in functional linear model settingswithout requiring any eigen-spacing conditions or lower bounds on the eigenvalues of thecovariance of the functional regressor. We apply our test in constructing confidence intervalsfor truncation points in truncated functional linear models. In a variety of application areas, such as meteorology, neuroscience, linguistics, and chemo-metrics, we observe samples containing random functions [43, 33]. The field of functional dataanalysis (FDA) has a rich toolbox of methods for the study of such data. For instance, there area number of regression methods for different functional data types, including linear function-on-scalar [35], scalar-on-function [20, 15, 40, 34, 50, 10] and function-on-function [22, 38] regression;there are also nonlinear and nonparametric variants [13, 14, 11, 47], and versions able to handlepotentially large numbers of functional predictors [12], to give a few examples; see Wang et al.[44], Morris [26] for helpful reviews and a more extensive list of relevant references. The avail-ability of software packages for functional regression methods, such as the R packages refund ∗ Part of this work was done while ARL was at the University of Copenhagen. ARL was supported by theCantab Capital Institute for the Mathematics of Information. † RDS was supported by an EPSRC first grant and an EPSRC programme grant. ‡ JP was supported by the VILLUM Foundation and the Carlsberg Foundation. a r X i v : . [ m a t h . S T ] J a n
16] and
FDboost [3], allow practitioners to easily adopt the FDA framework for their particulardata.One area of FDA that has received less attention is that of conditional independence testing.Given random elements
X, Y, Z , the conditional independence X ⊥⊥ Y | Z formalises the ideathat X contains no further information about Y beyond that already contained in Z . A precisedefinition is given in Section 1.3. Inferring conditional independence from observed data is ofcentral importance in causal inference [28, 41, 31], graphical modelling [25, 23] and variableselection. For example, consider the linear scalar-on-function regression model Y = (cid:90) θ X ( t ) X ( t ) dt + (cid:90) θ Z ( t ) Z ( t ) dt + ε, (1)where X, Z are random covariate functions taking values in L [0 , θ X , θ Z are unknown param-eter functions, Y ∈ R is a scalar response and ε ∈ R satisfying ε ⊥⊥ ( X, Z ) represents stochasticerror. Then conditional independence X ⊥⊥ Y | Z is equivalent to θ X = 0, i.e., whether thefunctional predictor X is significant.For nonlinear regression models, the conditional independence X ⊥⊥ Y | Z still characteriseswhether or not X is useful for predicting Y given Z . Indeed, consider a more general settingwhere Y is a potentially infinite-dimensional response, and X , . . . , X p are predictors, someor all of which may be functional. Then a set of predictors S ⊆ { , . . . , p } that contain alluseful information for predicting Y , that is such that Y ⊥⊥ { X j } j / ∈ S | { X j } j ∈ S , is known as aMarkov blanket in the graphical modelling literature [29, Sec. 3.2.1]. If Y (cid:54)⊥⊥ X j | { X k } k (cid:54) = j ,then j is contained in every Markov blanket, and under mild conditions (e.g., faithfulness), theMarkov blanket is unique and coincides exactly with those variables j satisfying this conditionaldependence. This set may thus be inferred by applying conditional independence tests.Conditional independence testing is also used as a building block of several methods forcausal discovery. For example, independence-based methods such as the PC or the FCI al-gorithms [41] connect conditional independence statements and separation statements in thecausal graph by means of the Markov condition and faithfulness. Other methods such as ICP[30] use the fundamental principle of invariance or autonomy [e.g. 18] to infer causal predictorsof a response variable Y by testing whether a regression model of Y on a set X of potentialcauses is invariant over different values of E , that is, E ⊥⊥ Y | X . The variable E can be inter-preted as encoding properties of different environments or can be another functional variable ifthe latter is known to be a non-descendant of Y .Recent work [39] however has shown that in the setting where X, Y and Z are simple randomvectors where Z is continuous (i.e., has a density with respect to Lebesgue measure), testingthe conditional independence X ⊥⊥ Y | Z is fundamentally hard in the sense that any test forconditional independence must have power at most its size. Intuitively, the reason for this is thatgiven any test, there are potentially highly complex joint distributions for the triple ( X, Y, Z )that maintain conditional independence but yield rejection rates as high as for any alternativedistribution. Lipschitz constraints on the joint density, for example, preclude the presence ofsuch distributions [27].In the context of functional data however, the problem can be more severe, and we show inthis work that even in the idealised setting where (
X, Y, Z ) are jointly Gaussian in the functionallinear regression model (1), testing for X ⊥⊥ Y | Z is fundamentally impossible: any test musthave power at most its size. In other words, any test with power β at some alternative cannothope to control type I error at level α < β across the entirety of the null hypothesis, even if weare willing to assume Gaussianity.Consequently, there is no general purpose conditional independence test even for Gaussianfunctional data and we must necessarily make some further modelling assumptions to proceed.2e argue that this calls for the need of conditional independence tests whose suitability for anyfunctional data setting can be judged more easily. Motivated by the Generalised CovarianceMeasure [39], we propose a simple test we call the Generalised Hilbertian Covariance Measure(GHCM) that involves regressing X on Z and Y on Z (each of which may be functional orindeed collections of functions), and then computing the Hilbert–Schmidt norm of the outerproduct between the resulting residuals. We show that the validity of this form of test reliesprimarily on the relatively weak requirement that the regression procedures are able to estimatethe conditional means X given Z , and Y given Z , at a slow rate. We thus aim to convert theproblem of conditional independence testing into the more familiar task of regression withfunctional data, for which well-developed methods are readily available. This marks out ourtest as different from existing approaches for assessing conditional independence in FDA, whichwe review in the following. One approach to measuring conditional dependence with functional data is based on the Gaus-sian graphical model. Zhu et al. [53] propose a Bayesian approach for learning a graphicalmodel for jointly Gaussian multivariate functional data. Qiao et al. [32] and Zapata et al. [52]study approaches based on generalisations of the graphical Lasso [51]. These latter methodsdo not aim to perform statistical tests for conditional independence, but rather provide a pointestimate of the graph, for which the authors establish consistency results valid in potentiallyhigh-dimensional settings.As discussed earlier, conditional independence testing is related to significance testing inregression models. There is however a paucity of literature on formal significance tests forthe inclusion of functional prediction. The R implementation [16] of the popular functionalregression methodology of Greven and Scheipl [17] produces p -values for the inclusion of afunctional predictor based on significance tests for generalised additive models developed inWood [45]. These tests, whilst being computationally efficient, do not have formal uniform levelcontrol guarantees.Among the literature on conditional independence testing for multivariate data, our testis most closely related to the Generalised Covariance Measure (GCM) introduced in Shah andPeters [39]. This is a family of tests based on regressing each of X and Y on Z , and computing acertain normalised covariance of the residuals. Our proposed Generalised Hilbertian CovarianceMeasure reduces to (the absolute value of) the GCM test statistic in the case where X and Y are scalar. The statement of the hardness result we present in Section 2 bears some similarityto the hardness result of Shah and Peters [39]. However the Gaussian and infinite-dimensionalsetting considered here are different to the setting considered in Shah and Peters [39], which isfinite-dimensional and places no restrictions on the null beyond conditional independence andhaving a density with respect to Lebesque measure. As a consequence the proof techniques aremostly unrelated. The rest of the paper is organised as follows. In Section 2 we present our formal hardnessresult on conditional independence testing for Gaussian functional data. The proof rests on anew result on the maximum power attainable at any alternative when testing for conditionalindependence with multivariate Gaussian data. The full technical details are given in Section Aof the supplementary material.In Section 3 we describe our new GHCM testing framework for testing X ⊥⊥ Y | Z , whereeach of X , Y and Z may be collections of functional and scalar variables. In view of the negative3esult on conditional independence testing with Gaussian data, our guarantees on type I errorcontrol must necessarily only hold on a subset of the null hypothesis of conditional independence.In Section 4 we show that for the GHCM, this subset may be characterised as one where inaddition to some tightness and moment conditions, the conditional expectations E ( X | Z ) and E ( Y | Z ) can be estimated at sufficiently fast rates, such that the product of the correspondingin-sample mean prediction squared errors (MSPEs) decay faster than 1 /n uniformly, where n is the sample size. In Section 4.2 we show that a version of the GHCM incorporating sample-splitting has uniform power against alternatives where the expected conditional covarianceoperator E { Cov(
X, Y | Z ) } has Hilbert–Schmidt norm of order n − / , and is thus rate-optimal.Such uniform guarantees are especially important in the case of conditional independence testingsince in particular they reveal the form of the null hypothesis actually being tested. Our proofsrely on new results on uniform convergence of Hilbertian and Banachian random variables whichmay be useful in other settings, too. These and related results we use are given in Section B ofthe supplementary material.The fact that control of the type I error of the GHCM depends on an in-sample MSPE ratherthan a more conventional out-of-sample MSPE, has important consequences. We demonstratein Section 4.3 that bounds on the former are achievable under significantly weaker conditionsthan equivalent bounds on the latter by considering ridge regression in the functional linearmodel. In particular the required prediction error rates are satisfied over classes of functionallinear models where the eigenvalues of the covariance operator of the functional regressor aredominated by a summable sequence; no additional eigen-spacing conditions, or lower boundson the decay of the eigenvalues are needed.In Section 5 we present the results of numerical experiments on the GHCM relating toits size and power. We also demonstrate in Section 5.2 the use of the GHCM in the con-struction of a confidence interval for the truncation point in a truncated functional linearmodel, a problem which we show may be framed as one of testing certain conditional inde-pendencies. We conclude with a discussion in Section 6 outlining potential follow-on workand open problems. The supplementary material contains the proofs of all results presentedin the main text and some additional numerical experiments, as well as the uniform conver-gence results mentioned above. An R -package implementing the methodology is available from https://github.com/ARLundborg/ghcm . For three random elements X , Y and Z defined on the same background space (Ω , F , P ) withvalues in ( X , A ), ( Y , G ) and ( Z , K ) respectively, we say that X is conditionally independent of Y given Z and write X ⊥⊥ Y | Z when E ( f ( X ) g ( Y ) | Z ) = E ( f ( X ) | Z ) E ( g ( Y ) | Z )for all bounded and Borel measurable f : X → R and g : Y → R . Several equivalentdefinitions are given in Constantinou and Dawid [8, Proposition 2.3].Throughout the paper we consider families of probability distributions P of the triplet( X, Y, Z ), which we partition into the null hypothesis P of those P ∈ P satisfying X ⊥⊥ Y | Z ,and set of alternatives Q := P \ P where the conditional independence relation is violated. Weconsider data ( x i , y i , z i ), i = 1 , . . . , n , consisting of i.i.d. copies of ( X, Y, Z ), and write X ( n ) :=( x i ) ni =1 and similarly for Y ( n ) and Z ( n ) . We apply to this data a test ψ n : ( X ×Y ×Z ) n → { , } ,with a value of 1 indicating rejection. We will at times write E P ( · ) for expectations of randomelements whose distribution is determined by P , and similarly P P ( · ) = E P {·} . Thus the sizeof the test ψ n may be written as sup P ∈P P P ( ψ n = 1).4e always take X = H X and Y = H Y for separable Hilbert spaces H X and H Y , which willin places be finite dimensional (Euclidean). For f in a Banach space B , we write (cid:107) f (cid:107) for thenorm of f and for g and h in a Hilbert space H , we write (cid:104) g, h (cid:105) for the inner product of g and h . The bounded linear operator on H given by x (cid:55)→ (cid:104) x, g (cid:105) h is the outer product of g and h and is denoted by g ⊗ h . A bounded linear operator A on H is compact if it has a singularvalue decomposition, i.e., there exists two orthonormal bases ( e ,k ) k ∈ N and ( e ,k ) k ∈ N of H anda non-decreasing sequence ( λ k ) k ∈ N of singular values such that A h = ∞ (cid:88) k =1 λ k ( e ,k ⊗ e ,k ) h = ∞ (cid:88) k =1 λ k (cid:104) e ,k , h (cid:105) e ,k for all h ∈ H . For a compact linear operator A as above, we denote by (cid:107) A (cid:107) op , (cid:107) A (cid:107) HS and (cid:107) A (cid:107) TR the operator norm, Hilbert-Schmidt norm and trace norm, respectively, of A whichequal the (cid:96) ∞ , (cid:96) and (cid:96) norms, respectively, of the sequence of singular values ( λ k ) k ∈ N .A random variable on a Banach space B is a mapping X : Ω → B defined on a backgroundprobability space (Ω , F , P ) which is measurable with respect to the Borel σ -algebra on B , B ( B ).Integrals with values in Hilbert or Banach spaces, including expectations, are Bochner integralsthroughout. For a random variable X on Hilbert space H , we define the covariance operator of X by Cov( X ) := E [( X − E ( X )) ⊗ ( X − E ( X ))] = E ( X ⊗ X ) − E ( X ) ⊗ E ( X )whenever E (cid:107) X (cid:107) < ∞ . For another random variable Y with E (cid:107) Y (cid:107) < ∞ , we define thecross-covariance operator of X and Y byCov( X, Y ) := E [( X − E ( X )) ⊗ ( Y − E ( Y ))] = E ( X ⊗ Y ) − E ( X ) ⊗ E ( Y ) . We define conditional variants of the covariance operator and cross-covariance operator byreplacing expectations with conditional expectations given a σ -algebra or random variable. In this section we present a negative result on the possibility of testing for conditional indepen-dence with functional data in the idealised setting where all variables are Gaussian. We take P to consist of distributions of ( X, Y, Z ) that are jointly Gaussian with non-singular (injective)covariance operator, where X and Z take values in separable Hilbert spaces H X and H Z re-spectively with H Z infinite-dimensional, and Y ∈ R d Y for some d Y ∈ N . We note that in thecase where d Y = 1 and H X , H Z are infinite-dimensional, each P ∈ P admits a representation asa Gaussian scalar-on-function linear model (1) where Y is the scalar response, and functionalcovariates X, Z and error ε are all jointly Gaussian with ε ⊥⊥ ( X, Z ); the settings with d Y > Q in the set of alternatives Q := P \ P , we further define P Q ⊂ P by P Q := { P ∈ P : the marginal distribution of Z under P and Q is the same } . Theorem 1 below shows that not only is it fundamentally hard to test the null hypothesis of P against Q for all dataset sizes n , but restricting to the null P Q presents an equally hardproblem. 5 heorem 1. Given alternative Q ∈ Q and n ∈ N , let ψ n be a test for null hypothesis P Q against Q . Then we have that the power is at most the size: P Q ( ψ n = 1) ≤ sup P ∈P Q P P ( ψ n = 1) . An interpretation of this statement in the context of the functional linear model is thatregardless of the number of observations n , there is no non-trivial test for the significance of thefunctional predictor X , even if the marginal distribution of the additional predictor Z is knownexactly. It is clear that the size of a test over P is at least as large as that over the null P Q ,so testing the larger null is of course at least as hard.It is known that testing conditional independence in simple multivariate (finite-dimensional)settings is hard in the sense of Theorem 1 when the conditioning variable is continuous. However,whilst restricting the null to include only distributions with Lipschitz densities, for example,allows for the existence of tests with power against large classes of the alternative, in thefunctional setting, simply removing pathological distributions from the entire null of conditionalindependence does not make the problem testable. Even with the parametric restriction ofGaussianity, the null is still too large for the existence of non-trivial hypothesis tests. Indeedthe starting point of our proof is a result due to Kraft [24] that the hardness in the statement ofTheorem 1 is equivalent to the n -fold product Q ⊗ n lying in the convex closure in total variationdistance of the set of n -fold products of distributions in P Q .A consequence of Theorem 1 is that we need to make strong modelling assumptions inorder to test for conditional independence in the functional data setting. Given the plethoraof regression methods for functional data, we argue that it can be convenient to frame thesemodelling assumptions in terms of regression models for each of X and Y on Z , or moregenerally, in terms of the performances of methods for these regressions. The remainder of thispaper is devoted to developing a family of conditional independence tests whose validity restsprimarily on the prediction errors of these regressions. In this section we present the Generalised Hilbertian Covariance Measure (GHCM) for testingconditional independence with functional data. To motivate the approach we take, it will behelpful to first review the construction of the Generalised Covariance Measure (GCM) developedin Shah and Peters [39] for univariate X and Y .Consider first therefore the case where X and Y are real-valued random variables, and Z is a random variable with values in some space Z . We can always write X = f ( Z ) + ε where f ( z ) := E ( X | Z = z ) and similarly Y = g ( Z ) + ξ with g ( z ) := E ( Y | Z = z ). The conditionalcovariance of X and Y given Z ,Cov( X, Y | Z ) := E [ { X − E ( X | Z ) }{ Y − E ( Y | Z ) } | Z ] = E ( εξ | Z ) , has the property that Cov( X, Y | Z ) = 0 and hence E ( εξ ) = 0 whenever X ⊥⊥ Y | Z . The GCMforms an empirical version of E ( εξ ) given data ( x i , y i , z i ) ni =1 by first regressing each of X ( n ) and Y ( n ) onto Z ( n ) to give estimates ˆ f and ˆ g of f and g respectively. Using the correspondingresiduals ˆ ε i := x i − ˆ f ( z i ) and ˆ ξ i := y i − ˆ g ( z i ), the product R i := ˆ ε i ˆ ξ i is computed for each i = 1 , . . . , n and then averaged to give ¯ R := (cid:80) ni =1 R i /n , an estimate of E ( εξ ). The standarddeviation of ¯ R under the null X ⊥⊥ Y | Z may also be estimated, and it can be shown [39, Thm 8]6hat under some conditions, ¯ R divided by its estimated standard deviation converges uniformlyto a standard Gaussian distribution.This basic approach can be extended to the case where X and Y take values in R d X and R d Y respectively for d X , d Y ∈ N , by considering a multivariate conditional covariance,Cov( X, Y | Z ) := E (cid:2) { X − E ( X | Z ) }{ Y − E ( Y | Z ) } T | Z (cid:3) = E ( εξ T | Z ) ∈ R d X × d Y . This is a zero matrix when X ⊥⊥ Y | Z , and hence E ( εξ T ) = 0 under this null. Thus ¯ R definedas before but where R i := ˆ ε i ˆ ξ Ti can form the basis of a test of conditional independence. Thereare several ways to construct a final test statistic using ¯ R ∈ R d X × d Y . The approach taken inShah and Peters [39] involves taking the maximum absolute value of a version of ¯ R with eachentry divided by its estimated standard deviation. This, however, does not generalise easily tothe functional data setting we are interested in here; we now outline an alternative that can beextended to handle functional data.To motivate our approach, consider multiplying ¯ R by √ n : √ n ¯ R = 1 √ n n (cid:88) i =1 ˆ ε i ˆ ξ Ti = 1 √ n n (cid:88) i =1 ( f ( z i ) − ˆ f ( z i ) + ε i )( g ( z i ) − ˆ g ( z i ) + ξ i ) T = 1 √ n n (cid:88) i =1 ε i ξ Ti (cid:124) (cid:123)(cid:122) (cid:125) U n + 1 √ n n (cid:88) i =1 ( f ( z i ) − ˆ f ( z i ))( g ( z i ) − ˆ g ( z i )) T (cid:124) (cid:123)(cid:122) (cid:125) a n + 1 √ n n (cid:88) i =1 ( f ( z i ) − ˆ f ( z i )) ξ Ti (cid:124) (cid:123)(cid:122) (cid:125) b n + 1 √ n n (cid:88) i =1 ε i ( g ( z i ) − ˆ g ( z i )) T (cid:124) (cid:123)(cid:122) (cid:125) c n . Observe that U n is a sum of i.i.d. terms and so the multivariate central limit theorem dictatesthat U n / √ n converges to a d X × d Y -dimensional Gaussian distribution. Applying the Frobeniusnorm (cid:107)·(cid:107) F to the a n term, we get by submultiplicativity and the Cauchy–Schwarz inequality, (cid:107) a n (cid:107) F ≤ √ n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107) (cid:107) g ( z i ) − ˆ g ( z i ) (cid:107) ≤ √ n (cid:18) n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107) (cid:19) / (cid:18) n n (cid:88) i =1 (cid:107) g ( z i ) − ˆ g ( z i ) (cid:107) (cid:19) / . (2)The right-hand-side here is a product of in-sample mean-squared prediction errors for each ofthe regressions performed. Under the null of conditional independence, each term of b n and c n ismean zero conditional on ( X ( n ) , Z ( n ) ) and ( Y ( n ) , Z ( n ) ), respectively. Thus so long as both of theregression functions are estimated at a rate sufficiently fast, we can expect a n , b n , c n to be smallso the distribution of √ n ¯ R can be well-approximated by the Gaussian limiting distribution of U n / √ n . As in the univariate setting, it is crucially the product of the prediction errors in (2)that is required to be small, so each root mean squared prediction error term can decay atrelatively slow o ( n − / ) rates.Unlike the univariate setting however, √ n ¯ R is now a matrix and hence we need to choosesome sensible aggregator function t : R d X × d Y → R such that we can threshold t ( √ n ¯ R ) to yielda p -value. One option is as follows; we take a different approach as the basis of the GHCM forreasons which will become clear in the sequel. If we vectorise ¯ R , i.e., view the matrix as a d X d Y -dimensional vector, then under the assumptions required for the above heuristic arguments toformally hold, √ n Vec( ¯ R ) converges to a Gaussian with mean zero and some covariance matrix7 if X ⊥⊥ Y | Z . Provided C is invertible, √ nC − / ¯ R therefore converges to a Gaussian withidentity covariance under the null and hence (cid:107) C − / √ n ¯ R (cid:107) converges to a χ -distribution with d X d Y degrees of freedom. Replacing C with an estimate ˆ C then yields a test statistic fromwhich we may derive a p -value.We now turn to the functional setting where X and Y take values in separable Hilbert spaces H X and H Y respectively; these could for example be L [0 , X ( n ) and Y ( n ) onto Z ( n ) as before to give residuals ˆ ε i ∈ H X , ˆ ξ i ∈ H Y for i = 1 , . . . , n . Withthese we proceed as in the multivariate case outlined above but replacing matrix outer productsin the multivariate setting with outer products in the Hilbertian sense, that is we define for i = 1 , . . . , n , R i := ˆ ε i ⊗ ˆ ξ i , and T n := √ n ¯ R (3)where ¯ R := 1 n n (cid:88) i =1 R i . We can show (see Theorem 2) that under the null, provided the analogous prediction errorterms in (2) decay sufficiently fast and additional regularity conditions hold, T n above convergesuniformly to a Gaussian distribution in the space of Hilbert–Schmidt operators. The covarianceoperator of this limiting Gaussian distribution can be estimated by the empirical covarianceoperator ˆ C := 1 n − n (cid:88) i =1 ( R i − ¯ R ) ⊗ HS ( R i − ¯ R ) (4)where ⊗ HS denotes the outer product in the space of Hilbert–Schmidt operators.An analogous approach to that outlined above for the multivariate setting would involve at-tempting to whiten this limiting distribution using the square-root of the inverse of ˆ C ; howeverthis inverse does typically not exist due to the compactness of covariance operators. Further-more, even when testing whether a high-dimensional Gaussian vector W with unknown variancehas mean zero (rather than the infinite-dimensional setting here), it is not recommended to at-tempt to whiten the distribution as the estimated inverse covariance may not approximate itspopulation level counterpart sufficiently well, as shown by Bai and Saranadasa [1]. Instead, Baiand Saranadasa [1] advocate using a test statistic based on the squared norm (cid:107) W (cid:107) . We takean analogous approach here, and use as our test statistic T n := (cid:107) T n (cid:107) HS (5)where (cid:107)·(cid:107) HS denotes the Hilbert–Schmidt norm. Using this, we may define an α -level testfunction ψ n via ψ n := { T n ≥ ˆ q α } (6)where ˆ q α is the 1 − α quantile of an infinite weighted sum ∞ (cid:88) k =1 ˆ λ k W k of independent χ distributions ( W k ) k ∈ N with weights given by the eigenvalues (ˆ λ k ) k ∈ N of ˆ C .In practice, we may approximate ˆ q α using a Monte Carlo approach, for example.In Section 4 we justify the GHCM test described above theoretically. Whilst the basicprinciples in the argument outlined above for the multivariate setting remain the same in thefunctional setting, the required analytic tools are substantially different. Before we presentresults on the size and power of the proposed test, we give in the next section a description ofthe method when applied not to fully observed functional data as we have assumed here, butto vectors of function values as would typically be encountered in practice.8 .2 Practical considerations For ease of exposition, we restrict our discussion to the setting where Y ∈ R d Y and X isfunctional, but our functional data are only observed discrete points. The case where bothvariables are functional can be handled analogously. Regression methods in the FDA literatureare typically well-equipped to handle this sort of missing data problem, either by first using asmoother to produce curves approximating the true functions [33], or by directly working withthe discrete samples [17].In either case, it is likely that the regression methods employed to regress the functional X ( n ) onto Z ( n ) will return discretely sampled residual functions instead of the desired fully observedresiduals. To accommodate this, the residuals must be first smoothed and then represented ina finite basis expansion such that their outer product can be computed. A convenient way ofachieving this is to use functional principal components analysis (FPCA) [42, 48] which thenyields representations of the residuals ˜ ε , . . . , ˜ ε n ∈ R d ε , where d ε can, for example, be chosensuch that some proportion of the variance of the sample is retained (we use 95% as default inthe experiments presented in Section 5).To calculate our test statistic, we set R i := Vec(˜ ε i ˆ ξ Ti ) ∈ R d ε · d Y . We then compute anapproximate version ˆ T n of the test statistic T n (5) and its estimated covariance matrix viaˆ T n := (cid:118)(cid:117)(cid:117)(cid:116) d ε d Y (cid:88) j =1 (cid:18) √ n n (cid:88) i =1 R ij (cid:19) , (7)ˆ C := 1 n − n (cid:88) i =1 ( R i − ¯ R )( R i − ¯ R ) T , (8)where ¯ R = (cid:80) ni =1 R i /n . Note that ˆ T n above is precisely equal to T n (5) when in the constructionof the latter, the true residuals (ˆ ε i , ˆ ξ i ) ni =1 are replaced with their FPCA representations above.The only source of the error in the approximation when d ε = n is the smoothing of the discreteobservations performed in FPCA, and thus if the functions are sufficiently densely observed, wewould recover the exact test statistic.To calibrate the test, we first take B independent draws from a N (0 , ˆ C ) distribution andcalculate the Euclidean norm of each (we use B = 10 000 as default in the experiments presentedin Section 5); these will (approximately) have the limiting null distribution of the test statistic T n ≈ ˆ T n . We then form a p -value by comparing the observed ˆ T n with the simulated versions. Wesummarise this procedure in Algorithm 1. Note that pseudocode presented and the discussionabove is for the case where X and Z are functional and Y ∈ R d Y . If Y is also functional, anadditional FPCA step should be included for the residuals ( ˆ ξ i ) ni =1 similarly to step 6 of thealgorithm for (ˆ ε i ) ni =1 .An R -package implementing the methodology is available from https://github.com/ARLundborg/ghcm . We now turn to the theoretical analysis of the performance of the GHCM. We consider functionaldata that is fully observed functions here, and so our analysis is most relevant for the practicalsetting where functions are observed densely. In Section 2 we saw that even when P consistsof Gaussian distributions over H X × R d Y × H Z , we cannot ensure that our test has both thedesired size α over P and also non-trivial power properties against alternative distributions in Q . We also have the following related result. 9 lgorithm 1: Generalised Hilbertian Covariance Measure (GHCM) input : n observations of ( X i , Y i , Z i ), where, for each i , X i consists of T Xi observations X i ( t ij ) and Z i consists of T Zi observations of Z i ( s ij ); options : number of samples B for computation of Monte Carlo p -value, FPCA methodto apply to the residuals, regression methods for each of the regressions; begin regress X ( n ) on Z ( n ) producing residuals ˆ ε i ( t ij ) for j = 1 , . . . , T Xi and i = 1 , . . . , n ; regress Y ( n ) on Z ( n ) producing residuals ˆ ξ i ∈ R d Y for i = 1 , . . . , n ; find a functional principal component basis for ˆ ε , . . . , ˆ ε n , yielding ˜ ε , . . . , ˜ ε n ∈ R d ε ; set R i ← Vec (cid:16) ˜ ε i ˆ ξ Ti (cid:17) for each i = 1 , . . . , n yielding n vectors in R d ε · d ξ ; compute ˆ T n and ˆ C via (7) and (8); simulate N (0 , ˆ C ) random variables W , . . . , W B , and compute p -value p ← (cid:16) (cid:80) Bb =1 {(cid:107) W b (cid:107) > ˆ T n } (cid:17) / ( B + 1) ; end output : p -value p ; Proposition 1.
Let H Z be a separable Hilbert space with orthonormal basis ( e k ) k ∈ N . Let P bethe family of Gaussian distributions for ( X, Y, Z ) ∈ R × R × H Z with non-singular covarianceoperator and where ( X, Y ) ⊥⊥ ( Z r +1 , Z r +2 , . . . ) | Z , . . . , Z r for some r ∈ N and Z k := (cid:104) e k , Z (cid:105) for all k ∈ N . For Q ∈ Q := P \ P , write P Q for the subset of the null hypothesis P ⊂ P consisting of those distributions for which the marginal distribution of Z agrees with that under Q . Then P Q ( ψ n = 1) ≤ sup P ∈P Q P P ( ψ n = 1) . In other words, even if we know a basis ( e k ) k ∈ N such that in particular the conditional expec-tations E ( X | Z ) and E ( Y | Z ) are sparse in that they depend only on finitely many components Z , . . . , Z r (with r ∈ N unknown), and the marginal distribution of Z is known exactly, there isstill no non-trivial test of conditional independence.In this specialised setting, it is however possible to give a test of conditional independencethat will, for each fixed null hypothesis P ∈ P , yield exact size control and power againstall alternatives Q for n sufficiently large. These properties are for example satisfied by thesignificance test ψ OLS n for Y in a linear model of X on Y, Z , . . . , Z a ( n ) and an intercept term,for some sequence a ( n ) < n with a ( n ) → ∞ and n − a ( n ) → ∞ as n → ∞ . Indeed if ψ OLS n is anominal α -level test we havesup P ∈P lim n →∞ P P ( ψ OLS n = 1) = α and inf Q ∈Q lim n →∞ P Q ( ψ OLS n = 1) = 1; (9)see Section C.1 in the supplementary material for a derivation. This illustrates the differencebetween pointwise asymptotic level control in the left-hand side of (9), and uniform asymptoticlevel control given by interchanging the limit and the supremum.Our analysis instead focuses on proving that the GHCM asymptotically maintains its leveluniformly over a subset of the conditional independence null. In order to state our resultswe first introduce some definitions and notation to do with uniform stochastic convergence.Throughout the remainder of this section we tacitly assume the existence of a backgroundspace (Ω , F ) whereupon all random quantities are defined. The background space is equippedwith a family of probability measures ( P P ) P ∈P such that the distribution of ( X, Y, Z ) under P P P . For a subset A ⊆ P , we say that a sequence of random variables W n converges uniformlyin distribution to W over A and write if W n D ⇒ A W if lim n →∞ sup P ∈A d BL ( W n , W ) = 0 , where d BL denotes the bounded Lipschitz metric. We say, W n converges uniformly in probabilityto W over A and write W n P ⇒ A W if for any ε > , lim n →∞ sup P ∈A P P ( (cid:107) W n − W (cid:107) ≥ ε ) = 0 . We sometimes omit the subscript A when it is clear from the context. A full treatment ofuniform stochastic convergence in a general setting is given in Section B of the supplementarymaterial. Throughout this section we emphasise the dependence of many of the quantities inSection 3.1 on the distribution of ( X, Y, Z ) with a subscript P , e.g. f P , ε P etc.In Sections 4.1 and 4.2 we present general results on the size and power of the GHCM. Wetake P to be the set of all distributions over H X × H Y × Z , and P to be the correspondingconditional independence null. We however show properties of the GHCM under a smaller set ofdistributions ˜ P ⊂ P with corresponding null distributions ˜ P ⊂ P , where in particular certainconditions on the quality of the regression procedures on which the test is based are met. InSection 4.3 we consider the special case where the regressions of each of X and Y on Z aregiven by functional linear models and show that Tikhonov regularised regression can satisfythese conditions. In order to state our result on the size of the GHCM, we introduce the following quantities. Let u P ( z ) := E P (cid:0) (cid:107) ε P (cid:107) | Z = z (cid:1) , v P ( z ) := E P (cid:0) (cid:107) ξ P (cid:107) | Z = z (cid:1) . We further define the in-sample unweighted and weighted mean-squared prediction errors of theregressions as follows: M fn,P := 1 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) f P ( z i ) − ˆ f ( n ) ( z i ) (cid:13)(cid:13)(cid:13) , M gn,P := 1 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) g P ( z i ) − ˆ g ( n ) ( z i ) (cid:13)(cid:13)(cid:13) , (10)˜ M fn,P := 1 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) f P ( z i ) − ˆ f ( n ) ( z i ) (cid:13)(cid:13)(cid:13) v P ( z i ) , ˜ M gn,P := 1 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) g P ( z i ) − ˆ g ( n ) ( z i ) (cid:13)(cid:13)(cid:13) u P ( z i ) . (11)The result below shows that on a subset ˜ P of the null distinguished primarily by the productof the prediction errors in (10) being small, the operator-valued statistic T n converges in dis-tribution uniformly to a mean zero Gaussian whose covariance can be estimated consistently.We remark that prediction error quantities in (10) and (11) are “in-sample” prediction errors,only reflecting the quality of estimates of the conditional expectations f and g at the observedvalues z , . . . , z n . Theorem 2.
Let ˜ P ⊆ P be such that uniformly over ˜ P ,(i) nM fn,P M gn,P P ⇒ ,(ii) ˜ M fn,P P ⇒ , ˜ M gn,P P ⇒ , iii) inf P ∈ ˜ P E P (cid:0) (cid:107) ε P (cid:107) (cid:107) ξ P (cid:107) (cid:1) > and sup P ∈ ˜ P E P (cid:0) (cid:107) ε P (cid:107) η (cid:107) ξ P (cid:107) η (cid:1) < ∞ for some η > ,and(iv) for some orthonormal bases ( e X,k ) k ∈ N and ( e Y,k ) k ∈ N of H X and H Y , respectively, writing ε P,k := (cid:104) e X,k , ε P (cid:105) and ξ P,k := (cid:104) e Y,k , ξ P (cid:105) , we have lim K →∞ sup P ∈ ˜ P ∞ (cid:88) k = K E ( ε P,k ξ P,k ) = 0 . Then uniformly over ˜ P we have T n D ⇒ N (0 , C P ) and (cid:107) ˆ C − C P (cid:107) TR P ⇒ , where C P := E { ( ε P ⊗ ξ P ) ⊗ HS ( ε P ⊗ ξ P ) } . Condition (i) is satisfied if √ nM fn,P , √ nM gn,P P ⇒
0, and so allows for relatively slow o ( √ n )rates for the mean squared prediction errors. Moreover, if one regression yields a faster rate,the other can go to zero more slowly. These properties are shared with the regular generalisedcovariance measure and more generally doubly robust procedures popular in the literature oncausal inference and semiparametric statistics [36, 37, 6]. We note that the regression methodsare not required to extrapolate well beyond the observed data. We show in Section 4.3 that whenthe regression models are functional linear models, (i) and (ii) hold under weaker conditionsthan are typically required for out-of-sample prediction error guarantees in the literature.Condition (iv) together with (ii) implies that the family { ε P ⊗ ξ P : P ∈ ˜ P } is uniformlytight. Similar tightness conditions occur in Chen and White [5, Lem. 3.1] in the context offunctional central limit theorems.The result below shows that the GHCM test ψ n based on the test statistic T n := (cid:107) T n (cid:107) HS hastype I error control uniformly over ˜ P given in Theorem 2 provided an additional assumptionof non-degeneracy of the covariance operators is satisfied. Theorem 3.
Let ˜ P ⊆ P satisfy the conditions stated in Theorem 2, and in addition suppose inf P ∈ ˜ P (cid:107) C P (cid:107) op > . (12) Then for each α ∈ (0 , , the α -level GHCM test ψ n (6) satisfies lim n →∞ sup P ∈ ˜ P | P P ( ψ n = 1) − α | = 0 . (13) We now study the power of the GHCM. It is not straightforward to analyse what happens to thetest statistic T n when the null hypothesis is false in the setup we have considered so far. However,if we modify the test such that the regression function estimates ˆ f and ˆ g are constructed usingan auxiliary dataset independent of the main data ( x i , y i , z i ) ni =1 , the behaviour of T n is moretractable. Given a single sample, this could be achieved through sample splitting, and cross-fitting [6] could be used to recover the loss in efficiency from the split into smaller datasets.However, we do not recommend such sample-splitting in practice here and view this as more ofa technical device that facilitates our theoretical analysis. As we require ˆ f and ˆ g to satisfy (i)12nd (ii) of Theorem 2, these estimators would need to perform well out of sample rather thanjust on the observed data, which is typically a harder task.Given that our test is based on an empirical version of E (Cov( X, Y | Z )) = E ( ε ⊗ ξ ), wecan only hope to have power against alternatives where this is non-zero. For such alternativeshowever, we have positive power whenever the Hilbert–Schmidt norm of the expected conditionalcovariance operator is at least C/ √ n for a constant C >
0, as the following result shows.
Theorem 4.
Consider a version of the GHCM test ψ n where ˆ f and ˆ g are constructed onindependent auxiliary data. Let ˜ P ⊂ P be the set of distributions for ( X, Y, Z ) satisfying (i)-(iv) of Theorem 2 and (12) with ˜ P in place of ˜ P . Then writing K P := E P ( ε P ⊗ ξ P ) = E P (Cov P ( X, Y | Z )) , we have, uniformly over P , ˜ T n := 1 √ n n (cid:88) i =1 ( R i − K P ) D ⇒ N (0 , C P ) and (cid:107) ˆ C − C P (cid:107) TR P ⇒ . Furthermore, an α -level GHCM test ψ n (constructed using independent estimates ˆ f and ˆ g )satisfies the following two stateaments.(i) Redefining ˜ P = ˜ P ∩ P , we have that (13) is satisfied, and so an α -level GHCM test hassize converging to α uniformly over ˜ P .(ii) For every < α < β < there exists C > and N ∈ N such that for any n ≥ N , inf P ∈Q C,n P P ( ψ n = 1) ≥ β, where Q C,n := { P ∈ ˜ P : (cid:107) K P (cid:107) HS > C/ √ n } . Here we consider a special case of the general setup used in Sections 4.1 and 4.2, and assumethat under the null of conditional independence, the functional X and Y are related to Z viafunctional linear models: X = S XP Z + ε P (14) Y = S YP Z + ξ P . (15)Here S XP is a Hilbert–Schmidt operator such that S XP Z = f ( Z ) := E ( X | Z ), with analogousproperties holding for S YP . If X , Y and Z are elements of L [0 , X ( t ) = (cid:90) β P ( s, t ) Z ( s ) d s + ε P ( t ) , where β P is a square-integrable function, and similarly for the relationship between Y and Z . Such functional response linear models have been discussed by Ramsay and Silverman [33,Chap. 16], and studied by Chiou et al. [7], Yao et al. [49], Crambes and Mas [9], for example.Benatia et al. [2] propose a Tikhonov regularised estimator analogous to ridge regression [21];applied to the regression model (14), this estimator takes the formˆ S γ = argmin ˜ S n (cid:88) i =1 (cid:107) x i − ˜ S ( z i ) (cid:107) + γ (cid:107) ˜ S (cid:107) HS , (16)13here γ > S X and ˆ S Y by solving the optimisationin (16) with regularisation parameterˆ γ := argmin γ> (cid:32) γn n (cid:88) i =1 min(ˆ µ i / , γ ) + γ (cid:33) where ˆ µ ≥ ˆ µ ≥ · · · ≥ ˆ µ n ≥ n × n matrix K where K ij = (cid:104) z i , z j (cid:105) /n . This data-driven choice of ˆ γ is motivated by an upper bound on the in-samplemean squared prediction error (MSPE) of the estimators ˆ S X and ˆ S Y (see Lemma 14) wherewe have omitted some distribution-dependent factors of (cid:107) S XP (cid:107) HS or (cid:107) S YP (cid:107) HS and a variancefactor; a similar strategy was used in an analysis of kernel ridge regression [39] which closelyparallels ours here. This choice allows us to conduct a theoretical analysis that we presentbelow. In practice, other choices of regularisation parameter such as cross validation-basedapproaches may perform even better and so could alternative methods that are not based onTikhonov regularisation.In the following result we take ψ n to be the α -level GHCM test (6) with estimated regressionfunctions ˆ f and ˆ g given byˆ f ( z ) = ˆ S X z and ˆ g ( z ) = ˆ S Y z, for all z ∈ H Z . (17)in-sample MSE bound under mild conditions when using this estimator. This allows us to giveconditions under which the GHCM using this estimator has uniform asymptotic level. Proposition 2.
Let ˜ P ⊂ P be such that (14) and (15) are satisfied, and moreover (i),. . . ,(iv)of Theorem 2 and (12) hold when ˆ f and ˆ g are as in (17) . Suppose further that(i) sup P ∈ ˜ P max( (cid:107) S XP (cid:107) HS , (cid:107) S YP (cid:107) HS ) < ∞ ,(ii) sup P ∈ ˜ P max( u P ( Z P ) , v P ( Z P )) < ∞ almost surely,(iii) sup P ∈ ˜ P E (cid:107) Z (cid:107) < ∞ and lim γ ↓ sup P ∈ ˜ P (cid:80) ∞ k =1 min( µ k,P , γ ) = 0 where ( µ k,P ) k ∈ N denotethe ordered eigenvalues of the covariance operator of Z under P .Then the α -level GHCM test ψ n satisfies lim n →∞ sup P ∈ ˜ P | P P ( ψ n = 1) − α | = 0 . Condition (iii) is satisfied, by the dominated convergence theorem, for any family ˜ P forwhich the family of covariance operators of Z is finite or the sequence of eigenvalues of thecovariance operators are uniformly bounded above by a summable sequence. The proof ofProposition 2 relies on Lemma 14 in Section C.4 of the supplementary material, which givesa bound on the in-sample MSPE of ridge regression in terms of the decay of the eigenvalues µ k,P . For example, we have that if these are dominated by an exponentially decaying sequence,the in-sample MSPE is o (log n/n ) as n → ∞ . This matches the out-of-sample MSPE boundobtained in Crambes and Mas [9, Corollary 5] in the same setting as that described, but the out-of-sample result additionally requires convexity and lower bounds on the decay of the sequenceof eigenvalues of the covariance operator, and stronger moment assumptions on the norm of thepredictor. Similarly, other related results [e.g., 4, 20] require additional eigen-spacing conditionsin place of convexity, and upper and lower bounds on the decay of the eigenvalues. Thisillustrates how in-sample and out-of-sample prediction are rather different in the functionaldata setting, and reliance on the former being small, as we have with the GHCM, is desirabledue to the weaker conditions needed to guarantee this.14 Numerical experiments
In this section we present the results of numerical experiments that investigate the performanceof our proposed GHCM methodology. All functional random variables we consider are in L [0 , pfr and pffr functions respectively from the refund package [16]. Theseare functional linear regression methods which rely on fitting smoothers implemented in the mgcv package [46]; we choose the tuning parameters for these smoothers (dimension of the basisexpansions of the smooth terms) as per the standard guidance such that a further increase doesnot decrease the deviance. In this section we examine the size and power properties of the GHCM when testing the con-ditional independence X ⊥⊥ Y | Z . We take X, Z ∈ L [0 , Y is scalar. In Section 5.1.2 we present experiments for the case where Y ∈ L [0 , Y , functional X and Z Here we consider the setup where Z is standard Brownian motion and X and Y are related to Z through the functional linear models X ( t ) = (cid:90) β a ( s, t ) Z ( t ) d s + N X ( t ) , (18) Y = (cid:90) α a ( t ) Z ( t ) d t + N Y . (19)The variables N X , N Y and Z are independent with N X a Brownian motion with variance σ X , N Y ∼ N (0 ,
1) and nonlinear coefficient functions β a and α a are given by β a ( s, t ) = a exp( − ( st ) /
2) sin( ast ) , α a ( t ) = (cid:90) β a ( s, t ) d s. (20)Thus X ⊥⊥ Y | Z . We vary the parameters σ X ∈ { . , . , . , } and a ∈ { , , } . Wegenerate n i.i.d. observations from each of the 4 × n ∈ { , , , } . Increasing a or decreasing σ X increase the difficulty of thetesting problem: for large a , β a oscillates more, making it harder to remove the dependenceof X on Z ; smaller σ X makes Y closer to the integral of X , and so increases the marginaldependence of X and Y .We apply the GHCM and compare the resulting tests to those corresponding to the signifi-cance test for X in a regression of Y on ( X, Z ) implemented in pfr . The rejection rates of thetwo tests at the 5% level, averaged over 100 simulation runs, can be seen in Figure 1. We seethat the pfr test has size greatly exceeding its level in the more challenging large a , small σ X settings, with large values of n exposing most clearly the mis-calibration of the test statistic.On the other hand, the GHCM tests maintain reasonable type I error control across the settingsconsidered here, though it does exceed the expected maximum rejection rate across all settingsof 0 .
104 when the underlying p -values are uniform in two settings.15 .040.010.060.06 0.050.070.070.06 0.090.060.10.03 0.050.060.090.090.050.020.050.06 0.050.030.070.05 0.070.050.070.02 0.050.020.040.06 0.080.090.060.07 0.150.210.080.03 0.350.180.090.07 0.460.170.090.040.040.060.040.04 0.040.060.110.07 0.070.060.040.06 0.070.140.040.03 0.320.10.270.09 0.60.480.410.17 0.890.660.390.24 0.970.680.360.140.040.050.030.04 0.070.090.080.07 0.030.080.050.04 0.030.050.030.07 a: 2 a: 6 a: 12 m e t hod : p f r m e t hod : G HC M
100 250 500 1000 100 250 500 1000 100 250 500 10000.10.250.510.10.250.51 n s X Rejection rate
Figure 1: Rejection rates in the various null settings considered in Section 5.1.1 for the nominal5%-level pfr test (top) and GHCM test (bottom).To investigate the power properties of the test, we simulate Z as before with with X alsogenerated according to (18). We replace the regression model (19) for Y with Y = (cid:90) α a ( t ) Z ( t ) d t + (cid:90) α a ( t ) a X ( t ) d t + N Y , (21)where N Y ∼ N Y (0 ,
1) as before. The rejection rates at the 5% level can be seen in Figure 2.While the two approaches perform similarly when a = 2, the pfr test has higher power in themore complex cases. However, as the results from the size analysis in Figure 1 show, null casesare also rejected in the analogous settings.To illustrate the full distribution of p -values from the two methods under the null and thealternative, we plot false positive rates and true positive rates in each setting as a function ofthe chosen significance level of the test α . The full set of results can be seen in Section D ofthe supplementary material and a plot for a subset of the simulations settings where n = 500and σ X ∈ { . , . , . } is presented in Figure 3. We see that both tests distinguish null fromalternative well in the cases with a small and σ X large. The p -values of the GHCM are closeto uniform in the settings considered, where as the distribution of the pfr p -values is heavilydependent on the particular null setting, illustrating the difficulty with calibrating this test. X , Y and Z In this section we modify the setup and consider functional Y ∈ L [0 , X and Z asin Section 5.1.1 but in the null settings we let Y ( t ) = (cid:90) β a ( s, t ) Z ( s ) d s + N Y ( t ) , where N Y is a standard Brownian motion. In the alternative settings, we take Y ( t ) = (cid:90) β a ( s, t ) Z ( s ) d s + (cid:90) β a ( s, t ) a X ( s ) d s + N Y ( t )16 .040.070.190.69 0.060.140.470.95 0.140.260.781 0.150.530.9710.050.110.210.76 0.050.130.620.98 0.110.340.81 0.180.6411 0.210.180.230.52 0.440.370.490.95 0.570.490.711 0.530.530.910.070.080.140.51 0.080.160.430.94 0.10.350.671 0.120.410.911 0.220.290.280.19 0.570.580.490.43 0.890.80.650.71 0.990.820.620.890.070.090.140.13 0.070.110.160.33 0.050.040.250.64 0.060.110.310.91 a: 2 a: 6 a: 12 m e t hod : p f r m e t hod : G HC M
100 250 500 1000 100 250 500 1000 100 250 500 10000.10.250.510.10.250.51 n s X Rejection rate
Figure 2: As Figure 1 but for the alternative settings, see (21). s X = s X = s X = a = = = Significance level R e j e c t i on r a t e method / setting pfr / null GHCM / nullpfr / alternative GHCM / alternative Figure 3: Rejection rates against significance level for the pfr (red) and GHCM (green) testsunder null (light) and alternative (dark) settings when n = 500.17 .080.060.070.04 0.050.070.040.04 0.040.030.080.06 0.050.10.030.040.10.170.570.99 0.110.460.961 0.140.70.991 0.250.9711 0.090.010.030.05 0.080.060.060.05 0.0700.020.02 0.060.040.080.040.040.210.771 0.090.4511 0.120.8911 0.23111 0.070.080.070.08 0.080.070.110.07 0.080.010.050.07 0.030.090.020.030.060.070.41 0.080.170.991 0.090.6711 0.14111 a: 2 a: 6 a: 12 s e tt i ng : nu ll s e tt i ng : a l t e r na t i v e
100 250 500 1000 100 250 500 1000 100 250 500 10000.10.250.510.10.250.51 n s X Rejection rate
Figure 4: Rejection rates in the various null (top) and alternative (bottom) settings consideredin Section 5.1.2 for the nominal 5%-level GHCM test.with N Y again being a standard Brownian motion.The rejection rates at the 5% level, averaged over 100 simulation runs, can be seen inFigure 4. We see that, as in the case where Y ∈ R , the GHCM maintains good type I errorcontrol in the settings considered, and has power increasing with n and σ X as expected. We notethat a comparison with the p -values from ff -terms in the pffr -function of the refund packagehere does not seem helpful. In our experiments the corresponding tests would consistently rejectin true null settings even for simple models. In this section we consider an application of the GHCM in constructing a confidence intervalfor the truncation point θ ∈ [0 ,
1] in a truncated functional linear model [19] Y = (cid:90) θ α ( t ) X ( t ) d t + ε, (22)where the predictor X ∈ L [0 , Y ∈ R is a response and ε ⊥⊥ X is stochastic noise. To framethis as a conditional independence testing problem, observe that (22) implies that defining thenull hypotheses H ˜ θ : Y ⊥⊥ { X ( t ) } t> ˜ θ | { X ( t ) } t ≤ ˜ θ (23)for ˜ θ ∈ (0 , H ˜ θ is true for all θ ≤ ˜ θ ≤ α -level conditional independence test ψ , we may thus form a one-sided confidenceinterval for θ using (cid:104) inf (cid:110) ˜ θ ∈ (0 ,
1) : ψ accepts null H ˜ θ (cid:111) , (cid:105) . (24)Indeed, with probability 1 − α , ψ will not reject the true null H θ , and so with probability 1 − α the infimum above will be at most θ .To approximate (24) we initially consider the null hypothesis H ˜ θ at 5 equidistant values of˜ θ and then employ a bisection search between the smallest of these points ˜ θ at which H ˜ θ is18 overage: 0.944 coverage: 0.956 q = q = Confidence interval left endpoints c oun t Figure 5: Histograms of the left endpoints of 95% confidence intervals for truncation points θ = 0 .
275 (left) and θ = 0 .
675 (right), given by red vertical lines, in model (22) across 500simulations.accepted by a 5% level GHCM, and the point immediately before it or 0. We consider twoinstances of the model (22) with θ = 0 . , .
675 and with α ( t ) := 10( t + 1) − / , X a standardBrownian motion and ε ∼ N (0 , Testing the conditional independence X ⊥⊥ Y | Z has been shown to be a hard problem in thesetting where X, Y, Z are all real-valued and Z is absolutely continuous with respect to Lebesguemeasure [39]. This hardness persists in the functional setting, but takes a more extreme formin that even when ( X, Y, Z ) are jointly Gaussian with non-degenerate covariance and Z andat most one of X and Y are infinite-dimensional, there is no non-trivial test of conditionalindependence. This requires us to (i) understand the form of the ‘effective null hypothesis’for a given hypothesis test, and (ii) develop tests where these effective nulls are somewhatinterpretable so that domain knowledge can more easily inform the choice of a conditionalindependence test to use on any given dataset.In order to address these two needs, we introduce here a new family of tests for functionaldata and develop the necessary uniform convergence results to understand the form of the nullhypothesis that we can have type I error control over. We see that error control is guaranteedunder conditions largely determined by the in-sample prediction error rate of regressions uponwhich the test is based. Whilst in-sample and more common out-of-sample results share simi-larities in some settings, the lack of a need to extrapolate beyond the data in the former leadto important differences when regressing on functional data. In particular, no eigen-spacingconditions or lower bounds on the eigenvalues of the covariance of the regressor are required forthe in-sample error to be controlled when ridge regression is used. It would be interesting toinvestigate the in-sample MSPE properties of other regression methods and understand whethersuch conditions can be avoided more generally.Another direction which may be fruitful to pursue is to adapt the GHCM so that it haspower against alternatives where E Cov(
X, Y | Z ) = 0. It is likely that further conditions willbe required of the regression methods than that their prediction errors are small, and so someinterpretability of the effective null hypothesis, and indeed its size compared to the full null19f conditional independence, will need to be sacrificed. There are however settings where theseverity of type I versus type II errors may be balanced such that this is an attractive option.It would also be interesting to investigate the hardness of conditional independence in thesetting where all of X , Y and Z are infinite-dimensional. For our hardness result here, at leastone of X and Y must be finite-dimensional. It may be the case that requiring two infinite-dimensional variables to be conditionally independent is such a strong condition that the nullis not prohibitively large compared to the entire space of Gaussian measures, and so genuinecontrol of the type I error while maintaining power is in fact possible. Such a result, or indeeda proof that hardness persists, would certainly be of interest. Acknowledgements
We thank Yoav Zemel, Alexander Aue and Sonja Greven for helpful discussions.
References [1] Z. Bai and H. Saranadasa. Effect of high dimension: by an example of a two sampleproblem.
Statistica Sinica , pages 311–329, 1996.[2] D. Benatia, M. Carrasco, and J.-P. Florens. Functional linear regression with func-tional response.
Journal of Econometrics , 201(2):269–291, 2017. doi: https://doi.org/10.1016/j.jeconom.2017.08.008. URL https://ideas.repec.org/a/eee/econom/v201y2017i2p269-291.html .[3] S. Brockhaus and D. Ruegamer.
FDboost: Boosting Functional Regression Models , 2018.[4] T. T. Cai and P. Hall. Prediction in functional linear regression.
Ann. Statist. , 34(5):2159–2179, 10 2006. doi: 10.1214/009053606000000830. URL https://doi.org/10.1214/009053606000000830 .[5] X. Chen and H. White. Central limit and functional central limit theorems for hilbert-valued dependent heterogeneous arrays with applications.
Econometric Theory , pages 260–284, 1998.[6] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, andJ. Robins. Double/debiased machine learning for treatment and structural parameters,2018.[7] J.-M. Chiou, H.-G. M¨uller, and J.-L. Wang. Functional response models.
Statistica Sinica ,pages 675–693, 2004.[8] P. Constantinou and A. P. Dawid. Extended conditional independence and applications incausal inference.
Ann. Statist. , 45(6):2618–2653, 12 2017. doi: 10.1214/16-AOS1537. URL https://doi.org/10.1214/16-AOS1537 .[9] C. Crambes and A. Mas. Asymptotics of prediction in functional linear regression withfunctional outputs.
Bernoulli , 19(5B):2627–2651, Nov. 2013. doi: 10.3150/12-bej469. URL https://doi.org/10.3150/12-bej469 .[10] A. Delaigle, P. Hall, et al. Methodology and theory for partial least squares applied tofunctional data.
The Annals of Statistics , 40(1):322–352, 2012.2011] Y. Fan, G. M. James, and P. Radchenko. Functional additive regression.
Ann. Statist. ,43(5):2296–2325, 10 2015. doi: 10.1214/15-AOS1346. URL https://doi.org/10.1214/15-AOS1346 .[12] Y. Fan, G. M. James, P. Radchenko, et al. Functional additive regression.
The Annals ofStatistics , 43(5):2296–2325, 2015.[13] F. Ferraty and P. Vieu.
Nonparametric Functional Data Analysis: Theory and Practice .Springer Series in Statistics. Springer New York, 2006. ISBN 9780387366203. URL https://books.google.dk/books?id=lMy6WPFZYFcC .[14] F. Ferraty, A. Laksaci, A. Tadj, and P. Vieu. Kernel regression with functional response.
Electron. J. Statist. , 5:159–171, 2011. doi: 10.1214/11-EJS600. URL https://doi.org/10.1214/11-EJS600 .[15] J. Goldsmith, J. Bobb, C. M. Crainiceanu, B. Caffo, and D. Reich. Penalized functionalregression.
Journal of Computational and Graphical Statistics , 20(4):830–851, 2011. doi:10.1198/jcgs.2010.10007. URL https://doi.org/10.1198/jcgs.2010.10007 . PMID:22368438.[16] J. Goldsmith, F. Scheipl, L. Huang, J. Wrobel, C. Di, J. Gellar, J. Harezlak, M. W. McLean,B. Swihart, L. Xiao, C. Crainiceanu, and P. T. Reiss. refund: Regression with FunctionalData , 2020. URL https://CRAN.R-project.org/package=refund . R package version0.1-22.[17] S. Greven and F. Scheipl. A general framework for functional regression modelling.
Sta-tistical Modelling , 17(1-2):1–35, 2017. doi: 10.1177/1471082X16681317. URL https://doi.org/10.1177/1471082X16681317 .[18] T. Haavelmo. The probability approach in econometrics.
Econometrica , 12:S1–S115 (sup-plement), 1944.[19] P. Hall and G. Hooker. Truncated linear models for functional data.
Journal of the RoyalStatistical Society Series B , 78(3):637–653, June 2016. URL https://ideas.repec.org/a/bla/jorssb/v78y2016i3p637-653.html .[20] P. Hall, J. L. Horowitz, et al. Methodology and convergence rates for functional linearregression.
The Annals of Statistics , 35(1):70–91, 2007.[21] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonalproblems.
Technometrics , 12(1):55–67, 1970.[22] A. E. Ivanescu, A.-M. Staicu, F. Scheipl, and S. Greven. Penalized function-on-functionregression.
Computational Statistics , 30(2):539–568, 2015. ISSN 1613-9658. doi: 10.1007/s00180-014-0548-4. URL https://doi.org/10.1007/s00180-014-0548-4 .[23] D. Koller and N. Friedman.
Probabilistic Graphical Models: Principles and Techniques -Adaptive Computation and Machine Learning . The MIT Press, 2009. ISBN 0262013193.[24] C. Kraft.
Some Conditions for Consistency and Uniform Consistency of Statistical Proce-dures . University of California Press, 1955.[25] S. Lauritzen.
Graphical Models . Oxford Statistical Science Series. Clarendon Press, 1996.ISBN 9780191591228. URL https://books.google.dk/books?id=mGQWkx4guhAC .2126] J. S. Morris. Functional regression.
Annual Review of Statistics and Its Application , 2(1):321–359, 2015. doi: 10.1146/annurev-statistics-010814-020413.[27] M. Neykov, S. Balakrishnan, and L. Wasserman. Minimax optimal conditional indepen-dence testing. arXiv preprint arXiv:2001.03039 , 2020.[28] J. Pearl.
Causality . Cambridge University Press, 2009. doi: https://doi.org/10.1017/CBO9780511803161.[29] J. Pearl.
Probabilistic reasoning in intelligent systems: networks of plausible inference .Elsevier, 2014.[30] J. Peters, P. B¨uhlmann, and N. Meinshausen. Causal inference using invariant prediction:identification and confidence intervals.
Journal of the Royal Statistical Society: Series B(with discussion) , 78(5):947–1012, 2016.[31] J. Peters, D. Janzing, and B. Sch¨olkopf.
Elements of Causal Inference: Foundations andLearning Algorithms . MIT Press, Cambridge, MA, USA, 2017.[32] X. Qiao, S. Guo, and G. M. James. Functional graphical models.
Journal of the AmericanStatistical Association , 114(525):211–222, 2019. doi: 10.1080/01621459.2017.1390466. URL https://doi.org/10.1080/01621459.2017.1390466 .[33] J. O. Ramsay and B. W. Silverman.
Functional Data Analysis . Springer New York, 2005.doi: 10.1007/b98888. URL https://doi.org/10.1007/b98888 .[34] P. T. Reiss and R. T. Ogden. Functional principal component regression and func-tional partial least squares.
Journal of the American Statistical Association , 102(479):984–996, 2007. doi: 10.1198/016214507000000527. URL https://doi.org/10.1198/016214507000000527 .[35] P. T. Reiss, L. Huang, and M. Mennes. Fast function-on-scalar regression with penalizedbasis expansions.
The International Journal of Biostatistics , 6(1), Jan. 2010. doi: 10.2202/1557-4679.1246. URL https://doi.org/10.2202/1557-4679.1246 .[36] J. M. Robins and A. Rotnitzky. Semiparametric efficiency in multivariate regression modelswith missing data.
Journal of the American Statistical Association , 90(429):122–129, 1995.[37] D. O. Scharfstein, A. Rotnitzky, and J. M. Robins. Adjusting for nonignorable drop-outusing semiparametric nonresponse models.
Journal of the American Statistical Association ,94(448):1096–1120, 1999.[38] F. Scheipl, A.-M. Staicu, and S. Greven. Functional additive mixed models.
Journal ofComputational and Graphical Statistics , 24(2):477–501, 2015. doi: 10.1080/10618600.2014.901914. URL https://doi.org/10.1080/10618600.2014.901914 .[39] R. D. Shah and J. Peters. The hardness of conditional independence testing and thegeneralised covariance measure.
Annals of Statistics , 48(3):1514–1538, 2020.[40] H. Shin. Partial functional linear regression.
Journal of Statistical Planning and Inference ,139(10):3405 – 3418, 2009. ISSN 0378-3758. doi: https://doi.org/10.1016/j.jspi.2009.03.001. URL .2241] P. Spirtes, P. Scheines, C. Glymour, R. Scheines, S. Richard, D. Heckerman, C. Meek,G. Cooper, and T. Richardson.
Causation, Prediction, and Search . Adaptive computationand machine learning. MIT Press, 2000. ISBN 9780262194402. URL https://books.google.dk/books?id=vV-U09kCdRwC .[42] J. G. Staniswalis and J. J. Lee. Nonparametric regression analysis of longitudinaldata.
Journal of the American Statistical Association , 93(444):1403–1418, 1998. doi:10.1080/01621459.1998.10473801. URL https://amstat.tandfonline.com/doi/abs/10.1080/01621459.1998.10473801 .[43] S. Ullah and C. F. Finch. Applications of functional data analysis: A systematic review.
BMC medical research methodology , 13(1):43, 2013.[44] J.-L. Wang, J.-M. Chiou, and H.-G. M¨uller. Functional data analysis.
An-nual Review of Statistics and Its Application , 3(1):257–295, 2016. doi: https://doi.org/10.1146/annurev-statistics-041715-033624. URL https://doi.org/10.1146/annurev-statistics-041715-033624 .[45] S. N. Wood. On p-values for smooth components of an extended generalized additivemodel.
Biometrika , 100(1):221–228, 2013.[46] S. N. Wood.
Generalized Additive Models . Chapman and Hall/CRC, May 2017. doi:10.1201/9781315370279. URL https://doi.org/10.1201/9781315370279 .[47] F. Yao and H.-G. M¨uller. Functional quadratic regression.
Biometrika , 97(1):49–64, 012010. ISSN 0006-3444. doi: 10.1093/biomet/asp069. URL https://doi.org/10.1093/biomet/asp069 .[48] F. Yao, H.-G. M¨uller, and J.-L. Wang. Functional linear regression analysis for longitudinaldata.
Ann. Statist. , 33(6):2873–2903, 12 2005. doi: 10.1214/009053605000000660. URL https://doi.org/10.1214/009053605000000660 .[49] F. Yao, H.-G. M¨uller, and J.-L. Wang. Functional linear regression analysis for longitudinaldata.
The Annals of Statistics , pages 2873–2903, 2005.[50] M. Yuan and T. T. Cai. A reproducing kernel hilbert space approach to functional linearregression.
Ann. Statist. , 38(6):3412–3444, 12 2010. doi: 10.1214/09-AOS772. URL https://doi.org/10.1214/09-AOS772 .[51] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model.
Biometrika , 94(1):19–35, 2007.[52] J. Zapata, S.-Y. Oh, and A. Petersen. Partial separability and functional graphical modelsfor multivariate gaussian processes. arXiv preprint arXiv:1910.03134 , 2019.[53] H. Zhu, N. Strawn, and D. B. Dunson. Bayesian graphical models for multivariate func-tional data.
The Journal of Machine Learning Research , 17(1):7157–7183, 2016.23 upplementary material for ‘Conditional Independence Testingin Hilbert Spaces with Applications to Functional Data Analysis’
Section A is a self-contained presentation of the theory and proofs of Section 2 in the paper.Section B contains much of the background on uniform stochastic convergence that is used forthe technical results of the paper. This includes an account of previously established results forreal-valued random variables and new results for Hilbertian and Banachian random variables.Section C contains the proofs of the results in Section 4 in the paper.
A Hardness of functional Gaussian independence testing
In this section we provide the necessary background and prove the hardness result in Section 2.We will use the notation and terminology described in the setup of Section 2 except that forease of reference P , P and Q will consist of n i.i.d. copies of jointly Gaussian ( X, Y, Z ) ratherthan a single copy. For a bounded linear operator A on a Hilbert space H , we let A ∗ denotethe adjoint of A . For two orthogonal subspaces A and B of a Hilbert space H , we write A ⊕ B for the orthogonal direct sum of A and B .In Section A.1 we consider regular conditional probabilities and conditional distributions ofHilbertian random variables and prove several Hilbertian analogues of well-known multivariateGaussian results. In Section A.2 we consider the setup of Section 2 in the specific case whereall the Hilbert spaces are finite-dimensional. Here we show that for any Q ∈ Q , sample size n and ε >
0, we can find a sufficiently large dimension of H Z such that any test of size α over P Q has power at most α + ε against any alternative. In Section A.3 we use the previous results toprove the main hardness result. A.1 Conditional distributions on Hilbert spaces
Let us first recall how to formally define a conditional distribution. We follow Dudley [7, Chapter10.2] and Rønn-Nielsen and Hansen [15].
Definition 1.
Let (Ω , F , P ) be a probability space, let D be a sub- σ -algebra of F and let P | D denote the restriction of P to D . Let X be a random variable defined on (Ω , F , P ) mapping intoa measurable space ( X , A ) . We say that a function P X | D : A × Ω → [0 , is a conditionaldistribution for X given D if the following two conditions hold.(i) For each A ∈ A , P X | D ( A, · ) = E ( { ( X ∈ A ) } | D ) = P ( X ∈ A | D ) P | D -a.s.(ii) For P | D almost every ω ∈ Ω , P X | D ( · , ω ) is a probability measure on ( X , A ) . We are mainly interested in conditioning on the value of some random variable which leadsto the following definition.
Definition 2.
Consider random variables X and Y defined on the probability space (Ω , F , P ) with values in the measurable spaces ( X , A ) and ( Y , G ) , respectively. We say that a function P Y | X : G × X → [0 , is a conditional distribution for Y given X if the following conditionshold.(i) For each x ∈ X , P Y | X ( · , x ) is a probability measure on ( Y , G ) .(ii) For each G ∈ G , P Y | X ( G, · ) is A - B measurable, where B denotes the Borel σ -algebra on R . iii) For each A ∈ AP ( X ∈ A, Y ∈ G ) = (cid:90) ( X ∈ A ) P Y | X ( G, X ( ω )) d P ( ω ) = (cid:90) A P Y | X ( G, x ) d X ( P )( x ) , where X ( P ) is the push-forward measure of X under P .Informally, we write Y | X for the conditional distribution of Y given X and Y | X = x for themeasure P Y | X ( · , x ) . If a function Q : G × X → [0 , only satisfies the first two conditions, wesay that Q is a ( X , A )-Markov kernel on ( Y , G ) . The connection between the previous two definitions can be seen by viewing X and Y asrandom variables on the background space ( X ×Y , A ⊗ G , ( X, Y )( P )) where ( X, Y )( P ) is the jointpush-forward measure of X and Y under P . If we then let D be the smallest σ -algebra makingthe projection onto the X -space measurable, we see by letting P Y | D ( G, ( x, y )) = P Y | X ( G, x )that P Y | X also satisfies the conditions of the first definition. For more on this perspective, seeDudley [7, Theorem 10.2.1]. It is non-trivial to show the existence of conditional distributions,however, we do have the following result from Dudley [7, Theorem 10.2.2]. Lemma 1.
Consider random variables X and Y defined on the probability space (Ω , F , P ) withvalues in the measurable spaces ( X , A ) and ( Y , G ) respectively. If X and Y are Polish spacesand A and G are their respective Borel σ -algebras then the conditional distribution for Y given X exists. We will consider real-valued and Hilbertian random variables in the following, thus we arefree to assume the existence of conditional distributions wherever needed. Before we delveinto the main preliminary results about Hilbertian conditional distributions, we present somefundamental results from the theory of regular conditional distributions. For measurable spaces( X , A ) and ( Y , G ), we let i x : Y → X × Y denote the inclusion map, i.e. i x ( y ) = ( x, y ). This isa G − A ⊗ G measurable mapping for each fixed x . The following four results are included forcompleteness and can be found in Rønn-Nielsen and Hansen [15, Lemma 1.1.4, Theorem 1.2.1,Theorem 2.1.1 & Theorem 3.5.5]. Unless otherwise specified, for these results X , Y and Z arerandom variables on measurable spaces ( X , A ), ( Y , G ) and ( Z , K ) respectively. Lemma 2.
Let Q be a ( X , A ) -Markov kernel on ( Y , G ) and let B denote the Borel σ -algebraon R . For each C ∈ A ⊗ G the map x (cid:55)→ Q ( i − x ( C ) , x ) is A - B measurable.Proof. Let D = { C ∈ A ⊗ G | x (cid:55)→ Q ( i − x ( C ) , x ) is A - B measurable } and consider a product set A × G ∈ A ⊗ G . Clearly, i − x ( A × G ) = (cid:40) ∅ if x (cid:54)∈ AB if x ∈ A and therefore Q ( i − x ( A × G ) , x ) = (cid:40) x (cid:54)∈ AQ ( G, x ) if x ∈ A = A ( x ) Q ( G, x ) . A - B measurable functions and is thus also A - B measurable. This showsthat D contains all product sets and since the product sets are an intersection-stable generatorof A ⊗ G , we are done if we can show that D is a Dynkin class by Schilling [17, Theorem 5.5].We have already shown that product sets are in D which includes X × Y . If C , C ∈ D where C ⊆ C then clearly also i − x ( C ) ⊆ i − x ( C ) and further i − x ( C \ C ) = i − x ( C ) \ i − x ( C ).This implies that Q ( i − x ( C \ C ) , x ) = Q ( i − x ( C ) , x ) − Q ( i − x ( C ) , x )which is the difference of two A - B measurable functions and is thus also A - B measurable. Hence C \ C ∈ D . Finally, assume that C ⊆ C ⊆ · · · is an increasing sequence of D -sets. Similarlyto above we have i − x ( C ) ⊆ i − x ( C ) ⊆ · · · and i − x (cid:32) ∞ (cid:91) n =1 C n (cid:33) = ∞ (cid:91) n =1 i − x ( C n ) . Then Q (cid:32) i − x (cid:32) ∞ (cid:91) n =1 C n (cid:33) , x (cid:33) = Q (cid:32) ∞ (cid:91) n =1 i − x ( C n ) , x (cid:33) = lim n →∞ Q ( i − x ( C n ) , x ) . The limit is A - B measurable since each of the functions x (cid:55)→ Q ( i − x ( C n ) , x ) are measurable.Hence D is a Dynkin class and we have the desired result. Proposition 3.
Let µ be a probability measure on ( X , A ) and let Q be a ( X , A ) -Markov kernelon ( Y , G ) . There exists a uniquely determined probability measure λ on ( X ×Y , A ⊗ G ) satisfying λ ( A × G ) = (cid:90) A Q ( G, x ) , d µ ( x ) for all A ∈ A and G ∈ G . Furthermore, for C ∈ A ⊗ G λ ( C ) = (cid:90) Q ( i − x ( C ) , x ) , d µ ( x ) . Proof.
Uniqueness follows from Schilling [17, Theorem 5.7] since λ is determined on the productsets which form an intersection-stable generator of A ⊗ G .For existence, we show that λ as defined for general C ∈ A ⊗ G is a measure. The integrandis measurable by Lemma 2 and since Q is non-negative, the integral is well-defined with valuesin [0 , ∞ ]. Let C , C . . . be a sequence of disjoint sets in A ⊗ G . Then for each x ∈ X the sets i − x ( C ) , i − x ( C ) , . . . are disjoint as well. Hence λ (cid:32) ∞ (cid:91) n =1 C n (cid:33) = (cid:90) Q (cid:32) i − x (cid:32) ∞ (cid:91) n =1 C n (cid:33) , x (cid:33) d µ ( x ) = (cid:90) ∞ (cid:88) n =1 Q (cid:0) i − x ( C n ) , x (cid:1) d µ ( x )= ∞ (cid:88) n =1 (cid:90) Q (cid:0) i − x ( C n ) , x (cid:1) d µ ( x ) = ∞ (cid:88) n =1 λ ( C n )where the second equality uses that Q ( · , x ) is a measure and the third uses monotone convergenceto interchange integration and summation. Since also λ ( X × Y ) = (cid:90) Q ( i − x ( X × Y ) , x ) d µ ( x ) = (cid:90) Q ( Y , x ) d µ ( x ) = (cid:90) µ ( x ) = 1 λ is a probability measure and it follows that λ ( A × G ) = (cid:90) Q ( i − x ( A × G ) , x ) d µ ( x ) = (cid:90) A Q ( G, x ) d µ ( x )for all A ∈ A and G ∈ G as desired. 3 roposition 4. Assume that P Y | X is the conditional distribution of Y given X . Let ( Z , K ) beanother measurable space and let φ : X ×Y → Z be a measurable mapping. Define Z = φ ( X, Y ) .Then the conditional distribution of Z given X exists and for K ∈ K and x ∈ X is given by P Z | X ( K, x ) = P Y | X (( φ ◦ i x ) − ( K ) , x ) . Proof.
Clearly P Z | X ( · , x ) is a probability measure for every x ∈ X and Lemma 2 yields that P Z | X ( K, · ) is A - B measurable for every K ∈ K . It remains to show that P Z | X satisfies thethird condition required to be the conditional distribution of Z given X . For A ∈ A and K ∈ K we get that P ( X ∈ A, Z ∈ K ) = P (( X, Y ) ∈ ( A × Y ) ∩ φ − ( K ))and hence by Proposition 3, we get that P ( X ∈ A, Z ∈ K ) = (cid:90) P Y | X ( i − x (( A × Y ) ∩ φ − ( K )) , x ) d X ( P )( x ) . Since i − x (( A × Y ) ∩ φ − ( K )) = (cid:40) ∅ if x (cid:54)∈ Ai − x ( φ − ( K )) if x ∈ A , we get P ( X ∈ A, Z ∈ K ) = (cid:90) A P Y | X ( i − x ( φ − ( K )) , x ) d X ( P )( x ) = (cid:90) A P Z | X ( K, x ) d X ( P )( x ) , proving the desired result. Proposition 5.
Suppose that conditional distribution P Y | ( X,Z ) of Y given ( X, Z ) has the struc-ture P Y | ( X,Z ) ( G, ( x, z )) = Q ( G, z ) for some Q : G × Z where for every z ∈ Z , Q ( · , z ) is a probability measure. Then Q is a Markovkernel, Q is the conditional distribution of Y given Z and X ⊥⊥ Y | Z .Proof. That Q is a Markov kernel follows immediately from the fact that P Y | ( X,Z ) is a Markovkernel. To see that Q is the conditional distribution of Y given Z , note that defining π Z : X × Z → Z to be the projection onto Z , we get P ( Z ∈ K, Y ∈ G ) = P (( X, Z ) ∈ π − Z ( K ) , Y ∈ G ) = (cid:90) π − Z ( K ) P Y | ( X,Z ) ( G, ( x, z )) d( X, Z )( P )( x, z )= (cid:90) π − Z ( K ) Q ( G, π Z ( x, z )) d( X, Z )( P )( x, z ) = (cid:90) K Q ( G, z ) d Z ( P )( z ) , by viewing Z ( P ) as the image measure of ( X, Z )( P ) under π Z and applying Schilling [17,Theorem 14.1]. For every G ∈ G , Q ( G, Z ) is a version of the conditional probability P ( Y ∈ G | Z ) = E (1 ( Y ∈ G ) | Z ) since Q ( G, Z ) is clearly measurable with respect to σ ( Z ) and (cid:90) ( Z ∈ K ) ( Y ∈ G ) d P = P ( Z ∈ K, Y ∈ G ) = (cid:90) ( Z ∈ K ) Q ( G, Z ) d P. The same argument applies to show that P Y | ( X,Z ) ( G, ( X, Z )) is a version of P ( Y ∈ G | X, Z ).Hence for every G ∈ GP ( Y ∈ G | Z ) = Q ( G, Z ) = P Y | ( X,Z ) ( G, ( X, Z )) = P ( Y | X, Z )and thus X ⊥⊥ Y | Z as desired. 4ith these results we are ready to start considering Hilbertian conditional distributions. Remark 1.
In the following we will repeatedly consider orthogonal decompositions of Hilbertspaces. We write H = H ⊕ H if every h ∈ H can be written as h = h + h where h ∈ H and h ∈ H and H ⊥ H . If an operator A is defined on H , the decomposition inducesfour operators: A and A , the H and H components of the restriction of A to H andsimilarly A and A , the H and H components of the restriction of A to H . We can write A as the sum of these four operators. If X is a random variable on H and H and H are asabove, we can similarly decompose X into ( X , X ) where X ∈ H and X ∈ H . If C is thecovariance operator of X , we can decompose it as mentioned above and, in particular, we have C = Cov( X ) , C = Cov( X ) and C = C ∗ = Cov( X , X ) , where C ∗ denotes the adjointof C . This is analogous to the usual block matrix decomposition of the covariance matrix ofmultivariate random variables. We will need two results that are fundamental in the theory of the multivariate Gaussiandistribution.
Proposition 6.
Let X be Gaussian on H and assume that H = H ⊕ H . Define ( X , X ) tobe the corresponding decomposition of X . Then X ⊥⊥ X if and only if Cov( X , X ) = 0 .Proof. We show that Cov( X , X ) = 0 implies independence since the other direction is trivial.We will use the approach of characteristic functionals as described in detail in Vakhania et al. [21,Chapter IV]. The characteristic functional of a random variable (technically, the distribution ofthe random variable) is the mapping defined on H where h (cid:55)→ E [exp( i (cid:104) X, h (cid:105) )]. Vakhania et al.[21, Theorem IV.2.4] state that for Gaussian X with mean µ and covariance operator C thecharacteristic functional is φ X ( h ) = exp (cid:18) i (cid:104) µ, h (cid:105) − (cid:104) C h, h (cid:105) (cid:19) . Vakhania et al. [21, Chapter IV, Proposition 2.2 + Corollary] state that X and X are in-dependent if the characteristic functional of X factorises into the product of their respectivecharacteristic functionals. By the assumption that C = Cov( X , X ) = 0, we can write the co-variance as C = C + C where C i is the covariance of X i . The result then follows by factorisingthe characteristic functional appropriately. Proposition 7.
Let X be Gaussian on H with mean µ and covariance operator C and let A be a bounded linear operator from H to H and z ∈ H . Then Y = A X + z is Gaussian on H with mean A µ + z and covariance operator A C A ∗ where A ∗ is the adjoint of A .Proof. Throughout, we let (cid:104)· , ·(cid:105) and (cid:104)· , ·(cid:105) denote the inner products of H and H respectively.By definition, for every h ∈ H , (cid:104) X, h (cid:105) is Gaussian on R . For every h ∈ H we have (cid:104) Y, h (cid:105) = (cid:104) A X, h (cid:105) + (cid:104) z, h (cid:105) = (cid:104) X, A ∗ h (cid:105) + (cid:104) z, h (cid:105) thus Y is also Gaussian. Using the interchangeability of the Bochner integral and linear opera-tors (see Hsing and Eubank [9, Theorem 3.1.7]), we get the mean of Y immediately. By notingthat for any h, k ∈ H , we have( A h ) ⊗ k = (cid:104) A h, ·(cid:105) k = (cid:104) h, A ∗ ·(cid:105) k = ( h ⊗ k ) A ∗ , the covariance result then follows by the same argument as for the mean.5ith these results we can now show that conditioning on a non-singular part of a Gaussiandistribution on a Hilbert space yields another Gaussian distribution with mean and covariancegiven by the Hilbertian analogue of the well-known Gaussian conditioning formula. Proposition 8.
Let X be mean zero Gaussian on H with covariance operator C and assumethat H = H ⊕ H . Let ( X , X ) denote the corresponding decomposition of X . As discussed inRemark 1, we then set C := Cov( X ) , C := Cov( X ) and C = C ∗ := Cov( X , X ) , where C ∗ denotes the adjoint of C . If C is injective, i.e.Ker ( C ) = { h ∈ H | C h = 0 } = { } then the conditional distribution of X given X is Gaussian on H with E ( X | X ) = C C † X and Cov( X | X ) = C − C C † C , where C † is the generalised inverse (or Moore-Penrose inverse) of C .Proof. Define Z := X − C C † X . Note that since ( Z, X ) is a bounded linear transformationof ( X , X ), ( Z, X ) must be jointly Gaussian by Proposition 7. By Proposition 6, Z and X are independent if Cov( Z, X ) = 0. We calculate the covariance and getCov( Z, X ) = C − C C † C = 0by Hsing and Eubank [9, Theorem 3.5.8 (3.18)] since Ker( C ) = 0. This implies that theconditional distribution of Z given X is simply the distribution of Z . We can find the completedistribution of Z by calculating the mean and covariance of Z , since Z is Gaussian. We get byProposition 7 E ( Z ) = E ( X ) − C C † E ( X ) = 0and Cov( Z ) = C − C C † C . By Proposition 4, since we can write X = Z + C C † X , the conditional distribution of X given X is as desired. A.2 Power of finite-dimensional Gaussian conditional independence testing
Before we consider Gaussian conditional independence testing, we have the following generalresult from Kraft [13]. A summary is given in LeCam [14].
Lemma 3.
Let A and B denote two families of probability measures on some measurable space ( X , A ) and assume that both families are dominated by a σ -finite measure. Consider the prob-lem of testing the null hypothesis that the given data is from a distribution in A against thealternative that the distribution is in B . Let d TV denote the total variation distance and ˜ A and ˜ B the closed convex hulls of A and B . Then inf ψ : X → [0 , sup P ∈A ,Q ∈B (cid:20)(cid:90) ψ d P + (cid:90) (1 − ψ ) d Q (cid:21) = 1 − inf P ∈ ˜ A ,Q ∈ ˜ B d TV ( P, Q ) . An immediate consequence of this is that for any test ψ that has size α and power function β : Q → [0 , , β ( Q ) = (cid:82) ψ d Q , we have inf Q ∈B β ( Q ) ≤ α + inf P ∈ ˜ A ,Q ∈ ˜ B d TV ( P, Q ) ≤ α + inf P ∈ ˜ A ,Q ∈B d TV ( P, Q ) .
6n most practical situations both A and B will consist of product measures on a productspace corresponding to a situation where we observe a sample of n i.i.d. observations of somerandom variable. The theorem states that a lower bound on the sum of the type I and type IIerror probabilities of testing the null that data is from a distribution in A against the alternativethat the distribution is in B is given by 1 minus the total variation distance between the closedconvex hulls of A and B . As a consequence we see that the power of a test is upper boundedby the size plus the total variation distance between the closed convex hull of A and B .In the remainder of this section we will consider the testing problem described in Section 2with H X = R d X and H Z = R d Z for various choices of d X , d Z ∈ N . To produce bounds onthe power of a test in this setting, we will construct an explicit TV-approximation to a familyof particularly simple distributions in Q using a distribution in the convex hull of the nulldistributions. We will need an upper bound on the total variation distance between measures. Lemma 4.
Let P and Q be probability measures where P has density f with respect to Q . Then d TV ( P, Q ) ≤ (cid:90) f d Q − . Proof.
Assume that the integral of f with respect to Q is finite, otherwise the inequality istrivially valid. By Jensen’s inequality, we get d TV ( P, Q ) = 14 (cid:18)(cid:90) | f − | d Q (cid:19) ≤ (cid:90) ( f − d Q = 14 (cid:90) f d Q − , proving the desired result.Using this bound and Lemma 3, we can show the following result. Theorem 5.
Let Q be a distribution consisting of n i.i.d. copies of jointly Gaussian ( X, Y, Z ) on ( R , R , R d ) for some d ∈ N where X and Y are standard Gaussian, Z is mean zero withidentity covariance matrix, Cov(
X, Z ) = Cov(
Y, Z ) = 0 and
Cov(
X, Y ) = ρ ∈ (0 , . Considerthe testing problem described in Section 2 with H X = R and H Z = R d and let ψ be the testfunction of a size α test over P Q . Letting β denote the power of ψ against Q , we have β ≤ α + 12 (cid:118)(cid:117)(cid:117)(cid:116) − ρ ) n d (cid:88) k =0 (cid:0) dk (cid:1) d (1 + (3 − k/d ) ρ ) n and in particular for fixed n the upper bound converges to α as d increases.Proof. Let τ ∈ {− , } d and P τ denote the Gaussian distribution consisting of n i.i.d. copiesof jointly Gaussian ( X, Y, Z ) where X and Y are standard Gaussian, Z is mean zero withidentity covariance matrix, Cov( X, Y ) = ρ and Cov( X, Z ) = Cov(
Y, Z ) = (cid:113) ρd τ T . For every τ ∈ {− , } d , it is clear that X ⊥⊥ Y | Z under P τ and thus forming P := 12 d (cid:88) τ ∈{− , } d P τ we note that P is in the closed convex hull of the set of null distributions. Let Γ τ and Γ Q denotethe n ( d + 2)-dimensional covariance matrices of the n i.i.d. copies of ( X, Y, Z ) under P τ and Q respectively. These are block-diagonal and we let Σ τ and Σ Q respectively denote the matricesin the diagonal, corresponding to the covariance of a single observation of ( X, Y, Z ) under P τ Q . By standard manipulations of densities, the density of P with respect to Q is simplythe ratio of their respective densities with respect to the Lebesgue measure. We haveΣ τ = (cid:18) ρρ (cid:19) (cid:113) ρd (cid:18) τ T τ T (cid:19)(cid:113) ρd (cid:0) τ τ (cid:1) I d and, letting I d denote the d -dimensional identity matrix,Σ Q = (cid:18) ρρ (cid:19) I d . The determinant of Σ Q is 1 − ρ by Laplace expanding the first row. Letting J denote the2-dimensional matrix of onesdet(Σ τ ) = det( I d ) det (cid:18)(cid:18) ρρ (cid:19) − ρJ (cid:19) = (1 − ρ ) by Schur’s formula. Defining f to be the density of P with respect to Q , we see that f ( v ) = 12 d (1 + ρ ) n/ (1 − ρ ) n/ (cid:88) τ ∈{− , } d exp (cid:18) − v T (Γ − τ − Γ − Q ) v (cid:19) since the determinants of Γ τ and Γ Q are the determinants of Σ τ and Σ Q to the n ’th power.From this we get that (cid:90) f d Q = 12 d (1 + ρ ) n (1 − ρ ) n (cid:88) τ,τ (cid:48) ∈{− , } d (cid:90) exp (cid:18) − v T (Γ − τ + Γ − τ (cid:48) − − Q ) v (cid:19) d Q ( v ) =12 d (1 + ρ ) n (1 − ρ ) n (cid:112) (2 π ) n ( d +2) (1 − ρ ) n (cid:88) τ,τ (cid:48) ∈{− , } d (cid:90) exp (cid:18) − v T (Γ − τ + Γ − τ (cid:48) − Γ − Q ) v (cid:19) d λ n ( d +2) ( v ) , where λ n ( d +2) denotes the n ( d + 2)-dimensional Lebesgue measure. Each integral is the integralof an unormalised Gaussian density in R n ( d +2) and thus we can simplify further to get (cid:90) f d Q = 12 d (1 + ρ ) n (1 − ρ ) n − ρ ) n/ (cid:88) τ,τ (cid:48) ∈{− , } d (cid:114) det (cid:104) (Γ − τ + Γ − τ (cid:48) − Γ − Q ) − (cid:105) = 12 d (1 + ρ ) n (1 − ρ ) n − ρ ) n/ (cid:88) τ,τ (cid:48) ∈{− , } d det(Γ − τ + Γ − τ (cid:48) − Γ − Q ) − / = 12 d (1 + ρ ) n (1 − ρ ) n − ρ ) n/ (cid:88) τ,τ (cid:48) ∈{− , } d det(Σ − τ + Σ − τ (cid:48) − Σ − Q ) − n/ , by again using the block diagonal structure of Γ Q and the Γ τ ’s. Recall that for a symmetricblock matrix (cid:18) A B T B C (cid:19) − = (cid:18) ( A − B T C − B ) − − ( A − B T C − B ) − B T C − − C − B ( A − B T C − B T ) − C − + C − B ( A − B T C − B ) − B T C − (cid:19) , hence we get that Σ − Q = − ρ (cid:18) − ρ − ρ (cid:19) I d − τ = − ρ I − − ρ (cid:113) ρd (cid:18) τ T τ T (cid:19) − − ρ (cid:113) ρd (cid:0) τ τ (cid:1) I d + ρ (1 − ρ ) d τ τ T . Further, Σ − τ + Σ − τ (cid:48) − Σ − Q = (cid:18) A B T B C (cid:19) , where A = 11 − ρ (cid:18) ρ + 1 ρρ ρ + 1 (cid:19) B = − − ρ (cid:114) ρd (cid:0) τ + τ (cid:48) τ + τ (cid:48) (cid:1) C = I d + 2 ρ (1 − ρ ) d ( τ τ T + τ (cid:48) τ (cid:48) T ) . We can again use Schur’s formula for the determinant of a block matrix to find thatdet(Σ − τ + Σ − τ (cid:48) − Σ − Q ) = det( C ) det( A − B T C − B ) . Defining V = (cid:0) τ τ (cid:48) (cid:1) , we note that C = I d + ρ (1 − ρ ) d V V T and defining further M = I + 2 ρ (1 − ρ ) d V T V = 1 d (1 − ρ ) (cid:18) d (1 + ρ ) 2 ρ (cid:104) τ, τ (cid:48) (cid:105) ρ (cid:104) τ, τ (cid:48) (cid:105) d (1 + ρ ) , (cid:19) the Weinstein-Aronszajn identity yields thatdet( C ) = det( M ) = ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) )( d (1 + ρ ) − ρ (cid:104) τ, τ (cid:48) (cid:105) ) d (1 − ρ ) . The Woodbury matrix identity yields that C − = I d − ρ (1 − ρ ) d V M − V T . Hence det( A − B T C − B ) = det (cid:18) A − B T B + 2 ρ (1 − ρ ) d B T V M − V T B (cid:19) . Now M − = (1 − ρ ) d ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) )( d (1 + ρ ) − ρ (cid:104) τ, τ (cid:48) (cid:105) ) (cid:18) d (1 + ρ ) − ρ (cid:104) τ, τ (cid:48) (cid:105)− ρ (cid:104) τ, τ (cid:48) (cid:105) d (1 + ρ ) (cid:19) and B T V = − − ρ (cid:114) ρd ( d + (cid:104) τ, τ (cid:48) (cid:105) ) J , where J is the 2-dimensional matrix of ones. Thus,2 ρ (1 − ρ ) d B T V M − V T B = 2 ρ ( d + (cid:104) τ, τ (cid:48) (cid:105) ) (1 − ρ ) d J M − J = 4 ρ ( d + (cid:104) τ, τ (cid:48) (cid:105) ) (1 − ρ ) d ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) ) J . Since B T B = 2 ρ (1 − ρ ) d ( d + (cid:104) τ, τ (cid:48) (cid:105) ) J
9e get thatdet( A − B T C − B ) = det (cid:18) A + (cid:18) ρ ( d + (cid:104) τ, τ (cid:48) (cid:105) ) (1 − ρ ) d ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) ) − ρ (1 − ρ ) d ( d + (cid:104) τ, τ (cid:48) (cid:105) ) (cid:19) J (cid:19) = det (cid:18) A − ρ ( d + (cid:104) τ, τ (cid:48) (cid:105) )(1 − ρ )( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) ) J (cid:19) = det (cid:18) ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) ) (cid:18) ρ + 1 ρρ ρ + 1 (cid:19) − ρ ( d + (cid:104) τ, τ (cid:48) (cid:105) )(1 + ρ ) J (cid:19) (1 − ρ ) (1 + ρ ) ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) ) = det (cid:18) ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) ) ρ + (1 + ρ )(1 − ρ ) d (1 + ρ )(1 − ρ ) d − ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) )(1 + ρ )(1 − ρ ) d − ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) ) ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) ) ρ + (1 + ρ )(1 − ρ ) d (cid:19) (1 − ρ ) (1 + ρ ) ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) ) = ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) )(1 + ρ )( ρ −
1) + 2(1 + ρ ) (1 − ρ ) d (1 − ρ ) (1 + ρ ) ( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) )= d (1 + ρ ) − ρ (cid:104) τ, τ (cid:48) (cid:105) (1 − ρ )(1 + ρ )( d (1 + ρ ) + 2 ρ (cid:104) τ, τ (cid:48) (cid:105) )and thus det(Σ − τ + Σ − τ (cid:48) − Σ − Q ) = ( d (1 + ρ ) − ρ (cid:104) τ, τ (cid:48) (cid:105) ) d (1 − ρ ) (1 + ρ ) . Returning to the squared integral of f with respect to Q , we get that (cid:90) f d Q = 12 d (1 + ρ ) n (1 − ρ ) n − ρ ) n/ (cid:88) τ,τ (cid:48) ∈{− , } d d n (cid:112) (1 − ρ ) n (1 + ρ ) n | d (1 + ρ ) − ρ (cid:104) τ, τ (cid:48) (cid:105)| n = 12 d (1 + ρ ) n (cid:88) τ,τ (cid:48) ∈{− , } d d n | d (1 + ρ ) − ρ (cid:104) τ, τ (cid:48) (cid:105)| n . For τ, τ (cid:48) ∈ {− , } d , (cid:104) τ, τ (cid:48) (cid:105) = 2 k − d where k is the number of indices where τ i = τ (cid:48) i . Thusinstead of summing over τ, τ (cid:48) ∈ {− , } d , we can count the number of ( τ, τ (cid:48) )-pairs where τ and τ (cid:48) agree in exactly k positions. For each τ , there are (cid:0) dk (cid:1) other elements in {− , } d agreeing inexactly k positions and there are 2 d different τ ’s, hence (cid:90) f d Q = 12 d (1 + ρ ) n d (cid:88) k =0 d n (cid:0) dk (cid:1) d | d (1 + ρ ) − ρ (2 k − d ) | n = (1 + ρ ) n d (cid:88) k =0 d n (cid:0) dk (cid:1) d ( d + ρ (3 d − k )) n = (1 + ρ ) n d (cid:88) k =0 (cid:0) dk (cid:1) d (1 + ρ (3 − k/d )) n . The result now follows from Proposition 4 and Lemma 3.To see that for each n the bound converges to α as d increases, let W d be a random variablewith a binomial distribution with probability parameter 1 / d trials and note that d (cid:88) k =0 (cid:0) dk (cid:1) d (1 + ρ (3 − k/d )) n = E (cid:0) (1 + ρ (3 − W d /d )) − n (cid:1) . By the Strong Law of Large Numbers, W d /d a.s. → / ρ (3 − W d /d )) − n a.s. → (1+ ρ ) − n .Since (1 + ρ (3 − W d /d )) − n ≤ (1 − ρ ) − n , we get by the bounded convergence theorem thatlim d →∞ E (cid:0) (1 + ρ (3 − W d /d )) − n (cid:1) = E (cid:0) (1 + ρ ) − n (cid:1) = (1 + ρ ) − n and hence the upper bound on the power converges to α .10e can expand the previous theorem to the situation where X and Y are of arbitrary finitedimension. Theorem 6.
Let Q be a distribution consisting of n i.i.d. copies of jointly Gaussian ( X, Y, Z ) on ( R d X , R d Y , R d Z ) for some d X , d Y , d Z ∈ N where X , Y and Z are all mean zero with identitycovariance matrix, Cov(
X, Z ) = Cov(
Y, Z ) = 0 and
Cov(
X, Y ) = R for some rectangulardiagonal matrix R with diagonal entries ρ , . . . , ρ r ∈ (0 , where r = min( d X , d Y ) . Considerthe testing problem described in Section 2 with H X = R d X and H Z = R d Z and let ψ be the testfunction of a size α test over P Q . Assume that d Z ≥ r . Letting β denote the power of ψ against Q , we have β ≤ α + 12 (cid:118)(cid:117)(cid:117)(cid:116) − r (cid:89) i =1 (cid:32) (1 + ρ i ) n d (cid:88) k =0 (cid:0) nk (cid:1) d (1 + (3 − k/d ) ρ i ) n (cid:33) and in particular for fixed n the upper bound converges to α as d Z increases.Proof. Assume without loss of generality that d X ≥ d Y and let d = (cid:98) d Z /r (cid:99) . The proof follows asimilar idea to the proof of Theorem 5. In what follows we will consistently consider a differentordering of the variables than the natural one given by ( X, Y, Z ). We will consider r + 1 blocks,where the first r blocks are ( X i , Y i , Z ( i − d +1 , . . . , Z id ) for i ∈ { , . . . , r } and the final blockconsists of the remaining X and Z variables. When we consider n i.i.d. copies, we will againreorder the variables such that we consider each block separately. As a consequence of doingthis the covariance matrix of n i.i.d. copies under Q , Ξ Q , can be written as a block-diagonalmatrix with r n ( d + 2) × n ( d + 2) blocks Γ Q,i and a final identity matrix block. Each of theΓ
Q,i ’s is again a block-diagonal matrix consisting of n identical blocks Σ Q,i of the formΣ
Q,i = (cid:18) ρ i ρ i (cid:19) I d . Let now T = ( {− , } d ) r and for each τ = ( τ , . . . , τ r ) ∈ T let P τ denote the Gaussian distribu-tion consisting of n i.i.d. copies of jointly Gaussian ( X, Y, Z ) where X , Y and Z are mean zerowith identity covariance, Cov( X, Y ) = R and Cov( X, Z ) = Cov(
Y, Z ) = 0 except forCov( X i , ( Z ( i − d +1 , . . . , Z id )) = Cov( Y i , ( Z ( i − d +1 , . . . , Z id )) = (cid:114) ρ i d τ Ti for i ∈ { , . . . , r } . Arranging the random variables as before, the covariance matrix of n i.i.d.copies under P τ , Ξ τ , is a block-diagonal matrix with r n ( d + 2) × n ( d + 2) blocks Γ τ,i and afinal identity matrix block. Each of the Γ τ,i ’s is again a block-diagonal matrix consisting of n identical blocks Σ τ,i of the form Σ τ,i = (cid:18) ρ i ρ i (cid:19) (cid:113) ρ i d τ Ti (cid:113) ρ i d τ i I d . Clearly X ⊥⊥ Y | Z under P τ for every τ ∈ T and thus letting P := 12 dr (cid:88) τ ∈T P τ
11e note that P is in the closed convex hull of the null distributions. Letting f be the densityof P with respect to Q , we see that f ( v ) = 12 dr (cid:32) r (cid:89) i =1 ρ i − ρ i (cid:33) n/ (cid:88) τ ∈T exp (cid:18) − v T (Ξ − τ − Ξ − Q ) v (cid:19) , since this is simply the ratio of their respective densities with respect to the Lebesgue measure.We can now repeat the argument of the proof of Theorem 5 to get (cid:90) f d Q = 12 dr r (cid:89) i =1 ρ i (1 − ρ i ) (cid:113) − ρ i n (cid:88) τ,τ (cid:48) ∈T (cid:18)(cid:113) det(Ξ − τ + Ξ − τ (cid:48) − Ξ − Q ) (cid:19) − . The determinant can be written asdet(Ξ − τ + Ξ − τ (cid:48) − Ξ − Q ) = r (cid:89) i =1 det(Γ − τ,i + Γ − τ (cid:48) ,i − Γ − Q,i )by the block-diagonal structure of the Ξ’s. In the proof of Theorem 5, we derive thatdet(Γ − τ,i + Γ − τ (cid:48) ,i − Γ − Q,i ) = (cid:18) ( d (1 + ρ i ) − ρ i (cid:104) τ i , τ (cid:48) i (cid:105) ) d (1 + ρ i )(1 − ρ i ) (cid:19) n . Therefore, (cid:90) f d Q = 12 dr r (cid:89) j =1 (1 + ρ j ) n (cid:88) τ,τ (cid:48) ∈T r (cid:89) i =1 d n | d (1 + ρ i ) − ρ i (cid:104) τ i , τ (cid:48) i (cid:105)| n . Since each factor of the second product only depends on the i ’th component of τ and τ (cid:48) , we caninterchange the product and sum and apply the same counting arguments as in Theorem 5 toget that (cid:90) f d Q = r (cid:89) i =1 (cid:32) (1 + ρ i ) n d (cid:88) k =0 (cid:0) nk (cid:1) d (1 + (3 − k/d ) ρ i ) n (cid:33) as desired. We can repeat the same SLLN-based limiting arguments as in Theorem 5 to showthat as d increases the integral will converge to 1 and hence the power is bounded by the sizein the limit.Having shown that for each n and d , we have an upper bound on the power of a Gaussianconditional independence test against a simple alternative, we can now show this also holds forGaussian conditional independence testing problems against other Q . Lemma 5.
Let Q ∈ Q be a distribution consisting of n i.i.d. copies of jointly Gaussian and non-singular ( X, Y, Z ) on ( R d X , R d Y , R d Z ) for some d X , d Y , d Z ∈ N . Consider the testing problemdescribed in Section 2 with H X = R d X and H Z = R d Z and let ψ be the test function of a size α test over P Q with power β against Q . Then there exists a d X × d Y -rectangular diagonalmatrix R with diagonal entries ρ , . . . , ρ r ∈ (0 , , a distribution ˜ Q consisting of n i.i.d. copiesof jointly Gaussian ( ˜ X, ˜ Y , ˜ Z ) where ˜ X , ˜ Y and ˜ Z are all mean zero with identity covariancematrix, Cov( ˜ X, ˜ Z ) = Cov( ˜ Y , ˜ Z ) = 0 and Cov( ˜ X, ˜ Y ) = R and a test size α test over P ˜ Q withpower β against ˜ Q . roof. Let ψ denote the test function of the test with power β against Q and µ and Σ denotethe mean and covariance matrix of ( X, Y, Z ) under Q . We construct a new test with testfunction ˜ ψ performed by first applying a transformation f to each sample of the data and thenapplying ψ . The transformation f : R d X + d Y + d Z → R d X + d Y + d Z is an affine transformation givenby f ( v ) = Ax + µ where A = (cid:18) D M B (cid:19) for a block-diagonal matrix D consisting of a d X × d X matrix D X and d Y × d Y matrix D Y , a( d X + d Y ) × d Z matrix M and a full rank d Z × d Z matrix B .Note first that such a transformation preserves conditional independence. Let ( X , Y , Z )be jointly Gaussian with X ⊥⊥ Y | Z , joint mean µ and covariance matrix Σ . The distri-bution of ( ˇ X , ˇ Y , ˇ Z ) := f ( X , Y , Z ) is again Gaussian by the finite-dimensional version ofProposition 7 and has mean Aµ + µ and covariance A Σ A T = (cid:18) D Σ XY D T + M Σ Z,XY D T + D Σ XY,Z M T + M Σ Z M T D Σ XY,Z A T + M Σ Z B T B Σ Z,XY D T + B Σ Z M T B Σ Z B T (cid:19) , where Σ XY = Cov(( X , Y )), Σ XY,Z = Σ Z,XY = Cov(( X , Y ) , Z ) and Σ Z = Cov( Z ). Us-ing the finite-dimensional version of Proposition 8, we get that the conditional distribution of( ˇ X , ˇ Y ) given ˇ Z is again Gaussian with covariance matrix D Σ XY D T + M Σ Z,XY D T + D Σ XY,Z M T + M Σ Z M T − ( D Σ XY,Z B T + M Σ Z B T )( B Σ Z B T ) − ( B Σ Z,XY D T + B Σ Z M T )= D (Σ XY − Σ XY,Z Σ Z Σ Z,XY ) D T . The matrix Σ XY − Σ XY,Z Σ Z Σ Z,XY is the conditional covariance matrix of ( X , Y ) given Z and is block-diagonal since X ⊥⊥ Y | Z by the multivariate analogue of Proposition 6. By thesame proposition, since D is block-diagonal, we see that the conditional covariance of ( ˇ X , ˇ Y )given ˇ Z is block-diagonal and hence ˇ X ⊥⊥ ˇ Y | ˇ Z as desired.Let now Σ − / X | Z Σ XY | Z Σ − / Y | Z = U SV T be the singular-value decomposition of the normalised conditional covariance of X and Y given Z under Q . The normalisation ensures that S is a rectangular diagonal matrix with diagonalentries in the open unit interval. If we let B := Σ / Z , M := (cid:32) Σ X,Z Σ − / Z Σ Y,Z Σ − / Z (cid:33) D := (cid:32) Σ / X | Z U
00 Σ / Y | Z V (cid:33) , R := S and ( ˇ X, ˇ Y , ˇ Z ) = f (( ˜ X, ˜ Y , ˜ Z )) where ( ˜ X, ˜ Y , ˜ Z ) ∼ ˜ Q then Proposition 7 yields that ( ˇ X, ˇ Y , ˇ Z ) ∼ Q and hence when applying ψ , we have power β by assumption. Since A also transforms a nulldistribution with identity covariance into a null distribution with where Z has mean µ Z andcovariance Σ Z , we have the desired result. 13 .3 Hardness of infinite-dimensional Hilbertian Gaussian conditional inde-pendence testing In this section we consider the testing problem described in Section 2 with H X and H Z infinite-dimensional and separable. We will show that the testing problem against Q is hard for any Q ∈ Q . In particular, this includes the typical functional data setting where H Z = L [0 , Q . A.3.1 Preliminary results
In this section, we consider finite-dimensional H X and infinite-dimensional H Z . We will need alemma using the theory of conditional Hilbertian Gaussian distributions from Section A.1. Lemma 6.
Let ( X, Y, Z ) be jointly Gaussian on R d X × R d Y × H and assume that the covarianceoperator of Z is injective. Then there exists a basis ( e k ) k ∈ N of H such that ( X, Y ) ⊥⊥ Z d X + d Y +1 , . . . | Z , . . . Z d X , Z d X +1 , . . . , Z d X + d Y where Z k := (cid:104) Z, e k (cid:105) .Proof. Note that R d X × R d Y ×H Z is itself a Hilbert space and decompose it as ( R d X × R d Y ) ⊕H Z .Let C Z := Cov( Z ) , C ( X,Y ) := Cov(( X, Y )) (the covariance of the joint vector (
X, Y )) and C ( X,Y ) ,Z := Cov(( X, Y ) , Z ). We can apply Proposition 8 to see that ( X, Y ) conditional on Z isGaussian with mean C ( X,Y ) ,Z C † Z Z and covariance operator C ( X,Y ) − C ( X,Y ) ,Z C † Z C ∗ ( X,Y ) ,Z . Theoperator A := C ( X,Y ) ,Z C † Z maps from H to R d X × R d Y and thus is at most a rank d X + d Y operator. By Hsing and Eubank [9, Theorem 3.3.7 7.] this implies that the rank of A ∗ isalso at most d X + d Y . Furthermore, Hsing and Eubank [9, Theorem 3.3.7 6.] yields that H = Ker( A ) ⊕ Im( A ∗ ). Using this decomposition we can write Z = ( Z Ker( A ) , Z Im( A ∗ ) ) and notethat by construction A Z = A Z Im( A ∗ ) thus the conditional distribution of ( X, Y ) given Z onlydepends on Z Im( A ∗ ) . In total, we have shown by Proposition 5 that ( X, Y ) ⊥⊥ Z Ker( A ) | Z Im( A ∗ ) .Letting r denote the rank of A ∗ , if we start with a basis for Im( A ∗ ) and append vectors toform a basis for H using the Gram-Schmidt procedure, we get a basis where( X, Y ) ⊥⊥ Z r +1 , . . . | Z , . . . Z r . Since r ≤ d X + d Y , the weak union property of conditional independence yields( X, Y ) ⊥⊥ Z d X + d Y +1 , . . . | Z , . . . Z d X , Z d X +1 , . . . , Z d Y + d X , as desired.Using the lemma and the power bound from the previous section, we can prove the hardnessresult for finite-dimensional H X and H Y . Theorem 7.
Let Q ∈ Q be a distribution consisting of n i.i.d. copies of jointly Gaussian andnon-singular ( X, Y, Z ) on ( R d X , R d Y , H Z ) for d X , d Y ∈ N and any infinite-dimensional andseparable H Z . Consider the testing problem described in Section 2 with H X = R d X and H Z asabove and let ψ be the test function of a size α test over P Q . Then ψ has power at most α against Q . roof. Assume for contradiction that ψ is a test of size α over P Q with power α + ε for some ε > Q . Let ( X, Y, Z ) be distributed as one of the n i.i.d. copies constituting Q . ByLemma 6, we can express Z in a basis ( e k ) k ∈ N such that defining Z k = (cid:104) Z, e k (cid:105) , we have( X, Y ) ⊥⊥ Z d X + d Y +1 , . . . | Z , . . . , Z d X , Z d X +1 , . . . , Z d X + d Y . By the weak union property of conditional independence, this implies that(
X, Y ) ⊥⊥ Z d +1 , . . . | Z , . . . , Z d for any d ≥ d X + d Y .Choose now an arbitrary d ≥ d X + d Y and let ˜ Q denote the distribution of n i.i.d. copies of( X, Y, Z , . . . , Z d ) under Q . Consider the testing problem described in Section 2 with H X = R d X and H Z = R d . We can construct a test in this setting by defining new observations ( ˇ X, ˇ Y , ˇ Z )with values in ( R d X , R d Y , H Z ) and applying ψ . We form the new observations by setting ˇ X := ˜ X ,ˇ Y := ˜ Y and ˇ Z := ( ˜ Z , . . . , ˜ Z d , Z ◦ d +1 , Z ◦ d +2 , . . . ), where Z ◦ d +1 , Z ◦ d +2 , . . . are sampled from theconditional distribution Z d +1 , Z d +2 , . . . | Z = ˜ Z , . . . Z d = ˜ Z d . If the original sample is from adistribution in P ˜ Q then the modified sample will be from a null distribution in P Q , thus thetest has size α over P ˜ Q . Similarly, if ( ˜ X, ˜ Y , ˜ Z ) ∼ ˜ Q , the modified sample will have distribution Q and hence the test has power α + ε against ˜ Q .By Lemma 5 this implies the existence of a d X × d Y block-diagonal matrix R with diag-onal entries in the open unit interval, a Gaussian distribution Q (cid:48) on ( R d X , R d Y , R d ) where if( X (cid:48) , Y (cid:48) , Z (cid:48) ) ∼ Q (cid:48) , X (cid:48) , Y (cid:48) and Z (cid:48) are mean zero with identity covariance matrix, Cov( X (cid:48) , Z (cid:48) ) =Cov( Y (cid:48) , Z (cid:48) ) = 0 and Cov( X (cid:48) , Y (cid:48) ) = R , and a test with size alpha over P Q (cid:48) with power α + ε against Q (cid:48) . Since d was arbitrary, this contradicts Theorem 6. A.3.2 Proofs of Theorem 1 and Proposition 1
In this section we prove Theorem 1 and Proposition 1. We do this by can extending the resultsfrom the previous section to the situation where at most one of X and Y are infinite-dimensional. Lemma 7.
Let ( X, Y, Z ) be jointly Gaussian on R d X ×H Y ×H Z and assume that the covarianceoperator of ( Y, Z ) is injective. Then there exists a basis ( e k ) k ∈ N of H Y such that X ⊥⊥ Y d X +1 , . . . | Y , . . . , Y d X , Z where Y k := (cid:104) Y, e k (cid:105) .Proof. Note that R d X ×H Y ×H Z is again a Hilbert space and decompose it as R d X ⊕ ( H Y ×H Z ).Let C ( Y,Z ) := Cov(( Y, Z )) (the covariance of the joint vector (
Y, Z )) , C X := Cov( X ) and C X, ( Y,Z ) := Cov( X, ( Y, Z )). We can apply Proposition 8 to see that X conditional on ( Y, Z ) isGaussian with mean C X, ( Y,Z ) C † ( Y,Z ) ( Y, Z ) and covariance operator C X − C X, ( Y,Z ) C † ( Y,Z ) C ∗ X, ( Y,Z ) .The operator A = C X, ( Y,Z ) C † ( Y,Z ) maps from H Y × H Z to R d X and thus is at most a rank d X operator. By Hsing and Eubank [9, Theorem 3.3.7 7.] this implies that the rank of A ∗ is alsoat most d X . Furthermore, Hsing and Eubank [9, Theorem 3.3.7 6.] yields that H Y × H Z =Ker( A ) ⊕ Im( A ∗ ). Using this decomposition we can write ( Y, Z ) = ((
Y, Z ) Ker( A ) , ( Y, Z ) Im( A ∗ ) )and note that by construction A ( Y, Z ) = A ( Y, Z ) Im( A ∗ ) thus the conditional distributionof X given ( Y, Z ) only depends on (
Y, Z ) Im( A ∗ ) . In total, we have shown by Proposition 5that X ⊥⊥ ( Y, Z ) Ker( A ) | ( Y, Z ) Im( A ∗ ) which implies by the weak union property of conditionalindependence that X ⊥⊥ Y Ker( A ) | Y Im( A ∗ ) , Z . 15ny basis of Im( A ∗ ) will consist of at most d X elements. Forming the span of the H Y -components of the basis vectors will yield a subspace of H Y that contains the projection onto H Y of Im( A ∗ ). Thus, letting r denote the rank of A ∗ , we can append vectors and form a basisfor H Y using the Gram-Schmidt procedure to get a basis where X ⊥⊥ Y r +1 , . . . | Y , . . . , Y r , Z. Since r ≤ d X , the weak union property of conditional independence yields X ⊥⊥ Y d X +1 , . . . | Y , . . . , Y d X , Z as desired.We are now ready to prove Theorem 1. Proof.
Assume without loss of generality that H X is finite-dimensional and thus H X is isomor-phic to a real vector space and we will instead denote H X = R d X where d X is the dimension of H X .Assume for contradiction that ψ is a test of size α with power α + ε for some ε > Q . Let ( X, Y, Z ) be distributed as one of the n i.i.d. copies constituting Q . By Lemma 7 wecan express Y in a basis ( e k ) k ∈ N such that defining Y k = (cid:104) Y, e k (cid:105) , we have X ⊥⊥ Y d X +1 , . . . | Y , . . . , Y d X , Z. By the weak union property of conditional independence, this implies that X ⊥⊥ Y d +1 , . . . | Y , . . . , Y d , Z for any d ≥ d X .Choose now an arbitrary d ≥ d X + d Y and let ˜ Q denote the distribution of n i.i.d. copies of( X, Y , . . . , Y d , Z ) under Q . Consider the testing problem described in Section 2 with H X = R d X and H Z as above. We can construct a test in this setting by defining new observations ( ˇ X, ˇ Y , ˇ Z )with values in ( R d X , H Y , H Z ) and applying ψ . We form the new observations by setting ˇ X := ˜ X ,ˇ Z := ˜ Z and ˇ Y := ( ˜ Y , . . . , ˜ Y d , Y ◦ d +1 , Y ◦ d +2 , . . . ), where Y ◦ d +1 , Y ◦ d +2 , . . . are sampled from theconditional distribution Y d +1 , Y d +2 , . . . | Y = ˜ Y , . . . Y d = ˜ Y d , Z = ˜ Z . If the original sample isfrom a distribution in P ˜ Q , then the modified sample will be from a null distribution in P Q ,thus the test has size α over P ˜ Q . Similarly, if ( ˜ X, ˜ Y , ˜ Z ) ∼ ˜ Q , the modified sample will havedistribution Q and hence the test has power α + ε against the distribution of ( X, Y , . . . , Y d , Z ).But this contradicts Theorem 7.A similar strategy can be employed to prove Proposition 1. Proof.
We can repeat the arguments of Theorem 7 and Theorem 1 without using Lemma 6 andLemma 7 since we can use the basis ( e k ) k ∈ N instead. B Uniform convergence of random variables
In this section we develop some background theory that will be useful when considering simulta-neous convergence of sequences with varying distributions. In particular, we are interested theconvergence of a sequence of random variables ( X n ) n ∈ N defined on a measurable space (Ω , F )with a family of probability measures ( P θ ) θ ∈ Θ . For each θ ∈ Θ the distribution of ( X n ) n ∈ N willchange as the background measure P θ changes. We are also interested in the convergence of16 -dependent functions of X n such as the conditional expectation wrt. P θ of X n given a sub- σ -algebra D of F . To allow for such considerations, the definitions given here will be more generalthan in Section 4 and will allow for a family of random variables ( X n,θ ) n ∈ N ,θ ∈ Θ to converge toa family of random variables ( X θ ) θ ∈ Θ .In this section we extend previous work by Kasy [10] and Bengs and Holzmann [1] toHilbertian and Banachian random variables and also add further characterizations of theircentral assumptions for families of real-valued random variables. Unless stated otherwise, weconsider the following setup for the remainder of this section. Let (Ω , F ) be a measurable space,( P θ ) θ ∈ Θ a family of probability measure on (Ω , F ) where Θ is any set and ( B , B ( B )) a separableBanach space with its Borel σ -algebra. Let ( X n,θ ) n ∈ N ,θ ∈ Θ and ( X θ ) θ ∈ Θ be families of randomvariables defined on (Ω , F ) with values in B . All additional random variables are also definedon (Ω , F ). We write E θ for the expectation with respect to P θ . Definition 3 (Uniform convergence of random variables) . (i) We say that X n,θ converges uni-formly in distribution over Θ to X θ and write X n,θ D ⇒ Θ X θ if lim n →∞ sup θ ∈ Θ d θ BL ( X n,θ , X θ ) = 0 , where d θ BL ( X n,θ , X θ ) := sup f ∈ BL | E θ ( f ( X n,θ )) − E θ ( f ( X θ )) | , and BL denotes the set of all functions f : B → [ − , that are Lipschitz with constant atmost . We write X n,θ D ⇒ X θ and simply say that X n,θ converges uniformly in distributionto X θ when Θ is clear from the context. When considering collections of random variablesthat do not depend on θ except through the measure on the background space, we simplywrite X n D ⇒ X .(ii) We say that X n,θ converges uniformly in probability over Θ to X θ and write X n,θ P ⇒ Θ X θ if, for any ε > , lim n →∞ sup θ ∈ Θ P θ ( (cid:107) X n,θ − X θ (cid:107) ≥ ε ) = 0 . We write X n,θ P ⇒ X θ and simply say that X n,θ converges uniformly in probability to X θ when Θ is clear from the context. When considering collections of random variables thatdo not depend on θ except through the measure on the background space, we simply write X n P ⇒ X . By slight abuse of notation we write X n,θ D ⇒ X n,θ P ⇒ X n,θ convergesuniformly to the family of random variables X θ that is equal to 0 for all ω ∈ Ω and any θ ∈ Θ. Note that if ( P θ ) θ ∈ Θ contains a single element, we recover the standard definitions ofconvergence in distribution and probability. We have the following helpful characterisations ofthe two modes of uniform convergence. Proposition 9. (i) X n,θ D ⇒ X θ if and only if for any sequence ( θ n ) n ∈ N ⊂ Θlim n →∞ d θ n BL ( X n,θ n , X θ n ) = 0 . ii) X n,θ P ⇒ X θ if and only if for any sequence ( θ n ) n ∈ N ⊂ Θ and any ε > n →∞ P θ n ( (cid:107) X n,θ n − X θ n (cid:107) ≥ ε ) = 0 . Proof.
The proof given in Kasy [10, Lemma 1] also works in the Banachian case.In the remainder of this section we derive various properties of uniform convergence inprobability and distribution that are analogous to the well-known properties of non-uniformconvergence. In particular, we first consider a uniform version of the continuous mappingtheorem which relies on stronger versions of continuity.
Proposition 10.
Let ψ : B → ˜ B where B and ˜ B are Banach spaces.(i) If X n,θ D ⇒ X θ and ψ is Lipschitz-continuous then ψ ( X n,θ ) D ⇒ ψ ( X θ ) .(ii) If X n,θ P ⇒ X θ and ψ is uniformly continuous then ψ ( X n,θ ) P ⇒ ψ ( X θ ) .Proof. The proof in Kasy [10, Theorem 1] also works in the Banachian case.An alternative assumption that allows continuity to be sufficient is tightness of the familyof pushforward measures ( X θ ( P θ )) θ ∈ Θ . Definition 4.
Let ( µ θ ) θ ∈ Θ be a family of probability measures on B .(i) ( µ θ ) θ ∈ Θ is said to be tight if for any ε > , there exists a compact set K such that sup θ ∈ Θ µ θ ( K c ) < ε . ( X θ ) θ ∈ Θ is said to be uniformly tight with respect to Θ if the familyof pushforward measures ( X θ ( P θ )) θ ∈ Θ is tight. If Θ is clear from the context we simplysay that ( X θ ) θ ∈ Θ is uniformly tight.(ii) ( X n,θ ) n ∈ N ,θ ∈ Θ is said to be sequentially tight with respect to Θ if for any sequence ( θ n ) n ∈ N ⊂ Θ the sequence of pushforward measures ( X n,θ n ( P θ n )) n ∈ N is tight. If Θ is clear from thecontext we simply say that ( X n,θ ) n ∈ N ,θ ∈ Θ is sequentially tight.(iii) ( µ θ ) θ ∈ Θ is said to be relatively compact if for any sequence ( θ n ) n ∈ N there exists a subse-quence ( θ k ( n ) ) n ∈ N , where k : N → N is strictly increasing, such that µ θ k ( n ) converges weaklyto some measure µ , which is not necessarily in the family ( µ θ ) θ ∈ Θ . Prohorov’s theorem states that tightness implies relative compactness and that they areequivalent on separable and complete metric spaces; in this work, we therefore use the termsinterchangeably since we only consider separable Banach and Hilbert spaces. With a uniformtightness assumption, we can perform continuous operations and preserve uniform convergencein probability just as in the non-uniform setting.
Proposition 11.
Let ( X n,θ ) n ∈ N ,θ ∈ Θ and ( X θ ) θ ∈ Θ be random variables taking values in a Banachspace B X and similarly let ( Y n,θ ) n ∈ N ,θ ∈ Θ and ( Y θ ) θ ∈ Θ be random variables in a Banach space B Y . Assume that X n,θ P ⇒ X θ , Y n,θ P ⇒ Y θ , and both X θ and Y θ are uniformly tight. Then, forany continuous function ψ : B X × B Y → B , where B is another Banach space, we have ψ ( X n,θ , Y n,θ ) P ⇒ ψ ( X θ , Y θ ) . roof. Let ε > θ ∈ Θ P θ ( (cid:107) ψ ( X n,θ , Y n,θ ) − ψ ( X θ , Y θ ) (cid:107) ≥ ε ) → X θ and Y θ are uniformly tight, for η > K X and K Y such thatsup θ ∈ Θ P θ ( X θ (cid:54)∈ K X ) < η/ θ ∈ Θ P θ ( Y θ (cid:54)∈ K Y ) < η/ . By the Heine-Cantor theorem, ψ is uniformly continuous on K X × K Y , so there exists δ > (cid:107) x − x (cid:48) (cid:107) + (cid:107) y − y (cid:48) (cid:107) < δ implies that (cid:107) ψ ( x, y ) − ψ ( x (cid:48) , y (cid:48) ) (cid:107) < ε . We thus havesup θ ∈ Θ P θ ( (cid:107) ψ ( X n,θ , Y n,θ ) − ψ ( X θ , Y θ ) (cid:107) ≥ ε ) ≤ sup θ ∈ Θ P θ ( X θ (cid:54)∈ K X ) + sup θ ∈ Θ P θ ( Y θ (cid:54)∈ K X )+ sup θ ∈ Θ P θ ( (cid:107) X n,θ − X θ (cid:107) ≥ δ/
2) + sup θ ∈ Θ P θ ( (cid:107) Y n,θ − Y θ (cid:107) ≥ δ/ . By assumption we can choose N sufficiently large such that for all n ≥ N , both of the finalterms are less than η/
4, resulting in the whole expression being less than η . As η was arbitrary,this proves the result.Bengs and Holzmann [1] make repeated use of the following concept for many of their resultsfor real-valued random variables. Definition 5.
A family of probability measures ( µ θ ) θ ∈ Θ is uniformly absolutely continuous withrespect to the measure µ if for any ε > , there exists δ > such that for any Borel set Bµ ( B ) < δ = ⇒ sup θ ∈ Θ µ θ ( B ) < ε. A family of random variables ( X θ ) θ ∈ Θ is uniformly absolutely continuous over Θ with respectto the measure µ if the family of pushforward measures ( X θ ( P θ )) θ ∈ Θ is uniformly absolutelycontinuous with respect to µ . When Θ is clear from the context we simply say that X θ isuniformly absolutely continuous with respect to µ . Uniform absolute continuity has prevously been studied in other works such as the ones byBogachev [3, Section 5.6] and Doob [6, Chapter IX, Section 4]. An intuitive view of uniformabsolute continuity can be given when µ is a finite measure. In this case, we can define apseudometric d µ on the Borel sets with d µ ( A, B ) = µ ( A (cid:52) B ), where A (cid:52) B is the symmetricdifference. Uniform absolute continuity is then uniform d µ -continuity over θ of the collection ofpush-forward measures ( X θ ( P θ )) θ ∈ Θ viewed as mappings from the Borel sets into R .Another helpful perspective is in the case where for each θ , X θ has a density f θ with respectto a common measure µ . The following proposition shows that X θ is uniformly absolutelycontinuous with respect to µ if and only if for each θ , X θ has a density f θ with respect to µ andthe family of densities is uniformly integrable. A convenient sufficient condition for uniformintegability is the existence of r > θ ∈ Θ (cid:82) f rθ d µ < ∞ . Proposition 12. If ( X θ ) θ ∈ Θ is uniformly absolutely continuous with respect to µ , then for each θ X θ has a density f θ with respect to µ and the family ( f θ ) θ ∈ Θ is uniformly integrable withrespect to µ . Conversely, if for each θ X θ has a density f θ with respect to µ and the family ( f θ ) θ ∈ Θ is uniformly integrable then ( X θ ) θ ∈ Θ is uniformly absolutely continuous with respect to µ . roof. For the first statement, note that by the Radon–Nikodym theorem, we need to showthat for each θ , µ ( B ) = 0 implies that P θ ( X θ ∈ B ) = 0 for every Borel measurable B . Thisis immediate from the assumption of uniform absolute continuity (by negation) and so is theuniform integrability of the family ( f θ ) θ ∈ Θ . The second statement follows immediately from thedefinitions of uniform integrability and uniform absolute continuity.In Bengs and Holzmann [1] uniform absolute continuity is assumed with respect to a prob-ability measure. For uniformly tight Banachian random variables that are uniformly absolutelycontinuous with respect to a σ -finite measure µ , we can show that the family is also uniformlyabsolutely continuous with respect to any σ -finite measure ν such that µ has a continuousdensity with respect to ν . Proposition 13.
Assume that ( X θ ) θ ∈ Θ is uniformly absolutely continuous with respect to some σ -finite measure µ . If ν is another σ -finite measure dominating µ and there exists a continuousRadon-Nikodym derivative of µ with respect to ν , then X is uniformly absolutely continuouswith respect to ν .Proof. Let ε > X θ ) θ ∈ Θ is uniformly tight, we can choose a compact set K , such that sup θ ∈ Θ P θ ( X θ (cid:54)∈ K ) < ε/ . Then note that for any Borel measurable set B sup θ ∈ Θ P θ ( X θ ∈ B ) < ε/ θ ∈ Θ P θ ( X θ ∈ B ∩ K ) . We thus need to find δ so that ν ( B ∩ K ) < δ implies sup θ ∈ Θ P θ ( X θ ∈ B ∩ K ) < ε/
2. Letting g denote the continuous Radon-Nikodym derivative of µ with respect to ν , we see that µ ( B ∩ K ) = (cid:90) B ∩ K g d ν ≤ (cid:18) sup x ∈ K g ( x ) (cid:19) ν ( B ∩ K ) . The supremum is finite by the extreme value theorem for continuous functions since K iscompact. If sup x ∈ K g ( x ) > δ (cid:48) from the uniform absolute continuity of X with respectto µ matching ε/ δ = δ (cid:48) / (sup x ∈ K g ( x )). Then for all B with ν ( B ) < δ , we have δ > ν ( B ) ≥ ν ( B ∩ K ) ≥ µ ( B ∩ K )sup x ∈ K g ( x ) = ⇒ µ ( B ∩ K ) < δ (cid:48) and thus sup θ ∈ Θ P θ ( X θ ∈ B ∩ K ) < ε/ x ∈ K g ( x ) = 0 any δ will work since µ ( B ∩ K ) = 0 impliessup θ ∈ Θ P ( X θ ∈ B ∩ K ) = 0.A consequence of the above result is that uniform absolute continuity with respect to theLebesgue measure implies uniform absolute continuity with respect to the standard Gaussianmeasure. This lets us immediately apply many of the results of Bengs and Holzmann [1] such asTheorem 4.1, when we consider a uniformly tight real-valued random variable that is uniformlyabsolutely continuous with respect to the Lebesgue measure. Corollary 1.
A real-valued family of random variables ( X θ ) θ ∈ Θ is uniformly absolutely contin-uous with respect to the Lebesgue measure if and only if it is uniformly absolutely continuouswith respect to the standard Gaussian measure. roof. The statement follows immediately by the equivalence of the standard Gaussian measureand the Lebesgue measure, by the continuity of the Gaussian density and its reciprocal, andProposition 13.We will consider sums of real-valued random variables and thus need to consider when suchsums are uniformly absolutely continuous with respect to a measure. It turns out that whenthe random variables are independent and one of the families is uniformly absolutely continuouswith respect to the Lebesgue measure, the same is true for the family of sums.
Theorem 8.
Let ( X θ ) θ ∈ Θ and ( Y θ ) θ ∈ Θ be two real-valued random variables such that for any θ ∈ Θ X θ and Y θ are independent under P θ . Assume that ( X θ ) θ ∈ Θ is uniformly absolutelycontinuous with respect to the Lebesgue measure. Then ( X θ + Y θ ) θ ∈ Θ is uniformly absolutelycontinuous with respect to the Lebesgue measure.Proof. Let ε > λ denote the Lebesgue measure. We need to find δ > B with λ ( B ) < δ , we have sup θ ∈ Θ P θ ( X θ + Y θ ∈ B ) < ε . We canuse the independence of X θ and Y θ to write the probability as a double-integral with respect tothe pushforward measures X θ ( P θ ) and Y θ ( P θ ) as follows: P θ ( X θ + Y θ ∈ B ) = (cid:90) B ( X θ ( ω ) + Y θ ( ω )) d P θ ( ω ) = (cid:90) (cid:90) B ( x + y ) d X θ ( P θ )( x ) d Y θ ( P θ )( y ) . Note that B ( x + y ) = B − y ( x ) where B − y := { b − y : b ∈ B } and that, by the translationinvariance of the Lebesgue measure, λ ( B ) = λ ( B − y ). As X θ is uniformly absolutely continuouswith respect to λ , there exists δ such that if λ ( B ) < δ we havesup θ ∈ Θ P θ ( X θ + Y θ ∈ B ) ≤ sup θ ∈ Θ (cid:90) (cid:18) sup θ ∈ Θ (cid:90) B − y ( x ) d X θ ( P θ )( x ) (cid:19) d Y θ ( P θ )( y ) < sup θ ∈ Θ (cid:90) ε d Y θ ( P θ )( y ) < ε. Thus far, we have not discussed when we can expect uniform convergence in distributionto imply uniform convergence of distribution functions. This is exactly where we need anassumption of uniform absolute continuity. The following result is a modified version of Bengsand Holzmann [1, Theorem 4.1], where our condition includes uniform convergence in x , ratherthan convergence for all x . Proposition 14.
Let ( X n,θ ) n ∈ N ,θ ∈ Θ and ( X θ ) θ ∈ Θ be real-valued random variables. Assume that ( X θ ) θ ∈ Θ is uniformly absolutely continuous with respect to a continuous probability measure µ .Then X n,θ D ⇒ X θ if and only if lim n →∞ sup x ∈ R sup θ ∈ Θ | P θ ( X n,θ ≤ x ) − P θ ( X θ ≤ x ) | = 0 . (25) Proof.
See Bengs and Holzmann [1, Theorem 4.1] for a proof that X n,θ D ⇒ X θ if and only iflim n →∞ sup θ ∈ Θ | P θ ( X n,θ ≤ x ) − P θ ( X θ ≤ x ) | = 0for all x ∈ R . To show that the convergence of distribution functions is uniform, we proceed asfollows. In view of the uniform absolute continuity of ( X θ ) θ ∈ Θ with respect to µ , for all ε > δ > B with µ ( B ) < δ , we have sup θ ∈ Θ P θ ( X θ ∈ B ) < ε . Let −∞ = x < x < · · · < x m = ∞ such that for all i ∈ { , . . . , m } , 0 < µ (( x i − , x i ]) < . We can find such a grid since µ is a continuous probability measure. For any θ and i ∈{ , . . . , m } , we thus have P θ ( X θ ≤ x i ) − P θ ( X θ ≤ x i − ) = P θ ( X θ ∈ ( x i − , x i ]) < ε. For x ∈ ( x i − , x i ],sup θ ∈ Θ { P θ ( X n,θ ≤ x ) − P θ ( X θ ≤ x ) } ≤ sup θ ∈ Θ { P θ ( X n,θ ≤ x i ) − P θ ( X θ ≤ x i − ) }≤ sup θ ∈ Θ { P θ ( X n,θ ≤ x i ) − P θ ( X θ ≤ x i ) } + ε ≤ sup θ ∈ Θ | P θ ( X n,θ ≤ x i ) − P θ ( X θ ≤ x i ) | + ε, and, similarly,sup θ ∈ Θ { P θ ( X θ ≤ x ) − P θ ( X n,θ ≤ x ) } ≤ sup θ ∈ Θ { P θ ( X θ ≤ x i ) − P θ ( X n,θ ≤ x i − ) }≤ sup θ ∈ Θ { P θ ( X θ ≤ x i − ) − P θ ( X n,θ ≤ x i − ) } + ε ≤ sup θ ∈ Θ | P θ ( X θ ≤ x i − ) − P θ ( X n,θ ≤ x i − ) | + ε. Thus,sup x ∈ R sup θ ∈ Θ | P θ ( X n,θ ≤ x ) − P θ ( X θ ≤ x ) | ≤ sup i ∈{ ,...,m } sup θ ∈ Θ | P θ ( X n,θ ≤ x i ) − P θ ( X θ ≤ x i ) | + ε. The first term on the right hand side goes to 0 by assumption and ε was arbitrary, thus provingthe uniform convergence.The final results of this section are uniform versions of Slutsky’s lemma, the Law of LargeNumbers and the Central Limit Theorem. In the remaining results uniform tightness will play acrucial role. It is a standard result that if ( X n ) n ∈ N converges in distribution to X then ( X n ) n ∈ N is tight. We can show that analogously if X n,θ D ⇒ X θ and ( X θ ) θ ∈ Θ is uniformly tight then( X n,θ ) n ∈ N ,θ ∈ Θ is sequentially tight. Proposition 15.
Assume that ( X θ ) θ ∈ Θ is uniformly tight. If X n,θ D ⇒ X θ then ( X n,θ ) n ∈ N ,θ ∈ Θ issequentially tight.Proof. We prove the contrapositive statement. Assume that there exists a sequence ( θ n ) n ∈ N ⊂ Θsuch that ( X n,θ n ( P θ n )) n ∈ N is not tight. Let Y n be distributed as X n,θ n ( P θ n ) and Z n distributedas X ( P θ n ) defined on a probability space (Ω , F , P ). Since ( Y n ) n ∈ N is not tight, there exists asubsequence ( k ( n )) n ∈ N with k : N → N strictly increasing such that any further subsequenceof ( Y k ( n ) ) n ∈ N does not converge in distribution. Since ( Z n ) n ∈ N is tight, there exists a strictlyincreasing k (cid:48) : N → N and a random variable Z such that writing m = k ◦ k (cid:48) , we have d BL ( Z m ( n ) , Z ) → . However, since Y k ( n ) does not have a weakly convergent subsequence, we have d BL ( Y m ( n ) , Z ) → . Thus there exists ε > k (cid:48)(cid:48) : N → N such that writing l = m ◦ k (cid:48)(cid:48) , wehave for all n d BL ( Y l ( n ) , Z ) ≥ ε. Next choose N such that for n ≥ N we have d BL ( Z l ( n ) , Z ) < ε/ . d BL ( Z l ( n ) , Y l ( n ) ) ≥ (cid:12)(cid:12) d BL ( Z l ( n ) , Z ) − d BL ( Z, Y l ( n ) ) (cid:12)(cid:12) ≥ ε/ n ≥ N . Since d BL ( Z l ( n ) , Y l ( n ) ) = d θ l ( n ) BL (cid:16) X l ( n ) ,θ l ( n ) , X θ l ( n ) (cid:17) by Proposition 9 we cannot have X n,θ D ⇒ X θ proving the desired statement.The previous result will be required when proving the second part of the upcoming uniformversion of Slutsky’s lemma. Proposition 16 (Uniform Slutsky’s lemma) . Let ( X n,θ ) n ∈ N ,θ ∈ Θ , ( Y n,θ ) n ∈ N ,θ ∈ Θ and ( X θ ) θ ∈ Θ beBanachian random variables. Assume that X n,θ D ⇒ X θ and Y n,θ P ⇒ . Then, the following twostatements hold.(i) X n,θ + Y n,θ D ⇒ X θ .(ii) If ( Y n,θ ) n ∈ N ,θ ∈ Θ is a family of real-valued random variables and ( X θ ) θ ∈ Θ is uniformly tight,then Y n,θ X n,θ P ⇒ .Proof. We first prove (i), for which we need to show thatsup θ ∈ Θ d θ BL ( X n,θ + Y n,θ , X θ ) → n → ∞ . We have for any θd θ BL ( X n,θ + Y n,θ , X θ ) ≤ d θ BL ( X n,θ + Y n,θ , X n,θ ) + d θ BL ( X n,θ , X θ ) , where the second term goes to 0 uniformly by assumption. It remains to show that the first termgoes to 0 uniformly. Now for f ∈ BL we have that for any ε > x, y ∈ B , (cid:107) y (cid:107) < ε implies (cid:107) f ( x + y ) − f ( x ) (cid:107) ≤ ε . Hence by using the triangle inequality for the expectation,partionining the integral and using the uniform continuity above, we get d θ BL ( X n,θ + Y n,θ , X n,θ ) ≤ ε + sup f ∈ BL E θ (cid:12)(cid:12)(cid:12) [ f ( X n,θ + Y n,θ ) − f ( X n,θ )] {(cid:107) Y n,θ (cid:107) >ε } (cid:12)(cid:12)(cid:12) . We can again apply the triangle inequality and recall that f is bounded by 1, yieldingsup θ ∈ Θ sup f ∈ BL E θ (cid:12)(cid:12)(cid:12) [ f ( X n,θ + Y n,θ ) − f ( X n,θ )] {(cid:107) Y n,θ (cid:107) >ε } (cid:12)(cid:12)(cid:12) ≤ θ ∈ Θ P θ ( (cid:107) Y n,θ (cid:107) > ε ) , which goes to 0 by assumption. Since ε > θ n ) n ∈ N ⊆ Θ and any ε > P θ n ( (cid:107) Y n,θ n X n,θ n (cid:107) ≥ ε ) → n → ∞ , which implies the desired result. Let δ > K such that sup n ∈ N P θ n ( X n,θ n ∈ K c ) ≤ δ/ . Since K is compact, it is bounded and thus there exists M > x ∈ K = ⇒ (cid:107) x (cid:107) < M.
23y the uniform convergence in probability of Y n to zero, we can find N such that for all n ≥ N , P θ n ( | Y n,θ n | ≥ ε/M ) < δ/ . Putting things together, we get, for all n ≥ N , P θ n ( (cid:107) X n,θ n Y n,θ n (cid:107) ≥ ε ) ≤ P θ n ( (cid:107) X n,θ n Y n,θ n (cid:107) ≥ ε, X n,θ n ∈ K ) + P θ n ( X n,θ n ∈ K c ) ≤ P θ n ( | Y n,θ n | ≥ ε/M ) + sup n ∈ N P θ n ( X n,θ n ∈ K c ) < δ, proving the result.We will now consider the setting of uniform convergence of averages of i.i.d. random variables,i.e. we assume that for each θ ∈ Θ the sequence ( X n,θ ) n ∈ N is i.i.d. and consider the convergence of1 /n (cid:80) ni =1 X i,θ . We can prove an analogue of the Law of Large numbers for uniform convergencein probability. Proposition 17.
Let ( X θ ) θ ∈ Θ be a Banachian random variable with E θ ( X θ ) = 0 for all θ ∈ Θ and sup θ ∈ Θ E θ ( (cid:107) X θ (cid:107) η ) < K for some K, η > . Let ( X n ) n ∈ N be a sequence random variablessuch that for θ ∈ Θ ( X n ) n ∈ N is i.i.d. with the same distribution as X under P θ . Then n n (cid:88) i =1 X i P ⇒ . Proof.
We adapt the argument given in Shah and Peters [19, Lemma 19]. Defining S n,θ := n − (cid:80) ni =1 X i,θ , we need to show that for any ε > θ ∈ Θ P θ ( (cid:107) S n,θ (cid:107) ≥ ε ) → n → ∞ . To that end, we let M > X <θ := {(cid:107) X θ (cid:107)≤ M } X θ and X >θ := {(cid:107) X θ (cid:107) >M } X θ and similarly S
0. Thus sup θ ∈ Θ P θ ( (cid:107) S >n,θ (cid:107) ≥ t ) can be made arbitrarily small by making M large. Let ε, δ > M large enough such that sup θ ∈ Θ P θ ( (cid:107) S >n,θ (cid:107) ≥ ε/ < δ/
2. Choose N large enough that 4 M / ( N ε ) < δ/
2. Then for all n ≥ N , we havesup θ ∈ Θ P θ ( (cid:107) S n,θ (cid:107) ≥ ε ) ≤ sup θ ∈ Θ P θ ( (cid:107) S 2) + sup θ ∈ Θ P θ ( (cid:107) S >n,θ (cid:107) ≥ ε/ < M nε + δ < δ. Proposition 18. Let ( X n,θ ) n ∈ N ,θ ∈ Θ and ( X θ ) θ ∈ Θ be Hilbertian random variables. Assume that(i) for all h ∈ H , (cid:104) X n,θ , h (cid:105) D ⇒ (cid:104) X θ , h (cid:105) ,(ii) ( X n,θ ) n ∈ N ,θ ∈ Θ is sequentially tight, and(iii) ( X θ ) θ ∈ Θ is uniformly tight.Then, X n,θ D ⇒ X θ .Proof. Let ( θ n ) n ∈ N ⊆ Θ and let Y n have distribution X n,θ ( P θ n ) and Z n have distribution X θ ( P θ n )defined on a probability space (Ω , F , P ). Suppose for contradiction that d θ n BL ( X n,θ n , X θ n ) = d BL ( Y n , Z n ) (cid:54)→ n → ∞ . Then there exists a subsequence of Y n and Z n and an ε > nd BL ( Y k ( n ) , Z k ( n ) ) ≥ ε, where k : N → N is a strictly increasing function. By sequential tightness of ( X n,θ ) n ∈ N ,θ ∈ Θ ,there exists a subsequence of ( Y k ( n ) ) n ∈ N , represented by the index function m = k ◦ k (cid:48) for astrictly increasing k (cid:48) : N → N such that the subsequence ( Y m ( n ) ) n ∈ N converges weakly to somerandom variable Y . By uniform tightness of X there exists a further subsequence of ( Z m ( n ) ) n ∈ N ,represented by the the index function l = m ◦ k (cid:48)(cid:48) for a strictly increasing k (cid:48)(cid:48) : N → N such that( Z l ( n ) ) n ∈ N converges weakly to some random variable Z . Note that since the range of l is asubset of the range of m , ( Y l ( n ) ) n ∈ N also converges to Y .We intend to show that the distributions of Z and Y are equal. The distribution of aHilbertian random variable is completely determined by the distribution of the linear functionals[9, Theorem 7.1.2] but for any h ∈ H and any n , d BL ( (cid:104) Y, h (cid:105) , (cid:104) Z, h (cid:105) ) ≤ d BL (cid:0) (cid:104) Y, h (cid:105) , (cid:104) Y l ( n ) , h (cid:105) (cid:1) + d BL (cid:0) (cid:104) Y l ( n ) , h (cid:105) , (cid:104) Z l ( n ) , h (cid:105) (cid:1) + d BL (cid:0) (cid:104) Z l ( n ) , h (cid:105) , (cid:104) Z, h (cid:105) (cid:1) . The first and third term of the right-hand side go to zero by definition and the middle termgoes to zero by assumption (i). Now, d BL ( Y l ( n ) , Z l ( n ) ) ≤ d BL ( Y l ( n ) , Z ) + d BL ( Z, Z l ( n ) )hence we can choose N making l ( N ) large enough that the RHS is smaller than ε/ 2. This isa contradiction since we chose k such that d BL ( Y k ( n ) , Z k ( n ) ) ≥ ε for all n ∈ N but ( l ( n )) n ∈ N ⊆ ( k ( n )) n ∈ N .We can now prove a uniform central limit theorem in Hilbert spaces. Proposition 19. Let ( X θ ) θ ∈ Θ be Hilbertian random variables with E θ ( X θ ) = 0 for all θ and sup θ ∈ Θ E θ ( (cid:107) X θ (cid:107) η ) < K for some K, η > . Denote ( C θ ) θ ∈ Θ the family of covariance operatorsof X θ under each P θ , i.e. C θ = E θ ( X θ ⊗ X θ ) . Let ( X n ) n ∈ N be a sequence random variables such hat for θ ∈ Θ ( X n ) n ∈ N is i.i.d. with the same distribution as X under P θ . Assume further thatfor some orthonormal basis ( e k ) ∞ k =1 of H lim K →∞ sup θ ∈ Θ ∞ (cid:88) k = K (cid:104) C θ e k , e k (cid:105) = 0 . (26) Then √ n n (cid:88) i =1 X i D ⇒ Z where the distribution of Z under P θ is N (0 , C θ ) .Proof. We intend to apply Proposition 18 and thus check the conditions. For the first condition,let h ∈ H be given and let Y n = (cid:104) X n , h (cid:105) and let Y be distributed as (cid:104)N (0 , C θ ) , h (cid:105) under P θ , i.e.as N (0 , (cid:104) C θ h, h (cid:105) ). Note that (cid:42) √ n n (cid:88) i =1 X i,θ , h (cid:43) = 1 √ n n (cid:88) i =1 Y i,θ hence by Proposition 9 it is sufficient for the first condition that for any ( θ n ) n ∈ N ⊂ Θlim n →∞ d θ n BL ( Y n , Y ) = 0 . Suppose for contradiction that there exists a sequence ( θ n ) n ∈ N such that the limit does notequal 0, then there exists a ε > m : N → N such that d θ m ( n ) BL ( Y n , Y ) ≥ ε for any n ∈ N . Denoting for n ∈ N , σ θ m ( n ) := (cid:104) C θ m ( n ) h, h (cid:105) , we note that the sequence (cid:16) σ θ m ( n ) (cid:17) n ∈ N is bounded by assumption and hence by the Bolzano-Weierstrass theorem it has aconvergent subsequence, i.e. there exists σ ≥ m (cid:48) : N → N such thatletting l = m (cid:48) ◦ m , σ θ l ( n ) → σ . Letting W denote a random variable with distribution N (0 , σ )for any P θ , by Scheff´e’s lemma this implies Thatlim n →∞ d θ l ( n ) BL ( Y, W ) = 0 . Further, the Lindeberg-Feller theorem [8, Theorem 3.4.10] yields thatlim n →∞ d θ l ( n ) BL ( Y n , W ) = 0 . since Lyapunovs condition is fulfilled by the uniform bound on the 2+ η ’th moment of X θ . Sincethe range of l is contained in the range of m , this is a contradiction, hence the first condition isfulfilled.The third condition follows immediately from the assumption in (26) by Bogachev [3, Propo-sition 2.5.2, Lemma 2.7.20]. The second condition follows by the same assumption and theoremsby considering S n,θ := √ n (cid:80) ni =1 X i,θ for n ∈ N since E θ (cid:107) S n,θ (cid:107) is bounded by the same constantbounding E θ (cid:107) X θ (cid:107) and sinceCov θ ( S n,θ ) = 1 n n (cid:88) i =1 Cov θ ( X i,θ ) = C θ . This shows that the family of measures ( X n,θ ( P θ )) n ∈ N ,θ ∈ Θ is tight which implies the secondcondition. 26 Proofs of results in Section 4 This section contains the proofs of all results in Section 4 with the exception of Proposition 1which is proven in Section A.3.2. The proofs are self-contained, but readers new to the fieldmay find the following references helpful. For general results about random variables on metricspaces (Slutsky’s theorem, etc.) see Billingsley [2, Chapter 1]. For more specific results aboutHilbertian random variables, Bochner integrals and operators on Hilbert spaces, see Hsing andEubank [9, Chapter 2, 4, 7]. For existence and construction of conditional expectations onHilbert spaces, see Scalora [16, Chapter 2]. In this section, we sometimes omit the subscript P when it is clear from the context. C.1 Derivation of (9) Proof. Given n ≥ a ( n ) < n − 1, let (˜ x i , ˜ y i , ˜ z i ) ni =1 denote mean-centred observations, so e.g. ˜ z i = z i − (cid:80) nj =1 z j /n , and let ˜ X ( n ) = (˜ x , . . . , ˜ x n ) T ∈ R n . Let W n ∈ R n × (1+ a ( n )) be the design matrix with i th row given by (˜ y i , ˜ z i , . . . , ˜ z ia ( n ) ), and let ˆ θ n ∈ R be the first component of the coefficient vector from regressing ˜ X ( n ) onto W n , so ˆ θ n :=( W Tn W n ) − W Tn ˜ X ( n ) . Then for n sufficiently large, ψ OLS n = {| ˆ θ n |≥ c n ˆ σ W,n } , where c n > c > c depending on the significance level, and ˆ σ W,n := { ( W Tn W n ) − } .Fix Q ∈ Q ; in the following we will suppress dependence on this for notational simplicity. Thenthere exists r ∈ N such that θ := E Cov( Y, X | Z ) / E Y = E Cov( Y, X | Z , . . . , Z r ) / E Y > , and so for n such that a ( n ) > r , E (ˆ θ n | W n ) = θ . To show that P ( ψ OLS n = 1) → 1, it thereforesuffices to show that Cov(ˆ θ n | W n ) = ˆ σ W,n P → n = Cov( Y, Z , . . . , Z a ( n ) ) and p := 1 + a ( n ), we have that W Tn W n has aWishart distribution on n − W Tn W n ∼ W p (Σ n , n − − n ) / ˆ σ W,n ∼ χ n − p and (Σ − n ) = Cov( Y | Z , . . . , Z r ) = Cov( Y | Z ) > 0. We therefore see that as n → ∞ and hence n − p → ∞ , we have ˆ σ W,n P → C.2 Proofs of results in Section 4.1 In this section we provide proofs of Theorems 3 and 2. The proofs rely heavily on the theorydeveloped in Section B. C.2.1 Auxiliary lemmas We first prove some auxiliary lemmas that will be needed for the upcoming proofs. Lemma 8. Let ( X n ) n ∈ N be a sequence of real-valued random variables defined on (Ω , F ) equippedwith a family of probability measures ( P θ ) θ ∈ Θ . Let X be another real-valued random variable onthe same space and let ( F n ) n ∈ N be a sequence of sub- σ -algebras of F . If E θ ( | X n | | F n ) P ⇒ then X n P ⇒ . roof. Let ε > θ ∈ Θ P θ ( | X n | ≥ ε ) ≤ sup θ ∈ Θ P θ ( | X n | ∧ ε ≥ ε ) ≤ sup θ ∈ Θ E θ ( | X n | ∧ ε ) ε . We will be done if we can show that sup θ ∈ Θ E θ ( | X n | ∧ ε ) → n → ∞ . Note that bymonotonicity of conditional expectations, for each θ ∈ Θ we have E θ ( | X n | ∧ ε | F n ) ≤ E θ ( ε | F n ) = ε, and E θ ( | X n | ∧ ε | F n ) ≤ E θ ( | X n | | F n ) . Combining both of the above expressions, we get E θ ( | X n | ∧ ε | F n ) ≤ E θ ( | X n | | F n ) ∧ ε. This lets us write by the tower property and monotonicity of integrals,sup θ ∈ Θ E θ ( | X n | ∧ ε ) = sup θ ∈ Θ E θ [ E θ ( | X n | ∧ ε | F n )] ≤ sup θ ∈ Θ E θ [ E θ ( | X n | | F n ) ∧ ε ] . Let Y n := E θ ( | X n | | F n ) ∧ ε and let δ > θ ∈ Θ E θ ( Y n ) ≤ sup θ ∈ Θ E θ (cid:0) Y n { Y n <δ/ } (cid:1) + sup θ ∈ Θ E θ (cid:0) Y n { Y n ≥ δ/ } (cid:1) ≤ δ ε sup θ ∈ Θ P θ ( Y n ≥ δ/ . By assumption, for any η > 0, we can choose N ∈ N so that for all n ≥ N , we can makesup θ ∈ Θ E θ ( | X n | | F n ) ≥ δ/ < η . Thus, choosing N to parry η = ε ε , we getsup θ ∈ Θ E θ ( | X n | ) < δ, proving the desired result. Lemma 9. Let X and Y be random variables defined on the probability space (Ω , F , P ) withvalues in a Hilbert space H . Let D be a sub- σ -algebra of F so that X is D -measurable. Assumethat E ( (cid:107) X (cid:107) ) , E ( (cid:107) Y (cid:107) ) and E ( (cid:107) X (cid:107)(cid:107) Y (cid:107) ) all exist. Then E ( (cid:104) X, Y (cid:105) | D ) = (cid:104) X, E ( Y | D ) (cid:105) . Proof. To show the result, we need to show that (cid:104) X, E ( Y | D ) (cid:105) is D -measurable and that integralsover D -sets of (cid:104) X, Y (cid:105) and (cid:104) X, E ( Y | D ) (cid:105) coincide. (cid:104) X, E ( Y | D ) (cid:105) is D -measurable by continuityof the inner product and the fact that X and E ( Y | D ) are D -measurable by assumption anddefinition, respectively. By expanding the inner product in an orthonormal basis ( e k ) k ∈ N of H ,we get (cid:90) D (cid:104) X, Y (cid:105) d P = (cid:90) D ∞ (cid:88) k =1 (cid:104) X, e k (cid:105)(cid:104) Y, e k (cid:105) d P = ∞ (cid:88) k =1 (cid:90) D E ( (cid:104) X, e k (cid:105)(cid:104) Y, e k (cid:105) | D ) d P = ∞ (cid:88) k =1 (cid:90) D (cid:104) X, e k (cid:105)(cid:104) E ( Y | D ) , e k (cid:105) d P = (cid:90) D ∞ (cid:88) k =1 (cid:104) X, e k (cid:105)(cid:104) E ( Y | D ) , e k (cid:105) d P = (cid:90) D (cid:104) X, E ( Y | D ) (cid:105) d P , by using the interchangibility of sums and integrals and the property E ( (cid:104) Y, e i (cid:105) | D ) = (cid:104) E ( Y | D ) , e i (cid:105) of conditional expectations on Hilbert spaces. 28 emma 10. Let q denote the function that maps a self-adjoint, positive semidefinite, trace-classoperator, C on a separable Hilbert space H , to the − α quantile of the (cid:107)N (0 , C ) (cid:107) distribution.Then q is continuous in trace norm and the restriction of q to a bounded subset C of covarianceoperators satisfying lim N →∞ sup C ∈C ∞ (cid:88) k = K (cid:104) C e k , e k (cid:105) = 0 (27) for some orthonormal basis ( e k ) ∞ k =1 of H , is uniformly continuous in trace norm.Proof. Let C n be a sequence of self-adjoint, positive semidefinite, trace-class operators converg-ing to C in trace norm. Then by Bogachev [3, Theorem 2.7.21] N (0 , C n ) D → N (0 , C ) and by thecontinuous mapping theorem we have (cid:107)N (0 , C n ) (cid:107) D → (cid:107)N (0 , C ) (cid:107) . This implies the convergenceof the quantile functions by the Portmanteau theorem and Vaart [20, Lemma 21.2] and hence q is continuous.By the Heine-Cantor theorem, the restriction of q to the closure of C is uniformly continuousif C is relatively compact. Restricting q further to C preserves the uniform continuity. Bogachev[3, Proposition 2.5.2] states that equation (27) exactly characterizes the relatively compact setsof trace class operators. Lemma 11. Let Θ ⊆ R + and let ( µ θ ) θ ∈ Θ be the family of probability distributions on R wherefor each θ ∈ Θ , µ θ denotes the distribution of θZ where Z ∼ χ . If Θ is bounded away from ,the family is uniformly absolutely continuous with respect to the Lebesgue measure λ .Proof. Note that the density f θ of µ θ with respect to the Lebesgue measure is f θ ( x ) = 1 √ πθ e − θ x √ x . We will apply Proposition 12 by showing that sup θ ∈ Θ (cid:82) f / θ d λ < ∞ , which is sufficient foruniform integrability by Bogachev [4, Example 4.5.10]. We see that (cid:90) f θ ( x ) / d λ = 1 √ π θ (cid:90) e − θ x √ x d λ, and we recognise the final integral as the unnormalised density of a Γ(1 / , / (4 θ )) randomvariable. Thus (cid:90) f θ ( x ) / d λ = 1 √ π θ Γ(1 / √ θ √ / √ π θ . This is finite for all θ ∈ Θ since Θ is bounded away from zero, proving the desired result. Lemma 12. Let X be a uniformly tight, real-valued and non-negative random variable that isuniformly absolutely continuous with respect to the Lebesgue measure. Then so is √ X .Proof. Let ε > λ denote the Lebesgue measure. We need to find δ > B , λ ( B ) < δ = ⇒ sup θ ∈ Θ P θ ( √ X ∈ B ) < ε. For each measurable B , we define B := { b : b ∈ R } . Then P θ ( √ X ∈ B ) = P θ ( X ∈ B ) andby the uniform tightness of X , we can find M > θ ∈ Θ P θ ( X ∈ B ) ≤ sup θ ∈ Θ P θ ( X ∈ B ∩ [0 , M ]) + ε/ . 29y the uniform absolute continuity of X with respect to λ , we can find δ (cid:48) such that λ ( B ) < δ (cid:48) implies sup θ ∈ Θ P θ ( X ∈ B ) < ε/ 2. Note that for any such B , by the regularity of the Lebesguemeasure, we can find an open set U ⊇ B such that λ ( U \ B ) < δ (cid:48) − λ ( B ). This implies that λ ( U ) < δ (cid:48) . For every open U , by Carothers [5, Theorem 4.6], we can find a countable union ofdisjoint open intervals ( I j ) ∞ j =1 , where I j = ( a j , b j ), such that U = (cid:83) ∞ j =1 I j . Note that U alsocovers B since if x ∈ U , x is in at least one of the intervals I j , and thus x is in I j . Combiningthese observations, we get that λ ( B ∩ [0 , M ]) ≤ λ ( U ∩ [0 , M ]) = ∞ (cid:88) j =1 λ ( I j ∩ [0 , M ]) = ∞ (cid:88) j =1 (min( M, b j ) − a j )= ∞ (cid:88) j =1 (min( √ M , b j ) + a j )(min( √ M , b j ) − a j ) ≤ √ M ∞ (cid:88) j =1 b j − a j < √ M δ (cid:48) . Thus letting δ = δ (cid:48) / (2 √ M ), we see that for all B with λ ( B ) < δ , we also have λ ( B ∩ [0 , M ]) < δ (cid:48) ,and hence sup θ ∈ Θ P θ ( X ∈ B ∩ [0 , M ]) < ε/ , proving the statement. C.2.2 Proof of Theorem 2 Proof. Throughout the proof we omit the subscript P from ε , ξ , f , g . Convergence of T n . We have that T n = 1 √ n n (cid:88) i =1 R i = 1 √ n n (cid:88) i =1 ε i ⊗ ξ i (cid:124) (cid:123)(cid:122) (cid:125) U n + 1 √ n n (cid:88) i =1 ( f ( z i ) − ˆ f ( z i )) ⊗ ( g ( z i ) − ˆ g ( z i )) (cid:124) (cid:123)(cid:122) (cid:125) a n + 1 √ n n (cid:88) i =1 ( f ( z i ) − ˆ f ( z i )) ⊗ ξ i (cid:124) (cid:123)(cid:122) (cid:125) b n + 1 √ n n (cid:88) i =1 ( ε i ⊗ g ( z i ) − ˆ g ( z i )) (cid:124) (cid:123)(cid:122) (cid:125) c n . Since E ( ε i ⊗ ξ i ) = E (( X − E ( X | Z )) ⊗ ( Y − E ( Y | Z ))) = E (Cov( X, Y | Z )) = 0because X ⊥⊥ Y | Z , Proposition 19 yields that U n converges uniformly in distribution to thedesired Gaussian over ˜ P . By Proposition 16, if a n , b n and c n all converge to 0 uniformly inprobability, we will have shown the desired result. We establish this by looking at the theHilbert–Schmidt norm of the sequences, since uniform convergence of the norms to 0 impliesuniform convergence of the sequences to 0. For a n , using properties of the Hilbert–Schmidt30orm and Cauchy–Schwarz yields (cid:107) a n (cid:107) HS = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ n n (cid:88) i =1 ( f ( z i ) − ˆ f ( z i )) ⊗ ( g ( z i ) − ˆ g ( z i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) HS ≤ √ n n (cid:88) i =1 (cid:107) ( f ( z i ) − ˆ f ( z i )) ⊗ ( g ( z i ) − ˆ g ( z i )) (cid:107) HS = 1 √ n n (cid:88) i =1 (cid:107) ( f ( z i ) − ˆ f ( z i )) (cid:107)(cid:107) g ( z i ) − ˆ g ( z i ) (cid:107)≤ (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (cid:107) ( f ( z i ) − ˆ f ( z i )) (cid:107) n (cid:88) i =1 (cid:107) g ( z i ) − ˆ g ( z i ) (cid:107) = (cid:113) nM fn,P M gn,P . By assumption nM fn,P M gn,P P ⇒ (cid:113) nM fn,P M gn,P since x (cid:55)→ √ x is uniformly continuous. This implies that (cid:107) a n (cid:107) HS P ⇒ (cid:107) b n (cid:107) HS P ⇒ 0, we will instead show that the square of the Hilbert–Schmidtnorm goes to 0. This implies that (cid:107) b n (cid:107) HS P ⇒ x (cid:55)→ √ x asabove. We will show that E P ( (cid:107) b n (cid:107) HS | X ( n ) , Z ( n ) ) P ⇒ 0, where X ( n ) = ( X , . . . , X n ) and Z ( n ) =( Z , . . . , Z n ), which then implies the desired result by Lemma 8. For every P ∈ ˜ P we have E P ( (cid:107) b n (cid:107) HS | X ( n ) , Z ( n ) ) = 1 n E P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ( f ( z i ) − ˆ f ( z i )) ⊗ ξ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) HS | X ( n ) , Z ( n ) = 1 n n (cid:88) j =1 n (cid:88) i =1 E P (cid:16) (cid:104) ( f ( z i ) − ˆ f ( z i )) ⊗ ξ i , ( f ( z j ) − ˆ f ( z j )) ⊗ ξ j (cid:105) HS | X ( n ) , Z ( n ) (cid:17) = 1 n n (cid:88) j =1 n (cid:88) i =1 E P (cid:16) (cid:104) f ( z i ) − ˆ f ( z i ) , f ( z j ) − ˆ f ( z j ) (cid:105)(cid:104) ξ i , ξ j (cid:105) | X ( n ) , Z ( n ) (cid:17) = 1 n n (cid:88) j =1 n (cid:88) i =1 (cid:104) f ( z i ) − ˆ f ( z i ) , f ( z j ) − ˆ f ( z j ) (cid:105) E P (cid:16) (cid:104) ξ i , ξ j (cid:105) | X ( n ) , Z ( n ) (cid:17) , (28)where the penultimate equality uses the fact that for Hilbert–Schmidt operators (cid:104) x ⊗ y , x ⊗ y (cid:105) HS = (cid:104) x , x (cid:105)(cid:104) y , y (cid:105) . The final equality holds since the terms involving f ( z i ) − ˆ f ( z i ) aremeasurable with respect to the σ -algebra generated by X ( n ) and Z ( n ) . The term (cid:104) ξ i , ξ j (cid:105) onlydepends on Z i and Z j of the conditioning variables, so we can omit the remaining variablesform the conditioning expression. Recall that ξ i = Y i − E P ( Y i | Z i ). For i (cid:54) = j , by using that E P ( Y i | Z i ) = E P ( Y i | Z i , Z j ) since Z j is independent of ( Y i , Z i ) and Lemma 9, we get E P (cid:2) (cid:104) ξ i , ξ j (cid:105) Y | X ( n ) , Z ( n ) (cid:3) = E P (cid:2) (cid:104) Y i , Y j (cid:105) Y − (cid:104) Y i , E P ( Y j | Z j ) (cid:105) Y − (cid:104) E P ( Y i | Z i ) , Y j (cid:105) Y + (cid:104) E P ( Y i | Z i ) , E P ( Y j | Z j ) (cid:105) Y | Z i , Z j (cid:3) = E P ( (cid:104) Y i , Y j (cid:105) Y | Z i , Z j ) − (cid:104) E P ( Y i | Z i , Z j ) , E ( Y j | Z i , Z j ) (cid:105) Y . We will show that this is zero. By assumption ( Y i , Z i ) ⊥⊥ ( Y j , Z j ), so applying the usual laws ofconditional independence, we get Y i ⊥⊥ Y j | ( Z i , Z j ). Take now some orthonormal basis for H Y ,( e k ) k ∈ N , and expand (cid:104) Y i , Y j (cid:105) Y to get E P ( (cid:104) Y i , Y j (cid:105) Y | Z i , Z j ) = E P (cid:32) ∞ (cid:88) k =1 (cid:104) Y i , e k (cid:105)(cid:104) Y j , e k (cid:105) | Z i , Z j (cid:33) = ∞ (cid:88) k =1 E P ( (cid:104) Y i , e k (cid:105)(cid:104) Y j , e k (cid:105) | Z i , Z j ) . k , (cid:104) Y i , e k (cid:105) Y ⊥⊥ (cid:104) Y j , e k (cid:105) Y | ( Z i , Z j ), so E ( (cid:104) Y i , e k (cid:105)(cid:104) Y j , e k (cid:105) | Z i , Z j ) factorises and we get ∞ (cid:88) k =1 E P ( (cid:104) Y i , e k (cid:105)(cid:104) Y j , e k (cid:105) | Z i , Z j ) = ∞ (cid:88) k =1 E P ( (cid:104) Y i , e k (cid:105) | Z i , Z j ) E P ( (cid:104) Y j , e k (cid:105) | Z i , Z j )= ∞ (cid:88) k =1 (cid:104) E P ( Y i | Z i , Z j ) , e k (cid:105)(cid:104) E P ( Y j | Z i , Z j ) , e k (cid:105) = (cid:104) E P ( Y i | Z i , Z j ) , E ( Y j | Z i , Z j ) (cid:105) , where the second last equality follows from E P ( (cid:104) Y, e k (cid:105) | Z i , Z j ) = (cid:104) E P ( Y | Z i , Z j ) , e k (cid:105) by Lemma 9.We can thus omit all terms from the sum in (28) where i (cid:54) = j and get E P ( (cid:107) b n (cid:107) HS | X ( n ) , Z ( n ) ) = 1 n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( n ) ( z i ) (cid:107) X E P (cid:0) (cid:107) ξ i (cid:107) Y | Z i (cid:1) = ˜ M fn P ⇒ , by assumption. An analogous argument can be repeated for c n , thus proving the desired result. Convergence of ˆ C . For simplicity we prove convergence where ˆ C is instead defined as theestimate where we divide by n instead of n − N (0 , C P )) P ∈ ˜ P is uniformly tight by Bogachev [3,Proposition 2.5.2, Lemma 2.7.20], we have1 n n (cid:88) i =1 R i = 1 √ n T n P ⇒ . By Proposition 11, this implies that the second term in the definition of ˆ C converges to 0uniformly in probability since the mapping ( A , B ) (cid:55)→ A ⊗ HS B is continuous. It remains toshow that the first term in the definition of ˆ C converges to C . The proof is similar to the proofof Theorem 6 in [19] and relies on expanding the first term n (cid:80) ni =1 R i ⊗ HS R i to yield1 n n (cid:88) i =1 [( f ( z i ) − ˆ f ( z i )) ⊗ ( g ( z i ) − ˆ g ( z i )) + ( f ( z i ) − ˆ f ( z i )) ⊗ ξ i + ε i ⊗ ( g ( z i ) − ˆ g ( z i )) + ε i ⊗ ξ i ] ⊗ HS , where A ⊗ HS = A ⊗ HS A . Expanding this even further yields 16 terms of which 15 go to zero.The non-zero term is I n = 1 n n (cid:88) i =1 ( ε i ⊗ ξ i ) ⊗ HS P ⇒ E P (cid:0) ( ε i ⊗ ξ i ) ⊗ HS (cid:1) = C , by Proposition 17. For the remaining 15 terms, we will argue by taking trace norms and applyingthe triangle inequality to reduce the number of cases. This leaves us with 8 terms and 5 cases(by symmetry of f and ε , g and ξ ) that we need to argue converge to 0 uniformly in probability.The first case isII n = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 [( f ( z i ) − ˆ f ( z i )) ⊗ ( g ( z i ) − ˆ g ( z i ))] ⊗ HS (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) TR ≤ n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) [( f ( z i ) − ˆ f ( z i )) ⊗ ( g ( z i ) − ˆ g ( z i ))] ⊗ HS (cid:13)(cid:13)(cid:13) TR = 1 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) f ( z i ) − ˆ f ( z i ) ⊗ g ( z i ) − ˆ g ( z i ) (cid:13)(cid:13)(cid:13) HS = 1 n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107) (cid:107) g ( z i ) − ˆ g ( z i ) (cid:107) ≤ nM fn,P M gn,P P ⇒ , (cid:80) a n b n ≤ (cid:80) a n (cid:80) b n , which can beseen by noting that every term on the left-hand side also appears on the right-hand side. Forthe second case we have, by applying the Cauchy–Schwarz inequality,III n = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 [( f ( z i ) − ˆ f ( z i )) ⊗ ξ i ] ⊗ HS [( g ( z i ) − ˆ g ( z i )) ⊗ ε i ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) TR ≤ n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107)(cid:107) g ( z i ) − ˆ g ( z i ) (cid:107)(cid:107) ε i (cid:107)(cid:107) ξ i (cid:107)≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116)(cid:32) n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107) (cid:107) g ( z i ) − ˆ g ( z i ) (cid:107) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) a n (cid:32) n n (cid:88) i =1 (cid:107) ε i (cid:107) (cid:107) ξ i (cid:107) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) b n . By Cauchy–Schwarz, we have a n ≤ nM fn,P M gn,P P ⇒ 0. We have b n P ⇒ (cid:107) C (cid:107) TR by Proposition 17.The family ( (cid:107) C (cid:107) TR ) P ∈ ˜ P is uniformly tight by the assumption that E ( (cid:107) ε P (cid:107) η (cid:107) ξ P (cid:107) η ) is uni-formly bounded, since this also yields a bound on E ( (cid:107) ε P (cid:107) (cid:107) ξ P (cid:107) ) = (cid:107) C (cid:107) TR . By the uniformtightness of ( (cid:107) C (cid:107) TR ) P ∈ ˜ P and uniform continuity of x (cid:55)→ √ x , Proposition 11 and 10 yield that √ a n b n P ⇒ f and a g variant where the roles of f and g and ε and ξ are swapped. We only show one variant of each, since the arguments are identical. The f -variant of the third case isIV fn = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 [( f ( z i ) − ˆ f ( z i )) ⊗ ξ i ] ⊗ HS (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) TR ≤ n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107) (cid:107) ξ i (cid:107) =: c n . If we can show that E ( c n | X ( n ) , Z ( n ) ) P ⇒ 0, we have that c n P ⇒ fn P ⇒ E P ( c n | X ( n ) , Z ( n ) ) = 1 n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107) E P (cid:16) (cid:107) ξ i (cid:107) | X ( n ) , Z ( n ) (cid:17) = ˜ M fn,P P ⇒ , by assumption.The f -variant of the fourth case is, by applying Cauchy–Schwarz inequality,V fn = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 [( f ( z i ) − ˆ f ( z i )) ⊗ ( g ( z i ) − ˆ g ( z i ))] ⊗ HS [( f ( z i ) − ˆ f ( z i )) ⊗ ξ i ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) TR ≤ n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107) (cid:107) g ( z i ) − ˆ g ( z i ) (cid:107)(cid:107) ξ i (cid:107)≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116)(cid:32) n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107) (cid:107) g ( z i ) − ˆ g ( z i ) (cid:107) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) a n (cid:32) n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107) (cid:107) ξ i (cid:107) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) c n . We saw above that a n P ⇒ c n P ⇒ √ a n b n P ⇒ x (cid:55)→ √ x isuniformly continuous. 33or the f -variant of the fifth and final case, we get, by applying the Cauchy–Schwarz in-equality again,VI fn = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 [( f ( z i ) − ˆ f ( z i )) ⊗ ξ i ] ⊗ HS [ ε i ⊗ ξ i ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) TR ≤ n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107)(cid:107) ε i (cid:107)(cid:107) ξ i (cid:107) ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116)(cid:32) n n (cid:88) i =1 (cid:107) f ( z i ) − ˆ f ( z i ) (cid:107) (cid:107) ξ i (cid:107) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) c n (cid:32) n n (cid:88) i =1 (cid:107) ε i (cid:107) (cid:107) ξ i (cid:107) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) b n . We can repeat the arguments used above yielding √ a n b n P ⇒ √ b n c n P ⇒ c n P ⇒ fn P ⇒ C.2.3 Proof of Theorem 3 Proof. By Proposition 10 and Theorem 2, since (cid:107)·(cid:107) HS is Lipschitz continuous, T n D ⇒ (cid:107)N (0 , C P ) (cid:107) HS .Let W be distributed as (cid:107)N (0 , C P ) (cid:107) HS when the background measure is P P . Since P P ( ψ n = 1) = P P ( T n > q ( ˆ C ))we need to show that lim n →∞ sup P ∈ ˜ P (cid:12)(cid:12)(cid:12) P P ( T n > q ( ˆ C )) − α (cid:12)(cid:12)(cid:12) = 0 , which amounts to finding, for all ε > 0, an N ∈ N , such that for all n ≥ N sup P ∈ ˜ P P P ( T n > q ( ˆ C )) < α + ε (29)and inf P ∈ ˜ P P P ( T n > q ( ˆ C )) > α − ε. (30)To show (29), let δ > | q ( ˆ C ) − q ( C P ) | < δ and T n > q ( ˆ C ), then T n > q ( C P ) − δ and thus P P ( T n > q ( ˆ C )) ≤ P P ( T n > q ( C P ) − δ ) + P P ( | q ( ˆ C ) − q ( C P ) | ≥ δ ) . Taking supremums and rewriting, we getsup P ∈ ˜ P P P ( T n > q ( ˆ C P )) ≤ I n (cid:122) (cid:125)(cid:124) (cid:123) sup P ∈ ˜ P [ P P ( T n > q ( C P ) − δ ) − P P ( W > q ( C P ) − δ )]+ sup P ∈ ˜ P [ P P ( W > q ( C ) − δ ) − α ] (cid:124) (cid:123)(cid:122) (cid:125) II n + sup P ∈ ˜ P P P ( | q ( ˆ C ) − q ( C P ) | ≥ δ ) (cid:124) (cid:123)(cid:122) (cid:125) III n + α We seek to show that, if n is sufficiently large, we can make each of the terms I n , II n andIII n less than ε/ P ∈ ˜ P P P ( T n > q ( ˆ C )) < α + ε, 34s desired.For the I n term, we note first that for each P ∈ ˜ P , W D = ∞ (cid:88) k =1 λ Pk V k , where λ Pk is the k th eigenvalue of C P and ( V k ) k ∈ N is a sequence of independent standard Gaussianrandom variables. We have assumed that the operator norm of ( C P ) P ∈ ˜ P is bounded away fromzero which implies that λ P is bounded away from zero. Thus the family ( λ P V ) P ∈ ˜ P is uniformlyabsolutely continuous with respect to the Lebesgue measure by Lemma 11. Theorem 8 yieldsthat W is also uniformly absolutely continuous with respect to the Lebesgue measure andLemma 12 yields that the same is true for W , since W is uniformly tight by the assumed uniformbound on E P ( (cid:107) ε P (cid:107) (cid:107) ξ P (cid:107) ). Further, Corollary 1 yields that W is also uniformly absolutelycontinuous with respect to the standard Gaussian on R . Finally, since T n D ⇒ (cid:107)N (0 , C P ) (cid:107) HS ,Proposition 14 yields uniform convergence of the distribution function of T n to the distributionfunction of W . Thus we can choose n large enough to ensure I n < ε/ n term, recall that α = P P ( W > q ( C P )), and thus P P ( W > q ( C P ) − δ ) − α = P P ( W ∈ [ q ( C P ) − δ, q ( C P )]) . By the uniform absolute continuity of W with respect to λ , we may fix δ such that sup P ∈ ˜ P P P ( W ∈ B ) < ε/ λ ( B ) < δ . This implies that II n < ε/ n term, Theorem 2 yields ˆ C P ⇒ C P and since Lemma 10 yields that q is uniformlycontinuous, Proposition 10 yields q ( ˆ C ) P ⇒ q ( C P ) thus the third term is less than ε/ n islarge enough.To show (30), note first that, as before, if | q ( ˆ C ) − q ( C P ) | < δ and T n > q ( C P ) + δ , then T n > q ( ˆ C ) and hence P P ( T n > q ( ˆ C )) ≥ P P (( T n > q ( C P ) + δ ) ∩ ( | q ( ˆ C ) − q ( C P ) | < δ )) ≥ P P ( T n > q ( C P ) + δ ) − P P ( | q ( ˆ C ) − q ( C P ) | ≥ δ ) . The final step uses that for any measurable sets A and B , P ( A ∩ B ) = P ( A ) + P ( B ) − P ( A ∪ B ) = P ( A ) − P ( B c ) + 1 − P ( A ∪ B ) ≥ P ( A ) − P ( B c ) . This lets us continue using similar arguments as for (29), proving the statement. C.3 Proof of Theorem 4 Proof. To argue that the modified GHCM satifies (13), we can repeat the arguments of Theo-rem 2 and Theorem 3 replacing conditioning on X ( n ) and Z ( n ) with conditioning on Z ( n ) and A and conditioning on Y ( n ) and Z ( n ) with conditioning on Z ( n ) and A .For the first claim that ˜ T n D ⇒ N (0 , C P ), we can repeat the decomposition of the proof ofTheorem 2 and write1 √ n n (cid:88) i =1 ( R i − K P ) = 1 √ n n (cid:88) i =1 ( ε i ⊗ ξ i − K P ) (cid:124) (cid:123)(cid:122) (cid:125) U n + a n + b n + c n , a n , b n and c n are as in the proof of Theorem 2. We have U n D ⇒ N (0 , C P ) over Q by Proposition 19 a n P ⇒ Q by the same argument as in the proof of Theorem 2. Theargument of the proof of Theorem 2 to show that b n P ⇒ c n P ⇒ (cid:107) ˆ C − C (cid:107) TR P ⇒ 0, note that by the ˜ T n result, Proposition 16 andProposition 11, 1 n n (cid:88) i =1 R i = 1 √ n · √ n n (cid:88) i =1 ( R i − K P ) + K P P ⇒ Q K P . Hence by Proposition 11, (cid:32) n n (cid:88) i =1 R i (cid:33) ⊗ HS (cid:32) n n (cid:88) i =1 R i (cid:33) P ⇒ Q K P ⊗ HS K P , since the mapping ( A , B ) (cid:55)→ A ⊗ HS B is continuous. We can now repeat the remainingarguments of the proof of Theorem 2 while again replacing conditioning as we did in the proofof the first claim to yield the desired result.For the final claim that for large enough n the GHCM has power greater than β overalternatives where (cid:107)√ n K P (cid:107) HS > C , let W be distributed as (cid:107)N (0 , C P ) (cid:107) HS when the backgroundmeasure is P P for P ∈ Q . Let q denote the mapping that sends a covariance operator C tothe 1 − α quantile of the distribution of (cid:107)N (0 , C ) (cid:107) HS . By similar arguments as in the proof ofTheorem 3, we get for any δ > C > n ∈ N ,inf P ∈Q C,n P P ( T n > q ( ˆ C )) ≥ inf P ∈Q C,n P P ( T n > q ( C P ) + δ ) − sup P ∈Q C,n P P ( | q ( ˆ C ) − q ( C P ) | ≥ δ ) . Defining ˜ T n := (cid:107) ˜ T n (cid:107) HS , by the reverse triangle inequality T n = (cid:13)(cid:13)(cid:13) ˜ T n + √ n K P (cid:13)(cid:13)(cid:13) HS ≥ (cid:12)(cid:12)(cid:12) ˜ T n − √ n (cid:107) K P (cid:107) HS (cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) K P (cid:107) HS − ˜ T n , and hence inf P ∈Q C,n P P ( T n > q ( C P ) + δ ) ≥ inf P ∈Q C,n P P ( √ n (cid:107) K P (cid:107) HS − ˜ T n > q ( C P ) + δ ) . Now since we are taking an infimum over a set where √ n (cid:107) K P (cid:107) HS > C , we haveinf P ∈Q C,n P P ( √ n (cid:107) K P (cid:107) HS − ˜ T n > q ( C P ) + δ ) ≥ inf P ∈Q C,n P P ( C − ˜ T n > q ( C P ) + δ ) , and thus combining all of the above yieldsinf P ∈Q C,n P P ( T n > q ( ˆ C P )) ≥ I n (cid:122) (cid:125)(cid:124) (cid:123) inf P ∈Q C,n [ P P ( C − ˜ T n > q ( C P ) + δ ) − P P ( C − W > q ( C P ) + δ )]+ inf P ∈Q C,n P P ( C − W > q ( C P ) + δ ) (cid:124) (cid:123)(cid:122) (cid:125) II n − sup P ∈Q C,n P P ( | q ( ˆ C ) − q ( C P ) | ≥ δ ) (cid:124) (cid:123)(cid:122) (cid:125) III n . If we can show that for n sufficiently large we can make I n + II n + III n ≥ β , we will be done.For the I n term, we have by the first claim proven above and Proposition 10, ˜ T n D ⇒ W .Using the same arguments as for the I n term in the proof of Theorem 3, our assumptions ensure36hat W is uniformly absolutely continuous with respect to the standard Gaussian on R henceProposition 14 yields we can choose n sufficiently large such that I n ≥ − (1 − β ) / n term, we can writeII n = 1 − sup P ∈Q C,n P P ( C − δ ≤ W + q ( C P )) . Hence by uniform tightness of ( W + q ( C P )) P ∈Q we can find C such that sup P ∈Q C,n P P ( C − δ ≤ W + q ( C P )) < (1 − β ) / n > − (1 − β ) / n term, we can repeat the arguments III n term in the proof of Theorem 3 to showthat III n P ⇒ n , we have III n > − (1 − β ) / n sufficiently large thatinf P ∈Q C,n P P ( T n > q ( ˆ C )) ≥ β. C.4 Proof of Proposition 2 We first prove a representer theorem [11, 18] for scalar-on-function regression which we use toprovide bounds on the in-sample error of the Hilbertian linear model in Lemma 14. Lemma 13. Let H denote a Hilbert space with norm (cid:107)·(cid:107) , x , . . . , x n ∈ R , z , . . . , z n ∈ H and γ > Let K be an n × n matrix where K i,j := (cid:104) z i , z j (cid:105) and let X = ( x , . . . , x n ) T ∈ R n . Then ˆ β minimises L ( β ) = n (cid:88) i =1 ( x i − (cid:104) β, z i (cid:105) ) + γ (cid:107) β (cid:107) over β ∈ H if and only if ˆ β = (cid:80) ni =1 ˆ α i z i and ˆ α ∈ R n minimises L ( α ) = (cid:107) X − Kα (cid:107) + γα T Kα over R n where (cid:107)·(cid:107) denotes the standard Euclidean norm on R n .Proof. Assume that ˆ β minimises L . Write ˆ β = u + v where u ∈ span( z , . . . , z n ) and v ⊥ u .Since (cid:104) ˆ β, z i (cid:105) = (cid:104) u, z i (cid:105) the first term of L only depends on the value of u and (cid:107) ˆ β (cid:107) ≥ (cid:107) u (cid:107) hence v = 0 by optimality and ˆ β can be writtenˆ β = n (cid:88) i =1 ˆ α i z i for some ˆ α ∈ R n . But now that ˆ β is known to have this form, it can be seen that ˆ α T K ˆ α = (cid:107) ˆ β (cid:107) and n (cid:88) i =1 ( x i − (cid:104) ˆ β, z i (cid:105) ) = n (cid:88) i =1 x i − n (cid:88) j =1 ˆ α j (cid:104) z j , z i (cid:105) = (cid:107) X − Kα (cid:107) , hence ˆ α minimises L . 37ssume now that ˆ α minimises L and ˆ β = (cid:80) ni =1 ˆ α i z i . Clearly, L ( ˆ α ) = L ( ˆ β ). For any˜ β ∈ H , we can write ˜ β = ˜ u + ˜ v with ˜ u ∈ span( z , . . . , z n ) and ˜ v ⊥ ˜ u as before. By similararguments as above L ( ˜ β ) ≥ L (˜ u ) . However, ˜ u = (cid:80) ni =1 ˜ α i z i , hence by optimality of ˆ α , we have L (˜ u ) ≥ L ( ˜ α ) ≥ L ( ˆ α ) = L ( ˆ β ) , proving that ˆ β minimises L as desired. Lemma 14. Consider the estimator ˆ S γ in equation (16) in the Hilbertian linear model whichis a function of X , . . . , X n , Z , . . . , Z n and let σ > such that E ( (cid:107) ε (cid:107) | Z ) ≤ σ almost surely.Let K be the n × n matrix where K ij := (cid:104) Z i , Z j (cid:105) and let ( d i ) ni =1 denote the eigenvalues of K .Then, letting Z ( n ) := ( Z , . . . , Z n ) , n E (cid:32) n (cid:88) i =1 (cid:107) S ( Z i ) − ˆ S γ ( Z i ) (cid:107) | Z ( n ) (cid:33) ≤ σ γ n n (cid:88) i =1 min( d i / , γ ) + (cid:107) S (cid:107) HS γ n (31) almost surely. Further, there exists C > such that n E (cid:32) n (cid:88) i =1 (cid:107) S ( Z i ) − ˆ S γ ( Z i ) (cid:107) (cid:33) ≤ σ γ Cn ∞ (cid:88) k =1 min( µ k / , γ ) + (cid:107) S (cid:107) HS γ , (32) where ( µ k ) k ∈ N are the eigenvalues of the covariance operator of Z .Proof. Let ( e k ) k ∈ N denote a basis of H X . Then n (cid:88) i =1 (cid:107) S ( Z i ) − ˆ S γ ( Z i ) (cid:107) = ∞ (cid:88) k =1 n (cid:88) i =1 ( (cid:104) S ( Z i ) , e k (cid:105) − (cid:104) ˆ S γ ( Z i ) , e k (cid:105) ) = ∞ (cid:88) k =1 n (cid:88) i =1 ( (cid:104) Z i , S ∗ ( e k ) (cid:105) − (cid:104) Z i , ˆ S ∗ γ ( e k ) (cid:105) ) and similarly we can rewrite the penalised likelihood criterion in (16) n (cid:88) i =1 (cid:107) X i − ˜ S ( Z i ) (cid:107) + γ (cid:107) ˜ S (cid:107) HS = ∞ (cid:88) k =1 (cid:34) n (cid:88) i =1 ( (cid:104) X i , e k (cid:105) − (cid:104) Z i , ˜ S ∗ ( e k ) (cid:105) ) + γ (cid:107) ˜ S (cid:107) (cid:35) . (33)Since each of the terms in square brackets can be chosen independently of each other, we haveˆ β k := ˆ S γ ( e k ) = argmin β ∈H n (cid:88) i =1 ( (cid:104) X i , e k (cid:105) − (cid:104) Z i , β (cid:105) ) + γ (cid:107) β (cid:107) . A bit of matrix calculus combined with Lemma 13 yields that( (cid:104) Z , ˆ β k (cid:105) , . . . , (cid:104) Z n , ˆ β k (cid:105) ) T = K ( K + γI ) − X ( k ) , where I is the n × n identity matrix and X ( k ) := ( (cid:104) X , e k (cid:105) , . . . , (cid:104) X n , e k (cid:105) ) T . Defining β k := S ∗ ( e k ), we can write β k = u k + v k where u ∈ span( Z , . . . , Z n ) and u ⊥ v , hence for i ∈{ , . . . , n } , (cid:104) Z i , β k (cid:105) = (cid:104) Z i , u k (cid:105) = (cid:104) Z i , n (cid:88) j =1 α k,j Z j (cid:105) = n (cid:88) j =1 α k,j (cid:104) Z i , Z j (cid:105) , α k ∈ R n , i.e., ( (cid:104) Z , β k (cid:105) , . . . , (cid:104) Z n , β k (cid:105) ) T = Kα. Let K = U DU T be the eigendecomposition of K , where D ii = d i , and let θ k = U T Kα . Let ε ( k ) := ( (cid:104) ε , e k (cid:105) , . . . , (cid:104) ε n , e k (cid:105) ) T and note that X ( k ) = Kα k + ε ( k ) . n times the left-hand side ofequation (31) can now be written (using equation (33)) E (cid:34) ∞ (cid:88) k =1 (cid:107) K ( K + γI ) − ( U θ k + ε ( k ) ) − U θ k (cid:107) | Z ( n ) (cid:35) = E (cid:34) ∞ (cid:88) k =1 (cid:107) DU T ( U DU T + γI ) − ( U θ k + ε ( k ) ) − θ k (cid:107) | Z ( n ) (cid:35) = E (cid:34) ∞ (cid:88) k =1 (cid:107) D ( D + γI ) − ( θ k + U T ε ( k ) ) − θ k (cid:107) | Z ( n ) (cid:35) = ∞ (cid:88) k =1 (cid:107) ( D ( D + γI ) − − I ) θ k (cid:107) + E (cid:34) ∞ (cid:88) k =1 (cid:107) D ( D + γI ) − U T ε ( k ) (cid:107) | Z ( n ) (cid:35) where the final equality uses that the first term is a function of Z ( n ) and the conditionalexpectation of the cross term in the sum of squares is 0, since E ( ε ( k ) | Z ( n ) ) = 0.The second term may be simplified as follows: E (cid:34) ∞ (cid:88) k =1 (cid:107) D ( D + γI ) − U T ε ( k ) (cid:107) | Z ( n ) (cid:35) = E (cid:34) ∞ (cid:88) k =1 tr (cid:16) D ( D + γI ) − U T ε ( k ) ( ε ( k ) ) T U D ( D + γI ) − (cid:17) | Z ( n ) (cid:35) = tr (cid:18) D ( D + γI ) − U T E (cid:34) ∞ (cid:88) k =1 ε ( k ) ( ε ( k ) ) T | Z ( n ) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) Σ ε | Z U D ( D + γI ) − (cid:19) , where we have used that only ε ( k ) is not a function of Z ( n ) and linearity of conditional ex-pectations and the trace. Note that Σ ε | Z is a diagonal matrix with diagonal entries equalto E (cid:34) ∞ (cid:88) k =1 (cid:104) ε i , e k (cid:105) | Z ( n ) (cid:35) = E (cid:2) (cid:107) ε i (cid:107) | Z i (cid:3) , hence we can bound each diagonal term by σ by assumption. This implies thattr (cid:18) D ( D + γI ) − U T Σ ε | Z U D ( D + γI ) − (cid:19) ≤ σ tr (cid:18) D ( D + γI ) − D ( D + γI ) − (cid:19) = σ n (cid:88) i =1 d i ( d i + γ ) . The first term can be dealt with immediately by noting that ∞ (cid:88) k =1 (cid:107) ( D ( D + γI ) − − I ) θ k (cid:107) = ∞ (cid:88) k =1 n (cid:88) i =1 γ θ k,i ( d i + γ ) = ∞ (cid:88) k =1 (cid:88) i : d i > γ θ k,i ( d i + γ ) = ∞ (cid:88) k =1 (cid:88) i : d i > θ k,i d i γ d i ( d i + γ ) ≤ (cid:18) max i ∈ ,...,n γ d i ( d i + γ ) (cid:19) ∞ (cid:88) k =1 (cid:88) i : d i > θ k,i d i ≤ γ ∞ (cid:88) k =1 (cid:88) i : d i > θ k,i d i . θ k = U T Kα k = DU T α k , hence θ k,i = 0 whenever d i = 0 and thefinal inequality uses that ab / ( a + b ) ≤ b/ 4. Let D + denote the generalised inverse of D , i.e. D + ii := d − i d i > . Then (cid:88) i : d i > θ k,i d i = (cid:107) (cid:112) D + θ k (cid:107) = α Tk KU D + U T Kα k = α Tk U DD + DU T α k = α Tk Kα k = (cid:107) u k (cid:107) ≤ (cid:107) u k (cid:107) + (cid:107) v k (cid:107) = (cid:107) β k (cid:107) . Putting things together, we have ∞ (cid:88) k =1 (cid:107) ( D ( D + γI ) − − I ) θ k (cid:107) ≤ γ ∞ (cid:88) k =1 (cid:107) β k (cid:107) = γ (cid:107) S (cid:107) HS . Hence 1 n E (cid:32) n (cid:88) i =1 (cid:107) S ( Z i ) − ˆ S γ ( Z i ) (cid:107) | Z ( n ) (cid:33) ≤ σ n n (cid:88) i =1 d i ( d i + γ ) + γ n (cid:107) S (cid:107) HS , and using that d i ( d i + γ ) ≤ min(1 , d i / (4 d i γ )) = min( d i / , γ ) /γ, we have shown equation (31). For equation (32), setting γ (cid:48) := γ/n , we can note that the upperbound from before can be written σ γ (cid:48) n ∞ (cid:88) k =1 min(( d i /n ) / , γ (cid:48) ) + (cid:107) S (cid:107) HS γ (cid:48) . Since ( d i /n ) ni =1 are the eigenvalues of K/n , we have the desired bound by Koltchinskii [12,Proposition 3.3 and 3.4].Using Lemma 14, we can now prove Proposition 2. Proof. By Theorem 3 and the assumptions of the proposition it is sufficient to show thatsup P ∈ ˜ P √ n E P (cid:32) n n (cid:88) i =1 (cid:107) S XP ( Z i ) − ˆ S ( Z i ) (cid:107) (cid:33) → Y on Z . This can be seen by noting that an application ofCauchy–Schwarz and Markov’s inequality yields that nM fn,P M gn,P P ⇒ u P and v P in assumption (ii), ˜ M fn,P P ⇒ M gn,P P ⇒ σ > u P ( Z P ) , v P ( Z P )) (existingby assumption (ii)) and C > (cid:107) S XP (cid:107) HS , (cid:107) S YP (cid:107) HS ) (existing byassumption (i)). Conditioning on Z , . . . , Z n and applying equation (31) in Lemma 14, we getthatsup P ∈ ˜ P E P (cid:32) n n (cid:88) i =1 (cid:107) S XP ( Z i ) − ˆ S X ( Z i ) (cid:107) (cid:33) ≤ sup P ∈ ˜ P E P (cid:32) σ ˆ γ n n (cid:88) i =1 min(ˆ µ i / , ˆ γ ) + (cid:107) S XP (cid:107) HS ˆ γ (cid:33) ≤ max( σ , C ) sup P ∈ ˜ P E P (cid:34) min γ> (cid:32) γn n (cid:88) i =1 min(ˆ µ i / , γ ) + γ (cid:33)(cid:35) . C (cid:48) > P ∈ ˜ P E P (cid:34) min γ> (cid:32) γn n (cid:88) i =1 min(ˆ µ i / , γ ) + γ (cid:33)(cid:35) ≤ sup P ∈ ˜ P inf γ> (cid:34) E P (cid:32) γn n (cid:88) i =1 min(ˆ µ i / , γ ) + γ (cid:33)(cid:35) ≤ sup P ∈ ˜ P inf γ> (cid:32) C (cid:48) γn ∞ (cid:88) i =1 min( µ i,P , γ ) + γ (cid:33) , where the second inequality is due to Koltchinskii [12, Proposition 3.3 and 3.4]. The abovederivations imply that it is sufficient to show that √ n sup P ∈ ˜ P inf γ> (cid:32) γn ∞ (cid:88) i =1 min( µ i,P , γ ) + γ (cid:33) → n → ∞ . For each P ∈ ˜ P , we let φ P : R + → R + be given by φ P ( γ ) = ∞ (cid:88) i =1 min( µ i,P , γ ) . By assumption (iii), lim γ ↓ sup P ∈ ˜ P φ P ( γ ) = 0, hence for any ε > N ∈ N suchthat for any n ≥ N , sup P ∈ ˜ P (cid:113) φ P (cid:0) n − / (cid:1) < ε/ 2. Let γ n,P = n − / (cid:113) φ P (cid:0) n − / (cid:1) . Then, √ n sup P ∈ ˜ P inf γ> (cid:32) γn ∞ (cid:88) i =1 min( µ i,P , γ ) + γ (cid:33) = sup P ∈ ˜ P inf γ> (cid:18) φ P ( γ ) γ √ n + √ nγ (cid:19) ≤ sup P ∈ ˜ P (cid:18) φ P ( γ n,P ) γ n,P √ n + √ nγ n,P (cid:19) = sup P ∈ ˜ P φ P (cid:16) n − / (cid:113) φ P (cid:0) n − / (cid:1)(cid:17)(cid:113) φ P (cid:0) n − / (cid:1) + (cid:113) φ P (cid:0) n − / (cid:1) . Assuming that ε ≤ φ P is increasing, we get that for n ≥ N sup P ∈ ˜ P φ P (cid:16) n − / (cid:113) φ P (cid:0) n − / (cid:1)(cid:17)(cid:113) φ P (cid:0) n − / (cid:1) + (cid:113) φ P (cid:0) n − / (cid:1) < sup P ∈ ˜ P φ P (cid:0) n − / ε/ (cid:1)(cid:113) φ P (cid:0) n − / (cid:1) + (cid:113) φ P (cid:0) n − / (cid:1) < sup P ∈ ˜ P (cid:113) φ P (cid:0) n − / (cid:1) < ε, proving the result. D Full results from section 5.1 The full results from the simulation study in section 5.1 can be seen below.41 X = s X = s X = s X = n = = = = Significance level R e j e c t i on r a t e method / setting pfr / null GHCM / null pfr / alternative GHCM / alternative Figure 6: Rejection rates against significance level α for the pfr (red) and GHCM (green) testsunder null (light) and alternative (dark) settings when a = 2.42 X = s X = s X = s X = n = = = = Significance level R e j e c t i on r a t e method / setting pfr / null GHCM / null pfr / alternative GHCM / alternative Figure 7: Rejection rates against significance level α for the pfr (red) and GHCM (green) testsunder null (light) and alternative (dark) settings when a = 6.43 X = s X = s X = s X = n = = = = Significance level R e j e c t i on r a t e method / setting pfr / null GHCM / null pfr / alternative GHCM / alternative Figure 8: Rejection rates against significance level α for the pfr (red) and GHCM (green) testsunder null (light) and alternative (dark) settings when a = 12. References [1] V. Bengs and H. Holzmann. Uniform approximation in classical weak convergence theory,2019.[2] P. Billingsley. Convergence of Probability Measures . John Wiley & Sons, Inc., July 1999.doi: 10.1002/9780470316962. URL https://doi.org/10.1002/9780470316962 .443] V. Bogachev. Weak Convergence of Measures . Mathematical Surveys and Monographs.American Mathematical Society, 2018. ISBN 9781470447380. URL https://books.google.dk/books?id=fchwDwAAQBAJ .[4] V. I. Bogachev. Measure Theory . Springer Berlin Heidelberg, 2007. doi: 10.1007/978-3-540-34514-5. URL https://doi.org/10.1007/978-3-540-34514-5 .[5] N. L. Carothers. Real Analysis . Cambridge University Press, 2000. doi: 10.1017/CBO9780511814228.[6] J. L. Doob. Measure Theory . Springer New York, 1994. doi: 10.1007/978-1-4612-0877-8.URL https://doi.org/10.1007/978-1-4612-0877-8 .[7] R. M. Dudley. Real Analysis and Probability . Cambridge Studies in Advanced Mathematics.Cambridge University Press, 2 edition, 2002. doi: 10.1017/CBO9780511755347.[8] R. Durrett. Probability: Theory and Examples . Cambridge Series in Statistical andProbabilistic Mathematics. Cambridge University Press, 5 edition, 2019. doi: 10.1017/9781108591034.[9] T. Hsing and R. Eubank. Theoretical Foundations of Functional Data Analysis, with anIntroduction to Linear Operators . John Wiley & Sons, Ltd, may 2015. doi: 10.1002/9781118762547. URL https://doi.org/10.1002/9781118762547 .[10] M. Kasy. Uniformity and the delta method. Journal of Econometric Methods , 8(1):1–19,January 2019. URL https://ideas.repec.org/a/bpj/jecome/v8y2019i1p19n9.html .[11] G. S. Kimeldorf and G. Wahba. A correspondence between bayesian estimation on stochas-tic processes and smoothing by splines. The Annals of Mathematical Statistics , 41(2):495–502, 1970.[12] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse RecoveryProblems . Springer Berlin Heidelberg, 2011. doi: 10.1007/978-3-642-22147-7. URL https://doi.org/10.1007/978-3-642-22147-7 .[13] C. Kraft. Some Conditions for Consistency and Uniform Consistency of Statistical Proce-dures . University of California Press, 1955.[14] L. LeCam. Convergence of estimates under dimensionality restrictions. Ann. Statist. , 1(1):38–53, 01 1973. doi: 10.1214/aos/1193342380. URL https://doi.org/10.1214/aos/1193342380 .[15] A. Rønn-Nielsen and E. Hansen. Conditioning and Markov properties . 2014.[16] F. S. Scalora. Abstract martingale convergence theorems. Pacific J. Math. , 11(1):347–374,1961. URL https://projecteuclid.org:443/euclid.pjm/1103037558 .[17] R. L. Schilling. Measures, Integrals and Martingales . Cambridge University Press, 2017.[18] B. Sch¨olkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Inter-national Conference on Computational Learning Theory , pages 416–426. Springer, 2001.[19] R. D. Shah and J. Peters. The hardness of conditional independence testing and thegeneralised covariance measure. Annals of Statistics , 48(3):1514–1538, 2020.4520] A. W. v. d. Vaart. Asymptotic Statistics . Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge University Press, 1998. doi: 10.1017/CBO9780511802256.[21] N. N. Vakhania, V. I. Tarieladze, and S. A. Chobanyan. Probability Distributions onBanach Spaces . Springer Netherlands, 1987. doi: 10.1007/978-94-009-3873-1. URL https://doi.org/10.1007/978-94-009-3873-1https://doi.org/10.1007/978-94-009-3873-1