Wasserstein F -tests and Confidence Bands for the Frèchet Regression of Density Response Curves
WWASSERSTEIN F -TESTS AND CONFIDENCE BANDSFOR THE FR´ECHET REGRESSION OF DENSITYRESPONSE CURVES By Alexander Petersen ‡ , ∗ , Xi Liu ∗ and Afshin A. Divani † University of California, Santa Barbara ∗ and University of New Mexico † Data consisting of samples of probability density functions areincreasingly prevalent, necessitating the development of methodolo-gies for their analysis that respect the inherent nonlinearities associ-ated with densities. In many applications, density curves appear asfunctional response objects in a regression model with vector pre-dictors. For such models, inference is key to understand the impor-tance of density-predictor relationships, and the uncertainty associ-ated with the estimated conditional mean densities, defined as condi-tional Fr´echet means under a suitable metric. Using the Wassersteingeometry of optimal transport, we consider the Fr´echet regressionof density curve responses and develop tests for global and partialeffects, as well as simultaneous confidence bands for estimated con-ditional mean densities. The asymptotic behavior of these objects isbased on underlying functional central limit theorems within Wasser-stein space, and we demonstrate that they are asymptotically of thecorrect size and coverage, with uniformly strong consistency of theproposed tests under sequences of contiguous alternatives. The accu-racy of these methods, including nominal size, power, and coverage, isassessed through simulations, and their utility is illustrated through aregression analysis of post-intracerebral hemorrhage hematoma den-sities and their associations with a set of clinical and radiologicalcovariates.
1. Introduction.
Samples of probability density functions arise natu-rally in many modern data analysis settings, including population age andmortality distributions across different countries or regions (Hron et al.,2016; Bigot et al., 2017; Petersen and M¨uller, 2019), and distributions offunctional connectivity patterns in the brain (Petersen and M¨uller, 2016).Methods for the analysis of density data began with the work of Kneip andUtikal (2001), who applied standard functional principal components anal- ‡ Corresponding Author, supported by National Science Foundation grant DMS-1811888
MSC 2010 subject classifications:
Primary 62J99, 62F03, 62F25; secondary 62F05,62F12
Keywords and phrases:
Random Densities, Least Squares Regression, Wasserstein Met-ric, Tests for Regression Effects, Simultaneous Confidence Band a r X i v : . [ s t a t . M E ] J u l PETERSEN, LIU AND DIVANI ysis (FPCA) in order to quantify a mean density and prominent modes ofvariability about that mean. However, due to inherent constraints of densityfunctions, which must be nonnegative and integrate to one, nonlinear meth-ods are steadily replacing such standard procedures. For example, Hron et al.(2016) and Petersen and M¨uller (2016) both proposed to apply a preliminarytransformation to the densities, mapping them into a Hilbert space, afterwhich linear methods such as FPCA can be suitably applied. The transfor-mation of Hron et al. (2016) was specifically motivated by the extension ofthe Aitchison geometry (Aitchison, 1986) to infinite-dimensional composi-tional data due to the work of Egozcue, Diaz-Barrero and Pawlowsky-Glahn(2006), while those in Petersen and M¨uller (2016) were generic and not mo-tivated by any particular geometry.Parallel developments in the analysis of nonlinear data have been madein the broader field of object-oriented data analysis (Marron and Alonso,2014), where a complex data space is endowed with a chosen metric that,in turn, defines the parameters of the model. Particular attention has beenpaid to manifold-valued data (e.g., Fletcher et al. (2004); Panaretos, Phamand Yao (2014)), with the main objects of interest being the Fr´echet meanand variance, as well as dimension reduction tools designed to optimally re-tain variability in the data (Patrangenaru and Ellingson, 2015). Within thiscontext, Srivastava, Jermyn and Joshi (2007) and Bigot et al. (2017) devel-oped manifold-based dimension reduction techniques specifically designedfor samples of density functions, the former utilizing the Fisher-Rao geome-try and the latter the Wasserstein geometry of optimal transport. Of thesetwo, the Wasserstein metric has proved in recent years to have greater ap-peal both theoretically, given its clear interpretation as an optimal transportcost (Villani, 2003; Ambrosio, Gigli and Savar´e, 2008), as well as in appliedsettings (Bolstad et al., 2003; Broadhurst et al., 2006; Zhang and M¨uller,2011; Panaretos and Zemel, 2016).In this paper, we study a regression model with density functions as re-sponse variables under the Wasserstein geometry, with predictors being Eu-clidean vectors. Such data are frequently encountered in practice (e.g., Neriniand Ghattas (2007); Talsk´a et al. (2018)). The goal of the model is to per-form inference, specifically tests for covariate effects and confidence bandsfor the fitted conditional mean densities. For this reason, we assume a globalregression model that does not require any smoothing or other tuning pa-rameter to fit. Standard functional response models, such as the linear model(Faraway, 1997), are not suitable for the nonlinear space of density functionsunless the densities are first transformed into a linear space, as demonstratedrecently in Talsk´a et al. (2018) using the compositional approach. However,
ASSERSTEIN INFERENCE Fig 1:
Observed hematoma density distributions for a randomly selected subset of 40post-intracerebral hemmorhage patients. there is no such transformation that is suitable for the Wasserstein geome-try. Global regression models for Riemannian manifolds have been developed(Fletcher, 2013; Niethammer, Huang and Vialard, 2011; Cornea et al., 2017),but are also not directly applicable to the Wasserstein geometry. Instead, wewill develop our inferential techniques under the Fr´echet regression modelproposed in Petersen and M¨uller (2019), which defines a global regressionfunction between response data in an arbitrary metric space and vector pre-dictors. Although the theory of estimation under this model was well-studiedin the general metric space framework, this is not the case for other formsof inference.In Section 2, we briefly describe the necessary components of the Wasser-stein geometry and its implications when applied to the Fr´echet regressionmodel. An additional component not considered in Petersen and M¨uller(2019) that is available through this particular formulation is the randomoptimal transport map between the conditional Wasserstein mean densityand the observed one. These maps serve the purpose of the error term inthe regression model, although they do not act additively or even linearly.In Section 3, we develop intuitive test statistics for covariate effects and de-rive their asymptotic distributions, leading to root- n consistent testing pro-cedures. We also describe different methods for implementing these tests,including a bootstrap procedure involving residual optimal transport mapsobtained from the fitted model. Section 4 demonstrates how to computeasymptotic confidence bands about the estimated conditional mean distri-butions. Finally, these methods are illustrated through extensive simulations(Section 5) and the analysis of distributions of hematoma density for stroke PETERSEN, LIU AND DIVANI patients (Section 6), as captured by computed tomography scans, and theirdependency on patient-specific covariates (Figure 1).
2. Preliminaries.
We begin with a brief definition of the Wassersteinmetric in the language of optimal transport. This Wasserstein metric hasbeen known under other names, including “Mallows” and “earth-mover’s”(Levina and Bickel, 2001), and its use in statistics is rapidly expanding(Panaretos and Zemel, 2019). Let D be the class of univariate probabilitydensity functions f that satisfy (cid:82) R u f ( u )d u < ∞ , i.e. absolutely continu-ous distributions on with a finite second moment. In fact, the Wassersteinmetric is well-defined for such distributions without requiring a density, butfor simplicity of presentation we deal specifically with this subclass. For acomprehensive treatment in the more general case, see Villani (2003) or Am-brosio, Gigli and Savar´e (2008). For f, g ∈ D , consider the collection of maps M f,g such that, if U ∼ f and M ∈ M f,g , then M ( U ) ∼ g. The squaredWasserstein distance between these two distributions is d W ( f, g ) = inf M ∈M f,g (cid:90) R ( M ( u ) − u ) f ( u )d u. It is well known that the infimum above is attained by the optimal transportmap M opt f,g = G − ◦ F , where F and G are the cumulative distributionfunctions of f and g , respectively, leading to the closed forms(2.1) d W ( f, g ) = (cid:90) R (cid:16) M opt f,g ( u ) − u (cid:17) f ( u )d u = (cid:90) (cid:0) F − ( t ) − G − ( t ) (cid:1) d t, where the last equality follows by the change of variables t = F ( u ) . A moreproper term for this metric is the Wasserstein-2 distance, since it is just oneamong an entire class of Wasserstein metrics.Within this larger class of metrics is also the Wasserstein- ∞ metric, whichwill be useful in the formation of confidence bands. For two densities f, g ∈D , their Wasserstein- ∞ distance is(2.2) d ∞ ( f, g ) = f - ess sup u ∈I | M opt f,g ( u ) − u | , where f - ess sup refers to the essential supremum with respect to f .2.1. Random Densities.
A random density F is a random variable takingits values almost surely in D . It will also be useful to refer to the corre-sponding random cdf F , quantile function Q = F − and quantile density q = Q (cid:48) = 1 / ( F ◦ Q ) (Parzen, 1979). For clarity, u, v ∈ R will consistently be ASSERSTEIN INFERENCE used as density and cdf arguments throughout, whereas s, t ∈ [0 ,
1] will beused as arguments for the quantile and quantile density functions. Followingthe ideas of Fr´echet (1948), the Wasserstein–Fr´echet (or simply Wasserstein)mean and variance of F are(2.3) f ∗⊕ := argmin f ∈D E (cid:0) d W ( F , f ) (cid:1) , Var ⊕ ( F ) := E (cid:0) d W ( F , f ∗⊕ ) (cid:1) . In the regression setting, we model the distribution of F conditional on avector X ∈ R p of predictors, where the pair ( X, F ) is distributed accordingto a probability measure G on the product space R p × D . In this sense, theobjects in (2.3) are the marginal Fr´echet mean and variance of F . Let S X denote the support of the marginal distribution of X. Our interest is in theFr´echet regression function, or function of conditional Fr´echet means,(2.4) f ⊕ ( x ) := argmin f ∈D E (cid:2) d W ( F , f ) | X = x (cid:3) , x ∈ S X . Let F ⊕ ( x ), Q ⊕ ( x ), and q ⊕ ( x ) denote, respectively, the cdf, quantile, andquantile density functions corresponding to f ⊕ ( x ) . We will use the notation f ⊕ ( x, u ) to denote the value of the conditional mean density f ⊕ ( x ) at argu-ment u ∈ R , and similary for F ⊕ ( x, u ) , Q ⊕ ( x, t ), and q ⊕ ( x, t ) , t ∈ [0 , . Fora pair ( X, F ), define T = Q ◦ F ⊕ ( X ) to be the optimal transport map fromthe conditional mean f ⊕ ( X ) to the random density F . By (2.1), it must bethat E ( Q ( t ) | X = x ) = Q ⊕ ( x, t ) , so that E ( T ( u ) | X = x ) = u for u such that f ⊕ ( x, u ) > . Then the conditional Fr´echet variance isVar ⊕ ( F | X = x ) = E (cid:2) d W ( F , f ⊕ ( x )) | X = x (cid:3) = (cid:90) R E (cid:2) ( T ( u ) − u ) | X = x (cid:3) f ⊕ ( x, u )d u = (cid:90) R Var( T ( u ) | X = x ) f ⊕ ( x, u )d u. (2.5)In these developments, we have assumed that the marginal and condi-tional Wasserstein mean densities exist and are unique. However, this isnot automatic and some conditions are needed. Previous work on existence,uniqueness, and regularity of Wasserstein means, or barycenters, for a fi-nite collection of probability meaures on R d was done by Agueh and Carlier(2011), and extended to continuously-indexed measures by Pass (2013). Forrandom probability measures with support on [0 , , Panaretos and Zemel(2016) (see Proposition 2 therein) gave sufficient conditions for the existenceand uniqueness of a Wasserstein mean measure, although it was not guaran-teed to have a density. However, none of these are sufficiently strong for the
PETERSEN, LIU AND DIVANI purposes of this paper, where existence, uniqueness, and regularity of bothmarginal (2.3) and conditional (2.4) Wasserstein means are needed. To thisend, consider the following assumptions on the joint distribution G . (A1) F ( u ) ∈ (0 , ∞ ) for u ∈ ( Q (0) , Q (1)) almost surely, Var( Q ( t )) < ∞ forall t ∈ (0 , , and (cid:82) Var( Q ( t ))d t < ∞ . (A2) For any t ∈ (0 , , there exists δ, < δ < min { t, − t } such that E (cid:16) sup | s − t | <δ q ( s ) (cid:17) < ∞ . (A3) For all x ∈ S X , P (cid:16) sup u ∈ ( Q (0) ,Q (1)) F ( u ) < ∞| X = x (cid:17) > . Assumption (A1) is essentially the same as that made in Proposition 2of Panaretos and Zemel (2016), with additional moment assumptions on Q since F is not assumed to be supported on any bounded interval, andimplies existence and uniqueness of the Wasserstein mean measures. As-sumption (A2) is a regularity condition to ensure these Wasserstein meanshave densities in D , and (A3) implies that these mean densities are bounded;see Pass (2013) for a similar assumption. The proof of the following and allother theoretical results can be found in the Appendix. Proposition . Suppose conditions (A1)–(A3) hold. Then the marginalWasserstein mean f ∗⊕ in (2.3) exists, is a unique element of D , and satis-fies sup u f ∗⊕ ( u ) < ∞ and Var ⊕ ( F ) < ∞ . For all x ∈ S X , the conditionalWasserstein mean f ⊕ ( x ) in (2.4) exists, is a unique element of D , and sat-isfies sup u f ⊕ ( x, u ) < ∞ and Var ⊕ ( F | X = x ) < ∞ . Global Wasserstein–Fr´echet Regression.
In order to facilitate infer-ence, specifically tests for no or partial effects of the covariates X and con-fidence bands for the conditional Wasserstein means, we consider a partic-ular global regression model for the conditional Wassersteins f ⊕ ( x ) definedin (2.4). This model, proposed by Petersen and M¨uller (2019), is termedFr´echet regression, and takes the form of a weighted Fr´echet mean(2.6) f ⊕ ( x ) = argmin f ∈D E (cid:2) s ( X, x ) d W ( F , f ) (cid:3) , where the weight function is s ( X, x ) = 1 + ( X − µ ) (cid:62) Σ − ( x − µ ) , µ = E ( X ) , Σ = Var( X ) , and Σ is assumed to be positive definite. The model is motivated by multiplelinear regression, and is its direct generalization to the case of a response ASSERSTEIN INFERENCE variable in a metric space. Specifically, if a scalar response Y is jointly dis-tributed with X and E ( Y | X = x ) is linear in x, Petersen and M¨uller (2019)showed that an alternative characterization of linear regression is E ( Y | X = x ) = argmin y ∈ R E (cid:2) s ( X, x )( Y − y ) (cid:3) . Thus, model (2.6) generalizes linear regression to the case of density responseby substituting F for Y and ( D , d W ) in place of the usual metric space ( R , |·| )implicitly used in multiple linear regression. Although D is not a linear space,(2.6) provides a sensible regression model for the conditional Wassersteinmeans that retains some properties of linear regression. For example, since s ( z, µ ) ≡ , we have f ⊕ ( µ ) = f ∗⊕ , so that the regression function passesthrough the point ( µ, f ∗⊕ ) . For the remainder of the paper, and implicit in the statement of all theo-retical results, we assume that the distribution G satisfies model (2.6), with f ∗⊕ , f ⊕ ( x ) being unique elements of D . We now give a very basic exampleof this model, motivated by the well-known connection between the Wasser-stein metric and location-scale families (e.g., Bigot et al. (2017)).
Example . Suppose X ∈ R and fix f ∈ D . Letting a j , b j ∈ R , define ν ( x ) = a + a x, τ ( x ) = b + b x, and suppose τ ( X ) > almost surely. Let V , V satisfy E ( V | X ) = 0 ,E ( V | X ) = 1 , and V > almost surely. Then the location-scale model F ( u ) = 1 V τ ( X ) f (cid:18) u − V − V ν ( X ) V τ ( X ) (cid:19) corresponds to (2.6) with f ⊕ ( x, u ) = 1 τ ( x ) f (cid:18) u − ν ( x ) τ ( x ) (cid:19) . To show the above, it is sufficient to show that E ( Q ( t ) | X ) = Q ⊕ ( X, t ) almost surely. Let F be the cdf corresponding to f , so that Q ⊕ ( x, t ) = ν ( x ) + τ ( x ) Q ( t ) . Since Q ( t ) = V + V ν ( X ) + V τ ( X ) Q ( t ) , it is easily verified that E ( Q ( t ) | X ) = Q ⊕ ( X, t ) by the properties of V , V . Moreover, note that the optimal transport map from f ⊕ ( X ) to F is T ( u ) = Q ◦ F ⊕ ( X, u ) = V + V u, and satisfies E ( T ( u ) | X ) = u almost surely. PETERSEN, LIU AND DIVANI
Thus far, the regression model (2.6) provides us with a formula for theconditional Wasserstein mean of F , whereas one also needs information onthe conditional variance in order to conduct inference. To this end, observethat Q = T ◦ Q ⊕ ( X ) , so that T acts as a residual transport, although itacts on the quantile function, and not the density, and does so throughcomposition and not additively. As oberved in Section 2.1, the first orderbehavior of the residual transport T is completely specified by the model as E ( T ( u ) | X ) = u almost surely. We also impose the following assumption onthe covariance.(T1) The covariance function C T ( u, v ) = Cov( T ( u ) , T ( v )) is continuous,sup u ∈ R Var( T ( u )) < ∞ , and Cov( T ( u ) , T ( v ) | X ) = C T ( u, v ) almostsurely.This corresponds to the classical constant variance or exogeneity assump-tion requiring that the second order behavior of the residual transport beindependent of the predictors. Define S = Q ⊕ ( X ) ◦ F ∗⊕ , which is the optimaltransport map from the marginal Wasserstein mean to the conditional one.Again, it is easily verified that E ( T ◦ S ( u )) = u for u such that f ∗⊕ ( u ) > . These observations lead to the following decomposition of the Wassersteinvariance.
Proposition . Suppose that assumption (T1) is satisfied. Then
Var ⊕ ( F ) = E [Var ⊕ ( F | X )] + Var ⊕ ( f ⊕ ( X ))= (cid:90) R E [ C T ( S ( u ) , S ( u ))] f ∗⊕ ( u )d u + (cid:90) R ( S ( u ) − u ) f ∗⊕ ( u )d u. (2.7)While a standard result in Euclidean spaces, the above variance decompo-sition does not generally hold in metric spaces such as D . However, Propo-sition 2 demonstrates that this decomposition does indeed hold for randomdensities under the Wasserstein metric whenever model (2.6) holds. Thisfinding motivates the specific choices of test statistics developed in Section 3.2.3.
Estimation.
In order to estimate the regression function f ⊕ ( x ) , weutilize an empirical version of the least-squares Wasserstein criterion in (2.6).First, set ¯ X = n − (cid:80) ni =1 X i , ˆΣ = n − (cid:80) ni =1 ( X i − ¯ X )( X i − ¯ X ) (cid:62) , and com-pute empirical weights s in ( x ) = 1 + ( X i − ¯ X ) (cid:62) ˆΣ − ( x − ¯ X ) , Let Q be the setof quantile functions in L [0 , . With (cid:107)·(cid:107) L denoting the standard Hilbertnorm on L [0 , Q ⊕ ( x ) is(2.8) ˆ Q ⊕ ( x ) = argmin Q ∈ Q n (cid:88) i =1 s in ( x ) (cid:107) Q − Q i (cid:107) L . ASSERSTEIN INFERENCE Implementation of this estimator is given in Algorithm 1 in the Appendix.In finite samples, ˆ Q ⊕ ( x ) will not necessarily admit a density. For almost allprocedures described, this quantile estimate is sufficient, since it can be usedto compute Wasserstein distances as in (2.1). Nevertheless, we will establish(see Lemma 2 in the Appendix) that ˆ Q ⊕ ( x ) admits a density ˆ f ⊕ ( x ) ∈ D forlarge samples with high probability. When this holds, the estimate(2.9) ˆ f ⊕ ( x ) = argmin f ∈D n n (cid:88) i =1 s in ( x ) d W ( F i , f ) . is well-defined, and ˆ f ⊕ ( x ) is the density corresponding to the quantile esti-mate ˆ Q ⊕ ( x ) above. It can be computed in practice using Algorithm 2 in theAppendix.Since we will also consider hypothesis tests of partial effects, write x ∈ R p as x = ( y, z ) , where y corresponds to the first q entries of x, q < p. Similarly,take X = ( Y, Z ) , µ = ( µ Y , µ Z ) , andΣ = (cid:18) Σ Y Y Σ Y Z Σ ZY Σ ZZ (cid:19) , with corresponding notations for the partitions of ¯ X and ˆΣ . Consider thenull hypothesis H P : f ⊕ ( x ) = f , ⊕ ( y ) ,f , ⊕ ( y ) := argmin f ∈D E (cid:2) s ( Y, y ) d W ( F i , f ) (cid:3) , s ( Y, y ) = 1+( Y − µ Y )Σ − Y Y ( y − µ Y ) . Then the restricted estimators ˆ Q , ⊕ ( y ) and ˆ f , ⊕ ( y ) are defined analogouslyto (2.8) and (2.9), only using submodel weights s in, ( y ) = 1 + ( Y − ¯ Y ) (cid:62) ˆΣ − Y Y ( y − ¯ Y ) .
3. Hypothesis Testing.
Once estimation is under control, the firstgoal of any global regression model is to test for effects of the predictors.In the more abstract setting of a response variable in an arbitrary metricspace, Petersen and M¨uller (2019) suggested a permutation approach basedon a Fr´echet generalization of the coefficient of determination R in multiplelinear regression, though the theoretical properties of this test were notinvestigated. In a recent preprint, Dubey and M¨uller (2019) developed atest statistic and its asymptotic distribution for the case of a random objectresponse and categorical predictors, giving a Fr´echet extension of analysisof variance. Given that we are considering the more specific case of density-valued response variables under the Wasserstein geometry, we are able toexpand on these results in order to develop asymptotically justified testsfor both global and partial null hypotheses, where predictors can be of anytype. PETERSEN, LIU AND DIVANI
Test of No Effects.
We begin with the global null hypothesis of noeffects, H G : f ⊕ ( x ) ≡ f ∗⊕ . Given the Wasserstein variance decompositionin (2.7), under H G we have Var ⊕ ( f ⊕ ( X )) = E (cid:0) d W ( f ⊕ ( X ) , f ∗⊕ ) (cid:1) = 0 . Thismotivates(3.1) F ∗ G = n (cid:88) i =1 d W (ˆ F i , ˆ f ∗⊕ )as a test statistic, where ˆ F i = ˆ f ⊕ ( X i ) are the fitted densities and ˆ f ∗⊕ = ˆ f ⊕ ( ¯ X )is the sample Wasserstein mean. This can be viewed as a generalization ofthe numerator of the global F -test in multiple linear regression, and we thusrefer to F ∗ G in (3.1) as the (global) Wasserstein F -statistic.In order to establish the asymptotic null distribution of F ∗ G , we requirethe following assumptions. Define C ( l,m ) T = ∂ l + m ∂u l ∂v m C T and, for any x ∈ R p , set ˜ x = (1 x (cid:62) ) (cid:62) . Let R = T ◦ S = Q ◦ F ∗⊕ and I x = ( Q ⊕ ( x, , Q ⊕ ( x, . (T2) T is differentiable almost surely, with T (cid:48) having random Lipschitz con-stant L. The covariance function C T has continuous partial derivatives C ( l,m ) T , for l, m = 0 , , with sup u ∈ R C (1 , T ( u, u ) < ∞ . (T3) E ( (cid:107) X (cid:107) E ) , E ( (cid:107) ˜ X (cid:107) E sup u ∈ R | T (cid:48) ( u ) | ) , and E ( (cid:107) ˜ X (cid:107) E L ) are all finite.(T4) I ∗ = I µ is a bounded interval, and γ ( u ) = Cov( X, R ( u )) has boundedsecond derivative on I ∗ . Also, inf u ∈I ∗ f ∗⊕ ( u ) > M > , sup u ∈I x f ⊕ ( x, u ) < M for almost all x. Assumptions (T2) and (T3) impose conditions on the joint distribution of(
X, T ) . Condition (T2) is a smoothness condition on the optimal transportprocess T , while the moment requirements in (T3) are tightness conditionsthat allow for the necessary asymptotic Gaussianity to hold; see Lemma 1 inthe Appendix. Assumption (T4) involves the regression relationship between X and F ; importantly, it ensures that the conditional mean densities aresufficiently and uniformly separated from the boundary of D within thespace of distributions with finite second moments. In the spirit of regression,we will consider the asymptotic behavior of F ∗ G conditional on the observedpredictors. Define the covariance kernel(3.2) K ( u, v ) = E [ C T ( S ( u ) , S ( v ))] = ∞ (cid:88) j =1 λ j φ j ( u ) φ j ( v ) , u, v ∈ ¯ I ∗ , where ¯ I ∗ is the closure of I ∗ . The right-hand side is the Mercer decompo-sition of K (Hsing and Eubank, 2015), so that { φ j } ∞ j =1 forms an orthonor-mal set in L (¯ I ∗ , f ∗⊕ ) and λ j are positive, nonincreasing in j , and satisfy ASSERSTEIN INFERENCE (cid:80) ∞ j =1 λ j < ∞ . Since K can be associated with a linear integral operator on L (¯ I ∗ , f ∗⊕ ) , we will refer to λ j as the eigenvalues of K. theorem . Suppose assumptions (T1)–(T4) hold. Then F ∗ G | X , . . . , X n D → ∞ (cid:88) j =1 λ j ω j almost surely,where ω j are i.i.d. χ p random variables and λ j are the eigenvalues in (3.2) . While this limiting distribution may appear surprising at first sight giventhe non-Euclidean setting, they are a result of the fact that central limittheorems can still be derived for data on manifolds (e.g., Barden, Le andOwen (2013)), including the space of distributions under the Wassersteinmetric (Panaretos and Zemel, 2016).The limiting distribution obtained in Theorem 1 depends on unknownparameters, namely the eigenvalues λ j , that must be approximated to for-mulate a rejection region. A natural approach would be to estimate thekernel K in (3.2) directly, followed by applying a modified Mercer decompo-sition incorporating the estimated marginal Wasserstein mean. Thankfully,this complicated approach is not necessary. Define the quantile conditionalcovariance kernel C Q ( s, t ) = E { Cov( Q ( s ) , Q ( t ) | X ) } . A simple calculationreveals that C Q ( s, t ) = K ( Q ∗⊕ ( s ) , Q ∗⊕ ( t )) , so that λ j are also the eigenvaluesof C Q , which can be estimated by(3.3) ˆ C Q ( s, t ) = 1 n n (cid:88) i =1 ( Q i ( s ) − ˆ Q i ( s ))( Q i ( t ) − ˆ Q i ( t )) , where ˆ Q i are the fitted quantile functions corresponding to densities ˆ F i . Letˆ λ j be the corresponding eigenvalues of ˆ C Q . One can consistently approximate the conditional null distribution of F ∗ G as follows. For α ∈ (0 ,
1) and eigenvalue estimates ˆ λ j , let ˆ b Gα be the 1 − α quantile of (cid:80) ∞ j =1 ˆ λ j ω j , where ω j are as in Theorem 1. Computation of thiscritical value is outlined in Algorithm 3 of the Appendix. The followingresult on the conditional power β Gn = P X ( F ∗ G > ˆ b Gα ) follows from Theorem 1.Let G denote the collection of distributions on R p × D such that model (2.6)holds. Corollary . If G ∈ G satisfies H G and (T1)–(T4) hold, then β Gn → α almost surely. PETERSEN, LIU AND DIVANI
In addition to having the correct asymptotic size for any null model, wedemonstrate the power performance of the above test under a sequence ofcontiguous alternatives. To do so, consider a subclass G ∗ ⊂ G of Wasser-stein regression models, with the marginal distribution of X being fixed andsatisfying E ( (cid:107) X (cid:107) E ) < ∞ , for which the criteria below are satisfied. Recallthat γ ( u ) = Cov( X, R ( u )) , where R = T ◦ S. (G1) For all G ∈ G ∗ , (T1) and (T2) hold, with sup u ∈ R C T ( u, u ) and sup u ∈ R C (1 , T ( u, u )uniformly bounded in G ∗ . In addition, sup u ∈I ∗ (cid:107) γ (cid:48) ( u ) (cid:107) E and sup u ∈I ∗ (cid:107) γ (cid:48)(cid:48) ( u ) (cid:107) E are both uniformly bounded in G ∗ . (G2) With L being the random Lipschitz constant in (T2), the followinguniform moment conditions are satisfied.lim sup M →∞ sup G∈ G ∗ E (cid:20) (cid:107) ˜ X (cid:107) E sup u ∈ R | T (cid:48) ( u ) | (cid:18) (cid:107) X (cid:107) E sup u ∈ R | T (cid:48) ( u ) | > M (cid:19)(cid:21) = 0 . lim sup M →∞ sup G∈ G ∗ E (cid:104) (cid:107) ˜ X (cid:107) E L (cid:16) (cid:107) ˜ X (cid:107) E L > M (cid:17)(cid:105) = 0(G3) There exists M > G∈ G ∗ inf u ∈I ∗ f ∗⊕ ( u ) ≥ M − . (G4) There exists M < ∞ such that sup G∈ G ∗ sup x ∈ S X sup u ∈I x f ⊕ ( x, u ) ≤ M . theorem . Let G ∗ satisfy (G1)–(G4), and a n be a sequence such that a n → and √ na n → ∞ . Consider a sequence of alternative global hypotheses H GA,n : G ∈ G A,n , G A,n = (cid:8) G ∈ G ∗ : E [ d W ( f ⊕ ( X ) , f ∗⊕ )] ≥ a n (cid:9) . Then the worst case power converges strongly and uniformly to 1, that is,for any (cid:15) > , inf G∈ G A,n P (cid:18) inf m ≥ n β Gm ≥ − (cid:15) (cid:19) → . Test of Partial Effects.
Beyond a test for no effect, it is often nec-essary to test the effect for just a single predictor or a subset of them. With X = ( Y, Z ) as in Section 2.3, under the partial null hypothesis H P : f ⊕ ( x ) = f , ⊕ ( y ) ,E (cid:0) d W ( f ⊕ ( X ) , f ∗⊕ ) (cid:1) = E (cid:0) d W ( f , ⊕ ( Y ) , f ∗⊕ ) (cid:1) , motivating the partial Wasserstein F -statistic(3.4) F ∗ P = n (cid:88) i =1 (cid:104) d W (ˆ F i , ˆ f ∗⊕ ) − d W (ˆ F ,i , ˆ f ∗⊕ ) (cid:105) , ˆ F ,i = ˆ f , ⊕ ( Y i ) , ASSERSTEIN INFERENCE corresponding to the numerator of the partial F -statistic in the multiplelinear regression setting.Setting J (cid:62) = [ − Σ ZY Σ − Y Y I p − ] and Σ Z | Y = Σ ZZ − Σ ZY Σ − Y Y Σ Y Z , definethe covariance matrix kernel K ∗ ( u, v ) = Σ − / Z | Y J (cid:62) E (cid:104) ( X − µ )( X − µ ) (cid:62) C T ( S ( u ) , S ( v )) (cid:105) J Σ − / Z | Y = ∞ (cid:88) l =1 τ l ϕ l ( u ) ϕ (cid:62) l ( v ) . (3.5)Here, the functions ϕ l form an orthonormal set in (cid:0) L (¯ I ∗ , f ∗⊕ ) (cid:1) p − q and thepositive, nonincreasing eigenvalues τ l of K ∗ satisfy (cid:80) ∞ l =1 τ l < ∞ . theorem . Under (T1)–(T4), F ∗ P | X , . . . , X n D → ∞ (cid:88) l =1 τ l ξ l almost surely,where ξ l are independent standard normal random variables. If, in addition, E ( Z | Y ) is linear in Y and Var( Z | Y ) = Σ Z | Y almost surely, then { τ l } ∞ l =1 = λ , . . . , λ (cid:124) (cid:123)(cid:122) (cid:125) p − q , λ , . . . , λ (cid:124) (cid:123)(cid:122) (cid:125) p − q , . . . so that (cid:80) ∞ l =1 τ l ξ l D = (cid:80) ∞ j =1 λ j ω (cid:48) j , where ω (cid:48) j are i.i.d. χ p − q random variables. Similar to the global case, the τ l in (3.5) also correspond to the eigenvaluesof the kernel C ∗ Q ( s, t ) = K ∗ ( Q ∗⊕ ( s ) , Q ∗⊕ ( t )) . Setting X i,c = X i − ¯ X, a naturalestimator is(3.6)ˆ C ∗ Q ( s, t ) = ˆΣ − / Z | Y ˆ J (cid:62) (cid:40) n n (cid:88) i =1 X i,c X (cid:62) i,c ( Q i ( s ) − ˆ Q i ( s ))( Q i ( t ) − ˆ Q i ( t )) (cid:41) ˆ J ˆΣ − / Z | Y , where ˆΣ Z | Y and ˆ J are plug-in estimates. Let ˆ τ j be the corresponding eigen-value estimates from ˆ C ∗ Q ( s, t ) , and ˆ b Pα be the 1 − α percentile of (cid:80) ∞ l =1 ˆ τ l ξ l , with ξ l as in Theorem 3. If the additional assumptions in the second part ofThereom 3 hold, then let ˆ b Pα be the 1 − α quantile of (cid:80) ∞ j =1 ˆ λ j ω (cid:48) j . Computa-tion of these critical values can be done in the same was as the global case.We have the following size and power results for the partial Wasserstein F -test, with β Pn = P X ( F ∗ p > ˆ b Pα ) denoting the conditional power as a functionof the underlying model G . PETERSEN, LIU AND DIVANI
Corollary . If G ∈ G satisfies H P and (T1)–(T4) hold, then β Pn → α almost surely. theorem . Let G ∗ satisfy (G1)–(G4), and a n be as in Theorem 2.Consider a sequence of alternative partial hypotheses H PA,n : G ∈ G (cid:48) A,n , G (cid:48) A,n = (cid:8) G ∈ G ∗ : E [ d W ( f ⊕ ( X ) , f , ⊕ ( Y ))] ≥ a n (cid:9) . Then the worst case power converges strongly and uniformly to 1, that is,for any (cid:15) > , inf G∈ G (cid:48) A,n P (cid:18) inf m ≥ n β Pm ≥ − (cid:15) (cid:19) → . Alternative Testing Approximations.
As an alternative to estimat-ing the eigenvalues in the limiting distributions of F ∗ G and F ∗ P , Satterth-waite’s approximation (Satterthwaite, 1941; Shen and Faraway, 2004) canelso be employed. Using the global test as an example, we approximatethe null distribution of the test statistic by aχ m , where a, m > am = p (cid:80) ∞ j =1 λ j ,a m = p (cid:80) ∞ j =1 λ j . Using this approximation, one does not need to estimatethe individual eigenvalues λ j , given the equalities (Hsing and Eubank, 2015) ∞ (cid:88) j =1 λ j = (cid:90) C Q ( t, t )d t, ∞ (cid:88) j =1 λ j = (cid:90) (cid:90) C Q ( s, t )d s d t. Hence, one can compute the approximate values a and m using the corre-sponding norms of the estimate ˆ C Q . This alternative approach is outlinedin Algorithm 4 of the Appendix.Finally, as the limiting distributions in Theorems 1 and 3 are the resultof underlying central limit theorems, one may employ a bootstrap approachto testing these hypotheses. Since the inference is conditional on the ob-served predictors X i , a natural approach is to perform a residual trans-port bootstrap. Using the global test as an example, let ˆ T i, = Q i ◦ ˆ F ∗⊕ bethe approximate versions of the residual transports T i under H G . Obtain B independent bootstrap samples { ˜ T bi } ni =1 , b = 1 , . . . B , by sampling withreplacement from the ˆ T i, , and form the bootstrapped quantile functions˜ Q bi = ˜ T bi ◦ ˆ Q ∗⊕ . Then, compute bootstrap estimates ˆ f b ⊕ ( x ) and ˆ f ∗ ,b ⊕ usingthe data ( X i , ˜ Q bi ) , i = 1 , . . . , n. Finally, compute the bootstrap statistics˜ F ∗ ,bG = (cid:80) ni =1 d W ( ˆ f b ⊕ ( X i ) , ˆ f ∗ ,b ⊕ ) . The bootstrap p -value then becomes (cid:110) ˜ F ∗ ,bG > F ∗ G (cid:111) + 1 B + 1 . ASSERSTEIN INFERENCE This global residual bootstrap test is outlined in Algorithm 5 of the Ap-pendix. For the partial test, this residual bootstrap can only be employed ifthe support of the densities is fixed, since otherwise the null residual trans-ports ˆ T i, = Q i ◦ ˆ F , ⊕ ( Y i ) will have different supports.
4. Confidence Bands.
We now develop methodology for producinga confidence set for f ⊕ ( x ) , where x is considered to be fixed. In similarsettings where one desires to make a confidence statement for a functionalparameter, such as nonparametric regression (Eubank and Speckman, 1993;Claeskens and Van Keilegom, 2003), mean and covariance estimation infunctional data analysis (Degras, 2011; Wang and Yang, 2009; Cao, Yangand Todem, 2012), and the varying coefficient model (Fan and Zhang, 2008),one can either build pointwise or simultaneous bands. In the current settingof Wasserstein regression for density response data, the constraints inherentto the density targets f ⊕ ( x ) render pointwise confidence bands of little use.Hence, the approach we take will be motivated by simultaneous confidencebands. Here, the descriptor “simultaneous” refers to the argument u of thefunctional parameter f ⊕ ( x, u ) , and not to the specific regressor value x underconsideration. Let g be a generic functional parameter of interest, wherewe assume that g is bounded. Given an estimator ˆ g, a typical approachto formulating a simultaneous confidence band is to show that b n (ˆ g ( u ) − g ( u )) /a ( u ) converges weakly to a limiting process (usually Gaussian) in thespace of bounded functions under the uniform metric (Van der Vaart andWellner, 1996), where a > b − n is the rate ofconvergence. By an application of the continuous mapping theorem, one canthen obtain a confidence band for g of the form { g ∗ : ˆ g ( u ) − ca ( u ) b − n ≤ g ∗ ( u ) ≤ ˆ g ( u ) + ca ( u ) b − n for all u } . This band corresponds to all functions that are almost everywhere betweenthe lower and upper bounds, and is the closest one can get to a confidenceinterval in function space. This partial ordering is induced explicitly by theuniform metric, and indicates why simultaneous confidence bands are souseful for functional parameters, in that one can visualize the entire setgraphically. We will explore two different approaches to formulating simul-taneous confidence bands. The first arises naturally from the Wassersteingeometry, and provides a distributional band for either Q ⊕ ( x ) or F ⊕ ( x ), butnot the density parameter. In the second approach, we strengthen the con-vergence results and utilize the delta method to construct a simultaneousconfidence band for f ⊕ ( x ) . PETERSEN, LIU AND DIVANI
Intrinsic Wasserstein- ∞ Bands.
The first method is directly relatedto the geometry imposed by the Wasserstein- ∞ metric; see (2.2). Specifically,if ˆ V x = ˆ Q ⊕ ( x ) ◦ F ⊕ ( x ) is the optimal transport from the target to theestimate, then d ∞ ( f ⊕ ( x ) , ˆ f ⊕ ( x )) = f ⊕ ( x )- ess sup u | ˆ V x ( u ) − u | = sup u ∈I x | ˆ V x ( u ) − u | . It is then natural to establish weak convergence (denoted by (cid:32) ) of theprocess ˆ V x ( u ) within the space of bounded functions on I x , denoted L ∞ ( I x ) . Define Λ = E ( ˜ X ˜ X (cid:62) ) and the covariance kernel(4.1) ˜ K ( u, v ) = Λ − E (cid:104) ˜ X ˜ X (cid:62) C T ( S ( u ) , S ( v )) (cid:105) Λ − , u, v ∈ ¯ I ∗ . theorem . Suppose that (T1)–(T4) hold. Then there exists a zero-mean Gaussian process M x on L ∞ ( I x ) such that √ n ( ˆ V x − id) | X , . . . , X n (cid:32) M x almost surely.With u x = Q ∗⊕ ◦ F ⊕ ( x, u ) for any u ∈ I x , the covariance of M x is (4.2) C x ( u, v ) = ˜ x (cid:62) ˜ K ( u x , v x ) ˜ x. As in the testing procedures, estimation of the covariance kernel ˜ K issimplified by moving to quantile functions. Set D Q ( s, t ) = ˜ K ( Q ∗⊕ ( s ) , Q ∗⊕ ( t )) , with estimate(4.3) ˆ D Q ( s, t ) = ˆΛ − (cid:40) n n (cid:88) i =1 ˜ X i ˜ X (cid:62) i ( Q i ( s ) − ˆ Q i ( s ))( Q i ( t ) − ˆ Q i ( t )) (cid:41) ˆΛ − , where ˆΛ = n − (cid:80) ni =1 ˜ X i ˜ X (cid:62) i . This leads to(4.4) ˆ C x ( u, v ) = ˜ x (cid:62) ˆ D Q ( ˆ F ⊕ ( x, u ) , ˆ F ⊕ ( x, v ))˜ x, u, v ∈ [ ˆ Q ⊕ ( x, , ˆ Q ⊕ ( x, . Let m α be the 1 − α quantile of the distribution of ζ x := sup u ∈I x C x ( u, u ) − / |M x ( u ) | . Since m α is unknown, we estimate it as follows. Observe that N x = M x ◦ Q ⊕ ( x ) is a Gaussian process on L ∞ [0 ,
1] with covariance ˜ x (cid:62) D Q ( s, t )˜ x. Con-ditional on the data, let ˆ N x be a zero-mean Gaussian process with covariance˜ x (cid:62) ˆ D Q ( s, t )˜ x . Defineˆ ζ x = sup t ∈ [0 , (cid:104) ˜ x (cid:62) ˆ D Q ( t, t )˜ x (cid:105) − / (cid:12)(cid:12)(cid:12) ˆ N x ( t ) (cid:12)(cid:12)(cid:12) , ASSERSTEIN INFERENCE and set ˆ m α as the 1 − α quantile of ˆ ζ x . Then the Wasserstein- ∞ confidenceband for f ⊕ ( x ) is(4.5) C α,n ( x ) = (cid:40) g ∈ D : sup u ∈ ˆ I x | T g,x ( u ) − u | ˆ C x ( u, u ) / ≤ ˆ m α √ n (cid:41) , T g,x = G − ◦ ˆ F ⊕ ( x ) . We have the following corollary of Theorem 5.
Corollary . Suppose (T1)–(T4) hold. If inf u ∈I ∗ C T ( u, u ) > , then P X ( f ⊕ ( x ) ∈ C α,n ( x )) → − α almost surely. An important case arising in practice that is ruled out by the requirementof strictly positive covariance is when the support of F is some fixed interval I , so that the random transport T is necessarily fixed at the boundaries. Inthis case, a slight adjustment can be made as outlined in Corollary 5 of theAppendix.Next, we demonstrate a connection between these simultaneous Wasserstein- ∞ bands and the usual partial stochastic ordering of distributions. Recallthat a cdf F is said to be stochastically greater than another G , written F (cid:31) G , if F ( u ) ≤ G ( u ) for all u. For F (cid:31) F , define the bracket[ F , F ] = { g ∈ D : F (cid:31) G (cid:31) F } consists of all distributions in D which lie between F and F in the stochas-tic ordering. From (4.5), we deduce that the simultaneous confidence bandconsists of all densities g ∈ D such that M L ( u ) ≤ G − ◦ ˆ F ⊕ ( x, u ) ≤ M U ( u ) , ( M L ( u ) , M U ( u )) = u ± n − / ˆ m α ˆ C x ( u, u ) / . Define T L to be the unique projection (in L [0 , M L onto the closedand convex set of non-decreasing functions M for which M ( u ) ≥ M L ( u ) . Similarly, T U is unique largest nondecreasing function below M U . Note that T L ( u ) ≤ T g,x ( u ) ≤ T U ( u ) for all g ∈ C α,n ( x ) . Then the cdf bounds F L =ˆ F ⊕ ( x ) ◦ T − L and F U = ˆ F ⊕ ( x ) ◦ T − U represent the Wasserstein- ∞ band, i.e. C α,n ( x ) = [ F L , F U ] . Algorithm 6 in the Appendix outlines the steps forcomputing C α,n ( x ) in practice.4.2. Wasserstein Density Bands.
One drawback of the above simultane-ous confidence bands is that they do not readily translate to the space ofdensities, since, if F ≺ F in the stochastic ordering, their derivatives neednot satisfy f ≥ f . Thus, we take a second approach to form a confidence PETERSEN, LIU AND DIVANI band in density space, based on the direct difference ˆ f ⊕ ( x ) − f ⊕ ( x ) ratherthan the optimal transport map between these distributions.An interesting challenge associated with this approach is that the sup-ports of the target f ⊕ ( x ) and its estimate ˆ f ⊕ ( x ) may differ, so that con-vergence may be ill-behaved near the boundaries. To resolve this, for any δ ∈ (0 , / , define I δx = [ Q ⊕ ( x, δ ) , Q ⊕ ( x, − δ )] . We consider conditionalweak convergence of the process √ n ( ˆ f ⊕ ( x ) − f ⊕ ( x )) in the space L ∞ ( I δx ) . For l, m ∈ { , } and ˜ K as in (4.1), define ˜ K ( l,m ) = ∂ l + m ∂u l ∂v m ˜ K. Lastly, set K ( u, v ) = (cid:32) ˜ K ( u, v ) (cid:2) f ∗⊕ ( v ) (cid:3) − ˜ K (0 , ( u, v ) (cid:2) f ∗⊕ ( u ) (cid:3) − ˜ K (1 , ( u, v ) (cid:2) f ∗⊕ ( u ) f ∗⊕ ( v ) (cid:3) − ˜ K (1 , ( u, v ) (cid:33) , u, v ∈ I ∗ . theorem . Suppose (T1)–(T4) hold, and that f ∗⊕ is continuously dif-ferentiable. Then, for almost all x, there exists a zero-mean Gaussian process F x on L ∞ ( I δx ) such that √ n (cid:16) ˆ f ⊕ ( x ) − f ⊕ ( x ) (cid:17) | X , . . . , X n (cid:32) F x almost surely.With u x = Q ∗⊕ ◦ F ⊕ ( x, u ) and c (cid:62) u = ( ∂/∂uf ⊕ ( x, u ) , − f ⊕ ( x, u )) for any u ∈ I δx ,and ⊗ denoting the Kronecker product, the covariance of F x is (4.6) R x ( u, v ) = c (cid:62) u (cid:16) ˜ x (cid:62) ⊗ K ( u x , v x ) ⊗ ˜ x (cid:17) c v . We now describe the method for estimating R x . With ˆ D Q as in (4.3), letˆ D ( l,m ) Q = ∂ l + m ∂s l ∂t m ˆ D Q . Then, define D ( s, t ) = K ( Q ∗⊕ ( s ) , Q ∗⊕ ( t )) and its estimate(4.7) ˆ D ( s, t ) = (cid:32) ˆ D Q ( s, t ) ˆ D (0 , Q ( s, t )ˆ D (1 , Q ( s, t ) ˆ D (1 , Q ( s, t ) , (cid:33) s, t ∈ [ δ, − δ ] , and setˆ K ( u, v ) = ˆ D ( ˆ F ⊕ ( x, u ) , ˆ F ⊕ ( x, v )) , u, v ∈ ˆ I δx = [ ˆ Q ⊕ ( x, δ ) , ˆ Q ⊕ ( x, − δ )] . Next, the chain rule implies that ∂∂u f ⊕ ( x, u ) = − f ⊕ ( x, u ) ∂∂t q ⊕ ( x, F ⊕ ( x, u )) . Hence, define ∂∂t ˆ q ⊕ ( x, t ) = n − (cid:80) ni =1 s in ( x ) q (cid:48) i ( t ) for t ∈ [ δ, − δ ] , yielding ∂∂u ˆ f ⊕ ( x, u ) = − ˆ f ⊕ ( x, u ) ∂∂t ˆ q ⊕ ( x, ˆ F ⊕ ( x, u )) , u ∈ ˆ I δx . Finally, for u, v ∈ ˆ I δx , set(4.8)ˆ R x ( u, v ) = ˆ c (cid:62) u (cid:16) ˜ x (cid:62) ⊗ ˆ K ( u, v ) ⊗ ˜ x (cid:17) ˆ c v , ˆ c (cid:62) u = (cid:18) ∂∂u ˆ f ⊕ ( x, u ) , − ˆ f ⊕ ( x, u ) (cid:19) . ASSERSTEIN INFERENCE Let l α be the unknown 1 − α quantile of ξ x = sup u ∈I δx R x ( u, u ) − / |F x ( u ) | . Similar to the case of the Wasserstein- ∞ band, conditional on the data, letˆ J x ( t ) be a zero-mean Gaussian process on (0 , , with covariance ˆ R x ( ˆ Q ⊕ ( x, s ) , ˆ Q ⊕ ( x, t )) . Define ˆ l α as the the 1 − α quantile ofˆ ξ x = sup t ∈ [ δ, − δ ] (cid:104) ˆ R x ( ˆ Q ⊕ ( x, t ) , ˆ Q ⊕ ( x, t )) (cid:105) − / (cid:12)(cid:12)(cid:12) ˆ J x ( t ) (cid:12)(cid:12)(cid:12) . Then the nearly simultaneous Wasserstein density confidence band is(4.9) B α,n ( x ) = (cid:40) g ∈ D : sup u ∈ ˆ I δx | g ( u ) − ˆ f ⊕ ( x, u ) | ˆ R x ( u, u ) / ≤ ˆ l α √ n (cid:41) . See Algorithm 7 in the Appendix for an implementation of this confidenceband.The final corollary demonstrates almost sure convergence of the coveragerate of the Wasserstein density confidence band. Since the estimation ofthe covariance R x is considerably more complex than for the Wasserstein- ∞ band, additional smoothness assumptions are required on the randomtransport map T. Corollary . Suppose the assumptions of Theorem 6 hold, and that inf u ∈I δx R x ( u, u ) > . Furthermore, assume that T is twice differentiablealmost surely, with T (cid:48)(cid:48) having Lipschitz constant ˜ L and satisfying E (cid:18) (cid:107) ˜ X (cid:107) E sup u ∈ R | T (cid:48)(cid:48) ( u ) | (cid:19) < ∞ , E ( ˜ L ) < ∞ . Then P X ( f ⊕ ( x ) ∈ B α,n ) → − α almost surely .
5. Simulations.
Extensive simulations were conducted to investigatethe empirical performance of the F -tests and confidence bands developedin Sections 3 and 4. Data were generated according to model (2.6) for in-creasing sample sizes and different random optimal transport processes T .Furthermore, to assess the robustness of our procedures when densities areonly indirectly observed through samples, simulations were also conductedusing only raw data generated from the random densities. For brevity, resultsfor the indirectly observed case are reported in the Appendix. PETERSEN, LIU AND DIVANI
We first describe the simulation settings for the Wasserstein F -tests. Fol-lowing Example 1, set f as the standard normal distribution, truncatedto the interval [ − . , .
5] and renormalized to have integral one. Bivariatepredictors X i = ( X i , X i ) were generated as independent uniform randomvariables on [ − . , . . Let ν ( x ) = α x + α x and τ ( x ) = 2 + β x + β x , where the parameters α j , β j were varied to increase signal strength for as-sessing power performance. The conditional Wasserstein mean density was(5.1) f ⊕ ( x, u ) = 1 τ ( x ) f (cid:18) u − ν ( x ) τ ( x ) (cid:19) . To produce the random densities F i , random optimal transport maps T i were generated in two different ways. In the first setting, T i ( u ) = V i + V i u, where V i ∼ U ( − . , .
5) and V i ∼ U (0 . , .
5) were generated indepen-dently, yielding linear optimal transport maps. In the second setting, fol-lowing the method described in Section 8.1 of Panaretos and Zemel (2016),nonlinear optimal transport maps were generated. Specifically, define tem-plate transport maps M ( u ) = u and M k ( u ) = u − sin( ku ) | k | , k (cid:54) = 0 . For J = 10 , ( W i , . . . , W iJ ) were sampled from a order J Dirichlet dis-tribution, so that W j > (cid:80) Jj =1 W j = 1 . Then K ij , j = 1 , . . . , J , weresampled independently and uniformly from ( − . , − . , , . , . T i ( u ) = (cid:80) Jj =1 W ij M K ij ( u ) . Finally, the random density was generated as F i ( u ) = f ⊕ (cid:0) X i , T − i ( u ) (cid:1) ( T − i ) (cid:48) ( u ) . In the case when the densities were not directly observed, secondary samplesof size 300 from each F i were generated, and local linear smoothing of theempirical quantile function was used to estimate the quantile functions ˆ Q i for use in the testing algorithms. The Matlab function ‘smooth’ was usedwith default choice of the smoothing parameter.In both the global and partial F tests simulations, the empirical size ap-proached the nominal size α = 0 .
05 as the sample size grew, and the empir-ical power function increased with the increasing signal strength. To createmodels satisfying global null and alternative hypotheses, α = α = β = β were set to be equal, with the common value running through (0 , . F -test, α and β were fixed as α = 2 and β = 1 respectively,while α = β varied across (0 , . F -test was H P : f ⊕ ( x ) = f ⊕ ( x ). ASSERSTEIN INFERENCE n = 100 n = 200 n = 5000.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.50.000.250.500.751.00 E m p i r i c a l P o w e r f un c t i on TransportLin.Non−Lin.MethodBoot.Satt.Chi Sq. (a) n = 100 n = 200 n = 5000.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.50.000.250.500.751.00 E m p i r i c a l P o w e r f un c t i on TransportLin.Non−Lin.MethodSatt.Chi Sq. (b)
Fig 2:
Power curves for global (top) and partial (bottom) Wasserstein F -tests, for samplesizes n = 100 , , . The dotted horizontal line represents the nominal level α = 0 . . For each of three sample sizes n = 100 , , χ mixture and alternativeSatterthwaite and bootstrap tests outliend in Section 3.3. The nominal sizesconverged to the right level, and power converged to one with increasingvalues of the common parameter value. Power was generally higher for thelinear transport map setting, as expected. The same pattern is observedfor the partial test, although the Satterthwaite alternative more accuratelymaintained the nominal level for low sample sizes. The bootstrap test couldnot be implemented for the partial test due to the fact that the support ofthe density varies with x. Similar results for indirectly observed densitiesare given in Figure 4 in the Appendix. For both global and partial tests, thedifference in performance is more prominent in low sample sizes (in n , the PETERSEN, LIU AND DIVANI
Table 1
Confidence band error rates for fully observed densities over 500 simulation runs
Wasserstein- ∞ bandlinear transport map nonlinear transport mapn = 100 n = 200 n = 500 n = 100 n = 200 n = 500x = -0.30 0.050 0.050 0.046 0.044 0.074 0.048x = -0.24 0.056 0.040 0.058 0.044 0.068 0.044x = -0.18 0.058 0.042 0.058 0.052 0.064 0.046x = -0.12 0.056 0.036 0.062 0.058 0.064 0.054x = -0.06 0.056 0.040 0.048 0.058 0.070 0.058x = 0 0.056 0.042 0.042 0.052 0.074 0.056x = 0.06 0.062 0.052 0.036 0.052 0.072 0.056x = 0.12 0.060 0.050 0.036 0.054 0.072 0.058x = 0.18 0.062 0.052 0.040 0.054 0.066 0.052x = 0.24 0.068 0.052 0.042 0.058 0.064 0.058x = 0.30 0.076 0.044 0.050 0.058 0.062 0.056Wasserstein density band*linear transport map nonlinear transport mapn = 100 n = 200 n = 500 n = 100 n = 200 n = 500x = -0.30 0.062 0.056 0.054 0.066 0.074 0.064x = -0.24 0.056 0.056 0.052 0.062 0.072 0.058x = -0.18 0.052 0.052 0.054 0.054 0.058 0.050x = -0.12 0.034 0.042 0.052 0.048 0.052 0.046x = -0.06 0.038 0.038 0.038 0.046 0.050 0.044x = 0 0.030 0.026 0.040 0.058 0.050 0.052x = 0.06 0.028 0.030 0.036 0.066 0.054 0.058x = 0.12 0.048 0.046 0.032 0.068 0.046 0.062x = 0.18 0.052 0.048 0.034 0.074 0.060 0.062x = 0.24 0.062 0.046 0.036 0.074 0.066 0.062x = 0.30 0.064 0.048 0.036 0.086 0.076 0.064* δ = 0 . number of densities), with the power converging at a slower rate and slightlylarger deviations from the nominal level compared to the case of observeddensities.For simplicity, in the simulations for confidence bands, a single predic-tor X i ∼ U ( − . , .
5) was used. With ν ( x ) = 2 x and τ ( x ) = 2 + x, themean was again given by (5.1), where f was the same as in the test-ing simulations. The random optimal transports and case of indirectly ob-served densities were handled the same as in the testing simulations aswell. Both types of confidence intervals were computed for predictor values x ∈ ( − . , − . , . . . , . , . δ = 0 . ∞ bands was also measured using δ = 0 . ASSERSTEIN INFERENCE functions are very steep near the boundary. The error rates in Table 1 werecomputed using 500 runs for each setting. Overall, error rates improved assample size increased, with error rates near α = 0 .
05 at all x values when n = 500 . Table 3 in the Appendix contains the corresponding results when for indi-rectly observed densities. Note that Algorithm 7 for computing the Wasser-stein density band also requires estimation of q i and q (cid:48) i in addition to thequantile function Q i . In our experiments, ˆ q i and ˆ q (cid:48) i were computed by nu-merical differentiation of ˆ Q i . Clearly there are alternative ways in which onemight estimate these functions, but numerical differentiation was chosen be-cause it represents a worst-case scenario in order to reveal sensitivities ofthe density bands to errors induced by this preprocessing step. Convergenceto the nominal coverage rate was slower for indirectly observed densities.In particular, the Wasserstein- ∞ band suffered from the aforementionedboundary issues associated with local linear estimation of quantile functionsthat are steep near the boundary. However, with n = 500 , all but one of theerror rates were below 0.1, compared to the nominal rate α = 0 . .
6. Application to Stroke Data.
Intracerebral hemorrhage (ICH),caused by small blood vessel ruptures inside the brain, is the second mostcommon stroke subtype (Morgenstern et al., 2010). Computed tomography(CT) is the most utilized imaging modality to diagnose and study ICH inclinical settings. Various studies of ICH have revealed the importance of thedensity of the hematoma, in addition to important factors such as the sizeand location of the hematoma inside brain parenchyma. (Barras et al., 2009;Delcourt et al., 2016; Salazar et al., 2019). The density of the hematoma isnot homogenous, and common practice is to summarize this important fea-ture by some set of subjective scalar measures that are often obtained byvisual inspection and are thus user-dependent. For example, Boulouis et al.(2016) demonstrated the importance of the CT hypodensity, a binary vari-able indicating the presence of low-density regions within the hematoma.Instead of using any particular summary, one can study the distributionof hematoma density (i.e., a probability density function of hematoma den-sity) throughout the entire hematoma as a functional object. Specifically, thehematoma density is measured on the Hounsfield scale (0 - 100 HU) for eachvoxel within the hematoma, and these can be used to produce a probabil-ity density function in a variety of ways. In this application, the hematomadensities were first combined into a histogram, followed by smoothing toobtain a probability density function; see Fig. 1. Other methods, such aslocal linear smoothing to obtain quantile function estimates as illustrated in PETERSEN, LIU AND DIVANI the simulations, could also be used.To illustrate the utility of the proposed inferential methods for Wasser-stein regression, we considered a study of n = 393 ICH anonymized subjectsand analyzed the associations between 5 clinical and 4 radiological vari-ables as predictors of the head CT hematoma densities as distributionalresponses. The clinical predictors were age, weight, history of diabetes, andtwo variables indicating history of coagulopathy (Warfarin and AnitPt).Radiological predictors included the logarithm of hematoma volume, a con-tinuous index of hematoma shape (Shape), presence of a shift in the mid-line of the brain, and length of the interval between stroke event and theCT scan (TimetoCT). These predictors were selected based on the naturalhistory and published clinical studies (Salman, Labovitz and Stapf, 2009;Al-Mufti et al., 2018). A complete description of the data source can befound in Hevesi et al. (2018). As seen in Figure 1, important features thatvary from subject to subject are the location and mode of the hematomadensity distribution, as well as the spread, where some hematoma densitiesare homogeneous and concentrated around the single mode, while othersare more heterogeneous or even bimodal. Thus, while the mean is an impor-tant aspect of the hematoma density, this natural summary fails to captureother forms of variability that are automatically taken into account by theWasserstein regression model.To assess the goodness of fit, the Wasserstein coefficient of determination(Petersen and M¨uller, 2019) was computed asˆ R ⊕ = 1 − (cid:80) ni =1 d W ( F i , ˆ F i ) (cid:80) ni =1 d W ( F i , ˆ f ∗⊕ ) = 0 . , representing the fraction of Wasserstein variability explained by the model.As a comparison, a standard linear functional response model (Faraway,1997) was also fit, after transforming the densities into a linear functionspace using the log quantile density (LQD) transformation of Petersen andM¨uller (2016). Because the hematoma densities fail to have common sup-ports, as required by the LQD transformation, the densities were regularizedto be strictly positive on (0 , . This was done by appropriately mixingeach density with the uniform distribution on (0 , . . This alternative LQD model yielded fitted densities by means of the inverseLQD transformation. The LQD method suffered from quite poor recoveryof the quantile functions near the boundaries t ∈ { , } , due to the fact thatthe observed densities decay to zero at the boundary of their supports. TheWasserstein distance in (2.1) was sensitive to these errors, and resulted in ASSERSTEIN INFERENCE Table 2
LQD transformation method and Wasserstein regression p -values Category Predictor Wasserstein LQDClinical Age 0.862 0.761Weight < .
001 0.479DM 0.034 0.742Warfarin 0.298 0.640AntiPt 0.078 0.900Radiological log(Volume) < . < . < . < . < .
001 0.039TimetoCT 0.902 0.657 a much lower Wasserstein coefficient of determination of ˆ R ⊕ = 0 . , sothat the Wasserstein regression model provided a substantial improvementon the LQD approach.The p -value for the global Wasserstein F -test was approximately zerousing the mixture χ test, giving strong evidence of a significant regressionrelationship between the hematoma densities and candidate predictors. Thealternative testing procedures of Section 3.3 gave similar results. Significanceof the effects of individual predictors are found in Table 2, where these werecomputed via partial Wasserstein F -tests. As a baseline for comparison, the F -tests of Shen and Faraway (2004) were performed on the LQD functionallinear model. The global F -test on the LQD model was also approximatelyzero, with partial p -values given in the last column of Table 2. Both modelsidentified hematoma size and shape, as well as presenece of midline shift,as important factors in determining the distribution of density within thehematoma. However, the Wasserstein regression model identifies two clinicalvariables related to weight and diabetes as also affecting the hematomadensity distribution.Finally, we demonstrate the use of the two types of Wasserstein confidencebands developed in Section 4. Figure 3a shows two fitted Wasserstein meancdfs. The first corresponds to a hematoma volume equivalent to the firstquartile (Q1) of the observed values, with all other predictors set at theirmean (for continuous variables) or mode (for binary variables). The second issimilar, but for the third quartile (Q3) of hematoma volume. Each fitted cdfis also accompanied by a Wasserstein- ∞ band, represented by the stochasticordering bracket [ F L , F U ]. These bands may be interpreted as bounds on the“horizontal” sampling variability of the fitted distributions, i.e. the samplingvariability of the fitted quantile function. Figure 3b shows the correspondingfitted Wasserstein mean densities, along with density bands. These densitybands reflect sampling variability at the density level and aid in inferring PETERSEN, LIU AND DIVANI (a) Wasserstein- ∞ band (b) Wasserstein density band Fig 3:
95% confidence bands for conditional Wasserstein mean CDF (left) and density(right) when hematoma volume is equal to the first (Q1) and third (Q3) quartile in thesample, with other variables set to their mean value for continuous variables and modefor binary variables. the relevance of local features seen in the fitted densities. As indicated byFigure 1, the magnitude of the horizontal variability is the smaller of the two,and this is reflected in the confidence bands. From a neurological point ofview, the physiology dictates that larger hematomas tend to be more denseand homogeneous, since the pressure put on them from the surroundingtissue restricts growth and causes voids within the hematoma to be filled,confirming the observed differences between fitted cdfs/densities in Figure 3.
7. Discussion.
We have studied the regression of density response curveson vector predictors under the Fr´echet regression model and the Wassersteingeometry of optimal transport. The targets are the conditional Wassersteinmean densities, which can be estimated without the need of a tuning pa-rameter, akin to a parametric model. By replacing the additive error term inordinary linear regression with a random optimal transport map, intuitivetest statistics are proposed for testing null hypotheses of both no and partialpredictor effects. In the spirit of regression, asymptotic distributions are de-rived conditional on the observed predictors. The covariance of the randomtransport map is a nuisance parameter that can be consistently estimatedand thus used to form a rejection region with correct asymptotic size andwhich are uniformly consistent against classes of contiguous alternatives.
ASSERSTEIN INFERENCE Confidence bands are also derived for the fitted Wasserstein mean densi-ties in two forms. Due to the intimate connection between the Wassersteinmetric and quantile functions for univariate distributions, the first type ofconfidence band is formed in terms of the fitted Wasserstein mean quantilefunction (equivalently, the Wasserstein mean cdf), and forms a bracket inthe usual stochastic ordering of distributions on the real line. These intrin-sic confidence bands are complemented by extrinsic bands in density space,allowing the user to simultaneously quantify sampling variability of all quan-tiles of the distribution, as well as the uncertainty of local features seen inthe conditional Wasserstein mean densities.As for any regression model, it will be necessary in future work to developdiagnostic tools to assess the validity of the Wasserstein regression model fora given data set. It is likely that tools for functional regression models cansimilarly be adapted to the setting of density response curves (Chiou andM¨uller, 2007). Likewise, model selection or regularized estimation proceduresare clearly desirable, especially in cases where the number of predictors islarge (Barber et al., 2017).While our theoretical developments have assumed the density responsecurves are known, in the majority of practical situations these will needto be estimated from a collection of univariate samples, each generated byone of the random densities. In the simulations, we have demonstrated howthis can be done using local linear smoothing of empirical quantile func-tions, while smoothed histograms could also be used as in the applicationto head CT densities. Some previous theoretical work in the analysis ofdensity samples has accounted for this preprocessing step, similar to the ef-fects of pre-smoothing in functional data analysis when curves are measuredsparsely in time and often contaminated with noise (Kneip and Utikal, 2001;Petersen and M¨uller, 2016; Panaretos and Zemel, 2016). An important pointof future research in this area will include identifying a division of regimesbetween dense and sparse samples for density functions, similar to Zhangand Wang (2016) for classical functional data, and their implications oninferential procedures such as those proposed in this paper.
Acknowledgements.
The authors would like to thank Mostafa Jafariand Pascal Salazar for providing the hematoma density curves for our anal-ysis. PETERSEN, LIU AND DIVANI
APPENDIX
The Appendix is organized as follows. Section A.1 gives algorithms for thevarious testing procedures and confidence band computations described inSections 3 and 4. Section A.2 illustrates how the confidence bands developedin Section 4 can be adapted to the case of random densities F with a fixedsupport I . Section A.3 gives simulation results, corresponding to the set-tings of Section 5, for the case when the random densities are only observedthrough a random sample. Section A.4 gives proofs of Propositions 1 and 2.Sections A.5 and A.6 provide proofs of the testing and confidence band re-sults in Sections 3 and 4, respectively. Finally, Section A.7 gives statementsand proofs of auxiliary lemmas.
A.1. Computational Algorithms.
A.1.1.
Estimation.
This section gives two algorithms, one for computingˆ Q ⊕ ( x ) in (2.8), and the other for the density estimate ˆ f ⊕ ( x ) in (2.9). Let (cid:104)· , ·(cid:105) L and (cid:107)·(cid:107) L denote the standard Hilbert inner product and norm on L [0 , . For any Q ∈ Q , since n − (cid:80) ni =1 s in ( x ) = 1 , observe that1 n n (cid:88) i =1 s in ( x ) (cid:107) Q i − Q (cid:107) L = (cid:107) Q (cid:107) L − (cid:104) ˜ Q ⊕ ( x ) , Q i (cid:105) L + 1 n n (cid:88) i =1 s in ( x ) (cid:107) Q i (cid:107) L , where ˜ Q ⊕ ( x ) = n (cid:80) ni =1 s in ( x ) Q i . For any g, h ∈ L [0 , , let 0 = t < · · · Estimating ˆ Q ⊕ ( x ) input : Predictor vector x ∈ R p , quantile functions Q i , and grid0 = t < · · · < t m = 1 output: Estimates ˆ Q ⊕ ( x, t l ) , l = 1 , . . . , m for l ← to m doQ l ← n − (cid:80) ni =1 s in ( x ) Q i ( t l ); endif Q l +1 ≥ Q l ∀ l ∈ { , . . . , m − } then ˆ Q ⊕ ( x, t l ) ← Q l ; elseb ∗ ← min b ∈ R m b (cid:62) A b − Q (cid:62) A b , subject to b ≤ · · · ≤ b m ;ˆ Q ⊕ ( x, t l ) ← b ∗ l ; endAlgorithm 2: Estimating ˆ f ⊕ ( x ) and ˆ q ⊕ ( x ) input : Predictor vector x ∈ R p , quantile functions values Q i (0),quantile densities q i , grid 0 = t < · · · < t m = 1, and (cid:15) > . output: Estimates ˆ q ⊕ ( t l ) and ˆ f ⊕ ( x, ˆ Q ⊕ ( x, t l )) , l = 1 , . . . , m ˜ Q ← n − (cid:80) ni =1 Q i (0); for l ← to m doQ l ← n − (cid:80) ni =1 s in ( x ) q i ( t l ); endif Q l < for any l ∈ { , . . . , m } thenb ∗ ← min b ∈ R m +1 b (cid:62) B b − Q (cid:62) B b , such that b , . . . , b m +1 ≥ (cid:15) ;( Q , . . . , Q m ) ← b ∗ ; end u ← Q ; for l ← to m do u l ← u l − + ( Q l + Q l +1 )( t l +1 − t l ) / Q ⊕ ( x, t l ) ← u l ;ˆ q ⊕ ( x, t l ) ← Q l ;ˆ f ⊕ ( x, u l ) ← / Q l ; end Next, in light of Section 4.2, it is necessary to obtain a density estimateˆ f ⊕ ( x ) . One possibility is to compute ˆ Q ⊕ ( x ) using Algorithm 1, followedby numerical inversion to obtain ˆ F ⊕ ( x, u ) , and finally computing ˆ f ⊕ ( x ) bynumerical differentiation. However, a more direct route is available. Suppose PETERSEN, LIU AND DIVANI that g, h ∈ L [0 , 1] are differentiable, and take ˜ g = ( g (0) , g (cid:48) ( t ) , . . . , g (cid:48) ( t m )) , and similarly ˜ h . Then let B ∈ R ( m +1) × ( m +1) be the matrix representing thetrapezoidal approximation (cid:90) g ( t ) h ( t )d t ≈ ˜ g (cid:62) B ˜ h . A.1.2. Hypothesis Testing. We first give the algorithm for computing thecritical value ˆ b Gα in the global test using eigenvalue estimates ˆ λ j . Assume thefitted quantile functions ˆ Q i = ˆ Q ⊕ ( X i ) and ˆ Q ∗⊕ = ˆ Q ⊕ ( ¯ X ) have already beencomputed using Algorithm 1, and that ˆ C Q in (3.3) has also been calculated.Then let ˆ λ j be the eigenvalues of ˆ C Q , which can be computed using a singularvalue decomposition of the discretized covariance. Let J be the number of λ j that are positive. Algorithm 3 can be replaced by either of Algorithm 4or 5, which correspond to the Satterthwaite approximation and residualbootstrap mentioned in Section 3.3. Similar algorithms can be followed forthe partial tests of Section 3.2. However, the residual bootstrap algorithmcan only be executed if the support of the random density F is fixed. Algorithm 3: Critical Value for Global Test input : Eigenvalue estimates ˆ λ j , level α ∈ (0 , R output: Critical value ˆ b Gα for r ← to R do Generate ω j i.i.d. ∼ χ p , j = 1 , . . . , J ; z r ← (cid:80) Jj =1 ˆ λ j ω j ; end ˆ b Gα ← − α empirical quantile of z , . . . , z R ; Algorithm 4: Satterthwaite Critical Value for Global Test input : Covariance Estimate ˆ C Q output: Critical value ˆ b Gα a ← (cid:16)(cid:82) (cid:82) ˆ C Q ( s, t )d s d t (cid:17) (cid:30) (cid:16)(cid:82) ˆ C Q ( t, t )d t (cid:17) ; m ← p (cid:16)(cid:82) ˆ C Q ( t, t )d t (cid:17) (cid:30) (cid:16)(cid:82) (cid:82) ˆ C Q ( s, t )d s d t (cid:17) ;ˆ b Gα ← aχ m, − α ; ASSERSTEIN INFERENCE Algorithm 5: Bootstrap Critical Value for Global Test input : Data ( X i , Q i ) , fitted quantile functions ˆ Q i and ˆ Q ∗⊕ , level α ∈ (0 , R output: Critical value ˆ b Gα for i ← to n do ˆ T i ← ˆ Q i ◦ ˆ F ∗⊕ ; endfor r ← to R do Sample T ∗ , . . . T ∗ n uniformly and with replacement from( ˆ T , . . . , ˆ T n );( Q ∗ , . . . , Q ∗ n ) ← ( T ∗ ◦ Q ∗⊕ , . . . , T ∗ n ◦ Q ∗⊕ );Compute fitted values ˆ Q ⊕ ,r ( X i ) from data ( X i , Q ∗ i ) usingAlgorithm 1; z r ← (cid:80) ni =1 (cid:107) ˆ Q ⊕ ,r ( X i ) − Q ∗⊕ (cid:107) L ; end ˆ b Gα ← − α empirical quantile of z , . . . , z R ;A.1.3. Confidence Bands. We first consider the Wasserstein- ∞ bands ofSection 4.1. The goal is to compute the upper and lower cdf bounds ( F L , F U )such that C α,n ( x ) = [ F L , F U ] . The initial step in the algorithm is to approx-imate the critical value ˆ m α . This requires generating zero-mean Gaussianprocesses (GP) with a specified covariance function. This step can be donein a number of ways, for instance by discretizing the process onto a grid andgenerating a multivariate Gaussian vector with the given covariance struc-ture. However, other methods can be used. The key step is to project theresulting functions M L and M U to their nearest monotonic majorant and mi-norant, respectively, which becomes a quadratic program when discretized.For simplicity, we don’t outline that step here, but the same approach usedin Algorithm 1 applies. In addition, the algorithm produces quantile bounds( Q L , Q U ) instead of the cdf bounds since it is easier to present this way, butone can easily compute the latter through numerical inversion as a post-processing step. Suppose that ˆ D Q ( s, t ) has already been computed as in(4.3), and define ˆ D ,x ( s, t ) = ˜ x (cid:62) ˆ D Q ( s, t )˜ x. PETERSEN, LIU AND DIVANI Algorithm 6: Wasserstein- ∞ Band input : Predictor vector x ∈ R p , estimates ˆ Q ⊕ ( x ) and ˆ D ,x ( s, t ) , level α ∈ (0 , R output: Lower/Upper bound quantile functions ( Q L , Q U ) ofsimultaneous Wasserstein- ∞ band C α,n ( x ) for r ← to R do Generate GP ˆ N x ( t ) with mean zero and covariance ˆ D ,x ( s, t ); z r ← sup t ∈ [0 , ˆ D ,x ( t, t ) − / | ˆ N x ( t ) | ; end ˆ m α ← − α empirical quantile of z , . . . , z R ; M L ( t ) ← ˆ Q ⊕ ( x, t ) − n − / ˆ m α ˆ D ,x ( t, t ) / ; M U ( t ) ← ˆ Q ⊕ ( x, t ) + n − / ˆ m α ˆ D ,x ( t, t ) / ; Q L ← argmin Q ∈ Q (cid:107) Q − M L (cid:107) L subject to Q nondecreasing, Q ≥ M L ; Q U ← argmin Q ∈ Q (cid:107) Q − M U (cid:107) L subject to Q nondecreasing, Q ≤ M U ;Lastly, we present the algorithm for computing the Wasserstein densityband B α,n ( x ) as in (4.9). Assume the covariance ˆ D ( s, t ) in (4.7) has beencomputed, using quantile estimates ˆ Q i , ˆ Q ⊕ ( x ) and quantile density estimatesˆ q i , ˆ q ⊕ ( x ) from Algorithm 2. Furthermore, defineˆ b t = 1ˆ q ⊕ ( x, t ) (cid:18) ∂∂t ˆ q ⊕ ( x, t ) , ˆ q ⊕ ( x, t ) (cid:19) (cid:62) , ∂∂t ˆ q ⊕ ( x, t ) = 1 n n (cid:88) i =1 s in ( x ) q (cid:48) i ( t ) . Then set ˆ D ,x ( s, t ) = ˆ b (cid:62) s (cid:16) ˜ x (cid:62) ⊗ ˆ D ( s, t ) ⊗ ˜ x (cid:17) ˆ b t . Algorithm 7: Wasserstein Density Band input : Predictor vector x ∈ R p , estimates ˆ f ⊕ ( x ) and ˆ D ,x ( s, t ) , level α ∈ (0 , δ ∈ (0 , / , and integer R output: Lower/Upper bound functions ( f L , f U ) of nearly simultaneousWasserstein density band B α,n ( x ) for r ← to R do Generate GP ˆ J ( t ) on ( δ, − δ ) with mean zero and covarianceˆ D ,x ( s, t ); z r ← sup t ∈ ( δ, − δ ) ˆ D ,x ( t, t ) − / | ˆ J ( t ) | end ˆ l α ← − α empirical quantile of z , . . . , z R ; f L ( u ) = ˆ f ⊕ ( x, u ) − n − / ˆ D ,x ( ˆ F ⊕ ( x, u ) , ˆ F ⊕ ( x, u )) / ˆ l α ; f U ( u ) = ˆ f ⊕ ( x, u ) + n − / ˆ D ,x ( ˆ F ⊕ ( x, u ) , ˆ F ⊕ ( x, u )) / ˆ l α ; ASSERSTEIN INFERENCE A.2. Densities with Fixed Support. In many common scenarios,the sample of densities all have a known finite support I = ( a, b ) . With-out loss of generality, suppose I = (0 , . The key difficulty associated withthe Wasserstein- ∞ band in (4.5) is the standardization provided obtainedby dividing by the square root of the pointwise variance estimate. Whenthe densities have fixed support, the random optimal transports T neces-sarily satisfy C T (0 , 0) = C T (1 , 1) = 0 , so that standardization does notpreserve tightness in the weak convergence statements. On the other hand,having a fixed support resolves the issue encountered in the development ofthe Wasserstein density bands, since the true conditional Wasserstein mean f ⊕ ( x ) and its estimate ˆ f ⊕ ( x ) will both have support (0 , . Using the same notation as in Section 4, for any δ ∈ (0 , / , defineˆ ζ (cid:48) x = sup t ∈ [ δ, − δ ] (cid:104) ˜ x (cid:62) ˆ D Q ( t, t )˜ x (cid:105) − / (cid:12)(cid:12)(cid:12) ˆ N x ( t ) (cid:12)(cid:12)(cid:12) , ˆ ξ (cid:48) x = sup t ∈ (0 , (cid:104) ˆ R x ( ˆ Q ⊕ ( x, t ) , ˆ Q ⊕ ( x, t )) (cid:105) − / (cid:12)(cid:12)(cid:12) ˆ J x ( t ) (cid:12)(cid:12)(cid:12) , and let ˆ m (cid:48) α and ˆ l (cid:48) α be, respectively, the 1 − α quantiles of ˆ ζ (cid:48) z and ˆ ξ (cid:48) x . Thendefine the confidence bands C (cid:48) α,n ( x ) = (cid:40) g ∈ D : sup u ∈ [ δ, − δ ] | T g,x ( u ) − u | ˆ C x ( u, u ) / ≤ ˆ m (cid:48) α n / (cid:41) , T g,x = G − ◦ ˆ F ⊕ ( x ) , (A.1) B (cid:48) α,n ( x ) = (cid:40) g ∈ D : sup u ∈I | g ( u ) − ˆ f ⊕ ( x, u ) | ˆ R x ( u, u ) / ≤ ˆ l (cid:48) α n / (cid:41) . (A.2)The same computational procedures outlined in Algorithms 6 and 7 applyto computing the above confidence bands. We also have strongly consistentcoverage. Corollary . Suppose that (T1)–(T4) and C T ( u, u ) > for all u ∈ (0 , . Then P X (cid:0) f ⊕ ( x ) ∈ C (cid:48) α,n ( x ) (cid:1) → − α almost surely . Corollary . Suppose the assumptions of Corollary 4 hold. Then, if inf u ∈ (0 , R x ( u, u ) > ,P X (cid:0) f ⊕ ( x ) ∈ B (cid:48) α,n ( x ) (cid:1) → − α almost surely . PETERSEN, LIU AND DIVANI n = 100 n = 200 n = 5000.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.50.000.250.500.751.00 E m p i r i c a l P o w e r f un c t i on TransportLin.Non−Lin.MethodBoot.Satt.Chi Sq. (a) n = 100 n = 200 n = 5000.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.50.000.250.500.751.00 E m p i r i c a l P o w e r f un c t i on TransportLin.Non−Lin.MethodSatt.Chi Sq. (b) Fig 4: Power curves for global (top) and partial (bottom) Wasserstein F -tests, forsample sizes n = 100 , , , when densities are only indirectly observed througha random sample. The dotted horizontal line represents the nominal level α = 0 . . A.3. Simulations with Indirectly Observed Densities. To under-stand the possible errors that may be introduced when densities are onlyobserved through a random sample, an additional step was added to thedensity simulations outlined in Section 5. After generating the random den-sities F i , i = 1 , . . . , n, a random sample of scalar variables was generatedfor each density, each of size 300. The preliminary estimation of Q i , q i = Q (cid:48) i and q (cid:48) i was performed as described in Section 5. ASSERSTEIN INFERENCE Table 3 Error rates of intrinsic Wasserstein- ∞ and Wasserstein density bands for indirectlyobserved densities over 500 simulation runs Wasserstein- ∞ band*linear transport map nonlinear transport mapn = 100 n = 200 n = 500 n = 100 n = 200 n = 500x = -0.30 0.072 0.086 0.072 0.066 0.076 0.078x = -0.24 0.072 0.080 0.070 0.078 0.070 0.084x = -0.18 0.072 0.078 0.088 0.092 0.064 0.082x = -0.12 0.076 0.074 0.094 0.084 0.062 0.084x = -0.06 0.072 0.070 0.088 0.088 0.056 0.088x = 0 0.066 0.082 0.092 0.082 0.058 0.086x = 0.06 0.062 0.076 0.076 0.084 0.062 0.082x = 0.12 0.062 0.068 0.082 0.098 0.072 0.088x = 0.18 0.062 0.064 0.068 0.090 0.078 0.092x = 0.24 0.070 0.060 0.066 0.094 0.088 0.090x = 0.30 0.064 0.064 0.072 0.096 0.072 0.096Wasserstein density band*linear transport map nonlinear transport mapn = 100 n = 200 n = 500 n = 100 n = 200 n = 500x = -0.3 0.272 0.188 0.114 0.072 0.048 0.044x = -0.24 0.228 0.148 0.100 0.068 0.046 0.052x = -0.18 0.206 0.146 0.096 0.050 0.048 0.054x = -0.12 0.182 0.120 0.098 0.032 0.042 0.042x = -0.06 0.166 0.112 0.088 0.030 0.034 0.038x = 0 0.150 0.106 0.084 0.024 0.026 0.032x = 0.06 0.140 0.098 0.074 0.030 0.028 0.034x = 0.12 0.134 0.112 0.068 0.030 0.030 0.030x = 0.18 0.152 0.118 0.064 0.036 0.032 0.034x = 0.24 0.160 0.116 0.060 0.048 0.030 0.042x = 0.3 0.176 0.116 0.062 0.044 0.040 0.040* δ = 0 . A.4. Proofs of Propositions 1 and 2. Proof of Proposition 1. By (A1), Q ∗⊕ ( t ) := E ( Q ( t )) is well-definedand finite for any t ∈ (0 , . Since Q is continuous almost surely by (A2),for any t ∈ (0 , 1) and sequence t k → t, Q ( t k ) → Q ( t ) almost surely. Then,by dominated convergence, Q ∗⊕ ( t k ) → Q ∗⊕ ( t ) . By monotonic convergence, wecan extend Q ∗⊕ to [0 , , with Q ∗⊕ ( t ) ↑ Q ∗⊕ (1) as t ↑ Q ∗⊕ ( t ) ↓ Q ∗⊕ (0) as t ↓ . Clearly, Q ∗⊕ is increasing by montonicity of expectation, so that Q ∗⊕ is avalid quantile function. Furthermore, Q ∗⊕ ∈ L [0 , 1] by (A1). In fact, for anymeasure M on R with finite second moment, its quantile function Q must bein L [0 , . Denote the standard Hilbert norm on this space by (cid:107)·(cid:107) L . Thus,the measure corresponding to Q ∗⊕ is the unique minimizer of E (cid:2) (cid:107) Q − Q(cid:107) L (cid:3) among all quantile functions Q ∈ L [0 , , so that Q ∗⊕ represents the quantile PETERSEN, LIU AND DIVANI function of the Wasserstein mean measure.Let t ∈ (0 , 1) and δ as in (A2), and recall that q ( t ) = Q (cid:48) ( t ) = 1 / [ f ◦ Q ( t )] ∈ (0 , ∞ ) by (A2). If t k → t with | t k − t | < δ, (A2) implies that (cid:12)(cid:12)(cid:12)(cid:12) Q ( t ) − Q ( t k ) t k − t (cid:12)(cid:12)(cid:12)(cid:12) ≤ sup | s − t | <δ q ( s ) . So, by dominated convergence, we have q ∗⊕ ( t ) := E ( q ( t )) = Q ∗⊕(cid:48) ( t ) . As0 < q ∗⊕ ( t ) < ∞ for all t ∈ (0 , , F ∗⊕ ( u ) = Q ∗⊕− ( u ) is a bona fide cdf.Furthermore, F ∗⊕ is differentiable with density f ∗⊕ ( u ) = 1 / (cid:2) q ∗⊕ ◦ F ∗⊕ ( u ) (cid:3) for u ∈ ( Q ∗⊕ (0) , Q ∗⊕ (1)) . That f ∗⊕ is the unique minimizer of (2.3) fol-lows from the fact that, for any g ∈ D , E (cid:2) d W ( F , g ) (cid:3) = E (cid:2) (cid:107) Q − G − (cid:107) L (cid:3) , so that Var ⊕ ( F ) < ∞ . Lastly, by (A3), there exists M > P (cid:16) sup u ∈ ( Q (0) ,Q (1)) F ( u ) < M (cid:17) > . Thus, for t ∈ (0 , ,q ∗⊕ ( t ) ≥ E (cid:18) inf t ∈ (0 , q ( t ) (cid:19) ≥ P (cid:16) sup u ∈ ( Q (0) ,Q (1)) F ( u ) < M (cid:17) M > , so that sup u ∈ ( Q ∗⊕ (0) ,Q ∗⊕ (1) f ∗⊕ ( u ) < ∞ . The results for conditional means are obtained similarly and the detailsare omitted. Proof of Proposition 2. Under the Wasserstein regression model, Q ⊕ ( x ) = Π Q ( E [ s ( X, x ) Q ( · )])where, for any continuous function h on [0 , , Π Q ( h ) is the projection of h onto Q := (cid:8) Q ∈ L [0 , 1] : Q is a quantile function (cid:9) . As Q is closed and convex, this projection exists and is unique (Rychlik,2012); in fact, it is continuous (Groeneboom and Jongbloed, 2010; Lin andDunson, 2014). Under the assumptions, Q ⊕ ( x, t ) is continuous and strictlyincreasing for almost all x, so that the projection operator is redundant and,by the form of s ( X, x ),(A.1) Q ⊕ ( x, t ) = Q ∗⊕ ( t ) + E (cid:104) ( X − µ ) (cid:62) Q ( t ) (cid:105) Σ − ( x − µ ) . This implies that E ( Q ⊕ ( X, t )) = Q ∗⊕ ( t ) for all t ∈ [0 , , so that E ( S ( u )) = E ( Q ⊕ ( X ) ◦ F ∗⊕ ( u )) = u for almost all u ∈ ¯ I ∗ = [ Q ∗⊕ (0) , Q ∗⊕ (1)] . ASSERSTEIN INFERENCE For the variance decomposition, since Q = T ◦ F ⊕ ( X ) = T ◦ S ◦ F ∗⊕ , Var ⊕ ( F ) = E (cid:20)(cid:90) I ∗ ( T ◦ S ( u ) − u ) f ∗⊕ ( u ) d u (cid:21) = (cid:90) I ∗ E (cid:2) ( T ◦ S ( u ) − S ( u )) + ( S ( u ) − u ) +2( T ◦ S ( u ) − S ( u ))( S ( u ) − u )] f ∗⊕ ( u ) d u = (cid:90) I ∗ E (cid:2) ( T ◦ S ( u ) − S ( u )) (cid:3) f ∗⊕ ( u ) d u + (cid:90) I E (cid:2) ( S ( u ) − u ) (cid:3) f ∗⊕ ( u ) d u, (A.2)and the cross product term vanishes since E ( T ◦ S ( u ) | X ) = S ( u ) a.s. Because E ( S ( u )) = u , f ∗⊕ is the Wasserstein mean of the random density f ⊕ ( X ) , andthe second term in the last line of (A.2) is E ( d W ( f ⊕ ( X ) , f ∗⊕ ) = Var ⊕ ( f ⊕ ( X ))as claimed. Furthermore, by (T1), E (cid:2) ( T ◦ S ( u ) − S ( u )) | X (cid:3) = C T ( S ( u ) , S ( u ))almost surely for all u ∈ ¯ I ∗ . The results follows. A.5. Proofs of Theorems 1–4 and Corollaries 1 and 2. We firstintroduce some notation. For x ∈ R p , set ˜ x = (1 x (cid:62) ) (cid:62) ∈ R p +1 , and letΛ = E ( ˜ X ˜ X (cid:62) ) , ˆΛ = n − (cid:80) ni =1 ˜ X i ˜ X (cid:62) i . Define I ∗ = ( Q ∗⊕ (0) , Q ∗⊕ (1)) and ¯ I ∗ its closure. For u ∈ ¯ I ∗ , let R = Q ◦ F ∗⊕ , R i = Q i ◦ F ∗⊕ , and define γ ( u ) = Cov( X, R ( u )) , ˜ γ ( u ) = 1 n n (cid:88) i =1 X i R i ( u ) ,π ( u ) = Λ − E (cid:104) ˜ XR ( u ) (cid:105) , ˜ π ( u ) = ˆΛ − (cid:34) n n (cid:88) i =1 ˜ X i R i ( u ) (cid:35) (A.1)For u (cid:48) ∈ I ∗ , let π (cid:48) ( u (cid:48) ) and ˜ π (cid:48) ( u (cid:48) ) denote the derivatives of π and ˜ π, respec-tively. The symbol ⊗ will denote the Kronecker product of matrices.For any set T and k ∈ N , define L ∞ k ( T ) as the k -fold Cartesian product ofall bounded functions on T . A statistic W computed from the data ( X i , F )is said to be o P X (1) if, for all (cid:15) > , P X ( | W | > (cid:15) ) converges to zero almostsurely. Likewise, for a nonnegative sequence a n , W = O P X ( a n ) iflim sup C →∞ lim n →∞ P X ( | W | > Ca n ) = 0 almost surely . PETERSEN, LIU AND DIVANI Proof of Theorem 1. Define ˜ π as in (A.1), and set˜ Q ⊕ ( x, t ) = 1 n n (cid:88) i =1 s in ( x ) Q i ( t ) = ˜ x (cid:62) ˜ π ◦ Q ∗⊕ ( t ) , t ∈ [0 , , ˜ q ⊕ ( x, t ) = ∂∂t ˜ Q ⊕ ( x, t ) = 1 n n (cid:88) i =1 s in ( x ) q i ( t ) = q ∗⊕ ( t )˜ x (cid:62) ˜ π (cid:48) ◦ Q ∗⊕ ( t ) , t ∈ (0 , , (A.2)as empirical versions of Q ⊕ ( x, t ) = ˜ x (cid:62) π ◦ Q ∗⊕ ( t ) , q ⊕ ( x, t ) = q ∗⊕ ( t )˜ x (cid:62) π ◦ Q ∗⊕ ( t ) . Let Π Q be as defined in the proof of Proposition 2. Then, for each i,d W (ˆ F i , ˆ f ∗⊕ ) = (cid:13)(cid:13)(cid:13) Π Q (cid:16) ˜ Q ⊕ ( X i , t ) (cid:17) − ˆ Q ∗⊕ (cid:13)(cid:13)(cid:13) L , ˆ Q ∗⊕ ( t ) = 1 n n (cid:88) i =1 Q i ( t ) . By Lemma 2, the event(A.3) A n = (cid:26) min ≤ i ≤ n inf t ∈ (0 , ˜ q ⊕ ( X i , t ) > (cid:27) satisfies P X ( A n ) → A n holds, ˜ Q ⊕ ( X i , t ) isstrictly increasing for all i and Π Q is redundant.Let γ and ˜ γ be as in (A.1). Since ˆ Q ∗⊕ ( t ) = ˜ Q ⊕ ( ¯ X, t ) , whenever A n holds,the test statistic becomes F ∗ G = n (cid:88) i =1 d W (ˆ F i , ˆ f ∗⊕ ) = n (cid:88) i =1 (cid:90) (cid:104) ˜ Q ⊕ ( X i , t ) − ˜ Q ⊕ ( ¯ X, t ) (cid:105) d t = n (cid:88) i =1 (cid:90) I ∗ (cid:104) ( X i − ¯ X ) (cid:62) ˆΣ − ˜ γ ( u ) (cid:105) f ∗⊕ ( u )d u = n (cid:90) I ∗ ˜ γ (cid:62) ( u ) ˆΣ − ˜ γ ( u ) f ∗⊕ ( u )d u, using the change of variables t = F ∗⊕ ( u ) . From (A.1), we deduce that S i ( u ) = ˜ X (cid:62) i π ( u ) . Hence, when H G holds, S = id almost surely, so that γ ( u ) ≡ . Then, almost surely, by Lemma 1, √ n ˆΣ − ˜ γ | X , . . . , X n converges weakly to a zero-mean Gaussian process Ξon L ∞ p (¯ I ∗ ) with covariance Cov(Ξ( u ) , Ξ( v )) = Σ − C T ( u, v ). Applying thecontinuous mapping theorem, F ∗ G | X , . . . , X n D → (cid:90) ¯ I ∗ Ξ (cid:62) ( u )Σ Ξ( u ) f ∗⊕ ( u )d u ASSERSTEIN INFERENCE almost surely. The result follows since K ( u, v ) = C T ( u, v ) under H G , so thatwith ( λ k , φ k ) as in (3.2) and ˜Ξ = Σ / Ξ , (cid:90) ¯ I ∗ (cid:107) ˜Ξ( u ) (cid:107) E f ∗⊕ ( u )d u = ∞ (cid:88) k =1 λ k p (cid:88) j =1 Z jk , where Z jk = λ − / k (cid:82) ¯ I ∗ ˜Ξ j ( u ) φ k ( u ) f ∗⊕ ( u )d u are standard Gaussian randomvariables, independent across both j and k. Proof of Corollary 1. Define ˜ Q ⊕ ( x, t ) as in (A.2), and set˜ C Q ( s, t ) = 1 n n (cid:88) i =1 ( Q i ( s ) − ˜ Q ⊕ ( X i , s ))( Q i ( t ) − ˜ Q ⊕ ( X i , t )) . Let Z, Z n be (conditional on the data) zero-mean Gaussian processes on L p [0 , 1] with covariance functions I p ˜ C Q ( s, t ) and I p C Q ( s, t ) , respectively,where I p is the p × p identity matrix. That ˜ C Q ( s, t ) converges almost surelyto C Q ( s, t ) for any s, t ∈ [0 , 1] follows by standard arguments, using as-sumptions (T1) and (T3). Furthermore, using (T2)–(T4), it follows that | ˜ C Q ( s, t ) − ˜ C Q ( s, t (cid:48) ) | = O (1) | t − t (cid:48) | almost surely, where the O (1) term isuniform over s, t, t (cid:48) ∈ [0 , . Thus, Lemma 3 implies that Z n (cid:32) Z in L ∞ p [0 , ω j be as in the statement of Theorem 1. If ˜ λ j are the eigenvalues of ˜ C Q , observe that (cid:80) ∞ j =1 λ j ω j D = (cid:82) (cid:107) Z ( t ) (cid:107) E d t and (cid:80) ∞ j =1 ˜ λ j ω j D = (cid:82) (cid:107) Z n ( t ) (cid:107) E d t almost surely. Let ˜ b Gα be the 1 − α quantile of (cid:80) ∞ j =1 ˜ λ j ω j . Then, by thecontinuous mapping theorem, ˜ b Gα → b Gα almost surely.Let A n be as in (A.3), so that ˆ C Q = ˜ C Q when A n holds. Then, byLemma 2, for any (cid:15) > ,β Gn = P X (cid:16) F ∗ G > ˆ b Gα (cid:17) ≤ P X (cid:0) F ∗ G > b Gα − (cid:15) (cid:1) + P X ( A cn ) + P X (cid:16) | ˜ b Gα − b Gα | > (cid:15) (cid:17) = P ∞ (cid:88) j =1 λ j ω j > b Gα − (cid:15) + o (1) almost surely . Since (cid:15) was arbitrary, lim sup n →∞ β Gn ≤ α almost surely. A similar lowerbound shows lim inf n →∞ β Gn ≥ α almost surely, proving the result. PETERSEN, LIU AND DIVANI Proof of Theorem 2. Let (cid:15) > A n be as in (A.3), andset(A.4) ρ ( u ) = Σ − γ ( u ) , ˆ ρ ( u ) = ˆΣ − ˜ γ ( u ) . For any model G ∈ G A,n , define W = W G = E (cid:2) d W ( f ⊕ ( X ) , f ∗⊕ ) (cid:3) = (cid:90) I ∗ ρ (cid:62) ( u )Σ ρ ( u ) f ∗⊕ ( u )d u, ˆ W n = (cid:90) I ∗ ρ (cid:62) ( u ) ˆΣ ρ ( u ) f ∗⊕ ( u )d u, (A.5)and set(A.6) ∆ n = (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) I ∗ [ ˆ ρ ( u ) − ρ ( u )] (cid:62) ˆΣ ρ ( u ) f ∗⊕ ( u )d u + ˆ W n − W (cid:12)(cid:12)(cid:12)(cid:12) . Lastly, define r n = sup G∈ G A,n P (cid:18) sup m ≥ n P X (cid:18) ∆ m > W (cid:19) ≥ (cid:15) (cid:19) ,r n = sup G∈ G A,n P (cid:18) sup m ≥ n P X ( A cm ) ≥ (cid:15) (cid:19) r n = sup G∈ G A,n P (cid:18) sup m ≥ n P X (cid:16) ˆ b Gα > R (cid:17) ≥ (cid:15) (cid:19) for some R > . It is shown in Lemma 4 that, for any (cid:15) > , r n and r n areboth o (1) and that there exists R > r n = o (1) . Next, on the set A n , one can write the test statistic F ∗ G as F ∗ G = n (cid:90) I ∗ ( ˆ ρ ( u ) − ρ ( u )) (cid:62) ˆΣ( ˆ ρ ( u ) − ρ ( u )) f ∗⊕ ( u )d u + 2 n (cid:90) I ∗ ( ˆ ρ ( u ) − ρ ( u )) (cid:62) ˆΣ ρ ( u ) f ∗⊕ ( u )d u + n ˆ W n . Then, for n large enough that nW ≥ na n > R for all n ≥ n and G ∈ G A,n ,for such n we immediately obtain β Gn = P X ( F ∗ G > ˆ b α ) ≥ P X ( F ∗ G > R ) − P X (ˆ b Gα > R ) ≥ P X (cid:18) n (cid:90) I ∗ ( ˆ ρ ( u ) − ρ ( u )) (cid:62) ˆΣ( ˆ ρ ( u ) − ρ ( u )) f ∗⊕ ( u )d u > R − n [ W − ∆ n ] (cid:19) − P X ( A cn ) − P X (ˆ b Gα > R ) ≥ − P X (cid:18) ∆ n > W (cid:19) − P X ( A cn ) − P X (ˆ b Gα > R ) . ASSERSTEIN INFERENCE As a result, for any (cid:15) > , there exists R > n, inf G∈ G A,n P (cid:18) inf m ≥ n β Gm ≥ − (cid:15) (cid:19) ≥ − r n − r n − r n = 1 − o (1) . Proof of Theorem 3. In analogy to (A.2), define˜ Q , ⊕ ( y, t ) = 1 n n (cid:88) i =1 s in, ( y ) Q i ( t ) = ˜ y (cid:62) ˜ π Y ◦ Q ∗⊕ ( t ) , t ∈ [0 , , ˜ q , ⊕ ( y, t ) = ∂∂t ˜ Q , ⊕ ( y, t ) = 1 n n (cid:88) i =1 s in, ( y ) q i ( t ) = q ∗⊕ ( t )˜ y (cid:62) ˜ π (cid:48) Y ◦ Q ∗⊕ ( t ) , t ∈ (0 , , (A.7)as empirical versions of Q , ⊕ ( y, t ) = ˜ y (cid:62) π Y ◦ Q ∗⊕ ( t ) ,q , ⊕ ( y, t ) = q ∗⊕ ( t )˜ y (cid:62) π Y ◦ Q ∗⊕ ( t ) . Note that ˆ Q ⊕ ( t ) = ¯ Y (cid:62) ˜ π Y ◦ Q ∗⊕ ( t ) = ¯ X (cid:62) ˜ π ◦ Q ∗⊕ ( t ) . By a similar argument as in the proof of Theorem 1, letting(A.8) A n, = (cid:26) min ≤ i ≤ n inf t ∈ (0 , ˜ q , ⊕ ( Y i , t ) > (cid:27) , then P X ( A n, ) → H G . With ˜ γ as in (A.1), partitionit as (˜ γ Y , ˜ γ Z ) . When both A n and A n, hold, d W (ˆ F i , ˆ f ⊕ ) = (cid:90) I ∗ (cid:104) ( X i − ¯ X ) (cid:62) ˆΣ − ˜ γ ( u ) (cid:105) f ∗⊕ ( u )d u,d W (ˆ F ,i , ˆ f ⊕ ) = (cid:90) I ∗ (cid:104) ( Y i − ¯ Y ) (cid:62) ˆΣ − Y Y ˜ γ Y ( u ) (cid:105) f ∗⊕ ( u )d u. Letting ˆΓ = (cid:18) I q ZY ˆΣ − Y Y (cid:19) , it can be verified thatˆΣ − ˆΓ = (cid:18) ˆΣ − Y Y 00 0 (cid:19) , ( I − ˆΓ) ˆΣ = (cid:18) Z | Y (cid:19) . PETERSEN, LIU AND DIVANI Next, define ρ, ˆ ρ as in (A.4), and partition them as ρ = ( ρ Y , ρ Z ) , ˆ ρ =( ˆ ρ Y , ˆ ρ Z ) . Then, under H P , ρ Z ( u ) ≡ . By straightforward algebra, one canverify that, under H P and on the set A n ∩ A n, , the test statistic becomes F ∗ P = n (cid:88) i =1 (cid:90) I ∗ (cid:26)(cid:104) ( X i − ¯ X ) (cid:62) ˆΣ − ˜ γ ( u ) (cid:105) − (cid:104) ( Y i − ¯ Y ) (cid:62) ˆΣ − Y Y ˜ γ Y ( u ) (cid:105) (cid:27) f ∗⊕ ( u )d u = n (cid:90) I ∗ (cid:104) ˜ γ (cid:62) ( u ) ˆΣ − ˜ γ ( u ) − ˜ γ (cid:62) ( u ) ˆΣ − ˆΓ˜ γ ( u ) (cid:105) f ∗⊕ ( u )d u = n (cid:90) I ∗ ˆ ρ (cid:62) ( u )( I − ˆΓ) ˆΣ ˆ ρ ( u ) f ∗⊕ ( u )d u = n (cid:90) I ∗ ˆ ρ Z ( u ) (cid:62) ˆΣ Z | Y ˆ ρ Z ( u ) f ∗⊕ ( u )d u. Thus, under H P , by applying Lemma 1, the continuous mapping theorem,and the law of large numbers, we see that F ∗ P | X , . . . , X n D → (cid:90) I (cid:107) ˜Ξ( u ) (cid:107) E f ∗⊕ ( u )d u almost surely, where ˜Ξ( u ) is a zero-mean Gaussian process on L ∞ p − q ( I ∗ ) withcovariance(A.9) Σ − / Z | Y J (cid:62) E (cid:104) ( X − µ )( X − µ ) (cid:62) C T ( S ( u ) , S ( v )) (cid:105) J Σ − / Z | Y , where J (cid:62) = ( − Σ ZY Σ − Y Y I p − q ) . By our assumptions, this is the kernel of aself-adjoint, trace-class operator, and so has nonnegative eigenvalues τ j with (cid:80) ∞ j =1 τ j < ∞ . By the Karhunen-Lo`eve representation, (cid:90) I (cid:107) ˜Ξ( u ) (cid:107) E f ∗⊕ ( u )d u D = ∞ (cid:88) j =1 τ j ξ j , where ξ j are i.i.d. standard normal random variables.Under the additional assumptions that E ( Z | Y ) is linear in Y and Var( Z | Y )is constant, it follows that J (cid:62) ( X − µ ) = Z − E ( Z | Y ) and Var( J (cid:62) ( X − µ ) | Y ) =Σ Z | Y . Consequently, the covariance of ˜Ξ( u ) becomesΣ − / Z | Y J (cid:62) E (cid:104) ( X − µ )( X − µ ) (cid:62) C T ( S ( u ) , S ( v )) (cid:105) J Σ − / Z | Y = I p − q K ( u, v )and the result follows. Proof of Corollary 2. The proof is similar to that of Corollary 1 andis omitted. Proof of Theorem 4. The proof is similar to that of Theorem 2 andis omitted. ASSERSTEIN INFERENCE A.6. Proofs of Theorems 5 and 6 and Corollaries 3–6. Proof of Theorem 5. First, note that Q ⊕ ( x ) ◦ F ∗⊕ = ˜ x (cid:62) π . With ˜ q ⊕ defined in (A.2), whenever inf t ∈ (0 , ˜ q ⊕ ( x, t ) > , we have ˆ Q ⊕ ( x ) ◦ F ∗⊕ = ˜ x (cid:62) ˜ π, implying that √ n (cid:16) ˆ V x ( u ) − u (cid:17) = √ n (cid:16) ˆ Q ⊕ ( x ) − Q ⊕ ( x ) (cid:17) ◦ F ⊕ ( x )= √ n ˜ x (cid:62) (˜ π − π ) ◦ Q ∗⊕ ◦ F ⊕ ( x ) . Let S be the first p + 1 coordinates of the multivariate Gaussian process S defined in Lemma 1, so thatCov( S ( u ) , S ( v )) = Λ − E (cid:16) ˜ X ˜ X (cid:62) C T ( S ( u ) , S ( v )) (cid:17) Λ − = ˜ K ( u, v ) , u, v ∈ ¯ I ∗ . Then, by the continuous mapping theorem and Lemma 2, √ n ( ˆ V x − id) | X , . . . , X n (cid:32) ˜ x (cid:62) S ◦ Q ∗⊕ ◦ F ⊕ ( x ) =: M x almost surely,where M x is a zero mean Gaussian process on L ∞ ( I x ) . For u, v ∈ I x , set u x , v x as in the statement of the theorem. ThenCov( M x ( u ) , M x ( v )) = ˜ x (cid:62) ˜ K ( Q ∗⊕ ◦ F ∗⊕ ( x, u ) , Q ∗⊕ ◦ F ∗⊕ ( x, v ))˜ x = ˜ x (cid:62) ˜ K ( u x , v x )˜ x. (A.1) Proof of Corollary 3. Define ˜ Q ⊕ ( x, t ) as in (A.2), and set B Q ( s, t ) = E (cid:104) ˜ X ˜ X (cid:62) C T ( Q ⊕ ( x, s ) , Q ⊕ ( x, t )) (cid:105) ˜ B Q ( s, t ) = 1 n n (cid:88) i =1 ˜ X i ˜ X (cid:62) i ( Q i ( s ) − ˜ Q ⊕ ( X i , s ))( Q i ( t ) − ˜ Q ⊕ ( X i , t )) , ˜ D Q ( s, t ) = ˆΛ − ˜ B Q ( s, t ) ˆΛ − . In a similar way to the proof of Corollary 1, when can show that (cid:107) ˜ B Q ( s, t ) − B Q ( s, t ) (cid:107) F converges to zero almost surely, for any s, t ∈ [0 , , and that (cid:107) ˜ B Q ( s, t ) − ˜ B Q ( s, t (cid:48) ) (cid:107) F = O (cid:0) | t − t (cid:48) | (cid:1) PETERSEN, LIU AND DIVANI almost surely, where the O (1) term is uniform over s, t, t (cid:48) ∈ [0 , . Let ˜ N x , N x be zero-mean Gaussian processes on L ∞ [0 , 1] with covariance ˜ x (cid:62) ˜ D Q ( s, t )˜ x and ˜ x (cid:62) D Q ( s, t )˜ x, respectively. Then, by Lemma 3 and Slutsky’s lemma,˜ N x (cid:32) N x almost surely in L ∞ [0 , . Hence, if ˜ m α is the 1 − α quantile ofsup t ∈ [0 , (cid:104) ˜ x (cid:62) ˜ D Q ( t, t ) (cid:105) − / | ˜ N x ( t ) | , we have ˜ m α → m α almost surely. However, with A n as in (A.3), ˆ m α = ˜ m α whenever A n holds, so that ˆ m α − m α = o P X (1) by Lemma 2.Note that the above argument also shows that sup t ∈ [0 , (cid:107) ˜ D Q ( t, t ) − D Q ( t, t ) (cid:107) F converges to zero almost surely. By the assumption that inf u ∈I ∗ C T ( u, u ) > , it is clear that inf t ∈ [0 , ˜ x (cid:62) D Q ( t, t )˜ x > . Hence, since ˜ D Q = ˆ D Q on A n , we havesup u ∈I x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ C x ( ˆ V x ( u ) , ˆ V x ( u )) C x ( u, u ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup t ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ x (cid:62) ˆ D Q ( t, t )˜ x ˜ x (cid:62) D Q ( t, t )˜ x − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o P X (1) . Then, for any (cid:15), η > , lim inf n P X ( f ⊕ ( x ) ∈ C α,n ( x ))= lim inf n P X sup v ∈ ˆ I x √ n (cid:12)(cid:12)(cid:12) ˆ V − x ( v ) − v (cid:12)(cid:12)(cid:12) ˆ C x ( v, v ) / ≤ ˆ m α ≥ lim inf n P X (cid:32) sup u ∈I x √ n | ˆ V x ( u ) − u |C x ( u, u ) / ≤ ˆ m α (1 − (cid:15) ) (cid:33) − lim sup n P X (cid:32) inf t ∈ [0 , ˆ C x ( ˆ V x ( u ) , ˆ V x ( u )) C x ( u, u ) > (1 − (cid:15) ) (cid:33) ≥ lim inf n P X (cid:32) sup u ∈I x √ n | ˆ V x ( u ) − u |C x ( u, u ) / ≤ ( m α − η )(1 − (cid:15) ) (cid:33) − lim sup n P X ( | ˆ m α − m α | > η ) + o (1)= P ( ζ x ≤ ( m α − η )(1 − (cid:15) )) + o (1) , almost surely. Letting (cid:15), η → n P X ( f ⊕ ( x ) ∈ C α,n ( x )) ≥ − α almost surely. A similar upper bound yields the result. Proof of Theorem 6. Let S = ( S , S ) be as in Lemma 1. Now, definethe map ψ x : L ∞ p +1 (¯ I ∗ ) × L ∞ p +1 ( I ∗ ) → L ∞ [0 , × L ∞ (0 , 1) by ψ x ( h , h ) = (˜ x (cid:62) h ◦ Q ∗⊕ , ˜ x (cid:62) ( h ◦ Q ∗⊕ ) q ∗⊕ ) . ASSERSTEIN INFERENCE With ˜ Q ⊕ ( x ) and ˜ q ⊕ ( x ) as defined in (A.2), and π and ˜ π as in (A.1), wehave ψ x (˜ π, ˜ π (cid:48) ) = ( ˜ Q ⊕ ( x ) , ˜ q ⊕ ( x )) and ψ x ( π, π (cid:48) ) = ( Q ⊕ ( x ) , q ⊕ ( x )) . Then, bythe continuous mapping theorem, √ n (( ˜ Q ⊕ ( x ) , ˜ q ⊕ ( x )) − ( Q ⊕ ( x ) , q ⊕ ( x ))) | X , . . . , X n (cid:32) ψ x ( S , S ) =: T x in L ∞ [0 , × L ∞ (0 , 1) almost surely. Here, T x ( t, t (cid:48) ) = ( T x ( t ) , T x ( t (cid:48) )) is azero-mean Gaussian process with covariance(A.2) Cov( T x ( s, s (cid:48) ) , T x ( t, t (cid:48) )) = E (cid:104)(cid:16) ˜ x (cid:62) Λ − ˜ X (cid:17) ˜ L ( s, s (cid:48) , t, t (cid:48) ) (cid:16) ˜ x (cid:62) Λ − ˜ X (cid:17)(cid:105) , where, letting U s = Q ⊕ ( X, s ),˜ L ( s, s (cid:48) , t, t (cid:48) ) = (cid:32) C T ( U s , U t ) C (0 , T ( U s , U t (cid:48) ) q ⊕ ( X, t (cid:48) ) q ⊕ ( X, s (cid:48) ) C (1 , T ( U s (cid:48) , U t ) q ⊕ ( X, s (cid:48) ) C (1 , T ( U s (cid:48) , U t (cid:48) ) q ⊕ ( X, t (cid:48) ) (cid:33) . In particular, if K and D are defined as in Section 4.2, then for s, t ∈ (0 , , Cov( T x ( s, s ) , T x ( t, t )) = ˜ x ⊗ K ( Q ∗⊕ ( s ) , Q ∗⊕ ( t )) ⊗ ˜ x = ˜ x (cid:62) ⊗ D ( s, t ) , ⊗ ˜ x. Observe that f ⊕ ( x ) = 1 / [ q ⊕ ( x ) ◦ F ⊕ ( x )] and that, whenever inf t ∈ (0 , ˜ q ⊕ ( x, t ) > , ˆ f ⊕ ( x ) = 1 / (cid:104) ˜ q ⊕ ( x ) ◦ ˜ F ⊕ ( x ) (cid:105) is well-defined, where ˜ F ⊕ ( x ) = ˜ Q ⊕ ( x ) − . Moreover, by the assumption that f ∗⊕ is continuously differentiable, the sameholds for f ⊕ ( x ) . Now, by Lemmas 2 and 5, the delta method applies, so that √ n ( ˆ f ⊕ ( x ) − f ⊕ ( x )) | X , . . . , X n (cid:32) (cid:20) ∂∂u f ⊕ ( x ) (cid:21) T x ◦ F ⊕ ( x ) − (cid:2) f ⊕ ( x ) (cid:3) T x ◦ F ⊕ ( x ) =: F x in L ∞ ( I δx ) almost surely. With c u = (( ∂/∂u ) f ⊕ ( x, u ) , − f ⊕ ( x, u ) ) (cid:62) and u x = Q ∗⊕ ◦ F ⊕ ( x, u ) , the covariance of F x isCov( F x ( u ) , F x ( v )) = c (cid:62) u Cov( T x ( F ⊕ ( x, u ) , F ⊕ ( x, u )) , T x ( F ⊕ ( x, v ) , F ⊕ ( x, v ))) c v = c (cid:62) u (cid:104) ˜ x (cid:62) ⊗ K ( u x , v x ) ⊗ ˜ x (cid:105) c v . Proof of Corollary 4. The proof is similar to that of Corollary 3,using the additional assumptions in Theorem 5, and is omitted. Proof of Corollary 5. By (T1), for any δ > , inf u ∈ [ δ, − δ ] C T ( u, u ) > . The remainder of the proof follows in the same was as that of Corollary 3. PETERSEN, LIU AND DIVANI Proof of Corollary 6. Under the assumption of fixed support, The-orem 6 can be adapted to show that √ n ( ˆ f ⊕ ( x ) − f ⊕ ( x )) | X , . . . , X n (cid:32) V (cid:48) x in L ∞ (0 , 1) almost surely. Then, since inf u ∈ (0 , R x ( u, u ) > , one can followthe steps of the proof of Corollary 3 to show the coverage result. A.7. Auxiliary Lemmas. Lemma . Assume (T1)–(T4) hold. Then there exists a zero-mean Gaus-sian process S on L ∞ p +1 (¯ I ∗ ) × L ∞ p +1 ( I ∗ ) such that √ n (cid:18)(cid:18) ˜ π ˜ π (cid:48) (cid:19) − (cid:18) ππ (cid:48) (cid:19)(cid:19) | X , . . . , X n (cid:32) S almost surely. Furthermore, for u, v ∈ ¯ I ∗ and u (cid:48) , v (cid:48) ∈ I ∗ , Cov( S ( u, u (cid:48) ) , S ( v, v (cid:48) )) = E (cid:20)(cid:16) Λ − ˜ X (cid:17) ⊗ L ( u, u (cid:48) , v, v (cid:48) ) ⊗ (cid:16) Λ − ˜ X (cid:17) (cid:62) (cid:21) , where L ( u, u (cid:48) , v, v (cid:48) ) = (cid:32) C T ( S ( u ) , S ( v )) C (0 , T ( S ( u ) , S ( v (cid:48) )) S (cid:48) ( v (cid:48) ) S (cid:48) ( u (cid:48) ) C (1 , T ( S ( u (cid:48) ) , S ( v )) S (cid:48) ( u (cid:48) ) C (1 , T ( S ( u (cid:48) ) , S ( v (cid:48) )) S (cid:48) ( v (cid:48) ) (cid:33) . Proof. The proof follows from a multivariate extension (see Problem1.5.3 of Van der Vaart and Wellner (1996)) of Example 2.11.13 in Van derVaart and Wellner (1996), which is itself a generalization of a result by Jainand Marcus (1975). Letting ˆΛ = n − (cid:80) ni =1 ˜ X i ˜ X (cid:62) i , for ( u, u (cid:48) ) ∈ ¯ I ∗ × I ∗ , set Z ni ( u, u (cid:48) ) = n − / (cid:16) ˆΛ − ˜ X i (cid:17) ⊗ ( R i ( u ) , R (cid:48) i ( u (cid:48) )) (cid:62) . Then √ n (cid:18) ˜ π ( u )˜ π (cid:48) ( u (cid:48) ) (cid:19) = n (cid:88) i =1 Z ni ( u, u (cid:48) ) , √ n (cid:18) π ( u ) π (cid:48) ( u (cid:48) ) (cid:19) = n (cid:88) i =1 E X (cid:0) Z ni ( u, u (cid:48) ) (cid:1) . Using assumption (T2) and (T4), one can derive the bounds | R i ( u ) − R i ( v ) | ≤ C (cid:18) sup u ∈ R | T (cid:48) i ( u ) | (cid:19) (cid:107) ˜ X i (cid:107) E | u − v | , | R (cid:48) i ( u (cid:48) ) − R (cid:48) i ( v (cid:48) ) | ≤ C (cid:20) L i (cid:107) ˜ X i (cid:107) E + sup u ∈ R | T (cid:48) i ( u ) | (cid:21) (cid:107) ˜ X i (cid:107) E | u (cid:48) − v (cid:48) | , (A.1) ASSERSTEIN INFERENCE where L i is the Lipschitz constant of T (cid:48) i and C , C are constants. Hence,setting M ni = n − / (cid:107) ˆΛ − (cid:107) F (cid:107) ˜ X i (cid:107) E (cid:20) (cid:107) ˜ X i (cid:107) E L i + sup u ∈ R | T (cid:48) ( u ) | (cid:21) / , we obtain (cid:107) Z ni ( u, u (cid:48) ) − Z ni ( v, v (cid:48) ) (cid:107) E ≤ CM ni (cid:107) ( u, u (cid:48) ) − ( v, v (cid:48) ) (cid:107) E for some C > . By (T3), n (cid:88) i =1 E X ( M ni ) = 1 n n (cid:88) i =1 ˜ X (cid:62) i ˆΛ − ˜ X i (cid:107) ˜ X i (cid:107) E (cid:20) (cid:107) ˜ X i (cid:107) E L i + sup u ∈ R | T (cid:48) ( u ) | (cid:21) ≤ O (1) · n n (cid:88) i =1 (cid:20) (cid:107) ˜ X i (cid:107) E L i + (cid:107) ˜ X i (cid:107) E sup u ∈ R | T (cid:48) ( u ) | (cid:21) = O (1)almost surely.Next, we verify the uniform Lindeberg condition. By monotonicity of T and assumptions (T1) and (T4),sup u ∈ ¯ I ∗ | R i ( u ) | ≤ sup u ∈ R C T ( u, u ) + C (cid:107) ˜ X i (cid:107) E , sup u (cid:48) ∈I ∗ | R (cid:48) i ( u (cid:48) ) | ≤ C (cid:107) ˜ X i (cid:107) E sup u (cid:48) ∈I ∗ | T (cid:48) i ( u ) | Then, by (T4),(A.2) E (cid:32) (cid:107) ˜ X i (cid:107) E (cid:34) sup u ∈ ¯ I ∗ | R ( u ) | + sup u (cid:48) ∈I ∗ | R (cid:48) ( u (cid:48) ) | (cid:35)(cid:33) < ∞ . By a Borel-Cantelli argument, max ≤ i ≤ n (cid:107) ˜ X i (cid:107) E = o ( n / ) almost surely.Hence, for any (cid:15), M > , almost surely for large n we will havemax ≤ i ≤ n (cid:107) ˆΛ − ˜ X i (cid:107) E < n(cid:15) M . PETERSEN, LIU AND DIVANI Hence, for some constant C ,lim sup n →∞ n (cid:88) i =1 E X (cid:34) sup u ∈ ¯ I ∗ , u (cid:48) ∈I ∗ (cid:107) Z ni ( u, u (cid:48) ) (cid:107) E (cid:32) sup u ∈ ¯ I ∗ , u (cid:48) ∈I ∗ (cid:107) Z ni ( u, u (cid:48) ) (cid:107) E > (cid:15) (cid:33)(cid:35) ≤ lim sup n →∞ O (1) n n (cid:88) i =1 (cid:107) ˜ X i (cid:107) E E X (cid:34)(cid:32) sup u ∈ ¯ I ∗ | R i ( u ) | + sup u (cid:48) ∈I ∗ | R (cid:48) i ( u ) | (cid:33) × (cid:32) sup u ∈ ¯ I ∗ | R i ( u ) | + sup u (cid:48) ∈I ∗ | R (cid:48) i ( u ) | > M (cid:33)(cid:35) ≤ C lim sup n →∞ n n (cid:88) i =1 (cid:107) ˜ X i (cid:107) E E X (cid:34)(cid:32) sup u ∈ ¯ I ∗ | R i ( u ) | + sup u (cid:48) ∈I ∗ | R (cid:48) i ( u ) | (cid:33) × (cid:32) sup u ∈ ¯ I ∗ | R i ( u ) | + sup u (cid:48) ∈I ∗ | R (cid:48) i ( u ) | > M (cid:33)(cid:35) = CE (cid:34) (cid:107) ˜ X (cid:107) E (cid:32) sup u ∈ ¯ I ∗ | R ( u ) | + sup u (cid:48) ∈I ∗ | R (cid:48) ( u ) | (cid:33) (cid:32) sup u ∈ ¯ I ∗ | R ( u ) | + sup u (cid:48) ∈I ∗ | R (cid:48) ( u ) | > M (cid:33)(cid:35) almost surely. By (A.2) and dominated convergence, the last line convergesto 0 as M → ∞ . Lastly, we demonstrate pointwise convergence of the covariance. By (T1),we have Cov( R i ( u ) , R i ( v ) | X i ) = C T ( S i ( u ) , S i ( v )) . Furthermore, using (T2),(T3), and the conditional dominated convergence theorem,Cov( R i ( u ) , R (cid:48) i ( v (cid:48) ) | X i ) = C (0 , T ( S i ( u ) , S i ( v )) S (cid:48) i ( v (cid:48) ) , Cov( R (cid:48) i ( u (cid:48) ) , R (cid:48) i ( v ) | (cid:48) X i ) = S (cid:48) i ( u (cid:48) ) C (1 , T ( S i ( u ) , S i ( v )) S (cid:48) i ( v (cid:48) ) . Hence, by the law of large numbers,Cov X (cid:32) n (cid:88) i =1 Z ni ( u, u (cid:48) ) , n (cid:88) k =1 Z nk ( v, v (cid:48) ) (cid:33) = (cid:34) n n (cid:88) i =1 (cid:16) ˆΛ − ˜ X i (cid:17) ⊗ (cid:18) Cov( R i ( u ) , R i ( v ) | X i ) Cov( R i ( u ) , R (cid:48) i ( v (cid:48) ) | X i )Cov( R (cid:48) i ( u (cid:48) ) , R i ( v ) | X i ) Cov( R (cid:48) i ( u (cid:48) ) , R (cid:48) i ( v (cid:48) ) | X i ) (cid:19) ⊗ (cid:16) ˆΛ − ˜ X i (cid:17) (cid:62) (cid:35) → E (cid:34)(cid:16) Λ − ˜ X (cid:17) ⊗ (cid:32) C T ( S ( u ) , S ( v )) C (0 , T ( S ( u ) , S ( v (cid:48) )) S (cid:48) ( v (cid:48) ) S (cid:48) ( u (cid:48) ) C (1 , T ( S ( u (cid:48) ) , S ( v )) S (cid:48) ( u (cid:48) ) C (1 , T ( S (cid:48) ( u (cid:48) ) , S (cid:48) ( v (cid:48) )) S (cid:48) ( v (cid:48) ) (cid:33) ⊗ (cid:16) Λ − ˜ X (cid:17) (cid:62) (cid:35) almost surely. ASSERSTEIN INFERENCE Lemma . For any x ∈ R p , define ˜ q ⊕ ( x, t ) = n − (cid:80) ni =1 s in ( x ) q i ( t ) . Then, under (T1)–(T4), P X (cid:18) inf t ∈ (0 , ˜ q ⊕ ( x, t ) > (cid:19) , P X (cid:18) min ≤ i ≤ n inf t ∈ (0 , ˜ q ⊕ ( X i , t ) > (cid:19) both converge to 1 almost surely. Proof. Observe that ˜ q ⊕ ( x, t ) = ˜ x (cid:62) ˜ π (cid:48) ( Q ∗⊕ ( t )) q ∗⊕ ( t ) and q ⊕ ( x, t ) = ˜ x (cid:62) π (cid:48) ( Q ∗⊕ ( t )) q ∗⊕ ( t ) . With M as in (T4), inf t ∈ (0 , ˜ q ⊕ ( x, t ) > M − − sup t ∈ (0 , | ˜ q ⊕ ( x, t ) − q ⊕ ( x, t ) | . But, by (T4) and Lemma 1,sup t ∈ (0 , | ˜ q ⊕ ( x, t ) − q ⊕ ( x, t ) | ≤ (cid:107) ˜ x (cid:107) E (cid:34) sup t ∈ (0 , q ∗⊕ ( t ) (cid:35) sup u ∈I ∗ (cid:107) ˜ π (cid:48) ( u ) − π (cid:48) ( u ) (cid:107) E = O P X ( n − / ) = o P X (1) , proving the first claim. Since E ( (cid:107) X (cid:107) E ) < ∞ , a Borel-Cantelli argumentshows that max ≤ i ≤ n (cid:107) ˜ X i (cid:107) E = o ( n / ) = o P X ( n / ) almost surely. Hence,min ≤ i ≤ n inf t ∈ (0 , ˜ q ⊕ ( X i , t ) > M − − max ≤ i ≤ n (cid:107) ˜ X i (cid:107) E (cid:34) sup t ∈ (0 , q ∗⊕ ( t ) (cid:35) sup u ∈I ∗ (cid:107) ˜ π (cid:48) ( u ) − π (cid:48) ( u ) (cid:107) E = M − − o P X ( n / ) O P X ( n − / ) = M − − o P X (1) , proving the second claim. Lemma . Let T ⊂ R be a bounded set, and suppose M , M n , arezero-mean (possibly multivariate) Gaussian process on L ∞ ( T ) , with non-degenerate covariances functions C and C n , respectively. Suppose thati) C is uniformly continuous,ii) C n → C pointwise, andiii) | C n ( s, t ) − C n ( s, t (cid:48) ) | = O ( | t − t (cid:48) | ) , uniformly in s, t, t (cid:48) . Then M n → M in L ∞ ( T ) . Proof. We show the result for the univariate case. The extension tomultivariate processes is straightforward. Clearly, M n converges marginallyto M , so it remains to show tightness.For any η > , choose (cid:15) > | C ( s, t ) − C ( s, t (cid:48) ) | < η whenever | t − t (cid:48) | < (cid:15). Let t , . . . , t L ∈ T be such that, for any t ∈ T , | t − t l | < (cid:15) forsome l = 1 , . . . , L = O ( (cid:15) − ) . For any l = 1 , . . . , L and t ∈ T , let t m satisfy | t − t m | < (cid:15). Then | C n ( t l , t ) − C ( t l , t ) | ≤ O ( (cid:15) ) + max m =1 ,...,L | C n ( t l , t m ) − C ( t l , t m ) | + η. PETERSEN, LIU AND DIVANI It follows that max l =1 ,...,L sup t ∈T | C n ( t l , t ) − C ( t l , t ) | = o (1) . A similar boundyields sup s,t | C n ( s, t ) − C n ( s, t ) | = o (1) . For s, t ∈ T , let ρ ( s, t ) = [ C ( s, s ) + C ( t, t ) − C ( s, t )] / ρ n ( s, t ) = [ C n ( s, s ) + C n ( t, t ) − C n ( s, t )] / be the standard deviation metrics of M and M n , respectively, so thatsup s,t | ρ n ( s, t ) − ρ ( s, t ) | = o (1) . For δ > , let n be large enough thatsup s,t ∈T | ρ n ( s, t ) − ρ ( s, t ) | < δ/ | s − t | <(cid:15) ρ n ( s, t ) < D(cid:15) / for some D > . Then, using Corollary 2.2.8 of Van der Vaart and Wellner (1996), E (cid:32) sup ρ ( s,t ) <δ/ |M n ( s ) − M n ( t ) | (cid:33) ≤ E (cid:32) sup ρ n ( s,t ) <δ |M n ( s ) − M n ( t ) | (cid:33) = O (cid:18)(cid:90) δ (cid:112) − log( τ )d τ (cid:19) = o (1)as δ → . Thus, M n is asymptotically ρ -equicontinuous in probability, andthus asymptotically tight. Lemma . For any (cid:15) > and some R > , let r jn , j = 1 , , , be asdefined as in (A.5) . Then, under the assumption of Theorem 2, there exists R > such that r jn = o (1) , j = 1 , , . Proof. Begin with j = 1 , and define W, ˆ W n as in (A.5). Then (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) I ∗ ( ˆ ρ ( u ) − ρ ( u )) (cid:62) ˆΣ ρ ( u ) f ∗⊕ ( u )d u (cid:12)(cid:12)(cid:12)(cid:12) ≤ ˆ W / n (cid:20)(cid:90) I ∗ ( ˆ ρ ( u ) − ρ ( u )) (cid:62) ˆΣ( ˆ ρ ( u ) − ρ ( u )) f ∗⊕ ( u )d u (cid:21) / . Next, a direct calculation gives E X ( ˆ ρ ( u )) = ρ ( u ) andVar X ( ˆ ρ ( u )) = ˆΣ − (cid:32) n n (cid:88) i =1 ( X i − ¯ X )( X i − ¯ X ) (cid:62) C T ( S i ( u ) , S i ( v )) (cid:33) ˆΣ − , so that, with C = sup G∈ G ∗ sup u ∈I ∗ C T ( u, u ) < ∞ ,E X (cid:34)(cid:26)(cid:90) I ∗ ( ˆ ρ ( u ) − ρ ( u )) (cid:62) ˆΣ ρ ( u ) f ∗⊕ ( u )d u (cid:27) (cid:35) ≤ ˆ W n tr (cid:18) ˆΣ (cid:90) I ∗ Var X ( ˆ ρ ( u )) f ∗⊕ ( u )d u (cid:19) = ˆ W n n n (cid:88) i =1 (cid:26) tr (cid:104) ( X i − ¯ X )( X i − ¯ X ) (cid:62) ˆΣ − (cid:105) (cid:90) I ∗ C T ( S i ( u ) , S i ( v )) f ∗⊕ ( u )d u (cid:27) ≤ C ˆ W n n tr (cid:32) ˆΣ − n (cid:88) i =1 ( X i − ¯ X )( X i − ¯ X ) (cid:62) (cid:33) = pD ˆ W n n . ASSERSTEIN INFERENCE Thus, P X (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) I ∗ ( ˆ ρ ( u ) − ρ ( u )) ˆΣ ρ ( u ) f ∗⊕ ( u )d u (cid:12)(cid:12)(cid:12)(cid:12) > W (cid:19) ≤ pC ˆ W n nW = o (cid:32) ˆ W n W (cid:33) , where the o ( · ) term is uniform over G ∈ G A,n as nW → ∞ uniformly. Hence,choosing η > η < (cid:15)/ 3, for large n we will have P (cid:18) sup m ≥ n P X (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) I ∗ ( ˆ ρ ( u ) − ρ ( u )) (cid:62) ˆΣ ρ ( u ) f ∗⊕ ( u )d u (cid:12)(cid:12)(cid:12)(cid:12) > W (cid:19) > (cid:15) (cid:19) ≤ P (cid:18) sup m ≥ n ˆ W m > W (cid:15) η (cid:19) ≤ P (cid:18) sup m ≥ n | ˆ W m − W | > W (cid:19) . (A.3)Next, writeˆ W n − W = (cid:90) u : ρ ( u ) (cid:54) =0 ρ ( u ) (cid:62) Σ ρ ( u ) (cid:34) ρ ( u ) (cid:62) ˆΣ ρ ( u ) ρ ( u ) (cid:62) Σ ρ ( u ) − (cid:35) f ∗⊕ ( u )d u. An elementary calculation combined with Lemma 4.3 of Bosq (2000) givessup u : ρ ( u ) (cid:54) =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ρ ( u ) (cid:62) ˆΣ ρ ( u ) ρ ( u ) (cid:62) Σ ρ ( u ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:107) ˆΣ − Σ (cid:107) F , for some C depending only on the distribution of X. As a consequence, | ˆ W n − W | ≤ CW (cid:107) ˆΣ − Σ (cid:107) F . Then, for any (cid:15) > , choose η and n so that(A.3) holds. For such n,P (cid:18) sup m ≥ n P X (cid:18) ∆ m > W (cid:19) > (cid:15) (cid:19) ≤ P (cid:18) sup m ≥ n P X (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) I ∗ ( ˆ ρ ( u ) − ρ ( u )) ˆΣ ρ ( u ) f ∗⊕ ( u )d u (cid:12)(cid:12)(cid:12)(cid:12) > W (cid:19) > (cid:15) (cid:19) + P (cid:18) sup m ≥ n (cid:18) | ˆ W m − W | > V (cid:19) > (cid:15) (cid:19) ≤ P (cid:18) sup m ≥ n | ˆ W m − W | > W (cid:19) ≤ P (cid:18) sup m ≥ n (cid:107) ˆΣ − Σ (cid:107) F > (cid:19) = o (1)uniformly in G A,n since the distribution of X is fixed and (cid:107) ˆΣ − Σ (cid:107) F → PETERSEN, LIU AND DIVANI Next, consider j = 2. For any δ > , let { u k } Kk =1 be a δ -covering of I ∗ , where K = O ( δ − ) . Observe that, by (G3) and (G4),˜ x (cid:62) π (cid:48) ◦ Q ∗⊕ ( t ) = q ⊕ ( x, t ) /q ∗⊕ ( t ) ≥ M M − =: M > x and all t ∈ (0 , , so that whenevermax ≤ i ≤ n sup u ∈I ∗ | ˜ X (cid:62) i (˜ π (cid:48) ( u ) − π (cid:48) ( u )) | < M , the event A n in (A.3) will hold. Thus, we obtain the bound P X ( A cn ) ≤ P X (cid:18) max ≤ i ≤ n max k | ˜ X (cid:62) i (˜ π (cid:48) ( u k ) − π (cid:48) ( u k )) | ≥ M (cid:19) + P X (cid:32) max ≤ i ≤ n sup | u − v | <δ | ˜ X (cid:62) i (˜ π (cid:48) ( u ) − ˜ π (cid:48) ( v )) | ≥ M (cid:33) . Let C = sup G∈ G ∗ sup u ∈ R C (1 , T ( u, u ) < ∞ . By (G1), there is a constant C < ∞ such that sup u ∈I ∗ (cid:107) π (cid:48) ( u ) (cid:107) E < C uniformly in G ∗ . Since S i ( u ) =˜ X (cid:62) i π ( u ) , | S (cid:48) i ( u ) | ≤ C (cid:107) ˜ X i (cid:107) E . Furthermore, max ≤ i ≤ n (cid:107) ˜ X i (cid:107) E = o ( n / ) and (cid:107) ˆΛ − (cid:107) F = O (1) almost surely. Then, for any η > B > , forlarge enough n we will have P X (cid:18) max ≤ i ≤ n max k | ˜ X (cid:62) i (˜ π (cid:48) ( u k ) − π (cid:48) ( u k )) ≥ M (cid:19) ≤ M − K max k E X (cid:20) max ≤ i ≤ n | ˜ X (cid:62) i (˜ π (cid:48) ( u k ) − π (cid:48) ( u k )) | (cid:21) ≤ M − K max k tr (cid:104) E X (cid:16) ˆΛ(˜ π (cid:48) ( u k ) − π (cid:48) ( u k ))(˜ π (cid:48) ( u k ) − π (cid:48) ( u k )) (cid:62) ˆΛ (cid:17)(cid:105) (cid:18) max ≤ i ≤ n (cid:107) ˆΛ − ˜ X i (cid:107) E (cid:19) = O (cid:16) KB η n / (cid:17) max k (cid:34) n n (cid:88) i =1 (cid:107) ˜ X i (cid:107) E C (1 , T ( S i ( u k ) , S i ( u k ))( S (cid:48) i ( u k )) (cid:35) = O (cid:16) C C (cid:48) Kη n / (cid:17) (cid:32) n n (cid:88) i =1 (cid:107) ˜ X i (cid:107) E (cid:33) = O (cid:16) η δ − n − / (cid:17) (A.4)almost surely, where the O ( · ) term depends only on the distribution of X, and is thus uniform over G ∗ . Next, define ν n = 1 n n (cid:88) i =1 (cid:107) ˜ X i (cid:107) E (cid:26) E ( L i | X i ) (cid:107) ˜ X i (cid:107) E + E (cid:18) sup u ∈ R | T (cid:48) i ( u ) | | X i (cid:19)(cid:27) ASSERSTEIN INFERENCE By (G1), there exists C < ∞ such that sup u ∈I ∗ (cid:107) π (cid:48)(cid:48) ( u ) (cid:107) E < C uniformlyin G ∗ . Then, using (A.1), (cid:107) ˜ π (cid:48) ( u ) − ˜ π (cid:48) ( v ) (cid:107) E ≤ C (1 + C ) (cid:107) ˆΛ − (cid:107) F ν n | u − v | . Hence, for large enough n,P X (cid:32) max ≤ i ≤ n sup | u − v | <δ | ˜ X (cid:62) i (˜ π (cid:48) ( u ) − ˜ π (cid:48) ( v )) | ≥ M (cid:33) ≤ C (1 + C ) M − δ (cid:107) ˆΛ − (cid:107) F (cid:18) max ≤ i ≤ n (cid:107) ˜ X i (cid:107) E (cid:19) ν n = O ( δηn / ν n ) , (A.5)where the O ( · ) terms is again uniform over G ∗ . Finally, by (G2) and Proposition A.5.1 of Van der Vaart and Wellner(1996), lim sup M →∞ lim sup n →∞ sup G∈ G ∗ P (cid:18) sup m ≥ n ν m > M (cid:19) → . Setting δ = n − / , (A.4) and (A.5) imply that, for any M > n,P (cid:18) sup m ≥ n P X ( A cn ) > (cid:15) (cid:19) ≤ P (cid:18) sup m ≥ n O (cid:16) η m − / (cid:17) > (cid:15) (cid:19) P (cid:18) sup m ≥ n O ( M η ) > (cid:15) (cid:19) + P (cid:18) sup m ≥ n ν m > M (cid:19) . The last term can be made arbitrarily small uniformly in G ∗ by choosing M large. Then, one can choose η small and n large so that the first and secondterms vanish for any G ∈ G ∗ . Hence r n = o (1) . Lastly, consider j = 3 and let R > H be the cdf of W n = (cid:80) ∞ j =1 ˆ λ j ω j (recall that ω j iid ∼ χ p ) conditional on the data, so thatˆ H (ˆ b Gα ) = 1 − α. Then ˆ b Gα > R implies that ˆ H ( R ) < − α. Denote by P ω theprobability measure of the ω j , treating the observed data as constant. Then1 − ˆ H ( R ) = P ω ∞ (cid:88) j =1 ˆ λ j V j > R ≤ pR − ∞ (cid:88) j =1 ˆ λ j ≤ pR − (cid:90) ˆ C Q ( t, t )d t, where we have used the identity (cid:80) ∞ j =1 ˆ λ j = (cid:82) ˆ C Q ( t, t )d t. Set ˜ C Q ( t, t ) = n − (cid:80) ni =1 ( Q i ( t ) − ˜ Q ⊕ ( X i , t )) , so that ˆ C Q ( t, t ) = ˜ C Q ( t, t )whenever A n holds. Whenever (cid:107) ˆΛ − (cid:107) F ≤ B and n − (cid:80) ni =1 (cid:107) ˜ X i (cid:107) E ≤ B , PETERSEN, LIU AND DIVANI using (G1) one can derive the bound(A.6) E X (cid:20)(cid:90) ˜ C Q ( t, t )d t (cid:21) ≤ D (cid:20) n − n + B (1 + B ) n (cid:21) = O (1) . uniformly over G ∗ . Thus, for large n, the above is bounded by some constant C so that P (cid:18) sup m ≥ n P X (ˆ b Gα > R ) > (cid:15) (cid:19) ≤ P (cid:18) sup m ≥ n P X (cid:18)(cid:90) ˜ C Q ( t, t )d t > Rp − α (cid:19) > (cid:15) (cid:19) + P (cid:18) sup m ≥ n P X ( A cm ) (cid:19) ≤ (cid:0) C > Rp − α (cid:1) + P (cid:18) sup m ≥ n (cid:107) ˆΛ − (cid:107) F > B (cid:19) + P (cid:32) sup m ≥ n n n (cid:88) i =1 (cid:107) ˜ X i (cid:107) E > B (cid:33) + r n , (A.7)which is uniform over G ∗ since the distribution of X is fixed. By choosing R large, the first term vanishes. The second and third terms can also be madesmall, uniformly in G ∗ , since (cid:107) ˆΛ − (cid:107) F and n − (cid:80) ni =1 (cid:107) ˜ X i (cid:107) E are both O (1)almost surely. Thus, r n → . Lemma . Let E = L ∞ [0 , × L ∞ (0 , , E = C [0 , × C b (0 , , and E φ = (cid:26) ( h , h ) ∈ E : inf t ∈ (0 , h ( t ) > , h strictly increasing, I δx ⊂ ( h (0) , h (0)) . (cid:27) Define the map φ : E φ → L ∞ ( I δx ) as φ ( h , h ) = h ◦ h − . Then, provided sup u ∈I ∗ | f ∗⊕(cid:48) ( u ) | < ∞ , φ is Hadamard differentiable at ( Q ⊕ ( x ) , q ⊕ ( x )) ∈ E φ , tangentially to E , with (A.8) φ (cid:48) ( Q ⊕ ( x ) ,q ⊕ ( x )) ( h , h ) = (cid:20) ∂∂u f ⊕ ( x ) (cid:21) h ◦ F ⊕ ( x ) − f ⊕ ( x ) h ◦ F ⊕ ( x ) . Proof. First, define F φ = { h ∈ L ∞ ( I δx ) : inf u ∈I δx h ( u ) > } and F = C ( I δx ) . Define φ : E φ → F φ by φ ( h , h ) = h ◦ h − , and φ : F φ → L ∞ ( I δx )by φ ( h ) = 1 /h. We will show that each map is Hadamard differentiable,then apply the chain rule.For ( h , h ) ∈ E , let r k be a real-valued sequence converging to zero. Fur-thermore, define a sequence ( h k , h k ) ∈ E such that h jk converges uniformlyto h j , and ( g k , g k ) = ( Q ⊕ ( x ) , q ⊕ ( x )) + r k ( h k , h k ) ∈ E φ ASSERSTEIN INFERENCE for all k. Then, for any u ∈ I δx , there exists t k ( u ) between g − k ( u ) and F ⊕ ( x, u ) such that(A.9) φ ( g k , g k )( u ) − φ ( Q ⊕ ( x ) , q ⊕ ( x ))( u ) = r k h k ◦ g − k ( u )+ ∂∂t q ⊕ ( x, t k ( u ))( g − k ( u ) − F ⊕ ( x, u )) . Define t k = g − k ( u ) . Then there exists u k between Q ⊕ ( x, t k ) and u such that g − k ( u ) − F ⊕ ( x, u ) = − r k q ⊕ ( x ) ◦ F ⊕ ( x, u k ) h k ( t k ) . Since inf t ∈ (0 , q ∗⊕ ( x, t ) > , we have sup u ∈I δx | g − k ( u ) − F ⊕ ( x, u ) | → . As aconsequence,(A.10)sup u ∈I δx | h k ◦ g − k ( u ) − h ◦ F ⊕ ( x, u ) | ≤ sup t ∈ [0 , | h k ( u ) − h ( u ) | + sup u ∈I δx | h ◦ g − k ( u ) − h ◦ F ⊕ ( x, u ) | . The first term goes to zero by assumption on h k , while the second convergesto zero since h is uniformly continuous on any compact subset of [0 , . Next, since q ∗⊕ is continuously differentiable, it follows that ( ∂/∂t ) q ⊕ ( x )is continuous on (0 , . Then, since | t k ( u ) − F ⊕ ( x, u ) | ≤ | g − k ( u ) − F ⊕ ( x, u ) | , we have that( ∂/∂t ) q ⊕ ( x, t k ( u )) → ∂∂t q ⊕ ( x, F ⊕ ( x, u )) = − ( ∂/∂u ) f ⊕ ( x, u ) f ⊕ ( x, u ) uniformly.Finally, by continuity of h and q ⊕ ( x ) , one can show that r − k ( g − k ( u ) − F ⊕ ( x, u )) → h ◦ F ⊕ ( x, u ) q ⊕ ( x ) ◦ F ⊕ ( x, u ) = f ⊕ ( x, u ) h ◦ F ⊕ ( x, u )uniformly. Plugging these results into (A.9), we find that φ is differentiableat ( Q ⊕ ( x ) , q ⊕ ( x )) ∈ E φ , tangentially to E , with φ (cid:48) , ( Q ⊕ ( x ) ,q ⊕ ( x )) ( h , h ) = lim k →∞ φ ( g k , g k ) − φ ( Q ⊕ ( x ) , q ⊕ ( x )) r k = (cid:20) h + ( ∂/∂t ) q ⊕ ( x ) q ⊕ ( x ) h (cid:21) ◦ F ⊕ ( x )= h ◦ F ⊕ ( x ) − ( ∂/∂u ) f ⊕ ( x ) f ⊕ ( x ) h ◦ F ⊕ ( x ) . (A.11)Using similar methods, one can show that φ is Hadamard differentiableat φ ( Q ⊕ ( x ) , q ⊕ ( x )) = q ⊕ ( x ) ◦ F ⊕ ( x ) ∈ F φ , tangentially to F , with(A.12) φ (cid:48) ,q ⊕ ( x ) ◦ F ⊕ ( x ) ( h ) = − hf ⊕ ( x ) . Since φ ( E ) ⊂ F , the result holds by the chain rule. PETERSEN, LIU AND DIVANI References. Agueh, M. and Carlier, G. (2011). Barycenters in the Wasserstein space. SIAM Journalon Mathematical Analysis Aitchison, J. (1986). The Statistical Analysis of Compositional Data . Chapman & Hall,Ltd. Al-Mufti, F. , Thabet, A. M. , Singh, T. , El-Ghanem, M. , Amuluru, K. and Gandhi, C. D. (2018). Clinical and Radiographic Predictors of Intracerebral Hem-orrhage Outcome. Interventional neurology Ambrosio, L. , Gigli, N. and Savar´e, G. (2008). Gradient Flows in Metric Spaces andin the Space of Probability Measures . Springer. Barber, R. F. , Reimherr, M. , Schill, T. et al. (2017). The function-on-scalar LASSOwith applications to longitudinal GWAS. Electronic Journal of Statistics Barden, D. , Le, H. and Owen, M. (2013). Central limit theorems for Fr´echet means inthe space of phylogenetic trees. Electronic Journal of Probability Barras, C. D. , Tress, B. M. , Christensen, S. , MacGregor, L. , Collins, M. , Desmond, P. M. , Skolnick, B. E. , Mayer, S. A. , Broderick, J. P. , Diringer, M. N. et al. (2009). Density and shape as CT predictors of intracerebralhemorrhage growth. Stroke Bigot, J. , Gouet, R. , Klein, T. and L´opez, A. (2017). Geodesic PCA in the Wasser-stein space by Convex PCA. Annales de l’Institut Henri Poincar´e B: Probability andStatistics Bolstad, B. M. , Irizarry, R. A. , ˚Astrand, M. and Speed, T. P. (2003). A comparisonof normalization methods for high density oligonucleotide array data based on varianceand bias. Bioinformatics Bosq, D. (2000). Linear Processes in Function Spaces: Theory and Applications . Springer-Verlag, New York. Boulouis, G. , Morotti, A. , Brouwers, H. B. , Charidimou, A. , Jessel, M. J. , Au-riel, E. , Pontes-Neto, O. , Ayres, A. , Vashkevich, A. , Schwab, K. M. et al.(2016). Noncontrast computed tomography hypodensities predict poor outcome in in-tracerebral hemorrhage patients. Stroke Broadhurst, R. E. , Stough, J. , Pizer, S. M. and Chaney, E. L. (2006). A statisticalappearance model based on intensity quantile histograms. In Cao, G. , Yang, L. and Todem, D. (2012). Simultaneous inference for the mean functionbased on dense functional data. Journal of nonparametric statistics Chiou, J.-M. and M¨uller, H.-G. (2007). Diagnostics for functional regression via resid-ual processes. Computational Statistics and Data Analysis Claeskens, G. and Van Keilegom, I. (2003). Bootstrap Confidence Bands for Regres-sion Curves and Their Derivatives. Annals of Statistics Cornea, E. , Zhu, H. , Kim, P. and Ibrahim, J. G. (2017). Regression models on Rie-mannian symmetric spaces. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) Degras, D. A. (2011). Simultaneous confidence bands for nonparametric regression withfunctional data. Statistica Sinica Delcourt, C. , Zhang, S. , Arima, H. , Sato, S. , Salman, R. A.-S. , Wang, X. , Davies, L. , Stapf, C. , Robinson, T. , Lavados, P. M. et al. (2016). Significance ofhematoma shape and density in intracerebral hemorrhage: the intensive blood pressurereduction in acute intracerebral hemorrhage trial study. Stroke Dubey, P. and M¨uller, H.-G. (2019). Fr´echet Analysis of Variance for Random Objects.ASSERSTEIN INFERENCE Biometrika, to appear – arXiv preprint arXiv:1710.02761 . Egozcue, J. J. , Diaz-Barrero, J. L. and Pawlowsky-Glahn, V. (2006). Hilbert spaceof probability density functions based on Aitchison geometry. Acta Mathematica Sinica Eubank, R. L. and Speckman, P. L. (1993). Confidence Bands in Nonparametric Re-gression. Journal of the American Statistical Association Fan, J. and Zhang, W. (2008). Statistical methods with varying coefficient models. Stat.Interface Faraway, J. J. (1997). Regression analysis for a functional response. Technometrics Fletcher, P. T. (2013). Geodesic regression and the theory of least squares on Rieman-nian manifolds. International Journal of Computer Vision Fletcher, P. T. , Lu, C. , Pizer, S. M. and Joshi, S. (2004). Principal geodesic analysisfor the study of nonlinear statistics of shape. IEEE Transactions on Medical Imaging Fr´echet, M. (1948). Les ´el´ements al´eatoires de nature quelconque dans un espace dis-tanci´e. Annales de l’Institut Henri Poincar´e Groeneboom, P. and Jongbloed, G. (2010). Generalized continuous isotonic regression. Statistics & probability letters Hevesi, M. , Bershad, E. M. , Jafari, M. , Mayer, S. A. , Selim, M. , Suarez, J. I. and Divani, A. A. (2018). Untreated hypertension as predictor of in-hospital mortality inintracerebral hemorrhage: A multi-center study. Journal of critical care Hron, K. , Menafoglio, A. , Templ, M. , Hruzova, K. and Filzmoser, P. (2016).Simplicial principal component analysis for density functions in Bayes spaces. MOX-report Hsing, T. and Eubank, R. (2015). Theoretical Foundations of Functional Data Analysis,with an Introduction to Linear Operators . John Wiley & Sons. Jain, N. C. and Marcus, M. B. (1975). Central limit theorems for C(S)-valued randomvariables. Journal of Functional Analysis Kneip, A. and Utikal, K. J. (2001). Inference for density families using functional prin-cipal component analysis. Journal of the American Statistical Association Levina, E. and Bickel, P. (2001). The earth mover’s distance is the Mallows distance:Some insights from statistics. In Proceedings Eighth IEEE International Conference onComputer Vision. ICCV 2001 Lin, L. and Dunson, D. B. (2014). Bayesian monotone regression using Gaussian processprojection. Biometrika Marron, J. S. and Alonso, A. M. (2014). Overview of object oriented data analysis. Biometrical Journal Morgenstern, L. B. , Hemphill III, J. C. , Anderson, C. , Becker, K. , Broder-ick, J. P. , Connolly Jr, E. S. , Greenberg, S. M. , Huang, J. N. , Macdon-ald, R. L. , Mess´e, S. R. et al. (2010). Guidelines for the management of spontaneousintracerebral hemorrhage: a guideline for healthcare professionals from the AmericanHeart Association/American Stroke Association. Stroke Nerini, D. and Ghattas, B. (2007). Classifying densities using functional regressiontrees: Applications in oceanology. Computational Statistics and Data Analysis Niethammer, M. , Huang, Y. and Vialard, F.-X. (2011). Geodesic regression for imagetime-series. In Medical Image Computing and Computer-Assisted Intervention–MICCAI2011 Panaretos, V. M. , Pham, T. and Yao, Z. (2014). Principal flows. Journal of the Amer- PETERSEN, LIU AND DIVANI ican Statistical Association Panaretos, V. M. and Zemel, Y. (2016). Amplitude and phase variation of point pro-cesses. The Annals of Statistics Panaretos, V. M. and Zemel, Y. (2019). Statistical aspects of Wasserstein distances. Annual review of statistics and its application Parzen, E. (1979). Nonparametric statistical modeling. Journal of the American Statis-tical Association Pass, B. (2013). Optimal transportation with infinitely many marginals. Journal of Func-tional Analysis Patrangenaru, V. and Ellingson, L. (2015). Nonparametric Statistics on Manifoldsand Their Applications to Object Data Analysis . CRC Press. Petersen, A. and M¨uller, H.-G. (2016). Functional data analysis for density functionsby transformation to a Hilbert space. Annals of Statistics Petersen, A. and M¨uller, H.-G. (2019). Fr´echet regression for random objects withEuclidean predictors. Annals of Statistics Rychlik, T. (2012). Projecting statistical functionals . Springer Science & BusinessMedia. Salazar, P. , Di Napoli, M. , Jafari, M. , Jafarli, A. , Ziai, W. , Petersen, A. , Mayer, S. A. , Bershad, E. M. , Damani, R. and Divani, A. A. (2019). Explo-ration of Multiparameter Hematoma 3D Image Analysis for Predicting Outcome AfterIntracerebral Hemorrhage. Neurocritical Care, to appear . Salman, R. A.-S. , Labovitz, D. L. and Stapf, C. (2009). Spontaneous intracerebralhaemorrhage. BMJ b2586. Satterthwaite, F. E. (1941). Synthesis of variance. Psychometrika Shen, Q. and Faraway, J. J. (2004). An F test for linear models with functional re-sponses. Statistica Sinica Srivastava, A. , Jermyn, I. and Joshi, S. (2007). Riemannian analysis of probabilitydensity functions with applications in vision. Proceedings from IEEE Conference onComputer Vision and Pattern Recognition Talsk´a, R. , Menafoglio, A. , Machalov´a, J. , Hron, K. and Fiˇserov´a, E. (2018).Compositional regression with functional response. Computational Statistics & DataAnalysis Van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Pro-cesses . Springer, New York. Villani, C. (2003). Topics in Optimal Transportation . American Mathematical Society. Wang, J. and Yang, L. (2009). Polynomial spline confidence bands for regression curves. Statistica Sinica Zhang, Z. and M¨uller, H.-G. (2011). Functional density synchronization. Computa-tional Statistics and Data Analysis Zhang, X. and Wang, J.-L. (2016). From sparse to dense functional data and beyond. The Annals of Statistics Address of Alexander Petersen and Xi LiuDepartment of Statistics and Applied ProbabilityUniversity of California Santa BarbaraSanta Barbara, CA 93106-3110E-mail: [email protected]@pstat.ucsb.edu