WWasserstein Regression ∗ Yaqing Chen , Zhenhua Lin , and Hans-Georg Müller Department of Statistics, University of California, Davis Department of Statistics and Applied Probability, National University of Singapore
Abstract
The analysis of samples of random objects that do not lie in a vector space has foundincreasing attention in statistics in recent years. An important class of such object data isunivariate probability measures defined on the real line. Adopting the Wasserstein metric,we develop a class of regression models for such data, where random distributions serve aspredictors and the responses are either also distributions or scalars. To define this regressionmodel, we utilize the geometry of tangent bundles of the metric space of random measures withthe Wasserstein metric. The proposed distribution-to-distribution regression model provides anextension of multivariate linear regression for Euclidean data and function-to-function regressionfor Hilbert space valued data in functional data analysis. In simulations, it performs betterthan an alternative approach where one first transforms the distributions to functions in aHilbert space and then applies traditional functional regression. We derive asymptotic rates ofconvergence for the estimator of the regression coefficient function and for predicted distributionsand also study an extension to autoregressive models for distribution-valued time series. Theproposed methods are illustrated with data on human mortality and distributions of houseprices.
Keywords:
Distribution regression; Wasserstein geometry; random measures; parallel transport;tangent bundles; functional data analysis; density transformation. ∗ Research supported by NIH Echo and NSF DMS1712862. a r X i v : . [ s t a t . M E ] J un hen, Lin & Müller Regression analysis is one of the foundational tools of statistics to quantify the relationship betweena response variable and predictors and there have been many extensions of simple models such asthe multiple linear regression model to more complex data scenarios. These include linear modelsfor function-to-function regression, where predictors and responses are both considered randomelements in Hilbert space, with a variant where responses are scalars (Grenander 1950; Ramsayand Dalzell 1991). Such linear functional regression models and their properties have been wellstudied (Cardot et al. 1999, 2003; Yao et al. 2005; Cai and Hall 2006; Hall and Horowitz 2007) andreviewed (Morris 2015; Wang et al. 2016).Samples that include random objects, which are random elements in general metric spaces thatby default do not have a vector space structure, are increasingly common. Such data cannot beanalyzed with methods devised for Euclidean or functional data, which are usually viewed as randomelements of a Hilbert space (Marron and Alonso 2014; Huckemann 2015). We focus here on thecase where the random objects are random probability measures on the real line that satisfy certainregularity conditions. Specifically, at this time there are no in-depth studies with detailed statisticalanalysis of regression models that feature such random measures as predictors, in contrast to thesituation where vector predictors are coupled with random distributions as responses (Petersen andMüller 2019a).Related work also includes a variety of methods that specifically target the case where Euclideanpredictors are paired with responses that reside on a finite-dimensional Riemannian manifold (Daviset al. 2007; Shi et al. 2009; Hinkle et al. 2012; Yuan et al. 2012; Cornea et al. 2017; Lin et al. 2017).Kernel and spline type methods have been proposed for the case where both predictors and responsesare elements of finite-dimensional Riemannian manifolds (Steinke and Hein 2009; Steinke et al. 2010;Banerjee et al. 2016). However, these methods do not cover spaces of probability measures underthe Wasserstein metric, where the tangent spaces are subspaces of infinite-dimensional Hilbertspaces. Additionally, no comprehensive investigation of the statistical properties and asymptoticbehavior of distribution-to-distribution regression models seems to exist. To develop the proposedmodel, we utilize tangent bundles in the space of probability distributions with the Wassersteinmetric and parallel transport to obtain asymptotic results for regression coefficient functions andpredicted measures.A recent approach to including random distributions as predictors in complex regression modelsis to transform the densities of these distributions to unconstrained functions in the Hilbert space L , e.g., by the log quantile density (lqd) transformation (Petersen and Müller 2016) and then toemploy functional regression models where the transformed functions serve as predictors and theresponses are either also the transformed functions or scalars (Chen et al. 2019; Kokoszka et al.2019; Petersen et al. 2019), whence established methods for functional regression become applicable.In contrast to this transformation approach, we develop here an alternative method that is closely2 asserstein Regression adapted to the underlying geometry, where we utilize the geometric properties of the metric spaceof random measures equipped with the Wasserstein distance. Indeed, we found in implementationsand simulations that this adapted geometric method works better for regression modeling whencomparing it to the “unadapted” transformation approaches based on Petersen and Müller (2016).Other alternatives have been considered for regressing scalar responses on distribution-valued pre-dictors (Póczos et al. 2013; Oliva et al. 2014; Szabó et al. 2016; Bachoc et al. 2017; Thi Thien Tranget al. 2019), but these are either Nadaraya-Watson type kernel estimators that suffer from a severecurse of dimensionality, kernel-based embedding, or moment-based methods, which are generic andnot adapted to the specific features of distributions as predictors.In this paper, we develop a distribution-to-distribution regression model that is analogous totraditional linear regression models for Euclidean and functional data, with the decisive differ-ence that both predictors and responses are univariate probability measures. We also discuss anextension of our approach to an autoregressive model for distribution-valued time series. In ourestimation procedures and theoretical analysis we cover the commonly encountered but more com-plex situation where neither predictor nor response distributions are directly observed and insteadthe available data consist of i.i.d. samples that are generated by each of these distributions.The paper is organized as follows. We first propose a distribution-to-distribution regressionmodel based on the tangent bundle of the Wasserstein space of probability distributions in Section2, with estimation and asymptotic theory in Section 3, and then describe an extension of the modelto an autoregressive model for time series of distributions in Section 4. Implementation detailsare discussed in Section 5, followed by a simulation study in Section 6 to assess the finite-sampleperformance of the proposed estimators and a competing approach. The wide applicability of theproposed methods is demonstrated with applications to human mortality data and US house pricedata in Section 7. Let D be R or a closed interval in R , and B ( D ) be the Borel σ -algebra on D . We focus onthe Wasserstein space W = W ( D ) of probability distributions on ( D, B ( D )) with finite secondmoments, endowed with the L -Wasserstein distance d W ( µ , µ ) = (cid:26)Z [ F − ( p ) − F − ( p )] d p (cid:27) / , (1)for µ , µ ∈ W , where F − and F − denote the quantile functions of µ and µ , respectively, whichwe assume to be well defined; specifically, for any distribution µ = µ ( F ) ∈ W with cumulativedistribution function (cdf) F , we consider the quantile function F − to be the left continuous3 hen, Lin & Müller inverse of F , i.e., F − ( p ) = inf { r ∈ D : F ( r ) ≥ p } , for p ∈ (0 , . (2)As demonstrated for example in Ambrosio et al. (2008); Bigot et al. (2017); Zemel and Panaretos(2019), basic concepts of Riemannian manifolds can be generalized to the Wasserstein space W . Weassume in the following µ ⊕ is a reference probability measure in W that has a continuous cdf F ⊕ .Then the tangent space at µ ⊕ is defined as the closed span of functions of the form F − ◦ F ⊕ − id,where F − is the quantile function of an arbitrary distribution µ ∈ W (Equation (8.5.1), Ambrosioet al. 2008): T µ ⊕ = { t ( F − ◦ F ⊕ − id) : µ ∈ W , t > } L µ ⊕ , where L µ ⊕ = L µ ⊕ ( D ) is the Hilbert space of µ ⊕ -square-integrable functions on D ⊂ R , with innerproduct h· , ·i µ ⊕ and norm k·k µ ⊕ . By the definition of the Wasserstein space W and the atomlessnessof the reference measure µ ⊕ , both F − ◦ F ⊕ and id belong to L µ ⊕ . Thus, the tangent space T µ ⊕ isa subspace of L µ ⊕ equipped with the same inner product and induced norm.The exponential map Exp µ ⊕ : T µ ⊕ → W is then defined by the push-forward map via thefunction g + id for g ∈ T µ ⊕ : Exp µ ⊕ g = ( g + id) µ ⊕ , where for a measurable function h : D → D , h µ ⊕ is a push-forward measure such that h µ ⊕ ( A ) = µ ⊕ ( { r ∈ D : h ( r ) ∈ A } ) for any set A ∈ B ( D ). Unlike Riemannian manifolds, the exponentialmap here is not a local homeomorphism (Ambrosio et al. 2004). However, any µ ∈ W withquantile function F − can be recovered from µ ⊕ by Exp µ ⊕ ( F − ◦ F ⊕ − id), in the sense that d W (Exp µ ⊕ ( F − ◦ F ⊕ − id) , µ ) = d W (( F − ◦ F ⊕ ) µ ⊕ , µ ) = 0 due to the continuity of F ⊕ . Thus, thelogarithmic map Log µ ⊕ : W → T µ ⊕ , as the inverse of the exponential map, can be defined as:Log µ ⊕ µ = F − ◦ F ⊕ − id , (3)for every µ ∈ W .Although not injective on the entire tangent space T µ ⊕ the exponential map restricted to thelog image Exp µ ⊕ | Log µ ⊕ ( W ) is an isometric homeomorphism (e.g., see Lemma 2.1, Bigot et al. 2017).Thus, geodesics can be defined for the Wasserstein space W as the image of Exp µ ⊕ of line segmentsin the tangent space T µ ⊕ , (1 − t )Log µ ⊕ µ + t Log µ ⊕ µ , with µ , µ ∈ W . Specifically, the geodesic γ µ ⊕ [ µ , µ ] : [0 , → W from µ to µ is given by γ µ ⊕ [ µ , µ ]( t ) = Exp µ ⊕ [(1 − t )Log µ ⊕ µ + t Log µ ⊕ µ ] , (4)4 asserstein Regression for t ∈ [0 , γ µ ⊕ [ µ , µ ] is invariant with respect to the choice of atomless reference measure µ ⊕ and isunique (e.g., Villani 2003). Let ( ν , ν ) be a pair of random elements in W with joint distribution F on W × W , assumed to besquare integrable in the sense that E d W ( µ, ν ) < ∞ and E d W ( µ, ν ) < ∞ for some (and thus forall) µ ∈ W . Any element in W that minimizes E d W ( · , ν ) is called a Fréchet mean of ν (Fréchet1948). Since the Wasserstein space W is a Hadamard space (Kloeckner 2010), there exists a uniqueminimizer of E d W ( · , ν ) (Sturm 2003), i.e., the Fréchet mean of ν , ν ⊕ = argmin µ ∈W E d W ( µ, ν ).Analogously, we denote the Fréchet mean of ν by ν ⊕ = argmin µ ∈W E d W ( µ, ν ). It is well-knownthat for univariate distributions as we consider here, the quantile function of the Fréchet mean issimply the expected quantile function of the random distribution: F − ⊕ ( · ) = E F − ( · ) and F − ⊕ ( · ) = E F − ( · ) , where F − ⊕ , F − ⊕ , F − and F − are the quantile functions of ν ⊕ , ν ⊕ , ν and ν , respectively.As we aim to develop a regression model where the predictors and responses are both distri-butions in W , a good starting point is linear regression in Euclidean spaces, where for a pair ofrandom elements ( X, Y ) ∈ ( R p , R ), E ( Y | X ) = Γ E ( X ) = E Y + β > ( X − E X ). The regression func-tion Γ E can be characterized by the following two properties: First, it maps the expectation of X to expectation of Y ; second, conditioning on X , it transports the line segment between E X and X to that between E Y and E [ Y | X ]. Specifically,(E1) E Y = Γ E ( E X );(E2) E [ Y + t ( Y − E Y ) | X ] = Γ E [ E X + t ( X − E X )], for all t ∈ [0 , E [ Y + t ( Y − E Y ) | X ], we define its counterpart in the Wasserstein spaceby E ⊕ { γ ν ⊕ [ ν ⊕ , ν ]( t ) | ν } := Exp ν ⊕ [ E ( t Log ν ⊕ ν | Log ν ⊕ ν )]. Accordingly, a regression operatorΓ W : W → W for the Wasserstein space would be expected to satisfy:(W1) d W ( ν ⊕ , Γ W ( ν ⊕ )) = 0;(W2) d W ( E ⊕ { γ ν ⊕ [ ν ⊕ , ν ]( t ) | ν } , Γ W { γ ν ⊕ [ ν ⊕ , ν ]( t ) } ) = 0, for all t ∈ [0 , ν ⊕ ν ⊕ = 0, ν ⊕ -a.e., and Log ν ⊕ ν ⊕ = 0, ν ⊕ -a.e.(A1) The Fréchet means ν ⊕ and ν ⊕ are atomless.5 hen, Lin & Müller Note that Exp µ (0) = µ for any µ ∈ W . Furthermore, since exponential maps restricted to the cor-responding log image spaces are isometric homeomorphisms, d W ( ν ⊕ , Γ W ( ν ⊕ )) = k Log ν ⊕ ◦ Γ W ◦ Exp ν ⊕ (0) k ν ⊕ , and d W ( E ⊕ { γ ν ⊕ [ ν ⊕ , ν ]( t ) | ν } , Γ W { γ ν ⊕ [ ν ⊕ , ν ]( t ) } ) = k E ( t Log ν ⊕ ν | Log ν ⊕ ν ) − Log ν ⊕ ◦ Γ W ◦ Exp ν ⊕ ( t Log ν ⊕ ν ) k ν ⊕ .Then we consider a regression operator between tangent spaces, Γ : T ν ⊕ → T ν ⊕ (Γ = Log ν ⊕ ◦ Γ W ◦ Exp ν ⊕ | T ν ⊕ ) such that(T1) k Γ(0) k ν ⊕ = 0;(T2) k E ( t Log ν ⊕ ν | Log ν ⊕ ν ) − Γ( t Log ν ⊕ ν ) k ν ⊕ = 0, for all t ∈ [0 , T ν ⊕ and T ν ⊕ are subspaces of L ν ⊕ and L ν ⊕ , respec-tively. Distribution-to-distribution regression can then be viewed as function-to-function regression,which has been well-studied in functional data analysis (see, e.g., Ferraty and Vieu 2003; Yao et al.2005; He et al. 2010; Wang et al. 2016). Specifically, the proposed distribution regression model is E (Log ν ⊕ ν | Log ν ⊕ ν ) = Γ(Log ν ⊕ ν ) , (5)where Γ : T ν ⊕ → T ν ⊕ is a linear operator defined asΓ g ( t ) = h β ( · , t ) , g i ν ⊕ , for t ∈ D and g ∈ T ν ⊕ . (6)Here, β : D → R is a coefficient function (i.e., the kernel of Γ) lying in L ν ⊕ × ν ⊕ , and ν ⊕ × ν ⊕ is a product probability measure on the product measurable space ( D , B ( D )) generated by ν ⊕ and ν ⊕ .It is clear that (T1) holds for the proposed model. However, (T2) holds only if Γ(Log ν ⊕ W ) ⊆ Log ν ⊕ W . This may not be the case, especially when D is compact, which is common in realdata, due to the fact that Log ν ⊕ W is a closed and convex subset of the tangent space T ν ⊕ (e.g.,Proposition 2.1 of Bigot et al. 2017). For our theoretical analysis, we therefore make the followingassumption for the regression operator Γ.(A2) For the regression operator Γ, the image Γ(Log ν ⊕ ν ) is in Log ν ⊕ W with probability 1.In practice, we propose a projection method to resolve the issue that the estimate of Γ(Log ν ⊕ ν )may not lie in the log image Log ν ⊕ W . We will discuss further details in Section 5. Since the cdfs F ⊕ and F ⊕ are continuous by (A1), it follows that E (Log ν ⊕ ν ) = 0, ν ⊕ -a.e., and E (Log ν ⊕ ν ) = 0, ν ⊕ -a.e. Denote the covariance operators of Log ν ⊕ ν and Log ν ⊕ ν by C ν = E (Log ν ⊕ ν ⊗ Log ν ⊕ ν ) and C ν = E (Log ν ⊕ ν ⊗ Log ν ⊕ ν ), respectively, and the cross-covarianceoperator by C ν ν = E (Log ν ⊕ ν ⊗ Log ν ⊕ ν ). Since C ν and C ν are trace-class operators, the two6 asserstein Regression covariance operators and hence the cross-covariance operator have the following eigendecomposition(Theorem 7.2.6, Hsing and Eubank 2015) C ν = ∞ X j =1 λ j φ j ⊗ φ j , C ν = ∞ X k =1 ς k ψ k ⊗ ψ k , C ν ν = ∞ X k =1 ∞ X j =1 ξ jk ψ k ⊗ φ j , (7)with eigenvalues λ j = E [ h Log ν ⊕ ν , φ j i ν ⊕ ], ς k = E [ h Log ν ⊕ ν , ψ k i ν ⊕ ] such that λ ≥ λ ≥ · · · ≥ ς ≥ ς ≥ · · · ≥
0, eigenfunctions { φ j } ∞ j =1 and { ψ k } ∞ k =1 that form orthonormal systems in thetangent spaces T ν ⊕ and T ν ⊕ , respectively, and ξ jk = E [ h Log ν ⊕ ν , φ j i ν ⊕ h Log ν ⊕ ν , ψ k i ν ⊕ ]. Withprobability 1, the log transformations Log ν ⊕ ν and Log ν ⊕ ν admit the Karhunen-Loève expansionLog ν ⊕ ν = ∞ X j =1 h Log ν ⊕ ν , φ j i ν ⊕ φ j and Log ν ⊕ ν = ∞ X k =1 h Log ν ⊕ ν , ψ k i ν ⊕ ψ k . Then as in the classical functional regression (e.g., Bosq 2000; Yao et al. 2005), the regressioncoefficient function β can be expressed as β = ∞ X k =1 ∞ X j =1 b jk ψ k ⊗ φ j , (8)with b jk = λ − j ξ jk . In order to guarantee that the right hand side of (8) converges in the sense thatlim J,K →∞ Z D Z D K X k =1 J X j =1 b jk φ j ( s ) ψ k ( t ) − β ( s, t ) d ν ⊕ ( s )d ν ⊕ ( t ) = 0 , we assume (Lemma A.2, Yao et al. 2005) ∞ X k =1 ∞ X j =1 λ − j ξ jk < ∞ . (9)For economy, we use the same notation g ⊗ g for the operator and its kernel throughoutthis paper. Namely, for g ∈ L µ and g ∈ L µ , g ⊗ g can represent either an operator on L µ such that ( g ⊗ g )( g ) = h g , g i µ g for g ∈ L µ or its kernel, i.e., a bivariate function such that( g ⊗ g )( s, t ) = g ( t ) g ( s ) for all s, t ∈ D .A variant of the proposed distribution-to-distribution regression in (5) is the pairing of distri-butions as predictors with scalar responses. For a pair of random elements ( ν , Y ) ∈ W × R withjoint distribution G , a distribution-to-scalar regression model is E ( Y | Log ν ⊕ ν ) = E ( Y ) + h β , Log ν ⊕ ν i ν ⊕ . hen, Lin & Müller Here, ν ⊕ is the Fréchet mean of ν and β : D → R is a regression coefficient function in L ν ⊕ given by β = P ∞ j =1 c j φ j , where λ j and φ j are the eigenvalues and eigenfunctions of the covarianceoperator C ν of Log ν ⊕ ν as in (7), and c j = λ − j h E ( Y Log ν ⊕ ν ) , φ j i ν ⊕ , for which we assumethat P ∞ j =1 λ − j h E ( Y Log ν ⊕ ν ) , φ j i ν ⊕ < ∞ . This model can also be viewed as function-to-scalarregression, which has been well studied in functional data analysis (Cardot et al. 1999, 2003; Caiand Hall 2006; Hall and Horowitz 2007; Yuan and Cai 2010; Lin and Yao 2017). While Bigot et al. (2017) assume distributions are fully observed, in reality this is usually not thecase, and this creates an additional challenge for the implementation of the proposed distribution-to-distribution regression model. Options to address this include estimating cdfs (e.g., Aggarwal 1955;Read 1972; Falk 1983; Leblanc 2012), or estimating quantile functions (e.g., Parzen 1979; Falk 1984;Yang 1985; Cheng and Parzen 1997) of the underlying distributions. Given an estimated quantilefunction b F − (resp. cdf b F ), we convert it to a cdf (resp. a quantile function) by right (resp. left)continuous inversion, b F ( r ) = sup { p ∈ [0 ,
1] : b F − ( p ) ≤ r } , for r ∈ R (10)(resp. (2)). Alternatively, one can start with a density estimator to estimate densities (Panaretosand Zemel 2016; Petersen and Müller 2016) and then compute the cdfs and quantile functions byintegration and inversion.Suppose { ( ν i , ν i ) } ni =1 are n independent realizations of ( ν , ν ). What we observe are collec-tions of independent measurements { X il } m ν i l =1 and { Y il } m ν i l =1 , sampled from ν i and ν i , respectively,where m ν i and m ν i are the sample sizes which may vary across distributions. Note that there aretwo independent layers of randomness in the data: The first generates independent pairs of dis-tributions ( ν i , ν i ); the second generates independent observations according to each distribution, X il ∼ ν i and Y il ∼ ν i .For a distribution µ ∈ W , denote by b µ = µ ( b F ) the distribution associated with some cdfestimate b F , based on a sample of measurements drawn according to µ . Using b ν i and b ν i assurrogates of ν i and ν i , the theoretical analysis on the estimation of the distribution-to-distributionregression coefficient function requires the following assumptions that quantify the discrepancy ofthe estimated and true probability measures.(A3) For any distribution µ ∈ W , with some nonnegative decreasing sequences τ m = o (1) and υ m = o (1) as m → ∞ such that υ m ≥ τ m , the corresponding estimate b µ based on a sample of8 asserstein Regression size m drawn according to µ satisfiessup µ ∈W E [ d W ( b µ, µ )] = O ( τ m ) , and sup µ ∈W var[ d W ( b µ, µ )] = O ( υ m ) . In order to obtain the uniform rates as in (A3), we also require (following Panaretos and Zemel2016; Petersen and Müller 2016)(A4) The domain D is a compact interval.For example, the distribution estimator proposed by Panaretos and Zemel (2016) satisfies (A3)with τ m = m − / and υ m = m − , while Petersen and Müller (2016) consider a subset W ac M of W containing distributions that are absolutely continuous with respect to Lebesgue measure on D such that sup µ ∈W ac M sup r ∈ D µ max { f µ ( r ) , /f µ ( r ) , | f µ ( r ) |} ≤ M, (11)where f µ is the density function of a distribution µ ∈ W ac M , D µ is the support of distribution µ and M > µ ∈W ac M E d W ( b µ, µ ) = O ( m − / ) andsup µ ∈W ac M var[ d W ( b µ, µ )] = O ( m − / ) in (A3) (Proposition 1, Petersen and Müller 2019b).Since 2 n distributions need to be simultaneously estimated, we also make the following assump-tion on the numbers of measurements per distribution m ν i and m ν i .(A5) There exists a sequence m = m ( n ) such that min { m ν i , m ν i : i = 1 , , . . . , n } ≥ m and m → ∞ as n → ∞ . Given the independent realizations { ( ν i , ν i ) : i = 1 , . . . , n } of ( ν , ν ), we first consider an oracleestimator for the regression coefficient function β , where we initially assume that { ( ν i , ν i ) } ni =1 arefully observed. First of all, the empirical Fréchet means are well-defined and unique due to the factthat we work in Hadamard spaces, and have explicit solutions e ν ⊕ = arg min µ ∈W n X i =1 d W ( ν i , µ ) and e ν ⊕ = arg min µ ∈W ∞ X i =1 d W ( ν i , µ ) . (12)Specifically, the quantile functions of the empirical Fréchet means are the empirical means of quan-tile functions across the sample: e F − ⊕ ( · ) = 1 n n X i =1 F − i ( · ) and e F − ⊕ ( · ) = 1 n n X i =1 F − i ( · ) , (13)9 hen, Lin & Müller and the corresponding distribution functions are given by right continuous inverse of the quan-tile functions as in (10). Then the log transforms Log ν ⊕ ν i and Log ν ⊕ ν i admit estimatesLog e ν ⊕ ν i and Log e ν ⊕ ν i . The covariance operators C ν and C ν can be estimated by e C ν = n − P ni =1 Log e ν ⊕ ν i ⊗ Log e ν ⊕ ν i and e C ν = n − P ni =1 Log e ν ⊕ ν i ⊗ Log e ν ⊕ ν i , which also admitthe eigendecompositions e C ν = P ∞ j =1 e λ j e φ j ⊗ e φ j and e C ν = P ∞ k =1 e ς k e ψ k ⊗ e ψ k , respectively, for e λ ≥ e λ ≥ · · · ≥ e ς ≥ e ς ≥ · · · ≥ C ν , its inverse is not bounded, leading to an ill-posed problem (e.g.,He et al. 2003; Wang et al. 2016). Regularization is thus needed and can be achieved throughtruncation. An oracle estimator for the regression coefficient function β is e β = K X k =1 J X j =1 e b jk e ψ k ⊗ e φ j , (14)where e b jk = e λ − j e ξ jk , with e ξ jk = n − P ni =1 h Log e ν ⊕ ν i , e φ j i e ν ⊕ h Log e ν ⊕ ν i , e ψ k i e ν ⊕ , and J and K are thetruncation bounds.Furthermore, we can construct an estimator based on the distribution estimation in Section 3.1which will be applicable in practical situations, where typically ν i and ν i are observed in the formof samples generated from ν i and ν i . Denote the estimated quantile functions by b F − i and b F − i ,respectively. Then the quantile functions of the empirical Fréchet means b ν ⊕ and b ν ⊕ of b ν i and b ν i for i = 1 , . . . , n are given by b F − ⊕ ( · ) = 1 n n X i =1 b F − i ( · ) and b F − ⊕ ( · ) = 1 n n X i =1 b F − i ( · ) , (15)and the corresponding distribution functions b F ⊕ and b F ⊕ can be obtained by right continuousinversion as per (10).Replacing ν i and ν i by the corresponding estimates b ν i and b ν i in (14), we can analogouslyobtain the estimates for the covariance operators with corresponding eigendecompositions b C ν = n − P ni =1 Log b ν ⊕ b ν i ⊗ Log b ν ⊕ b ν i = P ∞ j =1 b λ j b φ j ⊗ b φ j , and b C ν = n − P ni =1 Log b ν ⊕ b ν i ⊗ Log b ν ⊕ b ν i = P ∞ k =1 b ς k b ψ k ⊗ b ψ k . A realistic estimator of the regression coefficient function β is then b β = K X k =1 J X j =1 b b jk b ψ k ⊗ b φ j , (16)where b b jk = b λ − j b ξ jk , with b ξ jk = n − P ni =1 h Log b ν ⊕ b ν i , b φ j i b ν ⊕ h Log b ν ⊕ b ν i , b ψ k i b ν ⊕ , and the correspond-ing estimator b Γ for the regression operator Γ in (6) is b Γ g ( t ) = h g, b β ( · , t ) i b ν ⊕ , for g ∈ Log b ν ⊕ W and t ∈ D. (17)10 asserstein Regression We will discuss selection of the number of included functional principal components (FPCs) corre-sponding to J and K in implementations of the proposed method in Section 5.1. Note that the true and estimated regression operators are defined in different tangent spaces,Γ : T ν ⊕ → T ν ⊕ and b Γ : T b ν ⊕ → T b ν ⊕ . To compare them, we employ parallel transport, which isa commonly used tool for data on manifolds (Yuan et al. 2012; Petersen and Müller 2019b). Fortwo probability distributions µ , µ ∈ W , a parallel transport operator can be defined between theentire Hilbert spaces L µ and L µ , i.e., P µ ,µ : L µ → L µ byP µ ,µ g := g ◦ F − ◦ F , for g ∈ L µ , (18)where F − and F are the quantile function of µ and cdf of µ , respectively. Assuming that µ isatomless, restricted to the tangent space T µ , the parallel transport operator P µ ,µ | T µ defines theparallel transport from tangent space T µ to T µ .Denote by H µ ,µ the space of all Hilbert-Schmidt operators from T µ to T µ , for µ , µ ∈ W .With µ , µ , µ , µ ∈ W where µ and µ are atomless, we can define a parallel transport operator P ( µ ,µ ) , ( µ ,µ ) from H µ ,µ to H µ ,µ by( P ( µ ,µ ) , ( µ ,µ ) A ) g = P µ ,µ ( A (P µ ,µ g )) , (19)for g ∈ T µ and A ∈ H µ ,µ . Denoting the Hilbert-Schmidt norm on H µ ,µ by k · k H µ ,µ , for µ , µ ∈ W , properties of parallel transport operators P µ ,µ and P ( µ ,µ ) , ( µ ,µ ) that are relevantto the theory are as follows. Proposition 1.
With probability measures µ , µ , µ , µ ∈ W , the parallel transport operators P µ ,µ and P ( µ ,µ ) , ( µ ,µ ) , as defined in (18) and (19) , have the following properties:1. If µ is atomless, P µ ,µ is unitary, i.e., h P µ ,µ g, P µ ,µ h i µ = h g, h i µ , for g, h ∈ L µ .
2. If µ and µ are both atomless, then the parallel transport P µ ,µ from L µ to L µ is theadjoint operator of P µ ,µ , i.e., h P µ ,µ g , g i µ = h g , P µ ,µ g i µ , for g ∈ L µ and g ∈ L µ .
3. If µ , µ , and µ are atomless, given any positive integers N and N , for g j ∈ T µ , j =11 hen, Lin & Müller , · · · , N and g k ∈ T µ , k = 1 , · · · , N , P ( µ ,µ ) , ( µ ,µ ) N X j =1 N X k =1 c jk g k ⊗ g j = N X j =1 N X k =1 c jk (P µ ,µ g k ) ⊗ (P µ ,µ g j ) .
4. If µ , µ , µ , and µ are all atomless, for A ∈ H µ ,µ and A ∈ H µ ,µ , kP ( µ ,µ ) , ( µ ,µ ) A − A k H µ ,µ = kP ( µ ,µ ) , ( µ ,µ ) A − Ak H µ ,µ . Related results for finite-dimensional Riemannian manifolds are in Proposition 2 of Lin andYao (2018). These four statements can be easily checked by the definition of parallel transportoperators in (18) and (19); we omit the proof. Given atomless distributions µ , µ , µ , µ ∈ W ,applying Proposition 1, the discrepancy between operators A ∈ H µ ,µ and A ∈ H µ ,µ can bequantified in the space H µ ,µ by kP ( µ ,µ ) , ( µ ,µ ) A − Ak H µ ,µ . Our goal for the theory is to evaluate the performance of the estimated regression operator b Γ,as defined in (17). According to the discussion in Section 3.3, if (A1) holds and the estimatedFréchet means b ν ⊕ and b ν ⊕ are both atomless, then the discrepancy between the estimated andtrue regression operators b Γ and Γ can be gauged by kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b Γ − Γ k H ν ⊕ ,ν ⊕ , which isequal to the L ν ⊕ × ν ⊕ -norm between the corresponding kernels.To guarantee the atomlessness of the estimated Fréchet means b ν ⊕ and b ν ⊕ and hence to justifythe use of kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b Γ − Γ k H ν ⊕ ,ν ⊕ as a measure of the estimation error, we require thefollowing uniform Lipschitz condition on the estimated cdfs:(A6) For any distribution µ ∈ W , the corresponding estimate b µ based on a sample of size m drawnaccording to µ satisfies sup µ ∈W | b F ( r ) − b F ( r ) | ≤ L m | r − r | , for any r , r ∈ D , where b F isthe cdf of b µ and L m are a positive sequence only depending on the sample size m .For example, for the estimator proposed by Panaretos and Zemel (2016), the Lipschitz constant L m in (A6) can be taken as Ch − m + 1, where C is a constant and h m is a bandwidth satisfying h m → m → ∞ .Defining Q b β = K X k =1 J X j =1 b b jk (P b ν ⊕ ,ν ⊕ b ψ k ) ⊗ (P b ν ⊕ ,ν ⊕ b φ j ) , we note that under (A1) and (A6), the third statement in Proposition 1 indicates that the kernel12 asserstein Regression of the transported regression operator estimator P ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b Γ is Q b β . Hence, kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b Γ − Γ k H ν ⊕ ,ν ⊕ = Z D Z D [ Q b β ( s, t ) − β ( s, t )] d ν ⊕ ( s )d ν ⊕ ( t ) . (20)Thus, for the theory, we will focus on the convergence rate of the data-based estimator b β for theregression coefficient function by quantifying the asymptotic order of the right hand side of (20). Todo this, we require the following conditions regarding the spacing and decay rates of the eigenvalues λ j and ς k and the coefficients b jk . Conditions of this type are standard in traditional functionallinear regression (e.g., Hall and Horowitz 2007).(A7) For j ≥ λ j − λ j +1 ≥ Cj − θ − , where C and θ are positive constants.(A8) For k ≥ ς k − ς k +1 ≥ Ck − ϑ − , where C and ϑ are positive constants.(A9) For j, k ≥ | b jk | ≤ Cj − ρ k − % , where ρ, % > / C > ν i and ν i aroundtheir Fréchet means, we also assume that d W ( ν ⊕ , ν ) and d W ( ν ⊕ , ν ) have finite variance, i.e.,(A10) E [ d W ( ν ⊕ , ν )] < ∞ , E [ d W ( ν ⊕ , ν )] < ∞ .Then we obtain Theorem 1.
Assume (A1)–(A10). Furthermore, defining α = max { (4 θ +3) / (4 θ +4 ρ +2)+1 / (4 ϑ +4 % +2) , (2 θ +1) / (4 θ +4 ρ +2)+(2 ϑ +3) / (4 ϑ +4 % +2) , (2 θ +2) / (4 θ +4 ρ +2)+(2 ϑ +2) / (4 ϑ +4 % +2) }− ,assume α < and that τ m n α +1 → , υ / m n α +1 / → , as n → ∞ . If J ∼ n / (4 θ +4 ρ +2) , K ∼ n / (4 ϑ +4 % +2) , Z D Z D h Q b β ( s, t ) − β ( s, t ) i d ν ⊕ ( s )d ν ⊕ ( t ) = O ( τ m n α +1 ) + O p ( υ / m n α +1 / ) + O p ( n α ) . Here, τ m and υ m are as in (A3). If the estimator b β is based on the distribution estimation methodemployed by Panaretos and Zemel (2016), then the rate in Theorem 1 becomes O ( m − / n α +1 ) + O p ( m − / n α +1 / ) + O p ( n α ) and becomes O p ( n α ) if m ∼ n . Assuming ν i , ν i ∈ W ac M (as definedin (11)) in addition, the density estimator used by Petersen and Müller (2016, 2019b) entails theconvergence rate O ( m − / n α +1 ) + O p ( m − / n α +1 / ) + O p ( n α ), which becomes O p ( n α ) if m ∼ n / .If θ = ϑ and ρ = % , then α = (1 − ρ ) / (2 θ + 2 ρ + 1) = − ρ − / (2 θ + 2)] − . Thus, theregression coefficient function estimator b β has a rate of convergence closer to √ n if the eigenvalues λ j and ς k decay with a slower rate and the coefficients b jk decline faster. Furthermore, taking λ j = ς j = j − a with a > θ = ϑ = 2 a −
1. If ρ = % = 2 ca + 1 / c >
0, then α = − c/ (1 + c ), so that under certain conditions α may approach − hen, Lin & MüllerRemark . When the distributions { ν i , ν i } ni =1 are fully observed, we need fewer assumptions asdescribed in the following. First of all, we do not need to assume the compactness of the domain;among the assumptions of Theorem 1, it is enough to assume (A1)–(A2) and (A7)–(A10) in thiscase. To ensure the atomlessness of the empirical Fréchet means e ν ⊕ and e ν ⊕ , we assume thatthere exists a large constant M > W M := { µ ∈ W : | F ( r ) − F ( r ) | ≤ M | r − r | , for any r , r ∈ D } , where F is the cdfof µ . In particular, we assume that ν i , ν i ∈ W M for all i . Defining e Γ g ( t ) = h g, e β ( · , t ) i e ν ⊕ for g ∈ Log e ν ⊕ W and t ∈ D , and Q e β = P Kk =1 P Jj =1 e b jk (P e ν ⊕ ,ν ⊕ e ψ k ) ⊗ (P e ν ⊕ ,ν ⊕ e φ j ), by Proposition 1 wecan quantify the estimation error of the estimated regression operator e Γ by kP ( e ν ⊕ , e ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) e Γ − Γ k H ν ⊕ ,ν ⊕ = R D R D [ Q e β ( s, t ) − β ( s, t )] d ν ⊕ ( s )d ν ⊕ ( t ). We then find that with J ∼ n / (4 θ +4 ρ +2) , K ∼ n / (4 ϑ +4 % +2) , and α defined as in Theorem 1, Z D Z D h Q e β ( s, t ) − β ( s, t ) i d ν ⊕ ( s )d ν ⊕ ( t ) = O p ( n α ) . Theorem 1 entails the following corollary on the prediction of ν given ν = µ for a distribution µ ∈ W , where we are interested in the rate of convergence when estimating E ⊕ ( ν | ν = µ ) :=Exp ν ⊕ [ E (Log ν ⊕ ν | Log ν ⊕ ν = Log ν ⊕ µ )]. For the following result, the corresponding estimate b µ of µ is assumed to be based on a sample of m µ ≥ m observations drawn from µ satisfying (A3),where m is the lower bound of the number of observations per distribution as per (A5). We denotethe prediction of ν by b ν ( b µ ) := Exp b ν ⊕ [ b Γ(Log b ν ⊕ b µ )], where b Γ is defined in (17).
Corollary 1.
Under the assumptions of Theorem 1, d W ( b ν ( b µ ) , E ⊕ ( ν | ν = µ )) = O ( τ m n α +1 ) + O p ( υ / m n α +1 / ) + O p ( n α ) . Proofs and auxiliary lemmas for this section are in Appendix A.
Here we consider a dependent sequence µ , µ , . . . ∈ W , assumed to be square integrable in thesense that E d W ( µ, µ i ) < ∞ for some (and thus for all) µ ∈ W , i = 1 , . . . , n , and have a commonFréchet mean µ ⊕ , which satisfies(B1) The Fréchet mean µ ⊕ is atomless.Furthermore, assuming that (Log µ ⊕ µ i ) i ≥ is a stationary process (as per Definition 2.4 Bosq 2000),we extend the distribution-to-distribution regression model in (5) to a first order autoregressivemodel, Log µ ⊕ µ i +1 = Γ (cid:16) Log µ ⊕ µ i (cid:17) + ε i , for i = 1 , , . . . (21)14 asserstein Regression Here, Γ : T µ ⊕ → T µ ⊕ is a linear operator defined asΓ g ( t ) = h β ( · , t ) , g i µ ⊕ , for t ∈ D and g ∈ T µ ⊕ , (22)where β : D → R is the auto-regression coefficient kernel lying in L µ ⊕ × µ ⊕ , and ε i are i.i.d. randomelements in the tangent space T µ ⊕ . Similar models have been previously studied in the seminalwork of Bosq (2000).To ensure the existence of such an autoregressive process, we assume that there exists an integer q ≥ k Γ q k L µ ⊕ <
1, where k · k L µ ⊕ denotes the sup norm for linear operators on L µ ⊕ and we define Γ q by induction, Γ k ( · ) = Γ[Γ k − ( · )], for any integer k > µ ⊕ µ i ) falls in the closed convex set Log µ ⊕ W for all i withprobability 1.As in Section 3, we have E (Log µ ⊕ µ ) = 0, µ ⊕ -almost surely. Denote the covariance operatorand cross-covariance operator of order 1 by C = E (Log µ ⊕ µ ⊗ Log µ ⊕ µ ) and C = E (Log µ ⊕ µ ⊗ Log µ ⊕ µ ), respectively, where C has the eigendecomposition C = ∞ X j =1 λ j φ j ⊗ φ j , with eigenvalues λ ≥ λ ≥ · · · ≥ { φ j } ∞ j =1 . With probability 1,the logarithmic transforms Log µ ⊕ µ i admit the expansionLog µ ⊕ µ i = ∞ X j =1 h Log µ ⊕ µ i , φ j i µ ⊕ φ j , i = 1 , , . . . The cross-covariance operator C hence admits the decomposition C = P ∞ l =1 P ∞ j =1 ξ jl φ l ⊗ φ j , where ξ jl = E ( h Log µ ⊕ µ , φ j i µ ⊕ h Log µ ⊕ µ , φ l i µ ⊕ ). With b jl = λ − j ξ jl , the regression coefficient function canthen be expressed as β = ∞ X l =1 ∞ X j =1 b jl φ l ⊗ φ j . For estimation of the regression coefficient function β , first considering a fully observed sequenceof length n , µ , µ , . . . , µ n , with the oracle estimator of the Fréchet mean e µ ⊕ defined analogouslyto (12), the autocovariance operators C and C can be estimated by their empirical counterparts e C = n − P ni =1 Log e µ ⊕ µ i ⊗ Log e µ ⊕ µ i and e C = ( n − − P n − i =1 Log e µ ⊕ µ i +1 ⊗ Log e µ ⊕ µ i , with the decom-positions e C = P ∞ j =1 e λ j e φ j ⊗ e φ j and e C = P ∞ l =1 P ∞ j =1 e ξ jl e φ l ⊗ e φ j , respectively, for e λ ≥ e λ ≥ · · · ≥ e ξ jl = ( n − − P n − i =1 h Log e µ ⊕ µ i , e φ j i e µ ⊕ · h Log e µ ⊕ µ i +1 , e φ l i e µ ⊕ . Then an oracle estimator for the15 hen, Lin & Müller regression coefficient function β is e β = J X l =1 J X j =1 e b jl e φ l ⊗ e φ j , where e b jl = e λ − j e ξ jl , and J is the truncation bound.As discussed for the independent case in Section 3.2, a realistic estimator b β for β based onthe distribution estimation discussed in Section 3.1 can be obtained by replacing µ i and µ ⊕ withthe corresponding estimates b µ i and b µ ⊕ , the latter analogous to (15). Specifically, estimates for theautocovariance operators with corresponding decompositions are given by b C = n − P ni =1 Log b µ ⊕ b µ i ⊗ Log b µ ⊕ b µ i = P ∞ j =1 b λ j b φ j ⊗ b φ j and b C = ( n − − P n − i =1 Log b µ ⊕ b µ i +1 ⊗ Log b µ ⊕ b µ i = P ∞ j =1 b ξ jl b φ l ⊗ b φ j , with b ξ jl = ( n − − P n − i =1 h Log b µ ⊕ b µ i , b φ j i b µ ⊕ h Log b µ ⊕ b µ i +1 , b φ l i b µ ⊕ . A data-based estimator of the regressioncoefficient function β is then given by b β = J X l =1 J X j =1 b b jl b φ l ⊗ b φ j , (23)with b b jl = b λ − j b ξ jl , and the corresponding estimator b Γ for the operator Γ in (22) is b Γ g ( t ) = h g, b β ( · , t ) i b µ ⊕ , for g ∈ Log b µ ⊕ W and t ∈ D. (24)To derive the convergence rate of the estimator b β , we require the following assumptions analo-gous to the independent case as in Section 3.(B3) There exists a sequence m = m ( n ) such that for the number of measurements per distribution m µ i , min { m µ i : i = 1 , , . . . , n } ≥ m and m → ∞ as n → ∞ .(B4) For j ≥ λ j − λ j +1 ≥ Cj − θ − , where C and θ are positive constants.(B5) For j, l ≥ | b jl | ≤ Cj − ρ l − % , where ρ, % > / C > E [ d W ( µ ⊕ , µ i )] ≤ C , for all i = 1 , , . . . , where C > n − P ni =1 d W ( µ i , b µ i ) (Rosenblatt 1956; Bosq 2000).(B7) The sequence (Log µ ⊕ µ i ) i ≥ is geometrically strongly mixing in the sense that for any integer k ≥
1, sup i ∈ N + sup A ∈ σ (Log µ ⊕ µ l : l ≤ i ) A ∈ σ (Log µ ⊕ µ l : l ≥ i + k ) | P ( A ∩ A ) − P ( A ) P ( A ) | ≤ Cu k , asserstein Regression where C > u ∈ (0 ,
1) are constants, and given a collection V of random elements, σ ( V )denotes the σ -field generated by V .Here, as in Section 3, we define Q b β := P Jl =1 P Jj =1 b b jl (P b µ ⊕ ,µ ⊕ b φ l ) ⊗ (P b µ ⊕ ,µ ⊕ b φ j ). Then we obtain Theorem 2.
Assume (A3), (A4), (A6) and (B1)–(B7). Furthermore, defining ζ = (1 − ρ − % ) / (2 θ + ρ + % + 1) , assume ζ < and that τ m n ζ +1 → , υ / m n ζ +1 / → , as n → ∞ . If J ∼ n / (4 θ +2 ρ +2 % +2) , Z D Z D h Q b β ( s, t ) − β ( s, t ) i d µ ⊕ ( s )d µ ⊕ ( t ) = O ( τ m n ζ +1 ) + O p ( υ / m n ζ +1 / ) + O p ( n ζ ) . The convergence rate obtained for the estimated coefficient function of the autoregressive modelis very similar to that in Theorem 1. This is due to (B7) which entails var[ n − P ni =1 d W ( b µ i , µ i )]is of the same order as in the independent case. A similar discussion as after Theorem 1 appliesregarding specific rates of convergence. Remark . Analogous to the independent case, if µ i are all fully observed, we can obtain theconvergence rate of the oracle estimator e Γ without assumption (A4). In addition, we do not needthe strong mixing condition (B7) regarding the variance of n − P ni =1 d W ( µ i , b µ i ). Define Q e β = P Jl =1 P Jj =1 e b jl (P e µ ⊕ ,µ ⊕ e φ l ) ⊗ (P e µ ⊕ ,µ ⊕ e φ j ). Assuming (B1)–(B6) and that µ i ∈ W M for all i , it can beshown that with J ∼ n / (4 θ +2 ρ +2 % +2) and ζ defined in Theorem 2, Z D Z D h Q e β ( s, t ) − β ( s, t ) i d µ ⊕ ( s )d µ ⊕ ( t ) = O p ( n ζ ) . As for the independent case, we derive the asymptotic results for the one-on-one prediction of µ n +1 given µ n as follows. Here, we denote E ⊕ ( µ n +1 | µ n ) := Exp µ ⊕ [ E (Log µ ⊕ µ n +1 | Log µ ⊕ µ n )], anddenote the prediction of µ n +1 by b µ n +1 ( b µ n ) := Exp b µ ⊕ [ b Γ(Log b µ ⊕ b µ n )] with b Γ as per (24).
Corollary 2.
Under the assumptions of Theorem 2, d W ( b µ n +1 ( b µ n ) , E ⊕ ( µ n +1 | µ n )) = O ( τ m n ζ +1 ) + O p ( υ / m n ζ +1 / ) + O p ( n ζ ) . Proofs and auxiliary lemmas for Theorem 2 and Corollary 2 are in Appendix B.
Choosing J and K for the Independent Case The following data-based method for choosing the numbers of predictor and response FPCs J and K , respectively, as defined in (17), when one has an i.i.d. sample of distributions { ν i , ν i } ni =1 ,was found to be adequate in practical applications. We first choose the number of FPCs K hen, Lin & Müller for the logarithms of the response distributions Log b ν ⊕ ν i , by applying either leave-one-curve-outcross-validation or thresholding according to cumulative fraction of variance explained (FVE). Forcross-validation, the objective function to be minimized is the discrepancy between predicted andobserved trajectories, i.e., K = argmin K n X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Log b ν ⊕ b ν i − K X k =1 h Log b ν ⊕ b ν i , b ψ k, − i i b ν ⊕ b ψ k, − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b ν ⊕ , where { b ψ k, − i } k ≥ are the estimated eigenfunctions from the functional principal component analysis(FPCA) of the i th-curve-left-out sample { Log b ν ⊕ b ν i } i = i . For the choice by FVE, K is chosen suchthat at least 100(1 − α )% of the variance is explained by the first K FPCs, where users need tospecify α , with the common choice α = 0 . J , we use leave-one-curve-out cross-validation by minimizing thedifference between the curves constructed with FPCs predicted by linear regression and the observedtrajectories: J = argmin J n X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Log b ν ⊕ b ν i − K X k =1 J X j =1 b b jk, − i h Log b ν ⊕ b ν i , b φ j, − i i b ν ⊕ b ψ k, − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b ν ⊕ , where b b jk, − i = b λ − j, − i b ξ jk, − i with b ξ jk, − i = ( n − − P i = i h Log b ν ⊕ b ν i , b φ j, − i i b ν ⊕ ·h Log b ν ⊕ b ν i , b ψ k, − i i b ν ⊕ ,and { b λ j, − i } j ≥ and { b φ j, − i } j ≥ are the estimated eigenvalues and eigenfunctions from the FPCA ofthe i th-curve-left-out sample { Log b ν ⊕ b ν i } i = i , respectively.We mention that in practical implementations we replace leave-one-curve-out cross validationby 5-fold cross-validation when n > Choosing J for Distribution-Valued Time Series Here, we use a cross-validation approach proposed by Bergmeir et al. (2018) to select the numberof FPCs J as defined in (23) for the time series case. Note that there is no second truncationparameter K . Following Bergmeir et al. (2018), we first divide the observed time series into atraining set and a testing set, and then apply k -fold cross validation on the training set to choosethe number of FPCs J ; secondly, the regression coefficient function β will be estimated on thewhole training set with the optimal J ; lastly, the estimate of β will be applied on the testing set toevaluate the performance of out-of-sample prediction. We discuss the independent scenario only; the time series case is analogous. Our goal here is todeal with the case when the fit of the logarithmic response does not fall in the logarithmic space18 asserstein Regression with base point b ν ⊕ , i.e., b Γ(Log b ν ⊕ b ν i ) / ∈ Log b ν ⊕ W , (25)with b Γ given in (17). This problem was already recognized by Bigot et al. (2017). If (25) happens,we update the fit by a projection onto the boundary of Log b ν ⊕ W along the line segment connect-ing the origin 0 and the original fit b Γ(Log b ν ⊕ b ν i ). Specifically, we multiply the original estimate b Γ(Log b ν ⊕ b ν i ) by a constant η i such that η i = max { η ∈ [0 ,
1] : η b Γ(Log b ν ⊕ b ν i ) + id is non-decreasing } . (26)Note that η i = 1 when b Γ(Log b ν ⊕ b ν i ) ∈ Log b ν ⊕ W . In our implementation, the fitted logarithmicresponse is then given by η i b Γ(Log b ν ⊕ b ν i ) for all i = 1 , . . . , n . We compared the performance of the proposed method implemented with boundary projection (re-ferred to as projection method) with two other approaches. The first of these comparison methodsis to employ an alternative to the proposed boundary projection for those situations where theevent (25) takes place, which was proposed by Cazelles et al. (2018) in the context of principalcomponent analysis (PCA). This alternative to handle the problem extends the domains of thedistributions. We use this method by fitting the proposed distribution-to-distribution regressionmodel with distributions on an extended domain, and then normalize the fitted distributions byrestricting them back to the original domain. We refer to this as the domain-extension method inthe following. A second possible approach for comparison is provided by the log quantile density(lqd) method, where we map distributions/densities to a Hilbert space via the lqd transformation(Petersen and Müller 2016), apply function-to-function regression to the lqd functions, and mapthe fitted functions back to the Wasserstein space W through the inverse lqd transformation (Chenet al. 2019).For our simulations, we generated data as follows:Step 1: Generate ν i = Beta ( a i , c i ) where a i iid ∼ Gamma(20 ,
10) and c i iid ∼ Gamma(40 , i = 1 , . . . , n . Then ν ⊕ is obtained by computing F − ν ⊕ ( · ) = E F − i ( · ) via numerical integration.Step 2: Perform FPCA on { Log ν ⊕ ν i } ni =1 ,Log ν ⊕ ν i = J X j =1 h Log ν ⊕ ν i , φ j i ν ⊕ φ j , such that the first J FPCs explain at least 99.99% of the variation.19 hen, Lin & Müller
Step 3: For ν ⊕ = Beta(2 ,
4) and a J × J matrix B = ( b jk ) which consists of zeros except for thetopleft 2 × . . − . . ! , let Z i := η i J X k =1 h Log ν ⊕ ν i , J X j =1 b jk φ j i ν ⊕ φ k , where η i is a projection constant to ensure Z i ∈ Log ν ⊕ W as discussed in Section 5.2, i.e, η i = max { η ∈ [0 ,
1] : η P Jk =1 h Log ν ⊕ ν i , P Jj =1 b jk φ j i ν ⊕ φ k + id is non-decreasing } .Step 4: Sample a i iid ∼ Unif {± π, ± π, ± π } . Let ν i = [ g a i ◦ ( Z i + id)] ν ⊕ , where g a ( r ) = r −| a | − sin( ar ) with a ∈ R \{ } and r ∈ R . This is a valid method to provide random distortionsfor distributions (Petersen and Müller 2019a).Step 5: Draw an i.i.d. sample of size m from each of the distributions { ν i } ni =1 and { ν i } ni =1 .Four cases were considered with n ∈ { , } and m ∈ { , } . We simulated 500 runs foreach ( n, m ) pair. The distribution domain is expanded from [0 ,
1] to [ − ,
2] for the domain-extensionmethod. To compare the three methods, we computed the out-of-sample average Wassersteindiscrepancy (AWD) based on observations for 200 new predictors { ν i : i = n + 1 , . . . , n + 200 } , foreach Monte Carlo run. Denoting the fitted response distributions by ν \ i , the out-of-sample AWDis given by AWD( n, m ) = 1200 n +200 X i = n +1 d W (Exp ν ⊕ Z i , ν \ i ) . For our method with boundary projection, ν \ i = Exp b ν ⊕ (cid:16) η i b Γ(Log b ν ⊕ b ν i ) (cid:17) . The results are sum-marized in the boxplots of Figure 1. Note that the domain-extension method often failed to forcethe fit b Γ(Log b ν ⊕ b ν i ) to fall in the log space Log b ν ⊕ W . The boxplots of the AWDs of the domain-extension method were only calculated based on the runs where it was computable and the countof these runs out of 500 are also included in Figure 1.The proposed method outperforms the lqd method in all four cases considered and also thedomain-extension method when there are sufficiently many observations for each distribution ( m =500). Although the domain-extension method seems slightly better than the proposed method whenthe number of observations per distribution is limited, it has a major downside — it is frequentlynot computable, especially when the number of distributions is small ( n = 20). Specifically when m = 500, we found that the domain-extension method is only computable for 72 runs for n = 20and 208 runs for n = 200, out of 500 runs. In addition, the median AWDs almost stay the same asthe number of distribution pairs n increases from 20 to 200 when m is small for all three approaches.This suggests that the fitting error seems to be dominated by the distribution estimation error when20 asserstein Regression lll llll llllllll l lllllll . . . . . m = 50lqd ext proj lqd ext proj (99) (429) n = 20 n = 200 llllllllllllll ll lllllllll llllllllllllllllllllllllllllllllllllll llllllll . . . . m = 500lqd ext proj lqd ext proj (72) (208) n = 20 n = 200 A W D Figure 1: Boxplots of the average Wasserstein discrepancies (AWDs) for the four simulation setupswith ( n, m ) ∈ { , } × { , } , where “lqd”: lqd method; “ext”: domain-extension method;“proj”: projection method. The integers in brackets under “ext” indicate for how many of the 500Monte Carlo runs the domain-extension method was computable.there are few or modest numbers of observations available per distribution. When m is large, theAWDs decrease when n increases for the lqd and the proposed method, but they do not decreasefor the domain-extension method, indicating non-vanishing bias. There has been continuing interest in the nature of human longevity and the analysis of mortalitydata across countries and calendar years has provided some of the key data to study it (e.g., Chiouand Müller 2009; Ouellette and Bourbeau 2011; Hyndman et al. 2013; Shang and Hyndman 2017).Of particular interest is how patterns of mortality of specific populations evolve over calendartime. Going beyond summary statistics such as life expectancy, viewing the entire age-at-deathdistributions as data objects is expected to lead to deeper insights into the secular evolution ofhuman longevity and its dynamics. The Human Mortality Database ( )provides yearly life tables for 38 countries, from which the distributions of age-at-death in terms ofhistograms can be extracted, whence densities can then be obtained by smoothing with local linearregression. We obtained these densities on the domain [0 , n = 32 countries for which data are available for the years1983 and 2013. We applied the proposed distribution-to-distribution regression model with mortal-ity distributions for an earlier year (1983) as the predictor and a later year (2013) as the response to21 hen, Lin & Müller compare the temporal evolution of age-at-death distributions among different countries. We showthe results for females in Figure 2 for Japan, Ukraine, Italy and USA, which showcase differentpatterns of mortality change between 1983 and 2013. In addition to the graphical comparison,Wasserstein discrepancies (WD) between the observed and fitted distributions are also listed. Forall four countries, the observed and fitted distributions for 2013 are seen to be shifted to the rightfrom the corresponding distributions in 1983, indicating increased longevity. Japan (WD = 1.37)age den E nd [ p l o t i nd [ i ], ] . . . Ukraine (WD = 3.2)age den E nd [ p l o t i nd [ i ], ] Italy (WD = 0.36)age den E nd [ p l o t i nd [ i ], ] . . . USA (WD = 1.64)age den E nd [ p l o t i nd [ i ], ] Age D en s i t y Predictor (1983) Response (2013) Fitted response
Figure 2: Distribution-to-distribution regression for the mortality distributions of females in Japan,Ukraine, Italy and USA. The predictors are the distributions for 1983 and the responses are thedistributions 30 years later. The Wasserstein discrepancies (WDs) between fitted and observeddistributions are indicated for each country.Specifically, for Japan this rightward mortality shift is seen to be more expressed than suggestedby the fitted model, so that longevity extension is more than is anticipated, while the mortalitydistribution for Ukraine seems to shift to the right at a slower pace than the fitted model wouldsuggest, leading to a relatively large WD with a value of 3.2 between the observed and fittedresponse. In contrast, the regression fit for Italy almost perfectly matches the observed distributionin 2013. While the evolution of the mortality distributions for Japan, Ukraine and Italy canbe viewed as mainly a rightward shift over calendar years, this is not the case for USA, wherecompared with the fitted response, the actual rightward shift of the mortality distribution seemsto be accelerated for those above age 75 and decelerated for those below age 70.The estimated regression coefficient function b β ( s, t ) is shown in Figure 3. It is seen to bestratified according to s , which is the predictor age—negative for s around 7 and 25 and above 75or 80 and positive elsewhere (referred to as negative and positive strata in the sequel) and depends22 asserstein Regression only weakly on t , which is the response age except for t above 75, where the negative stratum atlarge s ≥
75 gradually diminishes and the positive strata expand. Hence, if the log transformedpredictor Log b ν ⊕ b ν i is non-negative or non-positive throughout its domain, then the fit for the logtransformed response η i b Γ(Log b ν ⊕ b ν i ) is determined by the comparison of the absolute values of thelog transformed predictor over the positive and negative strata of the estimated coefficient b β ( · , t ).If in 1983 the age-at-death distribution of a country was located to the right/left from the Fréchetmeans across countries, similar shifts may also be observed in the fitted distributions for 2013. Forexample, the log mapped predictors for Japan and Ukraine as shown in Figure 4 both have a peakaround 20–25 and then the fitted responses are indeed closer to the Fréchet mean in 2013 but stillto the right/left of the mean. s t Figure 3: Estimated regression coefficient function b β ( s, t ) in (16) for distribution-to-distributionregression on mortality data for log-mapped distributions.Compared to 1983, the fitted mortality distribution for 2013 in Japan has a bigger rightwardshift for older and a smaller one for younger females relative to the Fréchet mean distribution acrosscountries, in contrast to Ukraine, where mortality moves in the opposite direction from the Fréchetmean. The log transformed predictor for Italy is negative before 15 and positive after, whencethe fitted log response for Italy becomes positive throughout and also expands in size, meaningthe relative standing of Italy in terms of longevity improved in 2013 and the fitted distributionof Italy in 2013 is situated to the right from the Fréchet mean for all ages. While USA also haszero crossings in 1983, in 2013 the fitted response is almost entirely negative and consequently themortality distribution moved to the left of the Fréchet mean, i.e., its relative standing in terms oflongevity became worse.To illustrate the proposed autoregressive model for distribution-valued time series, we consider23 hen, Lin & Müller the mortality data for Sweden. We chose Sweden as an example because demographic data areavailable for a longer time span and are of very high quality. Here, the model was trained on thetime series between 1961 and 2001, and the out-of-sample prediction was evaluated for the following15 years up to 2016, where the yearly distribution was predicted from the fitted distribution timeseries, using the predicted distributions from previous years. Japan ( h i = Log S t a [ p l o t i nd [ i ], ] − Ukraine ( h i = Log S t a [ p l o t i nd [ i ], ] Italy ( h i = Log S t a [ p l o t i nd [ i ], ] − USA ( h i = Log S t a [ p l o t i nd [ i ], ] Age
Log M ap Predictor (1983) Fitted response (2013)
Figure 4: Log mapped predictors Log b ν ⊕ b ν i and fitted log responses η i b Γ(Log b ν ⊕ b ν i ) for Japan,Ukraine, Italy and USA, where b Γ and η i are defined in (17) and (26).As can been seen from the fitted distributions for the training period as shown in the firstrow of Figure 5, they are all close to the observed distributions. For the prediction period, thepredicted densities displayed in the second row of the figure increasingly deviate from the observeddistributions going from 2002 to 2016, where the observed densities have a mode increasingly shiftingto the right compared to the prediction. This means that the rightward mortality distribution shiftoutpaces the model expectation and longevity extension is accelerating for Sweden. A question of continuing interest to economists is how house prices change over time (e.g., Oikarinenet al. 2018; Bogin et al. 2019). We fitted the temporal evolution of house price distributions via theautoregressive distribution time series model described in Section 4, where we downloaded houseprice data from . These data included bimonthly median house pricesafter inflation adjustment for m = 306 cities in the US from June 1996 to August 2015, for whichthe distribution of median house prices across the cities was constructed for every second month.24 asserstein Regression The autoregressive model was trained on data up to April 2007 and predictions were computed forthe remaining period, where we successively predicted the distribution of each month based on theprediction two months prior, i.e., by running the distribution time series model as estimated fromthe training period. m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd [ i ] + y ea r S t a ) ] . . . m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd [ i ] + y ea r S t a ) ] m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd [ i ] + y ea r S t a ) ] m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd P r ed [ i ] + y ea r E nd ) ] . . . m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd P r ed [ i ] + y ea r E nd ) ] m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd P r ed [ i ] + y ea r E nd ) ] Age D en s i t y P r ed i c t i on F i tt i ng Observed Fitted/Predicted
Figure 5: Implementation of the autoregressive distribution time series model for age-at-deathdistributions of females in Sweden. Training period: 1961–2001 (with fitting results for three yearsshown in the top row). Prediction period: 2002–2016 (with prediction results for three years shownin the bottom row). The fitting/prediction Wasserstein discrepancies (WDs) are as shown for eachpanel.Figure 6 shows the fitting and prediction results for training and prediction periods, whereselected months are ordered in time, while Figure 7 illustrates the results sorted by Wassersteindiscrepancies. The house price densities are found to be mostly uni-modal, and the peak shiftsgradually to the right over time. Within the training period, the fitted densities are initially veryclose to the observed densities and then gradually are situated to the left of the observed densities,which means that the house price evolution overall accelerates during this period.For the prediction period, the predicted densities fall behind the observed distributions in 2007,catch up with the actual distribution in 2008, and then move to the right of the observed distribu-tions. We find that the discrepancy between the predicted and observed house price distributionsincreases from 2008 to 2012 and then decreases afterwards. These findings are in line with theoverheating of the housing market before 2006, the crash in 2007–2008, and the lingering effects ofthe financial crisis, followed by a recovery after 2012.25 hen, Lin & Müller
Aug. 1996(WD = 0.015)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] . Apr. 1999(WD = 0.028)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Dec. 2001(WD = 0.026)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Aug. 2004(WD = 0.12)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Apr. 2007(WD = 0.12)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Jun. 2007(WD = 0.13)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] . Jun. 2008(WD = 0.057)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Jun. 2009(WD = 0.1)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Jun. 2011(WD = 0.23)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Aug. 2015(WD = 0.1)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] log (Price) D en s i t y P r ed i c t i on F i tt i ng Observed Fitted/Predicted
Figure 6: Observed and fitted (top row) / predicted (bottom row) densities of the house pricedistributions. Training period: August 1996 to April 2007. Prediction period: June 2007 to August2015. Five representative months are depicted for each of the training and prediction periods, wherethe first month corresponds to the beginning of the respective period and the fifth month to theend. The Wasserstein discrepancies (WDs) are also listed for each month.
Jun. 1998(WD = 0.0042)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] . Aug. 2000(WD = 0.015)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Oct. 1999(WD = 0.029)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Dec. 2004(WD = 0.09)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Dec. 2005(WD = 0.24)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Feb. 2008(WD = 0.027)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] . Aug. 2009(WD = 0.11)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Feb. 2010(WD = 0.14)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Jun. 2013(WD = 0.23)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Apr. 2012(WD = 0.29)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] log (Price) D en s i t y P r ed i c t i on F i tt i ng Observed Fitted/Predicted
Figure 7: Observed and fitted (top row) / predicted (bottom row) densities of the house pricedistributions. Training period: August 1996 to April 2007. Prediction period: June 2007 to August2015. The representative months correspond to the minimum, three quartiles and maximum of theWasserstein discrepancies (WDs) for training and prediction periods, respectively.26 asserstein Regression
Appendices: Proofs and Ancillary Results
Throughout the proof, given any µ ∈ W , we denote the space of all Hilbert-Schmidt operators from T µ to T µ by H µ := H µ,µ . Appendix A: Proofs for Section 3.4
To derive the convergence rate of the estimator b β , we first need to study the asymptotic propertiesof the empirical covariance operators b C ν and b C ν . For simplicity, we denote P b ν ⊕ ,ν ⊕ g and P b ν ⊕ ,ν ⊕ g by P g for g ∈ L b ν ⊕ and g ∈ L b ν ⊕ , respectively. Lemma 1.
Assume (A1), (A3)–(A6), and (A10). Furthermore, assume that the eigenvalues { λ j } and { ς k } are distinct, respectively. Then kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) , kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) . Furthermore, sup j ≥ | b λ j − λ j | ≤ kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ , sup k ≥ | b ς k − ς k | ≤ kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ , k P b φ j − φ j k ν ⊕ ≤ √ kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ / min ≤ j ≤ j { λ j − λ j +1 } , for all j ≥ , k P b ψ k − ψ k k ν ⊕ ≤ √ kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ / min ≤ k ≤ k { ς k − ς k +1 } , for all k ≥ . Proof.
We only prove the results for b C ν ; those for b C ν can be shown analogously. By the third27 hen, Lin & Müller statement in Proposition 1, under (A1) and (A6), P ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν = n − n X i =1 (PLog b ν ⊕ b ν i ) ⊗ (PLog b ν ⊕ b ν i ) − C ν = n − n X i =1 (Log ν ⊕ ν i ) ⊗ (Log ν ⊕ ν i ) − C ν + n − n X i =1 (PLog b ν ⊕ b ν i − Log ν ⊕ ν i ) ⊗ (Log ν ⊕ ν i )+ n − n X i =1 (Log ν ⊕ ν i ) ⊗ (PLog b ν ⊕ b ν i − Log ν ⊕ ν i )+ n − n X i =1 (PLog b ν ⊕ b ν i − Log ν ⊕ ν i ) ⊗ (PLog b ν ⊕ b ν i − Log ν ⊕ ν i )=: A + A + A + A . For A , it can be observed that kA k H ν ⊕ ≤ n − n X i =1 k Log ν ⊕ ν i k ν ⊕ ! n − n X i =1 k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ ! . Under the assumption (A10), n − n X i =1 k Log ν ⊕ ν i k ν ⊕ = n − n X i =1 d W ( ν ⊕ , ν i ) E [ d W ( ν ⊕ , ν )] + O p ( n − / ) . (A.1)For the second term, by (3), (18) and (A6), PLog b ν ⊕ b ν i = Log ν ⊕ b ν i − Log ν ⊕ b ν ⊕ . Hence, k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ ≤ k Log ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ + 2 k Log ν ⊕ b ν ⊕ − Log ν ⊕ e ν ⊕ k ν ⊕ + 2 k Log ν ⊕ e ν ⊕ k ν ⊕ . By (3), (13) and (15), k Log ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ = d W ( b ν i , ν i ), k Log ν ⊕ b ν ⊕ − Log ν ⊕ e ν ⊕ k ν ⊕ = d W ( b ν ⊕ , e ν ⊕ ) ≤ n − P ni =1 d W ( b ν i , ν i ), and Log ν ⊕ e ν ⊕ = n − P ni =1 Log ν ⊕ ν i . Furthermore, since n − / P ni =1 Log ν ⊕ ν i converges in distribution to a Gaussian measure on L ν ⊕ with mean 0 andcovariance operator C ν (Aldous 1976), k Log ν ⊕ e ν ⊕ k ν ⊕ = O p ( n − ) . (A.2)28 asserstein Regression Furthermore, by (A3)–(A5), n − n X i =1 k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) . (A.3)Hence, kA k H ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ). Similarly, it can be shown that kA k H ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ), and kA k H ν ⊕ = O ( τ m ) + O p ( υ m n − ) + O p ( n − ). By Dauxoiset al. (1982), kA k H ν ⊕ = O p ( n − ) . (A.4)By (A6) and Proposition 1, { b λ j } and { P b φ j } are the eigenvalues and eigenfunctions of P b C ν , forwhich the results follow from Lemmas 4.2 and 4.3 of Bosq (2000). Proof of Theorem 1.
First we observe that (cid:13)(cid:13)(cid:13) Q b β − β (cid:13)(cid:13)(cid:13) ν ⊕ × ν ⊕ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Q b β − K X k =1 J X j =1 b jk ψ k ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ × ν ⊕ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ X k = K +1 ∞ X j = J +1 b jk ψ k ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ × ν ⊕ , where k A k ν ⊕ × ν ⊕ := R D R D A ( s, t ) d ν ⊕ ( s )d ν ⊕ ( t ) for A ∈ L ν ⊕ × ν ⊕ . By (A9), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ X k = K +1 ∞ X j = J +1 b jk ψ k ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ × ν ⊕ = ∞ X k = K +1 ∞ X j = J +1 b jk = O ( J − ρ +1 K − % +1 ) . Define A = K X k =1 J X j =1 ( b b jk − b jk ) ψ k ⊗ φ j ,A = K X k =1 J X j =1 b jk h P b ψ k ⊗ P b φ j − ψ k ⊗ φ j i ,A = K X k =1 J X j =1 ( b b jk − b jk ) h P b ψ k ⊗ P b φ j − ψ k ⊗ φ j i . Then kQ b β − P Kk =1 P Jj =1 b jk ψ k ⊗ φ j k ν ⊕ × ν ⊕ ≤ k A k ν ⊕ × ν ⊕ +2 k A k ν ⊕ × ν ⊕ +2 k A k ν ⊕ × ν ⊕ . Clearlythe term A is asymptotically dominated by A and A .Note that k A k ν ⊕ × ν ⊕ = P Kk =1 P Jj =1 ( b b jk − b jk ) = P Kk =1 P Jj =1 ( b λ − j b ξ jk − λ − j ξ jk ) . We claimthat ( b ξ jk − ξ jk ) = ( j θ +2 + k ϑ +2 )[ O ( τ m ) + O p ( υ / m n − / ) + O p ( n − )] , (A.5)29 hen, Lin & Müller where the O and O p terms are uniform across j, k . Using the same technique as in the proof ofHall and Horowitz (2007), we define an event E J = E J ( n ) = { λ J ≥ kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ } . On E J , by Lemma 1, b λ j ≥ λ j /
2, for all j = 1 , . . . , J . Note that λ j ≥ Cj − θ for some constant C > ρ, % > / P ∞ k =1 b jk ≤ Cj − ρ P ∞ k =1 k − % ≤ const . by (A9). Hence, on E J , k A k ν ⊕ × ν ⊕ ≤ K X k =1 J X j =1 λ j − b λ j b λ j b jk ! + 2 K X k =1 J X j =1 ( b ξ jk − ξ jk ) b λ j ≤ h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i J X j =1 K X k =1 b jk ! b λ − j + h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i J X j =1 K X k =1 b λ − j ( j θ +2 + k ϑ +2 ) ≤ h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i J X j =1 K X k =1 λ − j ( j θ +2 + k ϑ +2 ) ≤ h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i ( J θ +3 K + J θ +1 K ϑ +3 ) . Note that with the choice of J ∼ n / (4 θ +4 ρ +2) , τ m λ − J → υ / m n − / λ − J → n − λ − J →
0, as n → ∞ , under the assumptions of Theorem 1. Thus, Lemma 1 entails that P ( E J ) → n → ∞ .For A , define A = K X k =1 J X j =1 b jk ψ k ⊗ (P b φ j − φ j ) ,A = K X k =1 J X j =1 b jk (P b ψ k − ψ k ) ⊗ φ j ,A = K X k =1 J X j =1 b jk (P b ψ k − ψ k ) ⊗ (P b φ j − φ j ) . Then k A k ν ⊕ × ν ⊕ ≤ k A k ν ⊕ × ν ⊕ + 2 k A k ν ⊕ × ν ⊕ + 2 k A k ν ⊕ × ν ⊕ . It is clear that A is30 asserstein Regression asymptotically dominated by A and A . By (A7), (A9) and Lemma 1, k A k ν ⊕ × ν ⊕ ≤ K X k =1 J X j =1 b jk k P b φ j − φ j k ν ⊕ ≤ J X j =1 j − ρ − θ − K X k =1 k − % kP b C ν − C ν k ν ⊕ = h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i J X j =1 j − ρ − θ − . Here, J X j =1 j − ρ − θ − = O ( J − ρ +2 θ +3 ) if θ > ρ − O ( J ) if ρ − / < θ ≤ ρ − O (log J ) if θ = ρ − / O (1) if θ < ρ − / O ( J θ +2 ) . Hence, k A k ν ⊕ × ν ⊕ = O ( τ m J θ +2 ) + O p ( υ / m n − / J θ +2 ) + O p ( n − J θ +2 ) . Similarly, k A k ν ⊕ × ν ⊕ = O ( τ m K ϑ +2 ) + O p ( υ / m n − / K ϑ +2 ) + O p ( n − K ϑ +2 ) . Thus, with J ∼ n / (4 θ +4 ρ +2) and K ∼ n / (4 ϑ +4 % +2) , kQ b β − β k ν ⊕ × ν ⊕ = O ( τ m n α +1 ) + O p ( υ / m n α +1 / ) + O p ( n α ) . To complete the proof, it suffices to show (A.5). By (A6) and Proposition 1, b ξ jk − ξ jk = h n − n X i =1 h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ PLog b ν ⊕ b ν i − E ( h Log ν ⊕ ν , φ j i ν ⊕ Log ν ⊕ ν ) , P b ψ k i ν ⊕ + h E ( h Log ν ⊕ ν , φ j i ν ⊕ Log ν ⊕ ν ) , P b ψ k − ψ k i ν ⊕ =: I + I . Note that E k Log ν ⊕ ν k ν ⊕ = E [ d W ( ν ⊕ , ν )] < ∞ , and by Fubini’s theorem k q E (Log ν ⊕ ν ) k ν ⊕ = E k Log ν ⊕ ν k ν ⊕ = E [ d W ( ν ⊕ , ν )] < ∞ . By Lemma 1, I ≤ E h Log ν ⊕ ν , φ j i ν ⊕ k q E (Log ν ⊕ ν ) k ν ⊕ ·k P b ψ k − ψ k k ν ⊕ ≤ const . k P b ψ k − ψ k k ν ⊕ = k ϑ +2 [ O ( τ m ) + O p ( υ / m n − / ) + O p ( n − )], where the O hen, Lin & Müller and O p terms are uniform across j, k . For I , I ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ PLog b ν ⊕ b ν i − E ( h Log ν ⊕ ν , φ j i ν ⊕ Log ν ⊕ ν ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ PLog b ν ⊕ b ν i − n − n X i =1 h Log ν ⊕ ν i , φ j i ν ⊕ Log ν ⊕ ν i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 h Log ν ⊕ ν i , φ j i ν ⊕ Log ν ⊕ ν i − E ( h Log ν ⊕ ν , φ j i ν ⊕ Log ν ⊕ ν ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ =: 2 I + 2 I . Furthermore, I ≤ const . (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 (cid:16) h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ − h Log ν ⊕ ν i , φ j i ν ⊕ (cid:17) PLog b ν ⊕ b ν i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 h Log ν ⊕ ν i , φ j i ν ⊕ (cid:16) PLog b ν ⊕ b ν i − Log ν ⊕ ν i (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ ≤ const . " n − n X i =1 (cid:16) h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ − h Log ν ⊕ ν i , φ j i ν ⊕ (cid:17) n − n X i =1 k PLog b ν ⊕ b ν i k ν ⊕ + n − n X i =1 k Log ν ⊕ ν i k ν ⊕ n − n X i =1 k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ . Note that n − n X i =1 (cid:16) h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ − h Log ν ⊕ ν i , φ j i ν ⊕ (cid:17) ≤ n − n X i =1 h PLog b ν ⊕ b ν i , P b φ j − φ j i ν ⊕ + 2 n − n X i =1 h PLog b ν ⊕ b ν i − Log ν ⊕ ν i , φ j i ν ⊕ ≤ n − n X i =1 k PLog b ν ⊕ b ν i k ν ⊕ k P b φ j − φ j k ν ⊕ + 2 n − n X i =1 k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ . Under (A3)–(A6), and (A10), by (A.1) and (A.3), n − n X i =1 k PLog b ν ⊕ b ν i k ν ⊕ = E [ d W ( ν ⊕ , ν )] + O ( τ m ) + O p ( υ / m n − / ) + O p ( n − / ) . Analogously, we obtain n − P ni =1 k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ),and n − P ni =1 k PLog b ν ⊕ b ν i k ν ⊕ = E [ d W ( ν ⊕ , ν )] + O ( τ m ) + O p ( υ / m n − / ) + O p ( n − / ). Further-32 asserstein Regression more, by Lemma 1 and (A7), I = j θ +2 [ O ( τ m ) + O p ( υ / m n − / ) + O p ( n − )] , where the O and O p terms are uniform across j, k . For I , defining an empirical cross-covarianceoperator ˇ C ν ν = n − P ni =1 Log ν ⊕ ν i ⊗ Log ν ⊕ ν i , I = k ˇ C ν ν φ j − C ν ν φ j k ν ⊕ ≤ k ˇ C ν ν − C ν ν k H ν ⊕ ,ν ⊕ = O p ( n − ) , where the O p term is uniform across j, k (Theorems 4.4.5 and 7.7.6, Hsing and Eubank 2015). Proof of Corollary 1.
First note that for two distributions µ , µ ∈ W , if µ is atomless, then d W (Exp µ g, Exp µ P µ ,µ g ) = d W ( µ , µ ). Thus, d W ( b ν ( b µ ) , E ⊕ ( ν | ν = µ ))= d W (cid:18) Exp b ν ⊕ Z D b β ( s, · )Log b ν ⊕ b µ ( s )d b ν ⊕ ( s ) , Exp ν ⊕ [ E (Log ν ⊕ ν | Log ν ⊕ ν = Log ν ⊕ µ )] (cid:19) ≤ d W (cid:18) Exp ν ⊕ P b ν ⊕ ,ν ⊕ Z D b β ( s, · )Log b ν ⊕ b µ ( s )d b ν ⊕ ( s ) , Exp ν ⊕ Z D β ( s, · )Log ν ⊕ µ ( s )d ν ⊕ ( s ) (cid:19) + 2 d W ( b ν ⊕ , ν ⊕ )= 2 (cid:13)(cid:13)(cid:13)(cid:13) P b ν ⊕ ,ν ⊕ Z D b β ( s, · )Log b ν ⊕ b µ ( s )d b ν ⊕ ( s ) − Z D β ( s, · )Log ν ⊕ µ ( s )d ν ⊕ ( s ) (cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + 2 d W ( b ν ⊕ , ν ⊕ ) . Note that by (A6) and Proposition 1,P b ν ⊕ ,ν ⊕ Z D b β ( s, · )Log b ν ⊕ b µ ( s )d b ν ⊕ ( s ) = Z D Q b β ( s, · )PLog b ν ⊕ b µ ( s )d ν ⊕ ( s ) . Hence, d W ( b ν ( b µ ) , E ⊕ ( ν | ν = µ )) ≤ (cid:13)(cid:13)(cid:13)(cid:13)Z D Q b β ( s, · )PLog b ν ⊕ b µ ( s )d ν ⊕ ( s ) − Z D β ( s, · )Log ν ⊕ µ ( s )d ν ⊕ ( s ) (cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + 2 d W ( b ν ⊕ , ν ⊕ ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)Z D h Q b β ( s, · ) − β ( s, · ) i PLog b ν ⊕ b µ ( s )d ν ⊕ ( s ) (cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + 4 (cid:13)(cid:13)(cid:13)(cid:13)Z D β ( s, · ) h PLog b ν ⊕ b µ ( s ) − Log ν ⊕ µ ( s ) i d ν ⊕ ( s ) (cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + 2 d W ( b ν ⊕ , ν ⊕ ) ≤ k PLog b ν ⊕ b µ k ν ⊕ kQ b β − β k ν ⊕ × ν ⊕ + 4 k PLog b ν ⊕ b µ − Log ν ⊕ µ k ν ⊕ k β k ν ⊕ × ν ⊕ + 2 d W ( b ν ⊕ , ν ⊕ ) , hen, Lin & Müller where k · k ν ⊕ × ν ⊕ is defined as in the proof of Theorem 1. Here, k PLog b ν ⊕ b µ k ν ⊕ ≤ k PLog b ν ⊕ b µ − Log ν ⊕ µ k ν ⊕ + 2 k Log ν ⊕ µ k ν ⊕ = 2 d W ( ν ⊕ , µ ) + O ( τ m ) + O p ( υ / m ) + O p ( n − ) . By (A9), k β k ν ⊕ × ν ⊕ < ∞ . By analogy with the proof of (A.3), under (A3)–(A6), k PLog b ν ⊕ b µ − Log ν ⊕ µ k ν ⊕ = O ( τ m ) + O p ( υ / m ) + O p ( n − ). Lastly, d W ( b ν ⊕ , ν ⊕ ) = O ( τ m n − ) + O p ( υ / m n − / ) + O p ( n − ), which completes the proof by Theorem 1. Appendix B: Proofs for Section 4
In this section, we denote P b µ ⊕ ,µ ⊕ g by P g , for g ∈ L b µ ⊕ . Lemma 2.
Assume (A3), (A4), (A6), (B1), (B3), and (B6)–(B7). Furthermore, assume that theeigenvalues { λ j } are distinct, respectively. Then kP ( b µ ⊕ , b µ ⊕ ) , ( µ ⊕ ,µ ⊕ ) b C − C k H µ ⊕ = O ( τ m ) + O p ( n − / ) . Furthermore, sup j ≥ | b λ j − λ j | ≤ kP ( b µ ⊕ , b µ ⊕ ) , ( µ ⊕ ,µ ⊕ ) b C − C k H µ ⊕ , k P b φ j − φ j k µ ⊕ ≤ kP ( b µ ⊕ , b µ ⊕ ) , ( µ ⊕ ,µ ⊕ ) b C − C k H µ ⊕ / min ≤ j ≤ j { λ j − λ j +1 } , for all j ≥ . Proof.
The proof is analogous to the proof of Lemma 1. By the third statement in Proposition 1,under (B1) and (A6), P ( b µ ⊕ , b µ ⊕ ) , ( µ ⊕ ,µ ⊕ ) b C − C = n − n X i =1 (Log µ ⊕ µ i ) ⊗ (Log µ ⊕ µ i ) − C + n − n X i =1 (PLog b µ ⊕ b µ i − Log µ ⊕ µ i ) ⊗ (Log µ ⊕ µ i )+ n − n X i =1 (Log µ ⊕ µ i ) ⊗ (PLog b µ ⊕ b µ i − Log µ ⊕ µ i )+ n − n X i =1 (PLog b µ ⊕ b µ i − Log µ ⊕ µ i ) ⊗ (PLog b µ ⊕ b µ i − Log µ ⊕ µ i )=: A + A + A + A . asserstein Regression Note that kA k H µ ⊕ ≤ n − n X i =1 k Log µ ⊕ µ i k µ ⊕ ! n − n X i =1 k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ ! . For the first term on the right hand side, under (B6), E [ d W ( µ ⊕ , µ i )] ≤ C / , var[ d W ( µ ⊕ , µ i )] ≤ C ,and | cov[ d W ( µ ⊕ , µ i ) , d W ( µ ⊕ , µ l )] | ≤ C . Hence, n − P ni =1 k Log µ ⊕ µ i k µ ⊕ = n − P ni =1 d W ( µ ⊕ , µ i ) = O p (1).For the second term, by (3) and (18) and (A6), PLog b µ ⊕ b µ i = Log µ ⊕ b µ i − Log µ ⊕ b µ ⊕ . Hence, k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ ≤ k Log µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ +2 k Log µ ⊕ b µ ⊕ − Log µ ⊕ e µ ⊕ k µ ⊕ +2 k Log µ ⊕ e µ ⊕ k µ ⊕ .By (3), (13) and (15), k Log µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ = d W ( b µ i , µ i ), Log µ ⊕ e µ ⊕ = n − P ni =1 Log µ ⊕ µ i , and k Log µ ⊕ b µ ⊕ − Log µ ⊕ e µ ⊕ k µ ⊕ = d W ( b µ ⊕ , e µ ⊕ ) ≤ n − P ni =1 d W ( b µ i , µ i ). By Theorem 3.10 of Bosq (2000), k Log µ ⊕ e µ ⊕ k µ ⊕ = O p ( n − ). Hence, n − n X i =1 k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ ≤ n − n X i =1 d W ( b µ i , µ i ) + 4 n − n X i =1 d W ( b µ i , µ i ) + O p ( n − ) . Note that E { cov[ d W ( µ i , b µ i ) , d W ( µ i + k , b µ i + k ) | µ i , µ i + k ] } = 0. Under (A3), E [ d W ( µ i , b µ i ) | µ i ] = O ( τ m ) uniformly in i , i.e., there exists a constant C > m > E [ d W ( µ i , b µ i ) | µ i ] ≤ C τ m , for all m ≥ m and i ≥
1. Furthermore by (B7), for all m ≥ m ,sup i ∈ N + (cid:12)(cid:12)(cid:12) cov { C − τ − m E [ d W ( µ i , b µ i ) | µ i ] , C − τ − m E [ d W ( µ i + k , b µ i + k ) | µ i + k ] } (cid:12)(cid:12)(cid:12) ≤ C u k , with some constants C > u ∈ (0 ,
1) (Volkonskii and Rozanov 1959), and hence n − n − X i =1 n − i X k =1 (cid:12)(cid:12)(cid:12) cov[ d W ( µ i , b µ i ) , d W ( µ i + k , b µ i + k )] (cid:12)(cid:12)(cid:12) ≤ C C τ m n − n − X k =1 ( n − k ) u k = O ( τ m n − ) . Again applying (A3), n − P ni =1 var[ d W ( b µ i , µ i )] = O ( υ m n − ) and hence var (cid:2) n − P ni =1 d W ( b µ i , µ i ) (cid:3) = O ( υ m n − ). Thus, n − n X i =1 k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) , and hence kA k H µ ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ). Similarly, we find kA k H µ ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ), and A is asymptotically dominated by A and A . Lastly, kA k H µ ⊕ = O p ( n − ) follows from Theorem 4.9 of Bosq (2000).35 hen, Lin & MüllerProof of Theorem 2. The proof follows similar arguments to Theorem 1. First we observe (cid:13)(cid:13)(cid:13) Q b β − β (cid:13)(cid:13)(cid:13) µ ⊕ × µ ⊕ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Q b β − J X l =1 J X j =1 b jl φ l ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ × µ ⊕ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ X l = J +1 ∞ X j = J +1 b jl φ l ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ × µ ⊕ , where k A k µ ⊕ × µ ⊕ := R D R D A ( s, t ) d µ ⊕ ( s )d µ ⊕ ( t ) for A ∈ L µ ⊕ × µ ⊕ . By (B5), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ X l = J +1 ∞ X j = J +1 b jl φ l ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ × µ ⊕ = ∞ X l = J +1 ∞ X j = J +1 b jl = O ( J − ρ − % +2 ) . Define A = P Jl =1 P Jj =1 ( b b jl − b jl ) φ l ⊗ φ j , A = P Jl =1 P Jj =1 b jl (P b φ l ⊗ P b φ j − φ l ⊗ φ j ), and A = P Jl =1 P Jj =1 ( b b jl − b jl )(P b φ l ⊗ P b φ j − φ l ⊗ φ j ). Then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Q b β − J X l =1 J X j =1 b jl φ l ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ × µ ⊕ ≤ k A k µ ⊕ × µ ⊕ + 2 k A k µ ⊕ × µ ⊕ + 2 k A k µ ⊕ × µ ⊕ , where clearly A is asymptotically dominated by A and A .For A , note that k A k µ ⊕ × µ ⊕ = J X l =1 J X j =1 ( b b jl − b jl ) = J X l =1 J X j =1 ( b λ − j b ξ jl − b jl ) . Analogous to the proof of (A.5), we derive the convergence rate of ( b ξ jl − ξ jl ) . Note that b ξ jl − ξ jl = h ( n − − n − X i =1 h PLog b µ ⊕ b µ i , P b φ j i µ ⊕ PLog b µ ⊕ b µ i +1 − E ( h Log µ ⊕ µ i , φ j i µ ⊕ Log µ ⊕ µ i +1 ) , P b φ l i µ ⊕ + h E ( h Log µ ⊕ µ i , φ j i µ ⊕ Log µ ⊕ µ i +1 ) , P b φ l − φ l i µ ⊕ =: I + I . By (B6) and Lemma 2, I ≤ const . k P b φ l − φ l k µ ⊕ . For I , I ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( n − − n − X i =1 h PLog b µ ⊕ b µ i , P b φ j i µ ⊕ PLog b µ ⊕ b µ i +1 − ( n − − n − X i =1 h Log µ ⊕ µ i , φ j i µ ⊕ Log µ ⊕ µ i +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( n − − n − X i =1 h Log µ ⊕ µ i , φ j i µ ⊕ Log µ ⊕ µ i +1 − E ( h Log µ ⊕ µ i , φ j i µ ⊕ Log µ ⊕ µ i +1 ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ =: 2 I + 2 I . asserstein Regression Here, analogous to the proof of Theorem 1, I ≤ const . " n − − n − X i =1 (cid:16) k PLog b µ ⊕ b µ i k µ ⊕ k P b φ j − φ j k µ ⊕ + k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ (cid:17) · n − X i =1 k PLog b µ ⊕ b µ i +1 k µ ⊕ + 2( n − − n − X i =1 k Log µ ⊕ µ i k µ ⊕ n − X i =1 k PLog b µ ⊕ b µ i +1 − Log µ ⊕ µ i +1 k µ ⊕ ≤ O p (1) " k P b φ j − φ j k µ ⊕ + ( n − − n − X i =1 k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ + ( n − − n − X i =1 k PLog b µ ⊕ b µ i +1 − Log µ ⊕ µ i +1 k µ ⊕ . For I , defining an empirical auto-covariance operatorˇ C = ( n − − n − X i =1 Log µ ⊕ µ i +1 ⊗ Log µ ⊕ µ i ,I = k ˇ C φ j − C φ j k µ ⊕ ≤ k ˇ C − C k H µ ⊕ = O p ( n − ), uniformly in j, l (Theorem 4.9, Bosq 2000).Thus by Lemma 2,( b ξ jl − ξ jl ) = ( j θ +2 + l θ +2 ) h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i , where the O and O p terms are uniform across j, l . Following arguments analogous to the proof ofTheorem 1, we find k A k µ ⊕ × µ ⊕ = O ( τ m J θ +4 ) + O p ( υ / m n − / J θ +4 ) + O p ( n − J θ +4 ).For A , define A = J X l =1 J X j =1 b jl φ l ⊗ (P b φ j − φ j ) ,A = J X l =1 J X j =1 b jl (P b φ l − φ l ) ⊗ φ j ,A = J X l =1 J X j =1 b jl (P b φ l − φ l ) ⊗ (P b φ j − φ j ) . Then k A k µ ⊕ × µ ⊕ ≤ k A k µ ⊕ × µ ⊕ + 2 k A k µ ⊕ × µ ⊕ + 2 k A k µ ⊕ × µ ⊕ , and A is asymptotically dom-inated by A and A . Analogous to the proof of Theorem 1, it can be verified that k A k µ ⊕ × µ ⊕ = O ( τ m J θ +2 ) + O p ( υ / m n − / J θ +2 ) + O p ( n − J θ +2 ) , k A k µ ⊕ × µ ⊕ = O ( τ m J θ +2 ) + O p ( υ / m n − / J θ +2 ) + O p ( n − J θ +2 ) . hen, Lin & Müller REFERENCES
We omit the proof of Corollary 2, since it is analogous to that of Corollary 1.
References
Aggarwal, O. P. (1955). Some minimax invariant procedures for estimating a cumulative distri-bution function.
Annals of Mathematical Statistics Aldous, D. J. (1976). A characterisation of Hilbert space using the central limit theorem.
Journalof the London Mathematical Society Ambrosio, L. , Gigli, N. and
Savaré, G. (2004). Gradient flows with metric and differentiablestructures, and applications to the Wasserstein space.
Atti Accad. Naz. Lincei Cl. Sci. Fis. Mat.Natur. Rend. Lincei (9) Mat. Appl Ambrosio, L. , Gigli, N. and
Savaré, G. (2008).
Gradient Flows: in Metric Spaces and in theSpace of Probability Measures . Springer.
Bachoc, F. , Gamboa, F. , Loubes, J.-M. and
Venet, N. (2017). A Gaussian process regressionmodel for distribution inputs.
IEEE Transactions on Information Theory Banerjee, M. , Chakraborty, R. , Ofori, E. , Okun, M. S. , Viallancourt, D. E. and
Ve-muri, B. C. (2016). A nonlinear regression technique for manifold valued data with applicationsto medical image analysis. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition . 4424–4432.
Bergmeir, C. , Hyndman, R. J. and
Koo, B. (2018). A note on the validity of cross-validationfor evaluating autoregressive time series prediction.
Computational Statistics & Data Analysis
Bigot, J. , Gouet, R. , Klein, T. and
López, A. (2017). Geodesic PCA in the Wasserstein spaceby convex PCA.
Annales de l’Institut Henri Poincaré, Probabilités et Statistiques Bogin, A. , Doerner, W. and
Larson, W. (2019). Local house price dynamics: New indicesand stylized facts.
Real Estate Economics Bosq, D. (2000).
Linear Processes in Function Spaces: Theory and Applications . Springer-Verlag,New York.
Cai, T. and
Hall, P. (2006). Prediction in functional linear regression.
Annals of Statistics Cardot, H. , Ferraty, F. , Mas, A. and
Sarda, P. (2003). Testing hypotheses in the functionallinear model.
Scandinavian Journal of Statistics Cardot, H. , Ferraty, F. and
Sarda, P. (1999). Functional linear model.
Statistics & ProbabilityLetters Cazelles, E. , Seguy, V. , Bigot, J. , Cuturi, M. and
Papadakis, N. (2018). Geodesic PCA38
EFERENCES
Wasserstein Regression versus log-PCA of histograms in the Wasserstein space.
SIAM Journal on Scientific Computing B429–B456.
Chen, Z. , Bao, Y. , Li, H. and
Spencer Jr, B. F. (2019). LQD-RKHS-based distribution-to-distribution regression methodology for restoring the probability distributions of missing shmdata.
Mechanical Systems and Signal Processing
Cheng, C. and
Parzen, E. (1997). Unified estimators of smooth quantile and quantile densityfunctions.
Journal of Statistical Planning and Inference Chiou, J.-M. and
Müller, H.-G. (2009). Modeling hazard rates as functional data for the analy-sis of cohort lifetables and mortality forecasting.
Journal of the American Statistical Association
Cornea, E. , Zhu, H. , Kim, P. and
Ibrahim, J. G. (2017). Regression models on Riemanniansymmetric spaces.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) Dauxois, J. , Pousse, A. and
Romain, Y. (1982). Asymptotic theory for the principal componentanalysis of a vector random function: Some applications to statistical inference.
Journal ofMultivariate Analysis Davis, B. C. , Fletcher, P. T. , Bullitt, E. and
Joshi, S. (2007). Population shape regressionfrom random design data. In . Falk, M. (1983). Relative efficiency and deficiency of kernel type estimators of smooth distributionfunctions.
Statistica Neerlandica Falk, M. (1984). Relative deficiency of kernel type estimators of quantiles.
Annals of Statistics Ferraty, F. and
Vieu, P. (2003). Functional nonparametric statistics: A double infinite di-mensional framework. In
Recent Advances and Trends in Nonparametric Statistics . Elsevier,61–76.
Fréchet, M. (1948). Les éléments aléatoires de nature quelconque dans un espace distancié.
Annales de l’Institut Henri Poincaré Grenander, U. (1950). Stochastic processes and statistical inference.
Arkiv för Matematik Hall, P. and
Horowitz, J. L. (2007). Methodology and convergence rates for functional linearregression.
Annals of Statistics He, G. , Müller, H.-G. and
Wang, J.-L. (2003). Functional canonical analysis for squareintegrable stochastic processes.
Journal of Multivariate Analysis He, G. , Müller, H.-G. , Wang, J.-L. and
Yang, W. (2010). Functional linear regression viacanonical analysis.
Bernoulli Hinkle, J. , Muralidharan, P. , Fletcher, P. T. and
Joshi, S. (2012). Polynomial regressionon Riemannian manifolds. In
Computer Vision–ECCV 2012 . Springer, 1–14.39 hen, Lin & Müller
REFERENCES
Hsing, T. and
Eubank, R. (2015).
Theoretical Foundations of Functional Data Analysis, with anIntroduction to Linear Operators . John Wiley & Sons.
Huckemann, S. F. (2015). (Semi-)intrinsic statistical analysis on non-Euclidean spaces. In
Ad-vances in Complex Data Modeling and Computational Methods in Statistics . Springer, 103–118.
Hyndman, R. J. , Booth, H. and
Yasmeen, F. (2013). Coherent mortality forecasting: theproduct-ratio method with functional time series models.
Demography Kloeckner, B. (2010). A geometric study of Wasserstein spaces: Euclidean spaces.
Annali dellaScuola Normale Superiore di Pisa-Classe di Scienze Kokoszka, P. , Miao, H. , Petersen, A. and
Shang, H. L. (2019). Forecasting of densityfunctions with an application to cross-sectional and intraday returns.
International Journal ofForecasting Leblanc, A. (2012). On estimating distribution functions using Bernstein polynomials.
Annalsof the Institute of Statistical Mathematics Lin, L. , St. Thomas, B. , Zhu, H. and
Dunson, D. B. (2017). Extrinsic local regression onmanifold-valued data.
Journal of the American Statistical Association
Lin, Z. and
Yao, F. (2017). Functional regression on manifold with contamination. arXiv preprintarXiv:1704.03005 . Lin, Z. and
Yao, F. (2018). Intrinsic Riemannian functional data analysis. arXiv preprintarXiv:1812.01831 . Marron, J. S. and
Alonso, A. M. (2014). Overview of object oriented data analysis.
BiometricalJournal McCann, R. J. (1997). A convexity principle for interacting gases.
Advances in Mathematics
Morris, J. S. (2015). Functional regression.
Annual Review of Statistics and Its Application Oikarinen, E. , Bourassa, S. C. , Hoesli, M. and
Engblom, J. (2018). US metropolitan houseprice dynamics.
Journal of Urban Economics
Oliva, J. , Neiswanger, W. , Póczos, B. , Schneider, J. and
Xing, E. (2014). Fast distributionto real regression. In
Artificial Intelligence and Statistics . 706–714.
Ouellette, N. and
Bourbeau, R. (2011). Changes in the age-at-death distribution in four lowmortality countries: A nonparametric approach.
Demographic Research Panaretos, V. M. and
Zemel, Y. (2016). Amplitude and phase variation of point processes.
Annals of Statistics Parzen, E. (1979). Nonparametric statistical data modeling.
Journal of the American StatisticalAssociation Petersen, A. , Chen, C.-J. and
Müller, H.-G. (2019). Quantifying and visualizing intraregionalconnectivity in resting-state Functional Magnetic Resonance Imaging with correlation densities.40
EFERENCES
Wasserstein RegressionBrain Connectivity Petersen, A. and
Müller, H.-G. (2016). Functional data analysis for density functions bytransformation to a Hilbert space.
Annals of Statistics Petersen, A. and
Müller, H.-G. (2019a). Fréchet regression for random objects with Euclideanpredictors.
Annals of Statistics Petersen, A. and
Müller, H.-G. (2019b). Wasserstein covariance for multiple random densities.
Biometrika
Póczos, B. , Singh, A. , Rinaldo, A. and
Wasserman, L. A. (2013). Distribution-free distri-bution regression. In
AISTATS . 507–515.
Ramsay, J. O. and
Dalzell, C. J. (1991). Some tools for functional data analysis.
Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) Read, R. (1972). The asymptotic inadmissibility of the sample distribution function.
Annals ofMathematical Statistics Rosenblatt, M. (1956). A central limit theorem and a strong mixing condition.
Proceedings ofthe National Academy of Sciences of the United States of America Shang, H. L. and
Hyndman, R. J. (2017). Grouped functional time series forecasting: Anapplication to age-specific mortality rates.
Journal of Computational and Graphical Statistics Shi, X. , Styner, M. , Lieberman, J. , Ibrahim, J. G. , Lin, W. and
Zhu, H. (2009). Intrinsicregression models for manifold-valued data. In
Medical Image Computing and Computer-AssistedIntervention–MICCAI 2009 . Springer, 192–199.
Steinke, F. and
Hein, M. (2009). Non-parametric regression between manifolds. In
Advances inNeural Information Processing Systems . 1561–1568.
Steinke, F. , Hein, M. and
Schölkopf, B. (2010). Nonparametric regression between generalRiemannian manifolds.
SIAM Journal on Imaging Sciences Sturm, K.-T. (2003). Probability measures on metric spaces of nonpositive curvature.
HeatKernels and Analysis on Manifolds, Graphs, and Metric Spaces (Paris, 2002)
Szabó, Z. , Sriperumbudur, B. K. , Póczos, B. and
Gretton, A. (2016). Learning theory fordistribution regression.
Journal of Machine Learning Research Thi Thien Trang, B. , Loubes, J.-M. , Risser, L. and
Balaresque, P. (2019). Distribu-tion regression model with a Reproducing Kernel Hilbert Space approach.
Communications inStatistics-Theory and Methods in Press.
Villani, C. (2003).
Topics in Optimal Transportation . American Mathematical Society.
Volkonskii, V. and
Rozanov, Y. A. (1959). Some limit theorems for random functions. I.
Theory of Probability & Its Applications Wang, J.-L. , Chiou, J.-M. and
Müller, H.-G. (2016). Review of functional data analysis.
Annual Review of Statistics and Its Application hen, Lin & Müller REFERENCES
Yang, S.-S. (1985). A smooth nonparametric estimator of a quantile function.
Journal of theAmerican Statistical Association Yao, F. , Müller, H.-G. and
Wang, J.-L. (2005). Functional linear regression analysis forlongitudinal data.
Annals of Statistics Yuan, M. and
Cai, T. T. (2010). A reproducing kernel Hilbert space approach to functionallinear regression.
Annals of Statistics Yuan, Y. , Zhu, H. , Lin, W. and
Marron, J. (2012). Local polynomial regression for sym-metric positive definite matrices.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) Zemel, Y. and
Panaretos, V. M. (2019). Fréchet means and Procrustes analysis in Wassersteinspace.
Bernoulli25