[PDF] Wasserstein Regression

Abstract

The analysis of samples of random objects that do not lie in a vector space has found increasing attention in statistics in recent years. An important class of such object data is univariate probability measures defined on the real line. Adopting the Wasserstein metric, we develop a class of regression models for such data, where random distributions serve as predictors and the responses are either also distributions or scalars. To define this regression model, we utilize the geometry of tangent bundles of the metric space of random measures with the Wasserstein metric. The proposed distribution-to-distribution regression model provides an extension of multivariate linear regression for Euclidean data and function-to-function regression for Hilbert space valued data in functional data analysis. In simulations, it performs better than an alternative approach where one first transforms the distributions to functions in a Hilbert space and then applies traditional functional regression. We derive asymptotic rates of convergence for the estimator of the regression coefficient function and for predicted distributions and also study an extension to autoregressive models for distribution-valued time series. The proposed methods are illustrated with data on human mortality and distributions of house prices.

Full PDF

WWasserstein Regression ∗ Yaqing Chen , Zhenhua Lin , and Hans-Georg Müller Department of Statistics, University of California, Davis Department of Statistics and Applied Probability, National University of Singapore

Abstract

The analysis of samples of random objects that do not lie in a vector space has foundincreasing attention in statistics in recent years. An important class of such object data isunivariate probability measures deﬁned on the real line. Adopting the Wasserstein metric,we develop a class of regression models for such data, where random distributions serve aspredictors and the responses are either also distributions or scalars. To deﬁne this regressionmodel, we utilize the geometry of tangent bundles of the metric space of random measures withthe Wasserstein metric. The proposed distribution-to-distribution regression model provides anextension of multivariate linear regression for Euclidean data and function-to-function regressionfor Hilbert space valued data in functional data analysis. In simulations, it performs betterthan an alternative approach where one ﬁrst transforms the distributions to functions in aHilbert space and then applies traditional functional regression. We derive asymptotic rates ofconvergence for the estimator of the regression coeﬃcient function and for predicted distributionsand also study an extension to autoregressive models for distribution-valued time series. Theproposed methods are illustrated with data on human mortality and distributions of houseprices.

Keywords:

Distribution regression; Wasserstein geometry; random measures; parallel transport;tangent bundles; functional data analysis; density transformation. ∗ Research supported by NIH Echo and NSF DMS1712862. a r X i v : . [ s t a t . M E ] J un hen, Lin & Müller Regression analysis is one of the foundational tools of statistics to quantify the relationship betweena response variable and predictors and there have been many extensions of simple models such asthe multiple linear regression model to more complex data scenarios. These include linear modelsfor function-to-function regression, where predictors and responses are both considered randomelements in Hilbert space, with a variant where responses are scalars (Grenander 1950; Ramsayand Dalzell 1991). Such linear functional regression models and their properties have been wellstudied (Cardot et al. 1999, 2003; Yao et al. 2005; Cai and Hall 2006; Hall and Horowitz 2007) andreviewed (Morris 2015; Wang et al. 2016).Samples that include random objects, which are random elements in general metric spaces thatby default do not have a vector space structure, are increasingly common. Such data cannot beanalyzed with methods devised for Euclidean or functional data, which are usually viewed as randomelements of a Hilbert space (Marron and Alonso 2014; Huckemann 2015). We focus here on thecase where the random objects are random probability measures on the real line that satisfy certainregularity conditions. Speciﬁcally, at this time there are no in-depth studies with detailed statisticalanalysis of regression models that feature such random measures as predictors, in contrast to thesituation where vector predictors are coupled with random distributions as responses (Petersen andMüller 2019a).Related work also includes a variety of methods that speciﬁcally target the case where Euclideanpredictors are paired with responses that reside on a ﬁnite-dimensional Riemannian manifold (Daviset al. 2007; Shi et al. 2009; Hinkle et al. 2012; Yuan et al. 2012; Cornea et al. 2017; Lin et al. 2017).Kernel and spline type methods have been proposed for the case where both predictors and responsesare elements of ﬁnite-dimensional Riemannian manifolds (Steinke and Hein 2009; Steinke et al. 2010;Banerjee et al. 2016). However, these methods do not cover spaces of probability measures underthe Wasserstein metric, where the tangent spaces are subspaces of inﬁnite-dimensional Hilbertspaces. Additionally, no comprehensive investigation of the statistical properties and asymptoticbehavior of distribution-to-distribution regression models seems to exist. To develop the proposedmodel, we utilize tangent bundles in the space of probability distributions with the Wassersteinmetric and parallel transport to obtain asymptotic results for regression coeﬃcient functions andpredicted measures.A recent approach to including random distributions as predictors in complex regression modelsis to transform the densities of these distributions to unconstrained functions in the Hilbert space L , e.g., by the log quantile density (lqd) transformation (Petersen and Müller 2016) and then toemploy functional regression models where the transformed functions serve as predictors and theresponses are either also the transformed functions or scalars (Chen et al. 2019; Kokoszka et al.2019; Petersen et al. 2019), whence established methods for functional regression become applicable.In contrast to this transformation approach, we develop here an alternative method that is closely2 asserstein Regression adapted to the underlying geometry, where we utilize the geometric properties of the metric spaceof random measures equipped with the Wasserstein distance. Indeed, we found in implementationsand simulations that this adapted geometric method works better for regression modeling whencomparing it to the “unadapted” transformation approaches based on Petersen and Müller (2016).Other alternatives have been considered for regressing scalar responses on distribution-valued pre-dictors (Póczos et al. 2013; Oliva et al. 2014; Szabó et al. 2016; Bachoc et al. 2017; Thi Thien Tranget al. 2019), but these are either Nadaraya-Watson type kernel estimators that suﬀer from a severecurse of dimensionality, kernel-based embedding, or moment-based methods, which are generic andnot adapted to the speciﬁc features of distributions as predictors.In this paper, we develop a distribution-to-distribution regression model that is analogous totraditional linear regression models for Euclidean and functional data, with the decisive diﬀer-ence that both predictors and responses are univariate probability measures. We also discuss anextension of our approach to an autoregressive model for distribution-valued time series. In ourestimation procedures and theoretical analysis we cover the commonly encountered but more com-plex situation where neither predictor nor response distributions are directly observed and insteadthe available data consist of i.i.d. samples that are generated by each of these distributions.The paper is organized as follows. We ﬁrst propose a distribution-to-distribution regressionmodel based on the tangent bundle of the Wasserstein space of probability distributions in Section2, with estimation and asymptotic theory in Section 3, and then describe an extension of the modelto an autoregressive model for time series of distributions in Section 4. Implementation detailsare discussed in Section 5, followed by a simulation study in Section 6 to assess the ﬁnite-sampleperformance of the proposed estimators and a competing approach. The wide applicability of theproposed methods is demonstrated with applications to human mortality data and US house pricedata in Section 7. Let D be R or a closed interval in R , and B ( D ) be the Borel σ -algebra on D . We focus onthe Wasserstein space W = W ( D ) of probability distributions on ( D, B ( D )) with ﬁnite secondmoments, endowed with the L -Wasserstein distance d W ( µ , µ ) = (cid:26)Z [ F − ( p ) − F − ( p )] d p (cid:27) / , (1)for µ , µ ∈ W , where F − and F − denote the quantile functions of µ and µ , respectively, whichwe assume to be well deﬁned; speciﬁcally, for any distribution µ = µ ( F ) ∈ W with cumulativedistribution function (cdf) F , we consider the quantile function F − to be the left continuous3 hen, Lin & Müller inverse of F , i.e., F − ( p ) = inf { r ∈ D : F ( r ) ≥ p } , for p ∈ (0 , . (2)As demonstrated for example in Ambrosio et al. (2008); Bigot et al. (2017); Zemel and Panaretos(2019), basic concepts of Riemannian manifolds can be generalized to the Wasserstein space W . Weassume in the following µ ⊕ is a reference probability measure in W that has a continuous cdf F ⊕ .Then the tangent space at µ ⊕ is deﬁned as the closed span of functions of the form F − ◦ F ⊕ − id,where F − is the quantile function of an arbitrary distribution µ ∈ W (Equation (8.5.1), Ambrosioet al. 2008): T µ ⊕ = { t ( F − ◦ F ⊕ − id) : µ ∈ W , t > } L µ ⊕ , where L µ ⊕ = L µ ⊕ ( D ) is the Hilbert space of µ ⊕ -square-integrable functions on D ⊂ R , with innerproduct h· , ·i µ ⊕ and norm k·k µ ⊕ . By the deﬁnition of the Wasserstein space W and the atomlessnessof the reference measure µ ⊕ , both F − ◦ F ⊕ and id belong to L µ ⊕ . Thus, the tangent space T µ ⊕ isa subspace of L µ ⊕ equipped with the same inner product and induced norm.The exponential map Exp µ ⊕ : T µ ⊕ → W is then deﬁned by the push-forward map via thefunction g + id for g ∈ T µ ⊕ : Exp µ ⊕ g = ( g + id) µ ⊕ , where for a measurable function h : D → D , h µ ⊕ is a push-forward measure such that h µ ⊕ ( A ) = µ ⊕ ( { r ∈ D : h ( r ) ∈ A } ) for any set A ∈ B ( D ). Unlike Riemannian manifolds, the exponentialmap here is not a local homeomorphism (Ambrosio et al. 2004). However, any µ ∈ W withquantile function F − can be recovered from µ ⊕ by Exp µ ⊕ ( F − ◦ F ⊕ − id), in the sense that d W (Exp µ ⊕ ( F − ◦ F ⊕ − id) , µ ) = d W (( F − ◦ F ⊕ ) µ ⊕ , µ ) = 0 due to the continuity of F ⊕ . Thus, thelogarithmic map Log µ ⊕ : W → T µ ⊕ , as the inverse of the exponential map, can be deﬁned as:Log µ ⊕ µ = F − ◦ F ⊕ − id , (3)for every µ ∈ W .Although not injective on the entire tangent space T µ ⊕ the exponential map restricted to thelog image Exp µ ⊕ | Log µ ⊕ ( W ) is an isometric homeomorphism (e.g., see Lemma 2.1, Bigot et al. 2017).Thus, geodesics can be deﬁned for the Wasserstein space W as the image of Exp µ ⊕ of line segmentsin the tangent space T µ ⊕ , (1 − t )Log µ ⊕ µ + t Log µ ⊕ µ , with µ , µ ∈ W . Speciﬁcally, the geodesic γ µ ⊕ [ µ , µ ] : [0 , → W from µ to µ is given by γ µ ⊕ [ µ , µ ]( t ) = Exp µ ⊕ [(1 − t )Log µ ⊕ µ + t Log µ ⊕ µ ] , (4)4 asserstein Regression for t ∈ [0 , γ µ ⊕ [ µ , µ ] is invariant with respect to the choice of atomless reference measure µ ⊕ and isunique (e.g., Villani 2003). Let ( ν , ν ) be a pair of random elements in W with joint distribution F on W × W , assumed to besquare integrable in the sense that E d W ( µ, ν ) < ∞ and E d W ( µ, ν ) < ∞ for some (and thus forall) µ ∈ W . Any element in W that minimizes E d W ( · , ν ) is called a Fréchet mean of ν (Fréchet1948). Since the Wasserstein space W is a Hadamard space (Kloeckner 2010), there exists a uniqueminimizer of E d W ( · , ν ) (Sturm 2003), i.e., the Fréchet mean of ν , ν ⊕ = argmin µ ∈W E d W ( µ, ν ).Analogously, we denote the Fréchet mean of ν by ν ⊕ = argmin µ ∈W E d W ( µ, ν ). It is well-knownthat for univariate distributions as we consider here, the quantile function of the Fréchet mean issimply the expected quantile function of the random distribution: F − ⊕ ( · ) = E F − ( · ) and F − ⊕ ( · ) = E F − ( · ) , where F − ⊕ , F − ⊕ , F − and F − are the quantile functions of ν ⊕ , ν ⊕ , ν and ν , respectively.As we aim to develop a regression model where the predictors and responses are both distri-butions in W , a good starting point is linear regression in Euclidean spaces, where for a pair ofrandom elements ( X, Y ) ∈ ( R p , R ), E ( Y | X ) = Γ E ( X ) = E Y + β > ( X − E X ). The regression func-tion Γ E can be characterized by the following two properties: First, it maps the expectation of X to expectation of Y ; second, conditioning on X , it transports the line segment between E X and X to that between E Y and E [ Y | X ]. Speciﬁcally,(E1) E Y = Γ E ( E X );(E2) E [ Y + t ( Y − E Y ) | X ] = Γ E [ E X + t ( X − E X )], for all t ∈ [0 , E [ Y + t ( Y − E Y ) | X ], we deﬁne its counterpart in the Wasserstein spaceby E ⊕ { γ ν ⊕ [ ν ⊕ , ν ]( t ) | ν } := Exp ν ⊕ [ E ( t Log ν ⊕ ν | Log ν ⊕ ν )]. Accordingly, a regression operatorΓ W : W → W for the Wasserstein space would be expected to satisfy:(W1) d W ( ν ⊕ , Γ W ( ν ⊕ )) = 0;(W2) d W ( E ⊕ { γ ν ⊕ [ ν ⊕ , ν ]( t ) | ν } , Γ W { γ ν ⊕ [ ν ⊕ , ν ]( t ) } ) = 0, for all t ∈ [0 , ν ⊕ ν ⊕ = 0, ν ⊕ -a.e., and Log ν ⊕ ν ⊕ = 0, ν ⊕ -a.e.(A1) The Fréchet means ν ⊕ and ν ⊕ are atomless.5 hen, Lin & Müller Note that Exp µ (0) = µ for any µ ∈ W . Furthermore, since exponential maps restricted to the cor-responding log image spaces are isometric homeomorphisms, d W ( ν ⊕ , Γ W ( ν ⊕ )) = k Log ν ⊕ ◦ Γ W ◦ Exp ν ⊕ (0) k ν ⊕ , and d W ( E ⊕ { γ ν ⊕ [ ν ⊕ , ν ]( t ) | ν } , Γ W { γ ν ⊕ [ ν ⊕ , ν ]( t ) } ) = k E ( t Log ν ⊕ ν | Log ν ⊕ ν ) − Log ν ⊕ ◦ Γ W ◦ Exp ν ⊕ ( t Log ν ⊕ ν ) k ν ⊕ .Then we consider a regression operator between tangent spaces, Γ : T ν ⊕ → T ν ⊕ (Γ = Log ν ⊕ ◦ Γ W ◦ Exp ν ⊕ | T ν ⊕ ) such that(T1) k Γ(0) k ν ⊕ = 0;(T2) k E ( t Log ν ⊕ ν | Log ν ⊕ ν ) − Γ( t Log ν ⊕ ν ) k ν ⊕ = 0, for all t ∈ [0 , T ν ⊕ and T ν ⊕ are subspaces of L ν ⊕ and L ν ⊕ , respec-tively. Distribution-to-distribution regression can then be viewed as function-to-function regression,which has been well-studied in functional data analysis (see, e.g., Ferraty and Vieu 2003; Yao et al.2005; He et al. 2010; Wang et al. 2016). Speciﬁcally, the proposed distribution regression model is E (Log ν ⊕ ν | Log ν ⊕ ν ) = Γ(Log ν ⊕ ν ) , (5)where Γ : T ν ⊕ → T ν ⊕ is a linear operator deﬁned asΓ g ( t ) = h β ( · , t ) , g i ν ⊕ , for t ∈ D and g ∈ T ν ⊕ . (6)Here, β : D → R is a coeﬃcient function (i.e., the kernel of Γ) lying in L ν ⊕ × ν ⊕ , and ν ⊕ × ν ⊕ is a product probability measure on the product measurable space ( D , B ( D )) generated by ν ⊕ and ν ⊕ .It is clear that (T1) holds for the proposed model. However, (T2) holds only if Γ(Log ν ⊕ W ) ⊆ Log ν ⊕ W . This may not be the case, especially when D is compact, which is common in realdata, due to the fact that Log ν ⊕ W is a closed and convex subset of the tangent space T ν ⊕ (e.g.,Proposition 2.1 of Bigot et al. 2017). For our theoretical analysis, we therefore make the followingassumption for the regression operator Γ.(A2) For the regression operator Γ, the image Γ(Log ν ⊕ ν ) is in Log ν ⊕ W with probability 1.In practice, we propose a projection method to resolve the issue that the estimate of Γ(Log ν ⊕ ν )may not lie in the log image Log ν ⊕ W . We will discuss further details in Section 5. Since the cdfs F ⊕ and F ⊕ are continuous by (A1), it follows that E (Log ν ⊕ ν ) = 0, ν ⊕ -a.e., and E (Log ν ⊕ ν ) = 0, ν ⊕ -a.e. Denote the covariance operators of Log ν ⊕ ν and Log ν ⊕ ν by C ν = E (Log ν ⊕ ν ⊗ Log ν ⊕ ν ) and C ν = E (Log ν ⊕ ν ⊗ Log ν ⊕ ν ), respectively, and the cross-covarianceoperator by C ν ν = E (Log ν ⊕ ν ⊗ Log ν ⊕ ν ). Since C ν and C ν are trace-class operators, the two6 asserstein Regression covariance operators and hence the cross-covariance operator have the following eigendecomposition(Theorem 7.2.6, Hsing and Eubank 2015) C ν = ∞ X j =1 λ j φ j ⊗ φ j , C ν = ∞ X k =1 ς k ψ k ⊗ ψ k , C ν ν = ∞ X k =1 ∞ X j =1 ξ jk ψ k ⊗ φ j , (7)with eigenvalues λ j = E [ h Log ν ⊕ ν , φ j i ν ⊕ ], ς k = E [ h Log ν ⊕ ν , ψ k i ν ⊕ ] such that λ ≥ λ ≥ · · · ≥ ς ≥ ς ≥ · · · ≥

0, eigenfunctions { φ j } ∞ j =1 and { ψ k } ∞ k =1 that form orthonormal systems in thetangent spaces T ν ⊕ and T ν ⊕ , respectively, and ξ jk = E [ h Log ν ⊕ ν , φ j i ν ⊕ h Log ν ⊕ ν , ψ k i ν ⊕ ]. Withprobability 1, the log transformations Log ν ⊕ ν and Log ν ⊕ ν admit the Karhunen-Loève expansionLog ν ⊕ ν = ∞ X j =1 h Log ν ⊕ ν , φ j i ν ⊕ φ j and Log ν ⊕ ν = ∞ X k =1 h Log ν ⊕ ν , ψ k i ν ⊕ ψ k . Then as in the classical functional regression (e.g., Bosq 2000; Yao et al. 2005), the regressioncoeﬃcient function β can be expressed as β = ∞ X k =1 ∞ X j =1 b jk ψ k ⊗ φ j , (8)with b jk = λ − j ξ jk . In order to guarantee that the right hand side of (8) converges in the sense thatlim J,K →∞ Z D Z D  K X k =1 J X j =1 b jk φ j ( s ) ψ k ( t ) − β ( s, t )  d ν ⊕ ( s )d ν ⊕ ( t ) = 0 , we assume (Lemma A.2, Yao et al. 2005) ∞ X k =1 ∞ X j =1 λ − j ξ jk < ∞ . (9)For economy, we use the same notation g ⊗ g for the operator and its kernel throughoutthis paper. Namely, for g ∈ L µ and g ∈ L µ , g ⊗ g can represent either an operator on L µ such that ( g ⊗ g )( g ) = h g , g i µ g for g ∈ L µ or its kernel, i.e., a bivariate function such that( g ⊗ g )( s, t ) = g ( t ) g ( s ) for all s, t ∈ D .A variant of the proposed distribution-to-distribution regression in (5) is the pairing of distri-butions as predictors with scalar responses. For a pair of random elements ( ν , Y ) ∈ W × R withjoint distribution G , a distribution-to-scalar regression model is E ( Y | Log ν ⊕ ν ) = E ( Y ) + h β , Log ν ⊕ ν i ν ⊕ . hen, Lin & Müller Here, ν ⊕ is the Fréchet mean of ν and β : D → R is a regression coeﬃcient function in L ν ⊕ given by β = P ∞ j =1 c j φ j , where λ j and φ j are the eigenvalues and eigenfunctions of the covarianceoperator C ν of Log ν ⊕ ν as in (7), and c j = λ − j h E ( Y Log ν ⊕ ν ) , φ j i ν ⊕ , for which we assumethat P ∞ j =1 λ − j h E ( Y Log ν ⊕ ν ) , φ j i ν ⊕ < ∞ . This model can also be viewed as function-to-scalarregression, which has been well studied in functional data analysis (Cardot et al. 1999, 2003; Caiand Hall 2006; Hall and Horowitz 2007; Yuan and Cai 2010; Lin and Yao 2017). While Bigot et al. (2017) assume distributions are fully observed, in reality this is usually not thecase, and this creates an additional challenge for the implementation of the proposed distribution-to-distribution regression model. Options to address this include estimating cdfs (e.g., Aggarwal 1955;Read 1972; Falk 1983; Leblanc 2012), or estimating quantile functions (e.g., Parzen 1979; Falk 1984;Yang 1985; Cheng and Parzen 1997) of the underlying distributions. Given an estimated quantilefunction b F − (resp. cdf b F ), we convert it to a cdf (resp. a quantile function) by right (resp. left)continuous inversion, b F ( r ) = sup { p ∈ [0 ,

1] : b F − ( p ) ≤ r } , for r ∈ R (10)(resp. (2)). Alternatively, one can start with a density estimator to estimate densities (Panaretosand Zemel 2016; Petersen and Müller 2016) and then compute the cdfs and quantile functions byintegration and inversion.Suppose { ( ν i , ν i ) } ni =1 are n independent realizations of ( ν , ν ). What we observe are collec-tions of independent measurements { X il } m ν i l =1 and { Y il } m ν i l =1 , sampled from ν i and ν i , respectively,where m ν i and m ν i are the sample sizes which may vary across distributions. Note that there aretwo independent layers of randomness in the data: The ﬁrst generates independent pairs of dis-tributions ( ν i , ν i ); the second generates independent observations according to each distribution, X il ∼ ν i and Y il ∼ ν i .For a distribution µ ∈ W , denote by b µ = µ ( b F ) the distribution associated with some cdfestimate b F , based on a sample of measurements drawn according to µ . Using b ν i and b ν i assurrogates of ν i and ν i , the theoretical analysis on the estimation of the distribution-to-distributionregression coeﬃcient function requires the following assumptions that quantify the discrepancy ofthe estimated and true probability measures.(A3) For any distribution µ ∈ W , with some nonnegative decreasing sequences τ m = o (1) and υ m = o (1) as m → ∞ such that υ m ≥ τ m , the corresponding estimate b µ based on a sample of8 asserstein Regression size m drawn according to µ satisﬁessup µ ∈W E [ d W ( b µ, µ )] = O ( τ m ) , and sup µ ∈W var[ d W ( b µ, µ )] = O ( υ m ) . In order to obtain the uniform rates as in (A3), we also require (following Panaretos and Zemel2016; Petersen and Müller 2016)(A4) The domain D is a compact interval.For example, the distribution estimator proposed by Panaretos and Zemel (2016) satisﬁes (A3)with τ m = m − / and υ m = m − , while Petersen and Müller (2016) consider a subset W ac M of W containing distributions that are absolutely continuous with respect to Lebesgue measure on D such that sup µ ∈W ac M sup r ∈ D µ max { f µ ( r ) , /f µ ( r ) , | f µ ( r ) |} ≤ M, (11)where f µ is the density function of a distribution µ ∈ W ac M , D µ is the support of distribution µ and M > µ ∈W ac M E d W ( b µ, µ ) = O ( m − / ) andsup µ ∈W ac M var[ d W ( b µ, µ )] = O ( m − / ) in (A3) (Proposition 1, Petersen and Müller 2019b).Since 2 n distributions need to be simultaneously estimated, we also make the following assump-tion on the numbers of measurements per distribution m ν i and m ν i .(A5) There exists a sequence m = m ( n ) such that min { m ν i , m ν i : i = 1 , , . . . , n } ≥ m and m → ∞ as n → ∞ . Given the independent realizations { ( ν i , ν i ) : i = 1 , . . . , n } of ( ν , ν ), we ﬁrst consider an oracleestimator for the regression coeﬃcient function β , where we initially assume that { ( ν i , ν i ) } ni =1 arefully observed. First of all, the empirical Fréchet means are well-deﬁned and unique due to the factthat we work in Hadamard spaces, and have explicit solutions e ν ⊕ = arg min µ ∈W n X i =1 d W ( ν i , µ ) and e ν ⊕ = arg min µ ∈W ∞ X i =1 d W ( ν i , µ ) . (12)Speciﬁcally, the quantile functions of the empirical Fréchet means are the empirical means of quan-tile functions across the sample: e F − ⊕ ( · ) = 1 n n X i =1 F − i ( · ) and e F − ⊕ ( · ) = 1 n n X i =1 F − i ( · ) , (13)9 hen, Lin & Müller and the corresponding distribution functions are given by right continuous inverse of the quan-tile functions as in (10). Then the log transforms Log ν ⊕ ν i and Log ν ⊕ ν i admit estimatesLog e ν ⊕ ν i and Log e ν ⊕ ν i . The covariance operators C ν and C ν can be estimated by e C ν = n − P ni =1 Log e ν ⊕ ν i ⊗ Log e ν ⊕ ν i and e C ν = n − P ni =1 Log e ν ⊕ ν i ⊗ Log e ν ⊕ ν i , which also admitthe eigendecompositions e C ν = P ∞ j =1 e λ j e φ j ⊗ e φ j and e C ν = P ∞ k =1 e ς k e ψ k ⊗ e ψ k , respectively, for e λ ≥ e λ ≥ · · · ≥ e ς ≥ e ς ≥ · · · ≥ C ν , its inverse is not bounded, leading to an ill-posed problem (e.g.,He et al. 2003; Wang et al. 2016). Regularization is thus needed and can be achieved throughtruncation. An oracle estimator for the regression coeﬃcient function β is e β = K X k =1 J X j =1 e b jk e ψ k ⊗ e φ j , (14)where e b jk = e λ − j e ξ jk , with e ξ jk = n − P ni =1 h Log e ν ⊕ ν i , e φ j i e ν ⊕ h Log e ν ⊕ ν i , e ψ k i e ν ⊕ , and J and K are thetruncation bounds.Furthermore, we can construct an estimator based on the distribution estimation in Section 3.1which will be applicable in practical situations, where typically ν i and ν i are observed in the formof samples generated from ν i and ν i . Denote the estimated quantile functions by b F − i and b F − i ,respectively. Then the quantile functions of the empirical Fréchet means b ν ⊕ and b ν ⊕ of b ν i and b ν i for i = 1 , . . . , n are given by b F − ⊕ ( · ) = 1 n n X i =1 b F − i ( · ) and b F − ⊕ ( · ) = 1 n n X i =1 b F − i ( · ) , (15)and the corresponding distribution functions b F ⊕ and b F ⊕ can be obtained by right continuousinversion as per (10).Replacing ν i and ν i by the corresponding estimates b ν i and b ν i in (14), we can analogouslyobtain the estimates for the covariance operators with corresponding eigendecompositions b C ν = n − P ni =1 Log b ν ⊕ b ν i ⊗ Log b ν ⊕ b ν i = P ∞ j =1 b λ j b φ j ⊗ b φ j , and b C ν = n − P ni =1 Log b ν ⊕ b ν i ⊗ Log b ν ⊕ b ν i = P ∞ k =1 b ς k b ψ k ⊗ b ψ k . A realistic estimator of the regression coeﬃcient function β is then b β = K X k =1 J X j =1 b b jk b ψ k ⊗ b φ j , (16)where b b jk = b λ − j b ξ jk , with b ξ jk = n − P ni =1 h Log b ν ⊕ b ν i , b φ j i b ν ⊕ h Log b ν ⊕ b ν i , b ψ k i b ν ⊕ , and the correspond-ing estimator b Γ for the regression operator Γ in (6) is b Γ g ( t ) = h g, b β ( · , t ) i b ν ⊕ , for g ∈ Log b ν ⊕ W and t ∈ D. (17)10 asserstein Regression We will discuss selection of the number of included functional principal components (FPCs) corre-sponding to J and K in implementations of the proposed method in Section 5.1. Note that the true and estimated regression operators are deﬁned in diﬀerent tangent spaces,Γ : T ν ⊕ → T ν ⊕ and b Γ : T b ν ⊕ → T b ν ⊕ . To compare them, we employ parallel transport, which isa commonly used tool for data on manifolds (Yuan et al. 2012; Petersen and Müller 2019b). Fortwo probability distributions µ , µ ∈ W , a parallel transport operator can be deﬁned between theentire Hilbert spaces L µ and L µ , i.e., P µ ,µ : L µ → L µ byP µ ,µ g := g ◦ F − ◦ F , for g ∈ L µ , (18)where F − and F are the quantile function of µ and cdf of µ , respectively. Assuming that µ isatomless, restricted to the tangent space T µ , the parallel transport operator P µ ,µ | T µ deﬁnes theparallel transport from tangent space T µ to T µ .Denote by H µ ,µ the space of all Hilbert-Schmidt operators from T µ to T µ , for µ , µ ∈ W .With µ , µ , µ , µ ∈ W where µ and µ are atomless, we can deﬁne a parallel transport operator P ( µ ,µ ) , ( µ ,µ ) from H µ ,µ to H µ ,µ by( P ( µ ,µ ) , ( µ ,µ ) A ) g = P µ ,µ ( A (P µ ,µ g )) , (19)for g ∈ T µ and A ∈ H µ ,µ . Denoting the Hilbert-Schmidt norm on H µ ,µ by k · k H µ ,µ , for µ , µ ∈ W , properties of parallel transport operators P µ ,µ and P ( µ ,µ ) , ( µ ,µ ) that are relevantto the theory are as follows. Proposition 1.

With probability measures µ , µ , µ , µ ∈ W , the parallel transport operators P µ ,µ and P ( µ ,µ ) , ( µ ,µ ) , as deﬁned in (18) and (19) , have the following properties:1. If µ is atomless, P µ ,µ is unitary, i.e., h P µ ,µ g, P µ ,µ h i µ = h g, h i µ , for g, h ∈ L µ .

2. If µ and µ are both atomless, then the parallel transport P µ ,µ from L µ to L µ is theadjoint operator of P µ ,µ , i.e., h P µ ,µ g , g i µ = h g , P µ ,µ g i µ , for g ∈ L µ and g ∈ L µ .

3. If µ , µ , and µ are atomless, given any positive integers N and N , for g j ∈ T µ , j =11 hen, Lin & Müller , · · · , N and g k ∈ T µ , k = 1 , · · · , N , P ( µ ,µ ) , ( µ ,µ )  N X j =1 N X k =1 c jk g k ⊗ g j  = N X j =1 N X k =1 c jk (P µ ,µ g k ) ⊗ (P µ ,µ g j ) .

4. If µ , µ , µ , and µ are all atomless, for A ∈ H µ ,µ and A ∈ H µ ,µ , kP ( µ ,µ ) , ( µ ,µ ) A − A k H µ ,µ = kP ( µ ,µ ) , ( µ ,µ ) A − Ak H µ ,µ . Related results for ﬁnite-dimensional Riemannian manifolds are in Proposition 2 of Lin andYao (2018). These four statements can be easily checked by the deﬁnition of parallel transportoperators in (18) and (19); we omit the proof. Given atomless distributions µ , µ , µ , µ ∈ W ,applying Proposition 1, the discrepancy between operators A ∈ H µ ,µ and A ∈ H µ ,µ can bequantiﬁed in the space H µ ,µ by kP ( µ ,µ ) , ( µ ,µ ) A − Ak H µ ,µ . Our goal for the theory is to evaluate the performance of the estimated regression operator b Γ,as deﬁned in (17). According to the discussion in Section 3.3, if (A1) holds and the estimatedFréchet means b ν ⊕ and b ν ⊕ are both atomless, then the discrepancy between the estimated andtrue regression operators b Γ and Γ can be gauged by kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b Γ − Γ k H ν ⊕ ,ν ⊕ , which isequal to the L ν ⊕ × ν ⊕ -norm between the corresponding kernels.To guarantee the atomlessness of the estimated Fréchet means b ν ⊕ and b ν ⊕ and hence to justifythe use of kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b Γ − Γ k H ν ⊕ ,ν ⊕ as a measure of the estimation error, we require thefollowing uniform Lipschitz condition on the estimated cdfs:(A6) For any distribution µ ∈ W , the corresponding estimate b µ based on a sample of size m drawnaccording to µ satisﬁes sup µ ∈W | b F ( r ) − b F ( r ) | ≤ L m | r − r | , for any r , r ∈ D , where b F isthe cdf of b µ and L m are a positive sequence only depending on the sample size m .For example, for the estimator proposed by Panaretos and Zemel (2016), the Lipschitz constant L m in (A6) can be taken as Ch − m + 1, where C is a constant and h m is a bandwidth satisfying h m → m → ∞ .Deﬁning Q b β = K X k =1 J X j =1 b b jk (P b ν ⊕ ,ν ⊕ b ψ k ) ⊗ (P b ν ⊕ ,ν ⊕ b φ j ) , we note that under (A1) and (A6), the third statement in Proposition 1 indicates that the kernel12 asserstein Regression of the transported regression operator estimator P ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b Γ is Q b β . Hence, kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b Γ − Γ k H ν ⊕ ,ν ⊕ = Z D Z D [ Q b β ( s, t ) − β ( s, t )] d ν ⊕ ( s )d ν ⊕ ( t ) . (20)Thus, for the theory, we will focus on the convergence rate of the data-based estimator b β for theregression coeﬃcient function by quantifying the asymptotic order of the right hand side of (20). Todo this, we require the following conditions regarding the spacing and decay rates of the eigenvalues λ j and ς k and the coeﬃcients b jk . Conditions of this type are standard in traditional functionallinear regression (e.g., Hall and Horowitz 2007).(A7) For j ≥ λ j − λ j +1 ≥ Cj − θ − , where C and θ are positive constants.(A8) For k ≥ ς k − ς k +1 ≥ Ck − ϑ − , where C and ϑ are positive constants.(A9) For j, k ≥ | b jk | ≤ Cj − ρ k − % , where ρ, % > / C > ν i and ν i aroundtheir Fréchet means, we also assume that d W ( ν ⊕ , ν ) and d W ( ν ⊕ , ν ) have ﬁnite variance, i.e.,(A10) E [ d W ( ν ⊕ , ν )] < ∞ , E [ d W ( ν ⊕ , ν )] < ∞ .Then we obtain Theorem 1.

Assume (A1)–(A10). Furthermore, deﬁning α = max { (4 θ +3) / (4 θ +4 ρ +2)+1 / (4 ϑ +4 % +2) , (2 θ +1) / (4 θ +4 ρ +2)+(2 ϑ +3) / (4 ϑ +4 % +2) , (2 θ +2) / (4 θ +4 ρ +2)+(2 ϑ +2) / (4 ϑ +4 % +2) }− ,assume α < and that τ m n α +1 → , υ / m n α +1 / → , as n → ∞ . If J ∼ n / (4 θ +4 ρ +2) , K ∼ n / (4 ϑ +4 % +2) , Z D Z D h Q b β ( s, t ) − β ( s, t ) i d ν ⊕ ( s )d ν ⊕ ( t ) = O ( τ m n α +1 ) + O p ( υ / m n α +1 / ) + O p ( n α ) . Here, τ m and υ m are as in (A3). If the estimator b β is based on the distribution estimation methodemployed by Panaretos and Zemel (2016), then the rate in Theorem 1 becomes O ( m − / n α +1 ) + O p ( m − / n α +1 / ) + O p ( n α ) and becomes O p ( n α ) if m ∼ n . Assuming ν i , ν i ∈ W ac M (as deﬁnedin (11)) in addition, the density estimator used by Petersen and Müller (2016, 2019b) entails theconvergence rate O ( m − / n α +1 ) + O p ( m − / n α +1 / ) + O p ( n α ), which becomes O p ( n α ) if m ∼ n / .If θ = ϑ and ρ = % , then α = (1 − ρ ) / (2 θ + 2 ρ + 1) = − ρ − / (2 θ + 2)] − . Thus, theregression coeﬃcient function estimator b β has a rate of convergence closer to √ n if the eigenvalues λ j and ς k decay with a slower rate and the coeﬃcients b jk decline faster. Furthermore, taking λ j = ς j = j − a with a > θ = ϑ = 2 a −

1. If ρ = % = 2 ca + 1 / c >

0, then α = − c/ (1 + c ), so that under certain conditions α may approach − hen, Lin & MüllerRemark . When the distributions { ν i , ν i } ni =1 are fully observed, we need fewer assumptions asdescribed in the following. First of all, we do not need to assume the compactness of the domain;among the assumptions of Theorem 1, it is enough to assume (A1)–(A2) and (A7)–(A10) in thiscase. To ensure the atomlessness of the empirical Fréchet means e ν ⊕ and e ν ⊕ , we assume thatthere exists a large constant M > W M := { µ ∈ W : | F ( r ) − F ( r ) | ≤ M | r − r | , for any r , r ∈ D } , where F is the cdfof µ . In particular, we assume that ν i , ν i ∈ W M for all i . Deﬁning e Γ g ( t ) = h g, e β ( · , t ) i e ν ⊕ for g ∈ Log e ν ⊕ W and t ∈ D , and Q e β = P Kk =1 P Jj =1 e b jk (P e ν ⊕ ,ν ⊕ e ψ k ) ⊗ (P e ν ⊕ ,ν ⊕ e φ j ), by Proposition 1 wecan quantify the estimation error of the estimated regression operator e Γ by kP ( e ν ⊕ , e ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) e Γ − Γ k H ν ⊕ ,ν ⊕ = R D R D [ Q e β ( s, t ) − β ( s, t )] d ν ⊕ ( s )d ν ⊕ ( t ). We then ﬁnd that with J ∼ n / (4 θ +4 ρ +2) , K ∼ n / (4 ϑ +4 % +2) , and α deﬁned as in Theorem 1, Z D Z D h Q e β ( s, t ) − β ( s, t ) i d ν ⊕ ( s )d ν ⊕ ( t ) = O p ( n α ) . Theorem 1 entails the following corollary on the prediction of ν given ν = µ for a distribution µ ∈ W , where we are interested in the rate of convergence when estimating E ⊕ ( ν | ν = µ ) :=Exp ν ⊕ [ E (Log ν ⊕ ν | Log ν ⊕ ν = Log ν ⊕ µ )]. For the following result, the corresponding estimate b µ of µ is assumed to be based on a sample of m µ ≥ m observations drawn from µ satisfying (A3),where m is the lower bound of the number of observations per distribution as per (A5). We denotethe prediction of ν by b ν ( b µ ) := Exp b ν ⊕ [ b Γ(Log b ν ⊕ b µ )], where b Γ is deﬁned in (17).

Corollary 1.

Under the assumptions of Theorem 1, d W ( b ν ( b µ ) , E ⊕ ( ν | ν = µ )) = O ( τ m n α +1 ) + O p ( υ / m n α +1 / ) + O p ( n α ) . Proofs and auxiliary lemmas for this section are in Appendix A.

Here we consider a dependent sequence µ , µ , . . . ∈ W , assumed to be square integrable in thesense that E d W ( µ, µ i ) < ∞ for some (and thus for all) µ ∈ W , i = 1 , . . . , n , and have a commonFréchet mean µ ⊕ , which satisﬁes(B1) The Fréchet mean µ ⊕ is atomless.Furthermore, assuming that (Log µ ⊕ µ i ) i ≥ is a stationary process (as per Deﬁnition 2.4 Bosq 2000),we extend the distribution-to-distribution regression model in (5) to a ﬁrst order autoregressivemodel, Log µ ⊕ µ i +1 = Γ (cid:16) Log µ ⊕ µ i (cid:17) + ε i , for i = 1 , , . . . (21)14 asserstein Regression Here, Γ : T µ ⊕ → T µ ⊕ is a linear operator deﬁned asΓ g ( t ) = h β ( · , t ) , g i µ ⊕ , for t ∈ D and g ∈ T µ ⊕ , (22)where β : D → R is the auto-regression coeﬃcient kernel lying in L µ ⊕ × µ ⊕ , and ε i are i.i.d. randomelements in the tangent space T µ ⊕ . Similar models have been previously studied in the seminalwork of Bosq (2000).To ensure the existence of such an autoregressive process, we assume that there exists an integer q ≥ k Γ q k L µ ⊕ <

1, where k · k L µ ⊕ denotes the sup norm for linear operators on L µ ⊕ and we deﬁne Γ q by induction, Γ k ( · ) = Γ[Γ k − ( · )], for any integer k > µ ⊕ µ i ) falls in the closed convex set Log µ ⊕ W for all i withprobability 1.As in Section 3, we have E (Log µ ⊕ µ ) = 0, µ ⊕ -almost surely. Denote the covariance operatorand cross-covariance operator of order 1 by C = E (Log µ ⊕ µ ⊗ Log µ ⊕ µ ) and C = E (Log µ ⊕ µ ⊗ Log µ ⊕ µ ), respectively, where C has the eigendecomposition C = ∞ X j =1 λ j φ j ⊗ φ j , with eigenvalues λ ≥ λ ≥ · · · ≥ { φ j } ∞ j =1 . With probability 1,the logarithmic transforms Log µ ⊕ µ i admit the expansionLog µ ⊕ µ i = ∞ X j =1 h Log µ ⊕ µ i , φ j i µ ⊕ φ j , i = 1 , , . . . The cross-covariance operator C hence admits the decomposition C = P ∞ l =1 P ∞ j =1 ξ jl φ l ⊗ φ j , where ξ jl = E ( h Log µ ⊕ µ , φ j i µ ⊕ h Log µ ⊕ µ , φ l i µ ⊕ ). With b jl = λ − j ξ jl , the regression coeﬃcient function canthen be expressed as β = ∞ X l =1 ∞ X j =1 b jl φ l ⊗ φ j . For estimation of the regression coeﬃcient function β , ﬁrst considering a fully observed sequenceof length n , µ , µ , . . . , µ n , with the oracle estimator of the Fréchet mean e µ ⊕ deﬁned analogouslyto (12), the autocovariance operators C and C can be estimated by their empirical counterparts e C = n − P ni =1 Log e µ ⊕ µ i ⊗ Log e µ ⊕ µ i and e C = ( n − − P n − i =1 Log e µ ⊕ µ i +1 ⊗ Log e µ ⊕ µ i , with the decom-positions e C = P ∞ j =1 e λ j e φ j ⊗ e φ j and e C = P ∞ l =1 P ∞ j =1 e ξ jl e φ l ⊗ e φ j , respectively, for e λ ≥ e λ ≥ · · · ≥ e ξ jl = ( n − − P n − i =1 h Log e µ ⊕ µ i , e φ j i e µ ⊕ · h Log e µ ⊕ µ i +1 , e φ l i e µ ⊕ . Then an oracle estimator for the15 hen, Lin & Müller regression coeﬃcient function β is e β = J X l =1 J X j =1 e b jl e φ l ⊗ e φ j , where e b jl = e λ − j e ξ jl , and J is the truncation bound.As discussed for the independent case in Section 3.2, a realistic estimator b β for β based onthe distribution estimation discussed in Section 3.1 can be obtained by replacing µ i and µ ⊕ withthe corresponding estimates b µ i and b µ ⊕ , the latter analogous to (15). Speciﬁcally, estimates for theautocovariance operators with corresponding decompositions are given by b C = n − P ni =1 Log b µ ⊕ b µ i ⊗ Log b µ ⊕ b µ i = P ∞ j =1 b λ j b φ j ⊗ b φ j and b C = ( n − − P n − i =1 Log b µ ⊕ b µ i +1 ⊗ Log b µ ⊕ b µ i = P ∞ j =1 b ξ jl b φ l ⊗ b φ j , with b ξ jl = ( n − − P n − i =1 h Log b µ ⊕ b µ i , b φ j i b µ ⊕ h Log b µ ⊕ b µ i +1 , b φ l i b µ ⊕ . A data-based estimator of the regressioncoeﬃcient function β is then given by b β = J X l =1 J X j =1 b b jl b φ l ⊗ b φ j , (23)with b b jl = b λ − j b ξ jl , and the corresponding estimator b Γ for the operator Γ in (22) is b Γ g ( t ) = h g, b β ( · , t ) i b µ ⊕ , for g ∈ Log b µ ⊕ W and t ∈ D. (24)To derive the convergence rate of the estimator b β , we require the following assumptions analo-gous to the independent case as in Section 3.(B3) There exists a sequence m = m ( n ) such that for the number of measurements per distribution m µ i , min { m µ i : i = 1 , , . . . , n } ≥ m and m → ∞ as n → ∞ .(B4) For j ≥ λ j − λ j +1 ≥ Cj − θ − , where C and θ are positive constants.(B5) For j, l ≥ | b jl | ≤ Cj − ρ l − % , where ρ, % > / C > E [ d W ( µ ⊕ , µ i )] ≤ C , for all i = 1 , , . . . , where C > n − P ni =1 d W ( µ i , b µ i ) (Rosenblatt 1956; Bosq 2000).(B7) The sequence (Log µ ⊕ µ i ) i ≥ is geometrically strongly mixing in the sense that for any integer k ≥

1, sup i ∈ N + sup A ∈ σ (Log µ ⊕ µ l : l ≤ i ) A ∈ σ (Log µ ⊕ µ l : l ≥ i + k ) | P ( A ∩ A ) − P ( A ) P ( A ) | ≤ Cu k , asserstein Regression where C > u ∈ (0 ,

1) are constants, and given a collection V of random elements, σ ( V )denotes the σ -ﬁeld generated by V .Here, as in Section 3, we deﬁne Q b β := P Jl =1 P Jj =1 b b jl (P b µ ⊕ ,µ ⊕ b φ l ) ⊗ (P b µ ⊕ ,µ ⊕ b φ j ). Then we obtain Theorem 2.

Assume (A3), (A4), (A6) and (B1)–(B7). Furthermore, deﬁning ζ = (1 − ρ − % ) / (2 θ + ρ + % + 1) , assume ζ < and that τ m n ζ +1 → , υ / m n ζ +1 / → , as n → ∞ . If J ∼ n / (4 θ +2 ρ +2 % +2) , Z D Z D h Q b β ( s, t ) − β ( s, t ) i d µ ⊕ ( s )d µ ⊕ ( t ) = O ( τ m n ζ +1 ) + O p ( υ / m n ζ +1 / ) + O p ( n ζ ) . The convergence rate obtained for the estimated coeﬃcient function of the autoregressive modelis very similar to that in Theorem 1. This is due to (B7) which entails var[ n − P ni =1 d W ( b µ i , µ i )]is of the same order as in the independent case. A similar discussion as after Theorem 1 appliesregarding speciﬁc rates of convergence. Remark . Analogous to the independent case, if µ i are all fully observed, we can obtain theconvergence rate of the oracle estimator e Γ without assumption (A4). In addition, we do not needthe strong mixing condition (B7) regarding the variance of n − P ni =1 d W ( µ i , b µ i ). Deﬁne Q e β = P Jl =1 P Jj =1 e b jl (P e µ ⊕ ,µ ⊕ e φ l ) ⊗ (P e µ ⊕ ,µ ⊕ e φ j ). Assuming (B1)–(B6) and that µ i ∈ W M for all i , it can beshown that with J ∼ n / (4 θ +2 ρ +2 % +2) and ζ deﬁned in Theorem 2, Z D Z D h Q e β ( s, t ) − β ( s, t ) i d µ ⊕ ( s )d µ ⊕ ( t ) = O p ( n ζ ) . As for the independent case, we derive the asymptotic results for the one-on-one prediction of µ n +1 given µ n as follows. Here, we denote E ⊕ ( µ n +1 | µ n ) := Exp µ ⊕ [ E (Log µ ⊕ µ n +1 | Log µ ⊕ µ n )], anddenote the prediction of µ n +1 by b µ n +1 ( b µ n ) := Exp b µ ⊕ [ b Γ(Log b µ ⊕ b µ n )] with b Γ as per (24).

Corollary 2.

Under the assumptions of Theorem 2, d W ( b µ n +1 ( b µ n ) , E ⊕ ( µ n +1 | µ n )) = O ( τ m n ζ +1 ) + O p ( υ / m n ζ +1 / ) + O p ( n ζ ) . Proofs and auxiliary lemmas for Theorem 2 and Corollary 2 are in Appendix B.

Choosing J and K for the Independent Case The following data-based method for choosing the numbers of predictor and response FPCs J and K , respectively, as deﬁned in (17), when one has an i.i.d. sample of distributions { ν i , ν i } ni =1 ,was found to be adequate in practical applications. We ﬁrst choose the number of FPCs K hen, Lin & Müller for the logarithms of the response distributions Log b ν ⊕ ν i , by applying either leave-one-curve-outcross-validation or thresholding according to cumulative fraction of variance explained (FVE). Forcross-validation, the objective function to be minimized is the discrepancy between predicted andobserved trajectories, i.e., K = argmin K n X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Log b ν ⊕ b ν i − K X k =1 h Log b ν ⊕ b ν i , b ψ k, − i i b ν ⊕ b ψ k, − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b ν ⊕ , where { b ψ k, − i } k ≥ are the estimated eigenfunctions from the functional principal component analysis(FPCA) of the i th-curve-left-out sample { Log b ν ⊕ b ν i } i = i . For the choice by FVE, K is chosen suchthat at least 100(1 − α )% of the variance is explained by the ﬁrst K FPCs, where users need tospecify α , with the common choice α = 0 . J , we use leave-one-curve-out cross-validation by minimizing thediﬀerence between the curves constructed with FPCs predicted by linear regression and the observedtrajectories: J = argmin J n X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Log b ν ⊕ b ν i − K X k =1 J X j =1 b b jk, − i h Log b ν ⊕ b ν i , b φ j, − i i b ν ⊕ b ψ k, − i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b ν ⊕ , where b b jk, − i = b λ − j, − i b ξ jk, − i with b ξ jk, − i = ( n − − P i = i h Log b ν ⊕ b ν i , b φ j, − i i b ν ⊕ ·h Log b ν ⊕ b ν i , b ψ k, − i i b ν ⊕ ,and { b λ j, − i } j ≥ and { b φ j, − i } j ≥ are the estimated eigenvalues and eigenfunctions from the FPCA ofthe i th-curve-left-out sample { Log b ν ⊕ b ν i } i = i , respectively.We mention that in practical implementations we replace leave-one-curve-out cross validationby 5-fold cross-validation when n > Choosing J for Distribution-Valued Time Series Here, we use a cross-validation approach proposed by Bergmeir et al. (2018) to select the numberof FPCs J as deﬁned in (23) for the time series case. Note that there is no second truncationparameter K . Following Bergmeir et al. (2018), we ﬁrst divide the observed time series into atraining set and a testing set, and then apply k -fold cross validation on the training set to choosethe number of FPCs J ; secondly, the regression coeﬃcient function β will be estimated on thewhole training set with the optimal J ; lastly, the estimate of β will be applied on the testing set toevaluate the performance of out-of-sample prediction. We discuss the independent scenario only; the time series case is analogous. Our goal here is todeal with the case when the ﬁt of the logarithmic response does not fall in the logarithmic space18 asserstein Regression with base point b ν ⊕ , i.e., b Γ(Log b ν ⊕ b ν i ) / ∈ Log b ν ⊕ W , (25)with b Γ given in (17). This problem was already recognized by Bigot et al. (2017). If (25) happens,we update the ﬁt by a projection onto the boundary of Log b ν ⊕ W along the line segment connect-ing the origin 0 and the original ﬁt b Γ(Log b ν ⊕ b ν i ). Speciﬁcally, we multiply the original estimate b Γ(Log b ν ⊕ b ν i ) by a constant η i such that η i = max { η ∈ [0 ,

1] : η b Γ(Log b ν ⊕ b ν i ) + id is non-decreasing } . (26)Note that η i = 1 when b Γ(Log b ν ⊕ b ν i ) ∈ Log b ν ⊕ W . In our implementation, the ﬁtted logarithmicresponse is then given by η i b Γ(Log b ν ⊕ b ν i ) for all i = 1 , . . . , n . We compared the performance of the proposed method implemented with boundary projection (re-ferred to as projection method) with two other approaches. The ﬁrst of these comparison methodsis to employ an alternative to the proposed boundary projection for those situations where theevent (25) takes place, which was proposed by Cazelles et al. (2018) in the context of principalcomponent analysis (PCA). This alternative to handle the problem extends the domains of thedistributions. We use this method by ﬁtting the proposed distribution-to-distribution regressionmodel with distributions on an extended domain, and then normalize the ﬁtted distributions byrestricting them back to the original domain. We refer to this as the domain-extension method inthe following. A second possible approach for comparison is provided by the log quantile density(lqd) method, where we map distributions/densities to a Hilbert space via the lqd transformation(Petersen and Müller 2016), apply function-to-function regression to the lqd functions, and mapthe ﬁtted functions back to the Wasserstein space W through the inverse lqd transformation (Chenet al. 2019).For our simulations, we generated data as follows:Step 1: Generate ν i = Beta ( a i , c i ) where a i iid ∼ Gamma(20 ,

10) and c i iid ∼ Gamma(40 , i = 1 , . . . , n . Then ν ⊕ is obtained by computing F − ν ⊕ ( · ) = E F − i ( · ) via numerical integration.Step 2: Perform FPCA on { Log ν ⊕ ν i } ni =1 ,Log ν ⊕ ν i = J X j =1 h Log ν ⊕ ν i , φ j i ν ⊕ φ j , such that the ﬁrst J FPCs explain at least 99.99% of the variation.19 hen, Lin & Müller

Step 3: For ν ⊕ = Beta(2 ,

4) and a J × J matrix B = ( b jk ) which consists of zeros except for thetopleft 2 × . . − . . ! , let Z i := η i J X k =1 h Log ν ⊕ ν i , J X j =1 b jk φ j i ν ⊕ φ k , where η i is a projection constant to ensure Z i ∈ Log ν ⊕ W as discussed in Section 5.2, i.e, η i = max { η ∈ [0 ,

1] : η P Jk =1 h Log ν ⊕ ν i , P Jj =1 b jk φ j i ν ⊕ φ k + id is non-decreasing } .Step 4: Sample a i iid ∼ Unif {± π, ± π, ± π } . Let ν i = [ g a i ◦ ( Z i + id)] ν ⊕ , where g a ( r ) = r −| a | − sin( ar ) with a ∈ R \{ } and r ∈ R . This is a valid method to provide random distortionsfor distributions (Petersen and Müller 2019a).Step 5: Draw an i.i.d. sample of size m from each of the distributions { ν i } ni =1 and { ν i } ni =1 .Four cases were considered with n ∈ { , } and m ∈ { , } . We simulated 500 runs foreach ( n, m ) pair. The distribution domain is expanded from [0 ,

1] to [ − ,

2] for the domain-extensionmethod. To compare the three methods, we computed the out-of-sample average Wassersteindiscrepancy (AWD) based on observations for 200 new predictors { ν i : i = n + 1 , . . . , n + 200 } , foreach Monte Carlo run. Denoting the ﬁtted response distributions by ν \ i , the out-of-sample AWDis given by AWD( n, m ) = 1200 n +200 X i = n +1 d W (Exp ν ⊕ Z i , ν \ i ) . For our method with boundary projection, ν \ i = Exp b ν ⊕ (cid:16) η i b Γ(Log b ν ⊕ b ν i ) (cid:17) . The results are sum-marized in the boxplots of Figure 1. Note that the domain-extension method often failed to forcethe ﬁt b Γ(Log b ν ⊕ b ν i ) to fall in the log space Log b ν ⊕ W . The boxplots of the AWDs of the domain-extension method were only calculated based on the runs where it was computable and the countof these runs out of 500 are also included in Figure 1.The proposed method outperforms the lqd method in all four cases considered and also thedomain-extension method when there are suﬃciently many observations for each distribution ( m =500). Although the domain-extension method seems slightly better than the proposed method whenthe number of observations per distribution is limited, it has a major downside — it is frequentlynot computable, especially when the number of distributions is small ( n = 20). Speciﬁcally when m = 500, we found that the domain-extension method is only computable for 72 runs for n = 20and 208 runs for n = 200, out of 500 runs. In addition, the median AWDs almost stay the same asthe number of distribution pairs n increases from 20 to 200 when m is small for all three approaches.This suggests that the ﬁtting error seems to be dominated by the distribution estimation error when20 asserstein Regression lll llll llllllll l lllllll . . . . . m = 50lqd ext proj lqd ext proj (99) (429) n = 20 n = 200 llllllllllllll ll lllllllll llllllllllllllllllllllllllllllllllllll llllllll . . . . m = 500lqd ext proj lqd ext proj (72) (208) n = 20 n = 200 A W D Figure 1: Boxplots of the average Wasserstein discrepancies (AWDs) for the four simulation setupswith ( n, m ) ∈ { , } × { , } , where “lqd”: lqd method; “ext”: domain-extension method;“proj”: projection method. The integers in brackets under “ext” indicate for how many of the 500Monte Carlo runs the domain-extension method was computable.there are few or modest numbers of observations available per distribution. When m is large, theAWDs decrease when n increases for the lqd and the proposed method, but they do not decreasefor the domain-extension method, indicating non-vanishing bias. There has been continuing interest in the nature of human longevity and the analysis of mortalitydata across countries and calendar years has provided some of the key data to study it (e.g., Chiouand Müller 2009; Ouellette and Bourbeau 2011; Hyndman et al. 2013; Shang and Hyndman 2017).Of particular interest is how patterns of mortality of speciﬁc populations evolve over calendartime. Going beyond summary statistics such as life expectancy, viewing the entire age-at-deathdistributions as data objects is expected to lead to deeper insights into the secular evolution ofhuman longevity and its dynamics. The Human Mortality Database ( )provides yearly life tables for 38 countries, from which the distributions of age-at-death in terms ofhistograms can be extracted, whence densities can then be obtained by smoothing with local linearregression. We obtained these densities on the domain [0 , n = 32 countries for which data are available for the years1983 and 2013. We applied the proposed distribution-to-distribution regression model with mortal-ity distributions for an earlier year (1983) as the predictor and a later year (2013) as the response to21 hen, Lin & Müller compare the temporal evolution of age-at-death distributions among diﬀerent countries. We showthe results for females in Figure 2 for Japan, Ukraine, Italy and USA, which showcase diﬀerentpatterns of mortality change between 1983 and 2013. In addition to the graphical comparison,Wasserstein discrepancies (WD) between the observed and ﬁtted distributions are also listed. Forall four countries, the observed and ﬁtted distributions for 2013 are seen to be shifted to the rightfrom the corresponding distributions in 1983, indicating increased longevity. Japan (WD = 1.37)age den E nd [ p l o t i nd [ i ], ] . . . Ukraine (WD = 3.2)age den E nd [ p l o t i nd [ i ], ] Italy (WD = 0.36)age den E nd [ p l o t i nd [ i ], ] . . . USA (WD = 1.64)age den E nd [ p l o t i nd [ i ], ] Age D en s i t y Predictor (1983) Response (2013) Fitted response

Figure 2: Distribution-to-distribution regression for the mortality distributions of females in Japan,Ukraine, Italy and USA. The predictors are the distributions for 1983 and the responses are thedistributions 30 years later. The Wasserstein discrepancies (WDs) between ﬁtted and observeddistributions are indicated for each country.Speciﬁcally, for Japan this rightward mortality shift is seen to be more expressed than suggestedby the ﬁtted model, so that longevity extension is more than is anticipated, while the mortalitydistribution for Ukraine seems to shift to the right at a slower pace than the ﬁtted model wouldsuggest, leading to a relatively large WD with a value of 3.2 between the observed and ﬁttedresponse. In contrast, the regression ﬁt for Italy almost perfectly matches the observed distributionin 2013. While the evolution of the mortality distributions for Japan, Ukraine and Italy canbe viewed as mainly a rightward shift over calendar years, this is not the case for USA, wherecompared with the ﬁtted response, the actual rightward shift of the mortality distribution seemsto be accelerated for those above age 75 and decelerated for those below age 70.The estimated regression coeﬃcient function b β ( s, t ) is shown in Figure 3. It is seen to bestratiﬁed according to s , which is the predictor age—negative for s around 7 and 25 and above 75or 80 and positive elsewhere (referred to as negative and positive strata in the sequel) and depends22 asserstein Regression only weakly on t , which is the response age except for t above 75, where the negative stratum atlarge s ≥

75 gradually diminishes and the positive strata expand. Hence, if the log transformedpredictor Log b ν ⊕ b ν i is non-negative or non-positive throughout its domain, then the ﬁt for the logtransformed response η i b Γ(Log b ν ⊕ b ν i ) is determined by the comparison of the absolute values of thelog transformed predictor over the positive and negative strata of the estimated coeﬃcient b β ( · , t ).If in 1983 the age-at-death distribution of a country was located to the right/left from the Fréchetmeans across countries, similar shifts may also be observed in the ﬁtted distributions for 2013. Forexample, the log mapped predictors for Japan and Ukraine as shown in Figure 4 both have a peakaround 20–25 and then the ﬁtted responses are indeed closer to the Fréchet mean in 2013 but stillto the right/left of the mean. s t Figure 3: Estimated regression coeﬃcient function b β ( s, t ) in (16) for distribution-to-distributionregression on mortality data for log-mapped distributions.Compared to 1983, the ﬁtted mortality distribution for 2013 in Japan has a bigger rightwardshift for older and a smaller one for younger females relative to the Fréchet mean distribution acrosscountries, in contrast to Ukraine, where mortality moves in the opposite direction from the Fréchetmean. The log transformed predictor for Italy is negative before 15 and positive after, whencethe ﬁtted log response for Italy becomes positive throughout and also expands in size, meaningthe relative standing of Italy in terms of longevity improved in 2013 and the ﬁtted distributionof Italy in 2013 is situated to the right from the Fréchet mean for all ages. While USA also haszero crossings in 1983, in 2013 the ﬁtted response is almost entirely negative and consequently themortality distribution moved to the left of the Fréchet mean, i.e., its relative standing in terms oflongevity became worse.To illustrate the proposed autoregressive model for distribution-valued time series, we consider23 hen, Lin & Müller the mortality data for Sweden. We chose Sweden as an example because demographic data areavailable for a longer time span and are of very high quality. Here, the model was trained on thetime series between 1961 and 2001, and the out-of-sample prediction was evaluated for the following15 years up to 2016, where the yearly distribution was predicted from the ﬁtted distribution timeseries, using the predicted distributions from previous years. Japan ( h i = Log S t a [ p l o t i nd [ i ], ] − Ukraine ( h i = Log S t a [ p l o t i nd [ i ], ] Italy ( h i = Log S t a [ p l o t i nd [ i ], ] − USA ( h i = Log S t a [ p l o t i nd [ i ], ] Age

Log M ap Predictor (1983) Fitted response (2013)

Figure 4: Log mapped predictors Log b ν ⊕ b ν i and ﬁtted log responses η i b Γ(Log b ν ⊕ b ν i ) for Japan,Ukraine, Italy and USA, where b Γ and η i are deﬁned in (17) and (26).As can been seen from the ﬁtted distributions for the training period as shown in the ﬁrstrow of Figure 5, they are all close to the observed distributions. For the prediction period, thepredicted densities displayed in the second row of the ﬁgure increasingly deviate from the observeddistributions going from 2002 to 2016, where the observed densities have a mode increasingly shiftingto the right compared to the prediction. This means that the rightward mortality distribution shiftoutpaces the model expectation and longevity extension is accelerating for Sweden. A question of continuing interest to economists is how house prices change over time (e.g., Oikarinenet al. 2018; Bogin et al. 2019). We ﬁtted the temporal evolution of house price distributions via theautoregressive distribution time series model described in Section 4, where we downloaded houseprice data from . These data included bimonthly median house pricesafter inﬂation adjustment for m = 306 cities in the US from June 1996 to August 2015, for whichthe distribution of median house prices across the cities was constructed for every second month.24 asserstein Regression The autoregressive model was trained on data up to April 2007 and predictions were computed forthe remaining period, where we successively predicted the distribution of each month based on theprediction two months prior, i.e., by running the distribution time series model as estimated fromthe training period. m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd [ i ] + y ea r S t a ) ] . . . m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd [ i ] + y ea r S t a ) ] m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd [ i ] + y ea r S t a ) ] m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd P r ed [ i ] + y ea r E nd ) ] . . . m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd P r ed [ i ] + y ea r E nd ) ] m o r t[[ i c ]][, w h i c h ( y ea r [[ i c ]] == p l o t i nd P r ed [ i ] + y ea r E nd ) ] Age D en s i t y P r ed i c t i on F i tt i ng Observed Fitted/Predicted

Figure 5: Implementation of the autoregressive distribution time series model for age-at-deathdistributions of females in Sweden. Training period: 1961–2001 (with ﬁtting results for three yearsshown in the top row). Prediction period: 2002–2016 (with prediction results for three years shownin the bottom row). The ﬁtting/prediction Wasserstein discrepancies (WDs) are as shown for eachpanel.Figure 6 shows the ﬁtting and prediction results for training and prediction periods, whereselected months are ordered in time, while Figure 7 illustrates the results sorted by Wassersteindiscrepancies. The house price densities are found to be mostly uni-modal, and the peak shiftsgradually to the right over time. Within the training period, the ﬁtted densities are initially veryclose to the observed densities and then gradually are situated to the left of the observed densities,which means that the house price evolution overall accelerates during this period.For the prediction period, the predicted densities fall behind the observed distributions in 2007,catch up with the actual distribution in 2008, and then move to the right of the observed distribu-tions. We ﬁnd that the discrepancy between the predicted and observed house price distributionsincreases from 2008 to 2012 and then decreases afterwards. These ﬁndings are in line with theoverheating of the housing market before 2006, the crash in 2007–2008, and the lingering eﬀects ofthe ﬁnancial crisis, followed by a recovery after 2012.25 hen, Lin & Müller

Aug. 1996(WD = 0.015)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] . Apr. 1999(WD = 0.028)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Dec. 2001(WD = 0.026)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Aug. 2004(WD = 0.12)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Apr. 2007(WD = 0.12)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Jun. 2007(WD = 0.13)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] . Jun. 2008(WD = 0.057)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Jun. 2009(WD = 0.1)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Jun. 2011(WD = 0.23)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Aug. 2015(WD = 0.1)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] log (Price) D en s i t y P r ed i c t i on F i tt i ng Observed Fitted/Predicted

Figure 6: Observed and ﬁtted (top row) / predicted (bottom row) densities of the house pricedistributions. Training period: August 1996 to April 2007. Prediction period: June 2007 to August2015. Five representative months are depicted for each of the training and prediction periods, wherethe ﬁrst month corresponds to the beginning of the respective period and the ﬁfth month to theend. The Wasserstein discrepancies (WDs) are also listed for each month.

Jun. 1998(WD = 0.0042)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] . Aug. 2000(WD = 0.015)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Oct. 1999(WD = 0.029)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Dec. 2004(WD = 0.09)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Dec. 2005(WD = 0.24)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd [ i ] + da y S t a ] Feb. 2008(WD = 0.027)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] . Aug. 2009(WD = 0.11)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Feb. 2010(WD = 0.14)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Jun. 2013(WD = 0.23)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] Apr. 2012(WD = 0.29)denSup[cutoffIdx] t m p [ c u t o ffI d x , p l o t i nd P r ed [ i ] + da y E nd ] log (Price) D en s i t y P r ed i c t i on F i tt i ng Observed Fitted/Predicted

Figure 7: Observed and ﬁtted (top row) / predicted (bottom row) densities of the house pricedistributions. Training period: August 1996 to April 2007. Prediction period: June 2007 to August2015. The representative months correspond to the minimum, three quartiles and maximum of theWasserstein discrepancies (WDs) for training and prediction periods, respectively.26 asserstein Regression

Appendices: Proofs and Ancillary Results

Throughout the proof, given any µ ∈ W , we denote the space of all Hilbert-Schmidt operators from T µ to T µ by H µ := H µ,µ . Appendix A: Proofs for Section 3.4

To derive the convergence rate of the estimator b β , we ﬁrst need to study the asymptotic propertiesof the empirical covariance operators b C ν and b C ν . For simplicity, we denote P b ν ⊕ ,ν ⊕ g and P b ν ⊕ ,ν ⊕ g by P g for g ∈ L b ν ⊕ and g ∈ L b ν ⊕ , respectively. Lemma 1.

Assume (A1), (A3)–(A6), and (A10). Furthermore, assume that the eigenvalues { λ j } and { ς k } are distinct, respectively. Then kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) , kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) . Furthermore, sup j ≥ | b λ j − λ j | ≤ kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ , sup k ≥ | b ς k − ς k | ≤ kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ , k P b φ j − φ j k ν ⊕ ≤ √ kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ / min ≤ j ≤ j { λ j − λ j +1 } , for all j ≥ , k P b ψ k − ψ k k ν ⊕ ≤ √ kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ / min ≤ k ≤ k { ς k − ς k +1 } , for all k ≥ . Proof.

We only prove the results for b C ν ; those for b C ν can be shown analogously. By the third27 hen, Lin & Müller statement in Proposition 1, under (A1) and (A6), P ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν = n − n X i =1 (PLog b ν ⊕ b ν i ) ⊗ (PLog b ν ⊕ b ν i ) − C ν = n − n X i =1 (Log ν ⊕ ν i ) ⊗ (Log ν ⊕ ν i ) − C ν + n − n X i =1 (PLog b ν ⊕ b ν i − Log ν ⊕ ν i ) ⊗ (Log ν ⊕ ν i )+ n − n X i =1 (Log ν ⊕ ν i ) ⊗ (PLog b ν ⊕ b ν i − Log ν ⊕ ν i )+ n − n X i =1 (PLog b ν ⊕ b ν i − Log ν ⊕ ν i ) ⊗ (PLog b ν ⊕ b ν i − Log ν ⊕ ν i )=: A + A + A + A . For A , it can be observed that kA k H ν ⊕ ≤ n − n X i =1 k Log ν ⊕ ν i k ν ⊕ ! n − n X i =1 k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ ! . Under the assumption (A10), n − n X i =1 k Log ν ⊕ ν i k ν ⊕ = n − n X i =1 d W ( ν ⊕ , ν i ) E [ d W ( ν ⊕ , ν )] + O p ( n − / ) . (A.1)For the second term, by (3), (18) and (A6), PLog b ν ⊕ b ν i = Log ν ⊕ b ν i − Log ν ⊕ b ν ⊕ . Hence, k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ ≤ k Log ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ + 2 k Log ν ⊕ b ν ⊕ − Log ν ⊕ e ν ⊕ k ν ⊕ + 2 k Log ν ⊕ e ν ⊕ k ν ⊕ . By (3), (13) and (15), k Log ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ = d W ( b ν i , ν i ), k Log ν ⊕ b ν ⊕ − Log ν ⊕ e ν ⊕ k ν ⊕ = d W ( b ν ⊕ , e ν ⊕ ) ≤ n − P ni =1 d W ( b ν i , ν i ), and Log ν ⊕ e ν ⊕ = n − P ni =1 Log ν ⊕ ν i . Furthermore, since n − / P ni =1 Log ν ⊕ ν i converges in distribution to a Gaussian measure on L ν ⊕ with mean 0 andcovariance operator C ν (Aldous 1976), k Log ν ⊕ e ν ⊕ k ν ⊕ = O p ( n − ) . (A.2)28 asserstein Regression Furthermore, by (A3)–(A5), n − n X i =1 k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) . (A.3)Hence, kA k H ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ). Similarly, it can be shown that kA k H ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ), and kA k H ν ⊕ = O ( τ m ) + O p ( υ m n − ) + O p ( n − ). By Dauxoiset al. (1982), kA k H ν ⊕ = O p ( n − ) . (A.4)By (A6) and Proposition 1, { b λ j } and { P b φ j } are the eigenvalues and eigenfunctions of P b C ν , forwhich the results follow from Lemmas 4.2 and 4.3 of Bosq (2000). Proof of Theorem 1.

First we observe that (cid:13)(cid:13)(cid:13) Q b β − β (cid:13)(cid:13)(cid:13) ν ⊕ × ν ⊕ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Q b β − K X k =1 J X j =1 b jk ψ k ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ × ν ⊕ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ X k = K +1 ∞ X j = J +1 b jk ψ k ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ × ν ⊕ , where k A k ν ⊕ × ν ⊕ := R D R D A ( s, t ) d ν ⊕ ( s )d ν ⊕ ( t ) for A ∈ L ν ⊕ × ν ⊕ . By (A9), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ X k = K +1 ∞ X j = J +1 b jk ψ k ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ × ν ⊕ = ∞ X k = K +1 ∞ X j = J +1 b jk = O ( J − ρ +1 K − % +1 ) . Deﬁne A = K X k =1 J X j =1 ( b b jk − b jk ) ψ k ⊗ φ j ,A = K X k =1 J X j =1 b jk h P b ψ k ⊗ P b φ j − ψ k ⊗ φ j i ,A = K X k =1 J X j =1 ( b b jk − b jk ) h P b ψ k ⊗ P b φ j − ψ k ⊗ φ j i . Then kQ b β − P Kk =1 P Jj =1 b jk ψ k ⊗ φ j k ν ⊕ × ν ⊕ ≤ k A k ν ⊕ × ν ⊕ +2 k A k ν ⊕ × ν ⊕ +2 k A k ν ⊕ × ν ⊕ . Clearlythe term A is asymptotically dominated by A and A .Note that k A k ν ⊕ × ν ⊕ = P Kk =1 P Jj =1 ( b b jk − b jk ) = P Kk =1 P Jj =1 ( b λ − j b ξ jk − λ − j ξ jk ) . We claimthat ( b ξ jk − ξ jk ) = ( j θ +2 + k ϑ +2 )[ O ( τ m ) + O p ( υ / m n − / ) + O p ( n − )] , (A.5)29 hen, Lin & Müller where the O and O p terms are uniform across j, k . Using the same technique as in the proof ofHall and Horowitz (2007), we deﬁne an event E J = E J ( n ) = { λ J ≥ kP ( b ν ⊕ , b ν ⊕ ) , ( ν ⊕ ,ν ⊕ ) b C ν − C ν k H ν ⊕ } . On E J , by Lemma 1, b λ j ≥ λ j /

2, for all j = 1 , . . . , J . Note that λ j ≥ Cj − θ for some constant C > ρ, % > / P ∞ k =1 b jk ≤ Cj − ρ P ∞ k =1 k − % ≤ const . by (A9). Hence, on E J , k A k ν ⊕ × ν ⊕ ≤ K X k =1 J X j =1 λ j − b λ j b λ j b jk ! + 2 K X k =1 J X j =1 ( b ξ jk − ξ jk ) b λ j ≤ h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i J X j =1 K X k =1 b jk ! b λ − j + h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i J X j =1 K X k =1 b λ − j ( j θ +2 + k ϑ +2 ) ≤ h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i J X j =1 K X k =1 λ − j ( j θ +2 + k ϑ +2 ) ≤ h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i ( J θ +3 K + J θ +1 K ϑ +3 ) . Note that with the choice of J ∼ n / (4 θ +4 ρ +2) , τ m λ − J → υ / m n − / λ − J → n − λ − J →

0, as n → ∞ , under the assumptions of Theorem 1. Thus, Lemma 1 entails that P ( E J ) → n → ∞ .For A , deﬁne A = K X k =1 J X j =1 b jk ψ k ⊗ (P b φ j − φ j ) ,A = K X k =1 J X j =1 b jk (P b ψ k − ψ k ) ⊗ φ j ,A = K X k =1 J X j =1 b jk (P b ψ k − ψ k ) ⊗ (P b φ j − φ j ) . Then k A k ν ⊕ × ν ⊕ ≤ k A k ν ⊕ × ν ⊕ + 2 k A k ν ⊕ × ν ⊕ + 2 k A k ν ⊕ × ν ⊕ . It is clear that A is30 asserstein Regression asymptotically dominated by A and A . By (A7), (A9) and Lemma 1, k A k ν ⊕ × ν ⊕ ≤ K X k =1 J X j =1 b jk k P b φ j − φ j k ν ⊕ ≤ J X j =1 j − ρ − θ − K X k =1 k − % kP b C ν − C ν k ν ⊕ = h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i J X j =1 j − ρ − θ − . Here, J X j =1 j − ρ − θ − =  O ( J − ρ +2 θ +3 ) if θ > ρ − O ( J ) if ρ − / < θ ≤ ρ − O (log J ) if θ = ρ − / O (1) if θ < ρ − / O ( J θ +2 ) . Hence, k A k ν ⊕ × ν ⊕ = O ( τ m J θ +2 ) + O p ( υ / m n − / J θ +2 ) + O p ( n − J θ +2 ) . Similarly, k A k ν ⊕ × ν ⊕ = O ( τ m K ϑ +2 ) + O p ( υ / m n − / K ϑ +2 ) + O p ( n − K ϑ +2 ) . Thus, with J ∼ n / (4 θ +4 ρ +2) and K ∼ n / (4 ϑ +4 % +2) , kQ b β − β k ν ⊕ × ν ⊕ = O ( τ m n α +1 ) + O p ( υ / m n α +1 / ) + O p ( n α ) . To complete the proof, it suﬃces to show (A.5). By (A6) and Proposition 1, b ξ jk − ξ jk = h n − n X i =1 h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ PLog b ν ⊕ b ν i − E ( h Log ν ⊕ ν , φ j i ν ⊕ Log ν ⊕ ν ) , P b ψ k i ν ⊕ + h E ( h Log ν ⊕ ν , φ j i ν ⊕ Log ν ⊕ ν ) , P b ψ k − ψ k i ν ⊕ =: I + I . Note that E k Log ν ⊕ ν k ν ⊕ = E [ d W ( ν ⊕ , ν )] < ∞ , and by Fubini’s theorem k q E (Log ν ⊕ ν ) k ν ⊕ = E k Log ν ⊕ ν k ν ⊕ = E [ d W ( ν ⊕ , ν )] < ∞ . By Lemma 1, I ≤ E h Log ν ⊕ ν , φ j i ν ⊕ k q E (Log ν ⊕ ν ) k ν ⊕ ·k P b ψ k − ψ k k ν ⊕ ≤ const . k P b ψ k − ψ k k ν ⊕ = k ϑ +2 [ O ( τ m ) + O p ( υ / m n − / ) + O p ( n − )], where the O hen, Lin & Müller and O p terms are uniform across j, k . For I , I ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ PLog b ν ⊕ b ν i − E ( h Log ν ⊕ ν , φ j i ν ⊕ Log ν ⊕ ν ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ PLog b ν ⊕ b ν i − n − n X i =1 h Log ν ⊕ ν i , φ j i ν ⊕ Log ν ⊕ ν i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 h Log ν ⊕ ν i , φ j i ν ⊕ Log ν ⊕ ν i − E ( h Log ν ⊕ ν , φ j i ν ⊕ Log ν ⊕ ν ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ =: 2 I + 2 I . Furthermore, I ≤ const . (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 (cid:16) h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ − h Log ν ⊕ ν i , φ j i ν ⊕ (cid:17) PLog b ν ⊕ b ν i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n X i =1 h Log ν ⊕ ν i , φ j i ν ⊕ (cid:16) PLog b ν ⊕ b ν i − Log ν ⊕ ν i (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ⊕  ≤ const . " n − n X i =1 (cid:16) h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ − h Log ν ⊕ ν i , φ j i ν ⊕ (cid:17) n − n X i =1 k PLog b ν ⊕ b ν i k ν ⊕ + n − n X i =1 k Log ν ⊕ ν i k ν ⊕ n − n X i =1 k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ . Note that n − n X i =1 (cid:16) h PLog b ν ⊕ b ν i , P b φ j i ν ⊕ − h Log ν ⊕ ν i , φ j i ν ⊕ (cid:17) ≤ n − n X i =1 h PLog b ν ⊕ b ν i , P b φ j − φ j i ν ⊕ + 2 n − n X i =1 h PLog b ν ⊕ b ν i − Log ν ⊕ ν i , φ j i ν ⊕ ≤ n − n X i =1 k PLog b ν ⊕ b ν i k ν ⊕ k P b φ j − φ j k ν ⊕ + 2 n − n X i =1 k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ . Under (A3)–(A6), and (A10), by (A.1) and (A.3), n − n X i =1 k PLog b ν ⊕ b ν i k ν ⊕ = E [ d W ( ν ⊕ , ν )] + O ( τ m ) + O p ( υ / m n − / ) + O p ( n − / ) . Analogously, we obtain n − P ni =1 k PLog b ν ⊕ b ν i − Log ν ⊕ ν i k ν ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ),and n − P ni =1 k PLog b ν ⊕ b ν i k ν ⊕ = E [ d W ( ν ⊕ , ν )] + O ( τ m ) + O p ( υ / m n − / ) + O p ( n − / ). Further-32 asserstein Regression more, by Lemma 1 and (A7), I = j θ +2 [ O ( τ m ) + O p ( υ / m n − / ) + O p ( n − )] , where the O and O p terms are uniform across j, k . For I , deﬁning an empirical cross-covarianceoperator ˇ C ν ν = n − P ni =1 Log ν ⊕ ν i ⊗ Log ν ⊕ ν i , I = k ˇ C ν ν φ j − C ν ν φ j k ν ⊕ ≤ k ˇ C ν ν − C ν ν k H ν ⊕ ,ν ⊕ = O p ( n − ) , where the O p term is uniform across j, k (Theorems 4.4.5 and 7.7.6, Hsing and Eubank 2015). Proof of Corollary 1.

First note that for two distributions µ , µ ∈ W , if µ is atomless, then d W (Exp µ g, Exp µ P µ ,µ g ) = d W ( µ , µ ). Thus, d W ( b ν ( b µ ) , E ⊕ ( ν | ν = µ ))= d W (cid:18) Exp b ν ⊕ Z D b β ( s, · )Log b ν ⊕ b µ ( s )d b ν ⊕ ( s ) , Exp ν ⊕ [ E (Log ν ⊕ ν | Log ν ⊕ ν = Log ν ⊕ µ )] (cid:19) ≤ d W (cid:18) Exp ν ⊕ P b ν ⊕ ,ν ⊕ Z D b β ( s, · )Log b ν ⊕ b µ ( s )d b ν ⊕ ( s ) , Exp ν ⊕ Z D β ( s, · )Log ν ⊕ µ ( s )d ν ⊕ ( s ) (cid:19) + 2 d W ( b ν ⊕ , ν ⊕ )= 2 (cid:13)(cid:13)(cid:13)(cid:13) P b ν ⊕ ,ν ⊕ Z D b β ( s, · )Log b ν ⊕ b µ ( s )d b ν ⊕ ( s ) − Z D β ( s, · )Log ν ⊕ µ ( s )d ν ⊕ ( s ) (cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + 2 d W ( b ν ⊕ , ν ⊕ ) . Note that by (A6) and Proposition 1,P b ν ⊕ ,ν ⊕ Z D b β ( s, · )Log b ν ⊕ b µ ( s )d b ν ⊕ ( s ) = Z D Q b β ( s, · )PLog b ν ⊕ b µ ( s )d ν ⊕ ( s ) . Hence, d W ( b ν ( b µ ) , E ⊕ ( ν | ν = µ )) ≤ (cid:13)(cid:13)(cid:13)(cid:13)Z D Q b β ( s, · )PLog b ν ⊕ b µ ( s )d ν ⊕ ( s ) − Z D β ( s, · )Log ν ⊕ µ ( s )d ν ⊕ ( s ) (cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + 2 d W ( b ν ⊕ , ν ⊕ ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)Z D h Q b β ( s, · ) − β ( s, · ) i PLog b ν ⊕ b µ ( s )d ν ⊕ ( s ) (cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + 4 (cid:13)(cid:13)(cid:13)(cid:13)Z D β ( s, · ) h PLog b ν ⊕ b µ ( s ) − Log ν ⊕ µ ( s ) i d ν ⊕ ( s ) (cid:13)(cid:13)(cid:13)(cid:13) ν ⊕ + 2 d W ( b ν ⊕ , ν ⊕ ) ≤ k PLog b ν ⊕ b µ k ν ⊕ kQ b β − β k ν ⊕ × ν ⊕ + 4 k PLog b ν ⊕ b µ − Log ν ⊕ µ k ν ⊕ k β k ν ⊕ × ν ⊕ + 2 d W ( b ν ⊕ , ν ⊕ ) , hen, Lin & Müller where k · k ν ⊕ × ν ⊕ is deﬁned as in the proof of Theorem 1. Here, k PLog b ν ⊕ b µ k ν ⊕ ≤ k PLog b ν ⊕ b µ − Log ν ⊕ µ k ν ⊕ + 2 k Log ν ⊕ µ k ν ⊕ = 2 d W ( ν ⊕ , µ ) + O ( τ m ) + O p ( υ / m ) + O p ( n − ) . By (A9), k β k ν ⊕ × ν ⊕ < ∞ . By analogy with the proof of (A.3), under (A3)–(A6), k PLog b ν ⊕ b µ − Log ν ⊕ µ k ν ⊕ = O ( τ m ) + O p ( υ / m ) + O p ( n − ). Lastly, d W ( b ν ⊕ , ν ⊕ ) = O ( τ m n − ) + O p ( υ / m n − / ) + O p ( n − ), which completes the proof by Theorem 1. Appendix B: Proofs for Section 4

In this section, we denote P b µ ⊕ ,µ ⊕ g by P g , for g ∈ L b µ ⊕ . Lemma 2.

Assume (A3), (A4), (A6), (B1), (B3), and (B6)–(B7). Furthermore, assume that theeigenvalues { λ j } are distinct, respectively. Then kP ( b µ ⊕ , b µ ⊕ ) , ( µ ⊕ ,µ ⊕ ) b C − C k H µ ⊕ = O ( τ m ) + O p ( n − / ) . Furthermore, sup j ≥ | b λ j − λ j | ≤ kP ( b µ ⊕ , b µ ⊕ ) , ( µ ⊕ ,µ ⊕ ) b C − C k H µ ⊕ , k P b φ j − φ j k µ ⊕ ≤ kP ( b µ ⊕ , b µ ⊕ ) , ( µ ⊕ ,µ ⊕ ) b C − C k H µ ⊕ / min ≤ j ≤ j { λ j − λ j +1 } , for all j ≥ . Proof.

The proof is analogous to the proof of Lemma 1. By the third statement in Proposition 1,under (B1) and (A6), P ( b µ ⊕ , b µ ⊕ ) , ( µ ⊕ ,µ ⊕ ) b C − C = n − n X i =1 (Log µ ⊕ µ i ) ⊗ (Log µ ⊕ µ i ) − C + n − n X i =1 (PLog b µ ⊕ b µ i − Log µ ⊕ µ i ) ⊗ (Log µ ⊕ µ i )+ n − n X i =1 (Log µ ⊕ µ i ) ⊗ (PLog b µ ⊕ b µ i − Log µ ⊕ µ i )+ n − n X i =1 (PLog b µ ⊕ b µ i − Log µ ⊕ µ i ) ⊗ (PLog b µ ⊕ b µ i − Log µ ⊕ µ i )=: A + A + A + A . asserstein Regression Note that kA k H µ ⊕ ≤ n − n X i =1 k Log µ ⊕ µ i k µ ⊕ ! n − n X i =1 k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ ! . For the ﬁrst term on the right hand side, under (B6), E [ d W ( µ ⊕ , µ i )] ≤ C / , var[ d W ( µ ⊕ , µ i )] ≤ C ,and | cov[ d W ( µ ⊕ , µ i ) , d W ( µ ⊕ , µ l )] | ≤ C . Hence, n − P ni =1 k Log µ ⊕ µ i k µ ⊕ = n − P ni =1 d W ( µ ⊕ , µ i ) = O p (1).For the second term, by (3) and (18) and (A6), PLog b µ ⊕ b µ i = Log µ ⊕ b µ i − Log µ ⊕ b µ ⊕ . Hence, k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ ≤ k Log µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ +2 k Log µ ⊕ b µ ⊕ − Log µ ⊕ e µ ⊕ k µ ⊕ +2 k Log µ ⊕ e µ ⊕ k µ ⊕ .By (3), (13) and (15), k Log µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ = d W ( b µ i , µ i ), Log µ ⊕ e µ ⊕ = n − P ni =1 Log µ ⊕ µ i , and k Log µ ⊕ b µ ⊕ − Log µ ⊕ e µ ⊕ k µ ⊕ = d W ( b µ ⊕ , e µ ⊕ ) ≤ n − P ni =1 d W ( b µ i , µ i ). By Theorem 3.10 of Bosq (2000), k Log µ ⊕ e µ ⊕ k µ ⊕ = O p ( n − ). Hence, n − n X i =1 k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ ≤ n − n X i =1 d W ( b µ i , µ i ) + 4 n − n X i =1 d W ( b µ i , µ i ) + O p ( n − ) . Note that E { cov[ d W ( µ i , b µ i ) , d W ( µ i + k , b µ i + k ) | µ i , µ i + k ] } = 0. Under (A3), E [ d W ( µ i , b µ i ) | µ i ] = O ( τ m ) uniformly in i , i.e., there exists a constant C > m > E [ d W ( µ i , b µ i ) | µ i ] ≤ C τ m , for all m ≥ m and i ≥

1. Furthermore by (B7), for all m ≥ m ,sup i ∈ N + (cid:12)(cid:12)(cid:12) cov { C − τ − m E [ d W ( µ i , b µ i ) | µ i ] , C − τ − m E [ d W ( µ i + k , b µ i + k ) | µ i + k ] } (cid:12)(cid:12)(cid:12) ≤ C u k , with some constants C > u ∈ (0 ,

1) (Volkonskii and Rozanov 1959), and hence n − n − X i =1 n − i X k =1 (cid:12)(cid:12)(cid:12) cov[ d W ( µ i , b µ i ) , d W ( µ i + k , b µ i + k )] (cid:12)(cid:12)(cid:12) ≤ C C τ m n − n − X k =1 ( n − k ) u k = O ( τ m n − ) . Again applying (A3), n − P ni =1 var[ d W ( b µ i , µ i )] = O ( υ m n − ) and hence var (cid:2) n − P ni =1 d W ( b µ i , µ i ) (cid:3) = O ( υ m n − ). Thus, n − n X i =1 k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) , and hence kA k H µ ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ). Similarly, we ﬁnd kA k H µ ⊕ = O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ), and A is asymptotically dominated by A and A . Lastly, kA k H µ ⊕ = O p ( n − ) follows from Theorem 4.9 of Bosq (2000).35 hen, Lin & MüllerProof of Theorem 2. The proof follows similar arguments to Theorem 1. First we observe (cid:13)(cid:13)(cid:13) Q b β − β (cid:13)(cid:13)(cid:13) µ ⊕ × µ ⊕ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Q b β − J X l =1 J X j =1 b jl φ l ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ × µ ⊕ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ X l = J +1 ∞ X j = J +1 b jl φ l ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ × µ ⊕ , where k A k µ ⊕ × µ ⊕ := R D R D A ( s, t ) d µ ⊕ ( s )d µ ⊕ ( t ) for A ∈ L µ ⊕ × µ ⊕ . By (B5), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ X l = J +1 ∞ X j = J +1 b jl φ l ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ × µ ⊕ = ∞ X l = J +1 ∞ X j = J +1 b jl = O ( J − ρ − % +2 ) . Deﬁne A = P Jl =1 P Jj =1 ( b b jl − b jl ) φ l ⊗ φ j , A = P Jl =1 P Jj =1 b jl (P b φ l ⊗ P b φ j − φ l ⊗ φ j ), and A = P Jl =1 P Jj =1 ( b b jl − b jl )(P b φ l ⊗ P b φ j − φ l ⊗ φ j ). Then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Q b β − J X l =1 J X j =1 b jl φ l ⊗ φ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ × µ ⊕ ≤ k A k µ ⊕ × µ ⊕ + 2 k A k µ ⊕ × µ ⊕ + 2 k A k µ ⊕ × µ ⊕ , where clearly A is asymptotically dominated by A and A .For A , note that k A k µ ⊕ × µ ⊕ = J X l =1 J X j =1 ( b b jl − b jl ) = J X l =1 J X j =1 ( b λ − j b ξ jl − b jl ) . Analogous to the proof of (A.5), we derive the convergence rate of ( b ξ jl − ξ jl ) . Note that b ξ jl − ξ jl = h ( n − − n − X i =1 h PLog b µ ⊕ b µ i , P b φ j i µ ⊕ PLog b µ ⊕ b µ i +1 − E ( h Log µ ⊕ µ i , φ j i µ ⊕ Log µ ⊕ µ i +1 ) , P b φ l i µ ⊕ + h E ( h Log µ ⊕ µ i , φ j i µ ⊕ Log µ ⊕ µ i +1 ) , P b φ l − φ l i µ ⊕ =: I + I . By (B6) and Lemma 2, I ≤ const . k P b φ l − φ l k µ ⊕ . For I , I ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( n − − n − X i =1 h PLog b µ ⊕ b µ i , P b φ j i µ ⊕ PLog b µ ⊕ b µ i +1 − ( n − − n − X i =1 h Log µ ⊕ µ i , φ j i µ ⊕ Log µ ⊕ µ i +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( n − − n − X i =1 h Log µ ⊕ µ i , φ j i µ ⊕ Log µ ⊕ µ i +1 − E ( h Log µ ⊕ µ i , φ j i µ ⊕ Log µ ⊕ µ i +1 ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ ⊕ =: 2 I + 2 I . asserstein Regression Here, analogous to the proof of Theorem 1, I ≤ const . " n − − n − X i =1 (cid:16) k PLog b µ ⊕ b µ i k µ ⊕ k P b φ j − φ j k µ ⊕ + k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ (cid:17) · n − X i =1 k PLog b µ ⊕ b µ i +1 k µ ⊕ + 2( n − − n − X i =1 k Log µ ⊕ µ i k µ ⊕ n − X i =1 k PLog b µ ⊕ b µ i +1 − Log µ ⊕ µ i +1 k µ ⊕ ≤ O p (1) " k P b φ j − φ j k µ ⊕ + ( n − − n − X i =1 k PLog b µ ⊕ b µ i − Log µ ⊕ µ i k µ ⊕ + ( n − − n − X i =1 k PLog b µ ⊕ b µ i +1 − Log µ ⊕ µ i +1 k µ ⊕ . For I , deﬁning an empirical auto-covariance operatorˇ C = ( n − − n − X i =1 Log µ ⊕ µ i +1 ⊗ Log µ ⊕ µ i ,I = k ˇ C φ j − C φ j k µ ⊕ ≤ k ˇ C − C k H µ ⊕ = O p ( n − ), uniformly in j, l (Theorem 4.9, Bosq 2000).Thus by Lemma 2,( b ξ jl − ξ jl ) = ( j θ +2 + l θ +2 ) h O ( τ m ) + O p ( υ / m n − / ) + O p ( n − ) i , where the O and O p terms are uniform across j, l . Following arguments analogous to the proof ofTheorem 1, we ﬁnd k A k µ ⊕ × µ ⊕ = O ( τ m J θ +4 ) + O p ( υ / m n − / J θ +4 ) + O p ( n − J θ +4 ).For A , deﬁne A = J X l =1 J X j =1 b jl φ l ⊗ (P b φ j − φ j ) ,A = J X l =1 J X j =1 b jl (P b φ l − φ l ) ⊗ φ j ,A = J X l =1 J X j =1 b jl (P b φ l − φ l ) ⊗ (P b φ j − φ j ) . Then k A k µ ⊕ × µ ⊕ ≤ k A k µ ⊕ × µ ⊕ + 2 k A k µ ⊕ × µ ⊕ + 2 k A k µ ⊕ × µ ⊕ , and A is asymptotically dom-inated by A and A . Analogous to the proof of Theorem 1, it can be veriﬁed that k A k µ ⊕ × µ ⊕ = O ( τ m J θ +2 ) + O p ( υ / m n − / J θ +2 ) + O p ( n − J θ +2 ) , k A k µ ⊕ × µ ⊕ = O ( τ m J θ +2 ) + O p ( υ / m n − / J θ +2 ) + O p ( n − J θ +2 ) . hen, Lin & Müller REFERENCES

We omit the proof of Corollary 2, since it is analogous to that of Corollary 1.

References

Aggarwal, O. P. (1955). Some minimax invariant procedures for estimating a cumulative distri-bution function.

Annals of Mathematical Statistics Aldous, D. J. (1976). A characterisation of Hilbert space using the central limit theorem.

Journalof the London Mathematical Society Ambrosio, L. , Gigli, N. and

Savaré, G. (2004). Gradient ﬂows with metric and diﬀerentiablestructures, and applications to the Wasserstein space.

Atti Accad. Naz. Lincei Cl. Sci. Fis. Mat.Natur. Rend. Lincei (9) Mat. Appl Ambrosio, L. , Gigli, N. and

Savaré, G. (2008).

Gradient Flows: in Metric Spaces and in theSpace of Probability Measures . Springer.

Bachoc, F. , Gamboa, F. , Loubes, J.-M. and

Venet, N. (2017). A Gaussian process regressionmodel for distribution inputs.

IEEE Transactions on Information Theory Banerjee, M. , Chakraborty, R. , Ofori, E. , Okun, M. S. , Viallancourt, D. E. and

Ve-muri, B. C. (2016). A nonlinear regression technique for manifold valued data with applicationsto medical image analysis. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition . 4424–4432.

Bergmeir, C. , Hyndman, R. J. and

Koo, B. (2018). A note on the validity of cross-validationfor evaluating autoregressive time series prediction.

Computational Statistics & Data Analysis

Bigot, J. , Gouet, R. , Klein, T. and

López, A. (2017). Geodesic PCA in the Wasserstein spaceby convex PCA.

Annales de l’Institut Henri Poincaré, Probabilités et Statistiques Bogin, A. , Doerner, W. and

Larson, W. (2019). Local house price dynamics: New indicesand stylized facts.

Real Estate Economics Bosq, D. (2000).

Linear Processes in Function Spaces: Theory and Applications . Springer-Verlag,New York.

Cai, T. and

Hall, P. (2006). Prediction in functional linear regression.

Annals of Statistics Cardot, H. , Ferraty, F. , Mas, A. and

Sarda, P. (2003). Testing hypotheses in the functionallinear model.

Scandinavian Journal of Statistics Cardot, H. , Ferraty, F. and

Sarda, P. (1999). Functional linear model.

Statistics & ProbabilityLetters Cazelles, E. , Seguy, V. , Bigot, J. , Cuturi, M. and

Papadakis, N. (2018). Geodesic PCA38

EFERENCES

Wasserstein Regression versus log-PCA of histograms in the Wasserstein space.

SIAM Journal on Scientiﬁc Computing B429–B456.

Chen, Z. , Bao, Y. , Li, H. and

Spencer Jr, B. F. (2019). LQD-RKHS-based distribution-to-distribution regression methodology for restoring the probability distributions of missing shmdata.

Mechanical Systems and Signal Processing

Cheng, C. and

Parzen, E. (1997). Uniﬁed estimators of smooth quantile and quantile densityfunctions.

Journal of Statistical Planning and Inference Chiou, J.-M. and

Müller, H.-G. (2009). Modeling hazard rates as functional data for the analy-sis of cohort lifetables and mortality forecasting.

Journal of the American Statistical Association

Cornea, E. , Zhu, H. , Kim, P. and

Ibrahim, J. G. (2017). Regression models on Riemanniansymmetric spaces.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) Dauxois, J. , Pousse, A. and

Romain, Y. (1982). Asymptotic theory for the principal componentanalysis of a vector random function: Some applications to statistical inference.

Journal ofMultivariate Analysis Davis, B. C. , Fletcher, P. T. , Bullitt, E. and

Joshi, S. (2007). Population shape regressionfrom random design data. In . Falk, M. (1983). Relative eﬃciency and deﬁciency of kernel type estimators of smooth distributionfunctions.

Statistica Neerlandica Falk, M. (1984). Relative deﬁciency of kernel type estimators of quantiles.

Annals of Statistics Ferraty, F. and

Vieu, P. (2003). Functional nonparametric statistics: A double inﬁnite di-mensional framework. In

Recent Advances and Trends in Nonparametric Statistics . Elsevier,61–76.

Fréchet, M. (1948). Les éléments aléatoires de nature quelconque dans un espace distancié.

Annales de l’Institut Henri Poincaré Grenander, U. (1950). Stochastic processes and statistical inference.

Arkiv för Matematik Hall, P. and

Horowitz, J. L. (2007). Methodology and convergence rates for functional linearregression.

Annals of Statistics He, G. , Müller, H.-G. and

Wang, J.-L. (2003). Functional canonical analysis for squareintegrable stochastic processes.

Journal of Multivariate Analysis He, G. , Müller, H.-G. , Wang, J.-L. and

Yang, W. (2010). Functional linear regression viacanonical analysis.

Bernoulli Hinkle, J. , Muralidharan, P. , Fletcher, P. T. and

Joshi, S. (2012). Polynomial regressionon Riemannian manifolds. In

Computer Vision–ECCV 2012 . Springer, 1–14.39 hen, Lin & Müller

REFERENCES

Hsing, T. and

Eubank, R. (2015).

Theoretical Foundations of Functional Data Analysis, with anIntroduction to Linear Operators . John Wiley & Sons.

Huckemann, S. F. (2015). (Semi-)intrinsic statistical analysis on non-Euclidean spaces. In

Ad-vances in Complex Data Modeling and Computational Methods in Statistics . Springer, 103–118.

Hyndman, R. J. , Booth, H. and

Yasmeen, F. (2013). Coherent mortality forecasting: theproduct-ratio method with functional time series models.

Demography Kloeckner, B. (2010). A geometric study of Wasserstein spaces: Euclidean spaces.

Annali dellaScuola Normale Superiore di Pisa-Classe di Scienze Kokoszka, P. , Miao, H. , Petersen, A. and

Shang, H. L. (2019). Forecasting of densityfunctions with an application to cross-sectional and intraday returns.

International Journal ofForecasting Leblanc, A. (2012). On estimating distribution functions using Bernstein polynomials.

Annalsof the Institute of Statistical Mathematics Lin, L. , St. Thomas, B. , Zhu, H. and

Dunson, D. B. (2017). Extrinsic local regression onmanifold-valued data.

Journal of the American Statistical Association

Lin, Z. and

Yao, F. (2017). Functional regression on manifold with contamination. arXiv preprintarXiv:1704.03005 . Lin, Z. and

Yao, F. (2018). Intrinsic Riemannian functional data analysis. arXiv preprintarXiv:1812.01831 . Marron, J. S. and

Alonso, A. M. (2014). Overview of object oriented data analysis.

BiometricalJournal McCann, R. J. (1997). A convexity principle for interacting gases.

Advances in Mathematics

Morris, J. S. (2015). Functional regression.

Annual Review of Statistics and Its Application Oikarinen, E. , Bourassa, S. C. , Hoesli, M. and

Engblom, J. (2018). US metropolitan houseprice dynamics.

Journal of Urban Economics

Oliva, J. , Neiswanger, W. , Póczos, B. , Schneider, J. and

Xing, E. (2014). Fast distributionto real regression. In

Artiﬁcial Intelligence and Statistics . 706–714.

Ouellette, N. and

Bourbeau, R. (2011). Changes in the age-at-death distribution in four lowmortality countries: A nonparametric approach.

Demographic Research Panaretos, V. M. and

Zemel, Y. (2016). Amplitude and phase variation of point processes.

Annals of Statistics Parzen, E. (1979). Nonparametric statistical data modeling.

Journal of the American StatisticalAssociation Petersen, A. , Chen, C.-J. and

Müller, H.-G. (2019). Quantifying and visualizing intraregionalconnectivity in resting-state Functional Magnetic Resonance Imaging with correlation densities.40

EFERENCES

Wasserstein RegressionBrain Connectivity Petersen, A. and

Müller, H.-G. (2016). Functional data analysis for density functions bytransformation to a Hilbert space.

Annals of Statistics Petersen, A. and

Müller, H.-G. (2019a). Fréchet regression for random objects with Euclideanpredictors.

Annals of Statistics Petersen, A. and

Müller, H.-G. (2019b). Wasserstein covariance for multiple random densities.

Biometrika

Póczos, B. , Singh, A. , Rinaldo, A. and

Wasserman, L. A. (2013). Distribution-free distri-bution regression. In

AISTATS . 507–515.

Ramsay, J. O. and

Dalzell, C. J. (1991). Some tools for functional data analysis.

Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) Read, R. (1972). The asymptotic inadmissibility of the sample distribution function.

Annals ofMathematical Statistics Rosenblatt, M. (1956). A central limit theorem and a strong mixing condition.

Proceedings ofthe National Academy of Sciences of the United States of America Shang, H. L. and

Hyndman, R. J. (2017). Grouped functional time series forecasting: Anapplication to age-speciﬁc mortality rates.

Journal of Computational and Graphical Statistics Shi, X. , Styner, M. , Lieberman, J. , Ibrahim, J. G. , Lin, W. and

Zhu, H. (2009). Intrinsicregression models for manifold-valued data. In

Medical Image Computing and Computer-AssistedIntervention–MICCAI 2009 . Springer, 192–199.

Steinke, F. and

Hein, M. (2009). Non-parametric regression between manifolds. In

Advances inNeural Information Processing Systems . 1561–1568.

Steinke, F. , Hein, M. and

Schölkopf, B. (2010). Nonparametric regression between generalRiemannian manifolds.

SIAM Journal on Imaging Sciences Sturm, K.-T. (2003). Probability measures on metric spaces of nonpositive curvature.

HeatKernels and Analysis on Manifolds, Graphs, and Metric Spaces (Paris, 2002)

Szabó, Z. , Sriperumbudur, B. K. , Póczos, B. and

Gretton, A. (2016). Learning theory fordistribution regression.

Journal of Machine Learning Research Thi Thien Trang, B. , Loubes, J.-M. , Risser, L. and

Balaresque, P. (2019). Distribu-tion regression model with a Reproducing Kernel Hilbert Space approach.

Communications inStatistics-Theory and Methods in Press.

Villani, C. (2003).

Topics in Optimal Transportation . American Mathematical Society.

Volkonskii, V. and

Rozanov, Y. A. (1959). Some limit theorems for random functions. I.

Theory of Probability & Its Applications Wang, J.-L. , Chiou, J.-M. and

Müller, H.-G. (2016). Review of functional data analysis.

Annual Review of Statistics and Its Application hen, Lin & Müller REFERENCES

Yang, S.-S. (1985). A smooth nonparametric estimator of a quantile function.

Journal of theAmerican Statistical Association Yao, F. , Müller, H.-G. and

Wang, J.-L. (2005). Functional linear regression analysis forlongitudinal data.

Annals of Statistics Yuan, M. and

Cai, T. T. (2010). A reproducing kernel Hilbert space approach to functionallinear regression.

Annals of Statistics Yuan, Y. , Zhu, H. , Lin, W. and

Marron, J. (2012). Local polynomial regression for sym-metric positive deﬁnite matrices.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) Zemel, Y. and

Panaretos, V. M. (2019). Fréchet means and Procrustes analysis in Wassersteinspace.

Bernoulli25