[PDF] On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification

Abstract

Full PDF

OOn Pro jection Robust Optimal Transport: SampleComplexity and Model Misspeciﬁcation

Tianyi Lin (cid:63)

Zeyu Zheng (cid:63)

Elynn Y. Chen (cid:5)

Marco Cuturi (cid:47),(cid:46)

Michael I. Jordan (cid:5) , † Department of Electrical Engineering and Computer Sciences (cid:5)

Department of Industrial Engineering and Operations Research (cid:63)

Department of Statistics † University of California, BerkeleyCREST - ENSAE (cid:47) , Google Brain (cid:46)

June 30, 2020

Abstract

Optimal transport (OT) distances are increasingly used as loss functions for statistical inference,notably in the learning of generative models or supervised learning. Yet, the behavior of minimumWasserstein estimators is poorly understood, notably in high-dimensional regimes or under modelmisspeciﬁcation. In this work we adopt the viewpoint of projection robust (PR) OT, which seeks tomaximize the OT cost between two measures by choosing a k -dimensional subspace onto which theycan be projected. Our ﬁrst contribution is to establish several fundamental statistical properties ofPR Wasserstein distances, complementing and improving previous literature that has been restrictedto one-dimensional and well-speciﬁed cases. Next, we propose the integral PR Wasserstein (IPRW)distance as an alternative to the PRW distance, by averaging rather than optimizing on subspaces.Our complexity bounds can help explain why both PRW and IPRW distances outperform Wassersteindistances empirically in high-dimensional inference tasks. Finally, we consider parametric inferenceusing the PRW distance. We provide an asymptotic guarantee of two types of minimum PRWestimators and formulate a central limit theorem for max-sliced Wasserstein estimator under modelmisspeciﬁcation. To enable our analysis on PRW with projection dimension larger than one, wedevise a novel combination of variational analysis and statistical theory. Recent years have witnessed an ever-increasing role for ideas from optimal transport (OT) [Villani, 2008]in machine learning. Combining OT distances with the general principles of minimal distance estimation(MDE) [Wolfowitz, 1957, Basu et al., 2011] yields a powerful basis for various statistical inferenceproblems, such as density estimation Bassetti et al. [2006], training of generative model [Arjovsky et al.,2017, Gulrajani et al., 2017, Montavon et al., 2016, Adler and Lunz, 2018, Cao et al., 2019], auto-encoders [Tolstikhin et al., 2018], clustering [Cuturi and Doucet, 2014, Bonneel et al., 2016, Ho et al.,2017, Ye et al., 2017], multitask regression [Janati et al., 2020], trajectory inference [Hashimoto et al.,2016, Schiebinger et al., 2017, Yang et al., 2020, Tong et al., 2020] or nonparametric testing [Ramdaset al., 2017]; see Peyr´e and Cuturi [2019] and Panaretos and Zemel [2019] for reviews on these topics.For OT ideas to continue to bear fruit in machine learning, it will be necessary to tackle twocharacteristic challenges: (1) high dimensionality and (2) model misspeciﬁcation. Initial progress hasbeen made on the latter problem by Bernton et al. [2019], who showed that in the misspeciﬁed casethe minimum Wasserstein estimator (MWE) outputs the Wasserstein projection of the data-generating1 a r X i v : . [ m a t h . S T ] J un istribution onto the ﬁtted model class. These authors also obtained results on robustness and theasymptotic distribution of the projection, also these results only apply to the one-dimensional setting.High-dimensional settings are challenging; indeed, it is known that the sample complexity of estimatingthe Wasserstein distance can grow exponentially in dimension [Dudley, 1969, Fournier and Guillin, 2015,Weed and Bach, 2019, Lei, 2020].We focus on a promising approach to treating high-dimensional problems: Compute the OT dis-tance between low-dimensional projections of high-dimensional input measures. The simplest and mostrepresentative example of this approach is the sliced Wasserstein distance [Rabin et al., 2011, Bonnotte,2013, Bonneel et al., 2015, Deshpande et al., 2019, Kolouri et al., 2019a, Nadjahi et al., 2020], which isdeﬁned as the average OT distance obtained between random 1-dimensional projections, and which isshown practical in real applications [Deshpande et al., 2018, 2019, Kolouri et al., 2016, 2019b, Carriereet al., 2017, Wu et al., 2019, Liutkus et al., 2019]. In an important extension, Paty and Cuturi [2019]and Niles-Weed and Rigollet [2019] proposed very recently to seek the k -dimensional subspace ( k > projection robust Wasserstein (PRW) distance , which is conceptually simple and does solve the curseof dimensionality in the so-called spiked model as proved in [Niles-Weed and Rigollet, 2019, Theorem 1]by recovering an optimal 1 / √ n rate under the Talagrand transport inequality. This result suggests thatPRW can be signiﬁcantly more useful than the OT distance for inference tasks when the dimension islarge. From a computational point of view, PRW becomes the max-sliced Wasserstein distance when theprojection dimension is k = 1 and has an eﬃcient implementation [Deshpande et al., 2019]. For general k ≥

1, Lin et al. [2020] proposed to compute PRW using Riemannian optimization toolbox and pro-vided theoretical guarantee and encouraging empirical results. However, obtaining rigorous guaranteesfor practical performance of PRW requires a more thorough understanding of its statistical behavior.

Contributions.

In this paper, we study the statistical properties of PRW and another so-calledintegrated PRW (IPRW), which replaces the maximum in the original PRW with an average of OTdistance over k -dimensional projections. Our contributions can be summarized as follows.1. We prove that the empirical measure (cid:98) µ n converges to true measure µ (cid:63) under both PRW andIPRW with diﬀerent parametric rates. For example, when the order is p = 3 / k ≥

3, the parametric rate is n − /k (IPRW). For PRW, the rate is ( n − /k + n − / (cid:112) dk log( n ) + n − / dk log( n )) when µ (cid:63) satisﬁes a projection Bernstein tail condition and( n − /k + n − / (cid:112) dk log( n ) + n − / dk log( n )) when µ (cid:63) satisﬁes a projection Poincar´e inequality.Concentration results are also presented under stronger conditions.2. We establish asymptotic guarantees for the minimal PRW and expected PRW estimators un-der model misspeciﬁcation. For minimal PRW estimator with p = 1 and k = 1, we derive anasymptotic distribution for arbitrary dimension d with the parametric n − / rate in the Hausdorﬀmetric. Our assumptions are weaker than those used in Bernton et al. [2019], not requiring thenonsingularity of the Jacobian or the separability of the parameters.3. We conduct extensive experiments on synthetic data and neural networks to validate our theory.As a byproduct, we present a simple optimization algorithm that can eﬃciently compute the PRWdistance in practice even when k ≥

2; see Appendix F or Lin et al. [2020, Appendix B]. This quantity is also named as

Wasserstein Projection Pursuit (WPP) [Niles-Weed and Rigollet, 2019]. For simplicity,we refer from now on to PRW/WPP as PRW.

2n summary, our work provides an enhanced understanding of two PRW distances and the associatedminimal distance estimators under model misspeciﬁcation, complementing the existing literature [Niles-Weed and Rigollet, 2019, Bernton et al., 2019, Nadjahi et al., 2019, 2020]. Our proof techniques consistin a novel combination of classical results from variational analysis and statistical theory which may beof independent interest.

In this section, we provide some technical background materials on projection optimal transport.Throughout the paper, (cid:107) · (cid:107) denotes the Euclidean norm (in the corresponding vector space).

Wasserstein and sliced Wasserstein.

Let p ≥ P ( R d ) and P p ( R d ) as the set of allBorel measures on R d and the subset that satisﬁes M p ( µ ) := (cid:82) R d (cid:107) x (cid:107) p dµ ( x ) < + ∞ . For two probabilitymeasures µ, ν ∈ P p ( R d ), their Wasserstein distance of order p is deﬁned as follows: W p ( µ, ν ) := inf π ∈ Π( µ,ν ) (cid:90) R d × R d (cid:107) x − y (cid:107) p dπ ( x, y ) , (2.1)where the inﬁmum is taken over Π( µ, ν ) ⊆ P ( R d × R d )—the set of probability measures with marginals µ and ν . In the 1D case, Rachev and R¨uschendorf [1998, Theorem 3.1.2.(a)] have shown that W p ( µ, ν ) = (cid:82) | F − µ ( t ) − F − ν ( t ) | p dt , where F − µ and F − ν are the quantile functions of µ and ν . This 1D formulamotivates the sliced Wasserstein (SW) and max-sliced Wasserstein (max-SW) distances [Bonnotte,2013, Bonneel et al., 2015, Deshpande et al., 2019]. In particular, the idea is to use as a proxy of (2.1)the average or maximum of a set of 1D Wasserstein distances constructed by projecting d -dimensionalmeasures to a random collection of 1D spaces. Computationally appealing, both SW and max-SWdistances are widely used in practice, especially in generative modeling [Kolouri et al., 2019b, Deshpandeet al., 2019, Liutkus et al., 2019]. Practitioners observe that the SW distance only outputs a good Monte-Carlo approximation with a large number of samples, while the max-SW distance can achieve similarresults with fewer samples [Kolouri et al., 2019a, Nguyen et al., 2020]. Projection robust Wasserstein.

Encouraged by the success of SW and MSW, Paty and Cuturi[2019] ask whether we can gain more by using a subspace of dimension k ≥ projection robust Wasserstein (PRW) distance, and prove that this quantity is well posed if the order is p ≥

1. More speciﬁcally, let S d,k = { E ∈ R d × k : E (cid:62) E = I k } be the set of d × k orthogonal matrices and E (cid:63) be the linear transformation associated with E for any x ∈ R d by E (cid:63) ( x ) = E (cid:62) x . For any measurablefunction f and µ ∈ P ( R d ), we denote f µ as the push-forward of µ by f , so that f µ ( A ) = µ ( f − ( A ))where f − ( A ) = { x ∈ R d : f ( x ) ∈ A } for any Borel set A . The PRW distance of order p between µ and ν is deﬁned by PW p,k ( µ, ν ) := sup E ∈ S d,k W p ( E (cid:63) µ, E (cid:63) ν ) . (2.2)As an alternative, we deﬁne the IPRW distance, which replaces the supreme in Eq. (2.2) with an average.Formally, the IPRW distance of order p between µ and ν is deﬁned by PW p,k ( µ, ν ) := (cid:90) S d,k W p ( E (cid:63) µ, E (cid:63) ν ) dσ ( E ) , (2.3)3here σ is the uniform distribution on S d,k . The IPRW and PRW distances generalize the SW and max-SW distances to the high-dimensional projection setting respectively. Compared to the PRW distance,the IPRW distance can be shown better statistically but remains unfavorable in computational sense.Indeed, a large amount of samples from S d,k are necessary to approximate the IPRW distance accurately.Improving the computational eﬃciency is interesting but beyond the scope of this paper. Convergence of empirical measures.

Let X n = ( X , . . . , X n ) be independent and identically dis-tributed samples according to the true measure µ (cid:63) ∈ P q ( R d ). The empirical measure of X n is deﬁnedby (cid:98) µ n := (1 /n ) (cid:80) ni =1 δ X i . It is known that (cid:98) µ n ⇒ µ (cid:63) almost surely, and W p ( (cid:98) µ n , µ (cid:63) ) → E [ W p ( (cid:98) µ n , µ (cid:63) )] (cid:39) n − /d whenever µ is absolutely continuous with respect to Lebesgue measure and d > p [Dudley, 1969,Fournier and Guillin, 2015, Weed and Bach, 2019]. The convergence is slow when the dimension ishigh—an instance of the well-known curse-of-dimensionality phenomenon.Due to the low-dimensional structure of the IPRW and PRW distances, the parametric rate of IPRWand PRW distances is expected to be of n − /k in the large- n limit. Similar rates have been derivedfor E [ |PW k,p ( (cid:98) µ n , (cid:98) ν n ) − W p ( µ, ν ) | ] as a function of n under a spiked transport model; see Niles-Weedand Rigollet [2019, Theorem 8]. Their bound depends on problem dimension d and requires µ and ν tosatisfy the Talagrand transport inequality [Talagrand, 1996]. For the special case when k = 1, the ratefor the IPRW distance was studied in [Nadjahi et al., 2020]. To the best of our knowledge, there hasbeen no other paper on the statistical properties of IPRW and PRW distances. Parametric modeling and inference.

A statistical model is a family of distributions, M = { µ θ ∈ P ( R d ) | θ ∈ Θ } , where Θ is the parameter space. A minimal set of the conditions of a proper family ofdistribution are: (i) (Θ , (cid:107)·(cid:107) Θ ) is a Polish space, (ii) Θ is σ -compact, i.e., it is the union of countably manycompact subspaces, and (iii) parameters are identiﬁable, i.e., µ θ = µ θ (cid:48) implies θ = θ (cid:48) . Since the space P p ( R d ) endowed with the distance W p is a Polish space, we estimate model coeﬃcients using minimumdistance estimation (MDE) [Wolfowitz, 1957, Basu et al., 2011], where the distance we consider here isPRW. The main reason why we do not choose IPRW in this setting is computational. The minimumproject robust Wasserstein (MPRW) estimator is deﬁned as follows: (cid:98) θ n := argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) . (2.4)Note that the probability density function of µ θ can be diﬃcult to evaluate in practice, especially when µ θ is a generative model. Nevertheless, in various settings, even if the density is not available, onecan generate samples Z m from µ θ and use them to approximate µ θ . With this approximation, anatural alternative is the minimum expected projection robust Wasserstein (MEPRW) estimator, whichis deﬁned as follows [Bernton et al., 2019, Nadjahi et al., 2019]: (cid:98) θ n,m := argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] , (2.5)where n is the number of samples from the data distribution µ (cid:63) , m is the number of samples from theparametric distribution µ θ , and (cid:98) µ θ,m is an empirical version of µ θ based on samples Z m .Existing works have established asymptotic guarantees for minimal Wasserstein and sliced Wasser-stein estimators [Bernton et al., 2019, Nadjahi et al., 2019]. Despite the similar proof paths, we remarkthat our results for the MPRW and MEPRW estimators are new and derived under weaker assumptionsand more general settings than previous work; see Sections 3.3 and 3.4.4 Main Results on Projection Robust Optimal Transport Estimation

Throughout this section, we assume p ≥ k ∈ [ d ] unless stated otherwise. Focusing on the IPRWand PRW distances, we prove that they are lower semi-continuous and metrize weak convergence.Through a new sample complexity analysis, we derive the convergence rate of empirical measures underboth distances as well as an improved rate for the PRW distance when µ (cid:63) satisﬁes either a Bernstein tailcondition or the Poincar´e inequality. For the generative models with the PRW distance, we study themisspeciﬁed setting where the limit θ (cid:63) is not necessarily the limit of the maximum likelihood estimator.We establish the asymptotic properties of the MPRW and MEPRW estimators and formulate a centrallimit theorem when p = 1 and k = 1. We begin with the results on the relationship between the IPRW, PRW and Wasserstein distances. Thefollowing lemma demonstrates their equivalence in a topological sense.

Lemma 3.1

The IPRW, PRW and Wasserstein distances are equivalent. In other words, for anysequence of probability measures { µ i } i ∈ N and probability measure µ in P p ( R d ) , we have PW p,k ( µ i , µ ) → if and only if PW p,k ( µ i , µ ) → if and only if W p ( µ i , µ ) → . Lemma 3.1 is a generalization of Bayraktar and Guo [2019, Theorem 1] where the projection dimensionis k = 1. By Lemma 3.1 and Villani [2008, Theorem 6.9], we obtain the following result regarding thetopology induced by the PRW distance of order p . Theorem 3.2

The IPRW and PRW distances both metrize weak convergence. In other words, for anysequence of probability measures { µ i } i ∈ N and probability measure µ in P p ( R d ) , we have PW p,k ( µ i , µ ) → if and only if PW p,k ( µ i , µ ) → if and only if µ i ⇒ µ . Theorem 3.2 generalizes Villani [2008, Theorem 6.9] since the PRW distance is the Wasserstein distancewhen the projection dimension k = d . When k = 1, Theorem 3.2 implies that the SW and MSWdistances metrize weak convergence. Note that this implication is stronger than Nadjahi et al. [2019,Theorem 1], which only provides a one-sided argument. Theorem 3.3

The IPRW and PRW distances are both lower semi-continuous in the usual weak topol-ogy. In other words, if the sequences of probability measures { µ i } i ∈ N , { ν i } i ∈ N ⊆ P ( R d ) satisfy µ i ⇒ µ and ν i ⇒ ν for probability measures µ, ν ∈ P ( R d ) , then we have PW p,k ( µ, ν ) ≤ lim inf i → + ∞ PW p,k ( µ i , ν i ) and PW p,k ( µ, ν ) ≤ lim inf i → + ∞ PW p,k ( µ i , ν i ) . Theorem 3.3 generalizes Nadjahi et al. [2019, Lemma S6] and is pivotal to our subsequent analysis ofthe asymptotic properties of the MPRW and MEPRW estimators.

We provide the parametric rate of the empirical measures under the IPRW and PRW distances. Inparticular, we present our main result on convergence rates in the following theorem.

Theorem 3.4

Let µ (cid:63) ∈ P q ( R d ) and M q := M q ( µ (cid:63) ) < + ∞ . Then we have E [ PW p,k ( (cid:98) µ n , µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ k ∧ ( p − q )] (log( n )) ζp,q,kp , for all n ≥ , (3.1)5 here (cid:46) p,q refers to “less than” with a constant depending only on ( p, q ) and ζ p,q,k =  k = q = 2 p, k (cid:54) = 2 p and q = kpk − p ) or ( q > k = 2 p ) , Remark 3.1

Theorem 3.4 can be compared with Lei [2020, Theorem 3.1]. The key diﬀerence is thatour bound does not depend on d , while all bounds for the Wasserstein distance grow exponentially in d when d ≥ p . This improvement makes a qualitative diﬀerence, showing that the PRW distance doesnot suﬀer from the curse of dimensionality while retaining ﬂexibility via the choice of k . Deﬁnition 3.1 µ ∈ P ( R d ) satisﬁes a projection Bernstein tail condition if there exist σ, V > for all E ∈ S d,k and X ∼ E (cid:63) µ such that E [ | X | r ] ≤ (1 / σ r ! V r − for all i ∈ [ n ] and r ≥ . Theorem 3.5

Suppose µ (cid:63) ∈ P q ( R d ) satisﬁes a projection Bernstein tail condition and assume thesame setting as in Theorem 3.4. For all n ≥ , the following inequality holds true: E [ PW p,k ( (cid:98) µ n , µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ k ∧ ( p − q )] (log( n )) ζp,q,kp + n − p (cid:112) dk log( n ) + n − p dk log( n ) . Deﬁnition 3.2 µ ∈ P ( R d ) satisﬁes a projection Poincar´e inequality if there exists M > for all E ∈ S d,k and X ∼ E (cid:63) µ such that Var ( f ( X )) ≤ M E [ (cid:107)∇ f ( X ) (cid:107) ] for any f satisfying E [ f ( X ) ] < + ∞ and E [ (cid:107)∇ f ( X ) (cid:107) ] < + ∞ . Theorem 3.6

Suppose µ (cid:63) ∈ P q ( R d ) satisﬁes a projection Poincar´e inequality and assume the samesetting as in Theorem 3.4. For all n ≥ , the following inequality holds true: E [ PW p,k ( (cid:98) µ n , µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ k ∧ ( p − q )] (log( n )) ζp,q,kp + n − ∨ p (cid:112) dk log( n ) + n − p dk log( n ) . Remark 3.2

Theorem 3.5 and 3.6 present the parametric rate under the PRW distance when µ (cid:63) sat-isﬁes certain conditions. While the ﬁrst term matches that in Theorem 3.4, the extra two terms comefrom bounding the gap E [sup E ∈ S d,k ( W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )])] . Compared with Niles-Weed and Rigollet [2019, Theorem 8], where µ (cid:63) is assumed to satisfy the Talagrand transport inequal-ity, our conditions in Deﬁnition 3.1 and 3.2 are strictly weaker but our parametric rate matches their n − /k + n − / (cid:112) dk log( n ) rate in the large- n limit when p = 1 . We present concentration results when µ (cid:63) satisﬁes stronger conditions than Deﬁnition 3.1 and 3.2. Deﬁnition 3.3 µ ∈ P ( R d ) satisﬁes a Bernstein tail condition if there exists σ, V > such that E [sup E ∈ S d,k ,X ∼ E (cid:63) µ | X | r ] ≤ (1 / σ r ! V r − for all i ∈ [ n ] and all r ≥ . Theorem 3.7 If µ (cid:63) ∈ P ( R d ) satisﬁes a Bernstein tail condition then the following statement holdstrue for both W = PW p,k and W = PW p,k : P ( | W ( (cid:98) µ n , µ (cid:63) ) − E [ W ( (cid:98) µ n , µ (cid:63) )] | ≥ t ) ≤ (cid:18) − t σ n − /p + 4 tV n − /p (cid:19) . (3.3)6 eﬁnition 3.4 µ ∈ P ( R d ) satisﬁes a Poincar´e inequality if there exists

M > for X ∼ µ such that Var [ f ( X )] ≤ M E [ (cid:107)∇ f ( X ) (cid:107) ] for any f satisfying E [ f ( X ) ] < + ∞ and E [( ∇ f ( X )) ] < + ∞ . Theorem 3.8 If µ (cid:63) ∈ P ( R d ) satisﬁes Poincar´e inequality then the following statement holds true forboth W = PW p,k and W = PW p,k : P ( | W ( (cid:98) µ n , µ (cid:63) ) − E [ W ( (cid:98) µ n , µ (cid:63) )] | ≥ t ) ≤ − K − min { n p t, n ∨ p t } ) , (3.4) where K > only depends on M deﬁned in Deﬁnition 3.4. Remark 3.3

Theorem 3.8 provides strictly better bounds than Theorem 3.7 when p > . Moreover,our tail condition in Deﬁnition 3.3 is stronger than that in Deﬁnition 3.1 yet weaker than the standardBernstein tail condition where X ∼ µ inside the expectation without a sup ; see Wainwright [2019].The Poincar´e inequality is usually weaker than the log-Sobolev inequality and is satisﬁed by variousexponential measures and the measures induced by structured Markov processes [Ledoux, 1999]. We derive the asymptotic properties of the MPRW and MEPRW estimators under model misspeci-ﬁcation and data dependence, which is common in practice. Our setting is more general than thatconsidered in [Nadjahi et al., 2019] and our results provide the theory to support applications in real-world scenario.

Assumption 3.1

There exists a probability measure µ (cid:63) ∈ P ( R d ) such that the data-generating processsatisﬁes that lim n → + ∞ W p ( (cid:98) µ n , µ (cid:63) ) = 0 almost surely. Assumption 3.2

The map θ (cid:55)→ µ θ is continuous: (cid:107) θ n − θ (cid:107) Θ → implies µ θ n ⇒ µ θ . Assumption 3.3

There exists a constant τ > such that the set Θ (cid:63) ( τ ) ⊆ Θ is bounded where Θ (cid:63) ( τ ) = { θ ∈ Θ : PW p,k ( µ (cid:63) , µ θ ) ≤ inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ } . Theorem 3.9

Under Assumption 3.1-3.3, there exists a sample space Ω with P (Ω) = 1 such that, forall ω ∈ Ω , lim n → + ∞ inf θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) = inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) and lim sup n → + ∞ argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ⊆ argmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) . (3.5) There also exists n ( ω ) > such that argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) (cid:54) = ∅ for all n ≥ n ( ω ) . Assumption 3.4 If (cid:107) θ n − θ (cid:107) Θ → , then E [ W p ( (cid:98) µ θ n ,n , µ θ n ) | X n ] → . In the next result, we present an analogous version of Theorem 3.9 for the MEPRW estimator asmin { n, m } → + ∞ . For the simplicity, we set m := m ( n ) such that m ( n ) → + ∞ as n → + ∞ . Theorem 3.10

Under Assumption 3.1-3.3 and 3.4, there exists a sample space Ω with P (Ω) = 1 suchthat, for all ω ∈ Ω , lim n → + ∞ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] = inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) and lim sup n → + ∞ argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ⊆ argmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) . (3.6) There also exists n ( ω ) > so that argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] (cid:54) = ∅ for n ≥ n ( ω ) . ssumption 3.5 There exists a constant τ > such that the set Θ n ( τ ) ⊆ Θ is bounded where Θ n ( τ ) = { θ ∈ Θ : PW p,k ( (cid:98) µ n , µ θ ) ≤ inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) + τ } . Theorem 3.11

Under Assumption 3.2, 3.4 and 3.5, the following statement holds true: lim m → + ∞ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] = inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) (3.7)lim sup n → + ∞ argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ⊆ argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) . (3.8) There also exists m n > such that argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] (cid:54) = ∅ for m ≥ m n . To this end, the MPRW and MEPRW estimators both asymptotically converge to θ (cid:63) ∈ Θ, which is aminimizer of θ → PW p,k ( µ (cid:63) , µ θ ), assuming its existence. Moreover, θ (cid:63) is not necessarily the limit ofmaximum likelihood estimator and satisﬁes µ θ (cid:63) = µ (cid:63) in a well-speciﬁed setting. We investigate the asymptotic distribution of the MPRW estimator under model misspeciﬁcation andestablish the rate of convergence when k = p = 1. For any u ∈ S d − and t ∈ R , we deﬁne F θ ( u, t ) = (cid:90) R d ( −∞ ,t ] ( (cid:104) u, x (cid:105) ) dµ θ ( x ) , (cid:98) F n ( u, t ) = (1 /n ) |{ i ∈ [ n ] : (cid:104) u, X i (cid:105) ≤ t }| . (3.9)The functions F θ ( u, · ) and (cid:98) F ( u, · ) are the cumulative distribution functions of u (cid:63) µ θ and u (cid:63) (cid:98) µ n where u ∈ S d − is a unit vector. Let L ( S d − × R ) be the class of functions on S d − × R such that f ( , t ) iscontinuous and f ( u, · ) is absolutely integrable, with the norm (cid:107) f (cid:107) L = sup u ∈ S d − (cid:82) R | f ( u, t ) | dt .Assumption 3.6 is strictly weaker than a norm-diﬀerentiation condition where D (cid:63) has to be nonsin-gular. Assumption 3.7 permits model misspeciﬁcation where there is no θ (cid:63) ∈ Θ such that F θ (cid:63) = F (cid:63) andthus is more general than Nadjahi et al. [2019, A8]. Assumption 3.8 accounts for local strong identiﬁa-bility for the model µ θ around θ (cid:63) and is necessary for the fast rate of n − / under model misspeciﬁcation.(Bernton et al. [2019] assumes the analogous condition for the Wasserstein distance. However, theiranalysis depends on a much stronger version with N = Θ.) Thanks to Assumption 3.8, we do notrequire the condition that the parameters are weakly separable in the PRW sense. Assumption 3.6

There exists a measurable function D (cid:63) : S d − × R → R d θ such that (cid:107) F θ ( u, t ) − F θ (cid:63) ( u, t ) − (cid:104) θ − θ (cid:63) , D (cid:63) ( u, t ) (cid:105)(cid:107) L = o ( (cid:107) θ − θ (cid:63) (cid:107) Θ ) . Assumption 3.7

There exists a random element G (cid:63) : S d − × R (cid:55)→ R such that the stochastic process √ n ( (cid:98) F n − F (cid:63) ) converges weakly in L ( S d − × R ) to G (cid:63) . Assumption 3.8

There exists a neighborhood N of θ (cid:63) ∈ Θ and a positive constant c (cid:63) such that PW , ( µ θ , µ (cid:63) ) ≥ PW , ( µ θ (cid:63) , µ (cid:63) ) + c (cid:63) (cid:107) θ − θ (cid:63) (cid:107) Θ for all θ ∈ N . As pointed by Nadjahi et al. [2019], one can prove that Assumption 3.7 holds in general by extending [Dede, 2009,Proposition 3.5] and [del Barrio et al., 1999, Theorem 2.1(a)] with some mild conditions on the tails of u (cid:63) µ (cid:63) . emark 3.4 In well-speciﬁed setting where there exists θ (cid:63) ∈ Θ such that F (cid:63) = F θ (cid:63) , it is straightforwardto derive the norm-diﬀerentiation condition from Assumption 3.6 and 3.8. This is not true, however,under model misspeciﬁcation. Moreover, there are minor technical issues in the proof of Bernton et al.[2019, Theorem B.8]; see Appendix E.4. Fixing them are easy but require additional assumptions.Fortunately, we can overcome it using some new techniques. Thus, with some reﬁnement, our resultscan be interpreted as an improvement of Bernton et al. [2019] with fewer assumptions. To study the asymptotic distributions in the misspeciﬁed setting, we employ deﬁnitions from Pollard[1980, Section 7]. (Note, however, that our proof technique is diﬀerent from Pollard [1980], whichdepends on the nonsingularity of D (cid:63) and requires µ (cid:63) = µ θ (cid:63) for some θ (cid:63) in the interior of Θ.) Deﬁnition 3.5 (Hausdorﬀ metric)

Let S be the class of convex and compact sets in L ( S d − × R ) equipped with (cid:107)·(cid:107) L . The Hausdorﬀ metric on S is deﬁned by d H ( S , S ) = inf { δ > S ⊆ S δ , S ⊆ S δ } ,where S δ = ∪ x ∈ S { z ∈ L ( S d − × R ) : (cid:107) z − x (cid:107) L ≤ δ } . Deﬁnition 3.6 (Approximate MPRW estimators)

The set of approximate MPRW estimators isdeﬁned by M n = { θ ∈ Θ : PW , ( (cid:98) µ n , µ θ ) ≤ inf θ (cid:48) ∈ Θ PW , ( (cid:98) µ n , µ θ (cid:48) ) + η n / √ n } , where η n > is anysequence such that P ( η n →

0) = 1 and M n is nonempty. Theorem 3.12

Suppose Assumption 3.1-3.3 and 3.6-3.8 hold for some θ (cid:63) in the interior of Θ and let G n = √ n ( (cid:98) F n − F θ (cid:63) ) and G (cid:63)n = G (cid:63) + √ n ( F (cid:63) − F θ (cid:63) ) . We also deﬁne K ( x, β ) = { θ ∈ N : (cid:107) x − √ n (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L ≤ inf θ (cid:48) ∈N (cid:107) x − √ n (cid:104) θ (cid:48) − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L + β } where N = (cid:26) θ ∈ N : (cid:107) F θ − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D (cid:63) (cid:105)(cid:107) L (cid:107) θ − θ (cid:63) (cid:107) Θ ≤ c (cid:63) (cid:27) . (3.10) Then there exists a sequence satisfying lim n → + ∞ β n = 0 such that P (cid:63) ( M n ⊆ K ( G n , β n )) → as n → + ∞ . For any (cid:15) > , we have P ( d H ( K ( G (cid:63)n , , K ( G n , β n )) < (cid:15) ) → as n → + ∞ . Remark 3.5

Since K ( G (cid:63)n ,

0) = argmin θ ∈N (cid:107) G (cid:63) + √ n ( F (cid:63) − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105) ) (cid:107) L , Theorem 3.12 saysthat the distributional limit of the approximate MPRW estimator set is close to the limit of the sets argmin θ ∈N (cid:107) G (cid:63) + √ n ( F (cid:63) − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105) ) (cid:107) L in the Hausdorﬀ metric. This result provides thetheoretical guarantee for generative modeling with max-sliced Wasserstein distance. Remark 3.6

In well-speciﬁed setting, Assumption 3.8 can be replaced by Assumption A.1-A.2. Undercertain conditions, we derive the CLT (cf. Theorem A.3) which is analogous to Nadjahi et al. [2019,Theorem 6] for the minimum sliced Wasserstein estimators; see Appendix A for the details.

We empirically validate our theoretical ﬁndings through several experiments on synthetic and real data.Given the space limit, we present the experimental setup in Appendix G and explain an optimizationalgorithm for computing the PRW distance and estimators in Appendix F. P (cid:63) denotes the (inner) probability; see Pollard [1980] for details. Mean values (Top) and mean computational time (Bottom) of the IPRW and PRW distances of order 2 betweenempirical measures (cid:98) µ n and (cid:98) ν n as the number of points n varies. Results are averaged over 100 runs. Convergence and concentration.

We set µ = ν = U ([ − v, v ] d ) as an uniform distribution overa hypercube and study the convergence and computation of PW ,k ( (cid:98) µ n , (cid:98) ν n ) and PW ,k ( (cid:98) µ n , (cid:98) ν n ) for n ∈ { , , , , } . Figure 1 presents average distances and computational times for ( d, v ) ∈{ (10 , , (30 , , (50 , } , where the shaded areas show the max-min values over 100 runs. First, theIPRW distance is signiﬁcantly smaller than the PRW distance for small n especially when d and v are large. This conﬁrms Theorem 3.4 which shows that the IPRW distance is independent of d .Second, the PRW distance nearly matches the IPRW distance when n is large. This conﬁrms The-orem 3.6 since the uniform distribution with its bounded domain satisﬁes the Poincar´e inequality.Finally, the current computation of the PRW distance is faster than that of the IPRW distance.Figure 3: Probability density of estimation of cen-tered and rescaled (cid:98) σ n on the Gaussian model. Model misspeciﬁcation.

We consider the paramet-ric inference using Gaussian models M = {N ( m , σ I ) : m ∈ R , σ > } and a collection of i.i.d. observationsgenerated from a mixture of 8 Gaussian distributionsin R . This simple setting is useful since the closed-form expression of Gaussian density makes the compu-tation of the MPRW estimator of order 1 tractable inpractice. Following the setup in Nadjahi et al. [2019,Section 4], we illustrate the consistency of the MPRWand MEPRW estimators of order 1 and the conver-gence of MEPRW estimator of order 1 to MPRW es-timator of order 1. Results are shown in Figure 2;they are consistent with Theorem 3.9, 3.10 and 3.11,10 a) MPRW vs. n (b) MEPRW vs. n = m (c) MEPRW with n = 2000 vs. m Figure 2:

Minimal PRW and expected PRW estimations using Gaussian models and n samples from the mixture of 8Gaussian distributions. Results are averaged over 100 runs and shaded areas represent standard deviation. where m (cid:63) = (cid:98) m . Despite the model misspeciﬁcation, our estimators still converge as the numberof observations increases and the MEPRW estimator converges to the MPRW estimator as we gen-erate more samples. We also verify our central limit theorem by estimating the density of (cid:98) σ n witha kernel density estimator over 100 runs. Figure 3 shows the distribution centered and rescaled by √ n for each n , where σ (cid:63) = (cid:98) σ , and it conﬁrms the convergence rate we derived in Theorem 3.12.We refer the interested readers to Appendix H for more results on the i.i.d. observations gener-ated from a mixture of 12 or 25 Gaussian distributions and elliptically contoured stable models.Figure 4: Mean test loss for diﬀerent value of ( n, m )on

ImageNet200 . Generative modeling.

We conduct experiments onimage generation using the PRW generator of order2, as an alternative to the SW generator [Deshpandeet al., 2018]. We train the neural networks (NNs) with( n, m ) ∈ { (100 , , (1000 , , (5000 , , (10000 , } where n is the number of training samples and m isthe number of generated samples. We compare theirtesting losses to that of a NN trained using n = 10 (i.e. whole training dataset) and m = 200. All thetesting losses are evaluated using the trained modelson the the testing dataset ( n = 10 ) with m = 250generated samples. Figure 4 presents the mean testingloss on ImageNet200 over 10 runs, where the shadedareas show the max-min values over the runs. We deferthe results on other dataset to Appendix G and H.11

Acknowledgements

This work was supported in part by the Mathematical Data Science program of the Oﬃce of NavalResearch under grant number N00014-18-1-2764.

References

P-A. Absil, R. Mahony, and R. Sepulchre.

Optimization Algorithms on Matrix Manifolds . PrincetonUniversity Press, 2009. (Cited on pages 37 and 38.)

J. Adler and S. Lunz. Banach Wasserstein GAN. In

NIPS , pages 6754–6763, 2018. (Cited on page 1.)

C. D. Aliprantis and K. C. Border.

Inﬁnite Dimensional Analysis: A Hitchhiker’s Guide . SpringerScience & Business Media, 2006. (Cited on page 25.)

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In

ICML , pages214–223, 2017. (Cited on page 1.)

F. Bassetti, A. Bodini, and E. Regazzini. On minimum kantorovich distance estimators.

Statistics &Probability Letters , 76(12):1298–1302, 2006. (Cited on page 1.)

A. Basu, H. Shioya, and C. Park.

Statistical Inference: The Minimum Distance Approach . CRC Press,2011. (Cited on pages 1 and 4.)

E. Bayraktar and G. Guo. Strong equivalence between metrics of wasserstein type.

ArXiv Preprint:1912.08247 , 2019. (Cited on page 5.)

E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. On parameter estimation with the wassersteindistance.

Information and Inference: A Journal of the IMA , 8(4):657–676, 2019. (Cited on pages 1, 2,3, 4, 8, 9, 25, 33, and 37.)

P. Billingsley.

Convergence of Probability Measures . John Wiley & Sons, 2013. (Cited on page 17.)

N. Bonneel, J. Rabin, G. Peyr´e, and H. Pﬁster. Sliced and radon Wasserstein barycenters of measures.

Journal of Mathematical Imaging and Vision , 51(1):22–45, 2015. (Cited on pages 2, 3, and 39.)

N. Bonneel, G. Peyr´e, and M. Cuturi. Wasserstein barycentric coordinates: histogram regression usingoptimal transport.

ACM Transactions on Graphics , 35(4):71:1–71:10, 2016. (Cited on page 1.)

N. Bonnotte.

Unidimensional and Evolution Methods for Optimal Transportation . PhD thesis, Paris11, 2013. (Cited on pages 2 and 3.)

L. D. Brown and R. Purves. Measurable selections of extrema.

The Annals of Statistics , 1(5):902–912,1973. (Cited on page 26.)

J. Cao, L. Mo, Y. Zhang, K. Jia, C. Shen, and M. Tan. Multi-marginal Wasserstein GAN. In

NeurIPS ,pages 1774–1784, 2019. (Cited on page 1.)

M. Carriere, M. Cuturi, and S. Oudot. Sliced Wasserstein kernel for persistence diagrams. In

ICML ,pages 664–673. JMLR. org, 2017. (Cited on page 2.)

12. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In

ICML , pages 685–693,2014. (Cited on page 1.)

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In

NeurIPS , pages2292–2300, 2013. (Cited on page 37.)

S. Dede. An empirical central limit theorem in L1 for stationary sequences.

Stochastic Processes andTheir Applications , 119(10):3494–3515, 2009. (Cited on page 8.)

E. del Barrio, E. Gin´e, and C. Matr´an. Central limit theorems for the Wasserstein distance between theempirical and the true distributions.

Annals of Probability , pages 1009–1071, 1999. (Cited on page 8.)

I. Deshpande, Z. Zhang, and A. G. Schwing. Generative modeling using the sliced Wasserstein distance.In

CVPR , pages 3483–3491, 2018. (Cited on pages 2, 11, and 43.)

I. Deshpande, Y-T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao, D. Forsyth, and A. G.Schwing. Max-sliced Wasserstein distance and its use for GANs. In

CVPR , pages 10648–10656, 2019. (Cited on pages 2, 3, and 43.)

R. M. Dudley. The speed of mean Glivenko-Cantelli convergence.

The Annals of Mathematical Statistics ,40(1):40–50, 1969. (Cited on pages 2 and 4.)

R. Flamary and N. Courty. Pot python optimal transport library, 2017. URL https://github.com/rflamary/POT . (Cited on pages 37 and 42.) N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure.

Probability Theory and Related Fields , 162(3-4):707–738, 2015. (Cited on pages 2, 4, and 20.)

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of Wasser-stein GANs. In

NeurIPS , pages 5767–5777, 2017. (Cited on page 1.)

T. Hashimoto, D. Giﬀord, and T. Jaakkola. Learning population-level diﬀusions with generative RNNs.In

ICML , pages 2417–2426, 2016. (Cited on page 1.)

N. Ho, X. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Phung. Multilevel clustering viaWasserstein means. In

ICML , pages 1501–1509, 2017. (Cited on page 1.)

H. Janati, T. Bazeille, B. Thirion, M. Cuturi, and A. Gramfort. Multi-subject meg/eeg source imagingwith sparse multi-task regression.

NeuroImage , page 116847, 2020. (Cited on page 1.)

D. P. Kingma and J. Ba. ADAM: A method for stochastic optimization. In

ICLR , 2015. (Cited onpages 40, 43, and 45.)

S. Kolouri, Y. Zou, and G. K. Rohde. Sliced Wasserstein kernels for probability distributions. In

CVPR ,pages 5258–5267, 2016. (Cited on page 2.)

S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Rohde. Generalized sliced Wasserstein distances.In

NeurIPS , pages 261–272, 2019a. (Cited on pages 2 and 3.)

S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde. Sliced Wasserstein auto-encoders. In

ICLR ,2019b. (Cited on pages 2 and 3.)

13. Ledoux. Concentration of measure and logarithmic sobolev inequalities. In

Seminaire de probabilitesXXXIII , pages 120–216. Springer, 1999. (Cited on pages 7 and 21.)

J. Lei. Convergence and concentration of empirical measures under Wasserstein distance in unboundedfunctional spaces.

Bernoulli , 26(1):767–798, 2020. (Cited on pages 2, 6, and 20.)

X. Li, S. Chen, Z. Deng, Q. Qu, Z. Zhu, and A. M-C. So. Nonsmooth optimization over Stiefel manifold:Riemannian subgradient methods.

ArXiv Preprint: 1911.05047 , 2019. (Cited on page 38.)

T. Lin, C. Fan, N. Ho, M. Cuturi, and M. I. Jordan. Projection robust Wasserstein distance andRiemannian optimization.

ArXiv Preprint: 2006.07458 , 2020. (Cited on pages 2 and 37.)

H. Liu, A. M-C. So, and W. Wu. Quadratic optimization with orthogonality constraint: explicit(cid:32)lojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods.

Mathematical Programming , 178(1-2):215–262, 2019. (Cited on page 38.)

A. Liutkus, U. Simsekli, S. Majewski, A. Durmus, and F-R. St¨oter. Sliced-Wasserstein ﬂows: Nonpara-metric generative modeling via optimal transport and diﬀusions. In

ICML , pages 4104–4113, 2019. (Cited on pages 2 and 3.)

G. Montavon, K-R. M¨uller, and M. Cuturi. Wasserstein training of restricted Boltzmann machines. In

NIPS , pages 3718–3726. Curran Associates, Inc., 2016. (Cited on page 1.)

K. Nadjahi, A. Durmus, U. Simsekli, and R. Badeau. Asymptotic guarantees for learning generativemodels with the sliced-Wasserstein distance. In

NeurIPS , pages 250–260, 2019. (Cited on pages 3, 4, 5,7, 8, 9, 10, 25, 39, and 40.)

K. Nadjahi, A. Durmus, L. Chizat, S. Kolouri, S. Shahrampour, and U. S¸im¸sekli. Statistical andtopological properties of sliced probability divergences.

ArXiv Preprint: 2003.05783 , 2020. (Cited onpages 2, 3, and 4.)

K. Nguyen, N. Ho, T. Pham, and H. Bui. Distributional sliced-Wasserstein and applications to gener-ative modeling.

ArXiv Preprint: 2002.07367 , 2020. (Cited on page 3.)

J. Niles-Weed and P. Rigollet. Estimation of Wasserstein distances in the spiked transport model.

ArXivPreprint: 1909.07513 , 2019. (Cited on pages 2, 3, 4, 6, 20, and 37.)

J. P. Nolan. Multivariate elliptically contoured stable distributions: theory and estimation.

Computa-tional Statistics , 28(5):2067–2089, 2013. (Cited on page 40.)

V. M. Panaretos and Y. Zemel. Statistical aspects of Wasserstein distances.

Annual Review of Statisticsand its Application , 6:405–431, 2019. (Cited on page 1.)

F-P. Paty and M. Cuturi. Subspace robust Wasserstein distances. In

ICML , pages 5072–5081, 2019. (Cited on pages 2, 3, 37, and 42.)

G. Peyr´e and M. Cuturi. Computational optimal transport.

Foundations and Trends R (cid:13) in MachineLearning , 11(5-6):355–607, 2019. (Cited on page 1.) D. Pollard. The minimum distance method of testing.

Metrika , 27(1):43–70, 1980. (Cited on pages 9, 33,34, and 37.)

14. Rabin, G. Peyr´e, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texturemixing. In

International Conference on Scale Space and Variational Methods in Computer Vision ,pages 435–446. Springer, 2011. (Cited on page 2.)

S. T. Rachev and L. R¨uschendorf.

Mass Transportation Problems: Volume I: Theory , volume 1. SpringerScience & Business Media, 1998. (Cited on page 3.)

A. Ramdas, N. G. Trillos, and M. Cuturi. On Wasserstein two-sample testing and related families ofnonparametric tests.

Entropy , 19(2):47, 2017. (Cited on page 1.)

R. T. Rockafellar and R. J-B. Wets.

Variational Analysis , volume 317. Springer Science & BusinessMedia, 2009. (Cited on pages 25 and 26.)

G. Samoradnitsky.

Stable Non-Gaussian Random Processes: Stochastic Models with Inﬁnite Variance .Routledge, 2017. (Cited on page 40.)

G. Schiebinger, J. Shu, M. Tabaka, B. Cleary, V. Subramanian, A. Solomon, S. Liu, S. Lin, P. Berube,L. Lee, et al. Reconstruction of developmental landscapes by optimal-transport analysis of single-cellgene expression sheds light on cellular reprogramming. bioRxiv , page 191056, 2017. (Cited on page 1.)

M. Talagrand. Transportation cost for Gaussian and other product measures.

Geometric & FunctionalAnalysis GAFA , 6(3):587–600, 1996. (Cited on page 4.)

I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. In

ICLR , 2018. (Cited on page 1.)

A. Tong, J. Huang, G. Wolf, D. van Dijk, and S. Krishnaswamy. Trajectorynet: A dynamic optimaltransport network for modeling cellular dynamics.

ArXiv Preprint: 2002.04461 , 2020. (Cited on page 1.)

C. Villani.

Optimal Transport: Old and New , volume 338. Springer Science & Business Media, 2008. (Cited on pages 1, 4, 5, and 17.)

M. J. Wainwright.

High-dimensional Statistics: A Non-asymptotic Viewpoint , volume 48. CambridgeUniversity Press, 2019. (Cited on pages 7, 20, 22, and 23.)

J. Weed and F. Bach. Sharp asymptotic and ﬁnite-sample rates of convergence of empirical measuresin Wasserstein distance.

Bernoulli , 25(4A):2620–2648, 2019. (Cited on pages 2, 4, and 20.)

J. Wolfowitz. The minimum distance method.

The Annals of Mathematical Statistics , pages 75–88,1957. (Cited on pages 1 and 4.)

J. Wu, Z. Huang, D. Acharya, W. Li, J. Thoma, D. P. Paudel, and L. V. Gool. Sliced Wassersteingenerative models. In

CVPR , pages 3713–3722, 2019. (Cited on page 2.)

K. D. Yang, K. Damodaran, S. Venkatachalapathy, A. C. Soylemezoglu, G. V. Shivashankar, andC. Uhler. Predicting cell lineages using autoencoders and optimal transport.

PLoS computationalbiology , 16(4):e1007828, 2020. (Cited on page 1.)

J. Ye, P. Wu, J. Z. Wang, and J. Li. Fast discrete distribution clustering using Wasserstein barycenterwith sparse support.

IEEE Transactions on Signal Processing , 65(9):2317–2332, 2017. (Cited on page 1.) Further Results on the MPRW and MEPRW Estimators

In this section, we discuss the measurability of the MPRW and MEPRW estimators. For a genericfunction f on the domain X , we deﬁne δ -argmin x ∈X f = { x ∈ X : f ( x ) ≤ inf x ∈X f + δ } . Our resultsare summarized in the following two theorems. Theorem A.1

Under Assumption 3.1, for any n ≥ and δ > , there exists a Borel measurablefunction (cid:98) θ n : Ω → Θ such that (cid:98) θ n ( ω ) ∈ (cid:26) argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) if this set is nonempty ,δ - argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) otherwise . Theorem A.2

Under Assumption 3.1, for any n ≥ , m ≥ and δ > , there exists a Borel measurablefunction (cid:98) θ n,m : Ω → Θ such that (cid:98) θ n,m ( ω ) ∈ (cid:26) argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ) | X n ] if this set is nonempty ,δ - argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ) | X n ] otherwise . We also present the asymptotic distribution of the goodness-of-ﬁt statistics as well as the MPRWestimator in the well-speciﬁed setting and establish the rate of convergence. For this we require the wellseparability of the model in Assumption A.1 and the non-singularity of D (cid:63) in Assumption A.2 to takeplace of the local strong identiﬁability in Assumption 3.8. Assumption A.1

For any (cid:15) > , there exists δ > so that inf θ ∈ Θ: (cid:107) θ − θ (cid:63) (cid:107) Θ ≥ (cid:15) PW , ( µ θ (cid:63) , µ θ ) > δ . Assumption A.2

There exists a non-singular D (cid:63) such that Assumption 3.6 holds true. Theorem A.3

Suppose that µ (cid:63) = µ θ (cid:63) for some θ (cid:63) in the interior of Θ . Under Assumption 3.1-3.3, 3.6-3.7 and A.1-A.2, the goodness-of-ﬁt statistics satisﬁes √ n inf θ ∈ Θ PW , ( (cid:98) µ n , µ θ ) ⇒ inf θ ∈ Θ max u ∈ S d − (cid:90) R | G (cid:63) ( u, t ) − (cid:104) θ, D (cid:63) ( u, t ) (cid:105)| dt, as n → + ∞ . Suppose also that the random map θ → max u ∈ S d − (cid:82) R | G (cid:63) ( u, t ) − (cid:104) θ, D (cid:63) ( u, t ) (cid:105)| dt has a unique inﬁmumalmost surely. Then the MPRW estimator of order 1 satisﬁes √ n ( (cid:98) θ n − θ (cid:63) ) ⇒ argmin θ ∈ Θ max u ∈ S d − (cid:90) R | G (cid:63) ( u, t ) − (cid:104) θ, D (cid:63) ( u, t ) (cid:105)| dt, as n → + ∞ . Both the weak convergence results are valid for the metric induced by the norm (cid:107) · (cid:107) L . B Postponed Proofs in Subsection 3.1

This section lays out the detailed proofs for Lemma 3.1, Theorem 3.2 and 3.3.16 .1 Preliminary technical results

For completeness, we collect several preliminary technical results which will be used in the proofs. Theorem B.1 (Prokhorov’s theorem)

Let P ( R d ) denote the collection of all probability measuresdeﬁned on R d with the Borel σ -algebra and { µ i } i ∈ N is a tight sequence in P ( R d ) . Then every subsequenceof { µ i } i ∈ N has a subsequence that converges weakly in P ( R d ) . Moreover, if every weakly convergentsubsequence has the same limit, the whole sequence converges weakly to this limit. Theorem B.2 (Theorem 4.1 in Villani [2008])

Let ( X , µ ) and ( Y , ν ) be two Polish probability spaces;let a : X → R ∪ {−∞} and b : Y → R ∪ {−∞} be upper semi-continuous such that a and b are absolutelyintegrable with respect to the measures µ and ν respectively. Let c : X × Y → R ∪ { + ∞} be lowersemi-continuous, such that c ( x, y ) ≥ a ( x ) + b ( y ) for all x, y . Then there exists an optimal coupling π ∈ Π( µ, ν ) which minimizes the total cost E [ c ( X, Y )] . Lemma B.3 (Lemma 4.4 in Villani [2008])

Let X and Y be two Polish spaces. Let P ⊆ P ( X ) and Q ⊆ P ( Y ) be tight subsets of P ( X ) and P ( Y ) respectively. Then the set of all transportationplans whose marginals lie in P and Q respectively, is itself tight in P ( X × Y ) . Theorem B.4 (Theorem 6.9 in Villani [2008])

Let ( X , d ) be a Polish space and p ∈ [1 , + ∞ ) . TheWasserstein distance W p metrizes the weak convergence in P p ( X ) . That is, if { µ i } i ∈ N n is a sequenceof measures in P p ( X ) and µ ∈ P p ( X ) , then µ i ⇒ µ if and only if W p ( µ i , µ ) → . Deﬁnition B.1 (Lower semi-continuity)

We say that f : X → R is lower semi-continuous if forany x ∈ X and any y < f ( x ) , there exists a neighborhood U of x such that f ( x ) > y for all x in U .In the case of a metric space, this is equivalent to lim inf x → x f ( x ) ≥ f ( x ) for any x ∈ X . B.2 Proof of Lemma 3.1

We ﬁrst show that, for any µ ∈ P p ( R d ) and ν ∈ P p ( R d ), the following inequality holds true, PW p,k ( µ, ν ) ≤ PW p,k ( µ, ν ) ≤ W p ( µ, ν ) . (B.1)Indeed, by the deﬁnition of PW p,k and PW p,k , the ﬁrst inequality is trivial. For the second inequality,we derive from the deﬁnition of PW p,k that PW pp,k ( µ, ν ) = sup E ∈ S d,k W p ( E (cid:63) µ, E (cid:63) ν ) = sup E ∈ S d,k inf π ∈ Π( µ i ,µ ) (cid:90) R d × R d (cid:107) E (cid:62) ( x − y ) (cid:107) p dπ ( x, y ) . Since E ∈ S d,k , we have (cid:107) E (cid:62) ( x − y ) (cid:107) ≤ (cid:107) x − y (cid:107) . Thus, we have PW pp,k ( µ, ν ) ≤ W pp ( µ, ν ). Putting thesepieces together yields Eq. (B.1). For any sequence { µ i } i ∈ N ⊆ P p ( R d ) and µ ∈ P p ( R d ), we concludefrom Eq. (B.1) that W p ( µ i , µ ) → PW p,k ( µ i , µ ) → PW p,k ( µ i , µ ) → PW p,k ( µ i , µ ) → W p ( µ i , µ ) →

0. Indeed, we ﬁrst provethat PW p,k ( µ i , µ ) → µ i ⇒ µ . Let Z i ∼ µ i , we have E (cid:62) Z i ∼ E (cid:63) µ i . By the deﬁnition of theIPRW distance (cf. Deﬁnition 2.3) and using the fact that PW p,k ( µ i , µ ) →

0, we have ( (cid:107) E (cid:62) Z i ) (cid:107) p ) i ∈ N is For the Prokhorov’s theorem, we only present the results on the Euclidean space. For more results on general separablemetric space, we refer the interested readers to Billingsley [2013]. E ∈ S d,k . Since S d,k is compact, there exists a ﬁnite set { E , E , . . . , E I } ⊆ S d,k so that (cid:107) x (cid:107) ≤ (cid:80) Ij =1 (cid:107) E (cid:62) j x (cid:107) for all x ∈ R d . Therefore, we have (cid:107) Z i (cid:107) p ≤  I (cid:88) j =1 (cid:107) E (cid:62) j Z i (cid:107)  p ≤ I (cid:18) max ≤ j ≤ I (cid:107) E (cid:62) j Z i (cid:107) p (cid:19) ≤ I  I (cid:88) j =1 (cid:107) E (cid:62) j Z i (cid:107) p  . Therefore, we deduce that ( (cid:107) Z i (cid:107) p ) i ∈ N is uniformly integrable which implies the tightness of { µ i } i ∈ N .Using the Prokhorov’s theorem (cf. Theorem B.1), we obtain that every subsequence of { µ i } i ∈ N has aweakly convergent subsequence.The next step is to show that all the weakly convergent subsequences converge to the same probabilitymeasure µ . We ﬁx an arbitrary subsequence and for simplicity abbreviate the subscripts and still denoteit by { µ i } i ∈ N . Let ˜ µ i be the limit of any given weakly convergent subsequence ( µ i j ) j ∈ N , we need toprove that ˜ µ i = µ . In particular, we deﬁne the characteristic function for any probability measure ν asfollows, Φ ν ( z ) := (cid:90) R d e i (cid:104) z,x (cid:105) dν ( x ) for all z ∈ R d . Since µ i j ⇒ ˜ µ i , we have Φ µ ij ( z ) → Φ ˜ µ i ( z ) for all z ∈ R d . Thus, we need to show that Φ µ ij ( z ) → Φ µ ( z )for all z ∈ R d . This is trivial when z = d since Φ µ ij ( d ) = Φ µ ( d ) = 1 for all j ∈ N . Otherwise, let r := (cid:107) z (cid:107) and v := z/ (cid:107) z (cid:107) , we havelim j → + ∞ Φ µ ij ( z ) = lim j → + ∞ (cid:90) R d e i (cid:104) z,x (cid:105) dµ i j ( x ) = lim j → + ∞ (cid:90) R d e i r (cid:104) v,x (cid:105) dµ i j ( x ) . Since (cid:107) v (cid:107) = 1, we deﬁne ¯ E ∈ S d,k whose ﬁrst column is v . Let ¯ r be a k -dimensional vector whose ﬁrstcoordinate is r and other coordinates are zero. Then we have r (cid:104) v, x (cid:105) = (cid:104) ¯ r, ¯ E (cid:62) x (cid:105) . Putting these piecestogether yields that lim j → + ∞ Φ µ ij ( z ) = lim j → + ∞ (cid:90) R k e i (cid:104) ¯ r,y (cid:105) d ¯ E (cid:63) µ i j ( y ) . Since PW p,k ( µ i j , µ ) →

0, we deduce that W p ( ¯ E (cid:63) µ i j , ¯ E (cid:63) µ ) →

0. Using Theorem B.4, we have ¯ E (cid:63) µ i j ⇒ ¯ E (cid:63) µ . Since r (cid:104) v, x (cid:105) = (cid:104) ¯ r, ¯ E (cid:62) x (cid:105) , we havelim j → + ∞ (cid:90) R k e i (cid:104) ¯ r,x (cid:105) d ¯ E (cid:63) µ i j ( x ) = (cid:90) R k e i (cid:104) ¯ r,x (cid:105) d ¯ E (cid:63) µ ( x ) = (cid:90) R d e i r (cid:104) v,x (cid:105) dµ ( x ) = (cid:90) R d e i (cid:104) z,x (cid:105) dµ ( x ) . Putting these pieces together yields that Φ µ ij ( z ) → Φ µ ( z ) for all z ∈ R d / { d } and ˜ µ i = µ for all i ∈ N . Using the Prokhorov’s theorem again yields that the whole sequence { µ i } i ∈ N has the limit µ inweak sense. Therefore, PW p,k ( µ i , µ ) → µ i ⇒ µ . Since the Wasserstein distances metrize theweak convergence (cf. Theorem B.4), we conclude that PW p,k ( µ i , µ ) → W p ( µ i , µ ) →

0. Thiscompletes the proof.

B.3 Proof of Theorem 3.2

By Lemma 3.1, we have PW p,k ( µ i , µ ) → PW p,k ( µ i , µ ) → W p ( µ i , µ ) → µ i ⇒ µ if and only if W p ( µ i , µ ) →

0. Putting these pieces together yields thedesired result. 18 .4 Proof of Theorem 3.3

Fixing E ∈ St d,k , the mapping x (cid:55)→ E (cid:62) x is continuous from R d to R k . Since µ i ⇒ µ and ν i ⇒ ν , thecontinuous mapping theorem implies that E (cid:63) µ i ⇒ E (cid:63) µ and E (cid:63) ν i ⇒ E (cid:63) ν . The next step is the keyingredient in the proof and we hope to show that W pp ( E (cid:63) µ, E (cid:63) ν ) ≤ lim inf i → + ∞ W pp ( E (cid:63) µ i , E (cid:63) ν i ) for all E ∈ G k . (B.2)From Theorem B.2, there exists a coupling π i ∈ Π( E (cid:63) µ i , E (cid:63) ν i ) such that W pp ( E (cid:63) µ i , E (cid:63) ν i ) = (cid:82) R k × R k (cid:107) x − y (cid:107) p dπ i ( x, y ). By the deﬁnition of lim inf, there exists a subsequence of { π } i ∈ N such that (cid:82) R k × R k (cid:107) x − y (cid:107) p dπ i (cid:48) ( x, y ) converges to lim inf i → + ∞ W pp ( E (cid:63) µ i , E (cid:63) ν i ). For the simplicity, we still denote it by { π i } i ∈ N .By Lemma B.3 and Prokhorov’s theorem (cf. Theorem B.1), { π i } i ∈ N is sequentially compact in weaksense. Thus, there exists a subsequence { π i j } j ∈ N such that π i j ⇒ ˜ π ∈ P ( R k × R k ). Putting these piecestogether yields that lim inf i → + ∞ W pp ( E (cid:63) µ i , E (cid:63) ν i ) = (cid:90) R k × R k (cid:107) x − y (cid:107) p d ˜ π ( x, y ) . By the deﬁnition of the Wasserstein distance, it suﬃces to show that ˜ π ∈ Π( E (cid:63) µ, E (cid:63) ν ). Indeed, let f : R k → R be a continuous and bounded function, we have (cid:90) R k × R k f ( x ) d ˜ π ( x, y ) = lim j → + ∞ (cid:90) R k × R k f ( x ) dπ i j ( x, y ) . Since π i j ∈ Π( E (cid:63) µ i j , E (cid:63) ν i j ) and E (cid:63) µ i ⇒ E (cid:63) µ , we havelim j → + ∞ (cid:90) R k × R k f ( x ) dπ i j ( x, y ) = lim j → + ∞ (cid:90) R k f ( x ) dE (cid:63) µ i j ( x ) = (cid:90) R k f ( x ) dE (cid:63) µ ( x ) . Since E (cid:63) ν i ⇒ E (cid:63) ν , the same argument implies that (cid:82) R k × R k f ( y ) d ˜ π ( x, y ) = (cid:82) R k f ( y ) dE (cid:63) ν ( y ). Puttingthese pieces together yields Eq. (B.2).For the IPRW distance PW pp,k , we derive from Eq. (B.2) and the Fatou’s lemma that PW pp,k ( µ, ν ) = (cid:90) S d,k W pp ( E (cid:63) µ, E (cid:63) ν ) dσ ( E ) ≤ lim inf i → + ∞ (cid:90) S d,k W pp ( E (cid:63) µ i , E (cid:63) ν i ) dσ ( E ) = lim inf i → + ∞ PW pp,k ( µ i , ν i ) . For the PRW distance, we derive Eq. (B.2) and the fact that the supremum of a sequence of lowersemi-continuous mappings is lower semi-continuous that PW pp,k ( µ, ν ) = sup E ∈ S d,k W pp ( E (cid:63) µ, E (cid:63) ν ) ≤ lim inf i → + ∞ PW pp,k ( µ i , ν i ) . This completes the proof.

C Postponed Proofs in Subsection 3.2

In this section, we provide the detailed proofs for Theorem 3.4-3.8.19 .1 Preliminary technical results

To facilitate reading, we collect several preliminary technical results which will be used in the postponedproofs in subsection 3.2.

Theorem C.1 (Tonelli’s theorem) if ( X , A, µ ) and ( Y , B, ν ) are σ -ﬁnite measure spaces, while f : X × Y → [0 , + ∞ ] is non-negative measurable function, then (cid:90) X (cid:18)(cid:90) Y f ( x, y ) dy (cid:19) dx = (cid:90) Y (cid:18)(cid:90) X f ( x, y ) dx (cid:19) dy = (cid:90) X ×Y f ( x, y ) d ( x, y ) . The following proposition provides the state-of-the-art general bound for the Wasserstein distance be-tween the true measure and its empirical version in R d . Note that we do not assume any additionalstructures of the true measure. Similar results can be found in many classical works, e.g., Fournier andGuillin [2015, Theorem 1], Weed and Bach [2019, Theorem 1] and Lei [2020, Theorem 3.1]. Here wepresent the results in the form of Lei [2020, Theorem 3.1]. Proposition C.2

Let µ (cid:63) ∈ P q ( R d ) and M q := M q ( µ (cid:63) ) < + ∞ . Then we have E [ W p ( (cid:98) µ n , µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ d ∧ ( p − q )] (log( n )) ζ (cid:48) p,q,dp , for all n ≥ . (C.1) where (cid:46) p,q refers to “less than” with a constant depending only on ( p, q ) and ζ (cid:48) p,q,d =  d = q = 2 p, d (cid:54) = 2 p and q = dpd − p ” or “ q > d = 2 p ” , S d,k in the operator norm of amatrix, denoted by (cid:107) · (cid:107) op . This is a straightforward consequence of the classical results on the coveringnumber of the unit sphere in R d in Euclidean norm. For the proof details, we refer the interested readersto Niles-Weed and Rigollet [2019, Lemma 4]. For the background materials on the covering number, werefer the interested readers to Wainwright [2019, Chapter 5]. For the ease of presentation, we providea formal deﬁnition of covering number of S d,k in (cid:107) · (cid:107) op as follows.For any (cid:15) ∈ (0 , (cid:15) -covering number of S d,k in (cid:107) · (cid:107) op is deﬁned by N ( S d,k , (cid:15), (cid:107) · (cid:107) op ) = inf (cid:40) N ∈ N : ∃ x , x , . . . , x N ∈ S d,k , s.t. S d,k ⊆ N (cid:91) i =1 B ( x i , (cid:15) ) (cid:41) , where B ( x, r ) = { y ∈ S d,k : (cid:107) y − x (cid:107) op ≤ r } is the ball of radius r > x ∈ S d,k in the operatornorm of a matrix. Proposition C.3

There exists a universal constant c > such that for all (cid:15) ∈ (0 , , the (cid:15) -coveringnumber of S d,k in (cid:107) · (cid:107) op satisﬁes that N ( S d,k , (cid:15), (cid:107) · (cid:107) op ) ≤ ( c √ k(cid:15) − ) dk . The following theorem summarizes the concentration results assuming the Bernstein tail conditionunder product measure. Indeed, let { X i } i ∈ [ n ] be independent samples from probability measure µ i on spaces X i and X (cid:48) i be independent copies of X i for all i ∈ [ n ]. Denote X = ( X , . . . , X n ) and X (cid:48) ( i ) = ( X , . . . , X (cid:48) i , . . . , X n ) which is identical to X except for X (cid:48) i . Let f : (cid:81) ni =1 X i → R be a functionsuch that E [ | f ( X ) | ] < + ∞ , and deﬁne D i = f ( X ) − f ( X (cid:48) ( i ) ).20 heorem C.4 Suppose that there exists some σ i , M > so that E [ | D i | k | X − i ] ≤ (1 / σ i k ! M k − forall k ≥ . Then the following statement holds, P ( f ( X ) − E ( f ( X )) > t ) ≤ exp (cid:18) − t (cid:80) ni =1 σ i ) + 2 tM (cid:19) . The following theorem summarizes the concentration results assuming the Poincar´e inequality underproduct measure. We denote by (cid:107)∇ i f (cid:107) the length of the gradient with respect to the i th coordinate. Theorem C.5 (Corollary 4.6 in Ledoux [1999])

Denote by µ n the product of µ on ⊗ ni =1 R d and µ ∈ P ( R d ) satisﬁes the Poincar´e inequality (cf. Deﬁnition 3.4). For every function f on ⊗ ni =1 R d satisfying E ( | f ( X ) | ) < + ∞ , and (cid:80) ni =1 (cid:107)∇ i f ( X ) (cid:107) ≤ α and max ≤ i ≤ n (cid:107)∇ i f ( X ) (cid:107) ≤ β almost surely.Then the following statement holds true for X ∼ µ n that, P ( f ( X ) − E ( f ( X )) > t ) ≤ exp (cid:18) − K min (cid:26) tβ , t α (cid:27)(cid:19) , where K > only depends on the constant M in the Poincar´e inequality. C.2 Proof of Theorem 3.4

Note that µ (cid:63) ∈ P q ( R d ) and M q := M q ( µ (cid:63) ) < + ∞ . Fixing E ∈ S d,k , we have E (cid:63) µ (cid:63) ∈ P q ( R k ) and M q ( E (cid:63) µ (cid:63) ) ≤ M q < + ∞ . Then Proposition C.2 implies that E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ k ∧ ( p − q )] (log( n )) ζ (cid:48) p,q,kp for all n ≥ . Since W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) ≥ E ∈ S d,k and µ (cid:63) ∈ P q ( R d ), Theorem C.1 implies that E (cid:2) PW p,k ( (cid:98) µ n , µ (cid:63) ) (cid:3) = E (cid:34)(cid:90) S d,k W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) dσ ( E ) (cid:35) = (cid:90) S d,k E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] dσ ( E ) . Note that ζ p,q,k = ζ (cid:48) p,q,k where ζ p,q,k is deﬁned in Theorem 3.4. Putting these pieces together yields thedesired result. C.3 Proof of Theorem 3.5

By the deﬁnition of PW p,k ( (cid:98) µ n , µ (cid:63) ), we have E (cid:2) PW p,k ( (cid:98) µ n , µ (cid:63) ) (cid:3) ≤ sup E ∈ S d,k E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (C.2)+ E (cid:34) sup E ∈ S d,k (cid:0) W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (cid:1)(cid:35) . Using the same arguments for proving Theorem 3.4, we havesup E ∈ S d,k E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ k ∧ ( p − q )] (log( n )) ζp,q,kp for all n ≥ . (C.3)21he remaining step is to bound the gap E [sup E ∈ S d,k ( W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )])]. We ﬁrstclaim that W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] is sub-exponential with parameters (2 σn / − /p , V n − /p )for all E ∈ S d,k if the true measure µ (cid:63) satisﬁes the projection Bernstein-type tail condition (cf. Deﬁni-tion 3.1). Indeed, let f ( X ) = W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ), we have D i = f ( X ) − f ( X (cid:48) ( i ) ) ≤ W p ( E (cid:63) (cid:98) µ n , E (cid:63) (cid:98) µ (cid:48) n ) ≤ n − /p (cid:0) (cid:107) E (cid:63) ( X i ) − E (cid:63) ( X (cid:48) i ) (cid:107) (cid:1) . By the triangle inequality and using the projection Bernstein-type tail condition, we have E [ | D i | k | X − i ] ≤ k n − k/p ( E X ∼ E (cid:63) µ [ | X | k ]) ≤ k − n − k/p σ k ! V k − = (2 n − /p σ ) k !(2 n − /p V ) k − . This implies that the condition in Theorem C.4 holds true with σ i = 2 n − /p σ and M = 2 n − /p V .Equipped with Theorem C.4 yields that P (cid:0) W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] ≥ t (cid:1) ≤ exp (cid:18) − t σ n − /p + 4 tV n − /p (cid:19) . For the simplicity, let Z E = W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )]. Then we have E [ Z E ] = 0 and P ( Z E ≥ t ) ≤ exp( − t / (8 σ n − /p +4 tV n − /p )). This together with the deﬁnition of Z E and Wainwright[2019, Theorem 2.2] yields the desired claim.We then interpret { Z E } E ∈ S d,k as an empirical process indexed by E ∈ S d,k and claim that there existsa random variable L satisfying E [ L ] ≤ M q ( µ (cid:63) ) so that | Z U − Z V | ≤ L (cid:107) U − V (cid:107) op for all U, V ∈ S d,k .Indeed, we have | Z U − Z V | ≤ W p ( U (cid:63) (cid:98) µ n , V (cid:63) (cid:98) µ n ) + W p ( U (cid:63) µ (cid:63) , V (cid:63) µ (cid:63) )+ E (cid:2) W p ( U (cid:63) (cid:98) µ n , V (cid:63) (cid:98) µ n ) + W p ( U (cid:63) µ (cid:63) , V (cid:63) µ (cid:63) ) (cid:3) . Let X ∼ µ , we have | Z U − Z V | ≤ E ( (cid:107) ( U − V ) X (cid:107) p )) /p + (cid:32) n n (cid:88) i =1 (cid:107) ( U − V ) X i (cid:107) p (cid:33) /p + E (cid:32) n n (cid:88) i =1 (cid:107) ( U − V ) X i (cid:107) p (cid:33) /p  ≤ (cid:107) U − V (cid:107) op  E ( (cid:107) X (cid:107) p )) /p + (cid:32) n n (cid:88) i =1 (cid:107) X i (cid:107) p (cid:33) /p + E (cid:32) n n (cid:88) i =1 (cid:107) X i (cid:107) p (cid:33) /p  := L (cid:107) U − V (cid:107) op . Note that X n = ( X , . . . , X n ) are independent and identically distributed samples according to µ (cid:63) .By the Jensen’s inequality and using the fact that q > p ≥

1, we have E [ L ] ≤ E ( (cid:107) X (cid:107) p )) /p ≤ E ( (cid:107) X (cid:107) q )) /q = 4 M q ( µ (cid:63) ) . Thus, by a standard (cid:15) -net argument, we obtain that E [ sup E ∈ S d,k Z E ] ≤ inf (cid:15)> (cid:26) (cid:15) E [ L ] + 4 σn / − /p (cid:113) log( N ( S d,k , (cid:15), (cid:107) · (cid:107) op )) + 2 V n − /p log( N ( S d,k , (cid:15), (cid:107) · (cid:107) op )) (cid:27) c > N ( S d,k , (cid:15), (cid:107) · (cid:107) op )) ≤ dk log (cid:32) c √ k(cid:15) (cid:33) . Putting these pieces together and choosing (cid:15) = √ kn − /p yields that E (cid:34) sup E ∈ S d,k Z E (cid:35) (cid:46) p,q inf (cid:15)>  (cid:15) + n / − /p (cid:118)(cid:117)(cid:117)(cid:116) dk log (cid:32) √ k(cid:15) (cid:33) + n − /p dk log (cid:32) √ k(cid:15) (cid:33) (cid:46) p,q n / − /p (cid:112) dk log( n ) + n − /p dk log( n ) . Therefore, we conclude that E (cid:34) sup E ∈ S d,k (cid:0) W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (cid:1)(cid:35) (cid:46) p,q n / − /p (cid:112) dk log( n ) + n − /p dk log( n ) . This together with Eq. (C.2) and Eq. (C.3) yields the desired inequality.

C.4 Proof of Theorem 3.6

Using the same arguments in Theorem 3.5, we obtain Eq. (C.2) and Eq. (C.3). So it suﬃces to boundthe gap E [sup E ∈ S d,k ( W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )])] under diﬀerent condition.We ﬁrst claim that W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] is sub-exponential with parameters( (cid:112) K/ n − / (2 ∨ p ) , ( K/ n − /p ) for all E ∈ S d,k if the true measure µ (cid:63) satisﬁes the projection Poincar´einequality (cf. Deﬁnition 3.2). Indeed, we consider X = ( X , . . . , X n ) and X (cid:48) = ( X (cid:48) , . . . , X (cid:48) n ) where X i , X (cid:48) i are independent samples from E (cid:63) µ (cid:63) . Let f ( X ) = W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ), we have E ( | f ( X ) | ) < + ∞ .By the triangle inequality, we have | f ( X ) − f ( X (cid:48) ) | ≤ n − /p (cid:32) n (cid:88) i =1 (cid:107) X i − X (cid:48) i (cid:107) p (cid:33) /p ≤ n − ∨ p (cid:107) X − X (cid:48) (cid:107) . This implies that the following statement holds almost surely, n (cid:88) i =1 (cid:107)∇ i f ( X ) (cid:107) ≤ n − ∨ p and max ≤ i ≤ n (cid:107)∇ i f ( X ) (cid:107) ≤ n − p , almost surely . In addition, the probability measure E (cid:63) µ (cid:63) ∈ P ( R k ) is assumed to satisfy the Poincar´e inequality.Equipped with Theorem C.5 yields that P (cid:0) W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] ≥ t (cid:1) ≤ exp (cid:18) − K min (cid:26) tn − /p , t n − / (2 ∨ p ) (cid:27)(cid:19) , For the simplicity, let Z E = W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )]. Then we have E [ Z E ] = 0 and P ( Z E ≥ t ) ≤ exp( − K − min { n /p t, n / (2 ∨ p ) t } ). This together with the deﬁnition of Z E and Wainwright[2019, Theorem 2.2] yields the desired claim. 23sing the same argument in Theorem 3.5, we can interpret { Z E } E ∈ S d,k as an empirical processindexed by E ∈ S d,k and show that there exists a random variable L satisfying E [ L ] ≤ M q ( µ (cid:63) ) so that | Z U − Z V | ≤ L (cid:107) U − V (cid:107) op for all U, V ∈ S d,k . By a standard (cid:15) -net argument, we obtain that E [ sup E ∈ S d,k Z E ] ≤ inf (cid:15)> (cid:26) (cid:15) E [ L ] + √ Kn − / (2 ∨ p ) (cid:113) log( N ( S d,k , (cid:15), (cid:107) · (cid:107) op )) + ( K/ n − /p log( N ( S d,k , (cid:15), (cid:107) · (cid:107) op )) (cid:27) . Combining Proposition C.3 and choosing (cid:15) = √ kn − /p yields that E (cid:34) sup E ∈ S d,k Z E (cid:35) (cid:46) p,q inf (cid:15)>  (cid:15) + n − / (2 ∨ p ) (cid:118)(cid:117)(cid:117)(cid:116) dk log (cid:32) √ k(cid:15) (cid:33) + n − /p dk log (cid:32) √ k(cid:15) (cid:33) (cid:46) p,q n − / (2 ∨ p ) (cid:112) dk log( n ) + n − /p dk log( n ) . Therefore, we conclude that E (cid:34) sup E ∈ S d,k (cid:0) W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (cid:1)(cid:35) (cid:46) p,q n − / (2 ∨ p ) (cid:112) dk log( n ) + n − /p dk log( n ) . This together with Eq. (C.2) and Eq. (C.3) yields the desired inequality.

C.5 Proof of Theorem 3.7

Since the arguments in this proof hold true for both IPRW and PRW distances. For the simplicity, wedenote W = PW p,k or W = PW p,k for short. Let f ( X ) = W ( (cid:98) µ n , µ (cid:63) ), we have D i = f ( X ) − f ( X (cid:48) ( i ) ) ≤ W ( (cid:98) µ n , (cid:98) µ (cid:48) n ) ≤ n − /p (cid:32) sup E ∈ S d,k (cid:107) E (cid:63) ( X i ) − E (cid:63) ( X (cid:48) i ) (cid:107) (cid:33) . By the triangle inequality, we have E (cid:104) | D i | k | X − i (cid:105) ≤ k n − k/p (cid:32) E (cid:34) sup E ∈ S d,k ,X ∼ E (cid:63) µ | X | k (cid:35)(cid:33) . Since the true measure µ (cid:63) satisﬁes the Bernstein-type tail condition (cf. Deﬁnition 3.3), we have E (cid:104) | D i | k | X − i (cid:105) ≤ k − n − k/p σ k ! V k − = (2 n − /p σ ) k !(2 n − /p V ) k − σ i = 2 n − /p σ and M = 2 n − /p V .Equipped with Theorem C.4 yields the desired inequality. C.6 Proof of Theorem 3.8

Since the arguments in this proof hold true for both IPRW and PRW distances. For the simplicity,we denote W = PW p,k or W = PW p,k for short. We consider X = ( X , X , . . . , X n ) and X (cid:48) =24 X (cid:48) , X (cid:48) , . . . , X (cid:48) n ) where X i , X (cid:48) i are independent samples from µ (cid:63) . Let f ( X ) = W ( (cid:98) µ n , µ (cid:63) ), we have E ( | f ( X ) | ) < + ∞ . By the triangle inequality, we have | f ( X ) − f ( X (cid:48) ) | ≤ n − /p (cid:32) n (cid:88) i =1 (cid:107) X i − X (cid:48) i (cid:107) p (cid:33) /p ≤ n − ∨ p (cid:107) X − X (cid:48) (cid:107) . This implies that the following statement holds almost surely, n (cid:88) i =1 (cid:107)∇ i f ( X ) (cid:107) ≤ n − ∨ p and max ≤ i ≤ n (cid:107)∇ i f ( X ) (cid:107) ≤ n − p . In addition, the true measure µ (cid:63) satisﬁes the Poincar´e inequality (cf. Deﬁnition 3.4). Equipped withTheorem C.5 yields the desired inequality. D Postponed Proofs in Subsection 3.3

In this section, we provide the detailed proofs for Theorem 3.9-3.11 and Theorem A.1-A.2. Our resultsare derived analogously to the proof in Bernton et al. [2019] for the estimators based on Wassersteindistance and the proof in Nadjahi et al. [2019] for the estimators based on sliced-Wasserstein distance.

D.1 Preliminary technical results

To facilitate the reading, we collect several preliminary technical results which will be used in thepostponed proofs in subsection 3.3.

Theorem D.1 (Theorem 2.43 in Aliprantis and Border [2006])

A real-valued lower semi-continuousfunction on a compact space attains a minimum value, and the nonempty set of minimizers is com-pact. Similarly, an upper semicontinuous function on a compact set attains a maximum value, and thenonempty set of maximizers is compact.

Deﬁnition D.1 (epiconvergence)

Let X be a metric space and { f i } i ∈ N be a sequence of real-valuedfunction from X to R . We say that the sequence { f i } i ∈ N epiconverges to a function f : X → br if foreach x ∈ X , the following statement holds true, lim inf i → + ∞ f i ( x i ) ≥ f ( x ) for every sequence { x i } i ∈ N such that x i → x, lim sup i → + ∞ f i ( x i ) ≤ f ( x ) for some sequence { x i } i ∈ N such that x i → x. Proposition D.2 (Proposition 7.29 in Rockafellar and Wets [2009])

Let X be a metric spaceand { f i } i ∈ N be a sequence of real-valued function from X to R with a lower semi-continuous function f : X → R . Then the sequence { f i } i ∈ N epiconverges to f if and only if lim inf i → + ∞ ( inf x ∈ K f i ( x )) ≥ inf x ∈ K f ( x ) for every compact set K ⊆ X , lim sup i → + ∞ (sup x ∈ O f i ( x )) ≤ sup x ∈ O f ( x ) for every open set O ⊆ X . δ -argmin x ∈X f = { x ∈ X : f ( x ) ≤ inf x ∈X f + δ } for a generic function f : X → R . Thefollowing theorem gives asymptotic properties for the inﬁmum and δ -argmin of epiconvergent functionsand thus a standard approach to prove the existence and consistency of the estimators. Theorem D.3 (Theorem 7.31 in Rockafellar and Wets [2009])

Let X be a metric space and { f i } i ∈ N be a sequence of function which epiconverges to a lower semi-continuous function f with inf x ∈X f ∈ ( −∞ , + ∞ ) . Then we have the following statements,1. inf x ∈X f i → inf x ∈X f if and only if for every δ > there exists a compact set B ⊆ X and N ∈ N such that inf x ∈ B f i ≤ inf x ∈X f i + δ for all i ≥ N .2. lim sup i → + ∞ ( δ - argmin x ∈X f i ) ⊆ δ - argmin x ∈X f for any δ ≥ and lim sup i → + ∞ ( δ i - argmin x ∈X f i ) ⊆ argmin x ∈X f whenever δ i ↓ .3. Assume that inf x ∈X f i → inf x ∈X f , there exists a sequence δ i ↓ such that δ i - argmin x ∈X f i → argmin x ∈X f . Conversely, if argmin x ∈X f (cid:54) = ∅ and if such a sequence exists, then inf x ∈X f i → inf x ∈X f . The following theorem summarizes the well-known Skorokhod’s representation theorem.

Theorem D.4 (Skorokhod’s representation theorem)

Let { µ n } n ∈ N be a sequence of probabilitymeasures on a metric space S such that µ n converges weakly to some probability measure µ ∞ on S as n → ∞ . Suppose also that the support of µ ∞ is separable. Then there exist random variables X n deﬁnedon a common probability space (Ω , F , P ) such that the law of X n is µ n for all n (including n = ∞ ) andsuch that X n converges to X ∞ almost surely. The following theorem presents the classical results which lead to a standard approach for proving themeasurability of the estimators. Note that the projection proj( D ) = { x ∈ X : ∃ y ∈ Y , s.t.( x, y ) ∈ D } for each D ⊆ X × Y and the section D x = { y ∈ Y : ( x, y ) ∈ D } for each x ∈ proj( D ). Theorem D.5 (Corollary 1 in Brown and Purves [1973])

Let X , Y be complete separable metricspaces and f be a real-valued Borel measurable function deﬁned on a Borel subset D of X × Y . Supposethat for each x ∈ proj( D ) , the section D x is σ -compact and f ( x, · ) is lower semi-continuous with respectto the relative topology on D x . Then1. The sets G = proj( D ) and I = { x ∈ G : ∃ y ∈ D x s.t. y = argmin z ∈Y f ( x, z ) } are Borel.2. For each (cid:15) > , there exists a Borel measure function ϕ (cid:15) satisfying, for x ∈ G that, f ( x, ϕ (cid:15) ( x ))  = inf y ∈ G f ( x, y ) , x ∈ I, ≤ (cid:15) + inf y ∈ G f ( x, y ) , if x / ∈ I and inf y ∈ G f ( x, y ) (cid:54) = −∞ , ≤ − (cid:15) − , x / ∈ I and inf y ∈ G f ( x, y ) = −∞ . To show that the MEPRW estimator is measurable, we establish the lower semi-continuity of theexpectation of empirical PRW distance in the following lemma.

Lemma D.6

The expected empirical PRW distance is lower semi-continuous in the usual weak topology.If the sequences { µ i } i ∈ N , { ν i } i ∈ N ⊆ P ( R d ) satisfying that µ i ⇒ µ ∈ P ( R d ) and ν i ⇒ ν ∈ P ( R d ) , wehave E [ PW p,k ( µ, (cid:98) ν m )] ≤ lim inf i → + ∞ E [ PW p,k ( µ i , (cid:98) ν i,m )] , where (cid:98) ν m = (1 /m ) (cid:80) mj =1 δ Z j for i.i.d. samples Z m according to ν and { (cid:98) ν i,m } i ∈ N are deﬁned similarly. .2 Proof of Theorem 3.9 We ﬁrst prove that argmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) (cid:54) = ∅ . Indeed, by Assumption 3.2 and Theorem 3.3,the mapping θ (cid:55)→ PW p,k ( µ (cid:63) , µ θ ) is lower semi-continuous. By Assumption 3.3, the set Θ (cid:63) ( τ ) isbounded for some τ >

0. By the deﬁnition of inf, there exists θ (cid:48) ∈ Θ such that PW p,k ( µ (cid:63) , µ θ (cid:48) ) =inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ /

2. This implies that θ (cid:48) ∈ Θ (cid:63) ( τ ) and Θ (cid:63) ( τ ) is nonempty. By the lower semi-continuity of the mapping θ (cid:55)→ PW p,k ( µ (cid:63) , µ θ ), the set Θ (cid:63) ( τ ) is closed. Putting these pieces togetheryields that Θ (cid:63) ( τ ) is compact. Therefore, we conclude the desired result from Theorem D.1.Then we show that there exists a set E ⊆ Ω with P ( E ) = 1 such that, for all ω ∈ E , the sequence ofmappings θ (cid:55)→ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) epiconverges to the mapping θ (cid:55)→ PW p,k ( µ (cid:63) , µ θ ) as n → + ∞ . Indeed,we only need to prove that the conditions in Proposition D.2 hold true.Fix K ⊆ Θ as a compact set. By the lower semi-continuity of the mapping θ (cid:55)→ PW p,k ( (cid:98) µ n ( ω ) , µ θ )(cf. Assumption 3.2 and Theorem 3.3), Theorem D.1 implies thatinf θ ∈ K PW p,k ( (cid:98) µ n ( ω ) , µ θ ) = PW p,k ( (cid:98) µ n ( ω ) , µ θ n )for some sequence θ n = θ n ( ω ) ∈ K . Thus, we havelim inf n → + ∞ inf θ ∈ K PW p,k ( (cid:98) µ n ( ω ) , µ θ ) = lim inf n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ θ n ) . By the deﬁnition of lim inf, there exists a subsequence of { θ n } n ∈ N such that PW p,k ( (cid:98) µ n ( ω ) , µ θ n ) convergesto lim inf n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ θ n ) along this subsequence. By the compactness of K , this subsequencemust have a convergent subsubsequence. We denote this subsubsequence as { θ n j } j ∈ N and its limit as¯ θ ∈ K . Then lim inf n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ θ n ) = lim j → + ∞ PW p,k ( (cid:98) µ n j ( ω ) , µ θ nj ) . Since ω ∈ E where P ( E ) = 1, Assumption 3.1 and 3.2 imply (cid:98) µ n j ( ω ) ⇒ µ (cid:63) and µ θ nj ⇒ µ ¯ θ . Thesepieces together with the lower semi-continuity of the PRW distance (cf. Theorem 3.3) yields thatlim j → + ∞ PW p,k ( (cid:98) µ n j ( ω ) , µ θ nj ) ≥ PW p,k ( µ (cid:63) , µ ¯ θ ). Putting these pieces together yields thatlim inf n → + ∞ inf θ ∈ K PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≥ inf θ ∈ K PW p,k ( µ (cid:63) , µ θ ) . Fix O ⊆ Θ as an arbitary open set. By the deﬁnition of inf, there exists a sequence θ (cid:48) n = θ (cid:48) n ( ω ) ∈ O suchthat PW p,k ( µ (cid:63) , µ θ (cid:48) n ) → inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ). In addition, inf θ ∈ O PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ PW p,k ( (cid:98) µ n ( ω ) , µ θ (cid:48) n ).Thus, we have lim sup n → + ∞ inf θ ∈ O PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ lim sup n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ θ (cid:48) n ) ≤ lim sup n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ (cid:63) ) + lim sup n → + ∞ PW p,k ( µ (cid:63) , µ θ (cid:48) n ) . Since ω ∈ E where P ( E ) = 1, Assumption 3.1 implies lim sup n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ (cid:63) ) = 0. By thedeﬁnition of θ (cid:48) n , lim sup n → + ∞ PW p,k ( µ (cid:63) , µ θ (cid:48) n ) = inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ). Putting these pieces togetheryields that lim sup n → + ∞ inf θ ∈ O PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ).Proposition D.2 guarantees that there exists a set E ⊆ Ω with P ( E ) = 1 such that, for all ω ∈ E ,the sequence of mappings θ (cid:55)→ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) epiconverges to the mapping θ (cid:55)→ PW p,k ( µ (cid:63) , µ θ ) as n → + ∞ . Then the second statement of Theorem D.3 implies thatlim sup n → + ∞ argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ⊆ argmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) . (D.1)27he next step is to show that, for every δ >

0, there exists a compact set B ⊆ Θ and N ∈ N such thatinf θ ∈ B PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ inf θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ )+ δ . In what follows, we prove a stronger statementwhich states that the above inequality holds true with δ = 0. Indeed, by the same reasoning for theopen set case in the proof of epiconvergence, we havelim sup n → + ∞ inf θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) . By Assumption 3.3 and using previous argument, Θ (cid:63) ( τ ) is nonempty and compact for some τ >

0. Theabove inequality implies that there exists n ( ω ) > n ≥ n ( ω ), the set { θ ∈ Θ : PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ inf θ (cid:48) ∈ Θ PW p,k ( µ (cid:63) , µ θ (cid:48) ) + τ / } is nonempty. For any θ in this set and let n ≥ n ( ω ),we have PW p,k ( µ (cid:63) , µ θ ) ≤ PW p,k ( µ (cid:63) , (cid:98) µ n ( ω )) + inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ . By Assumption 3.1, there exists n ( ω ) > n ≥ n ( ω ), we have PW p,k ( µ (cid:63) , (cid:98) µ n ( ω )) ≤ W p ( µ (cid:63) , (cid:98) µ n ( ω )) ≤ τ . Putting these pieces together yields that, for all n ≥ max { n ( ω ) , n ( ω ) } , we have PW p,k ( µ (cid:63) , µ θ ) ≤ inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ . This implies that, for all n ≥ max { n ( ω ) , n ( ω ) } that, (cid:26) θ ∈ Θ : PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ inf θ (cid:48) ∈ Θ PW p,k ( µ (cid:63) , µ θ (cid:48) ) + τ (cid:27) ⊆ Θ (cid:63) ( τ ) . Therefore, we have inf θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) = inf θ ∈ Θ (cid:63) ( τ ) PW p,k ( (cid:98) µ n ( ω ) , µ θ ). This together with thecompactness of Θ (cid:63) ( τ ) yields the desired result.The ﬁrst statement of Theorem D.3 implies thatinf θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) → inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) , as n → + ∞ . (D.2)By Assumption 3.2 and Theorem 3.3, the mapping θ (cid:55)→ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) is lower semi-continuous.Theorem D.1 implies argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) are nonempty for all n ≥ max { n ( ω ) , n ( ω ) } . To-gether with Eq. (D.1) and (D.2) yields the desired results.Finally, we remark that these results hold true for δ n -argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) with δ n →

0. ForEq. (D.1) and (D.2), the analogous results can be derived by using the second and third statementsof Theorem D.3. To show that δ n -argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) is nonempty, we notice it contains thenonempty set argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ). D.3 Proof of Theorem 3.10

Following up the same approach used for analyzing Theorem 3.9, it is straightforward to derive thatargmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) (cid:54) = ∅ . Then we show that there exists a set E ⊆ Ω with P ( E ) = 1 such that,for all ω ∈ E , the sequences θ (cid:55)→ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] epiconverges θ (cid:55)→ PW p,k ( µ (cid:63) , µ θ ) as n → + ∞ . Indeed, it suﬃces to verify the conditions in Proposition D.2.Fix K ⊆ Θ as an arbitrary compact set. By Assumption 3.2 and Lemma D.6, the mapping θ (cid:55)→ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] is lower semi-continuous. Then Theorem D.1 implies thatinf θ ∈ K E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] = E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ n ,m ( n ) ) | X n ]28or some sequence θ n = θ n ( ω ) ∈ K . Thus, we havelim inf n → + ∞ inf θ ∈ K E (cid:2) PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n (cid:3) = lim inf n → + ∞ E (cid:2) PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ n ,m ( n ) ) | X n (cid:3) . Following up the same approach used in the proof of Theorem 3.9, there exists a subsequence of { θ n } n ∈ N ,denoted by { θ n j } j ∈ N with the limit ¯ θ ∈ K , such thatlim inf n → + ∞ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ n ,m ( n ) ) | X n ] = lim j → + ∞ E [ PW p,k ( (cid:98) µ n j ( ω ) , (cid:98) µ θ nj ,m ( n j ) ) | X n j ] ≥ lim inf j → + ∞ E [ PW p,k ( (cid:98) µ n j ( ω ) , µ θ nj )] − lim sup j → + ∞ E [ PW p,k ( µ θ nj , (cid:98) µ θ nj ,m ( n j ) ) | X n j ] . Since ω ∈ E where P ( E ) = 1, Assumption 3.1 and 3.2 imply (cid:98) µ n j ( ω ) ⇒ µ (cid:63) and µ θ nj ⇒ µ ¯ θ . Thesepieces together with the lower semi-continuity of the PRW distance (cf. Theorem 3.3) yields thatlim inf j → + ∞ PW p,k ( (cid:98) µ n j ( ω ) , µ θ nj ) ≥ PW p,k ( µ (cid:63) , µ ¯ θ ). By Assumption 3.4 and using θ n j → ¯ θ , we havelim sup j → + ∞ E [ PW p,k ( µ θ nj , (cid:98) µ θ nj ,m ( n j ) ) | X n j ] →

0. Putting these pieces together yields thatlim inf n → + ∞ inf θ ∈ K E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≥ inf θ ∈ K PW p,k ( µ (cid:63) , µ θ ) . Fix O ⊆ Θ as an arbitary open set. By the deﬁnition of inf, there exists a sequence θ (cid:48) n = θ (cid:48) n ( ω ) ∈ O such that PW p,k ( µ (cid:63) , µ θ (cid:48) n ) → inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ). In addition, we haveinf θ ∈ O E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ (cid:48) n ,m ( n ) ) | X n ] . Thus, we havelim sup n → + ∞ inf θ ∈ O E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ lim sup n → + ∞ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ (cid:48) n ,m ( n ) ) | X n ] ≤ lim sup n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ (cid:63) ) + lim sup n → + ∞ PW p,k ( µ (cid:63) , µ θ (cid:48) n ) + lim sup n → + ∞ E [ PW p,k ( µ θ (cid:48) n , (cid:98) µ θ (cid:48) n ,m ( n ) ) | X n ] . Since ω ∈ E where P ( E ) = 1, Assumption 3.1 implies lim sup n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ (cid:63) ) = 0. By thedeﬁnition of θ (cid:48) n , we have lim sup n → + ∞ PW p,k ( µ (cid:63) , µ θ (cid:48) n ) = inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ). Using Assumption 3.4and lim j → + ∞ θ m j = ¯ θ , we have lim sup n → + ∞ E [ PW p,k ( µ θ (cid:48) n , (cid:98) µ θ (cid:48) n ,m ( n ) ) | X n ] = 0. Putting these piecestogether yields that lim sup n → + ∞ inf θ ∈ O E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ).Proposition D.2 guarantees that there exists a set E ⊆ Ω with P ( E ) = 1 such that, for all ω ∈ E ,the sequence of mappings θ (cid:55)→ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] epiconverges to the mapping θ (cid:55)→PW p,k ( µ (cid:63) , µ θ ) as n → + ∞ . Then the second statement of Theorem D.3 implies thatlim sup n → + ∞ argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ⊆ argmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) . (D.3)The next step is to show that, for every δ >

0, there exists a compact set B ⊆ Θ and N ∈ N such thatinf θ ∈ B E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] + δ . In what follows,we prove a stronger statement which states that the above inequality holds true with δ = 0. Indeed, bythe same reasoning for the open set case in the proof of epiconvergence, we havelim sup n → + ∞ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) .

29y Assumption 3.3 and using previous argument, Θ (cid:63) ( τ ) is nonempty and compact for some τ >

0. Theabove inequality implies that there exists n ( ω ) > n ≥ n ( ω ), the set { θ ∈ Θ : E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ inf θ (cid:48) ∈ Θ PW p,k ( µ (cid:63) , µ θ (cid:48) ) + τ / } is nonempty. For any θ in this set andlet n ≥ n ( ω ), we have PW p,k ( µ (cid:63) , µ θ ) ≤ PW p,k ( µ (cid:63) , (cid:98) µ n ( ω )) + E [ PW p,k ( µ θ , (cid:98) µ θ,m ( n ) ) | X n ] + inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ . By Assumption 3.1, there exists n ( ω ) > n ≥ n ( ω ), we have PW p,k ( µ (cid:63) , (cid:98) µ n ( ω )) ≤ W p ( µ (cid:63) , (cid:98) µ n ( ω )) ≤ τ . By Assumption 3.4, there exists n ( ω ) > n ≥ n ( ω ), we have E [ PW p,k ( (cid:98) µ θ,m ( n ) , µ θ ) | X n ] ≤ E [ W p ( (cid:98) µ θ,m ( n ) , µ θ ) | X n ] ≤ τ . Putting these pieces together yields that, for all n ≥ max { n ( ω ) , n ( ω ) , n ( ω ) } that, PW p,k ( µ (cid:63) , µ θ ) ≤ inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ. This implies that, for all n ≥ max { n ( ω ) , n ( ω ) , n ( ω ) } that, (cid:26) θ ∈ Θ : E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ inf θ (cid:48) ∈ Θ PW p,k ( µ (cid:63) , µ θ (cid:48) ) + τ (cid:27) ⊆ Θ (cid:63) ( τ ) . Therefore, we have inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] = inf θ ∈ Θ (cid:63) ( τ ) E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ].This together with the compactness of Θ (cid:63) ( τ ) yields the desired result.The ﬁrst statement of Theorem D.3 implies thatinf θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] → inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) , as n → + ∞ . (D.4)By Assumption 3.2 and Lemma D.6, the mapping θ (cid:55)→ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] is lower semi-continuous. Theorem D.1 implies argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] are nonempty for all n ≥ max { n ( ω ) , n ( ω ) , n ( ω ) } . Together with Eq. (D.3) and (D.4) yields the desired results.Finally, we remark that these results hold true for δ n -argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ]with δ n →

0. For Eq. (D.3) and (D.4), the analogous results can be derived by using the secondand third statements of Theorem D.3. To show that δ n -argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] isnonempty, we notice it contains the nonempty set argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ]. D.4 Proof of Theorem 3.11

We ﬁrst prove that argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) (cid:54) = ∅ . Indeed, by Assumption 3.2 and Theorem 3.3,the mapping θ (cid:55)→ PW p,k ( (cid:98) µ n , µ θ ) is lower semi-continuous. By Assumption 3.5, the set Θ n ( τ ) isbounded for some τ n >

0. By the deﬁnition of inf, there exists θ (cid:48) n ∈ Θ such that PW p,k ( (cid:98) µ n , µ θ (cid:48) n ) =inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) + τ n /

2. This implies that θ (cid:48) n ∈ Θ n ( τ ) and Θ n ( τ ) is nonempty. By the lower semi-continuity of the mapping θ (cid:55)→ PW p,k ( (cid:98) µ n , µ θ ), the set Θ n ( τ ) is closed. Putting these pieces togetheryields that Θ n ( τ ) is compact. Therefore, we conclude the desired result from Theorem D.1.30hen we show that the sequences θ (cid:55)→ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] epiconverges to θ (cid:55)→ PW p,k ( (cid:98) µ n , µ θ )as m → + ∞ . Indeed, it suﬃces to verify the conditions in Proposition D.2.Fix K ⊆ Θ as an arbitrary compact set. By Assumption 3.2 and Lemma D.6, the mapping θ (cid:55)→ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] is lower semi-continuous. Then Theorem D.1 implies thatinf θ ∈ K E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] = E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ m ,m ) | X n ]for some sequence θ m ∈ K . Thus, we havelim inf m → + ∞ inf θ ∈ K E (cid:2) PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n (cid:3) = lim inf m → + ∞ E (cid:2) PW p,k ( (cid:98) µ n , (cid:98) µ θ m ,m ) | X n (cid:3) . Following up the same approach used in the proof of Theorem 3.9, there exists a subsequence of { θ m } m ∈ N ,denoted by { θ m j } j ∈ N with the limit ¯ θ ∈ K , such thatlim inf m → + ∞ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ m ,m ) | X n ] = lim j → + ∞ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ mj ,m j ) | X n ] ≥ lim inf j → + ∞ E [ PW p,k ( (cid:98) µ n , µ θ mj )] − lim sup j → + ∞ E [ PW p,k ( µ θ mj , (cid:98) µ θ mj ,m j ) | X n ] . Assumption 3.1 and 3.2 imply (cid:98) µ m j ⇒ µ (cid:63) and µ θ mj ⇒ µ ¯ θ . Together with the lower semi-continuityof the PRW distance yields that lim inf j → + ∞ PW p,k ( (cid:98) µ n , µ θ mj ) ≥ PW p,k ( (cid:98) µ n , µ ¯ θ ). By Assumption 3.4and using θ m j → ¯ θ , we have lim sup j → + ∞ E [ PW p,k ( µ θ mj , (cid:98) µ θ mj ,m j ) | X n ] = 0. Thus, we conclude thatlim inf m → + ∞ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ m ,m ) | X n ] ≥ inf θ ∈ K PW p,k ( (cid:98) µ n , µ θ ).Fix O ⊆ Θ as an arbitary open set. By the deﬁnition of inf, there exists a sequence θ (cid:48) m ∈ O suchthat PW p,k ( (cid:98) µ n , µ θ (cid:48) m ) → inf θ ∈ O PW p,k ( (cid:98) µ n , µ θ ). In addition, we haveinf θ ∈ O E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ (cid:48) m ,m ) | X n ] . Thus, we havelim sup m → + ∞ inf θ ∈ O E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ lim sup m → + ∞ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ (cid:48) m ,m ) | X n ] ≤ lim sup m → + ∞ PW p,k ( (cid:98) µ n , µ θ (cid:48) m ) + lim sup n → + ∞ E [ PW p,k ( µ θ (cid:48) n , (cid:98) µ θ (cid:48) m ,m ) | X n ] . By the deﬁnition of θ (cid:48) m , we have lim sup m → + ∞ PW p,k ( (cid:98) µ n , µ θ (cid:48) m ) = inf θ ∈ O PW p,k ( (cid:98) µ n , µ θ ). Using Assump-tion 3.4 and lim j → + ∞ θ m j = ¯ θ , we have lim sup m → + ∞ E [ PW p,k ( µ θ (cid:48) m , (cid:98) µ θ (cid:48) m ,m ) | X n ] = 0. Putting thesepieces together yields that lim sup m → + ∞ inf θ ∈ O E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ inf θ ∈ O PW p,k ( (cid:98) µ n , µ θ ).Proposition D.2 guarantees that the sequence of mappings θ (cid:55)→ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] epicon-verges to the mapping θ (cid:55)→ PW p,k ( (cid:98) µ n , µ θ ) as m → + ∞ . Then the second statement of Theorem D.3implies that lim sup m → + ∞ argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ⊆ argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) . (D.5)The next step is to show that, for every δ >

0, there exists a compact set B ⊆ Θ and N ∈ N such thatinf θ ∈ B E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] + δ . In what follows, we prove astronger statement which states that the above inequality holds true with δ = 0. Indeed, by the samereasoning for the open set case in the proof of epiconvergence, we havelim sup n → + ∞ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) .

31y Assumption 3.5 and using previous argument, Θ n ( τ ) is nonempty and compact for some τ > m > m ≥ m , the set { θ ∈ Θ : E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) + τ / } is nonempty. For any θ in this set and let m ≥ m , we have PW p,k ( (cid:98) µ n , µ θ ) ≤ E [ PW p,k ( (cid:98) µ θ,m , µ θ ) | X n ] + inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) + τ . By Assumption 3.4, there exists m > m ≥ m , we have E [ PW p,k ( (cid:98) µ θ,m , µ θ ) | X n ] ≤ E [ W p ( (cid:98) µ θ,m , µ θ ) | X n ] ≤ τ . Putting these pieces together yields that PW p,k ( (cid:98) µ n , µ θ ) ≤ inf θ (cid:48) ∈ Θ PW p,k ( (cid:98) µ n , µ θ (cid:48) ) + τ for all m ≥ max { m , m } . This implies that, for all m ≥ max { m , m } that, (cid:26) θ ∈ Θ : E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ inf θ (cid:48) ∈ Θ PW p,k ( (cid:98) µ n , µ θ (cid:48) ) + τ (cid:27) ⊆ Θ n ( τ ) . Therefore, we have inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] = inf θ ∈ Θ n ( τ ) E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ]. This togetherwith the compactness of Θ n ( τ ) yields the desired result.The ﬁrst statement of Theorem D.3 implies thatinf θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] → inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) , as m → + ∞ . (D.6)By Assumption 3.2 and Lemma D.6, the mapping θ (cid:55)→ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] is lower semi-continuous. Theorem D.1 implies argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] are nonempty for all m ≥ max { m , m } . Together with Eq. (D.5) and Eq. (D.6) yields the desired results.Finally, we remark that these results hold true for δ n -argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] with δ n →

0. For Eq. (D.5) and (D.6), the analogous results can be derived by using the second and thirdstatements of Theorem D.3. To show that δ n -argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] is nonempty, wenotice it contains the nonempty set argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ]. D.5 Proof of Lemma D.6

Since ν i ⇒ ν ∈ P ( R d ) and R d is separable, the Skorokhod’s representation theorem (cf. Theorem D.4)implies that there exists m sequences of random variables {{ Z ki } i ∈ N , k ∈ [ m ] } and m random variables { Z k , k ∈ [ m ] } such that the distribution of Z ki is ν i , the distribution of Z k is ν and { Z ki } i ∈ N convergesto Z k almost surely for all k ∈ [ m ].Suppose that (cid:98) ν i,m = (1 /m )( (cid:80) mk =1 δ Z ki ) and (cid:98) ν m = (1 /m )( (cid:80) mk =1 Z k ), we proceed to the key part of theproof and show that { (cid:98) ν i,m } i ∈ N weakly converges to (cid:98) ν m . Indeed, it suﬃces to consider the deterministiccase where (cid:98) ν i,m = (1 /m )( (cid:80) mk =1 δ z ki ) and (cid:98) ν m = (1 /m )( (cid:80) mk =1 z k ) where {{ z ki } i ∈ N , k ∈ [ m ] } and { z k , k ∈ [ m ] } are all deterministic such that lim i → + ∞ (cid:0) max k ∈ [ m ] (cid:107) z ki − z k (cid:107) (cid:1) = 0. Since the Wasserstein distancemetrizes the weak convergence (cf. Theorem B.4), we only need to show that lim i → + ∞ W ( (cid:98) ν i,m , (cid:98) ν m ) = 0.By the deﬁnition of the Wasserstein distance, { (cid:98) ν i,m } i ∈ N and (cid:98) ν m , we have W ( (cid:98) ν i,m , (cid:98) ν m ) ≤ max k ∈ [ m ] (cid:107) z ki − z k (cid:107) . Putting these pieces together yields that { (cid:98) ν i,m } i ∈ N weakly converges to (cid:98) ν m almost surely.32inally, we conclude from the lower semi-continuity of the PRW distance (cf. Theorem 3.3) and theFatou’s lemma that E [ PW p,k ( µ, (cid:98) ν m )] ≤ E (cid:20) lim inf i → + ∞ PW p,k ( µ i , (cid:98) ν i,m ) (cid:21) ≤ lim inf i → + ∞ E [ PW p,k ( µ i , (cid:98) ν i,m )] . This completes the proof.

D.6 Proof of Theorem A.1

Using Assumption 3.2 and Theorem 3.3, the mapping ( µ, θ ) (cid:55)→ PW p,k ( µ, µ θ ) is lower semi-continuousin P ( R d ) × Θ. It remains to verify that the conditions in Theorem D.5 are satisﬁed.We notice that the empirical measure (cid:98) µ n ( ω ) depends on ω ∈ Ω only through X n ∈ ⊗ ni =1 R d . Thus,we can write (cid:98) µ n ( ω ) = (cid:98) µ n ( x ) as a function in ⊗ ni =1 R d . Let D = ( ⊗ ni =1 R d ) × Θ, it is a Borel subset of( ⊗ ni =1 R d ) × R . Since R d is a Polish space, R d × . . . × R d endowed with the product topology is a Polishspace. D x is σ -compact for any x ∈ proj( D ) since D x ⊆ Θ and Θ is σ -compact.Deﬁne f ( x, θ ) = PW p,k ( (cid:98) µ n ( x ) , µ θ ), we claim that f is measurable on D and f ( x, · ) is lower semi-continuous on D x . Indeed, we have shown that the mapping ( µ, θ ) (cid:55)→ PW p,k ( µ, µ θ ) is lower semi-continuous and thus measurable in P ( R d ) × Θ. The mapping x (cid:55)→ (cid:98) µ n ( x ) is measurable in ⊗ ni =1 R d .Since the composition of measurable functions is measure, f is measurable on D . Moreover, for any x ∈ ⊗ ni =1 R d , f ( x, · ) is lower semi-continuous on D x since the mapping ( µ, θ ) (cid:55)→ PW p,k ( µ, µ θ ) is lowersemi-continuous on D . Putting these pieces together yields the desired results. D.7 Proof of Theorem A.2

Using Assumption 3.2 and Lemma D.6, the mapping ( ν, θ ) (cid:55)→ E [ PW p,k ( ν, (cid:98) µ θ,m ) | X n ] is lower semi-continuous in P ( R d ) × Θ. Then the proof can be done similarly to the proof of Theorem A.1 using thisresult and Theorem D.5.

E Postponed Proofs in Subsection 3.4

In this section, we provide the detailed proofs for Theorem 3.12 and Theorem A.3. Our derivation isthe reﬁnement of the analysis in Bernton et al. [2019] for minimal Wasserstein estimators.

E.1 Preliminary technical results

To facilitate reading, we collect several preliminary technical results which will be used in the postponedproofs in subsection 3.4.Let ( X , (cid:107) · (cid:107) X ) be a normed linear space and θ (cid:55)→ f θ be a map from a subset Θ of R d into X .The statistical information comes from a sequence { f n } n ∈ N of random elements of X , each of which isassumed to be measurable with respect to the σ -algebra generated by the balls in X . In some sense f n should converge to f θ (cid:63) where θ (cid:63) is some ﬁxed (but unknown) point in the interior of Θ. To avoid theabuse of notation, we use K ( x, β ) here. Theorem E.1 (Theorem 4.2 in Pollard [1980])

Suppose the following assumptions hold:1. inf θ / ∈ N (cid:107) f θ − f θ (cid:63) (cid:107) X > for every neighborhood N of θ (cid:63) . . θ (cid:55)→ f θ is norm diﬀerentiable with non-singular derivative D θ (cid:63) at θ (cid:63) .3. There exists a random element G (cid:63) ∈ X for which G n := √ n ( f n − f θ (cid:63) ) ⇒ G (cid:63) in the sense for themetric induced by the norm (cid:107) · (cid:107) X .Then the limiting distribution of the goodness-of-ﬁt statistic is given by √ n inf θ ∈ Θ (cid:107) f n − f θ (cid:107) X ⇒ inf θ ∈ Θ (cid:107) G (cid:63) − (cid:104) θ, D θ (cid:63) (cid:105)(cid:107) X . Let K ( x, β ) = { θ : (cid:107) x − (cid:104) θ, D θ (cid:63) (cid:105)(cid:107) X ≤ inf θ (cid:48) ∈ Θ (cid:107) x − (cid:104) θ (cid:48) , D θ (cid:63) (cid:105)(cid:107) X + β } and M n is deﬁned by M n = (cid:26) θ ∈ Θ : (cid:107) f n − f θ (cid:107) X ≤ inf θ (cid:48) ∈ Θ (cid:107) f n − f θ (cid:48) (cid:107) X + η n / √ n (cid:27) , where η n > P ( η n →

0) = 1 and M n is nonempty. Theorem E.2 (Theorem 7.2 in Pollard [1980])

Under the conditions of Theorem E.1, there existsa sequence of real number β n ↓ satisfying P (cid:63) ( M n ⊆ θ (cid:63) + n − / K ( G n , β n )) → , as n → + ∞ . Moreover, for any (cid:15) > , we have P ( d H ( K ( G (cid:63)n , , K ( G n , β n )) < (cid:15) ) → as n → + ∞ . E.2 Proof of Theorem 3.12

First, we show that M n ⊆ N with (inner) probability approaching 1 as n → + ∞ . Indeed, with innerprobability approaching 1, we haveargmin θ ∈ Θ PW , ( (cid:98) µ n , µ θ ) ⊆ argmin θ ∈ Θ PW , ( µ (cid:63) , µ θ ) . By the deﬁnition of PW , , we conclude that any minimizer of (cid:107) (cid:98) F n − F θ (cid:107) L will be included in the setof minimizers of (cid:107) F (cid:63) − F θ (cid:107) L with inner probability approaching 1. By Assumption 3.8, the minimizerof (cid:107) F (cid:63) − F θ (cid:107) L is unique and N is the neighborhood of this minimizer. Putting these pieces togetheryields that the set inf θ ∈ Θ PW , ( (cid:99) µ n , µ θ ) is contained in the set N with (inner) probability approaching1 as n → + ∞ . By the deﬁnition of M n , we achieve the desired result.Then we make three key claims. First, we claim that M n ⊆ Θ n with (inner) probability approaching1 as n → + ∞ , where Θ n is deﬁned byΘ n = (cid:40) θ ∈ Θ : (cid:107) θ − θ (cid:63) (cid:107) Θ ≤ √ n (cid:107) (cid:98) F n − F (cid:63) (cid:107) L + 2 η n c (cid:63) √ n (cid:41) . Indeed, for any θ ∈ N , we derive from the triangle inequality that (cid:107) (cid:98) F n − F θ (cid:107) L − (cid:107) (cid:98) F n − F θ (cid:63) (cid:107) L ≥ (cid:107) F θ − F (cid:63) (cid:107) L − (cid:107) F θ (cid:63) − F (cid:63) (cid:107) L − (cid:107) (cid:98) F n − F (cid:63) (cid:107) L . Using the deﬁnition of PW , together with Assumption 3.8, we have (cid:107) (cid:98) F n − F θ (cid:107) L − (cid:107) (cid:98) F n − F θ (cid:63) (cid:107) L ≥ c (cid:63) (cid:107) θ − θ (cid:63) (cid:107) Θ − (cid:107) (cid:98) F n − F (cid:63) (cid:107) L . (E.1)34ince M n ⊆ N with (inner) probability approaching one, Eq. (E.1) holds true for any θ ∈ M n with(inner) probability approaching one. Moreover, by the deﬁnition of M n , we have θ ∈ M n satisﬁes (cid:107) (cid:98) F n − F θ (cid:107) L ≤ inf θ (cid:48) ∈ Θ PW , ( (cid:98) µ n , µ θ (cid:48) ) + η n √ n ≤ (cid:107) (cid:98) F n − F θ (cid:63) (cid:107) L + η n √ n (E.2)Combining Eq. (E.1), Eq. (E.2) and the deﬁnition of Θ n , we conclude that θ ∈ Θ n if θ ∈ M n with(inner) probability approaching 1. This completes the proof the ﬁrst claim.Second, we claim that argmin θ (cid:48) ∈N (cid:107) G n − (cid:104)√ n ( θ (cid:48) − θ (cid:63) ) , D θ (cid:63) (cid:105)(cid:107) L ⊆ N ∩ Θ n with (inner) probabilityapproaching 1 as n → + ∞ . Indeed, by the deﬁnition of G n , we have (cid:107) G n − (cid:104)√ n ( θ (cid:48) − θ (cid:63) ) , D θ (cid:63) (cid:105)(cid:107) L = √ n (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L . For the simplicity of notation, we let R θ = F θ − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105) . By Assumption 3.6, we have (cid:107) R θ (cid:107) L = o ( (cid:107) θ − θ (cid:63) (cid:107) Θ ). By the deﬁnition of N , we have (cid:107) R θ (cid:107) L ≤ (1 / c (cid:63) (cid:107) θ − θ (cid:63) (cid:107) Θ . Therefore, for any θ ∈ N , we have (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L ≥ (cid:107) (cid:98) F n − F θ (cid:107) L − (cid:107) R θ (cid:107) L Eq. (E.1) ≥ (cid:107) (cid:98) F n − F θ (cid:63) (cid:107) L + (1 / c (cid:63) (cid:107) θ − θ (cid:63) (cid:107) Θ − (cid:107) (cid:98) F n − F (cid:63) (cid:107) L . This implies that, for any θ ∈ N \ Θ n , we have (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L ≥ (cid:107) (cid:98) F n − F θ (cid:63) (cid:107) L ≥ inf θ (cid:48) ∈N ∩ Θ n (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ (cid:48) − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L . This completes the proof of the second claim.Thirdly, we claim that there is an uniform control over the diﬀerence between θ (cid:55)→ √ n (cid:107) (cid:98) F n − F θ (cid:107) L and the convex map θ (cid:55)→ (cid:107) G n − √ n (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L over the set Ω n with (inner) probability approaching1 as n → + ∞ . Indeed, we deﬁneΓ n = sup θ ∈ Ω n |√ n (cid:107) (cid:98) F n − F θ (cid:107) L − (cid:107) G n − √ n (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L | . By the deﬁnition of G n , we haveΓ n = sup θ ∈ Ω n |√ n (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105) − R θ (cid:107) L − √ n (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L | = o (cid:18) sup θ ∈ Ω n √ n (cid:107) θ − θ (cid:63) (cid:107) Θ (cid:19) = o ( √ n (cid:107) (cid:98) F n − F (cid:63) (cid:107) L )By Assumption 3.7, we have Γ n → (cid:107) θ − θ (cid:63) (cid:107) Θ → n → + ∞ . This completes the proof of the third claim.By the deﬁnition of G n and G (cid:63)n , we have (cid:107) G n − G (cid:63)n (cid:107) L = (cid:107)√ n ( (cid:98) F n − F (cid:63) ) − G (cid:63) (cid:107) L . By Assumption 3.7,there exists a sequence τ n → P ( (cid:107) G n − G (cid:63)n (cid:107) L > τ n ) →

0. By the deﬁnition of Γ n and η n ,there exists two sequences τ n → τ n → P (Γ n > τ n ) → P ( η n > τ n ) → β n = max { τ n , τ n + τ n } , we have β n → n → + ∞ .It remains to show that M n ⊆ K ( G n , β n ) with (inner) probability approaching 1 as n → + ∞ . Indeed,we have inf θ (cid:48) ∈N (cid:107) G n − (cid:104)√ n ( θ (cid:48) − θ (cid:63) ) , D θ (cid:63) (cid:105)(cid:107) L ≥ inf θ (cid:48) ∈N √ n (cid:107) (cid:98) F n − F θ (cid:48) (cid:107) L − τ n .

35y the deﬁnition of M n , let θ ∈ M n , the above inequality impliesinf θ (cid:48) ∈N (cid:107) G n − (cid:104)√ n ( θ (cid:48) − θ (cid:63) ) , D θ (cid:63) (cid:105)(cid:107) L ≥ √ n (cid:107) (cid:98) F n − F θ (cid:107) L − τ n − τ n . Since M n ⊆ Θ n with (inner) probability approaching 1 as n → + ∞ , we have √ n (cid:107) (cid:98) F n − F θ (cid:107) L ≥ (cid:107) G n − (cid:104)√ n ( θ − θ (cid:63) ) , D θ (cid:63) (cid:105)(cid:107) L − τ n . Putting these pieces together with β n ≥ τ n + τ n yields that θ ∈ K ( G n , β n ).Finally, let (cid:15) >

0, we prove that P ( d H ( K ( G (cid:63)n , , K ( G n , β n )) < (cid:15) ) → n → + ∞ . Indeed, bythe triangle inequality, θ ∈ K ( G (cid:63)n ,

0) implies θ ∈ K ( G n , (cid:107) G n − G (cid:63)n (cid:107) L ). Therefore, we conclude that K ( G (cid:63)n , ⊆ K ( G n , β n ) with (inner) probability approaching one as n → + ∞ . On the other hand, θ ∈ K ( G n , β n ) implies θ ∈ K ( G (cid:63)n , β n + 2 (cid:107) G n − G (cid:63)n (cid:107) L ). By the deﬁnition of β n , G n and G (cid:63)n , we obtainthat β n + 2 (cid:107) G n − G (cid:63)n (cid:107) L → n → + ∞ . By the deﬁnitionof the Hausdorﬀ metric, we conclude the desired result. E.3 Proof of Theorem A.3

Diﬀerent from Theorem 3.12, the proof of Theorem A.3 is relatively straightforward and based onTheorem E.1 and E.2. It is mostly because there exists θ (cid:63) in the interior of Θ such that F (cid:63) = F θ (cid:63) .More speciﬁcally, we consider f θ = F θ and f n = (cid:98) F n such that F θ ( u, t ) = (cid:90) R d ( −∞ ,t ] ( (cid:104) u, x (cid:105) ) dµ θ ( x ) , (cid:98) F n ( u, t ) = (1 /n ) |{ i ∈ [ n ] : (cid:104) u, X i (cid:105) ≤ t }| . Let X = L ( S d − × R ) and (cid:107) · (cid:107) X = (cid:107) · (cid:107) L , we can check that ( X , (cid:107) · (cid:107) X ) is a normed linear space. By thedeﬁnition of PW , , we have PW , ( (cid:98) µ n , µ θ ) = (cid:107) (cid:98) F n − F θ (cid:107) X . By Assumption 3.1, (cid:98) F n converges to F (cid:63) .Moreover, in well-speciﬁed setting, F (cid:63) = F θ (cid:63) where θ (cid:63) is some ﬁxed (but unknown) point in the interiorof Θ. Now we are ready to check the conditions of Theorem E.1.First, Assumption A.1 and PW , ( (cid:98) µ n , µ θ ) = (cid:107) (cid:98) F n − F θ (cid:107) X imply C1. Furthermore, by the deﬁnitionof norm diﬀerentiable, Assumption 3.6 and Assumption A.2 imply C2. Finally, Assumption 3.7 and F (cid:63) = F θ (cid:63) imply C3. Therefore, we conclude from Theorem E.1 that √ n inf θ ∈ Θ PW , ( (cid:98) µ n , µ θ ) = √ n inf θ ∈ Θ (cid:107) (cid:98) F n − F θ (cid:107) L ⇒ inf t ∈ Θ (cid:107) G (cid:63) − (cid:104) t, D θ (cid:63) (cid:105)(cid:107) L . in the sense for the metric induced by the norm (cid:107) · (cid:107) L . This together with the deﬁnition of the norm (cid:107) · (cid:107) L implies the desired result for the goodness-of-ﬁt statistics.On the other hand, Theorem E.2 can be applied with speciﬁc choice of η n . More speciﬁcally, wenotice that the estimator (cid:98) θ n is well deﬁned by (cid:98) θ n := argmin θ ∈ Θ PW , ( (cid:98) µ n , µ θ ) = argmin θ ∈ Θ (cid:107) (cid:98) F n − F θ (cid:107) L . Let η n = 0, the set M n = { (cid:98) θ n } is a singleton set. This implies that √ n ( (cid:98) θ n − θ (cid:63) ) ⇒ K ( G (cid:63) ,

0) as n → + ∞ under its Hausdorﬀ metric topology. Since the random map θ → max u ∈ S d − (cid:82) R | G (cid:63) ( u, t ) −(cid:104) θ, D (cid:63) ( u, t ) (cid:105)| dt has a unique inﬁmum almost surely, we have K ( G (cid:63) ,

0) is a singleton set deﬁned by K ( G (cid:63) ,

0) = argmin θ ∈ Θ max u ∈ S d − (cid:90) R | G (cid:63) ( u, t ) − (cid:104) θ, D (cid:63) ( u, t ) (cid:105)| dt. In this case, the Hausdorﬀ metric is simply induced by the norm (cid:107) · (cid:107) L . Putting these pieces togetheryields the desired result for the MPRW estimator of order 1.36 .4 Minor Technical Issues We use the notations of Bernton et al. [2019, Theorem B.8] throughout this subsection. Indeed, inpage 38-39 of the recent arvix version of Bernton et al. [2019], the authors prove that m ( H n ) =inf u ∈ L n f ( H n , u ), implicitly assuming that the minimizer of the map θ (cid:55)→ √ n (cid:107) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L is contained in the set N = { θ ∈ N : (cid:107) θ − θ (cid:63) (cid:107) H ≤ c (cid:63) / } . However, this result is not obvious. Indeed, itseems diﬃcult to derive such results from the existing fact that the minimizer of θ (cid:55)→ √ n (cid:107) F n − F θ (cid:107) L is contained in N . We only have the uniform control over the diﬀerence between θ (cid:55)→ √ n (cid:107) F n − F θ (cid:107) L and θ (cid:55)→ √ n (cid:107) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L over the set S n instead of the whole set. So there is fewrelationship between the minimizers of these two mappings. Moreover, the techniques from the proofof Pollard [1980, Theorem 7.2] can not be applicable to ﬁx this issue here since the proof depends onthe assumption that µ (cid:63) = µ θ (cid:63) which does not hold under model misspeciﬁcation yet. F Computational Aspects

The computation of the PRW distance is in general computationally intractable when the projectiondimension is k ≥ PW ,k ( (cid:98) µ n , (cid:98) ν n ) when the projection dimension is k ≥

2. Part of results can be found in the appendix of concurrent work [Lin et al., 2020] and we providethe details for the sake of completeness.

Approximation of PW ,k . We consider the computation of PW ,k between empirical measures. In-deed, let { x , x , . . . , x n } ⊆ R d and { y , y , . . . , y n } ⊆ R d denote sets of n atoms, and let ( r , r , . . . , r n ) ∈ ∆ n and ( c , c , . . . , c n ) ∈ ∆ n denote weight vectors, we deﬁne discrete measures (cid:98) µ n := (cid:80) ni =1 r i δ x i and (cid:98) ν n := (cid:80) nj =1 c j δ y j . The computation of PW ,k ( (cid:98) µ n , (cid:98) ν n ) is equivalent to solving a structured max-minoptimization model where the maximization and minimization are performed over the Stiefel manifoldSt( d, k ) := { U ∈ R d × k | U (cid:62) U = I k } and the transportation polytope Π( µ, ν ) := { π ∈ R n × n + | r ( π ) = r, c ( π ) = c } respectively. Formally, we havemax U ∈ R d × k min π ∈ R n × n + n (cid:88) i =1 n (cid:88) j =1 π i,j (cid:107) U (cid:62) x i − U (cid:62) y j (cid:107) s.t. U (cid:62) U = I k , r ( π ) = r, c ( π ) = c. (F.1)Eq. (F.1) is equivalent to the non-convex nonsmooth optimization model as follows,max U ∈ St( d,k )  f ( U ) := min π ∈ Π( µ,ν ) n (cid:88) i =1 n (cid:88) j =1 π i,j (cid:107) U (cid:62) x i − U (cid:62) y j (cid:107)  . (F.2)Fixing U ∈ St( d, k ), Eq. (F.2) becomes a classical OT problem which can be either solved by theSinkhorn iteration [Cuturi, 2013] or the variant of network simplex method in the

POT package [Fla-mary and Courty, 2017]. The key challenge is the maximization over the Stiefel manifold St( d, k ) := { U ∈ R d × k | U (cid:62) U = I k } .Eq. (F.2) is a special instance of the Stiefel manifold optimization problem. The dimension ofSt( d, k ) is equal to dk − k ( k + 1) / Z ∈ St( d, k ) is deﬁned by37 lgorithm 1

Riemannian SuperGradient Ascent with Network Simplex Iteration (RSGAN) Input: measures { ( x i , r i ) } i ∈ [ n ] and { ( y j , c j ) } j ∈ [ n ] , dimension k and tolerance (cid:15) . Initialize: U ∈ St( d, k ) and γ > for t = 0 , , , . . . , T − do Compute π t +1 ← OT ( { ( x i , r i ) } i ∈ [ n ] , { ( y j , c j ) } j ∈ [ n ] , U t ). Compute ξ t +1 ← P T Ut St (2 V π t +1 U t ). Compute γ t +1 ← γ / √ t + 1. Compute U t +1 ← Retr U t ( γ t +1 ξ t +1 ). end for T Z St := { ξ ∈ R d × k : ξ (cid:62) Z + Z (cid:62) ξ = 0 } . We endow St( d, k ) with Riemannian metric inherited fromthe Euclidean inner product (cid:104) X, Y (cid:105) for any

X, Y ∈ T Z St and Z ∈ St( d, k ). Then the projection of G ∈ R d × k onto T Z St is given by Absil et al. [2009, Example 3.6.2]: P T Z St ( G ) = G − Z ( G (cid:62) Z + Z (cid:62) G ) / retraction , which is the ﬁrst-order approximation of an exponentialmapping on the manifold and which is amenable to computation [Absil et al., 2009, Deﬁnition 4.1.1].For the Stiefel manifold, we have the following deﬁnition: Deﬁnition F.1

A retraction on St ≡ St( d, k ) is a smooth mapping Retr : TSt → St from the tangentbundle TSt onto St such that the restriction of Retr onto T Z St , denoted by Retr Z , satisﬁes that (i) Retr Z (0) = Z for all Z ∈ St where denotes the zero element of TSt , and (ii) for any Z ∈ St , it holdsthat lim ξ ∈ T Z St ,ξ → (cid:107) Retr Z ( ξ ) − ( Z + ξ ) (cid:107) F / (cid:107) ξ (cid:107) F = 0 . Our algorithm uses the retraction based on the

QR decomposition as suggested by Liu et al. [2019].More speciﬁcally, Retr qr Z ( ξ ) = qr( Z + ξ ) where qr( A ) is the Q factor of the QR factorization of A .We start with a brief overview of the Riemannian supergradient ascent algorithm for nonsmoothStiefel optimization, denoted by max U ∈ St( d,k ) F ( U ). A generic Riemannian supergradient ascent algo-rithm for solving this problem is given by U t +1 ← Retr U t ( γ t +1 ξ t +1 ) for any ξ t +1 ∈ subdiﬀ F ( U t ) , where subdiﬀ F ( U t ) is Riemannian subdiﬀerential of F at U t and Retr is any retraction on St( d, k ). Thestep size is set as γ t +1 = γ / √ t + 1 as suggested by [Li et al., 2019]. By the deﬁnition of Riemanniansubdiﬀerential, ξ t can be obtained by taking ξ ∈ ∂F ( U ) and by setting ξ t = P T U St ( ξ ). Thus, it isnecessary for us to specify the subdiﬀerential of f in Eq. (F.2). We deﬁne V π = (cid:80) ni =1 (cid:80) nj =1 π i,j ( x i − y j )( x i − y j ) (cid:62) ∈ R d × d which is symmetry and derive that ∂f ( U ) = Conv { V π (cid:63) U | π (cid:63) ∈ argmin π ∈ Π( µ,ν ) (cid:104) U U (cid:62) , V π (cid:105)} , for any U ∈ R d × k , It remains to solve an OT problem with a given U at each inner loop of the maximization and use theoutput π ( U ) to obtain a supergradient of f . The network simplex method can exactly solve this LP.To this end, we summarize the pseudocode of the RSGAN algorithm in Algorithm 1. Approximation of PW ,k . We recall the deﬁnition of the IPRW distance of order 2 as follows, PW ,k ( µ, ν ) = (cid:90) S d,k W ( E (cid:63) µ, E (cid:63) ν ) dσ ( E ) , σ is the uniform distribution on S d,k and E (cid:63) is the linear transformation associated with E forany x ∈ R d by E (cid:63) ( x ) = E (cid:62) x . For any measurable function f and µ ∈ P ( R d ), we denote f µ as thepush-forward of µ by f , so that f µ ( A ) = µ ( f − ( A )) where f − ( A ) = { x ∈ R d : f ( x ) ∈ A } for anyBorel set A . We approximate the integral by selecting a ﬁnite set of projections S ⊆ S d,k and computingthe empirical average: PW ,k ( µ, ν ) ≈ S ) (cid:88) E ∈S W ( E (cid:63) µ, E (cid:63) ν ) . The quality of this approximation depends on the sampling of S d,k . In this paper, we use randomsamples picked uniformly on S d,k , which is analogues to the approach proposed by Bonneel et al. [2015]for the case of k = 1; see Sampling schemes for the details.

Approximation of PW p, . We recall the deﬁnition of the PRW distance of order p with the projectiondimension k = 1 as follows, PW pp, ( µ, ν ) := sup u ∈ S d, W pp ( u (cid:63) µ, u (cid:63) ν ) = sup u ∈ S d, (cid:90) | F − u (cid:63) µ ( t ) − F − u (cid:63) ν ( t ) | p dt. where u ∈ S d, is an unit d -dimensional vector, u (cid:63) is the linear transformation associated with u for any x ∈ R d by u (cid:63) ( x ) = u (cid:62) x , and F − ξ is the quantile function of ξ . This integral can be estimated using aMonte Carlo estimate and a linear interpolation of the quantile function. Following up Nadjahi et al.[2019, Appendix 4], we consider two approximations of this quantity. The ﬁrst one is given by, PW pp, ( µ, ν ) = sup u ∈ S d, K K (cid:88) k =1 | ˜ F − u (cid:63) µ ( t k ) − ˜ F − u (cid:63) ν ( t k ) | p , (F.3)where { t k } Kk =1 are uniform and independent samples from [0 ,

1] and ˜ F − ξ is a linear interpolation of F − ξ which denotes either the exact quantile function of a discrete measure ξ , or an approximation bya Monte Carlo procedure. The second one is given by PW pp, ( µ, ν ) = sup u ∈ S d, K K (cid:88) k =1 | s k − ˜ F − u (cid:63) ν ( ˜ F u (cid:63) µ ( s k )) | p , (F.4)where { s k } Kk =1 are uniform and independent samples from u (cid:63) µ and ˜ F ξ (resp. ˜ F − ξ ) is a linear interpo-lation of F ξ (resp. F − ξ ) which denotes either the exact cumulative distribution function (resp. quantilefunction) of a discrete measure ξ , or an approximation by a Monte Carlo procedure. Sampling schemes.

We explain the methods that we use to generate the i.i.d. samples from theuniform distribution on the set of d × k orthogonal matrices, i.e., S d,k = { E ∈ R d × k : E (cid:62) E = I k } andthe i.i.d. samples from multivariate elliptically contoured stable distributions.To sample from S d,k , we ﬁrst construct the ( d × k )-dimensional matrix Z by drawing each of itscomponents from the standard normal distribution N (0 ,

1) and then perform the QR decomposition ofit: E = qr( Z ). By the deﬁnition, E ∈ S d,k is an uniform sample.To sample from multivariate elliptically contoured stable distributions, we follows the approachpresented in Nadjahi et al. [2019, Appendix 4]. Indeed, we recall that if Y ∈ R d is α -stable and39lliptically contoured, i.e., Y ∈ E α S c (Σ , m ), then its joint characteristic function is deﬁned as, for any t ∈ R d that, E (cid:104) exp( it (cid:62) Y ) (cid:105) = exp (cid:16) − ( t (cid:62) Σ t ) α/ + it (cid:62) m (cid:17) , (F.5)where Σ is a positive deﬁnite matrix (akin to a correlation matrix), m ∈ R d is a location vector(equal to the mean if it exists) and α ∈ (0 ,

2) controls the thickness of the tail. Elliptically contouredstable distributions are scale mixtures of multivariate Gaussian distributions [Samoradnitsky, 2017,Proposition 2.5.2] with computationally intractable densities. Fortunately, it was shown by Nolan[2013] that sampling from multivariate elliptically contoured stable distributions is possible: let A ∼S α/ ( β, γ, δ ) be a one-dimensional positive ( α/ β = 1, γ = 2 cos( πα/ /α and δ = 0, and G ∼ N (0 , Σ). By the deﬁnition, Y = √ AG + m satisﬁes Eq. (F.5) and Y ∼ E α S c (Σ , m ). Optimization methods.

Computing the MPRW and MEPRW estimators are intractable in general.This is mainly because the PRW distance requires a maximization over inﬁnitely many projections.Formally, we hope to solve the following minimax optimization model,min θ ∈ Θ PW pp, ( µ θ , µ (cid:63) ) = min θ ∈ Θ max u ∈ S d, (cid:90) | F − u (cid:63) µ θ ( t ) − F − u (cid:63) µ (cid:63) ( t ) | p dt, where { µ θ : θ ∈ Θ } is the model and µ (cid:63) is the data-generating process. Following up the approachpresented in Nadjahi et al. [2019] together with the approximation of PW p, , we consider using theADAM optimization method to minimize the (expected) PRW distance over the set of parameters whileapplying multiple projected supergradient ascent to ﬁnd an approximate projection u which maximizesover S d, at each inner loop. The ADAM optimization method is associated with the default parametersetting as suggested by Kingma and Ba [2015]. At each inner loop, we run 5 projected supergradientascent with the learning rate 10 − . Gaussian models . For the MPRW estimator, we consider the approximate PW , distance basedon Eq. (F.4). Indeed, let µ denote N ( m , σ I ) and (cid:98) ν denote the empirical probability measures of n samples drawn from the data-generating process, we deﬁne the function f ( m , σ , u ) as f ( m , σ , u ) = 1card( S ) (cid:88) s ∈S | s − ˜ F − u (cid:63) (cid:98) ν ( ˜ F u (cid:63) µ ( s )) | N ( s ; u (cid:62) m , σ I ) , where S ⊆ R and N ( s ; u (cid:62) m , σ I ) refers to the density function of Gaussian of parameters ( u (cid:62) m , σ I )evaluated at s ∈ S . We compute the explicit gradient expression of f ( m , σ , u ) with respect to themean m , the variance σ and the projection vector u as follows, ∇ m f ( m , σ , u ) = 1 σ card( S ) (cid:88) s ∈S (cid:16) | s − ˜ F − u (cid:63) (cid:98) ν ( ˜ F u (cid:63) µ ( s )) | N ( s ; u (cid:62) m , σ I )( s − u (cid:62) m ) u (cid:17) , ∇ σ f ( m , σ , u ) = 12 σ card( S ) (cid:88) s ∈S (cid:16) | s − ˜ F − u (cid:63) (cid:98) ν ( ˜ F u (cid:63) µ ( s )) | N ( s ; u (cid:62) m , σ I )(( s − u (cid:62) m ) − σ ) (cid:17) , ∇ u f ( m , σ , u ) = 1 σ card( S ) (cid:88) s ∈S (cid:16) | s − ˜ F − u (cid:63) (cid:98) ν ( ˜ F u (cid:63) µ ( s )) | N ( s ; u (cid:62) m , σ I )( s − u (cid:62) m ) m (cid:17) . For the MEPRW estimator, we consider the approximate PW , distance based on Eq. (F.3). Indeed,let (cid:98) µ and (cid:98) ν denote the empirical probability measures of m samples drawn from N ( m , σ I ) and n f ( m , σ , u ) as f ( m , σ , u ) = 1 K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | , where { t k } Kk =1 are uniform and independent samples from [0 , f ( m , σ , u ) with respect to the mean m , the variance σ and the projection vector u asfollows, ∇ m f ( m , σ , u ) = − K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | u, ∇ σ f ( m , σ , u ) = − K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | m , ∇ u f ( m , σ , u ) = − σ K K (cid:88) k =1 (cid:16) | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | ( u (cid:62) m − ˜ F − u (cid:63) (cid:98) µ ( t k )) (cid:17) . Elliptically contoured stable models.

When comparing the MEPRW estimator with the MPRW estima-tor using elliptically contoured stable models, we also approximate these estimators using the ADAMoptimization method with the default parameter setting.We consider the approximate PW , distance based on Eq. (F.3). Indeed, let (cid:98) µ and (cid:98) ν denote theempirical probability measures of m samples drawn from E α S c ( I , m ) and n samples drawn from thedata-generating process, we deﬁne the function f ( m , u ) as f ( m , u ) = 1 K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | . where { t k } Kk =1 are uniform and independent samples from [0 , f ( m , u ) with respect to the location parameter m and the projection vector u as follows, ∇ m f ( m , u ) = − K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | u, ∇ u f ( m , u ) = − K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | m . Generative modeling.

We use the ADAM optimizer provided Pytorch GPU.

G Experimental Setup

Computing infrastructure.

For the experiments on the uniform distribution over hypercube, weimplement in Python 3.7 with Numpy 1.18 on a workstation with an Intel Core i5-9400F (6 cores and 6threads) and 32GB memory, equipped with Ubuntu 18.04. For the experiments on MPRW and MEPRWestimators, we implement in Python 2.7 with Numpy 1.16 and IPython 5.8 on the same machine. Theseexperiments were not conducted with GPU. For the experiments on neural networks, we implement onthe same machine with 2 GPUs (GeForce GTX 1070 and GeForce GTX 2070).41igure 5:

Mean values (Top) and mean computational time (Bottom) of the IPRW and PRW distances of order 2 betweenempirical measures (cid:98) µ n and (cid:98) ν n as the number of points n varies. Results are averaged over 100 runs. Convergence and concentration.

We conduct the experiment on the uniform distribution overdiﬀerent hypercubes which are also used in the experiment [Paty and Cuturi, 2019]. In particular, weconsider µ = ν = U ([ − v, v ] d ) which is an uniform distribution over an hypercube and where d and v standfor the dimension and scale of the distribution respectively. (cid:98) µ n and (cid:98) ν n are empirical distributions corre-sponding to µ and ν with n samples. We evaluate the PRW and IPRW distance in terms of mean valuesand mean computational times over 100 runs for ( d, v ) ∈ { (10 , , (10 , , (30 , , (30 , , (50 , , (50 , } .For the PRW distance, we run Algorithm 1 with emd solver in the POT package [Flamary and Courty,2017] and terminate the algorithm either when the maximum number of iterations T = 30 is reached orwhen (cid:107) U t +1 − U t (cid:107) F ≤ − . For the IPRW distance, we draw 100 uniform and independent projectionsfrom S d,k and compute each Wasserstein distances using emd solver in the POT package again.

Model misspeciﬁcation.

We conduct the experiments on three type of data: the mixture of 8,12 and 25 Gaussian distributions with Gaussian models M = {N ( m , σ I ) : m ∈ R , σ > } andelliptically contoured stable models M = {E α S c ( I , m ) : m ∈ R } . For data-generating process, we ﬁx k centers { ( a i , b i ) } ≤ i ≤ k . For each sample, we ﬁrst randomly select m from the centers at uniform andthen draw the sample from N (2 m , . − , .We use the ADAM optimization method with the default parameter setting to compute the MPRWand MEPRW estimators. At each inner loop, we run 5 projected supergradient ascent with the learningrate 10 − . For the Gaussian models, we estimate the densities of (cid:98) σ n with a kernel density estimatorby computing 100 times MPRW estimator of order 1. The maximum number of ADAM iterations isset as 20000. To illustrate the consistency of MPRW and MEPRW estimators, we compute 100 timesMPRW and MEPRW estimators of order 2, where the maximum number of ADAM iterations are set42 a) Mixture of 12 Gaussian distributions (b)

Mixture of 25 Gaussian distributions

Figure 6:

Probability density of estimation of centered and rescaled (cid:98) σ n on the Gaussian model for diﬀerence n . as 20000 and 10000 respectively. We also verify the convergence of MEPRW to MPRW by computing100 times these estimators on a ﬁxed set of n = 2000 observations for diﬀerent m generated samplesfrom the model. The maximum number of ADAM iterations for MPRW and MEPRW estimators areset as 20000 and 10000. For the elliptically contoured stable models, we verify the consistency propertyof MEPRW and the convergence of MEPRW to MPRW. For the former one, we compute 100 timesMEPRW estimator of order 2 and set the maximum number of ADAM iterations as 10000. For thelatter one, we compute 100 times MPRW and MEPRW estimators of order 2 on a ﬁxed set of n = 100observations for diﬀerent m generated samples from the model. The maximum number of ADAMiterations are set as 20000 and 10000. All of these settings are consistently used on the mixture of 8,12 and 25 Gaussian distributions. Generative modeling.

The procedure of the max-SW generator is summarized as follows: we ﬁrstsample a random variable Z from a ﬁxed distribution on the base space Z , and then transforms Z through a neural network parametrized by θ . This provides a parametric function T θ : Z → R d which allows us to generate images from a distribution µ θ . Our goal is to optimize the neural networkparameters θ by minimizing the max-SW distance [Deshpande et al., 2019] between µ θ and data-generating distribution. We use a neural network with the fully-connected conﬁguration from Deshpandeet al. [2018, Appendix D] and train our model with CIFAR10 and ImageNet200 . The former oneconsists of 60000 and 10000 images of size 3 × ×

32 for training and testing while the latter oneconsists of 100000 and 10000 images for training and testing. We use the minimal expected max-SWestimator of order 2 approximated with 50 projected gradient ascent steps and 10 − learning rate. Wetrain for 1000 iterations with the ADAM optimizer [Kingma and Ba, 2015] and 10 − learning rate. Available in https://tiny-imagenet.herokuapp.com/ a) MPRW vs. n (b) MEPRW vs. n = m (c) MEPRW with n = 2000 vs. m Figure 7:

Minimal PRW and expected PRW estimations using Gaussian models and n samples from the mixture of 12Gaussian distributions. Results are averaged over 100 runs and shaded areas represent standard deviation. (a) MPRW vs. n (b) MEPRW vs. n = m (c) MEPRW with n = 2000 vs. m Figure 8:

Minimal PRW and expected PRW estimations using Gaussian models and n samples from the mixture of 25Gaussian distributions. Results are averaged over 100 runs and shaded areas represent standard deviation. H Additional Experimental Results

Convergence and concentration.

Figure 5 presents average distances and computational times for( d, v ) ∈ { (10 , , (30 , , (50 , } , where the shaded areas show the max-min values over 100 runs. Wealso observe that the IPRW distance is smaller than the PRW distance for small n , especially so when d and v are large. The two distances are close when n is large, supporting the theoretical results givenby Theorem 3.4 and Theorem 3.6 in practice. The computation of the PRW distance is relatively faster44han that of the IPRW distance in these computations. Model misspeciﬁcation: Gaussian models.

Figure 6 shows the distributions centered and rescaledby √ n for a range of moderately large n , based on the two underlying models including the mixtureof 12 Gaussian distributions and the mixture of 25 Gaussian distributions. The left ﬁgure supportsthe convergence rate and the limiting distribution of the estimator as derived in Theorem 3.12 onthe mixture of 12 Gaussian distributions. The right ﬁgure suggests that the limiting distribution isnot normal when the underlying model is given by the mixture of 25 Gaussian distributions. Forthe latter case, the result is not as anticipated by Theorem 3.12. This is possibly because we onlyconduct 5 projected supergradient ascent at each inner loop, which may not be enough to achieve agood approximate projection u ∈ S d, .Figure 7 and 8 demonstrate the large-sample consistency behavior of MPRW and MEPRW estima-tors on the mixture of 12 and 25 Gaussian distributions, which are expected since Assumption 3.1-3.3are mild. The MEPRW estimator also converges to the MPRW estimator on the mixture of 12 Gaussiandistributions, conﬁrming Theorem 3.11. One exception in these experiments is the failure of convergenceof MEPRW to MPRW on the mixture of 25 Gaussian distributions. Apparently, the results from The-orem 3.11 do not hold in this experiment setting. This is likely due to the violation of Assumption 3.5that is necessary for Theorem 3.11 to hold. Model misspeciﬁcation: Elliptically contoured stable models.

Figure 9 (a) illustrates theconsistency of the MEPRW estimator (cid:98) m n,m , approximated with 5 projected supergradient ascent, thesame way as for the Gaussian models. Figure 9 (b) conﬁrms the convergence of (cid:98) m n,m to the MPRWestimator (cid:98) m n , where we ﬁx n = 100 observations and compute the mean squared error between thesetwo estimators (using 5 projected supergradient ascent) for diﬀerent values of m . Note that the MPRWestimator is approximated with the MEPRW obtained for a large enough value of m : (cid:98) m n = (cid:98) m n, . Tothis end, our results on elliptically contoured stable models conﬁrm Theorem 3.9, Theorem 3.10 andTheorem 3.11 in practice. Figure 10: Mean test loss for diﬀerent value of ( n, m )on

CIFAR10 . Generative modeling.

Figure 10 presents the meantest loss on

CIFAR10 over 10 runs, where the shadedareas show the max-min values over the runs. Herethe minimal expected max-SW estimator of order 2 isapproximated with 20 projected gradient ascent stepsand 10 − learning rate. We trained for 1000 itera-tions with the ADAM optimizer [Kingma and Ba, 2015]and 10 − learning rate. We also train the NNs with( n, m ) ∈ { (100 , , (1000 , , (5000 , , (10000 , } where n is the number of training samples and m is thenumber of generated samples and compute the testinglosses using the trained models on the testing dataset( n = 10000) with m = 250 generated samples. Wecompare these testing losses to that of a NN trainedusing n = 60000 (i.e., the entire training dataset) and m = 200 and present them in Figure 10. Again, ourresults conﬁrm Theorem 3.10 in practice. 45 a) MEPRW (b)

MEPRW, n (cid:63) = 100 Figure 9:

Minimal expected PRW estimations using elliptically contoured stable models and n samples from the mixtureof 8 Gaussian distributions (top), 12 Gaussian distributions (middle) and 25 Gaussian distributions (bottom), and m samples generated from the model. Results are averaged over 100 runs and shaded areas represent standard deviation.samples generated from the model. Results are averaged over 100 runs and shaded areas represent standard deviation.