On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification
Tianyi Lin, Zeyu Zheng, Elynn Y. Chen, Marco Cuturi, Michael I. Jordan
OOn Pro jection Robust Optimal Transport: SampleComplexity and Model Misspecification
Tianyi Lin (cid:63)
Zeyu Zheng (cid:63)
Elynn Y. Chen (cid:5)
Marco Cuturi (cid:47),(cid:46)
Michael I. Jordan (cid:5) , † Department of Electrical Engineering and Computer Sciences (cid:5)
Department of Industrial Engineering and Operations Research (cid:63)
Department of Statistics † University of California, BerkeleyCREST - ENSAE (cid:47) , Google Brain (cid:46)
June 30, 2020
Abstract
Optimal transport (OT) distances are increasingly used as loss functions for statistical inference,notably in the learning of generative models or supervised learning. Yet, the behavior of minimumWasserstein estimators is poorly understood, notably in high-dimensional regimes or under modelmisspecification. In this work we adopt the viewpoint of projection robust (PR) OT, which seeks tomaximize the OT cost between two measures by choosing a k -dimensional subspace onto which theycan be projected. Our first contribution is to establish several fundamental statistical properties ofPR Wasserstein distances, complementing and improving previous literature that has been restrictedto one-dimensional and well-specified cases. Next, we propose the integral PR Wasserstein (IPRW)distance as an alternative to the PRW distance, by averaging rather than optimizing on subspaces.Our complexity bounds can help explain why both PRW and IPRW distances outperform Wassersteindistances empirically in high-dimensional inference tasks. Finally, we consider parametric inferenceusing the PRW distance. We provide an asymptotic guarantee of two types of minimum PRWestimators and formulate a central limit theorem for max-sliced Wasserstein estimator under modelmisspecification. To enable our analysis on PRW with projection dimension larger than one, wedevise a novel combination of variational analysis and statistical theory. Recent years have witnessed an ever-increasing role for ideas from optimal transport (OT) [Villani, 2008]in machine learning. Combining OT distances with the general principles of minimal distance estimation(MDE) [Wolfowitz, 1957, Basu et al., 2011] yields a powerful basis for various statistical inferenceproblems, such as density estimation Bassetti et al. [2006], training of generative model [Arjovsky et al.,2017, Gulrajani et al., 2017, Montavon et al., 2016, Adler and Lunz, 2018, Cao et al., 2019], auto-encoders [Tolstikhin et al., 2018], clustering [Cuturi and Doucet, 2014, Bonneel et al., 2016, Ho et al.,2017, Ye et al., 2017], multitask regression [Janati et al., 2020], trajectory inference [Hashimoto et al.,2016, Schiebinger et al., 2017, Yang et al., 2020, Tong et al., 2020] or nonparametric testing [Ramdaset al., 2017]; see Peyr´e and Cuturi [2019] and Panaretos and Zemel [2019] for reviews on these topics.For OT ideas to continue to bear fruit in machine learning, it will be necessary to tackle twocharacteristic challenges: (1) high dimensionality and (2) model misspecification. Initial progress hasbeen made on the latter problem by Bernton et al. [2019], who showed that in the misspecified casethe minimum Wasserstein estimator (MWE) outputs the Wasserstein projection of the data-generating1 a r X i v : . [ m a t h . S T ] J un istribution onto the fitted model class. These authors also obtained results on robustness and theasymptotic distribution of the projection, also these results only apply to the one-dimensional setting.High-dimensional settings are challenging; indeed, it is known that the sample complexity of estimatingthe Wasserstein distance can grow exponentially in dimension [Dudley, 1969, Fournier and Guillin, 2015,Weed and Bach, 2019, Lei, 2020].We focus on a promising approach to treating high-dimensional problems: Compute the OT dis-tance between low-dimensional projections of high-dimensional input measures. The simplest and mostrepresentative example of this approach is the sliced Wasserstein distance [Rabin et al., 2011, Bonnotte,2013, Bonneel et al., 2015, Deshpande et al., 2019, Kolouri et al., 2019a, Nadjahi et al., 2020], which isdefined as the average OT distance obtained between random 1-dimensional projections, and which isshown practical in real applications [Deshpande et al., 2018, 2019, Kolouri et al., 2016, 2019b, Carriereet al., 2017, Wu et al., 2019, Liutkus et al., 2019]. In an important extension, Paty and Cuturi [2019]and Niles-Weed and Rigollet [2019] proposed very recently to seek the k -dimensional subspace ( k > projection robust Wasserstein (PRW) distance , which is conceptually simple and does solve the curseof dimensionality in the so-called spiked model as proved in [Niles-Weed and Rigollet, 2019, Theorem 1]by recovering an optimal 1 / √ n rate under the Talagrand transport inequality. This result suggests thatPRW can be significantly more useful than the OT distance for inference tasks when the dimension islarge. From a computational point of view, PRW becomes the max-sliced Wasserstein distance when theprojection dimension is k = 1 and has an efficient implementation [Deshpande et al., 2019]. For general k ≥
1, Lin et al. [2020] proposed to compute PRW using Riemannian optimization toolbox and pro-vided theoretical guarantee and encouraging empirical results. However, obtaining rigorous guaranteesfor practical performance of PRW requires a more thorough understanding of its statistical behavior.
Contributions.
In this paper, we study the statistical properties of PRW and another so-calledintegrated PRW (IPRW), which replaces the maximum in the original PRW with an average of OTdistance over k -dimensional projections. Our contributions can be summarized as follows.1. We prove that the empirical measure (cid:98) µ n converges to true measure µ (cid:63) under both PRW andIPRW with different parametric rates. For example, when the order is p = 3 / k ≥
3, the parametric rate is n − /k (IPRW). For PRW, the rate is ( n − /k + n − / (cid:112) dk log( n ) + n − / dk log( n )) when µ (cid:63) satisfies a projection Bernstein tail condition and( n − /k + n − / (cid:112) dk log( n ) + n − / dk log( n )) when µ (cid:63) satisfies a projection Poincar´e inequality.Concentration results are also presented under stronger conditions.2. We establish asymptotic guarantees for the minimal PRW and expected PRW estimators un-der model misspecification. For minimal PRW estimator with p = 1 and k = 1, we derive anasymptotic distribution for arbitrary dimension d with the parametric n − / rate in the Hausdorffmetric. Our assumptions are weaker than those used in Bernton et al. [2019], not requiring thenonsingularity of the Jacobian or the separability of the parameters.3. We conduct extensive experiments on synthetic data and neural networks to validate our theory.As a byproduct, we present a simple optimization algorithm that can efficiently compute the PRWdistance in practice even when k ≥
2; see Appendix F or Lin et al. [2020, Appendix B]. This quantity is also named as
Wasserstein Projection Pursuit (WPP) [Niles-Weed and Rigollet, 2019]. For simplicity,we refer from now on to PRW/WPP as PRW.
2n summary, our work provides an enhanced understanding of two PRW distances and the associatedminimal distance estimators under model misspecification, complementing the existing literature [Niles-Weed and Rigollet, 2019, Bernton et al., 2019, Nadjahi et al., 2019, 2020]. Our proof techniques consistin a novel combination of classical results from variational analysis and statistical theory which may beof independent interest.
In this section, we provide some technical background materials on projection optimal transport.Throughout the paper, (cid:107) · (cid:107) denotes the Euclidean norm (in the corresponding vector space).
Wasserstein and sliced Wasserstein.
Let p ≥ P ( R d ) and P p ( R d ) as the set of allBorel measures on R d and the subset that satisfies M p ( µ ) := (cid:82) R d (cid:107) x (cid:107) p dµ ( x ) < + ∞ . For two probabilitymeasures µ, ν ∈ P p ( R d ), their Wasserstein distance of order p is defined as follows: W p ( µ, ν ) := inf π ∈ Π( µ,ν ) (cid:90) R d × R d (cid:107) x − y (cid:107) p dπ ( x, y ) , (2.1)where the infimum is taken over Π( µ, ν ) ⊆ P ( R d × R d )—the set of probability measures with marginals µ and ν . In the 1D case, Rachev and R¨uschendorf [1998, Theorem 3.1.2.(a)] have shown that W p ( µ, ν ) = (cid:82) | F − µ ( t ) − F − ν ( t ) | p dt , where F − µ and F − ν are the quantile functions of µ and ν . This 1D formulamotivates the sliced Wasserstein (SW) and max-sliced Wasserstein (max-SW) distances [Bonnotte,2013, Bonneel et al., 2015, Deshpande et al., 2019]. In particular, the idea is to use as a proxy of (2.1)the average or maximum of a set of 1D Wasserstein distances constructed by projecting d -dimensionalmeasures to a random collection of 1D spaces. Computationally appealing, both SW and max-SWdistances are widely used in practice, especially in generative modeling [Kolouri et al., 2019b, Deshpandeet al., 2019, Liutkus et al., 2019]. Practitioners observe that the SW distance only outputs a good Monte-Carlo approximation with a large number of samples, while the max-SW distance can achieve similarresults with fewer samples [Kolouri et al., 2019a, Nguyen et al., 2020]. Projection robust Wasserstein.
Encouraged by the success of SW and MSW, Paty and Cuturi[2019] ask whether we can gain more by using a subspace of dimension k ≥ projection robust Wasserstein (PRW) distance, and prove that this quantity is well posed if the order is p ≥
1. More specifically, let S d,k = { E ∈ R d × k : E (cid:62) E = I k } be the set of d × k orthogonal matrices and E (cid:63) be the linear transformation associated with E for any x ∈ R d by E (cid:63) ( x ) = E (cid:62) x . For any measurablefunction f and µ ∈ P ( R d ), we denote f µ as the push-forward of µ by f , so that f µ ( A ) = µ ( f − ( A ))where f − ( A ) = { x ∈ R d : f ( x ) ∈ A } for any Borel set A . The PRW distance of order p between µ and ν is defined by PW p,k ( µ, ν ) := sup E ∈ S d,k W p ( E (cid:63) µ, E (cid:63) ν ) . (2.2)As an alternative, we define the IPRW distance, which replaces the supreme in Eq. (2.2) with an average.Formally, the IPRW distance of order p between µ and ν is defined by PW p,k ( µ, ν ) := (cid:90) S d,k W p ( E (cid:63) µ, E (cid:63) ν ) dσ ( E ) , (2.3)3here σ is the uniform distribution on S d,k . The IPRW and PRW distances generalize the SW and max-SW distances to the high-dimensional projection setting respectively. Compared to the PRW distance,the IPRW distance can be shown better statistically but remains unfavorable in computational sense.Indeed, a large amount of samples from S d,k are necessary to approximate the IPRW distance accurately.Improving the computational efficiency is interesting but beyond the scope of this paper. Convergence of empirical measures.
Let X n = ( X , . . . , X n ) be independent and identically dis-tributed samples according to the true measure µ (cid:63) ∈ P q ( R d ). The empirical measure of X n is definedby (cid:98) µ n := (1 /n ) (cid:80) ni =1 δ X i . It is known that (cid:98) µ n ⇒ µ (cid:63) almost surely, and W p ( (cid:98) µ n , µ (cid:63) ) → E [ W p ( (cid:98) µ n , µ (cid:63) )] (cid:39) n − /d whenever µ is absolutely continuous with respect to Lebesgue measure and d > p [Dudley, 1969,Fournier and Guillin, 2015, Weed and Bach, 2019]. The convergence is slow when the dimension ishigh—an instance of the well-known curse-of-dimensionality phenomenon.Due to the low-dimensional structure of the IPRW and PRW distances, the parametric rate of IPRWand PRW distances is expected to be of n − /k in the large- n limit. Similar rates have been derivedfor E [ |PW k,p ( (cid:98) µ n , (cid:98) ν n ) − W p ( µ, ν ) | ] as a function of n under a spiked transport model; see Niles-Weedand Rigollet [2019, Theorem 8]. Their bound depends on problem dimension d and requires µ and ν tosatisfy the Talagrand transport inequality [Talagrand, 1996]. For the special case when k = 1, the ratefor the IPRW distance was studied in [Nadjahi et al., 2020]. To the best of our knowledge, there hasbeen no other paper on the statistical properties of IPRW and PRW distances. Parametric modeling and inference.
A statistical model is a family of distributions, M = { µ θ ∈ P ( R d ) | θ ∈ Θ } , where Θ is the parameter space. A minimal set of the conditions of a proper family ofdistribution are: (i) (Θ , (cid:107)·(cid:107) Θ ) is a Polish space, (ii) Θ is σ -compact, i.e., it is the union of countably manycompact subspaces, and (iii) parameters are identifiable, i.e., µ θ = µ θ (cid:48) implies θ = θ (cid:48) . Since the space P p ( R d ) endowed with the distance W p is a Polish space, we estimate model coefficients using minimumdistance estimation (MDE) [Wolfowitz, 1957, Basu et al., 2011], where the distance we consider here isPRW. The main reason why we do not choose IPRW in this setting is computational. The minimumproject robust Wasserstein (MPRW) estimator is defined as follows: (cid:98) θ n := argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) . (2.4)Note that the probability density function of µ θ can be difficult to evaluate in practice, especially when µ θ is a generative model. Nevertheless, in various settings, even if the density is not available, onecan generate samples Z m from µ θ and use them to approximate µ θ . With this approximation, anatural alternative is the minimum expected projection robust Wasserstein (MEPRW) estimator, whichis defined as follows [Bernton et al., 2019, Nadjahi et al., 2019]: (cid:98) θ n,m := argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] , (2.5)where n is the number of samples from the data distribution µ (cid:63) , m is the number of samples from theparametric distribution µ θ , and (cid:98) µ θ,m is an empirical version of µ θ based on samples Z m .Existing works have established asymptotic guarantees for minimal Wasserstein and sliced Wasser-stein estimators [Bernton et al., 2019, Nadjahi et al., 2019]. Despite the similar proof paths, we remarkthat our results for the MPRW and MEPRW estimators are new and derived under weaker assumptionsand more general settings than previous work; see Sections 3.3 and 3.4.4 Main Results on Projection Robust Optimal Transport Estimation
Throughout this section, we assume p ≥ k ∈ [ d ] unless stated otherwise. Focusing on the IPRWand PRW distances, we prove that they are lower semi-continuous and metrize weak convergence.Through a new sample complexity analysis, we derive the convergence rate of empirical measures underboth distances as well as an improved rate for the PRW distance when µ (cid:63) satisfies either a Bernstein tailcondition or the Poincar´e inequality. For the generative models with the PRW distance, we study themisspecified setting where the limit θ (cid:63) is not necessarily the limit of the maximum likelihood estimator.We establish the asymptotic properties of the MPRW and MEPRW estimators and formulate a centrallimit theorem when p = 1 and k = 1. We begin with the results on the relationship between the IPRW, PRW and Wasserstein distances. Thefollowing lemma demonstrates their equivalence in a topological sense.
Lemma 3.1
The IPRW, PRW and Wasserstein distances are equivalent. In other words, for anysequence of probability measures { µ i } i ∈ N and probability measure µ in P p ( R d ) , we have PW p,k ( µ i , µ ) → if and only if PW p,k ( µ i , µ ) → if and only if W p ( µ i , µ ) → . Lemma 3.1 is a generalization of Bayraktar and Guo [2019, Theorem 1] where the projection dimensionis k = 1. By Lemma 3.1 and Villani [2008, Theorem 6.9], we obtain the following result regarding thetopology induced by the PRW distance of order p . Theorem 3.2
The IPRW and PRW distances both metrize weak convergence. In other words, for anysequence of probability measures { µ i } i ∈ N and probability measure µ in P p ( R d ) , we have PW p,k ( µ i , µ ) → if and only if PW p,k ( µ i , µ ) → if and only if µ i ⇒ µ . Theorem 3.2 generalizes Villani [2008, Theorem 6.9] since the PRW distance is the Wasserstein distancewhen the projection dimension k = d . When k = 1, Theorem 3.2 implies that the SW and MSWdistances metrize weak convergence. Note that this implication is stronger than Nadjahi et al. [2019,Theorem 1], which only provides a one-sided argument. Theorem 3.3
The IPRW and PRW distances are both lower semi-continuous in the usual weak topol-ogy. In other words, if the sequences of probability measures { µ i } i ∈ N , { ν i } i ∈ N ⊆ P ( R d ) satisfy µ i ⇒ µ and ν i ⇒ ν for probability measures µ, ν ∈ P ( R d ) , then we have PW p,k ( µ, ν ) ≤ lim inf i → + ∞ PW p,k ( µ i , ν i ) and PW p,k ( µ, ν ) ≤ lim inf i → + ∞ PW p,k ( µ i , ν i ) . Theorem 3.3 generalizes Nadjahi et al. [2019, Lemma S6] and is pivotal to our subsequent analysis ofthe asymptotic properties of the MPRW and MEPRW estimators.
We provide the parametric rate of the empirical measures under the IPRW and PRW distances. Inparticular, we present our main result on convergence rates in the following theorem.
Theorem 3.4
Let µ (cid:63) ∈ P q ( R d ) and M q := M q ( µ (cid:63) ) < + ∞ . Then we have E [ PW p,k ( (cid:98) µ n , µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ k ∧ ( p − q )] (log( n )) ζp,q,kp , for all n ≥ , (3.1)5 here (cid:46) p,q refers to “less than” with a constant depending only on ( p, q ) and ζ p,q,k = k = q = 2 p, k (cid:54) = 2 p and q = kpk − p ) or ( q > k = 2 p ) , Remark 3.1
Theorem 3.4 can be compared with Lei [2020, Theorem 3.1]. The key difference is thatour bound does not depend on d , while all bounds for the Wasserstein distance grow exponentially in d when d ≥ p . This improvement makes a qualitative difference, showing that the PRW distance doesnot suffer from the curse of dimensionality while retaining flexibility via the choice of k . Definition 3.1 µ ∈ P ( R d ) satisfies a projection Bernstein tail condition if there exist σ, V > for all E ∈ S d,k and X ∼ E (cid:63) µ such that E [ | X | r ] ≤ (1 / σ r ! V r − for all i ∈ [ n ] and r ≥ . Theorem 3.5
Suppose µ (cid:63) ∈ P q ( R d ) satisfies a projection Bernstein tail condition and assume thesame setting as in Theorem 3.4. For all n ≥ , the following inequality holds true: E [ PW p,k ( (cid:98) µ n , µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ k ∧ ( p − q )] (log( n )) ζp,q,kp + n − p (cid:112) dk log( n ) + n − p dk log( n ) . Definition 3.2 µ ∈ P ( R d ) satisfies a projection Poincar´e inequality if there exists M > for all E ∈ S d,k and X ∼ E (cid:63) µ such that Var ( f ( X )) ≤ M E [ (cid:107)∇ f ( X ) (cid:107) ] for any f satisfying E [ f ( X ) ] < + ∞ and E [ (cid:107)∇ f ( X ) (cid:107) ] < + ∞ . Theorem 3.6
Suppose µ (cid:63) ∈ P q ( R d ) satisfies a projection Poincar´e inequality and assume the samesetting as in Theorem 3.4. For all n ≥ , the following inequality holds true: E [ PW p,k ( (cid:98) µ n , µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ k ∧ ( p − q )] (log( n )) ζp,q,kp + n − ∨ p (cid:112) dk log( n ) + n − p dk log( n ) . Remark 3.2
Theorem 3.5 and 3.6 present the parametric rate under the PRW distance when µ (cid:63) sat-isfies certain conditions. While the first term matches that in Theorem 3.4, the extra two terms comefrom bounding the gap E [sup E ∈ S d,k ( W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )])] . Compared with Niles-Weed and Rigollet [2019, Theorem 8], where µ (cid:63) is assumed to satisfy the Talagrand transport inequal-ity, our conditions in Definition 3.1 and 3.2 are strictly weaker but our parametric rate matches their n − /k + n − / (cid:112) dk log( n ) rate in the large- n limit when p = 1 . We present concentration results when µ (cid:63) satisfies stronger conditions than Definition 3.1 and 3.2. Definition 3.3 µ ∈ P ( R d ) satisfies a Bernstein tail condition if there exists σ, V > such that E [sup E ∈ S d,k ,X ∼ E (cid:63) µ | X | r ] ≤ (1 / σ r ! V r − for all i ∈ [ n ] and all r ≥ . Theorem 3.7 If µ (cid:63) ∈ P ( R d ) satisfies a Bernstein tail condition then the following statement holdstrue for both W = PW p,k and W = PW p,k : P ( | W ( (cid:98) µ n , µ (cid:63) ) − E [ W ( (cid:98) µ n , µ (cid:63) )] | ≥ t ) ≤ (cid:18) − t σ n − /p + 4 tV n − /p (cid:19) . (3.3)6 efinition 3.4 µ ∈ P ( R d ) satisfies a Poincar´e inequality if there exists
M > for X ∼ µ such that Var [ f ( X )] ≤ M E [ (cid:107)∇ f ( X ) (cid:107) ] for any f satisfying E [ f ( X ) ] < + ∞ and E [( ∇ f ( X )) ] < + ∞ . Theorem 3.8 If µ (cid:63) ∈ P ( R d ) satisfies Poincar´e inequality then the following statement holds true forboth W = PW p,k and W = PW p,k : P ( | W ( (cid:98) µ n , µ (cid:63) ) − E [ W ( (cid:98) µ n , µ (cid:63) )] | ≥ t ) ≤ − K − min { n p t, n ∨ p t } ) , (3.4) where K > only depends on M defined in Definition 3.4. Remark 3.3
Theorem 3.8 provides strictly better bounds than Theorem 3.7 when p > . Moreover,our tail condition in Definition 3.3 is stronger than that in Definition 3.1 yet weaker than the standardBernstein tail condition where X ∼ µ inside the expectation without a sup ; see Wainwright [2019].The Poincar´e inequality is usually weaker than the log-Sobolev inequality and is satisfied by variousexponential measures and the measures induced by structured Markov processes [Ledoux, 1999]. We derive the asymptotic properties of the MPRW and MEPRW estimators under model misspeci-fication and data dependence, which is common in practice. Our setting is more general than thatconsidered in [Nadjahi et al., 2019] and our results provide the theory to support applications in real-world scenario.
Assumption 3.1
There exists a probability measure µ (cid:63) ∈ P ( R d ) such that the data-generating processsatisfies that lim n → + ∞ W p ( (cid:98) µ n , µ (cid:63) ) = 0 almost surely. Assumption 3.2
The map θ (cid:55)→ µ θ is continuous: (cid:107) θ n − θ (cid:107) Θ → implies µ θ n ⇒ µ θ . Assumption 3.3
There exists a constant τ > such that the set Θ (cid:63) ( τ ) ⊆ Θ is bounded where Θ (cid:63) ( τ ) = { θ ∈ Θ : PW p,k ( µ (cid:63) , µ θ ) ≤ inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ } . Theorem 3.9
Under Assumption 3.1-3.3, there exists a sample space Ω with P (Ω) = 1 such that, forall ω ∈ Ω , lim n → + ∞ inf θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) = inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) and lim sup n → + ∞ argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ⊆ argmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) . (3.5) There also exists n ( ω ) > such that argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) (cid:54) = ∅ for all n ≥ n ( ω ) . Assumption 3.4 If (cid:107) θ n − θ (cid:107) Θ → , then E [ W p ( (cid:98) µ θ n ,n , µ θ n ) | X n ] → . In the next result, we present an analogous version of Theorem 3.9 for the MEPRW estimator asmin { n, m } → + ∞ . For the simplicity, we set m := m ( n ) such that m ( n ) → + ∞ as n → + ∞ . Theorem 3.10
Under Assumption 3.1-3.3 and 3.4, there exists a sample space Ω with P (Ω) = 1 suchthat, for all ω ∈ Ω , lim n → + ∞ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] = inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) and lim sup n → + ∞ argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ⊆ argmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) . (3.6) There also exists n ( ω ) > so that argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] (cid:54) = ∅ for n ≥ n ( ω ) . ssumption 3.5 There exists a constant τ > such that the set Θ n ( τ ) ⊆ Θ is bounded where Θ n ( τ ) = { θ ∈ Θ : PW p,k ( (cid:98) µ n , µ θ ) ≤ inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) + τ } . Theorem 3.11
Under Assumption 3.2, 3.4 and 3.5, the following statement holds true: lim m → + ∞ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] = inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) (3.7)lim sup n → + ∞ argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ⊆ argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) . (3.8) There also exists m n > such that argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] (cid:54) = ∅ for m ≥ m n . To this end, the MPRW and MEPRW estimators both asymptotically converge to θ (cid:63) ∈ Θ, which is aminimizer of θ → PW p,k ( µ (cid:63) , µ θ ), assuming its existence. Moreover, θ (cid:63) is not necessarily the limit ofmaximum likelihood estimator and satisfies µ θ (cid:63) = µ (cid:63) in a well-specified setting. We investigate the asymptotic distribution of the MPRW estimator under model misspecification andestablish the rate of convergence when k = p = 1. For any u ∈ S d − and t ∈ R , we define F θ ( u, t ) = (cid:90) R d ( −∞ ,t ] ( (cid:104) u, x (cid:105) ) dµ θ ( x ) , (cid:98) F n ( u, t ) = (1 /n ) |{ i ∈ [ n ] : (cid:104) u, X i (cid:105) ≤ t }| . (3.9)The functions F θ ( u, · ) and (cid:98) F ( u, · ) are the cumulative distribution functions of u (cid:63) µ θ and u (cid:63) (cid:98) µ n where u ∈ S d − is a unit vector. Let L ( S d − × R ) be the class of functions on S d − × R such that f ( , t ) iscontinuous and f ( u, · ) is absolutely integrable, with the norm (cid:107) f (cid:107) L = sup u ∈ S d − (cid:82) R | f ( u, t ) | dt .Assumption 3.6 is strictly weaker than a norm-differentiation condition where D (cid:63) has to be nonsin-gular. Assumption 3.7 permits model misspecification where there is no θ (cid:63) ∈ Θ such that F θ (cid:63) = F (cid:63) andthus is more general than Nadjahi et al. [2019, A8]. Assumption 3.8 accounts for local strong identifia-bility for the model µ θ around θ (cid:63) and is necessary for the fast rate of n − / under model misspecification.(Bernton et al. [2019] assumes the analogous condition for the Wasserstein distance. However, theiranalysis depends on a much stronger version with N = Θ.) Thanks to Assumption 3.8, we do notrequire the condition that the parameters are weakly separable in the PRW sense. Assumption 3.6
There exists a measurable function D (cid:63) : S d − × R → R d θ such that (cid:107) F θ ( u, t ) − F θ (cid:63) ( u, t ) − (cid:104) θ − θ (cid:63) , D (cid:63) ( u, t ) (cid:105)(cid:107) L = o ( (cid:107) θ − θ (cid:63) (cid:107) Θ ) . Assumption 3.7
There exists a random element G (cid:63) : S d − × R (cid:55)→ R such that the stochastic process √ n ( (cid:98) F n − F (cid:63) ) converges weakly in L ( S d − × R ) to G (cid:63) . Assumption 3.8
There exists a neighborhood N of θ (cid:63) ∈ Θ and a positive constant c (cid:63) such that PW , ( µ θ , µ (cid:63) ) ≥ PW , ( µ θ (cid:63) , µ (cid:63) ) + c (cid:63) (cid:107) θ − θ (cid:63) (cid:107) Θ for all θ ∈ N . As pointed by Nadjahi et al. [2019], one can prove that Assumption 3.7 holds in general by extending [Dede, 2009,Proposition 3.5] and [del Barrio et al., 1999, Theorem 2.1(a)] with some mild conditions on the tails of u (cid:63) µ (cid:63) . emark 3.4 In well-specified setting where there exists θ (cid:63) ∈ Θ such that F (cid:63) = F θ (cid:63) , it is straightforwardto derive the norm-differentiation condition from Assumption 3.6 and 3.8. This is not true, however,under model misspecification. Moreover, there are minor technical issues in the proof of Bernton et al.[2019, Theorem B.8]; see Appendix E.4. Fixing them are easy but require additional assumptions.Fortunately, we can overcome it using some new techniques. Thus, with some refinement, our resultscan be interpreted as an improvement of Bernton et al. [2019] with fewer assumptions. To study the asymptotic distributions in the misspecified setting, we employ definitions from Pollard[1980, Section 7]. (Note, however, that our proof technique is different from Pollard [1980], whichdepends on the nonsingularity of D (cid:63) and requires µ (cid:63) = µ θ (cid:63) for some θ (cid:63) in the interior of Θ.) Definition 3.5 (Hausdorff metric)
Let S be the class of convex and compact sets in L ( S d − × R ) equipped with (cid:107)·(cid:107) L . The Hausdorff metric on S is defined by d H ( S , S ) = inf { δ > S ⊆ S δ , S ⊆ S δ } ,where S δ = ∪ x ∈ S { z ∈ L ( S d − × R ) : (cid:107) z − x (cid:107) L ≤ δ } . Definition 3.6 (Approximate MPRW estimators)
The set of approximate MPRW estimators isdefined by M n = { θ ∈ Θ : PW , ( (cid:98) µ n , µ θ ) ≤ inf θ (cid:48) ∈ Θ PW , ( (cid:98) µ n , µ θ (cid:48) ) + η n / √ n } , where η n > is anysequence such that P ( η n →
0) = 1 and M n is nonempty. Theorem 3.12
Suppose Assumption 3.1-3.3 and 3.6-3.8 hold for some θ (cid:63) in the interior of Θ and let G n = √ n ( (cid:98) F n − F θ (cid:63) ) and G (cid:63)n = G (cid:63) + √ n ( F (cid:63) − F θ (cid:63) ) . We also define K ( x, β ) = { θ ∈ N : (cid:107) x − √ n (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L ≤ inf θ (cid:48) ∈N (cid:107) x − √ n (cid:104) θ (cid:48) − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L + β } where N = (cid:26) θ ∈ N : (cid:107) F θ − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D (cid:63) (cid:105)(cid:107) L (cid:107) θ − θ (cid:63) (cid:107) Θ ≤ c (cid:63) (cid:27) . (3.10) Then there exists a sequence satisfying lim n → + ∞ β n = 0 such that P (cid:63) ( M n ⊆ K ( G n , β n )) → as n → + ∞ . For any (cid:15) > , we have P ( d H ( K ( G (cid:63)n , , K ( G n , β n )) < (cid:15) ) → as n → + ∞ . Remark 3.5
Since K ( G (cid:63)n ,
0) = argmin θ ∈N (cid:107) G (cid:63) + √ n ( F (cid:63) − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105) ) (cid:107) L , Theorem 3.12 saysthat the distributional limit of the approximate MPRW estimator set is close to the limit of the sets argmin θ ∈N (cid:107) G (cid:63) + √ n ( F (cid:63) − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105) ) (cid:107) L in the Hausdorff metric. This result provides thetheoretical guarantee for generative modeling with max-sliced Wasserstein distance. Remark 3.6
In well-specified setting, Assumption 3.8 can be replaced by Assumption A.1-A.2. Undercertain conditions, we derive the CLT (cf. Theorem A.3) which is analogous to Nadjahi et al. [2019,Theorem 6] for the minimum sliced Wasserstein estimators; see Appendix A for the details.
We empirically validate our theoretical findings through several experiments on synthetic and real data.Given the space limit, we present the experimental setup in Appendix G and explain an optimizationalgorithm for computing the PRW distance and estimators in Appendix F. P (cid:63) denotes the (inner) probability; see Pollard [1980] for details. Mean values (Top) and mean computational time (Bottom) of the IPRW and PRW distances of order 2 betweenempirical measures (cid:98) µ n and (cid:98) ν n as the number of points n varies. Results are averaged over 100 runs. Convergence and concentration.
We set µ = ν = U ([ − v, v ] d ) as an uniform distribution overa hypercube and study the convergence and computation of PW ,k ( (cid:98) µ n , (cid:98) ν n ) and PW ,k ( (cid:98) µ n , (cid:98) ν n ) for n ∈ { , , , , } . Figure 1 presents average distances and computational times for ( d, v ) ∈{ (10 , , (30 , , (50 , } , where the shaded areas show the max-min values over 100 runs. First, theIPRW distance is significantly smaller than the PRW distance for small n especially when d and v are large. This confirms Theorem 3.4 which shows that the IPRW distance is independent of d .Second, the PRW distance nearly matches the IPRW distance when n is large. This confirms The-orem 3.6 since the uniform distribution with its bounded domain satisfies the Poincar´e inequality.Finally, the current computation of the PRW distance is faster than that of the IPRW distance.Figure 3: Probability density of estimation of cen-tered and rescaled (cid:98) σ n on the Gaussian model. Model misspecification.
We consider the paramet-ric inference using Gaussian models M = {N ( m , σ I ) : m ∈ R , σ > } and a collection of i.i.d. observationsgenerated from a mixture of 8 Gaussian distributionsin R . This simple setting is useful since the closed-form expression of Gaussian density makes the compu-tation of the MPRW estimator of order 1 tractable inpractice. Following the setup in Nadjahi et al. [2019,Section 4], we illustrate the consistency of the MPRWand MEPRW estimators of order 1 and the conver-gence of MEPRW estimator of order 1 to MPRW es-timator of order 1. Results are shown in Figure 2;they are consistent with Theorem 3.9, 3.10 and 3.11,10 a) MPRW vs. n (b) MEPRW vs. n = m (c) MEPRW with n = 2000 vs. m Figure 2:
Minimal PRW and expected PRW estimations using Gaussian models and n samples from the mixture of 8Gaussian distributions. Results are averaged over 100 runs and shaded areas represent standard deviation. where m (cid:63) = (cid:98) m . Despite the model misspecification, our estimators still converge as the numberof observations increases and the MEPRW estimator converges to the MPRW estimator as we gen-erate more samples. We also verify our central limit theorem by estimating the density of (cid:98) σ n witha kernel density estimator over 100 runs. Figure 3 shows the distribution centered and rescaled by √ n for each n , where σ (cid:63) = (cid:98) σ , and it confirms the convergence rate we derived in Theorem 3.12.We refer the interested readers to Appendix H for more results on the i.i.d. observations gener-ated from a mixture of 12 or 25 Gaussian distributions and elliptically contoured stable models.Figure 4: Mean test loss for different value of ( n, m )on
ImageNet200 . Generative modeling.
We conduct experiments onimage generation using the PRW generator of order2, as an alternative to the SW generator [Deshpandeet al., 2018]. We train the neural networks (NNs) with( n, m ) ∈ { (100 , , (1000 , , (5000 , , (10000 , } where n is the number of training samples and m isthe number of generated samples. We compare theirtesting losses to that of a NN trained using n = 10 (i.e. whole training dataset) and m = 200. All thetesting losses are evaluated using the trained modelson the the testing dataset ( n = 10 ) with m = 250generated samples. Figure 4 presents the mean testingloss on ImageNet200 over 10 runs, where the shadedareas show the max-min values over the runs. We deferthe results on other dataset to Appendix G and H.11
Acknowledgements
This work was supported in part by the Mathematical Data Science program of the Office of NavalResearch under grant number N00014-18-1-2764.
References
P-A. Absil, R. Mahony, and R. Sepulchre.
Optimization Algorithms on Matrix Manifolds . PrincetonUniversity Press, 2009. (Cited on pages 37 and 38.)
J. Adler and S. Lunz. Banach Wasserstein GAN. In
NIPS , pages 6754–6763, 2018. (Cited on page 1.)
C. D. Aliprantis and K. C. Border.
Infinite Dimensional Analysis: A Hitchhiker’s Guide . SpringerScience & Business Media, 2006. (Cited on page 25.)
M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In
ICML , pages214–223, 2017. (Cited on page 1.)
F. Bassetti, A. Bodini, and E. Regazzini. On minimum kantorovich distance estimators.
Statistics &Probability Letters , 76(12):1298–1302, 2006. (Cited on page 1.)
A. Basu, H. Shioya, and C. Park.
Statistical Inference: The Minimum Distance Approach . CRC Press,2011. (Cited on pages 1 and 4.)
E. Bayraktar and G. Guo. Strong equivalence between metrics of wasserstein type.
ArXiv Preprint:1912.08247 , 2019. (Cited on page 5.)
E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. On parameter estimation with the wassersteindistance.
Information and Inference: A Journal of the IMA , 8(4):657–676, 2019. (Cited on pages 1, 2,3, 4, 8, 9, 25, 33, and 37.)
P. Billingsley.
Convergence of Probability Measures . John Wiley & Sons, 2013. (Cited on page 17.)
N. Bonneel, J. Rabin, G. Peyr´e, and H. Pfister. Sliced and radon Wasserstein barycenters of measures.
Journal of Mathematical Imaging and Vision , 51(1):22–45, 2015. (Cited on pages 2, 3, and 39.)
N. Bonneel, G. Peyr´e, and M. Cuturi. Wasserstein barycentric coordinates: histogram regression usingoptimal transport.
ACM Transactions on Graphics , 35(4):71:1–71:10, 2016. (Cited on page 1.)
N. Bonnotte.
Unidimensional and Evolution Methods for Optimal Transportation . PhD thesis, Paris11, 2013. (Cited on pages 2 and 3.)
L. D. Brown and R. Purves. Measurable selections of extrema.
The Annals of Statistics , 1(5):902–912,1973. (Cited on page 26.)
J. Cao, L. Mo, Y. Zhang, K. Jia, C. Shen, and M. Tan. Multi-marginal Wasserstein GAN. In
NeurIPS ,pages 1774–1784, 2019. (Cited on page 1.)
M. Carriere, M. Cuturi, and S. Oudot. Sliced Wasserstein kernel for persistence diagrams. In
ICML ,pages 664–673. JMLR. org, 2017. (Cited on page 2.)
12. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In
ICML , pages 685–693,2014. (Cited on page 1.)
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In
NeurIPS , pages2292–2300, 2013. (Cited on page 37.)
S. Dede. An empirical central limit theorem in L1 for stationary sequences.
Stochastic Processes andTheir Applications , 119(10):3494–3515, 2009. (Cited on page 8.)
E. del Barrio, E. Gin´e, and C. Matr´an. Central limit theorems for the Wasserstein distance between theempirical and the true distributions.
Annals of Probability , pages 1009–1071, 1999. (Cited on page 8.)
I. Deshpande, Z. Zhang, and A. G. Schwing. Generative modeling using the sliced Wasserstein distance.In
CVPR , pages 3483–3491, 2018. (Cited on pages 2, 11, and 43.)
I. Deshpande, Y-T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao, D. Forsyth, and A. G.Schwing. Max-sliced Wasserstein distance and its use for GANs. In
CVPR , pages 10648–10656, 2019. (Cited on pages 2, 3, and 43.)
R. M. Dudley. The speed of mean Glivenko-Cantelli convergence.
The Annals of Mathematical Statistics ,40(1):40–50, 1969. (Cited on pages 2 and 4.)
R. Flamary and N. Courty. Pot python optimal transport library, 2017. URL https://github.com/rflamary/POT . (Cited on pages 37 and 42.) N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure.
Probability Theory and Related Fields , 162(3-4):707–738, 2015. (Cited on pages 2, 4, and 20.)
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of Wasser-stein GANs. In
NeurIPS , pages 5767–5777, 2017. (Cited on page 1.)
T. Hashimoto, D. Gifford, and T. Jaakkola. Learning population-level diffusions with generative RNNs.In
ICML , pages 2417–2426, 2016. (Cited on page 1.)
N. Ho, X. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Phung. Multilevel clustering viaWasserstein means. In
ICML , pages 1501–1509, 2017. (Cited on page 1.)
H. Janati, T. Bazeille, B. Thirion, M. Cuturi, and A. Gramfort. Multi-subject meg/eeg source imagingwith sparse multi-task regression.
NeuroImage , page 116847, 2020. (Cited on page 1.)
D. P. Kingma and J. Ba. ADAM: A method for stochastic optimization. In
ICLR , 2015. (Cited onpages 40, 43, and 45.)
S. Kolouri, Y. Zou, and G. K. Rohde. Sliced Wasserstein kernels for probability distributions. In
CVPR ,pages 5258–5267, 2016. (Cited on page 2.)
S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Rohde. Generalized sliced Wasserstein distances.In
NeurIPS , pages 261–272, 2019a. (Cited on pages 2 and 3.)
S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde. Sliced Wasserstein auto-encoders. In
ICLR ,2019b. (Cited on pages 2 and 3.)
13. Ledoux. Concentration of measure and logarithmic sobolev inequalities. In
Seminaire de probabilitesXXXIII , pages 120–216. Springer, 1999. (Cited on pages 7 and 21.)
J. Lei. Convergence and concentration of empirical measures under Wasserstein distance in unboundedfunctional spaces.
Bernoulli , 26(1):767–798, 2020. (Cited on pages 2, 6, and 20.)
X. Li, S. Chen, Z. Deng, Q. Qu, Z. Zhu, and A. M-C. So. Nonsmooth optimization over Stiefel manifold:Riemannian subgradient methods.
ArXiv Preprint: 1911.05047 , 2019. (Cited on page 38.)
T. Lin, C. Fan, N. Ho, M. Cuturi, and M. I. Jordan. Projection robust Wasserstein distance andRiemannian optimization.
ArXiv Preprint: 2006.07458 , 2020. (Cited on pages 2 and 37.)
H. Liu, A. M-C. So, and W. Wu. Quadratic optimization with orthogonality constraint: explicit(cid:32)lojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods.
Mathematical Programming , 178(1-2):215–262, 2019. (Cited on page 38.)
A. Liutkus, U. Simsekli, S. Majewski, A. Durmus, and F-R. St¨oter. Sliced-Wasserstein flows: Nonpara-metric generative modeling via optimal transport and diffusions. In
ICML , pages 4104–4113, 2019. (Cited on pages 2 and 3.)
G. Montavon, K-R. M¨uller, and M. Cuturi. Wasserstein training of restricted Boltzmann machines. In
NIPS , pages 3718–3726. Curran Associates, Inc., 2016. (Cited on page 1.)
K. Nadjahi, A. Durmus, U. Simsekli, and R. Badeau. Asymptotic guarantees for learning generativemodels with the sliced-Wasserstein distance. In
NeurIPS , pages 250–260, 2019. (Cited on pages 3, 4, 5,7, 8, 9, 10, 25, 39, and 40.)
K. Nadjahi, A. Durmus, L. Chizat, S. Kolouri, S. Shahrampour, and U. S¸im¸sekli. Statistical andtopological properties of sliced probability divergences.
ArXiv Preprint: 2003.05783 , 2020. (Cited onpages 2, 3, and 4.)
K. Nguyen, N. Ho, T. Pham, and H. Bui. Distributional sliced-Wasserstein and applications to gener-ative modeling.
ArXiv Preprint: 2002.07367 , 2020. (Cited on page 3.)
J. Niles-Weed and P. Rigollet. Estimation of Wasserstein distances in the spiked transport model.
ArXivPreprint: 1909.07513 , 2019. (Cited on pages 2, 3, 4, 6, 20, and 37.)
J. P. Nolan. Multivariate elliptically contoured stable distributions: theory and estimation.
Computa-tional Statistics , 28(5):2067–2089, 2013. (Cited on page 40.)
V. M. Panaretos and Y. Zemel. Statistical aspects of Wasserstein distances.
Annual Review of Statisticsand its Application , 6:405–431, 2019. (Cited on page 1.)
F-P. Paty and M. Cuturi. Subspace robust Wasserstein distances. In
ICML , pages 5072–5081, 2019. (Cited on pages 2, 3, 37, and 42.)
G. Peyr´e and M. Cuturi. Computational optimal transport.
Foundations and Trends R (cid:13) in MachineLearning , 11(5-6):355–607, 2019. (Cited on page 1.) D. Pollard. The minimum distance method of testing.
Metrika , 27(1):43–70, 1980. (Cited on pages 9, 33,34, and 37.)
14. Rabin, G. Peyr´e, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texturemixing. In
International Conference on Scale Space and Variational Methods in Computer Vision ,pages 435–446. Springer, 2011. (Cited on page 2.)
S. T. Rachev and L. R¨uschendorf.
Mass Transportation Problems: Volume I: Theory , volume 1. SpringerScience & Business Media, 1998. (Cited on page 3.)
A. Ramdas, N. G. Trillos, and M. Cuturi. On Wasserstein two-sample testing and related families ofnonparametric tests.
Entropy , 19(2):47, 2017. (Cited on page 1.)
R. T. Rockafellar and R. J-B. Wets.
Variational Analysis , volume 317. Springer Science & BusinessMedia, 2009. (Cited on pages 25 and 26.)
G. Samoradnitsky.
Stable Non-Gaussian Random Processes: Stochastic Models with Infinite Variance .Routledge, 2017. (Cited on page 40.)
G. Schiebinger, J. Shu, M. Tabaka, B. Cleary, V. Subramanian, A. Solomon, S. Liu, S. Lin, P. Berube,L. Lee, et al. Reconstruction of developmental landscapes by optimal-transport analysis of single-cellgene expression sheds light on cellular reprogramming. bioRxiv , page 191056, 2017. (Cited on page 1.)
M. Talagrand. Transportation cost for Gaussian and other product measures.
Geometric & FunctionalAnalysis GAFA , 6(3):587–600, 1996. (Cited on page 4.)
I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. In
ICLR , 2018. (Cited on page 1.)
A. Tong, J. Huang, G. Wolf, D. van Dijk, and S. Krishnaswamy. Trajectorynet: A dynamic optimaltransport network for modeling cellular dynamics.
ArXiv Preprint: 2002.04461 , 2020. (Cited on page 1.)
C. Villani.
Optimal Transport: Old and New , volume 338. Springer Science & Business Media, 2008. (Cited on pages 1, 4, 5, and 17.)
M. J. Wainwright.
High-dimensional Statistics: A Non-asymptotic Viewpoint , volume 48. CambridgeUniversity Press, 2019. (Cited on pages 7, 20, 22, and 23.)
J. Weed and F. Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measuresin Wasserstein distance.
Bernoulli , 25(4A):2620–2648, 2019. (Cited on pages 2, 4, and 20.)
J. Wolfowitz. The minimum distance method.
The Annals of Mathematical Statistics , pages 75–88,1957. (Cited on pages 1 and 4.)
J. Wu, Z. Huang, D. Acharya, W. Li, J. Thoma, D. P. Paudel, and L. V. Gool. Sliced Wassersteingenerative models. In
CVPR , pages 3713–3722, 2019. (Cited on page 2.)
K. D. Yang, K. Damodaran, S. Venkatachalapathy, A. C. Soylemezoglu, G. V. Shivashankar, andC. Uhler. Predicting cell lineages using autoencoders and optimal transport.
PLoS computationalbiology , 16(4):e1007828, 2020. (Cited on page 1.)
J. Ye, P. Wu, J. Z. Wang, and J. Li. Fast discrete distribution clustering using Wasserstein barycenterwith sparse support.
IEEE Transactions on Signal Processing , 65(9):2317–2332, 2017. (Cited on page 1.) Further Results on the MPRW and MEPRW Estimators
In this section, we discuss the measurability of the MPRW and MEPRW estimators. For a genericfunction f on the domain X , we define δ -argmin x ∈X f = { x ∈ X : f ( x ) ≤ inf x ∈X f + δ } . Our resultsare summarized in the following two theorems. Theorem A.1
Under Assumption 3.1, for any n ≥ and δ > , there exists a Borel measurablefunction (cid:98) θ n : Ω → Θ such that (cid:98) θ n ( ω ) ∈ (cid:26) argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) if this set is nonempty ,δ - argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) otherwise . Theorem A.2
Under Assumption 3.1, for any n ≥ , m ≥ and δ > , there exists a Borel measurablefunction (cid:98) θ n,m : Ω → Θ such that (cid:98) θ n,m ( ω ) ∈ (cid:26) argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ) | X n ] if this set is nonempty ,δ - argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ) | X n ] otherwise . We also present the asymptotic distribution of the goodness-of-fit statistics as well as the MPRWestimator in the well-specified setting and establish the rate of convergence. For this we require the wellseparability of the model in Assumption A.1 and the non-singularity of D (cid:63) in Assumption A.2 to takeplace of the local strong identifiability in Assumption 3.8. Assumption A.1
For any (cid:15) > , there exists δ > so that inf θ ∈ Θ: (cid:107) θ − θ (cid:63) (cid:107) Θ ≥ (cid:15) PW , ( µ θ (cid:63) , µ θ ) > δ . Assumption A.2
There exists a non-singular D (cid:63) such that Assumption 3.6 holds true. Theorem A.3
Suppose that µ (cid:63) = µ θ (cid:63) for some θ (cid:63) in the interior of Θ . Under Assumption 3.1-3.3, 3.6-3.7 and A.1-A.2, the goodness-of-fit statistics satisfies √ n inf θ ∈ Θ PW , ( (cid:98) µ n , µ θ ) ⇒ inf θ ∈ Θ max u ∈ S d − (cid:90) R | G (cid:63) ( u, t ) − (cid:104) θ, D (cid:63) ( u, t ) (cid:105)| dt, as n → + ∞ . Suppose also that the random map θ → max u ∈ S d − (cid:82) R | G (cid:63) ( u, t ) − (cid:104) θ, D (cid:63) ( u, t ) (cid:105)| dt has a unique infimumalmost surely. Then the MPRW estimator of order 1 satisfies √ n ( (cid:98) θ n − θ (cid:63) ) ⇒ argmin θ ∈ Θ max u ∈ S d − (cid:90) R | G (cid:63) ( u, t ) − (cid:104) θ, D (cid:63) ( u, t ) (cid:105)| dt, as n → + ∞ . Both the weak convergence results are valid for the metric induced by the norm (cid:107) · (cid:107) L . B Postponed Proofs in Subsection 3.1
This section lays out the detailed proofs for Lemma 3.1, Theorem 3.2 and 3.3.16 .1 Preliminary technical results
For completeness, we collect several preliminary technical results which will be used in the proofs. Theorem B.1 (Prokhorov’s theorem)
Let P ( R d ) denote the collection of all probability measuresdefined on R d with the Borel σ -algebra and { µ i } i ∈ N is a tight sequence in P ( R d ) . Then every subsequenceof { µ i } i ∈ N has a subsequence that converges weakly in P ( R d ) . Moreover, if every weakly convergentsubsequence has the same limit, the whole sequence converges weakly to this limit. Theorem B.2 (Theorem 4.1 in Villani [2008])
Let ( X , µ ) and ( Y , ν ) be two Polish probability spaces;let a : X → R ∪ {−∞} and b : Y → R ∪ {−∞} be upper semi-continuous such that a and b are absolutelyintegrable with respect to the measures µ and ν respectively. Let c : X × Y → R ∪ { + ∞} be lowersemi-continuous, such that c ( x, y ) ≥ a ( x ) + b ( y ) for all x, y . Then there exists an optimal coupling π ∈ Π( µ, ν ) which minimizes the total cost E [ c ( X, Y )] . Lemma B.3 (Lemma 4.4 in Villani [2008])
Let X and Y be two Polish spaces. Let P ⊆ P ( X ) and Q ⊆ P ( Y ) be tight subsets of P ( X ) and P ( Y ) respectively. Then the set of all transportationplans whose marginals lie in P and Q respectively, is itself tight in P ( X × Y ) . Theorem B.4 (Theorem 6.9 in Villani [2008])
Let ( X , d ) be a Polish space and p ∈ [1 , + ∞ ) . TheWasserstein distance W p metrizes the weak convergence in P p ( X ) . That is, if { µ i } i ∈ N n is a sequenceof measures in P p ( X ) and µ ∈ P p ( X ) , then µ i ⇒ µ if and only if W p ( µ i , µ ) → . Definition B.1 (Lower semi-continuity)
We say that f : X → R is lower semi-continuous if forany x ∈ X and any y < f ( x ) , there exists a neighborhood U of x such that f ( x ) > y for all x in U .In the case of a metric space, this is equivalent to lim inf x → x f ( x ) ≥ f ( x ) for any x ∈ X . B.2 Proof of Lemma 3.1
We first show that, for any µ ∈ P p ( R d ) and ν ∈ P p ( R d ), the following inequality holds true, PW p,k ( µ, ν ) ≤ PW p,k ( µ, ν ) ≤ W p ( µ, ν ) . (B.1)Indeed, by the definition of PW p,k and PW p,k , the first inequality is trivial. For the second inequality,we derive from the definition of PW p,k that PW pp,k ( µ, ν ) = sup E ∈ S d,k W p ( E (cid:63) µ, E (cid:63) ν ) = sup E ∈ S d,k inf π ∈ Π( µ i ,µ ) (cid:90) R d × R d (cid:107) E (cid:62) ( x − y ) (cid:107) p dπ ( x, y ) . Since E ∈ S d,k , we have (cid:107) E (cid:62) ( x − y ) (cid:107) ≤ (cid:107) x − y (cid:107) . Thus, we have PW pp,k ( µ, ν ) ≤ W pp ( µ, ν ). Putting thesepieces together yields Eq. (B.1). For any sequence { µ i } i ∈ N ⊆ P p ( R d ) and µ ∈ P p ( R d ), we concludefrom Eq. (B.1) that W p ( µ i , µ ) → PW p,k ( µ i , µ ) → PW p,k ( µ i , µ ) → PW p,k ( µ i , µ ) → W p ( µ i , µ ) →
0. Indeed, we first provethat PW p,k ( µ i , µ ) → µ i ⇒ µ . Let Z i ∼ µ i , we have E (cid:62) Z i ∼ E (cid:63) µ i . By the definition of theIPRW distance (cf. Definition 2.3) and using the fact that PW p,k ( µ i , µ ) →
0, we have ( (cid:107) E (cid:62) Z i ) (cid:107) p ) i ∈ N is For the Prokhorov’s theorem, we only present the results on the Euclidean space. For more results on general separablemetric space, we refer the interested readers to Billingsley [2013]. E ∈ S d,k . Since S d,k is compact, there exists a finite set { E , E , . . . , E I } ⊆ S d,k so that (cid:107) x (cid:107) ≤ (cid:80) Ij =1 (cid:107) E (cid:62) j x (cid:107) for all x ∈ R d . Therefore, we have (cid:107) Z i (cid:107) p ≤ I (cid:88) j =1 (cid:107) E (cid:62) j Z i (cid:107) p ≤ I (cid:18) max ≤ j ≤ I (cid:107) E (cid:62) j Z i (cid:107) p (cid:19) ≤ I I (cid:88) j =1 (cid:107) E (cid:62) j Z i (cid:107) p . Therefore, we deduce that ( (cid:107) Z i (cid:107) p ) i ∈ N is uniformly integrable which implies the tightness of { µ i } i ∈ N .Using the Prokhorov’s theorem (cf. Theorem B.1), we obtain that every subsequence of { µ i } i ∈ N has aweakly convergent subsequence.The next step is to show that all the weakly convergent subsequences converge to the same probabilitymeasure µ . We fix an arbitrary subsequence and for simplicity abbreviate the subscripts and still denoteit by { µ i } i ∈ N . Let ˜ µ i be the limit of any given weakly convergent subsequence ( µ i j ) j ∈ N , we need toprove that ˜ µ i = µ . In particular, we define the characteristic function for any probability measure ν asfollows, Φ ν ( z ) := (cid:90) R d e i (cid:104) z,x (cid:105) dν ( x ) for all z ∈ R d . Since µ i j ⇒ ˜ µ i , we have Φ µ ij ( z ) → Φ ˜ µ i ( z ) for all z ∈ R d . Thus, we need to show that Φ µ ij ( z ) → Φ µ ( z )for all z ∈ R d . This is trivial when z = d since Φ µ ij ( d ) = Φ µ ( d ) = 1 for all j ∈ N . Otherwise, let r := (cid:107) z (cid:107) and v := z/ (cid:107) z (cid:107) , we havelim j → + ∞ Φ µ ij ( z ) = lim j → + ∞ (cid:90) R d e i (cid:104) z,x (cid:105) dµ i j ( x ) = lim j → + ∞ (cid:90) R d e i r (cid:104) v,x (cid:105) dµ i j ( x ) . Since (cid:107) v (cid:107) = 1, we define ¯ E ∈ S d,k whose first column is v . Let ¯ r be a k -dimensional vector whose firstcoordinate is r and other coordinates are zero. Then we have r (cid:104) v, x (cid:105) = (cid:104) ¯ r, ¯ E (cid:62) x (cid:105) . Putting these piecestogether yields that lim j → + ∞ Φ µ ij ( z ) = lim j → + ∞ (cid:90) R k e i (cid:104) ¯ r,y (cid:105) d ¯ E (cid:63) µ i j ( y ) . Since PW p,k ( µ i j , µ ) →
0, we deduce that W p ( ¯ E (cid:63) µ i j , ¯ E (cid:63) µ ) →
0. Using Theorem B.4, we have ¯ E (cid:63) µ i j ⇒ ¯ E (cid:63) µ . Since r (cid:104) v, x (cid:105) = (cid:104) ¯ r, ¯ E (cid:62) x (cid:105) , we havelim j → + ∞ (cid:90) R k e i (cid:104) ¯ r,x (cid:105) d ¯ E (cid:63) µ i j ( x ) = (cid:90) R k e i (cid:104) ¯ r,x (cid:105) d ¯ E (cid:63) µ ( x ) = (cid:90) R d e i r (cid:104) v,x (cid:105) dµ ( x ) = (cid:90) R d e i (cid:104) z,x (cid:105) dµ ( x ) . Putting these pieces together yields that Φ µ ij ( z ) → Φ µ ( z ) for all z ∈ R d / { d } and ˜ µ i = µ for all i ∈ N . Using the Prokhorov’s theorem again yields that the whole sequence { µ i } i ∈ N has the limit µ inweak sense. Therefore, PW p,k ( µ i , µ ) → µ i ⇒ µ . Since the Wasserstein distances metrize theweak convergence (cf. Theorem B.4), we conclude that PW p,k ( µ i , µ ) → W p ( µ i , µ ) →
0. Thiscompletes the proof.
B.3 Proof of Theorem 3.2
By Lemma 3.1, we have PW p,k ( µ i , µ ) → PW p,k ( µ i , µ ) → W p ( µ i , µ ) → µ i ⇒ µ if and only if W p ( µ i , µ ) →
0. Putting these pieces together yields thedesired result. 18 .4 Proof of Theorem 3.3
Fixing E ∈ St d,k , the mapping x (cid:55)→ E (cid:62) x is continuous from R d to R k . Since µ i ⇒ µ and ν i ⇒ ν , thecontinuous mapping theorem implies that E (cid:63) µ i ⇒ E (cid:63) µ and E (cid:63) ν i ⇒ E (cid:63) ν . The next step is the keyingredient in the proof and we hope to show that W pp ( E (cid:63) µ, E (cid:63) ν ) ≤ lim inf i → + ∞ W pp ( E (cid:63) µ i , E (cid:63) ν i ) for all E ∈ G k . (B.2)From Theorem B.2, there exists a coupling π i ∈ Π( E (cid:63) µ i , E (cid:63) ν i ) such that W pp ( E (cid:63) µ i , E (cid:63) ν i ) = (cid:82) R k × R k (cid:107) x − y (cid:107) p dπ i ( x, y ). By the definition of lim inf, there exists a subsequence of { π } i ∈ N such that (cid:82) R k × R k (cid:107) x − y (cid:107) p dπ i (cid:48) ( x, y ) converges to lim inf i → + ∞ W pp ( E (cid:63) µ i , E (cid:63) ν i ). For the simplicity, we still denote it by { π i } i ∈ N .By Lemma B.3 and Prokhorov’s theorem (cf. Theorem B.1), { π i } i ∈ N is sequentially compact in weaksense. Thus, there exists a subsequence { π i j } j ∈ N such that π i j ⇒ ˜ π ∈ P ( R k × R k ). Putting these piecestogether yields that lim inf i → + ∞ W pp ( E (cid:63) µ i , E (cid:63) ν i ) = (cid:90) R k × R k (cid:107) x − y (cid:107) p d ˜ π ( x, y ) . By the definition of the Wasserstein distance, it suffices to show that ˜ π ∈ Π( E (cid:63) µ, E (cid:63) ν ). Indeed, let f : R k → R be a continuous and bounded function, we have (cid:90) R k × R k f ( x ) d ˜ π ( x, y ) = lim j → + ∞ (cid:90) R k × R k f ( x ) dπ i j ( x, y ) . Since π i j ∈ Π( E (cid:63) µ i j , E (cid:63) ν i j ) and E (cid:63) µ i ⇒ E (cid:63) µ , we havelim j → + ∞ (cid:90) R k × R k f ( x ) dπ i j ( x, y ) = lim j → + ∞ (cid:90) R k f ( x ) dE (cid:63) µ i j ( x ) = (cid:90) R k f ( x ) dE (cid:63) µ ( x ) . Since E (cid:63) ν i ⇒ E (cid:63) ν , the same argument implies that (cid:82) R k × R k f ( y ) d ˜ π ( x, y ) = (cid:82) R k f ( y ) dE (cid:63) ν ( y ). Puttingthese pieces together yields Eq. (B.2).For the IPRW distance PW pp,k , we derive from Eq. (B.2) and the Fatou’s lemma that PW pp,k ( µ, ν ) = (cid:90) S d,k W pp ( E (cid:63) µ, E (cid:63) ν ) dσ ( E ) ≤ lim inf i → + ∞ (cid:90) S d,k W pp ( E (cid:63) µ i , E (cid:63) ν i ) dσ ( E ) = lim inf i → + ∞ PW pp,k ( µ i , ν i ) . For the PRW distance, we derive Eq. (B.2) and the fact that the supremum of a sequence of lowersemi-continuous mappings is lower semi-continuous that PW pp,k ( µ, ν ) = sup E ∈ S d,k W pp ( E (cid:63) µ, E (cid:63) ν ) ≤ lim inf i → + ∞ PW pp,k ( µ i , ν i ) . This completes the proof.
C Postponed Proofs in Subsection 3.2
In this section, we provide the detailed proofs for Theorem 3.4-3.8.19 .1 Preliminary technical results
To facilitate reading, we collect several preliminary technical results which will be used in the postponedproofs in subsection 3.2.
Theorem C.1 (Tonelli’s theorem) if ( X , A, µ ) and ( Y , B, ν ) are σ -finite measure spaces, while f : X × Y → [0 , + ∞ ] is non-negative measurable function, then (cid:90) X (cid:18)(cid:90) Y f ( x, y ) dy (cid:19) dx = (cid:90) Y (cid:18)(cid:90) X f ( x, y ) dx (cid:19) dy = (cid:90) X ×Y f ( x, y ) d ( x, y ) . The following proposition provides the state-of-the-art general bound for the Wasserstein distance be-tween the true measure and its empirical version in R d . Note that we do not assume any additionalstructures of the true measure. Similar results can be found in many classical works, e.g., Fournier andGuillin [2015, Theorem 1], Weed and Bach [2019, Theorem 1] and Lei [2020, Theorem 3.1]. Here wepresent the results in the form of Lei [2020, Theorem 3.1]. Proposition C.2
Let µ (cid:63) ∈ P q ( R d ) and M q := M q ( µ (cid:63) ) < + ∞ . Then we have E [ W p ( (cid:98) µ n , µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ d ∧ ( p − q )] (log( n )) ζ (cid:48) p,q,dp , for all n ≥ . (C.1) where (cid:46) p,q refers to “less than” with a constant depending only on ( p, q ) and ζ (cid:48) p,q,d = d = q = 2 p, d (cid:54) = 2 p and q = dpd − p ” or “ q > d = 2 p ” , S d,k in the operator norm of amatrix, denoted by (cid:107) · (cid:107) op . This is a straightforward consequence of the classical results on the coveringnumber of the unit sphere in R d in Euclidean norm. For the proof details, we refer the interested readersto Niles-Weed and Rigollet [2019, Lemma 4]. For the background materials on the covering number, werefer the interested readers to Wainwright [2019, Chapter 5]. For the ease of presentation, we providea formal definition of covering number of S d,k in (cid:107) · (cid:107) op as follows.For any (cid:15) ∈ (0 , (cid:15) -covering number of S d,k in (cid:107) · (cid:107) op is defined by N ( S d,k , (cid:15), (cid:107) · (cid:107) op ) = inf (cid:40) N ∈ N : ∃ x , x , . . . , x N ∈ S d,k , s.t. S d,k ⊆ N (cid:91) i =1 B ( x i , (cid:15) ) (cid:41) , where B ( x, r ) = { y ∈ S d,k : (cid:107) y − x (cid:107) op ≤ r } is the ball of radius r > x ∈ S d,k in the operatornorm of a matrix. Proposition C.3
There exists a universal constant c > such that for all (cid:15) ∈ (0 , , the (cid:15) -coveringnumber of S d,k in (cid:107) · (cid:107) op satisfies that N ( S d,k , (cid:15), (cid:107) · (cid:107) op ) ≤ ( c √ k(cid:15) − ) dk . The following theorem summarizes the concentration results assuming the Bernstein tail conditionunder product measure. Indeed, let { X i } i ∈ [ n ] be independent samples from probability measure µ i on spaces X i and X (cid:48) i be independent copies of X i for all i ∈ [ n ]. Denote X = ( X , . . . , X n ) and X (cid:48) ( i ) = ( X , . . . , X (cid:48) i , . . . , X n ) which is identical to X except for X (cid:48) i . Let f : (cid:81) ni =1 X i → R be a functionsuch that E [ | f ( X ) | ] < + ∞ , and define D i = f ( X ) − f ( X (cid:48) ( i ) ).20 heorem C.4 Suppose that there exists some σ i , M > so that E [ | D i | k | X − i ] ≤ (1 / σ i k ! M k − forall k ≥ . Then the following statement holds, P ( f ( X ) − E ( f ( X )) > t ) ≤ exp (cid:18) − t (cid:80) ni =1 σ i ) + 2 tM (cid:19) . The following theorem summarizes the concentration results assuming the Poincar´e inequality underproduct measure. We denote by (cid:107)∇ i f (cid:107) the length of the gradient with respect to the i th coordinate. Theorem C.5 (Corollary 4.6 in Ledoux [1999])
Denote by µ n the product of µ on ⊗ ni =1 R d and µ ∈ P ( R d ) satisfies the Poincar´e inequality (cf. Definition 3.4). For every function f on ⊗ ni =1 R d satisfying E ( | f ( X ) | ) < + ∞ , and (cid:80) ni =1 (cid:107)∇ i f ( X ) (cid:107) ≤ α and max ≤ i ≤ n (cid:107)∇ i f ( X ) (cid:107) ≤ β almost surely.Then the following statement holds true for X ∼ µ n that, P ( f ( X ) − E ( f ( X )) > t ) ≤ exp (cid:18) − K min (cid:26) tβ , t α (cid:27)(cid:19) , where K > only depends on the constant M in the Poincar´e inequality. C.2 Proof of Theorem 3.4
Note that µ (cid:63) ∈ P q ( R d ) and M q := M q ( µ (cid:63) ) < + ∞ . Fixing E ∈ S d,k , we have E (cid:63) µ (cid:63) ∈ P q ( R k ) and M q ( E (cid:63) µ (cid:63) ) ≤ M q < + ∞ . Then Proposition C.2 implies that E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ k ∧ ( p − q )] (log( n )) ζ (cid:48) p,q,kp for all n ≥ . Since W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) ≥ E ∈ S d,k and µ (cid:63) ∈ P q ( R d ), Theorem C.1 implies that E (cid:2) PW p,k ( (cid:98) µ n , µ (cid:63) ) (cid:3) = E (cid:34)(cid:90) S d,k W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) dσ ( E ) (cid:35) = (cid:90) S d,k E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] dσ ( E ) . Note that ζ p,q,k = ζ (cid:48) p,q,k where ζ p,q,k is defined in Theorem 3.4. Putting these pieces together yields thedesired result. C.3 Proof of Theorem 3.5
By the definition of PW p,k ( (cid:98) µ n , µ (cid:63) ), we have E (cid:2) PW p,k ( (cid:98) µ n , µ (cid:63) ) (cid:3) ≤ sup E ∈ S d,k E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (C.2)+ E (cid:34) sup E ∈ S d,k (cid:0) W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (cid:1)(cid:35) . Using the same arguments for proving Theorem 3.4, we havesup E ∈ S d,k E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (cid:46) p,q n − [ p ) ∨ k ∧ ( p − q )] (log( n )) ζp,q,kp for all n ≥ . (C.3)21he remaining step is to bound the gap E [sup E ∈ S d,k ( W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )])]. We firstclaim that W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] is sub-exponential with parameters (2 σn / − /p , V n − /p )for all E ∈ S d,k if the true measure µ (cid:63) satisfies the projection Bernstein-type tail condition (cf. Defini-tion 3.1). Indeed, let f ( X ) = W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ), we have D i = f ( X ) − f ( X (cid:48) ( i ) ) ≤ W p ( E (cid:63) (cid:98) µ n , E (cid:63) (cid:98) µ (cid:48) n ) ≤ n − /p (cid:0) (cid:107) E (cid:63) ( X i ) − E (cid:63) ( X (cid:48) i ) (cid:107) (cid:1) . By the triangle inequality and using the projection Bernstein-type tail condition, we have E [ | D i | k | X − i ] ≤ k n − k/p ( E X ∼ E (cid:63) µ [ | X | k ]) ≤ k − n − k/p σ k ! V k − = (2 n − /p σ ) k !(2 n − /p V ) k − . This implies that the condition in Theorem C.4 holds true with σ i = 2 n − /p σ and M = 2 n − /p V .Equipped with Theorem C.4 yields that P (cid:0) W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] ≥ t (cid:1) ≤ exp (cid:18) − t σ n − /p + 4 tV n − /p (cid:19) . For the simplicity, let Z E = W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )]. Then we have E [ Z E ] = 0 and P ( Z E ≥ t ) ≤ exp( − t / (8 σ n − /p +4 tV n − /p )). This together with the definition of Z E and Wainwright[2019, Theorem 2.2] yields the desired claim.We then interpret { Z E } E ∈ S d,k as an empirical process indexed by E ∈ S d,k and claim that there existsa random variable L satisfying E [ L ] ≤ M q ( µ (cid:63) ) so that | Z U − Z V | ≤ L (cid:107) U − V (cid:107) op for all U, V ∈ S d,k .Indeed, we have | Z U − Z V | ≤ W p ( U (cid:63) (cid:98) µ n , V (cid:63) (cid:98) µ n ) + W p ( U (cid:63) µ (cid:63) , V (cid:63) µ (cid:63) )+ E (cid:2) W p ( U (cid:63) (cid:98) µ n , V (cid:63) (cid:98) µ n ) + W p ( U (cid:63) µ (cid:63) , V (cid:63) µ (cid:63) ) (cid:3) . Let X ∼ µ , we have | Z U − Z V | ≤ E ( (cid:107) ( U − V ) X (cid:107) p )) /p + (cid:32) n n (cid:88) i =1 (cid:107) ( U − V ) X i (cid:107) p (cid:33) /p + E (cid:32) n n (cid:88) i =1 (cid:107) ( U − V ) X i (cid:107) p (cid:33) /p ≤ (cid:107) U − V (cid:107) op E ( (cid:107) X (cid:107) p )) /p + (cid:32) n n (cid:88) i =1 (cid:107) X i (cid:107) p (cid:33) /p + E (cid:32) n n (cid:88) i =1 (cid:107) X i (cid:107) p (cid:33) /p := L (cid:107) U − V (cid:107) op . Note that X n = ( X , . . . , X n ) are independent and identically distributed samples according to µ (cid:63) .By the Jensen’s inequality and using the fact that q > p ≥
1, we have E [ L ] ≤ E ( (cid:107) X (cid:107) p )) /p ≤ E ( (cid:107) X (cid:107) q )) /q = 4 M q ( µ (cid:63) ) . Thus, by a standard (cid:15) -net argument, we obtain that E [ sup E ∈ S d,k Z E ] ≤ inf (cid:15)> (cid:26) (cid:15) E [ L ] + 4 σn / − /p (cid:113) log( N ( S d,k , (cid:15), (cid:107) · (cid:107) op )) + 2 V n − /p log( N ( S d,k , (cid:15), (cid:107) · (cid:107) op )) (cid:27) c > N ( S d,k , (cid:15), (cid:107) · (cid:107) op )) ≤ dk log (cid:32) c √ k(cid:15) (cid:33) . Putting these pieces together and choosing (cid:15) = √ kn − /p yields that E (cid:34) sup E ∈ S d,k Z E (cid:35) (cid:46) p,q inf (cid:15)> (cid:15) + n / − /p (cid:118)(cid:117)(cid:117)(cid:116) dk log (cid:32) √ k(cid:15) (cid:33) + n − /p dk log (cid:32) √ k(cid:15) (cid:33) (cid:46) p,q n / − /p (cid:112) dk log( n ) + n − /p dk log( n ) . Therefore, we conclude that E (cid:34) sup E ∈ S d,k (cid:0) W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (cid:1)(cid:35) (cid:46) p,q n / − /p (cid:112) dk log( n ) + n − /p dk log( n ) . This together with Eq. (C.2) and Eq. (C.3) yields the desired inequality.
C.4 Proof of Theorem 3.6
Using the same arguments in Theorem 3.5, we obtain Eq. (C.2) and Eq. (C.3). So it suffices to boundthe gap E [sup E ∈ S d,k ( W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )])] under different condition.We first claim that W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] is sub-exponential with parameters( (cid:112) K/ n − / (2 ∨ p ) , ( K/ n − /p ) for all E ∈ S d,k if the true measure µ (cid:63) satisfies the projection Poincar´einequality (cf. Definition 3.2). Indeed, we consider X = ( X , . . . , X n ) and X (cid:48) = ( X (cid:48) , . . . , X (cid:48) n ) where X i , X (cid:48) i are independent samples from E (cid:63) µ (cid:63) . Let f ( X ) = W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ), we have E ( | f ( X ) | ) < + ∞ .By the triangle inequality, we have | f ( X ) − f ( X (cid:48) ) | ≤ n − /p (cid:32) n (cid:88) i =1 (cid:107) X i − X (cid:48) i (cid:107) p (cid:33) /p ≤ n − ∨ p (cid:107) X − X (cid:48) (cid:107) . This implies that the following statement holds almost surely, n (cid:88) i =1 (cid:107)∇ i f ( X ) (cid:107) ≤ n − ∨ p and max ≤ i ≤ n (cid:107)∇ i f ( X ) (cid:107) ≤ n − p , almost surely . In addition, the probability measure E (cid:63) µ (cid:63) ∈ P ( R k ) is assumed to satisfy the Poincar´e inequality.Equipped with Theorem C.5 yields that P (cid:0) W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] ≥ t (cid:1) ≤ exp (cid:18) − K min (cid:26) tn − /p , t n − / (2 ∨ p ) (cid:27)(cid:19) , For the simplicity, let Z E = W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )]. Then we have E [ Z E ] = 0 and P ( Z E ≥ t ) ≤ exp( − K − min { n /p t, n / (2 ∨ p ) t } ). This together with the definition of Z E and Wainwright[2019, Theorem 2.2] yields the desired claim. 23sing the same argument in Theorem 3.5, we can interpret { Z E } E ∈ S d,k as an empirical processindexed by E ∈ S d,k and show that there exists a random variable L satisfying E [ L ] ≤ M q ( µ (cid:63) ) so that | Z U − Z V | ≤ L (cid:107) U − V (cid:107) op for all U, V ∈ S d,k . By a standard (cid:15) -net argument, we obtain that E [ sup E ∈ S d,k Z E ] ≤ inf (cid:15)> (cid:26) (cid:15) E [ L ] + √ Kn − / (2 ∨ p ) (cid:113) log( N ( S d,k , (cid:15), (cid:107) · (cid:107) op )) + ( K/ n − /p log( N ( S d,k , (cid:15), (cid:107) · (cid:107) op )) (cid:27) . Combining Proposition C.3 and choosing (cid:15) = √ kn − /p yields that E (cid:34) sup E ∈ S d,k Z E (cid:35) (cid:46) p,q inf (cid:15)> (cid:15) + n − / (2 ∨ p ) (cid:118)(cid:117)(cid:117)(cid:116) dk log (cid:32) √ k(cid:15) (cid:33) + n − /p dk log (cid:32) √ k(cid:15) (cid:33) (cid:46) p,q n − / (2 ∨ p ) (cid:112) dk log( n ) + n − /p dk log( n ) . Therefore, we conclude that E (cid:34) sup E ∈ S d,k (cid:0) W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) ) − E [ W p ( E (cid:63) (cid:98) µ n , E (cid:63) µ (cid:63) )] (cid:1)(cid:35) (cid:46) p,q n − / (2 ∨ p ) (cid:112) dk log( n ) + n − /p dk log( n ) . This together with Eq. (C.2) and Eq. (C.3) yields the desired inequality.
C.5 Proof of Theorem 3.7
Since the arguments in this proof hold true for both IPRW and PRW distances. For the simplicity, wedenote W = PW p,k or W = PW p,k for short. Let f ( X ) = W ( (cid:98) µ n , µ (cid:63) ), we have D i = f ( X ) − f ( X (cid:48) ( i ) ) ≤ W ( (cid:98) µ n , (cid:98) µ (cid:48) n ) ≤ n − /p (cid:32) sup E ∈ S d,k (cid:107) E (cid:63) ( X i ) − E (cid:63) ( X (cid:48) i ) (cid:107) (cid:33) . By the triangle inequality, we have E (cid:104) | D i | k | X − i (cid:105) ≤ k n − k/p (cid:32) E (cid:34) sup E ∈ S d,k ,X ∼ E (cid:63) µ | X | k (cid:35)(cid:33) . Since the true measure µ (cid:63) satisfies the Bernstein-type tail condition (cf. Definition 3.3), we have E (cid:104) | D i | k | X − i (cid:105) ≤ k − n − k/p σ k ! V k − = (2 n − /p σ ) k !(2 n − /p V ) k − σ i = 2 n − /p σ and M = 2 n − /p V .Equipped with Theorem C.4 yields the desired inequality. C.6 Proof of Theorem 3.8
Since the arguments in this proof hold true for both IPRW and PRW distances. For the simplicity,we denote W = PW p,k or W = PW p,k for short. We consider X = ( X , X , . . . , X n ) and X (cid:48) =24 X (cid:48) , X (cid:48) , . . . , X (cid:48) n ) where X i , X (cid:48) i are independent samples from µ (cid:63) . Let f ( X ) = W ( (cid:98) µ n , µ (cid:63) ), we have E ( | f ( X ) | ) < + ∞ . By the triangle inequality, we have | f ( X ) − f ( X (cid:48) ) | ≤ n − /p (cid:32) n (cid:88) i =1 (cid:107) X i − X (cid:48) i (cid:107) p (cid:33) /p ≤ n − ∨ p (cid:107) X − X (cid:48) (cid:107) . This implies that the following statement holds almost surely, n (cid:88) i =1 (cid:107)∇ i f ( X ) (cid:107) ≤ n − ∨ p and max ≤ i ≤ n (cid:107)∇ i f ( X ) (cid:107) ≤ n − p . In addition, the true measure µ (cid:63) satisfies the Poincar´e inequality (cf. Definition 3.4). Equipped withTheorem C.5 yields the desired inequality. D Postponed Proofs in Subsection 3.3
In this section, we provide the detailed proofs for Theorem 3.9-3.11 and Theorem A.1-A.2. Our resultsare derived analogously to the proof in Bernton et al. [2019] for the estimators based on Wassersteindistance and the proof in Nadjahi et al. [2019] for the estimators based on sliced-Wasserstein distance.
D.1 Preliminary technical results
To facilitate the reading, we collect several preliminary technical results which will be used in thepostponed proofs in subsection 3.3.
Theorem D.1 (Theorem 2.43 in Aliprantis and Border [2006])
A real-valued lower semi-continuousfunction on a compact space attains a minimum value, and the nonempty set of minimizers is com-pact. Similarly, an upper semicontinuous function on a compact set attains a maximum value, and thenonempty set of maximizers is compact.
Definition D.1 (epiconvergence)
Let X be a metric space and { f i } i ∈ N be a sequence of real-valuedfunction from X to R . We say that the sequence { f i } i ∈ N epiconverges to a function f : X → br if foreach x ∈ X , the following statement holds true, lim inf i → + ∞ f i ( x i ) ≥ f ( x ) for every sequence { x i } i ∈ N such that x i → x, lim sup i → + ∞ f i ( x i ) ≤ f ( x ) for some sequence { x i } i ∈ N such that x i → x. Proposition D.2 (Proposition 7.29 in Rockafellar and Wets [2009])
Let X be a metric spaceand { f i } i ∈ N be a sequence of real-valued function from X to R with a lower semi-continuous function f : X → R . Then the sequence { f i } i ∈ N epiconverges to f if and only if lim inf i → + ∞ ( inf x ∈ K f i ( x )) ≥ inf x ∈ K f ( x ) for every compact set K ⊆ X , lim sup i → + ∞ (sup x ∈ O f i ( x )) ≤ sup x ∈ O f ( x ) for every open set O ⊆ X . δ -argmin x ∈X f = { x ∈ X : f ( x ) ≤ inf x ∈X f + δ } for a generic function f : X → R . Thefollowing theorem gives asymptotic properties for the infimum and δ -argmin of epiconvergent functionsand thus a standard approach to prove the existence and consistency of the estimators. Theorem D.3 (Theorem 7.31 in Rockafellar and Wets [2009])
Let X be a metric space and { f i } i ∈ N be a sequence of function which epiconverges to a lower semi-continuous function f with inf x ∈X f ∈ ( −∞ , + ∞ ) . Then we have the following statements,1. inf x ∈X f i → inf x ∈X f if and only if for every δ > there exists a compact set B ⊆ X and N ∈ N such that inf x ∈ B f i ≤ inf x ∈X f i + δ for all i ≥ N .2. lim sup i → + ∞ ( δ - argmin x ∈X f i ) ⊆ δ - argmin x ∈X f for any δ ≥ and lim sup i → + ∞ ( δ i - argmin x ∈X f i ) ⊆ argmin x ∈X f whenever δ i ↓ .3. Assume that inf x ∈X f i → inf x ∈X f , there exists a sequence δ i ↓ such that δ i - argmin x ∈X f i → argmin x ∈X f . Conversely, if argmin x ∈X f (cid:54) = ∅ and if such a sequence exists, then inf x ∈X f i → inf x ∈X f . The following theorem summarizes the well-known Skorokhod’s representation theorem.
Theorem D.4 (Skorokhod’s representation theorem)
Let { µ n } n ∈ N be a sequence of probabilitymeasures on a metric space S such that µ n converges weakly to some probability measure µ ∞ on S as n → ∞ . Suppose also that the support of µ ∞ is separable. Then there exist random variables X n definedon a common probability space (Ω , F , P ) such that the law of X n is µ n for all n (including n = ∞ ) andsuch that X n converges to X ∞ almost surely. The following theorem presents the classical results which lead to a standard approach for proving themeasurability of the estimators. Note that the projection proj( D ) = { x ∈ X : ∃ y ∈ Y , s.t.( x, y ) ∈ D } for each D ⊆ X × Y and the section D x = { y ∈ Y : ( x, y ) ∈ D } for each x ∈ proj( D ). Theorem D.5 (Corollary 1 in Brown and Purves [1973])
Let X , Y be complete separable metricspaces and f be a real-valued Borel measurable function defined on a Borel subset D of X × Y . Supposethat for each x ∈ proj( D ) , the section D x is σ -compact and f ( x, · ) is lower semi-continuous with respectto the relative topology on D x . Then1. The sets G = proj( D ) and I = { x ∈ G : ∃ y ∈ D x s.t. y = argmin z ∈Y f ( x, z ) } are Borel.2. For each (cid:15) > , there exists a Borel measure function ϕ (cid:15) satisfying, for x ∈ G that, f ( x, ϕ (cid:15) ( x )) = inf y ∈ G f ( x, y ) , x ∈ I, ≤ (cid:15) + inf y ∈ G f ( x, y ) , if x / ∈ I and inf y ∈ G f ( x, y ) (cid:54) = −∞ , ≤ − (cid:15) − , x / ∈ I and inf y ∈ G f ( x, y ) = −∞ . To show that the MEPRW estimator is measurable, we establish the lower semi-continuity of theexpectation of empirical PRW distance in the following lemma.
Lemma D.6
The expected empirical PRW distance is lower semi-continuous in the usual weak topology.If the sequences { µ i } i ∈ N , { ν i } i ∈ N ⊆ P ( R d ) satisfying that µ i ⇒ µ ∈ P ( R d ) and ν i ⇒ ν ∈ P ( R d ) , wehave E [ PW p,k ( µ, (cid:98) ν m )] ≤ lim inf i → + ∞ E [ PW p,k ( µ i , (cid:98) ν i,m )] , where (cid:98) ν m = (1 /m ) (cid:80) mj =1 δ Z j for i.i.d. samples Z m according to ν and { (cid:98) ν i,m } i ∈ N are defined similarly. .2 Proof of Theorem 3.9 We first prove that argmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) (cid:54) = ∅ . Indeed, by Assumption 3.2 and Theorem 3.3,the mapping θ (cid:55)→ PW p,k ( µ (cid:63) , µ θ ) is lower semi-continuous. By Assumption 3.3, the set Θ (cid:63) ( τ ) isbounded for some τ >
0. By the definition of inf, there exists θ (cid:48) ∈ Θ such that PW p,k ( µ (cid:63) , µ θ (cid:48) ) =inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ /
2. This implies that θ (cid:48) ∈ Θ (cid:63) ( τ ) and Θ (cid:63) ( τ ) is nonempty. By the lower semi-continuity of the mapping θ (cid:55)→ PW p,k ( µ (cid:63) , µ θ ), the set Θ (cid:63) ( τ ) is closed. Putting these pieces togetheryields that Θ (cid:63) ( τ ) is compact. Therefore, we conclude the desired result from Theorem D.1.Then we show that there exists a set E ⊆ Ω with P ( E ) = 1 such that, for all ω ∈ E , the sequence ofmappings θ (cid:55)→ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) epiconverges to the mapping θ (cid:55)→ PW p,k ( µ (cid:63) , µ θ ) as n → + ∞ . Indeed,we only need to prove that the conditions in Proposition D.2 hold true.Fix K ⊆ Θ as a compact set. By the lower semi-continuity of the mapping θ (cid:55)→ PW p,k ( (cid:98) µ n ( ω ) , µ θ )(cf. Assumption 3.2 and Theorem 3.3), Theorem D.1 implies thatinf θ ∈ K PW p,k ( (cid:98) µ n ( ω ) , µ θ ) = PW p,k ( (cid:98) µ n ( ω ) , µ θ n )for some sequence θ n = θ n ( ω ) ∈ K . Thus, we havelim inf n → + ∞ inf θ ∈ K PW p,k ( (cid:98) µ n ( ω ) , µ θ ) = lim inf n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ θ n ) . By the definition of lim inf, there exists a subsequence of { θ n } n ∈ N such that PW p,k ( (cid:98) µ n ( ω ) , µ θ n ) convergesto lim inf n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ θ n ) along this subsequence. By the compactness of K , this subsequencemust have a convergent subsubsequence. We denote this subsubsequence as { θ n j } j ∈ N and its limit as¯ θ ∈ K . Then lim inf n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ θ n ) = lim j → + ∞ PW p,k ( (cid:98) µ n j ( ω ) , µ θ nj ) . Since ω ∈ E where P ( E ) = 1, Assumption 3.1 and 3.2 imply (cid:98) µ n j ( ω ) ⇒ µ (cid:63) and µ θ nj ⇒ µ ¯ θ . Thesepieces together with the lower semi-continuity of the PRW distance (cf. Theorem 3.3) yields thatlim j → + ∞ PW p,k ( (cid:98) µ n j ( ω ) , µ θ nj ) ≥ PW p,k ( µ (cid:63) , µ ¯ θ ). Putting these pieces together yields thatlim inf n → + ∞ inf θ ∈ K PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≥ inf θ ∈ K PW p,k ( µ (cid:63) , µ θ ) . Fix O ⊆ Θ as an arbitary open set. By the definition of inf, there exists a sequence θ (cid:48) n = θ (cid:48) n ( ω ) ∈ O suchthat PW p,k ( µ (cid:63) , µ θ (cid:48) n ) → inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ). In addition, inf θ ∈ O PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ PW p,k ( (cid:98) µ n ( ω ) , µ θ (cid:48) n ).Thus, we have lim sup n → + ∞ inf θ ∈ O PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ lim sup n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ θ (cid:48) n ) ≤ lim sup n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ (cid:63) ) + lim sup n → + ∞ PW p,k ( µ (cid:63) , µ θ (cid:48) n ) . Since ω ∈ E where P ( E ) = 1, Assumption 3.1 implies lim sup n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ (cid:63) ) = 0. By thedefinition of θ (cid:48) n , lim sup n → + ∞ PW p,k ( µ (cid:63) , µ θ (cid:48) n ) = inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ). Putting these pieces togetheryields that lim sup n → + ∞ inf θ ∈ O PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ).Proposition D.2 guarantees that there exists a set E ⊆ Ω with P ( E ) = 1 such that, for all ω ∈ E ,the sequence of mappings θ (cid:55)→ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) epiconverges to the mapping θ (cid:55)→ PW p,k ( µ (cid:63) , µ θ ) as n → + ∞ . Then the second statement of Theorem D.3 implies thatlim sup n → + ∞ argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ⊆ argmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) . (D.1)27he next step is to show that, for every δ >
0, there exists a compact set B ⊆ Θ and N ∈ N such thatinf θ ∈ B PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ inf θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ )+ δ . In what follows, we prove a stronger statementwhich states that the above inequality holds true with δ = 0. Indeed, by the same reasoning for theopen set case in the proof of epiconvergence, we havelim sup n → + ∞ inf θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) . By Assumption 3.3 and using previous argument, Θ (cid:63) ( τ ) is nonempty and compact for some τ >
0. Theabove inequality implies that there exists n ( ω ) > n ≥ n ( ω ), the set { θ ∈ Θ : PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ inf θ (cid:48) ∈ Θ PW p,k ( µ (cid:63) , µ θ (cid:48) ) + τ / } is nonempty. For any θ in this set and let n ≥ n ( ω ),we have PW p,k ( µ (cid:63) , µ θ ) ≤ PW p,k ( µ (cid:63) , (cid:98) µ n ( ω )) + inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ . By Assumption 3.1, there exists n ( ω ) > n ≥ n ( ω ), we have PW p,k ( µ (cid:63) , (cid:98) µ n ( ω )) ≤ W p ( µ (cid:63) , (cid:98) µ n ( ω )) ≤ τ . Putting these pieces together yields that, for all n ≥ max { n ( ω ) , n ( ω ) } , we have PW p,k ( µ (cid:63) , µ θ ) ≤ inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ . This implies that, for all n ≥ max { n ( ω ) , n ( ω ) } that, (cid:26) θ ∈ Θ : PW p,k ( (cid:98) µ n ( ω ) , µ θ ) ≤ inf θ (cid:48) ∈ Θ PW p,k ( µ (cid:63) , µ θ (cid:48) ) + τ (cid:27) ⊆ Θ (cid:63) ( τ ) . Therefore, we have inf θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) = inf θ ∈ Θ (cid:63) ( τ ) PW p,k ( (cid:98) µ n ( ω ) , µ θ ). This together with thecompactness of Θ (cid:63) ( τ ) yields the desired result.The first statement of Theorem D.3 implies thatinf θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) → inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) , as n → + ∞ . (D.2)By Assumption 3.2 and Theorem 3.3, the mapping θ (cid:55)→ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) is lower semi-continuous.Theorem D.1 implies argmin θ ∈ Θ PW p,k ( (cid:98) µ n ( ω ) , µ θ ) are nonempty for all n ≥ max { n ( ω ) , n ( ω ) } . To-gether with Eq. (D.1) and (D.2) yields the desired results.Finally, we remark that these results hold true for δ n -argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) with δ n →
0. ForEq. (D.1) and (D.2), the analogous results can be derived by using the second and third statementsof Theorem D.3. To show that δ n -argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) is nonempty, we notice it contains thenonempty set argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ). D.3 Proof of Theorem 3.10
Following up the same approach used for analyzing Theorem 3.9, it is straightforward to derive thatargmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) (cid:54) = ∅ . Then we show that there exists a set E ⊆ Ω with P ( E ) = 1 such that,for all ω ∈ E , the sequences θ (cid:55)→ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] epiconverges θ (cid:55)→ PW p,k ( µ (cid:63) , µ θ ) as n → + ∞ . Indeed, it suffices to verify the conditions in Proposition D.2.Fix K ⊆ Θ as an arbitrary compact set. By Assumption 3.2 and Lemma D.6, the mapping θ (cid:55)→ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] is lower semi-continuous. Then Theorem D.1 implies thatinf θ ∈ K E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] = E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ n ,m ( n ) ) | X n ]28or some sequence θ n = θ n ( ω ) ∈ K . Thus, we havelim inf n → + ∞ inf θ ∈ K E (cid:2) PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n (cid:3) = lim inf n → + ∞ E (cid:2) PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ n ,m ( n ) ) | X n (cid:3) . Following up the same approach used in the proof of Theorem 3.9, there exists a subsequence of { θ n } n ∈ N ,denoted by { θ n j } j ∈ N with the limit ¯ θ ∈ K , such thatlim inf n → + ∞ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ n ,m ( n ) ) | X n ] = lim j → + ∞ E [ PW p,k ( (cid:98) µ n j ( ω ) , (cid:98) µ θ nj ,m ( n j ) ) | X n j ] ≥ lim inf j → + ∞ E [ PW p,k ( (cid:98) µ n j ( ω ) , µ θ nj )] − lim sup j → + ∞ E [ PW p,k ( µ θ nj , (cid:98) µ θ nj ,m ( n j ) ) | X n j ] . Since ω ∈ E where P ( E ) = 1, Assumption 3.1 and 3.2 imply (cid:98) µ n j ( ω ) ⇒ µ (cid:63) and µ θ nj ⇒ µ ¯ θ . Thesepieces together with the lower semi-continuity of the PRW distance (cf. Theorem 3.3) yields thatlim inf j → + ∞ PW p,k ( (cid:98) µ n j ( ω ) , µ θ nj ) ≥ PW p,k ( µ (cid:63) , µ ¯ θ ). By Assumption 3.4 and using θ n j → ¯ θ , we havelim sup j → + ∞ E [ PW p,k ( µ θ nj , (cid:98) µ θ nj ,m ( n j ) ) | X n j ] →
0. Putting these pieces together yields thatlim inf n → + ∞ inf θ ∈ K E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≥ inf θ ∈ K PW p,k ( µ (cid:63) , µ θ ) . Fix O ⊆ Θ as an arbitary open set. By the definition of inf, there exists a sequence θ (cid:48) n = θ (cid:48) n ( ω ) ∈ O such that PW p,k ( µ (cid:63) , µ θ (cid:48) n ) → inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ). In addition, we haveinf θ ∈ O E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ (cid:48) n ,m ( n ) ) | X n ] . Thus, we havelim sup n → + ∞ inf θ ∈ O E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ lim sup n → + ∞ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ (cid:48) n ,m ( n ) ) | X n ] ≤ lim sup n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ (cid:63) ) + lim sup n → + ∞ PW p,k ( µ (cid:63) , µ θ (cid:48) n ) + lim sup n → + ∞ E [ PW p,k ( µ θ (cid:48) n , (cid:98) µ θ (cid:48) n ,m ( n ) ) | X n ] . Since ω ∈ E where P ( E ) = 1, Assumption 3.1 implies lim sup n → + ∞ PW p,k ( (cid:98) µ n ( ω ) , µ (cid:63) ) = 0. By thedefinition of θ (cid:48) n , we have lim sup n → + ∞ PW p,k ( µ (cid:63) , µ θ (cid:48) n ) = inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ). Using Assumption 3.4and lim j → + ∞ θ m j = ¯ θ , we have lim sup n → + ∞ E [ PW p,k ( µ θ (cid:48) n , (cid:98) µ θ (cid:48) n ,m ( n ) ) | X n ] = 0. Putting these piecestogether yields that lim sup n → + ∞ inf θ ∈ O E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ inf θ ∈ O PW p,k ( µ (cid:63) , µ θ ).Proposition D.2 guarantees that there exists a set E ⊆ Ω with P ( E ) = 1 such that, for all ω ∈ E ,the sequence of mappings θ (cid:55)→ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] epiconverges to the mapping θ (cid:55)→PW p,k ( µ (cid:63) , µ θ ) as n → + ∞ . Then the second statement of Theorem D.3 implies thatlim sup n → + ∞ argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ⊆ argmin θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) . (D.3)The next step is to show that, for every δ >
0, there exists a compact set B ⊆ Θ and N ∈ N such thatinf θ ∈ B E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] + δ . In what follows,we prove a stronger statement which states that the above inequality holds true with δ = 0. Indeed, bythe same reasoning for the open set case in the proof of epiconvergence, we havelim sup n → + ∞ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) .
29y Assumption 3.3 and using previous argument, Θ (cid:63) ( τ ) is nonempty and compact for some τ >
0. Theabove inequality implies that there exists n ( ω ) > n ≥ n ( ω ), the set { θ ∈ Θ : E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ inf θ (cid:48) ∈ Θ PW p,k ( µ (cid:63) , µ θ (cid:48) ) + τ / } is nonempty. For any θ in this set andlet n ≥ n ( ω ), we have PW p,k ( µ (cid:63) , µ θ ) ≤ PW p,k ( µ (cid:63) , (cid:98) µ n ( ω )) + E [ PW p,k ( µ θ , (cid:98) µ θ,m ( n ) ) | X n ] + inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ . By Assumption 3.1, there exists n ( ω ) > n ≥ n ( ω ), we have PW p,k ( µ (cid:63) , (cid:98) µ n ( ω )) ≤ W p ( µ (cid:63) , (cid:98) µ n ( ω )) ≤ τ . By Assumption 3.4, there exists n ( ω ) > n ≥ n ( ω ), we have E [ PW p,k ( (cid:98) µ θ,m ( n ) , µ θ ) | X n ] ≤ E [ W p ( (cid:98) µ θ,m ( n ) , µ θ ) | X n ] ≤ τ . Putting these pieces together yields that, for all n ≥ max { n ( ω ) , n ( ω ) , n ( ω ) } that, PW p,k ( µ (cid:63) , µ θ ) ≤ inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) + τ. This implies that, for all n ≥ max { n ( ω ) , n ( ω ) , n ( ω ) } that, (cid:26) θ ∈ Θ : E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] ≤ inf θ (cid:48) ∈ Θ PW p,k ( µ (cid:63) , µ θ (cid:48) ) + τ (cid:27) ⊆ Θ (cid:63) ( τ ) . Therefore, we have inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] = inf θ ∈ Θ (cid:63) ( τ ) E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ].This together with the compactness of Θ (cid:63) ( τ ) yields the desired result.The first statement of Theorem D.3 implies thatinf θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] → inf θ ∈ Θ PW p,k ( µ (cid:63) , µ θ ) , as n → + ∞ . (D.4)By Assumption 3.2 and Lemma D.6, the mapping θ (cid:55)→ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] is lower semi-continuous. Theorem D.1 implies argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] are nonempty for all n ≥ max { n ( ω ) , n ( ω ) , n ( ω ) } . Together with Eq. (D.3) and (D.4) yields the desired results.Finally, we remark that these results hold true for δ n -argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ]with δ n →
0. For Eq. (D.3) and (D.4), the analogous results can be derived by using the secondand third statements of Theorem D.3. To show that δ n -argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ] isnonempty, we notice it contains the nonempty set argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n ( ω ) , (cid:98) µ θ,m ( n ) ) | X n ]. D.4 Proof of Theorem 3.11
We first prove that argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) (cid:54) = ∅ . Indeed, by Assumption 3.2 and Theorem 3.3,the mapping θ (cid:55)→ PW p,k ( (cid:98) µ n , µ θ ) is lower semi-continuous. By Assumption 3.5, the set Θ n ( τ ) isbounded for some τ n >
0. By the definition of inf, there exists θ (cid:48) n ∈ Θ such that PW p,k ( (cid:98) µ n , µ θ (cid:48) n ) =inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) + τ n /
2. This implies that θ (cid:48) n ∈ Θ n ( τ ) and Θ n ( τ ) is nonempty. By the lower semi-continuity of the mapping θ (cid:55)→ PW p,k ( (cid:98) µ n , µ θ ), the set Θ n ( τ ) is closed. Putting these pieces togetheryields that Θ n ( τ ) is compact. Therefore, we conclude the desired result from Theorem D.1.30hen we show that the sequences θ (cid:55)→ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] epiconverges to θ (cid:55)→ PW p,k ( (cid:98) µ n , µ θ )as m → + ∞ . Indeed, it suffices to verify the conditions in Proposition D.2.Fix K ⊆ Θ as an arbitrary compact set. By Assumption 3.2 and Lemma D.6, the mapping θ (cid:55)→ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] is lower semi-continuous. Then Theorem D.1 implies thatinf θ ∈ K E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] = E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ m ,m ) | X n ]for some sequence θ m ∈ K . Thus, we havelim inf m → + ∞ inf θ ∈ K E (cid:2) PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n (cid:3) = lim inf m → + ∞ E (cid:2) PW p,k ( (cid:98) µ n , (cid:98) µ θ m ,m ) | X n (cid:3) . Following up the same approach used in the proof of Theorem 3.9, there exists a subsequence of { θ m } m ∈ N ,denoted by { θ m j } j ∈ N with the limit ¯ θ ∈ K , such thatlim inf m → + ∞ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ m ,m ) | X n ] = lim j → + ∞ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ mj ,m j ) | X n ] ≥ lim inf j → + ∞ E [ PW p,k ( (cid:98) µ n , µ θ mj )] − lim sup j → + ∞ E [ PW p,k ( µ θ mj , (cid:98) µ θ mj ,m j ) | X n ] . Assumption 3.1 and 3.2 imply (cid:98) µ m j ⇒ µ (cid:63) and µ θ mj ⇒ µ ¯ θ . Together with the lower semi-continuityof the PRW distance yields that lim inf j → + ∞ PW p,k ( (cid:98) µ n , µ θ mj ) ≥ PW p,k ( (cid:98) µ n , µ ¯ θ ). By Assumption 3.4and using θ m j → ¯ θ , we have lim sup j → + ∞ E [ PW p,k ( µ θ mj , (cid:98) µ θ mj ,m j ) | X n ] = 0. Thus, we conclude thatlim inf m → + ∞ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ m ,m ) | X n ] ≥ inf θ ∈ K PW p,k ( (cid:98) µ n , µ θ ).Fix O ⊆ Θ as an arbitary open set. By the definition of inf, there exists a sequence θ (cid:48) m ∈ O suchthat PW p,k ( (cid:98) µ n , µ θ (cid:48) m ) → inf θ ∈ O PW p,k ( (cid:98) µ n , µ θ ). In addition, we haveinf θ ∈ O E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ (cid:48) m ,m ) | X n ] . Thus, we havelim sup m → + ∞ inf θ ∈ O E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ lim sup m → + ∞ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ (cid:48) m ,m ) | X n ] ≤ lim sup m → + ∞ PW p,k ( (cid:98) µ n , µ θ (cid:48) m ) + lim sup n → + ∞ E [ PW p,k ( µ θ (cid:48) n , (cid:98) µ θ (cid:48) m ,m ) | X n ] . By the definition of θ (cid:48) m , we have lim sup m → + ∞ PW p,k ( (cid:98) µ n , µ θ (cid:48) m ) = inf θ ∈ O PW p,k ( (cid:98) µ n , µ θ ). Using Assump-tion 3.4 and lim j → + ∞ θ m j = ¯ θ , we have lim sup m → + ∞ E [ PW p,k ( µ θ (cid:48) m , (cid:98) µ θ (cid:48) m ,m ) | X n ] = 0. Putting thesepieces together yields that lim sup m → + ∞ inf θ ∈ O E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ inf θ ∈ O PW p,k ( (cid:98) µ n , µ θ ).Proposition D.2 guarantees that the sequence of mappings θ (cid:55)→ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] epicon-verges to the mapping θ (cid:55)→ PW p,k ( (cid:98) µ n , µ θ ) as m → + ∞ . Then the second statement of Theorem D.3implies that lim sup m → + ∞ argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ⊆ argmin θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) . (D.5)The next step is to show that, for every δ >
0, there exists a compact set B ⊆ Θ and N ∈ N such thatinf θ ∈ B E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] + δ . In what follows, we prove astronger statement which states that the above inequality holds true with δ = 0. Indeed, by the samereasoning for the open set case in the proof of epiconvergence, we havelim sup n → + ∞ inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) .
31y Assumption 3.5 and using previous argument, Θ n ( τ ) is nonempty and compact for some τ > m > m ≥ m , the set { θ ∈ Θ : E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) + τ / } is nonempty. For any θ in this set and let m ≥ m , we have PW p,k ( (cid:98) µ n , µ θ ) ≤ E [ PW p,k ( (cid:98) µ θ,m , µ θ ) | X n ] + inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) + τ . By Assumption 3.4, there exists m > m ≥ m , we have E [ PW p,k ( (cid:98) µ θ,m , µ θ ) | X n ] ≤ E [ W p ( (cid:98) µ θ,m , µ θ ) | X n ] ≤ τ . Putting these pieces together yields that PW p,k ( (cid:98) µ n , µ θ ) ≤ inf θ (cid:48) ∈ Θ PW p,k ( (cid:98) µ n , µ θ (cid:48) ) + τ for all m ≥ max { m , m } . This implies that, for all m ≥ max { m , m } that, (cid:26) θ ∈ Θ : E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] ≤ inf θ (cid:48) ∈ Θ PW p,k ( (cid:98) µ n , µ θ (cid:48) ) + τ (cid:27) ⊆ Θ n ( τ ) . Therefore, we have inf θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] = inf θ ∈ Θ n ( τ ) E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ]. This togetherwith the compactness of Θ n ( τ ) yields the desired result.The first statement of Theorem D.3 implies thatinf θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] → inf θ ∈ Θ PW p,k ( (cid:98) µ n , µ θ ) , as m → + ∞ . (D.6)By Assumption 3.2 and Lemma D.6, the mapping θ (cid:55)→ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] is lower semi-continuous. Theorem D.1 implies argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] are nonempty for all m ≥ max { m , m } . Together with Eq. (D.5) and Eq. (D.6) yields the desired results.Finally, we remark that these results hold true for δ n -argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] with δ n →
0. For Eq. (D.5) and (D.6), the analogous results can be derived by using the second and thirdstatements of Theorem D.3. To show that δ n -argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ] is nonempty, wenotice it contains the nonempty set argmin θ ∈ Θ E [ PW p,k ( (cid:98) µ n , (cid:98) µ θ,m ) | X n ]. D.5 Proof of Lemma D.6
Since ν i ⇒ ν ∈ P ( R d ) and R d is separable, the Skorokhod’s representation theorem (cf. Theorem D.4)implies that there exists m sequences of random variables {{ Z ki } i ∈ N , k ∈ [ m ] } and m random variables { Z k , k ∈ [ m ] } such that the distribution of Z ki is ν i , the distribution of Z k is ν and { Z ki } i ∈ N convergesto Z k almost surely for all k ∈ [ m ].Suppose that (cid:98) ν i,m = (1 /m )( (cid:80) mk =1 δ Z ki ) and (cid:98) ν m = (1 /m )( (cid:80) mk =1 Z k ), we proceed to the key part of theproof and show that { (cid:98) ν i,m } i ∈ N weakly converges to (cid:98) ν m . Indeed, it suffices to consider the deterministiccase where (cid:98) ν i,m = (1 /m )( (cid:80) mk =1 δ z ki ) and (cid:98) ν m = (1 /m )( (cid:80) mk =1 z k ) where {{ z ki } i ∈ N , k ∈ [ m ] } and { z k , k ∈ [ m ] } are all deterministic such that lim i → + ∞ (cid:0) max k ∈ [ m ] (cid:107) z ki − z k (cid:107) (cid:1) = 0. Since the Wasserstein distancemetrizes the weak convergence (cf. Theorem B.4), we only need to show that lim i → + ∞ W ( (cid:98) ν i,m , (cid:98) ν m ) = 0.By the definition of the Wasserstein distance, { (cid:98) ν i,m } i ∈ N and (cid:98) ν m , we have W ( (cid:98) ν i,m , (cid:98) ν m ) ≤ max k ∈ [ m ] (cid:107) z ki − z k (cid:107) . Putting these pieces together yields that { (cid:98) ν i,m } i ∈ N weakly converges to (cid:98) ν m almost surely.32inally, we conclude from the lower semi-continuity of the PRW distance (cf. Theorem 3.3) and theFatou’s lemma that E [ PW p,k ( µ, (cid:98) ν m )] ≤ E (cid:20) lim inf i → + ∞ PW p,k ( µ i , (cid:98) ν i,m ) (cid:21) ≤ lim inf i → + ∞ E [ PW p,k ( µ i , (cid:98) ν i,m )] . This completes the proof.
D.6 Proof of Theorem A.1
Using Assumption 3.2 and Theorem 3.3, the mapping ( µ, θ ) (cid:55)→ PW p,k ( µ, µ θ ) is lower semi-continuousin P ( R d ) × Θ. It remains to verify that the conditions in Theorem D.5 are satisfied.We notice that the empirical measure (cid:98) µ n ( ω ) depends on ω ∈ Ω only through X n ∈ ⊗ ni =1 R d . Thus,we can write (cid:98) µ n ( ω ) = (cid:98) µ n ( x ) as a function in ⊗ ni =1 R d . Let D = ( ⊗ ni =1 R d ) × Θ, it is a Borel subset of( ⊗ ni =1 R d ) × R . Since R d is a Polish space, R d × . . . × R d endowed with the product topology is a Polishspace. D x is σ -compact for any x ∈ proj( D ) since D x ⊆ Θ and Θ is σ -compact.Define f ( x, θ ) = PW p,k ( (cid:98) µ n ( x ) , µ θ ), we claim that f is measurable on D and f ( x, · ) is lower semi-continuous on D x . Indeed, we have shown that the mapping ( µ, θ ) (cid:55)→ PW p,k ( µ, µ θ ) is lower semi-continuous and thus measurable in P ( R d ) × Θ. The mapping x (cid:55)→ (cid:98) µ n ( x ) is measurable in ⊗ ni =1 R d .Since the composition of measurable functions is measure, f is measurable on D . Moreover, for any x ∈ ⊗ ni =1 R d , f ( x, · ) is lower semi-continuous on D x since the mapping ( µ, θ ) (cid:55)→ PW p,k ( µ, µ θ ) is lowersemi-continuous on D . Putting these pieces together yields the desired results. D.7 Proof of Theorem A.2
Using Assumption 3.2 and Lemma D.6, the mapping ( ν, θ ) (cid:55)→ E [ PW p,k ( ν, (cid:98) µ θ,m ) | X n ] is lower semi-continuous in P ( R d ) × Θ. Then the proof can be done similarly to the proof of Theorem A.1 using thisresult and Theorem D.5.
E Postponed Proofs in Subsection 3.4
In this section, we provide the detailed proofs for Theorem 3.12 and Theorem A.3. Our derivation isthe refinement of the analysis in Bernton et al. [2019] for minimal Wasserstein estimators.
E.1 Preliminary technical results
To facilitate reading, we collect several preliminary technical results which will be used in the postponedproofs in subsection 3.4.Let ( X , (cid:107) · (cid:107) X ) be a normed linear space and θ (cid:55)→ f θ be a map from a subset Θ of R d into X .The statistical information comes from a sequence { f n } n ∈ N of random elements of X , each of which isassumed to be measurable with respect to the σ -algebra generated by the balls in X . In some sense f n should converge to f θ (cid:63) where θ (cid:63) is some fixed (but unknown) point in the interior of Θ. To avoid theabuse of notation, we use K ( x, β ) here. Theorem E.1 (Theorem 4.2 in Pollard [1980])
Suppose the following assumptions hold:1. inf θ / ∈ N (cid:107) f θ − f θ (cid:63) (cid:107) X > for every neighborhood N of θ (cid:63) . . θ (cid:55)→ f θ is norm differentiable with non-singular derivative D θ (cid:63) at θ (cid:63) .3. There exists a random element G (cid:63) ∈ X for which G n := √ n ( f n − f θ (cid:63) ) ⇒ G (cid:63) in the sense for themetric induced by the norm (cid:107) · (cid:107) X .Then the limiting distribution of the goodness-of-fit statistic is given by √ n inf θ ∈ Θ (cid:107) f n − f θ (cid:107) X ⇒ inf θ ∈ Θ (cid:107) G (cid:63) − (cid:104) θ, D θ (cid:63) (cid:105)(cid:107) X . Let K ( x, β ) = { θ : (cid:107) x − (cid:104) θ, D θ (cid:63) (cid:105)(cid:107) X ≤ inf θ (cid:48) ∈ Θ (cid:107) x − (cid:104) θ (cid:48) , D θ (cid:63) (cid:105)(cid:107) X + β } and M n is defined by M n = (cid:26) θ ∈ Θ : (cid:107) f n − f θ (cid:107) X ≤ inf θ (cid:48) ∈ Θ (cid:107) f n − f θ (cid:48) (cid:107) X + η n / √ n (cid:27) , where η n > P ( η n →
0) = 1 and M n is nonempty. Theorem E.2 (Theorem 7.2 in Pollard [1980])
Under the conditions of Theorem E.1, there existsa sequence of real number β n ↓ satisfying P (cid:63) ( M n ⊆ θ (cid:63) + n − / K ( G n , β n )) → , as n → + ∞ . Moreover, for any (cid:15) > , we have P ( d H ( K ( G (cid:63)n , , K ( G n , β n )) < (cid:15) ) → as n → + ∞ . E.2 Proof of Theorem 3.12
First, we show that M n ⊆ N with (inner) probability approaching 1 as n → + ∞ . Indeed, with innerprobability approaching 1, we haveargmin θ ∈ Θ PW , ( (cid:98) µ n , µ θ ) ⊆ argmin θ ∈ Θ PW , ( µ (cid:63) , µ θ ) . By the definition of PW , , we conclude that any minimizer of (cid:107) (cid:98) F n − F θ (cid:107) L will be included in the setof minimizers of (cid:107) F (cid:63) − F θ (cid:107) L with inner probability approaching 1. By Assumption 3.8, the minimizerof (cid:107) F (cid:63) − F θ (cid:107) L is unique and N is the neighborhood of this minimizer. Putting these pieces togetheryields that the set inf θ ∈ Θ PW , ( (cid:99) µ n , µ θ ) is contained in the set N with (inner) probability approaching1 as n → + ∞ . By the definition of M n , we achieve the desired result.Then we make three key claims. First, we claim that M n ⊆ Θ n with (inner) probability approaching1 as n → + ∞ , where Θ n is defined byΘ n = (cid:40) θ ∈ Θ : (cid:107) θ − θ (cid:63) (cid:107) Θ ≤ √ n (cid:107) (cid:98) F n − F (cid:63) (cid:107) L + 2 η n c (cid:63) √ n (cid:41) . Indeed, for any θ ∈ N , we derive from the triangle inequality that (cid:107) (cid:98) F n − F θ (cid:107) L − (cid:107) (cid:98) F n − F θ (cid:63) (cid:107) L ≥ (cid:107) F θ − F (cid:63) (cid:107) L − (cid:107) F θ (cid:63) − F (cid:63) (cid:107) L − (cid:107) (cid:98) F n − F (cid:63) (cid:107) L . Using the definition of PW , together with Assumption 3.8, we have (cid:107) (cid:98) F n − F θ (cid:107) L − (cid:107) (cid:98) F n − F θ (cid:63) (cid:107) L ≥ c (cid:63) (cid:107) θ − θ (cid:63) (cid:107) Θ − (cid:107) (cid:98) F n − F (cid:63) (cid:107) L . (E.1)34ince M n ⊆ N with (inner) probability approaching one, Eq. (E.1) holds true for any θ ∈ M n with(inner) probability approaching one. Moreover, by the definition of M n , we have θ ∈ M n satisfies (cid:107) (cid:98) F n − F θ (cid:107) L ≤ inf θ (cid:48) ∈ Θ PW , ( (cid:98) µ n , µ θ (cid:48) ) + η n √ n ≤ (cid:107) (cid:98) F n − F θ (cid:63) (cid:107) L + η n √ n (E.2)Combining Eq. (E.1), Eq. (E.2) and the definition of Θ n , we conclude that θ ∈ Θ n if θ ∈ M n with(inner) probability approaching 1. This completes the proof the first claim.Second, we claim that argmin θ (cid:48) ∈N (cid:107) G n − (cid:104)√ n ( θ (cid:48) − θ (cid:63) ) , D θ (cid:63) (cid:105)(cid:107) L ⊆ N ∩ Θ n with (inner) probabilityapproaching 1 as n → + ∞ . Indeed, by the definition of G n , we have (cid:107) G n − (cid:104)√ n ( θ (cid:48) − θ (cid:63) ) , D θ (cid:63) (cid:105)(cid:107) L = √ n (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L . For the simplicity of notation, we let R θ = F θ − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105) . By Assumption 3.6, we have (cid:107) R θ (cid:107) L = o ( (cid:107) θ − θ (cid:63) (cid:107) Θ ). By the definition of N , we have (cid:107) R θ (cid:107) L ≤ (1 / c (cid:63) (cid:107) θ − θ (cid:63) (cid:107) Θ . Therefore, for any θ ∈ N , we have (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L ≥ (cid:107) (cid:98) F n − F θ (cid:107) L − (cid:107) R θ (cid:107) L Eq. (E.1) ≥ (cid:107) (cid:98) F n − F θ (cid:63) (cid:107) L + (1 / c (cid:63) (cid:107) θ − θ (cid:63) (cid:107) Θ − (cid:107) (cid:98) F n − F (cid:63) (cid:107) L . This implies that, for any θ ∈ N \ Θ n , we have (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L ≥ (cid:107) (cid:98) F n − F θ (cid:63) (cid:107) L ≥ inf θ (cid:48) ∈N ∩ Θ n (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ (cid:48) − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L . This completes the proof of the second claim.Thirdly, we claim that there is an uniform control over the difference between θ (cid:55)→ √ n (cid:107) (cid:98) F n − F θ (cid:107) L and the convex map θ (cid:55)→ (cid:107) G n − √ n (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L over the set Ω n with (inner) probability approaching1 as n → + ∞ . Indeed, we defineΓ n = sup θ ∈ Ω n |√ n (cid:107) (cid:98) F n − F θ (cid:107) L − (cid:107) G n − √ n (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L | . By the definition of G n , we haveΓ n = sup θ ∈ Ω n |√ n (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105) − R θ (cid:107) L − √ n (cid:107) (cid:98) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L | = o (cid:18) sup θ ∈ Ω n √ n (cid:107) θ − θ (cid:63) (cid:107) Θ (cid:19) = o ( √ n (cid:107) (cid:98) F n − F (cid:63) (cid:107) L )By Assumption 3.7, we have Γ n → (cid:107) θ − θ (cid:63) (cid:107) Θ → n → + ∞ . This completes the proof of the third claim.By the definition of G n and G (cid:63)n , we have (cid:107) G n − G (cid:63)n (cid:107) L = (cid:107)√ n ( (cid:98) F n − F (cid:63) ) − G (cid:63) (cid:107) L . By Assumption 3.7,there exists a sequence τ n → P ( (cid:107) G n − G (cid:63)n (cid:107) L > τ n ) →
0. By the definition of Γ n and η n ,there exists two sequences τ n → τ n → P (Γ n > τ n ) → P ( η n > τ n ) → β n = max { τ n , τ n + τ n } , we have β n → n → + ∞ .It remains to show that M n ⊆ K ( G n , β n ) with (inner) probability approaching 1 as n → + ∞ . Indeed,we have inf θ (cid:48) ∈N (cid:107) G n − (cid:104)√ n ( θ (cid:48) − θ (cid:63) ) , D θ (cid:63) (cid:105)(cid:107) L ≥ inf θ (cid:48) ∈N √ n (cid:107) (cid:98) F n − F θ (cid:48) (cid:107) L − τ n .
35y the definition of M n , let θ ∈ M n , the above inequality impliesinf θ (cid:48) ∈N (cid:107) G n − (cid:104)√ n ( θ (cid:48) − θ (cid:63) ) , D θ (cid:63) (cid:105)(cid:107) L ≥ √ n (cid:107) (cid:98) F n − F θ (cid:107) L − τ n − τ n . Since M n ⊆ Θ n with (inner) probability approaching 1 as n → + ∞ , we have √ n (cid:107) (cid:98) F n − F θ (cid:107) L ≥ (cid:107) G n − (cid:104)√ n ( θ − θ (cid:63) ) , D θ (cid:63) (cid:105)(cid:107) L − τ n . Putting these pieces together with β n ≥ τ n + τ n yields that θ ∈ K ( G n , β n ).Finally, let (cid:15) >
0, we prove that P ( d H ( K ( G (cid:63)n , , K ( G n , β n )) < (cid:15) ) → n → + ∞ . Indeed, bythe triangle inequality, θ ∈ K ( G (cid:63)n ,
0) implies θ ∈ K ( G n , (cid:107) G n − G (cid:63)n (cid:107) L ). Therefore, we conclude that K ( G (cid:63)n , ⊆ K ( G n , β n ) with (inner) probability approaching one as n → + ∞ . On the other hand, θ ∈ K ( G n , β n ) implies θ ∈ K ( G (cid:63)n , β n + 2 (cid:107) G n − G (cid:63)n (cid:107) L ). By the definition of β n , G n and G (cid:63)n , we obtainthat β n + 2 (cid:107) G n − G (cid:63)n (cid:107) L → n → + ∞ . By the definitionof the Hausdorff metric, we conclude the desired result. E.3 Proof of Theorem A.3
Different from Theorem 3.12, the proof of Theorem A.3 is relatively straightforward and based onTheorem E.1 and E.2. It is mostly because there exists θ (cid:63) in the interior of Θ such that F (cid:63) = F θ (cid:63) .More specifically, we consider f θ = F θ and f n = (cid:98) F n such that F θ ( u, t ) = (cid:90) R d ( −∞ ,t ] ( (cid:104) u, x (cid:105) ) dµ θ ( x ) , (cid:98) F n ( u, t ) = (1 /n ) |{ i ∈ [ n ] : (cid:104) u, X i (cid:105) ≤ t }| . Let X = L ( S d − × R ) and (cid:107) · (cid:107) X = (cid:107) · (cid:107) L , we can check that ( X , (cid:107) · (cid:107) X ) is a normed linear space. By thedefinition of PW , , we have PW , ( (cid:98) µ n , µ θ ) = (cid:107) (cid:98) F n − F θ (cid:107) X . By Assumption 3.1, (cid:98) F n converges to F (cid:63) .Moreover, in well-specified setting, F (cid:63) = F θ (cid:63) where θ (cid:63) is some fixed (but unknown) point in the interiorof Θ. Now we are ready to check the conditions of Theorem E.1.First, Assumption A.1 and PW , ( (cid:98) µ n , µ θ ) = (cid:107) (cid:98) F n − F θ (cid:107) X imply C1. Furthermore, by the definitionof norm differentiable, Assumption 3.6 and Assumption A.2 imply C2. Finally, Assumption 3.7 and F (cid:63) = F θ (cid:63) imply C3. Therefore, we conclude from Theorem E.1 that √ n inf θ ∈ Θ PW , ( (cid:98) µ n , µ θ ) = √ n inf θ ∈ Θ (cid:107) (cid:98) F n − F θ (cid:107) L ⇒ inf t ∈ Θ (cid:107) G (cid:63) − (cid:104) t, D θ (cid:63) (cid:105)(cid:107) L . in the sense for the metric induced by the norm (cid:107) · (cid:107) L . This together with the definition of the norm (cid:107) · (cid:107) L implies the desired result for the goodness-of-fit statistics.On the other hand, Theorem E.2 can be applied with specific choice of η n . More specifically, wenotice that the estimator (cid:98) θ n is well defined by (cid:98) θ n := argmin θ ∈ Θ PW , ( (cid:98) µ n , µ θ ) = argmin θ ∈ Θ (cid:107) (cid:98) F n − F θ (cid:107) L . Let η n = 0, the set M n = { (cid:98) θ n } is a singleton set. This implies that √ n ( (cid:98) θ n − θ (cid:63) ) ⇒ K ( G (cid:63) ,
0) as n → + ∞ under its Hausdorff metric topology. Since the random map θ → max u ∈ S d − (cid:82) R | G (cid:63) ( u, t ) −(cid:104) θ, D (cid:63) ( u, t ) (cid:105)| dt has a unique infimum almost surely, we have K ( G (cid:63) ,
0) is a singleton set defined by K ( G (cid:63) ,
0) = argmin θ ∈ Θ max u ∈ S d − (cid:90) R | G (cid:63) ( u, t ) − (cid:104) θ, D (cid:63) ( u, t ) (cid:105)| dt. In this case, the Hausdorff metric is simply induced by the norm (cid:107) · (cid:107) L . Putting these pieces togetheryields the desired result for the MPRW estimator of order 1.36 .4 Minor Technical Issues We use the notations of Bernton et al. [2019, Theorem B.8] throughout this subsection. Indeed, inpage 38-39 of the recent arvix version of Bernton et al. [2019], the authors prove that m ( H n ) =inf u ∈ L n f ( H n , u ), implicitly assuming that the minimizer of the map θ (cid:55)→ √ n (cid:107) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L is contained in the set N = { θ ∈ N : (cid:107) θ − θ (cid:63) (cid:107) H ≤ c (cid:63) / } . However, this result is not obvious. Indeed, itseems difficult to derive such results from the existing fact that the minimizer of θ (cid:55)→ √ n (cid:107) F n − F θ (cid:107) L is contained in N . We only have the uniform control over the difference between θ (cid:55)→ √ n (cid:107) F n − F θ (cid:107) L and θ (cid:55)→ √ n (cid:107) F n − F θ (cid:63) − (cid:104) θ − θ (cid:63) , D θ (cid:63) (cid:105)(cid:107) L over the set S n instead of the whole set. So there is fewrelationship between the minimizers of these two mappings. Moreover, the techniques from the proofof Pollard [1980, Theorem 7.2] can not be applicable to fix this issue here since the proof depends onthe assumption that µ (cid:63) = µ θ (cid:63) which does not hold under model misspecification yet. F Computational Aspects
The computation of the PRW distance is in general computationally intractable when the projectiondimension is k ≥ PW ,k ( (cid:98) µ n , (cid:98) ν n ) when the projection dimension is k ≥
2. Part of results can be found in the appendix of concurrent work [Lin et al., 2020] and we providethe details for the sake of completeness.
Approximation of PW ,k . We consider the computation of PW ,k between empirical measures. In-deed, let { x , x , . . . , x n } ⊆ R d and { y , y , . . . , y n } ⊆ R d denote sets of n atoms, and let ( r , r , . . . , r n ) ∈ ∆ n and ( c , c , . . . , c n ) ∈ ∆ n denote weight vectors, we define discrete measures (cid:98) µ n := (cid:80) ni =1 r i δ x i and (cid:98) ν n := (cid:80) nj =1 c j δ y j . The computation of PW ,k ( (cid:98) µ n , (cid:98) ν n ) is equivalent to solving a structured max-minoptimization model where the maximization and minimization are performed over the Stiefel manifoldSt( d, k ) := { U ∈ R d × k | U (cid:62) U = I k } and the transportation polytope Π( µ, ν ) := { π ∈ R n × n + | r ( π ) = r, c ( π ) = c } respectively. Formally, we havemax U ∈ R d × k min π ∈ R n × n + n (cid:88) i =1 n (cid:88) j =1 π i,j (cid:107) U (cid:62) x i − U (cid:62) y j (cid:107) s.t. U (cid:62) U = I k , r ( π ) = r, c ( π ) = c. (F.1)Eq. (F.1) is equivalent to the non-convex nonsmooth optimization model as follows,max U ∈ St( d,k ) f ( U ) := min π ∈ Π( µ,ν ) n (cid:88) i =1 n (cid:88) j =1 π i,j (cid:107) U (cid:62) x i − U (cid:62) y j (cid:107) . (F.2)Fixing U ∈ St( d, k ), Eq. (F.2) becomes a classical OT problem which can be either solved by theSinkhorn iteration [Cuturi, 2013] or the variant of network simplex method in the
POT package [Fla-mary and Courty, 2017]. The key challenge is the maximization over the Stiefel manifold St( d, k ) := { U ∈ R d × k | U (cid:62) U = I k } .Eq. (F.2) is a special instance of the Stiefel manifold optimization problem. The dimension ofSt( d, k ) is equal to dk − k ( k + 1) / Z ∈ St( d, k ) is defined by37 lgorithm 1
Riemannian SuperGradient Ascent with Network Simplex Iteration (RSGAN) Input: measures { ( x i , r i ) } i ∈ [ n ] and { ( y j , c j ) } j ∈ [ n ] , dimension k and tolerance (cid:15) . Initialize: U ∈ St( d, k ) and γ > for t = 0 , , , . . . , T − do Compute π t +1 ← OT ( { ( x i , r i ) } i ∈ [ n ] , { ( y j , c j ) } j ∈ [ n ] , U t ). Compute ξ t +1 ← P T Ut St (2 V π t +1 U t ). Compute γ t +1 ← γ / √ t + 1. Compute U t +1 ← Retr U t ( γ t +1 ξ t +1 ). end for T Z St := { ξ ∈ R d × k : ξ (cid:62) Z + Z (cid:62) ξ = 0 } . We endow St( d, k ) with Riemannian metric inherited fromthe Euclidean inner product (cid:104) X, Y (cid:105) for any
X, Y ∈ T Z St and Z ∈ St( d, k ). Then the projection of G ∈ R d × k onto T Z St is given by Absil et al. [2009, Example 3.6.2]: P T Z St ( G ) = G − Z ( G (cid:62) Z + Z (cid:62) G ) / retraction , which is the first-order approximation of an exponentialmapping on the manifold and which is amenable to computation [Absil et al., 2009, Definition 4.1.1].For the Stiefel manifold, we have the following definition: Definition F.1
A retraction on St ≡ St( d, k ) is a smooth mapping Retr : TSt → St from the tangentbundle TSt onto St such that the restriction of Retr onto T Z St , denoted by Retr Z , satisfies that (i) Retr Z (0) = Z for all Z ∈ St where denotes the zero element of TSt , and (ii) for any Z ∈ St , it holdsthat lim ξ ∈ T Z St ,ξ → (cid:107) Retr Z ( ξ ) − ( Z + ξ ) (cid:107) F / (cid:107) ξ (cid:107) F = 0 . Our algorithm uses the retraction based on the
QR decomposition as suggested by Liu et al. [2019].More specifically, Retr qr Z ( ξ ) = qr( Z + ξ ) where qr( A ) is the Q factor of the QR factorization of A .We start with a brief overview of the Riemannian supergradient ascent algorithm for nonsmoothStiefel optimization, denoted by max U ∈ St( d,k ) F ( U ). A generic Riemannian supergradient ascent algo-rithm for solving this problem is given by U t +1 ← Retr U t ( γ t +1 ξ t +1 ) for any ξ t +1 ∈ subdiff F ( U t ) , where subdiff F ( U t ) is Riemannian subdifferential of F at U t and Retr is any retraction on St( d, k ). Thestep size is set as γ t +1 = γ / √ t + 1 as suggested by [Li et al., 2019]. By the definition of Riemanniansubdifferential, ξ t can be obtained by taking ξ ∈ ∂F ( U ) and by setting ξ t = P T U St ( ξ ). Thus, it isnecessary for us to specify the subdifferential of f in Eq. (F.2). We define V π = (cid:80) ni =1 (cid:80) nj =1 π i,j ( x i − y j )( x i − y j ) (cid:62) ∈ R d × d which is symmetry and derive that ∂f ( U ) = Conv { V π (cid:63) U | π (cid:63) ∈ argmin π ∈ Π( µ,ν ) (cid:104) U U (cid:62) , V π (cid:105)} , for any U ∈ R d × k , It remains to solve an OT problem with a given U at each inner loop of the maximization and use theoutput π ( U ) to obtain a supergradient of f . The network simplex method can exactly solve this LP.To this end, we summarize the pseudocode of the RSGAN algorithm in Algorithm 1. Approximation of PW ,k . We recall the definition of the IPRW distance of order 2 as follows, PW ,k ( µ, ν ) = (cid:90) S d,k W ( E (cid:63) µ, E (cid:63) ν ) dσ ( E ) , σ is the uniform distribution on S d,k and E (cid:63) is the linear transformation associated with E forany x ∈ R d by E (cid:63) ( x ) = E (cid:62) x . For any measurable function f and µ ∈ P ( R d ), we denote f µ as thepush-forward of µ by f , so that f µ ( A ) = µ ( f − ( A )) where f − ( A ) = { x ∈ R d : f ( x ) ∈ A } for anyBorel set A . We approximate the integral by selecting a finite set of projections S ⊆ S d,k and computingthe empirical average: PW ,k ( µ, ν ) ≈ S ) (cid:88) E ∈S W ( E (cid:63) µ, E (cid:63) ν ) . The quality of this approximation depends on the sampling of S d,k . In this paper, we use randomsamples picked uniformly on S d,k , which is analogues to the approach proposed by Bonneel et al. [2015]for the case of k = 1; see Sampling schemes for the details.
Approximation of PW p, . We recall the definition of the PRW distance of order p with the projectiondimension k = 1 as follows, PW pp, ( µ, ν ) := sup u ∈ S d, W pp ( u (cid:63) µ, u (cid:63) ν ) = sup u ∈ S d, (cid:90) | F − u (cid:63) µ ( t ) − F − u (cid:63) ν ( t ) | p dt. where u ∈ S d, is an unit d -dimensional vector, u (cid:63) is the linear transformation associated with u for any x ∈ R d by u (cid:63) ( x ) = u (cid:62) x , and F − ξ is the quantile function of ξ . This integral can be estimated using aMonte Carlo estimate and a linear interpolation of the quantile function. Following up Nadjahi et al.[2019, Appendix 4], we consider two approximations of this quantity. The first one is given by, PW pp, ( µ, ν ) = sup u ∈ S d, K K (cid:88) k =1 | ˜ F − u (cid:63) µ ( t k ) − ˜ F − u (cid:63) ν ( t k ) | p , (F.3)where { t k } Kk =1 are uniform and independent samples from [0 ,
1] and ˜ F − ξ is a linear interpolation of F − ξ which denotes either the exact quantile function of a discrete measure ξ , or an approximation bya Monte Carlo procedure. The second one is given by PW pp, ( µ, ν ) = sup u ∈ S d, K K (cid:88) k =1 | s k − ˜ F − u (cid:63) ν ( ˜ F u (cid:63) µ ( s k )) | p , (F.4)where { s k } Kk =1 are uniform and independent samples from u (cid:63) µ and ˜ F ξ (resp. ˜ F − ξ ) is a linear interpo-lation of F ξ (resp. F − ξ ) which denotes either the exact cumulative distribution function (resp. quantilefunction) of a discrete measure ξ , or an approximation by a Monte Carlo procedure. Sampling schemes.
We explain the methods that we use to generate the i.i.d. samples from theuniform distribution on the set of d × k orthogonal matrices, i.e., S d,k = { E ∈ R d × k : E (cid:62) E = I k } andthe i.i.d. samples from multivariate elliptically contoured stable distributions.To sample from S d,k , we first construct the ( d × k )-dimensional matrix Z by drawing each of itscomponents from the standard normal distribution N (0 ,
1) and then perform the QR decomposition ofit: E = qr( Z ). By the definition, E ∈ S d,k is an uniform sample.To sample from multivariate elliptically contoured stable distributions, we follows the approachpresented in Nadjahi et al. [2019, Appendix 4]. Indeed, we recall that if Y ∈ R d is α -stable and39lliptically contoured, i.e., Y ∈ E α S c (Σ , m ), then its joint characteristic function is defined as, for any t ∈ R d that, E (cid:104) exp( it (cid:62) Y ) (cid:105) = exp (cid:16) − ( t (cid:62) Σ t ) α/ + it (cid:62) m (cid:17) , (F.5)where Σ is a positive definite matrix (akin to a correlation matrix), m ∈ R d is a location vector(equal to the mean if it exists) and α ∈ (0 ,
2) controls the thickness of the tail. Elliptically contouredstable distributions are scale mixtures of multivariate Gaussian distributions [Samoradnitsky, 2017,Proposition 2.5.2] with computationally intractable densities. Fortunately, it was shown by Nolan[2013] that sampling from multivariate elliptically contoured stable distributions is possible: let A ∼S α/ ( β, γ, δ ) be a one-dimensional positive ( α/ β = 1, γ = 2 cos( πα/ /α and δ = 0, and G ∼ N (0 , Σ). By the definition, Y = √ AG + m satisfies Eq. (F.5) and Y ∼ E α S c (Σ , m ). Optimization methods.
Computing the MPRW and MEPRW estimators are intractable in general.This is mainly because the PRW distance requires a maximization over infinitely many projections.Formally, we hope to solve the following minimax optimization model,min θ ∈ Θ PW pp, ( µ θ , µ (cid:63) ) = min θ ∈ Θ max u ∈ S d, (cid:90) | F − u (cid:63) µ θ ( t ) − F − u (cid:63) µ (cid:63) ( t ) | p dt, where { µ θ : θ ∈ Θ } is the model and µ (cid:63) is the data-generating process. Following up the approachpresented in Nadjahi et al. [2019] together with the approximation of PW p, , we consider using theADAM optimization method to minimize the (expected) PRW distance over the set of parameters whileapplying multiple projected supergradient ascent to find an approximate projection u which maximizesover S d, at each inner loop. The ADAM optimization method is associated with the default parametersetting as suggested by Kingma and Ba [2015]. At each inner loop, we run 5 projected supergradientascent with the learning rate 10 − . Gaussian models . For the MPRW estimator, we consider the approximate PW , distance basedon Eq. (F.4). Indeed, let µ denote N ( m , σ I ) and (cid:98) ν denote the empirical probability measures of n samples drawn from the data-generating process, we define the function f ( m , σ , u ) as f ( m , σ , u ) = 1card( S ) (cid:88) s ∈S | s − ˜ F − u (cid:63) (cid:98) ν ( ˜ F u (cid:63) µ ( s )) | N ( s ; u (cid:62) m , σ I ) , where S ⊆ R and N ( s ; u (cid:62) m , σ I ) refers to the density function of Gaussian of parameters ( u (cid:62) m , σ I )evaluated at s ∈ S . We compute the explicit gradient expression of f ( m , σ , u ) with respect to themean m , the variance σ and the projection vector u as follows, ∇ m f ( m , σ , u ) = 1 σ card( S ) (cid:88) s ∈S (cid:16) | s − ˜ F − u (cid:63) (cid:98) ν ( ˜ F u (cid:63) µ ( s )) | N ( s ; u (cid:62) m , σ I )( s − u (cid:62) m ) u (cid:17) , ∇ σ f ( m , σ , u ) = 12 σ card( S ) (cid:88) s ∈S (cid:16) | s − ˜ F − u (cid:63) (cid:98) ν ( ˜ F u (cid:63) µ ( s )) | N ( s ; u (cid:62) m , σ I )(( s − u (cid:62) m ) − σ ) (cid:17) , ∇ u f ( m , σ , u ) = 1 σ card( S ) (cid:88) s ∈S (cid:16) | s − ˜ F − u (cid:63) (cid:98) ν ( ˜ F u (cid:63) µ ( s )) | N ( s ; u (cid:62) m , σ I )( s − u (cid:62) m ) m (cid:17) . For the MEPRW estimator, we consider the approximate PW , distance based on Eq. (F.3). Indeed,let (cid:98) µ and (cid:98) ν denote the empirical probability measures of m samples drawn from N ( m , σ I ) and n f ( m , σ , u ) as f ( m , σ , u ) = 1 K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | , where { t k } Kk =1 are uniform and independent samples from [0 , f ( m , σ , u ) with respect to the mean m , the variance σ and the projection vector u asfollows, ∇ m f ( m , σ , u ) = − K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | u, ∇ σ f ( m , σ , u ) = − K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | m , ∇ u f ( m , σ , u ) = − σ K K (cid:88) k =1 (cid:16) | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | ( u (cid:62) m − ˜ F − u (cid:63) (cid:98) µ ( t k )) (cid:17) . Elliptically contoured stable models.
When comparing the MEPRW estimator with the MPRW estima-tor using elliptically contoured stable models, we also approximate these estimators using the ADAMoptimization method with the default parameter setting.We consider the approximate PW , distance based on Eq. (F.3). Indeed, let (cid:98) µ and (cid:98) ν denote theempirical probability measures of m samples drawn from E α S c ( I , m ) and n samples drawn from thedata-generating process, we define the function f ( m , u ) as f ( m , u ) = 1 K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | . where { t k } Kk =1 are uniform and independent samples from [0 , f ( m , u ) with respect to the location parameter m and the projection vector u as follows, ∇ m f ( m , u ) = − K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | u, ∇ u f ( m , u ) = − K K (cid:88) k =1 | ˜ F − u (cid:63) (cid:98) µ ( t k ) − ˜ F − u (cid:63) (cid:98) ν ( t k ) | m . Generative modeling.
We use the ADAM optimizer provided Pytorch GPU.
G Experimental Setup
Computing infrastructure.
For the experiments on the uniform distribution over hypercube, weimplement in Python 3.7 with Numpy 1.18 on a workstation with an Intel Core i5-9400F (6 cores and 6threads) and 32GB memory, equipped with Ubuntu 18.04. For the experiments on MPRW and MEPRWestimators, we implement in Python 2.7 with Numpy 1.16 and IPython 5.8 on the same machine. Theseexperiments were not conducted with GPU. For the experiments on neural networks, we implement onthe same machine with 2 GPUs (GeForce GTX 1070 and GeForce GTX 2070).41igure 5:
Mean values (Top) and mean computational time (Bottom) of the IPRW and PRW distances of order 2 betweenempirical measures (cid:98) µ n and (cid:98) ν n as the number of points n varies. Results are averaged over 100 runs. Convergence and concentration.
We conduct the experiment on the uniform distribution overdifferent hypercubes which are also used in the experiment [Paty and Cuturi, 2019]. In particular, weconsider µ = ν = U ([ − v, v ] d ) which is an uniform distribution over an hypercube and where d and v standfor the dimension and scale of the distribution respectively. (cid:98) µ n and (cid:98) ν n are empirical distributions corre-sponding to µ and ν with n samples. We evaluate the PRW and IPRW distance in terms of mean valuesand mean computational times over 100 runs for ( d, v ) ∈ { (10 , , (10 , , (30 , , (30 , , (50 , , (50 , } .For the PRW distance, we run Algorithm 1 with emd solver in the POT package [Flamary and Courty,2017] and terminate the algorithm either when the maximum number of iterations T = 30 is reached orwhen (cid:107) U t +1 − U t (cid:107) F ≤ − . For the IPRW distance, we draw 100 uniform and independent projectionsfrom S d,k and compute each Wasserstein distances using emd solver in the POT package again.
Model misspecification.
We conduct the experiments on three type of data: the mixture of 8,12 and 25 Gaussian distributions with Gaussian models M = {N ( m , σ I ) : m ∈ R , σ > } andelliptically contoured stable models M = {E α S c ( I , m ) : m ∈ R } . For data-generating process, we fix k centers { ( a i , b i ) } ≤ i ≤ k . For each sample, we first randomly select m from the centers at uniform andthen draw the sample from N (2 m , . − , .We use the ADAM optimization method with the default parameter setting to compute the MPRWand MEPRW estimators. At each inner loop, we run 5 projected supergradient ascent with the learningrate 10 − . For the Gaussian models, we estimate the densities of (cid:98) σ n with a kernel density estimatorby computing 100 times MPRW estimator of order 1. The maximum number of ADAM iterations isset as 20000. To illustrate the consistency of MPRW and MEPRW estimators, we compute 100 timesMPRW and MEPRW estimators of order 2, where the maximum number of ADAM iterations are set42 a) Mixture of 12 Gaussian distributions (b)
Mixture of 25 Gaussian distributions
Figure 6:
Probability density of estimation of centered and rescaled (cid:98) σ n on the Gaussian model for difference n . as 20000 and 10000 respectively. We also verify the convergence of MEPRW to MPRW by computing100 times these estimators on a fixed set of n = 2000 observations for different m generated samplesfrom the model. The maximum number of ADAM iterations for MPRW and MEPRW estimators areset as 20000 and 10000. For the elliptically contoured stable models, we verify the consistency propertyof MEPRW and the convergence of MEPRW to MPRW. For the former one, we compute 100 timesMEPRW estimator of order 2 and set the maximum number of ADAM iterations as 10000. For thelatter one, we compute 100 times MPRW and MEPRW estimators of order 2 on a fixed set of n = 100observations for different m generated samples from the model. The maximum number of ADAMiterations are set as 20000 and 10000. All of these settings are consistently used on the mixture of 8,12 and 25 Gaussian distributions. Generative modeling.
The procedure of the max-SW generator is summarized as follows: we firstsample a random variable Z from a fixed distribution on the base space Z , and then transforms Z through a neural network parametrized by θ . This provides a parametric function T θ : Z → R d which allows us to generate images from a distribution µ θ . Our goal is to optimize the neural networkparameters θ by minimizing the max-SW distance [Deshpande et al., 2019] between µ θ and data-generating distribution. We use a neural network with the fully-connected configuration from Deshpandeet al. [2018, Appendix D] and train our model with CIFAR10 and ImageNet200 . The former oneconsists of 60000 and 10000 images of size 3 × ×
32 for training and testing while the latter oneconsists of 100000 and 10000 images for training and testing. We use the minimal expected max-SWestimator of order 2 approximated with 50 projected gradient ascent steps and 10 − learning rate. Wetrain for 1000 iterations with the ADAM optimizer [Kingma and Ba, 2015] and 10 − learning rate. Available in https://tiny-imagenet.herokuapp.com/ a) MPRW vs. n (b) MEPRW vs. n = m (c) MEPRW with n = 2000 vs. m Figure 7:
Minimal PRW and expected PRW estimations using Gaussian models and n samples from the mixture of 12Gaussian distributions. Results are averaged over 100 runs and shaded areas represent standard deviation. (a) MPRW vs. n (b) MEPRW vs. n = m (c) MEPRW with n = 2000 vs. m Figure 8:
Minimal PRW and expected PRW estimations using Gaussian models and n samples from the mixture of 25Gaussian distributions. Results are averaged over 100 runs and shaded areas represent standard deviation. H Additional Experimental Results
Convergence and concentration.
Figure 5 presents average distances and computational times for( d, v ) ∈ { (10 , , (30 , , (50 , } , where the shaded areas show the max-min values over 100 runs. Wealso observe that the IPRW distance is smaller than the PRW distance for small n , especially so when d and v are large. The two distances are close when n is large, supporting the theoretical results givenby Theorem 3.4 and Theorem 3.6 in practice. The computation of the PRW distance is relatively faster44han that of the IPRW distance in these computations. Model misspecification: Gaussian models.
Figure 6 shows the distributions centered and rescaledby √ n for a range of moderately large n , based on the two underlying models including the mixtureof 12 Gaussian distributions and the mixture of 25 Gaussian distributions. The left figure supportsthe convergence rate and the limiting distribution of the estimator as derived in Theorem 3.12 onthe mixture of 12 Gaussian distributions. The right figure suggests that the limiting distribution isnot normal when the underlying model is given by the mixture of 25 Gaussian distributions. Forthe latter case, the result is not as anticipated by Theorem 3.12. This is possibly because we onlyconduct 5 projected supergradient ascent at each inner loop, which may not be enough to achieve agood approximate projection u ∈ S d, .Figure 7 and 8 demonstrate the large-sample consistency behavior of MPRW and MEPRW estima-tors on the mixture of 12 and 25 Gaussian distributions, which are expected since Assumption 3.1-3.3are mild. The MEPRW estimator also converges to the MPRW estimator on the mixture of 12 Gaussiandistributions, confirming Theorem 3.11. One exception in these experiments is the failure of convergenceof MEPRW to MPRW on the mixture of 25 Gaussian distributions. Apparently, the results from The-orem 3.11 do not hold in this experiment setting. This is likely due to the violation of Assumption 3.5that is necessary for Theorem 3.11 to hold. Model misspecification: Elliptically contoured stable models.
Figure 9 (a) illustrates theconsistency of the MEPRW estimator (cid:98) m n,m , approximated with 5 projected supergradient ascent, thesame way as for the Gaussian models. Figure 9 (b) confirms the convergence of (cid:98) m n,m to the MPRWestimator (cid:98) m n , where we fix n = 100 observations and compute the mean squared error between thesetwo estimators (using 5 projected supergradient ascent) for different values of m . Note that the MPRWestimator is approximated with the MEPRW obtained for a large enough value of m : (cid:98) m n = (cid:98) m n, . Tothis end, our results on elliptically contoured stable models confirm Theorem 3.9, Theorem 3.10 andTheorem 3.11 in practice. Figure 10: Mean test loss for different value of ( n, m )on
CIFAR10 . Generative modeling.
Figure 10 presents the meantest loss on
CIFAR10 over 10 runs, where the shadedareas show the max-min values over the runs. Herethe minimal expected max-SW estimator of order 2 isapproximated with 20 projected gradient ascent stepsand 10 − learning rate. We trained for 1000 itera-tions with the ADAM optimizer [Kingma and Ba, 2015]and 10 − learning rate. We also train the NNs with( n, m ) ∈ { (100 , , (1000 , , (5000 , , (10000 , } where n is the number of training samples and m is thenumber of generated samples and compute the testinglosses using the trained models on the testing dataset( n = 10000) with m = 250 generated samples. Wecompare these testing losses to that of a NN trainedusing n = 60000 (i.e., the entire training dataset) and m = 200 and present them in Figure 10. Again, ourresults confirm Theorem 3.10 in practice. 45 a) MEPRW (b)
MEPRW, n (cid:63) = 100 Figure 9:
Minimal expected PRW estimations using elliptically contoured stable models and n samples from the mixtureof 8 Gaussian distributions (top), 12 Gaussian distributions (middle) and 25 Gaussian distributions (bottom), and m samples generated from the model. Results are averaged over 100 runs and shaded areas represent standard deviation.samples generated from the model. Results are averaged over 100 runs and shaded areas represent standard deviation.