[PDF] Two-sample Test with Kernel Projected Wasserstein Distance

Abstract

We develop a kernel projected Wasserstein distance for the two-sample test, an essential building block in statistics and machine learning: given two sets of samples, to determine whether they are from the same distribution. This method operates by finding the nonlinear mapping in the data space which maximizes the distance between projected distributions. In contrast to existing works about projected Wasserstein distance, the proposed method circumvents the curse of dimensionality more efficiently. We present practical algorithms for computing this distance function together with the non-asymptotic uncertainty quantification of empirical estimates. Numerical examples validate our theoretical results and demonstrate good performance of the proposed method.

Full PDF

TTwo-sample Test with Kernel ProjectedWasserstein Distance

Jie Wang, Rui Gao, Yao Xie ∗ February 15, 2021

Abstract

We develop a kernel projected Wasserstein distance for the two-sampletest, an essential building block in statistics and machine learning: giventwo sets of samples, to determine whether they are from the same distribu-tion. This method operates by ﬁnding the nonlinear mapping in the dataspace which maximizes the distance between projected distributions. Incontrast to existing works about projected Wasserstein distance, the pro-posed method circumvents the curse of dimensionality more eﬃciently. Wepresent practical algorithms for computing this distance function togetherwith the non-asymptotic uncertainty quantiﬁcation of empirical estimates.Numerical examples validate our theoretical results and demonstrate goodperformance of the proposed method.

Two-sample hypothesis testing is a fundamental problem in statistical infer-ence (Casella & Berger, 2001), which aims to determine whether two collectionsof samples are from the same distribution or not. This problem has wide appli-cations in scientiﬁc discovery and machine learning problems such as anomalydetection (Chandola et al., 2009, 2010; Bhuyan et al., 2013), where the abnormalobservations follow a distinct distribution compared with the typical observations.Similarly, in change-point detection (Xie & Siegmund, 2013; Cao et al., 2018;Xie & Xie, 2020), the post-change observations follow a diﬀerent distributionfrom the pre-change one. Other examples include model criticism (Lloyd &Ghahramani, 2015; Chwialkowski et al., 2016; Bi´nkowski et al., 2018), causalinference (Lopez-Paz & Oquab, 2018), and health care (Schober & Vetter, 2019).Traditional methods for two-sample tests have been focusing on the parametricor low-dimensional testing scenario. One can design a parametric test when extrainformation about the underlying distribution is available. Typical examples ∗ J. Wang and Y. Xie are with H. Milton Stewart School of Industrial and Systems Engi-neering, Georgia Institute of Technology. R. Gao is with Department of Information, Risk,and Operations Management, University of Texas at Austin. a r X i v : . [ m a t h . S T ] F e b nclude Hotelling’s two-sample test (Hotelling, 1931), Student’s t-test (Pfanzagl& Sheynin, 1996), etc. Non-parametric two-sample tests can be more appealingwhen it is diﬃcult to specify the exact parametric form of the distributions.Integral probability metrics are popular for the design of non-parametric two-sample tests, since the evaluation of the corresponding empirical test statistics canbe directly estimated using samples without knowing the underlying distributions.Some earlier works construct two-sample tests based on the Kolmogorov-Smirnovdistance (Pratt & Gibbons, 1981; Jr., 1951), the total variation distance (Gy¨orﬁ& Van Der Meulen, 1991), and the Wasserstein distance (del Barrio et al., 1999;Ramdas et al., 2015). However, those tests are not applicable when performingtwo-sample tests in high-dimensional settings since the sample complexity forestimating those distance functions based on collected samples suﬀer from thecurse of dimensionality.There is a strong need for developing non-parametric test for high-dimensionaldata especially for modern applications. A notable contribution is two-sampletest based on the Maximum Mean Discrepancy (MMD) statistic (Gretton et al.,2012a, 2009, 2012b). However, as pointed out in (Reddi et al., 2014), the power ofMMD-based tests decreases polynomially with an increasing dimension of targetdistributions, which indicates it is not suitable for high-dimensional testing well.Recent works (Wang et al., 2020; Xie & Xie, 2020) use the idea of dimensionalityreduction for handling high-dimensional data, which introduces the projectedWasserstein distance as the test statistic, i.e.., the test statistic works by ﬁndingthe linear projector such that the distance between projected distributions ismaximized. However, a linear projector may not serve as an optimal design formaximizing the power of tests.In this paper, we present a new non-parametric two-sample test statistics aim-ing for high-dimensional setting based on a kernel projected Wasserstein (KPW)distance , with a nonlinear projector based on the reproducing kernel Hilbertspace designed to optimize the test power via maximizing the probability distancebetween the distributions after projection. In addition, our contributions includethe following: • We develop a computationally eﬃcient algorithm for evaluating the KPWusing a representer theorem to reformulate the problem into a ﬁnite-dimensional optimization problem and a block coordinate descent opti-mization algorithm which is guaranteed to ﬁnd a (cid:15) -stationary point withcomplexity O (cid:0) (cid:15) − (cid:1) . • To quantify the false detection rate, which is essential in setting thedetection threshold, we develop non-asymptotic bounds for empirical KPWdistance based on the Rademacher complexity and (cid:15) -net argument. • We present numerical experiments to validate our theoretical results aswell as demonstrate the good performance of our proposed test.The remaining of this paper is organized as follows. Section 3 introduces somepreliminary knowledge on two-sample testing and related probability distances,2ection 4 outlines a practical algorithm for computing KPW distance, Section 5studies the uncertainty quantiﬁcation of empirical KPW distance together withsome theoretical analysis for the designed two-sample test, Section 6 demonstratessome numerical experiments, and Section 7 presents some concluding remarks.

Generalization on Integral Probability Metrics.

Our proposed KPWdistance falls into the framework of integral probability metrics, which arewidely used in non-parametric two-sample tests. In particular, several eﬃcienttwo-sample tests have been constructed based on Maximum Mean Discrep-ancy (MMD) (Gretton et al., 2012a, 2009, 2012b), a special case of IPM. Thekey for designing eﬃcient two-sample tests is to balance the discriminative - generalization trade-oﬀ, i.e., the discriminator function space should be largeenough to make IPM( µ, ν ) discriminative so that IPM( µ, ν ) = 0 implies µ = ν ;and the space should be relatively small to make IPM( µ, ν ) generalizable in thatthe sample complexity for estimating IPM based on empirical samples small.(Arora et al., 2017) studies the generalization bound for general GAN metrics andit turns out that the neural-net distance enjoys generalization properties. (Zhanget al., 2018) discusses the necessary and suﬃcient conditions for discriminativeand generalization properties for general IPMs. A recent work Wang et al. (2020)extends those arguments above to investigate the two-sample convergence rate ofempirical IPMs. The proof for the uncertainty quantiﬁcation of empirical KPWdistance follows similar arguments in this work.Wasserstein distance is a special case of IPM when the discriminator functionspace is a collection of 1-Lipschitz functions. However, in high-dimensionalsettings, Wasserstein distance is diﬃcult to compute, and the convergence ratefor empirical Wasserstein distance is slow. Several variants of Wassersteindistance have been developed in the literature to address these two issues. Thefollowing outlines two approaches in existing works for breaking the curse ofhigh dimensions. Smoothed Wasserstein Distance.

Sinkhorn divergence is ﬁrst proposed in(Cuturi, 2013) to reduce the computational cost by adding entropic regularizationsinto the original Wasserstein distance. A recent work (Genevay et al., 2019) showsthat the convergence rate of empirical Sinkhorn divergences is of order O (cid:0) n − / (cid:1) with respect to the sample size n , which is independent of the dimension ofthe underlying distributions. The followed up work (Mena & Weed, 2019)further extends this convergence result into unbounded supports and tightens theconvergence rate up to some constants. Another variant of smoothed Wassersteindistance is by considering the Wasserstein distance between the distributionsconvolved with the isotropic Gaussian distribution. (Goldfeld et al., 2020b) showsthe convergence rate of such Gaussian-smoothed empirical measures (includingthe Wasserstein distance, total variation distance, etc.) under the sub-Gaussianassumption drops at the parametric rate in terms of the dimension d . Basedon this result, (Goldfeld & Greenewald, 2020) further establishes related metric3tructure and non-asymptotic statistical eﬃciency of the Gaussian-smoothedWasserstein distance, and (Goldfeld et al., 2020a) establishes the associatedasymptotic statistical results. Dimension Reduction.

Low-dimensional projections are widely used forunderstanding the structure of high-dimension distributions. Typical examplesinclude the principal component analysis (Jolliﬀe, 1986), linear discriminantanalysis (McLachlan & Sons, 1992), etc. It is intuitive to identify the diﬀerencesbetween two collections of high-dimensional samples by projecting those samplesinto low-dimensional spaces at some proper directions. (Wei et al., 2013; Ghosh &Biswas, 2016) design the direction by training binary linear classiﬁers on samples.(Mueller & Jaakkola, 2015; Xie & Xie, 2020) ﬁnd the worst-case linear projectorthat maximizes the probability distance between projected sample points inone-dimension. The recent work Lin et al. (2020a) naturally extends this ideaby considering projector that maps sample points points into k dimensionallinear subspace with k ≥

1, called the projected Wasserstein distance. Sincecomputing the projected Wasserstein distance is costly for large scale problems,Huang et al. (2021) proposes a block coordinate descent (BCD) algorithm toﬁnd the (cid:15) -stationary point of the optimization problem in O ( (cid:15) − ) iterations.The statistical property of this metric is discussed comprehensively in (Linet al., 2020b). Based on these uncertainty quantiﬁcation results, (Wang et al.,2020) applies the projected Wasserstein distance for the task of two-sampletesting. However, the uncertainty quantiﬁcation results are loose in general, anda linear projection mapping may not be the optimal choice for maximizing theperformance of two-sample tests. In contrast, our work focuses a more generalmetric called the KPW distance, in which general nonlinear mapping deﬁned ina Hilbert space is considered. We follow the similar idea in Huang et al. (2021)to design a BCD algorithm for computing the KPW distance so that (cid:15) -stationarypoint can be found in O ( (cid:15) − ) iterations. Let x n := { x i } ni =1 and y m := { y i } mi =1 be i.i.d. samples generated from distri-butions µ and ν , respectively. Our goal is to design a two-sample test which,given samples x n and y n , decides to accept the null hypothesis H : µ = ν orthe alternative hypothesis H : µ (cid:54) = ν . Denote by T : ( x n , y m ) → { d , d } thetwo-sample test, where d i means that we accept the hypothesis H i and reject theother, i = 0 ,

1. Deﬁne the type-I risk as the probability of rejecting hypothesis H when it is true, and the type-II risk as the probability of accepting H when µ (cid:54) = ν : (cid:15) (I) n,m = P x n ∼ µ,y m ∼ ν (cid:18) T ( x n , y m ) = d (cid:19) , under H ,(cid:15) (II) n,m = P x n ∼ µ,y m ∼ ν (cid:18) T ( x n , y m ) = d (cid:19) , under H . α, β ∈ (0 , ), we aim at building a two-sample test such that,when applied to n -observation samples x n and m -observation samples y m , it hasthe type-I risk at most α (i.e., at level α ) and the type-II risk at most β (i.e.,of power 1 − β ). Moreover, we want to ensure these speciﬁcations with samplesizes n, m as small as possible.Denote by ˆ µ n and ˆ ν m the empirical distributions constructed from i.i.d.samples from µ and ν :ˆ µ n (cid:44) n n (cid:88) i =1 δ x i , ˆ ν m (cid:44) m m (cid:88) i =1 δ y i . We propose a non-parametric test by considering the probability distance func-tions between two empirical distributions constructed from observed samples.Speciﬁcally, we design a test T such that the null hypothesis H is rejected when D (ˆ µ n , ˆ ν m ) > (cid:96), where D ( · , · ) is a divergence quantifying the diﬀerences of two distributions, and (cid:96) is a data-dependent threshold. In particular, we consider the divergence D asa special type of integral probability metric (IPM). Deﬁnition 1 (Integral Probability Metric) . Given two distributions µ and ν ,deﬁne the integral probability metric asIPM F ( µ, ν ) = sup f ∈F (cid:18) (cid:90) f ( x ) d µ ( x ) − (cid:90) f ( y ) d ν ( y ) (cid:19) , where F denotes the discriminator function space. Several existing tests can be viewed as special cases of IPM, including theMMD test, total variation distance test, etc. In this paper, we will design thedivergence D based on 1-Wasserstein distance. General p -Wasserstein distanceswith p ∈ (1 , ∞ ] are beyond the scope of IPM, so we leverage the discussion forgeneral p -Wasserstein distances in future works. Deﬁnition 2 (1-Wasserstein Distance) . Given two distributions µ and ν , the -Wasserstein distance is deﬁned as W ( µ, ν ) = min π ∈ Π( µ,ν ) (cid:90) (cid:107) x − y (cid:107) d π ( x, y ) , where (cid:107) · (cid:107) denotes the norm in the Euclidean space, and Π( µ, ν ) denotes thejoint distribution with marginal distributions µ and ν . While Wasserstein distance has wide applications in machine learning, theﬁnite-sample convergence rate of Wasserstein distance between empirical dis-tributions is slow in high-dimensional settings. Therefore, it is not suitable forhigh-dimensional two-sample tests. Instead, we use the projection idea to rescuethis issue. 5 eﬁnition 3 (Projected Wasserstein Distance) . Given two distributions µ and ν , deﬁne the projected Wasserstein distance as P W ( µ, ν ) = max A : R D → R d A T A = I d W ( A µ, A ν ) , where the operator denotes the push-forward operator, i.e., A ( z ) ∼ A µ for z ∼ µ, and we denote A as a linear operator such that A ( z ) = A T z with z ∈ R D and A ∈ R D × d . Although this idea is demonstrated to be useful for breaking the curse ofdimensionality for the original Wasserstein distance, a linear projector is not anoptimal choice for dimensionality reduction. Instead, we will consider a nonlinearprojector to obtain a more powerful two-sample test, and we use functions invector-valued reproducing kernel Hilbert space (RKHS) for projection.

Deﬁnition 4 (Vector-valued RKHS) . A function K : X × X → R d × d is saidto be a positive semi-deﬁnite kernel if N (cid:88) i =1 N (cid:88) j =1 (cid:104) y i , K ( x i , x j ) y j (cid:105) ≥ for any ﬁnite set of points { x i } Ni =1 in X and { y i } Ni =1 in R d . Given such a kernel,there exists an unique R d -valued Hilbert space H K with the reproducing kernel K . For ﬁxed x ∈ X and y ∈ R d , deﬁne the kernel section K x with the action y as the mapping K x y : X → R d such that ( K x y )( x (cid:48) ) = K ( x (cid:48) , x ) y, ∀ x (cid:48) ∈ X . In particular, the Hilbert space H K satisﬁes the reproducing property: ∀ f ∈ H K , (cid:104) f, K x y (cid:105) H K = (cid:104) f ( x ) , y (cid:105) . Deﬁnition 5 (Kernel Projected Wasserstein Distance) . Consider a R d -valuedRKHS H with the corresponding kernel function K . Given two distributions µ and ν , deﬁne the kernel projected Wasserstein (KPW) distance as KP W ( µ, ν ) = max f ∈H (cid:26) W ( f µ, f ν ) : (cid:107) f (cid:107) H ≤ (cid:27) . In this paper, we design the two-sample test such that the acceptance regionfor the null hypothesis H is given by: { ( x n , y m ) : KP W (ˆ µ n , ˆ ν m ) ≤ γ n,m } , (1)where the hyper-parameter γ n,m is chosen such that the two-sample test is atlevel α . In the following sections, we will consider how to compute this divergenceand how to obtain γ n,m based on some concentration results.6 Computing KPW Distance

By the deﬁnition of Wasserstein distance, computing KP W (ˆ µ n , ˆ ν m ) is equivalentto the following max-min problem:max f ∈H min π ∈ Γ (cid:88) i,j π i,j (cid:107) f ( x i ) − f ( y j ) (cid:107) , s.t. (cid:107) f (cid:107) H ≤ , (2)where Γ = (cid:110) π ∈ R n × m + : (cid:80) j π i,j = n , (cid:80) i π i,j = m (cid:111) .The computation of KPW distance has numerous challenges. It is crucial todesign a suitable kernel function to obtain low computational complexity andreliable testing power, which will be discussed in Section 6.1. Moreover, thefunction f ∈ H is a countable combination of basis functions, i.e., the problem (2)is an inﬁnite-dimensional optimization. By developing the representer theorem inTheorem 1, we are able to convert this problem into a ﬁnite-dimensional problem.Finally, there is no theoretical guarantee for ﬁnding the global optimum since itis a non-convex non-smooth optimization problem. Moreover, Sion’s minimaxtheorem is not applicable due to the lack of convex-concave structure. Based onthis observation, we only focus on optimization algorithms for ﬁnding a localoptimum point in polynomial time. Theorem 1 (Representer Theorem for KPW Distance) . There exists an optimalsolution to (2) that admits the following expression: ˆ f = n (cid:88) i =1 K x i a x,i − m (cid:88) j =1 K y j a y,j , where K x ( · ) denotes the kernel section and a x,i , a y,j ∈ R d are coeﬃcients to bedetermined. In order to express the optimal solution as the compact matrix form, deﬁne a x ∈ R nd as the concatenation of coeﬃcients a x,i for i = 1 , . . . , n and K z ( x n ) = (cid:0) K ( z, x ) · · · K ( z, x n ) (cid:1) ∈ R d × nd . We also deﬁne the vector a y and matrix K z ( y m ) likewise. Then we haveˆ f ( z ) = K z ( x n ) a x − K z ( y m ) a y , ∀ z ∈ X . Deﬁne the gram matrix K ( x n , x n ) as the n × n block matrix with the ( i, j )-thblock being K ( x i , x j ). The gram matrices K ( x n , y m ) , K ( y m , x n ) and K ( y m , y m )can be deﬁned likewise. Denote by G the concatenation of gram matrices: G = (cid:18) K ( x n , x n ) − K ( x n , y m ) − K ( y m , x n ) K ( y m , y m ) (cid:19) , and we assume that G is positive deﬁnite. Otherwise, we add the gram matrixwith a small number times identity matrix to make it invertible.7ubstituting the expression of ˆ f ( z ) , z ∈ X into (2), we obtain a ﬁnite-dimensional optimization problem:max ω min π ∈ Γ (cid:88) i,j π i,j c i,j , s.t. ω T Gω ≤ , where ω ∈ R d ( n + m ) is the concatenation of vectors a x , a y , and A i,j = [ K x i ( x n ) − K y j ( x n ) , K y j ( y m ) − K x i ( y m )] ,c i,j = (cid:107) A i,j ω (cid:107) . Suppose that the inverse of G admits the Cholesky decomposition G − = U U T , then by the linearity of the (cid:96) norm and the change of variable s = U − ω ,we obtain the norm-constrained optimization problem:max s ∈ S d ( n + m ) − min π ∈ Γ (cid:88) i,j π i,j c i,j , (3)where S d ( n + m ) − = { s ∈ R d ( n + m ) : s T s = 1 } .Note that the (cid:96) norm in the objective function is not continuously diﬀer-entiable at zero. To this end, we approximate the (cid:96) norm with the followingform: (cid:107) x (cid:107) ≈ (cid:107) x (cid:107) ,κ := (cid:115)(cid:88) i x i + κ − κ, where κ = . It follows that the approximated function is 1 /κ -smooth.For the ease of optimization, we consider the entropic regularization of theproblem (3): max s ∈ S d ( n + m ) − min π ∈ Γ (cid:88) i,j π i,j c i,j − ηH ( π ) , (4)By the duality theory of entropic optimal transport and the change-of-variabletechnique, (4) is equivalent to the following minimization problem:min s ∈ S d ( n + m ) − u ∈ R n ,v ∈ R m F ( u, v, s ) , (5)where π i,j ( u, v, s ) = exp (cid:18) − η c i,j + u i + v j (cid:19) ,F ( u, v, s ) = (cid:88) i,j π i,j ( u, v, s ) − n n (cid:88) i =1 u i − m m (cid:88) j =1 v j . Based on this formulation, we consider a Riemannian block coordinate de-scent (BCD) method for optimization. whose t -th iteration updates are the8ollowing: u t +1 = min u ∈ R n F ( u, v t , s t ) , (6a) v t +1 = min v ∈ R m F ( u t +1 , v, s t ) , (6b) ζ t +1 = (cid:88) i,j ∇ s π i,j ( u t +1 , v t +1 , s t ) , (6c) ξ t +1 = P s t (cid:0) ζ t +1 (cid:1) , (6d) s t +1 = Retr s t (cid:0) − τ ξ t +1 (cid:1) , (6e)where the operator P s ( ζ ) denotes the orthogonal projection of the vector ζ ontothe tangent space of the manifold S d ( n + m ) − at s : P s (cid:0) ζ (cid:1) = ζ − (cid:104) s, ζ (cid:105) s, s ∈ S d ( n + m ) − , and the retraction on this manifold is deﬁned asRetr s (cid:0) − τ ξ (cid:1) = s − τ ξ (cid:107) s − τ ξ (cid:107) , s ∈ S d ( n + m ) − . Note that the update steps (6a) and (6b) have closed-form expressions: u t +1 = u t + (cid:40) log 1 /n (cid:80) j π i,j ( u t , v t , s t ) (cid:41) i ∈ [ n ] , (6f) v t +1 = v t + (cid:26) log 1 /m (cid:80) i π i,j ( u t +1 , v t , s t ) (cid:27) j ∈ [ m ] , (6g)and the Euclidean gradient ζ t +1 in (6c) can be computed using the chain rule: ζ t +1 = − η (cid:88) i,j π i,j ( u t +1 , v t +1 , s t ) · U T A T i,j A i,j U s t (cid:107) A i,j U s t (cid:107) ,κ . (6h)By making use of the structure of matrices A i,j , ∀ i, j , the gradient ζ t +1 can beimplemented by avoiding double-loop computations. The overall algorithm forsolving the problem (5) is summarized in Algorithm 1. Theorem 2 (Convergence Analysis for BCD) . We say that (ˆ u, ˆ v, ˆ s ) is a ( (cid:15) , (cid:15) ) -stationary point of (5) if (cid:107) Grad s F (ˆ u, ˆ v, ˆ s ) (cid:107) ≤ (cid:15) /η,F (ˆ u, ˆ v, ˆ s ) − min u,v F ( u, v, ˆ s ) ≤ (cid:15) , where Grad s F ( u, v, s ) denotes the partial derivative of F with respect to thevariable s on the sphere S d ( n + m ) − . Let { u t , v t , s t } be the sequence generated byAlgorithm 1, then Algorithm 1 returns an ( (cid:15) , (cid:15) ) -stationary point in T = O (cid:18) κ · (cid:20) (cid:15) + 1 (cid:15) (cid:15) (cid:21)(cid:19) , lgorithm 1 BCD Algorithm for Solving (5)

Require:

Empirical distributions ˆ µ n and ˆ ν m . Initialize v , s for t = 0 , , , . . . , T − do Update u t +1 = min u F ( u, v t , s t ) according to (6f) Update v t +1 = min v F ( u t +1 , v, s t ) according to (6g) Update the Euclidean gradient ζ t +1 and the Riemannian gradient ξ t +1 according to (6h) and (6d) Update s t +1 according to (6e) end forReturn u ∗ = u T , v ∗ = v T , s ∗ = s T . where the constant depends on the gram matrix G and the Lipschitz propertiesof the Retraction operator. The proof of Theorem 2 follows the similar idea as in Huang et al. (2021),which can be found in Appendix B.

In order to design the two-sample test at a certain level, we ﬁrst build theﬁnite-sample guarantee for the convergence of kernel projected Wassersteindistance (KPW). Results throughout this section are based on the followingassumption.

Assumption 1.

For any x, x (cid:48) ∈ X , the matrix-valued kernel K ( x, x (cid:48) ) is sym-metric and satisﬁes (cid:22) K ( x, x (cid:48) ) (cid:22) BI d . Proposition 1 (Finite-sample Guarantee for the convergence of KPW) . Basedon Assumption 1, the following concentration bounds hold:1. With probability at least − α , we have that (cid:12)(cid:12) KP W (ˆ µ n , ˆ ν m ) − KP W ( µ, ν ) (cid:12)(cid:12) ≤ (cid:114) B ( m + n ) mn log 2 α + 2 √ dB √ n + 2 √ dB √ m + ζ n,d + ζ m,d , where ζ N,d denotes a constant factor such that ζ N,d = (cid:40) ˜ O ( N − / ) , if d ≤ , O ( N − /d ) , if d ≥ . . Assume further that the sample size n = m and target distributions µ = ν .Then with probability at least − α , we have that (cid:12)(cid:12) KP W (ˆ µ n , ˆ ν m ) (cid:12)(cid:12) ≤ (cid:114) Bn log 2 α + 2 √ dB √ n + ζ n,d . Projected Wasserstein distance is proposed to break the curse of dimen-sionality of the original Wasserstein distance, and there are several existingworks about the sample complexity of projection Wasserstein distance, including(Mueller & Jaakkola, 2015; Wang et al., 2020; Lin et al., 2020b). However, theupper bounded presented in (Mueller & Jaakkola, 2015; Lin et al., 2020b) doesnot express the constant term explicitly, and the theoretical guarantee discussedin (Wang et al., 2020) relies on the restrictive assumption that the support oftarget distributions is compact. In constrast, our ﬁnite-sample guarantee on thekernel projected Wasserstein distance gives tighter concentration results andapplies into distributions with unbounded support. Therefore, the proposedkernel projected Wasserstein distance breaks the curse of dimensionality moreeﬃciently.Based on the concentration inequality above, we can develop a two-sampletest at certain level. It turns out that the acceptance region does not dependon the dimension of the support of distributions, but only on the number ofsamples and the dimension of projected spaces.

Theorem 3 (Test based on Finite-sample Uncertainty Quantiﬁcation) . Designsof the two-sample test at a certain level can be summarized as follows.1. A hypothesis test of level α for null hypothesis µ = ν , has the acceptanceregion KP W (ˆ µ n , ˆ ν n ) ≤ (cid:114) B ( m + n ) mn log 2 α + 2 √ dB √ n + 2 √ dB √ m + ζ n,d + ζ m,d .

2. Assume further that m = n . Then a hypothesis test of level α for nullhypothesis µ = ν , has the acceptance region KP W (ˆ µ n , ˆ ν n ) ≤ (cid:114) Bn log 2 α + 2 √ dB √ n + ζ n,d . Moreover, we provide an upper bound of the type-II error of our designedtest based on the Markov’s inequality.

Theorem 4 (Power of Two-sample Test) . Provided that KP W ( µ, ν ) > γ m,n ,then the type-II error for the test with acceptance region presented in (1) can beupper bounded asPr H (cid:18) KP W (ˆ µ n , ˆ ν m ) < γ m,n (cid:19) ≤ MSE ( µ, ν )( KP W ( µ, ν ) − γ m,n ) ,

20 50 125 250 625

Sample Size L o g M S E KPW( n , n ) when = = ([ 1,1] ) KPW ( d =1) KPW * ( d =1) KPW ( d =2) KPW * ( d =2) KPW ( d =5) KPW * ( d =5) 5 20 50 125 250 625 Sample Size L o g M S E KPW( n , n ) when = = ([ 1,1] ) KPW ( d =1) KPW * ( d =1) KPW ( d =2) KPW * ( d =2) KPW ( d =5) KPW * ( d =5) 5 20 50 125 250 625 Sample Size L o g M S E KPW( n , n ) when = = ([ 1,1] ) KPW ( d =1) KPW * ( d =1) KPW ( d =2) KPW * ( d =2) KPW ( d =5) KPW * ( d =5) Figure 1: MSE of KPW distances between empirical distributions ˆ µ n and ˆ ν n across diﬀerent number of samples n . Results are averaged over 10 independenttrials. where MSE ( µ, ν ) = E (cid:104)(cid:0) KP W (ˆ µ n , ˆ ν m ) − KP W ( µ, ν ) (cid:1) (cid:105) . The upper bound in Theorem 4 characterizes two intersting features fortwo-sample test: As the MSE between the underlying KPW distance and itsempirical estimate decreases, the type-II error of the designed test decreases.When the discrepancy between µ and ν increases, i.e., ( KP W ( µ, ν )) increases,the type-II error of the designed test decreases. In Section 6.1 we would designthe matrix-valued kernel to approximately minimize the upper bound on thetype-II error. Before the practical implementation, we outline how to design the matrix-valuedkernel to approximately minimize the type-II error. Most existing works includingMinh & Sindhwani (2011); Micchelli & Pontil (2005); Caponnetto et al. (2008);Baldassarre et al. (2010) consider the design of the matrix-valued kernel functionas K ( x, x (cid:48) ) = k ( x, x (cid:48) ) · P, where k ( · , · ) denotes a scalar-valued kernel function and P ∈ R d × d is a positivesemi-deﬁnite matrix that encoders the relation between the output space. Wefollow the design of this type of kernels in order to decrease the computationalcomplexity. As a result, the gram matrices presented in Section 4 can be expressedas the tensor product, e.g., K ( x n , x n ) = k ( x n , x n ) ⊗ P. More speciﬁcally, wechoose the scalar-valued kernel k ( · , · ) to be a standard Gaussian kernel with thebandwidth σ , and P = (1 − ρ ) T + ρI d , with ρ ∈ [0 , . Next, we focus on optimizing the hyper-parameters σ and ρ in which theperformance criteria is based on the upper bound in Theorem 4. For existingworks about the scalar-valued kernel design for maximizing the power of MMD,12 Figure 2: a) Scatteer plots for 2-dimensional HDGM dataset with sample size n = m = 400. b) KDE plot for the empirical distributions with the projectortrained by linear projected Wassersterin distance. c) KDE plot for the empiricaldistributions with the projector trained by KPW distances. The dashed linecorresponds to the KPW distance with the standard kernel function (denoted as KP W ), and the solid line corresponds to the distance with the optimized kernelfunction (denoted as

KP W ∗ ). KPW with the optimized kernel function identiﬁeshigh-dimensional diﬀerences better since KP W ∗ = > KP W = .see Liu et al. (2020); Sutherland et al. (2021). For a reasonably large samplesize, the denominator of the fraction upper bound can be approximated as( KP W ( µ, ν )) . Since optimization for a fraction has pathological landscapes,such as gradient vanishing and explosion issues, we choose hyper-parameters sothat (( σ ∗ ) , ρ ∗ ) = arg min σ ,ρ MSE( µ, ν ) −

12 ( KP W ( µ, ν )) . We use the bootstrap method to approximately evaluate the objective functionin practice. In particular, we approximate MSE( µ, ν ) with1 L L (cid:88) (cid:96) =1 (cid:0) KP W (ˆ µ ( (cid:96) ) n , ˆ ν ( (cid:96) ) m ) − KP W (ˆ µ n , ˆ ν m ) (cid:1) , and ( KP W ( µ, ν )) with ( KP W (ˆ µ (0) n , ˆ µ (0) n )) , where for (cid:96) = 0 , , . . . , L , ˆ µ ( (cid:96) ) n andˆ ν ( (cid:96) ) m are empirical distributions re-sampled based on ˆ µ n and ˆ ν m , respectively. Theoptimization is performed using the simulated annealing algorithm based on theMATLAB optimization toolbox with maximum number of iterations being .The initial guess of ρ is chosen to be , and that of σ is based on the medianheuristic outlined in (Section 8, Gretton et al. (2012a)). The advantage for thekernel design will be demonstrated in the following experiments. Throughputthis section, denote by KP W the KPW distance with hyper-parameters selectedby median heuristic unless otherwise speciﬁed, and KP W ∗ the distance withoptimized hyper-parameters. We set µ = ν = U ([ − , D ) with D ∈ { , , } and examine the convergenceof KP W (ˆ µ n , ˆ ν n ) with respect to the sample complexity n . The dimension of13rojected space d ∈ { , , } . In this experiment, we ﬁx hyper-parameters σ = , ρ = for KP W , and the hyper-parameters for KP W ∗ are optimizedfor each pair of ( d, D ) using 100 sample points from µ and ν , respectively.Fig. 1 presents the values of log KP W and log KP W ∗ between two empiricaldistributions ˆ µ n and ˆ ν n ) for n ∈ { , , , , , } over 10 independenttrials. As we can see, the value of KP W (ˆ µ n , ˆ ν n ) decreases into zero quicklywith respect to the sample size, which justiﬁes the non-asymptotic uncertaintyquantiﬁcation results in Proposition 1. The design of kernel further improves theconvergence rate, especially for smaller values of d . Interestingly, the values ofempirical KPW distances with d ∈ { , } are slightly smaller than that with d = 30. The intuition is that all information of high-dimensional data points areencoded into the gram matrix induced by kernel functions, and the computationof KPW distance does not depend on data points directly but only on the grammatrix. Mean Shifted Gaussians Figure 3: Comparison of test power of diﬀerent methods across diﬀerent dimen-sions D . a) Power of tests under mean shifted Gaussian distributions; b) Powerof tests under covariance shifted Gaussian distributions with the dimension ofintrinsic diﬀerences I = 1; c) Power of tests under covariance shifted Gaussiandistributions with the dimension of intrinsic diﬀerences I = 3; Our proposed KPW distance also provides a visual interpretation for identifyingthe diﬀerences between collected samples. Consider the HDGM dataset with D = 2 so that µ = (cid:80) i =1 12 N ( m hi , I D ) and ν = (cid:80) i =1 12 N ( m hi , Σ ), where m h = (cid:18) (cid:19) , m h = (cid:18) (cid:19) , Σ = I D , Σ = (cid:18) . . (cid:19) . Fig. 2a) shows the scatter plots for this dataset with sample size n = m =400. Fig. 2b) presents the kernel density estimate (KDE) plot in which targetdistributions are projected into one-dimensional linear subspace such that theWasserstein distance between projected distributions is maximized. Fig. 2c)demonstrates the KDE plot in which projectors are trained by computing KPWdistance. In particular, the dashed line and solid line correspond to KP W and

KP W ∗ , respectively. In contrast to (linear) projected Wasserstein distance, thesupport of projected distributions is bounded by the (cid:96) norm with radius √ B

14s long as Assumption 1 holds. Moreover, we can improve the performancefor identifying high-dimensional diﬀerences by design kernel functions. In thisscenario, the distance with optimized kernel functions

KP W ∗ = , whilethe distance with standard kernel functions KP W = . Finally, we investigate the performance of the proposed two-sample tests. Bench-mark algorithms are chosen to be the Hotelling’s T test and the MMD testwith standard Gaussian kernel functions. Since the acceptance region based onthe concentration inequality gives conservative results in general, we ﬁrst projectobservations into d -dimensional subspace and then use MMD to identify whethertwo sets of projected points belong to the same distribution or not. We rejectthe null hypothesis when p < .

05. The sample size was chosen as n = m = 100,and results were averaged over 100 independent trials. We follow the similarsetup as in (Section VI.A, Xie & Xie (2020)) to examine the performance ofdiﬀerent methods, where datasets are speciﬁed as follows. • µ, ν are Gaussian distributions with diﬀerent mean values, i.e., µ = N (0 , I D ) and ν = N (0 . / √ D, I D ). The division of the mean vectorby √ D is to make the testing problem remain equivalently diﬃcult toperform. • µ, ν are covariance-shifted Gaussian distributions. In particular, µ = N (0 , I D ) and ν = N (0 , Σ), with Σ being a diagonal matrix such thatΣ i,i = 0 .

25 if i ≤ I and otherwise Σ i,i = 1. Here I ∈ { , } quantiﬁes thedimension of intrinsic diﬀerences between µ and ν .Fig. 3a)-Fig. 3c) demonstrate the power across diﬀerent dimensions D underthe mean-shifted and covariance-shifted distributions. It can be seen that thepower decreases with increasing dimensions D . Benchmark algorithms performwell in low-dimensional settings, but their performances degrade quickly forhigh-dimensional problems. In contrast, our proposed tests outperform thebenchmark in high-dimensional settings. As long as a good low-dimensionalprojection mapping is found, the power of tests based on KPW tends to decreaseslower compared with Hotelling’s T test and MMD test. Interestingly, we cansee that d = 1 is not the optimal choice to maximize the power, even whenthe dimension of intrinsic diﬀerences I = 1. Therefore, it is of research interestto determine the optimal choice of d to obtain more eﬃcient tests, which isleveraged for future works. Moreover, tests based on KP W ∗ usually outperformtests based on KP W . We proposed the KPW distance for the task of two-sample testing, which operatesby ﬁnding the nonlinear mapping in the data space to maximize the distance15etween projected distributions. Practical algorithms together with uncertaintyquantiﬁcation of empirical estimates are discussed to help with this task. Infuture works we will develop sharper uncertainty quantiﬁcation results on theempirical KPW distance.

References

Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Generalization andequilibrium in generative adversarial nets (gans), 2017.Baldassarre, L., Rosasco, L., Barla, A., and Verri, A. Vector ﬁeld learning viaspectral ﬁltering. In Balc´azar, J. L., Bonchi, F., Gionis, A., and Sebag, M.(eds.),

Machine Learning and Knowledge Discovery in Databases , pp. 56–71,Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.Bhuyan, M. H., Bhattacharyya, D. K., and Kalita, J. K. Network anomalydetection: methods, systems and tools.

Ieee communications surveys &tutorials , 16(1):303–336, 2013.Bi´nkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifyingmmd gans, 2018.Boumal, N., Absil, P.-A., and Cartis, C. Global rates of convergence for nonconvexoptimization on manifolds.

IMA Journal of Numerical Analysis , 39(1):1–33,Feb 2018.Cao, Y., Guigues, V., Juditsky, A., Nemirovski, A., and Xie, Y. Change detectionvia aﬃne and quadratic detectors, 2018.Caponnetto, A., Micchelli, C. A., Pontil, M., and Ying, Y. Universal multi-taskkernels.

Journal of Machine Learning Research , 9(52):1615–1646, 2008.Casella, G. and Berger, R.

Statistical Inference . Duxbury Resource Center, June2001.Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection: A survey.

ACMComput. Surv. , 41(3), July 2009. ISSN 0360-0300.Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection for discretesequences: A survey.

IEEE transactions on knowledge and data engineering ,24(5):823–839, 2010.Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness ofﬁt.

Proceedings of Machine Learning Research , 48:2606–2615, 20–22 Jun 2016.Cover, T. M. and Thomas, J. A.

Elements of Information Theory (Wiley Seriesin Telecommunications and Signal Processing) . Wiley-Interscience, USA, 2006.Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transporta-tion distances, 2013. 16el Barrio, E., Cuesta-Albertos, J. A., Matr´an, C., and Rodriguez-Rodriguez,J. M. Tests of goodness of ﬁt based on the l -wasserstein distance. Ann.Statist. , 27(4):1230–1239, 08 1999.Genevay, A., Chizat, L., Bach, F., Cuturi, M., and Peyr´e, G. Sample complexityof sinkhorn divergences, 2019.Ghosh, A. K. and Biswas, M. Distribution-free high-dimensional two-sampletests based on discriminating hyperplanes.

TEST , 25(3):525–547, 2016.Gin, E. and Nickl, R.

Mathematical Foundations of Inﬁnite-Dimensional Statis-tical Models . Cambridge University Press, USA, 1st edition, 2015.Goldfeld, Z. and Greenewald, K. Gaussian-smooth optimal transport: Metricstructure and statistical eﬃciency, 2020.Goldfeld, Z., Greenewald, K., and Kato, K. Asymptotic guarantees for generativemodeling based on the smooth wasserstein distance, 2020a.Goldfeld, Z., Greenewald, K., Polyanskiy, Y., and Weed, J. Convergence ofsmoothed empirical measures with applications to entropy estimation, 2020b.Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. A fast, con-sistent kernel two-sample test. In Bengio, Y., Schuurmans, D., Laﬀerty, J. D.,Williams, C. K. I., and Culotta, A. (eds.),

Advances in Neural InformationProcessing Systems 22 , pp. 673–681. Curran Associates, Inc., 2009.Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B., and Smola, A. Akernel two-sample test.

J. Mach. Learn. Res. , 13:723–773, March 2012a.Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M.,Fukumizu, K., and Sriperumbudur, B. K. Optimal kernel choice for large-scaletwo-sample tests. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger,K. Q. (eds.),

Advances in Neural Information Processing Systems 25 , pp.1205–1213. Curran Associates, Inc., 2012b.Gy¨orﬁ, L. and Van Der Meulen, E. C.

A Consistent Goodness of Fit Test Basedon the Total Variation Distance , pp. 631–645. Springer Netherlands, Dordrecht,1991.Hotelling, H. The generalization of student’s ratio.

Ann. Math. Statist. , 2(3):360–378, 08 1931.Huang, M., Ma, S., and Lai, L. A riemannian block coordinate descent methodfor computing the projection robust wasserstein distance, 2021.Jolliﬀe, I.

Principal Component Analysis . Springer Verlag, 1986.Jr., F. J. M. The kolmogorov-smirnov test for goodness of ﬁt.

Journal of theAmerican Statistical Association , 46(253):68–78, 1951.17in, T., Fan, C., Ho, N., Cuturi, M., and Jordan, M. I. Projection robustwasserstein distance and riemannian optimization, 2020a.Lin, T., Zheng, Z., Chen, E. Y., Cuturi, M., and Jordan, M. I. On projectionrobust optimal transport: Sample complexity and model misspeciﬁcation,2020b.Liu, F., Xu, W., Lu, J., Zhang, G., Gretton, A., and Sutherland, D. J. Learningdeep kernels for non-parametric two-sample tests, 2020.Lloyd, J. R. and Ghahramani, Z. Statistical model criticism using kernel twosample tests. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., andGarnett, R. (eds.),

Advances in Neural Information Processing Systems 28 ,pp. 829–837. Curran Associates, Inc., 2015.Lopez-Paz, D. and Oquab, M. Revisiting classiﬁer two-sample tests, 2018.Maurer, A. A vector-contraction inequality for rademacher complexities, 2016.McDiarmid, C.

On the method of bounded diﬀerences , pp. 148–188. LondonMathematical Society Lecture Note Series. Cambridge University Press, 1989.McLachlan, G. and Sons, J. W. .

Discriminant Analysis and Statistical PatternRecognition . Wiley Series in Probability and Statistics. Wiley, 1992.Mena, G. and Weed, J. Statistical bounds for entropic optimal transport: samplecomplexity and the central limit theorem, 2019.Micchelli, C. A. and Pontil, M. A. On learning vector-valued functions.

NeuralComput. , 17(1):177–204, January 2005.Minh, H. Q. and Sindhwani, V. Vector-valued manifold regularization. In

Pro-ceedings of the 28th International Conference on International Conference onMachine Learning , ICML’11, pp. 57–64, Madison, WI, USA, 2011. Omnipress.Mueller, J. and Jaakkola, T. Principal diﬀerences analysis: Interpretable charac-terization of diﬀerences between distributions, 2015.Pfanzagl, J. and Sheynin, O. Studies in the history of probability and statisticsXLIV A forerunner of the t-distribution.

Biometrika , 83(4):891–898, 12 1996.Pratt, J. W. and Gibbons, J. D.

Kolmogorov-Smirnov Two-Sample Tests , pp.318–344. Springer New York, New York, NY, 1981.Ramdas, A., Garcia, N., and Cuturi, M. On wasserstein two sample testing andrelated families of nonparametric tests, 2015.Reddi, S. J., Ramdas, A., PAczos, B., Singh, A., and Wasserman, L. On thedecreasing power of kernel and distance based nonparametric hypothesis testsin high dimensions, 2014. 18chober, P. and Vetter, T. Two-sample unpaired t tests in medical research.

Anesthesia and analgesia , 129:911, 10 2019.Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola,A., and Gretton, A. Generative models and model criticism via optimizedmaximum mean discrepancy, 2021.Wang, J., Gao, R., and Xie, Y. Two-sample test using projected wassersteindistance: Breaking the curse of dimensionality, 2020.Wei, S., Lee, C., Wichers, L., Li, G., and Marron, J. S. Direction-projection-permutation for high dimensional hypothesis tests, 2013.Xie, L. and Xie, Y. Sequential change detection by optimal weighted (cid:96) divergence,2020.Xie, Y. and Siegmund, D. Sequential multi-sensor change-point detection. TheAnnals of Statistics , 41(2):670–692, Apr 2013.Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. On the discrimination-generalization tradeoﬀ in gans, 2018.19

Preliminary Technical Results

Theorem 5 (Pinsker’s Inequality (Cover & Thomas, 2006)) . Consider twodiscrete probability distributions p = { p i } ni =1 and q = { q i } ni =1 , then we have that n (cid:88) i =1 p i log p i q i ≥ (cid:107) p − q (cid:107) . Proposition 2 (Lipschitz Properties of Retraction Operator Boumal et al.(2018)) . There exists constants L , L such that the following inequalities hold: (cid:107) Retr U ( ζ ) − U (cid:107) ≤ L (cid:107) ζ (cid:107)(cid:107) Retr U ( ζ ) − ( U + ζ ) (cid:107) ≤ L (cid:107) ζ (cid:107) . Deﬁnition 6 (Rademacher complexity) . Given the function space F and µ ,deﬁne the Rademacher complexity as R n ( F , µ ) := E x i ∼ µ,σ i (cid:34) sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 σ i f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) , where σ i ’s are i.i.d. and take values in {− , } with equal probability. Lemma 1 (Contraction for vector-valued functions (Maurer, 2016)) . Let H be acollection of R k -valued functions on X ⊆ R d . Let (cid:96) : R m → R be a L (cid:96) -Lipschitzcontinuous function. Denote by (cid:96) ◦ H (cid:44) { z (cid:55)→ (cid:96) ( h ( z )) : h ∈ H} . Then we have R n ( (cid:96) ◦ H , µ ) ≤ √ L (cid:96) n E x i ∼ µ,σ  sup h ∈H n (cid:88) i =1 k (cid:88) j =1 σ i,j h j ( x i )  Deﬁnition 7 (Integral Probability Metric) . Given two distributions µ and ν together with the function space F , deﬁne the integral probability metric asIPM F ( µ, ν ) = sup f ∈F (cid:18) (cid:90) f ( x ) d µ ( x ) − (cid:90) f ( y ) d ν ( y ) (cid:19) . Lemma 2 (Generalization Bound on IPM (Zhang et al., 2018)) . Consider theintegral probability metric IPM F with the discriminative function space F , then E [ IPM F ( µ, ˆ µ n )] ≤ R n ( F , µ ) , where ˆ µ n is the empirical distribution constructed from n i.i.d. samples generatedfrom µ . Theorem 6 (McDiarmid’s Inequality (McDiarmid, 1989)) . Let X , . . . , X n beindependent random variables, where X i has the support X i . Let f : X × X ×· · · × X n → R be any function with the ( c , . . . , c n ) bounded diﬀerence property, .e., for i ∈ { , . . . , n } and for any ( x , . . . , x n ) , ( x (cid:48) , . . . , x (cid:48) n ) that diﬀers only inthe i -th corodinate, we have | f ( x , . . . , x n ) − f ( x (cid:48) , . . . , x (cid:48) n ) | ≤ c i . Then for any t > , we havePr (cid:26) | f ( X , . . . , X n ) − E [ f ( X , . . . , X n )] | ≥ t (cid:27) ≤ (cid:18) − t (cid:80) ni =1 c i (cid:19) . Lemma 3 (Equivalent Deﬁnition for Sub-Gaussian variables (Lemma 2.3.2 inGin & Nickl (2015))) . Assume that E [ ζ ] = 0 and P {| ζ | ≥ t } ≤ C exp (cid:18) − t σ (cid:19) , t > , for some C ≥ and σ > . Then the random variable ζ is sub-Gaussian withconstant ˜ σ = 12(2 C + 1) σ . B Technical Proofs in Section 4

Proof of Theorem 1.

Assume that ˆ f is an optimal solution to the problem (2).Let S be the subspace S =  n (cid:88) i =1 m (cid:88) j =1 ( K x i − K y j ) a i,j : a i,j ∈ R d  . Then by the projection theorem, we have ˆ f = ˆ f S + ˆ f S ⊥ and (cid:107) ˆ f (cid:107) H = (cid:107) ˆ f S (cid:107) H + (cid:107) ˆ f S ⊥ (cid:107) H . It remains to show that ˆ f S shares the same objective value with ˆ f . Forﬁxed i, j , we have that (cid:107) ˆ f ( x i ) − ˆ f ( y j ) (cid:107) = max a i,j : (cid:107) a i,j (cid:107) ≤ (cid:104) ˆ f ( x i ) − ˆ f ( y j ) , a i,j (cid:105) = max a i,j : (cid:107) a i,j (cid:107) ≤ (cid:104) ˆ f ( x i ) , a i,j (cid:105) − (cid:104) ˆ f ( y j ) , a i,j (cid:105) = max a i,j : (cid:107) a i,j (cid:107) ≤ (cid:104) ˆ f , K x i a i,j (cid:105) − (cid:104) ˆ f , K y j a i,j (cid:105) = max a i,j : (cid:107) a i,j (cid:107) ≤ (cid:104) ˆ f , ( K x i − K y j ) a i,j (cid:105) = max a i,j : (cid:107) a i,j (cid:107) ≤ (cid:104) ˆ f S , ( K x i − K y j ) a i,j (cid:105) = (cid:107) ˆ f S ( x i ) − ˆ f S ( y j ) (cid:107) . Therefore, there always exists an optimal solution that lies in the subspace S , which means that there exists an optimal solution to (2) that admits thefollowing expression: ˆ f = n (cid:88) i =1 m (cid:88) j =1 ( K x i − K y j ) a i,j . Deﬁne a x,i = (cid:80) mj =1 a i,j and a y,j = (cid:80) ni =1 a i,j completes the proof.21 emma 4 (Properties of the gradient with respect to s ) . For ﬁxed ( i, j ) , denoteby ∇ s c i,j [ s ] the gradient of c i,j = (cid:107) A i,j ω (cid:107) ,κ = (cid:107) A i,j U s (cid:107) ,κ with respect to thevariable s . We have that (cid:107)∇ s c i,j [ s ] (cid:107) ≤ κ (cid:107) AU (cid:107) ∞ , ∀ s ∈ S d ( n + m ) − , (7a) (cid:107)∇ s c i,j [ s ] − ∇ s c i,j [ s ] (cid:107) ≤ (cid:107) AU (cid:107) ∞ κ (cid:107) s − s (cid:107) , ∀ s , s ∈ S d ( n + m ) − , (7b) where (cid:107) AU (cid:107) ∞ = max i,j (cid:107) A i,j U (cid:107) .Proof of Lemma 4. The gradient of c i,j equals ∇ s c i,j [ s ] = U T A T i,j A i,j U s (cid:107) A i,j U s (cid:107) ,κ . Therefore, (cid:107)∇ s c i,j [ s ] (cid:107) ≤ (cid:107) U T A T i,j A i,j U s (cid:107) κ ≤ κ (cid:107) U T A T i,j A i,j U (cid:107) ≤ κ (cid:107) AU (cid:107) ∞ , where the ﬁrst inequality is because (cid:107) A i,j U s (cid:107) ,κ ≥ κ , the second inequalityis because (cid:107) Ax (cid:107) le (cid:107) A (cid:107) for any (cid:107) x (cid:107) ≤

1, and the last inequality is because (cid:107) A (cid:107) = (cid:107) A T (cid:107) for any matrix A . It suﬃces to upper bound the spectral radius ofthe Hessian matrix of c i,j with respect to s when showing the Lipschitz propertyof ∇ s c i,j [ s ]. In particular, ∇ s c i,j [ s ] = (cid:107) A i,j U s (cid:107) ,κ U T A T i,j A i,j U − ( U T A T i,j A i,j U s )( U T A T i,j A i,j U s ) T (cid:107) A i,j U s (cid:107) ,κ (cid:22) κ (cid:107) AU (cid:107) ∞ · I where I stands for the identity matrix of size d ( n + m ). Lemma 5 (Lipschitzness of ∇ s F ( u, v, s )) . Let { u t , v t , s t } t be the sequence gen-erated from Algorithm 1. The following inequality holds for any s ∈ S d ( n + m ) − and λ ∈ [0 , : (cid:107)∇ s F ( u t +1 , v t +1 , λs + (1 − λ ) s t ) − ∇ s F ( u t +1 , v t +1 , s t ) (cid:107) ≤ (cid:37)λ (cid:107) s t − s (cid:107) , where (cid:37) = (cid:107) AU (cid:107) ∞ κη + mn (cid:107) AU (cid:107) ∞ κ η .Proof of Lemma 5. An intermediate result is that (cid:88) i π i,j ( u t +1 , v t +1 , s t ) = (cid:88) i exp (cid:18) − η c i,j [ s t ] + u t +1 i (cid:19) exp (cid:0) v t +1 j (cid:1) = (cid:88) i exp (cid:18) − η c i,j [ s t ] + u t +1 i (cid:19) exp (cid:0) v tj (cid:1) /m (cid:80) i π i,j ( u t +1 , v t , s t )= 1 m (cid:80) i π i,j ( u t +1 , v t , s t ) (cid:80) i π i,j ( u t +1 , v t , s t ) = 1 /m. (cid:80) i,j π i,j ( u t +1 , v t , s t ) = 1. For ﬁxed s t , deﬁne s λ = λs + (1 − λ ) s t . Then we have that (cid:107)∇ s F ( u t +1 , v t +1 , λs + (1 − λ ) s t ) − ∇ s F ( u t +1 , v t +1 , s t ) (cid:107) = 1 η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i,j π i,j ( u t +1 , v t +1 , s λ ) · ∇ s c i,j [ s λ ] − (cid:88) i,j π i,j ( u t +1 , v t +1 , s t ) · ∇ s c i,j [ s t ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i,j π i,j ( u t +1 , v t +1 , s λ ) · (cid:18) ∇ s c i,j [ s λ ] − ∇ s c i,j [ s t ] (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 1 η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i,j (cid:18) π i,j ( u t +1 , v t +1 , s λ ) − π i,j ( u t +1 , v t +1 , s t ) (cid:19) · ∇ s c i,j [ s t ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) AU (cid:107) ∞ κη (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i,j π i,j ( u t +1 , v t +1 , s λ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) s λ − s t (cid:107) + (cid:107) AU (cid:107) ∞ κη (cid:88) i,j (cid:12)(cid:12) π i,j ( u t +1 , v t +1 , s λ ) − π i,j ( u t +1 , v t +1 , s t ) (cid:12)(cid:12) = (cid:107) AU (cid:107) ∞ κη (cid:107) s λ − s t (cid:107) + (cid:107) AU (cid:107) ∞ κη (cid:88) i,j (cid:12)(cid:12) π i,j ( u t +1 , v t +1 , s λ ) − π i,j ( u t +1 , v t +1 , s t ) (cid:12)(cid:12) ≤ (cid:107) AU (cid:107) ∞ κη (cid:107) s λ − s t (cid:107) + mn (cid:107) AU (cid:107) ∞ κ η (cid:107) s λ − s t (cid:107) . where the second inequality is by applying (7a) and (7b), and the last inequalityis by evaluating the gradient of π i,j with respect to s and then applying (7a).Then applying the condition that (cid:107) s λ − s t (cid:107) = λ (cid:107) s − s t (cid:107) completes the proof. Lemma 6 (Boundedness of ∇ s F ( u, v, s )) . Let { u t , v t , s t } t be the sequence gen-erated from Algorithm 1. Deﬁne ζ t +1 = ∇ s F ( u t +1 , v t +1 , s t ) , then (cid:107) ζ t +1 (cid:107) ≤ (cid:107) AU (cid:107) ∞ κη . Proof of Lemma 6.

By the matrix calculus knowledge, (cid:107) ζ t +1 (cid:107) = 1 η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i,j π i,j ( u t +1 , v t +1 , s t ) · ∇ s c i,j [ s t ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) AU (cid:107) ∞ κη (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i,j π i,j ( u t +1 , v t +1 , s t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:107) AU (cid:107) ∞ κη . where the inequality is by applying (7a), and the last equality is because that (cid:80) i,j π i,j ( u t +1 , v t , s t ) = 1. Lemma 7 (Decrease of F in s ) . Let { u t , v t , s t } t be the sequence generated fromAlgorithm 1. The following inequality holds for any k ≥ : F ( u t +1 , v t +1 , s t +1 ) − F ( u t +1 , v t +1 , s t ) ≤ − (cid:107) AU (cid:107) ∞ L / ( κη ) + 2 (cid:37)L (cid:107) ξ t +1 (cid:107) . roof of Lemma 7. Note that (cid:12)(cid:12) F ( u t +1 , v t +1 , s t +1 ) − F ( u t +1 , v t +1 , s t ) − (cid:104)∇ t F ( u t +1 , v t +1 , s t ) , s t +1 − s t (cid:105) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) (cid:104)∇ s F ( u t +1 , v t +1 , λs t +1 + (1 − λ ) s t ) − ∇ s F ( u t +1 , v t +1 , s t ) , s t +1 − s t (cid:105) d λ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) (cid:107)∇ s F ( u t +1 , v t +1 , λs t +1 + (1 − λ ) s t ) − ∇ s F ( u t +1 , v t +1 , s t ) (cid:107)(cid:107) s t +1 − s t (cid:107) d λ ≤ (cid:90) (cid:37)λ (cid:107) s t +1 − s t (cid:107) d λ = (cid:37) (cid:107) s t +1 − s t (cid:107) = (cid:37) (cid:13)(cid:13) Retr s t (cid:0) − τ ξ t +1 (cid:1) − s t (cid:13)(cid:13) ≤ (cid:37)τ L (cid:107) ξ t +1 (cid:107) . where the second inequality is by applying Lemma 5, and the last inequality isby applying Proposition 2. Moreover, we have that (cid:104)∇ t F ( u t +1 , v t +1 , s t ) , s t +1 − s t (cid:105) = (cid:104)∇ t F ( u t +1 , v t +1 , s t ) , − τ ξ t +1 (cid:105) + (cid:104)∇ t F ( u t +1 , v t +1 , s t ) , Retr s t (cid:0) − τ ξ t +1 (cid:1) − ( s t − τ ξ t +1 ) (cid:105)≤ − τ (cid:107) ξ t +1 (cid:107) + (cid:107)∇ s F ( u t +1 , v t +1 , s t ) (cid:107) (cid:107) Retr s t (cid:0) − τ ξ t +1 (cid:1) − ( s t − τ ξ t +1 ) (cid:107)≤ − τ (cid:107) ξ t +1 (cid:107) + (cid:107) ζ t +1 (cid:107) · L τ (cid:107) ξ t +1 (cid:107) ≤ − τ (cid:107) ξ t +1 (cid:107) + (cid:107) AU (cid:107) ∞ L τ κη (cid:107) ξ t +1 (cid:107) . Combining those inequalities above implies that F ( u k +1 , v k +1 , t k +1 ) − F ( u k +1 , v k +1 , t k ) ≤ − τ (cid:18) − (cid:20) (cid:107) AU (cid:107) ∞ L κη + (cid:37) L (cid:21) τ (cid:19) (cid:107) ξ t +1 (cid:107) . Taking τ = (cid:107) AU (cid:107) ∞ L / ( κη )+ (cid:37)L gives the desired result. Lemma 8 (Decrease of F in v ) . Let { u t , v t , s t } t be the sequence generated fromAlgorithm 1. The following inequality holds for any k ≥ : F ( u t +1 , v t +1 , s t ) − F ( u t +1 , v t , s t ) ≤ − (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) . where (cid:107) /m − π ( u t +1 , v t , s t ) (cid:107) = (cid:88) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m − (cid:88) i π i,j ( u t +1 , v t , s t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . roof of Lemma 8. According to the expression of F , we have that F ( u t +1 , v t +1 , s t ) − F ( u t +1 , v t , s t )= (cid:88) i,j π i,j ( u t +1 , v t +1 , s t ) − (cid:88) i,j π i,j ( u t +1 , v t , s t ) + 1 m m (cid:88) j =1 ( v tj − v t +1 j )= 1 m m (cid:88) j =1 ( v tj − v t +1 j ) = − m m (cid:88) j =1 log 1 /m (cid:80) i π i,j ( u t +1 , v t , s t ) , where the second equality is because that (cid:88) i π i,j ( u t +1 , v t +1 , s t ) = 1 m , (cid:88) j π i,j ( u t +1 , v t , s t ) = 1 n . Therefore, applying the Pinsker’s inequality in Theorem 5 implies that F ( u t +1 , v t +1 , s t ) − F ( u t +1 , v t , s t ) ≤ − (cid:88) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m − (cid:88) i π i,j ( u t +1 , v t , s t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Lemma 9 (Decrease of F in u ) . Let { u t , v t , s t } t be the sequence generated fromAlgorithm 1. The following inequality holds for any t ≥ : F ( u t +1 , v t , s t ) − F ( u t , v t , s t ) ≤ − (cid:107) /n − π ( u t , v t , s t )1 (cid:107) . where (cid:107) /n − π ( u t , v t , s t )1 (cid:107) = (cid:88) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − (cid:88) j π i,j ( u k , v k , t k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Proof of Lemma 9.

For ﬁxed i ∈ [ n ], deﬁne h i = (cid:88) j π i,j ( u t +1 , v t , s t ) − (cid:88) j π i,j ( u t , v t , s t ) − n log 1 /n (cid:80) j π i,j ( u t , v t , s t )According to the expression of F , F ( u t +1 , v t , s t ) − F ( u t , v t , s t ) = (cid:88) i h i , h i , i ∈ [ n ]. By substituting theexpression of u t +1 into h i , we have that h i = (cid:88) j π i,j ( u t , v t , s t ) (cid:34) /n (cid:80) j π i,j ( u t , v t , s t ) − (cid:35) − n log 1 /n (cid:80) j π i,j ( u t , v t , s t )= 1 n − (cid:0) π ( u t , v t , s t )1 (cid:1) i − n log 1 /n (cid:0) π ( u t , v t , s t )1 (cid:1) i Deﬁne the function (cid:96) ( x ) = 1 n − x − n log 1 /nx + ( x − /n ) . We can see that this function attains its maximum at x = 1 /n , with (cid:96) (1 /n ) = 0.It follows that h i ≤ − (cid:18)(cid:0) π ( u t , v t , s t )1 (cid:1) i − n (cid:19) . The proof is completed.It is reasonable to assume that the sequence { u t , v t , s t } t is bounded sincethe following lemma shows that the objective value F ( u t , v t , s t ) is monotonedecreasing and lower bounded. Hence, we assume that (cid:107) u t (cid:107) ∞ ≤ U, (cid:107) v t (cid:107) ∞ ≤ V, ∀ t. Lemma 10.

Let { u t , v t , s t } t be the sequence generated from Algorithm 1, whichis terminated when the following conditions hold: (cid:107) ξ t +1 (cid:107) ≤ (cid:15) /η, (cid:107) /n − π ( u t , v t , s t )1 (cid:107) ≤ (cid:15) U , (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) ≤ (cid:15) V .

Then { u T , v T , s T } is an ( (cid:15) , (cid:15) ) stationary point of (5) .Proof of Lemma 10. The condition (cid:107) ξ t +1 (cid:107) ≤ (cid:15) directly implies that (cid:107) Grad s F ( u T , v T , s T ) (cid:107) ≤ (cid:15) /η. Suppose that π ( u T , v T , s T )1 = r, π ( u T , v T , s T ) T c, where (cid:107) /n − r (cid:107) ≤ (cid:15) / (4 U ) and (cid:107) /m − c (cid:107) ≤ (cid:15) / (4 V ). Then the solution( u T , v T ) is the optimum solution to the following optimization problem:min u ∈ R n ,v ∈ R m F (cid:48) ( u, v, s T )s.t. π i,j ( u, v, s ) = exp (cid:18) − η c i,j + u i + v j (cid:19) F (cid:48) ( u, v, s ) = (cid:88) i,j π i,j ( u, v, s ) − n (cid:88) i =1 r i u i − m (cid:88) j =1 c j v j . u ∗ , v ∗ ) as the minimizer for the problem min u,v F ( u, v, s T ). It followsthat F ( u T , v T , s T ) − min u,v F ( u, v, s T )=[ F ( u T , v T , s T ) − F (cid:48) ( u T , v T , s T )] + [ F (cid:48) ( u T , v T , s T ) − min u,v F ( u, v, s T )] ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ( r i − /n ) u Ti (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) j =1 ( c i − /m ) v Ti (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + [ F (cid:48) ( u ∗ , v ∗ , s T ) − F ( u ∗ , v ∗ , s T )] ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ( r i − /n ) u Ti (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) j =1 ( c i − /m ) v Ti (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ( r i − /n ) u ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) j =1 ( c i − /m ) v ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) /n − π ( u t , v t , s t )1 (cid:107) U + 2 (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) V<(cid:15) . Lemma 11 (Lower Boundedness of F ) . Denote by ( u ∗ , v ∗ , s ∗ ) the global optimumof (5) . Then we have that F ( u ∗ , v ∗ , s ∗ ) ≥ − η (cid:107) AU (cid:107) ∞ . Proof of Lemma 11.

It is easy to show that (cid:88) i,j π i,j ( u ∗ , v ∗ , s ∗ ) = 1 . Moreover, for any ( i, j ), we have that c i,j ≤ (cid:107) AU (cid:107) ∞ . It follows thatexp (cid:18) − η (cid:107) AU (cid:107) ∞ + u ∗ i + v ∗ j (cid:19) ≤ π i,j ≤ , and therefore u ∗ i + v ∗ j ≤ η (cid:107) AU (cid:107) ∞ for any ( i, j ) . Hence we conclude that (cid:88) i,j π i,j ( u ∗ , v ∗ , s ∗ ) − n n (cid:88) i =1 u i − m m (cid:88) j =1 v j ≥ − η (cid:107) AU (cid:107) ∞ . In the following we give a re-statement of Theorem 2 and the formal proof.

Theorem (Re-statement of Theorem 2) . Choose parameters τ = 12 (cid:107) AU (cid:107) ∞ L / ( κη ) + (cid:37)L , η = (cid:15) , (cid:37) = (cid:107) AU (cid:107) ∞ κη + mn (cid:107) AU (cid:107) ∞ κ η , nd Algorithm 1 terminates when (cid:107) ξ t +1 (cid:107) ≤ (cid:15) /η, (cid:107) /n − π ( u t , v t , s t )1 (cid:107) ≤ (cid:15) U , (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) ≤ (cid:15) V .

We say that (ˆ u, ˆ v, ˆ s ) is a ( (cid:15) , (cid:15) ) -stationary point of (5) if (cid:107) Grad s F (ˆ u, ˆ v, ˆ s ) (cid:107) ≤ (cid:15) /η,F (ˆ u, ˆ v, ˆ s ) − min u,v F ( u, v, ˆ s ) ≤ (cid:15) , where Grad s F ( u, v, s ) denotes the partial derivative of F with respect to thevariable s on the sphere S d ( n + m ) − . Then Algorithm 1 returns an ( (cid:15) , (cid:15) ) -stationary point in iterations T = O (cid:18) κ · (cid:20) (cid:15) + 1 (cid:15) (cid:15) (cid:21)(cid:19) . Proof of Theorem 2.

We can build the one-iteration descent result based onLemma 7, Lemma 8, and Lemma 9: F ( u t +1 , v t +1 , s t +1 ) − F ( u t , v t , s t ) ≤ − (cid:18) (cid:107) /n − π ( u t , v t , s t )1 (cid:107) + 12 (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) + 14 (cid:107) AU (cid:107) ∞ L / ( κη ) + 2 (cid:37)L (cid:107) ξ t +1 (cid:107) (cid:19) = − (cid:18) (cid:107) /n − π ( u t , v t , s t )1 (cid:107) + (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) + κ η (cid:107) ξ t +1 (cid:107) κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:19) Then we have that F ( u T , v T , s T ) − F ( u , v , s ) ≤ − T − (cid:88) t =0 (cid:18) (cid:107) /n − π ( u t , v t , s t )1 (cid:107) + (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) + κ η (cid:107) ξ t +1 (cid:107) κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:19) ≤ − · min (cid:26) , κ κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:27) × T − (cid:88) t =0 (cid:0) (cid:107) /n − π ( u t , v t , s t )1 (cid:107) + (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) + η (cid:107) ξ t +1 (cid:107) (cid:1) ≤ − T · min (cid:26) , κ κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:27) · min (cid:26) (cid:15) , (cid:15) U , (cid:15) V (cid:27) Since the gradient ζ t +1 can be expressed as a vector multiplied with 1 /η , we restrict η (cid:107) Grad s F (ˆ u, ˆ v, ˆ s ) (cid:107) ≤ (cid:15) from the perspective of rescaling. T ≤ [ F ( u , v , t ) − F ( u T , v T , s T )] max (cid:40) , (cid:0) κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:1) κ (cid:41) max (cid:26) (cid:15) , U (cid:15) , V (cid:15) (cid:27) ≤ (cid:18) F ( u , v , t ) − (cid:107) AU (cid:107) ∞ η (cid:19) max (cid:40) , (cid:0) κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:1) κ (cid:41) max (cid:26) (cid:15) , U (cid:15) , V (cid:15) (cid:27) = O (cid:18) κ · (cid:20) (cid:15) + 1 (cid:15) (cid:15) (cid:21)(cid:19) . C Technical Proofs in Section 5

We ﬁrst present several technical lemmas before showing the proof of Theorem 1.

Lemma 12.

Based on Assumption 1, for f ∈ { f ∈ H : (cid:107) f (cid:107) H ≤ } , we have (cid:107) f ( x ) (cid:107) ≤ √ B, ∀ x ∈ R D . Proof of Lemma 12.

For ﬁxed x ∈ X , the norm of f ( x ) can be upper boundedas the following: (cid:107) f ( x ) (cid:107) = (cid:104) f ( x ) , f ( x ) (cid:105) = (cid:104) f, K x f ( x ) (cid:105) H ≤ (cid:107) f (cid:107) H (cid:107) K x f ( x ) (cid:107) H ≤ (cid:107) K x f ( x ) (cid:107) H . In particular, (cid:107) K x f ( x ) (cid:107) H = (cid:104) K x f ( x ) , K x f ( x ) (cid:105) H = (cid:104) (cid:0) K x f ( x ) (cid:1) f ( x ) , f ( x ) (cid:105) = (cid:104) K ( x, f ( x )) f ( x ) , f ( x ) (cid:105) = f ( x ) T K ( x, f ( x )) f ( x ) ≤ B (cid:107) f ( x ) (cid:107) Combining those two relations above implies the desired result.

Lemma 13.

Let (cid:96) : R d → R be a -Lipschitz continuous function. Deﬁne thefunction space F (cid:96) = { R D (cid:51) z (cid:55)→ (cid:96) ( f ( z )) : f ∈ H , (cid:107) f (cid:107) H ≤ } . Then we have that R n ( F (cid:96) , µ ) ≤ √ dB √ n . roof of Lemma 13. The Rademacher complexity term can be upper boundedby applying the contraction Lemma 1: R n ( F (cid:96) , µ ) ≤ √ n E x i ∼ µ,σ  sup f ∈H n (cid:88) i =1 d (cid:88) j =1 σ i,j f j ( x i )  = √ n E x i ∼ µ,σ (cid:34) sup f ∈H n (cid:88) i =1 (cid:104) σ i , f ( x i ) (cid:105) (cid:35) = √ n E x i ∼ µ,σ (cid:34) sup f ∈H n (cid:88) i =1 (cid:104) f, K x i σ i (cid:105) H (cid:35) = √ n E x i ∼ µ,σ (cid:34) sup f ∈H (cid:104) f, n (cid:88) i =1 K x i σ i (cid:105) H (cid:35) where the ﬁrst equality is by expressing the inner summation over j as the innerproduct form, the second equality is by the reproducing property, and the lastequality is by the linearity of inner product. Moreover, the inner product formcan be upper bounded as the following: (cid:104) f, n (cid:88) i =1 K x i σ i (cid:105) H ≤ (cid:107) f (cid:107) H (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 K x i σ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H = (cid:118)(cid:117)(cid:117)(cid:116) (cid:104) n (cid:88) i =1 K x i σ i , n (cid:88) i =1 K x i σ i (cid:105) H = (cid:115)(cid:88) i (cid:88) j (cid:104) K x i σ i , K x j σ j (cid:105) H = (cid:115)(cid:88) i (cid:88) j (cid:104) (cid:0) K x i σ i (cid:1) ( x j ) , σ j (cid:105) = (cid:115)(cid:88) i (cid:88) j (cid:104) K ( x i , x j ) σ i , σ j (cid:105) = (cid:115)(cid:88) i (cid:88) j σ T j K ( x i , x j ) σ i E x i ∼ µ,σ (cid:34) sup f ∈H (cid:104) f, n (cid:88) i =1 K x i σ i (cid:105) H (cid:35) ≤ E x i ∼ µ,σ (cid:115)(cid:88) i (cid:88) j σ T j K ( x i , x j ) σ i  ≤ (cid:115) E x i ∼ µ,σ (cid:88) i (cid:88) j σ T j K ( x i , x j ) σ i = (cid:115) E x i ∼ µ,σ (cid:88) i K ( x i , x i ) + (cid:88) i (cid:54) = j σ T j K ( x i , x j ) σ i = (cid:115) E x i ∼ µ,σ (cid:88) i σ T i K ( x i , x i ) σ ≤ (cid:115) E x i ∼ µ,σ (cid:88) i Bσ T i σ i = √ ndB, which implies R n ( F (cid:96) , µ ) ≤ √ dB √ n . Lemma 14.

We have that E [ KP W (ˆ µ n , µ )] ≤ √ kK √ n + ζ n,d , where ζ n,d = inf (cid:15)>  (cid:15) + 6 (cid:114) Bn (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) (cid:38) √ Bε (cid:39) + 1 (cid:33) + (1 + 4 √ B/ε ) d log 2  . Proof of Lemma 14.

Deﬁne the function space F (cid:96) = { R D (cid:51) z (cid:55)→ (cid:96) ( f ( z )) : f ∈ H , (cid:107) f (cid:107) H ≤ } . Without loss of generality, the kernel projected Wasserstein distance suﬃces toconsider the following discriminator function space: T = (cid:26) (cid:96) : B ( √ B ) → R (cid:12)(cid:12)(cid:12)(cid:12) (cid:96) (0) = 0 , | (cid:96) ( y ) − (cid:96) ( y (cid:48) ) | ≤ (cid:107) y − y (cid:48) (cid:107) , ∀ y, y (cid:48) ∈ B ( √ B ) (cid:27) . where B ( r ) := { y ∈ R d : (cid:107) y (cid:107) ≤ r } . Then by the bias-variation decomposition,we have that E [ KP W (ˆ µ n , µ )] = E (cid:20) sup (cid:96) ∈ T IPM F (cid:96) (ˆ µ n , µ ) (cid:21) ≤ sup (cid:96) ∈ T E [IPM F (cid:96) (ˆ µ n , µ )] + E (cid:20) sup (cid:96) ∈ T (cid:0) IPM F (cid:96) (ˆ µ n , µ ) − E [IPM F (cid:96) (ˆ µ n , µ )] (cid:1)(cid:21) (cid:96) ∈ T E [IPM F (cid:96) (ˆ µ n , µ )] ≤ (cid:96) ∈ T R n ( F (cid:96) , µ ) ≤ √ dB √ n . Now we start to upper bound the variation term. Deﬁne the empirical process X (cid:96) = IPM F (cid:96) (ˆ µ n , µ ) − E [IPM F (cid:96) (ˆ µ n , µ )]. It is easy to see that E [ X (cid:96) ] = 0, andwe can show that for ﬁxed (cid:96) , the random variable X (cid:96) is sub-Gaussian: deﬁneˆ µ n = n (cid:80) ni =1 δ z i , Z = { z i } ni =1 , and g ( Z ) = IPM F (cid:96) (ˆ µ n , µ ). Then we have that | g ( Z ) − g ( Z (cid:48) ( i ) ) | ≤ IPM F (cid:96) (ˆ µ n , ˆ µ (cid:48) n ) = 1 n sup f ∈H , (cid:107) f (cid:107) H ≤ (cid:107) f ( z i ) − f ( z (cid:48) i ) (cid:107) ≤ √ Bn .

Therefore, applying the McDiarmid’s inequality in Theorem 6 impliesPr {| X (cid:96) | ≥ u } ≤ (cid:18) − nu B (cid:19) . Applying Lemma 3 implies that for ﬁxed (cid:96) , X (cid:96) is a sub-Gaussian variable withthe parameter σ = Bn . Consider the metric space T for the collection ofLipschitz functions (cid:96) ∈ T and deﬁne the distance d ( (cid:96), (cid:96) (cid:48) ) = sup y ∈ B ( √ K ) | (cid:96) ( y ) − (cid:96) (cid:48) ( y ) | . Then for any (cid:96), (cid:96) (cid:48) ∈ T , we have that | X (cid:96) − X (cid:96) (cid:48) | ≤ | IPM F (cid:96) (ˆ µ n , µ ) − IPM F (cid:96) (cid:48) (ˆ µ n , µ ) | + | E [IPM F (cid:96) (ˆ µ n , µ )] − E [IPM F (cid:96) (cid:48) (ˆ µ n , µ )] |≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup f ( E ˆ µ n [ (cid:96) ( f ( z ))] − E ˆ µ n [ (cid:96) (cid:48) ( f ( z ))]) + sup f ( E µ [ (cid:96) ( f ( z ))] − E µ [ (cid:96) (cid:48) ( f ( z ))]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup f ( E ˆ µ n [ (cid:96) ( f ( z ))] − E ˆ µ n [ (cid:96) (cid:48) ( f ( z ))]) + sup f ( E µ [ (cid:96) ( f ( z ))] − E µ [ (cid:96) (cid:48) ( f ( z ))]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ f E µ [ (cid:96) ( f ( z )) − (cid:96) (cid:48) ( f ( z ))]+ sup f n n (cid:88) i =1 [ (cid:96) ( f ( z i )) − (cid:96) (cid:48) ( f ( z i ))] + E (cid:34) sup f n n (cid:88) i =1 [ (cid:96) ( f ( z i )) − (cid:96) (cid:48) ( f ( z i ))] (cid:35) ≤ d ( (cid:96), (cid:96) (cid:48) ) . Applying the (cid:15) -net argument gives E (cid:20) sup (cid:96) ∈ T X (cid:96) (cid:21) ≤ inf (cid:15)> (cid:40) (cid:15) + (cid:114) Bn (cid:112) N ( T, d , (cid:15) ) (cid:41) N ( T, d , ε ) = (cid:32) (cid:38) B ( √ B )) ε (cid:39) + 1 (cid:33) · N ( B ( √ B ) ,ε/ , (cid:107)·(cid:107) ) = (cid:32) (cid:38) √ Bε (cid:39) + 1 (cid:33) · (1+4 √ B/ε ) d , which implies that E (cid:20) sup (cid:96) ∈ T X (cid:96) (cid:21) ≤ inf (cid:15)>  (cid:15) + 6 (cid:114) Bn (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) (cid:38) √ Bε (cid:39) + 1 (cid:33) + (1 + 4 √ B/ε ) d log 2  . Proof of Proposition 1.

For the ﬁrst part of this proposition, we consider thefollowing bias-variance decomposition of (cid:12)(cid:12) KP W (ˆ µ n , ˆ ν m ) − KP W ( µ, ν ) (cid:12)(cid:12) : (cid:12)(cid:12) KP W (ˆ µ n , ˆ ν m ) − KP W ( µ, ν ) (cid:12)(cid:12) ≤ | E [ KP W (ˆ µ n , ˆ ν m )] − KP W ( µ, ν ) | + |KP W (ˆ µ n , ˆ ν m ) − E [ KP W (ˆ µ n , ˆ ν m )] | where the ﬁrst term quantiﬁes the bias for empirical estimation, and the secondterm quantiﬁes the variance of estimation. The bias term can be upper boundedby applying the triangular inequality together with Lemma 14: | E [ KP W (ˆ µ n , ˆ ν m )] − KP W ( µ, ν ) | ≤ E [ KP W (ˆ µ n , µ )] + E [ KP W (ˆ ν m , ν )] ≤ √ dB √ n + 2 √ dB √ m + ζ n,d + ζ m,d . The remaining step is to bound the variance term |KP W (ˆ µ n , ˆ ν m ) − E [ KP W (ˆ µ n , ˆ ν m )] | .For ﬁxed f ∈ H , deﬁne X = { x i } ni =1 , Y = { y i } mi =1 , g ( X, Y ) = W ( f µ n , f ν m ).We use X (cid:48) ( i ) to represent the collection of data points such that X and X (cid:48) ( i ) diﬀers only in the i -th component, and Y (cid:48) ( i ) is deﬁned likewise. Then we havethat g ( X, Y ) − g ( X (cid:48) ( i ) , Y ) ≤ W ( f µ n , f µ (cid:48) n ) ≤ n − (cid:107) f ( x i ) − f ( x (cid:48) i ) (cid:107) ≤ B / n . where the last inequality is by applying Lemma 12. Similarly, we have g ( X, Y ) − g ( X, Y (cid:48) ( i ) ) ≤ B / m . Therefore, applying Mcdiamard’s inequality in Theorem 6with c i = B / n for i = 1 , , . . . , n and c i = B / m for i = n + 1 , . . . , n + m impliesPr (cid:26) |KP W (ˆ µ n , ˆ ν m ) − E [ KP W (ˆ µ n , ˆ ν m )] | ≤ (cid:15) (cid:27) ≤ (cid:18) − (cid:15) mn B ( m + n ) (cid:19) . In other words, with probability 1 − α , we have that |KP W (ˆ µ n , ˆ ν m ) − E [ KP W (ˆ µ n , ˆ ν m )] | ≤ (cid:114) B ( m + n ) mn log 2 α . E [ KP W (ˆ µ n , ˆ ν m )]. Now we prove the second part of this proposition. In thiscase, we can follow the similar idea as in Lemma 14 to show that E [ KP W (ˆ µ n , ˆ ν n )] ≤ √ dB √ n + ζ n,d . Deﬁne X = { x i } ni =1 , Y = { y i } ni =1 , and ∆( X, Y ) = KP W (ˆ µ n , ˆ ν n ). Then | ∆( X, Y ) − ∆( X (cid:48) ( i ) , Y ) | ≤ B / n , | ∆( X, Y ) − ∆( X, Y (cid:48) ( i ) ) | ≤ B / n . Hence, applying Mcdiamard’s inequality in Theorem 6 with c i = B / n for i = 1 , , . . . , n and c i = B / n for i = n + 1 , . . . , n impliesPr (cid:26) |KP W (ˆ µ n , ˆ ν n ) − E [ KP W (ˆ µ n , ˆ ν n )] | ≤ (cid:15) (cid:27) ≤ (cid:18) − (cid:15) n B (cid:19) . The second part of this proposition is proved by combining the upper bound on E [ KP W (ˆ µ n , ˆ ν n )]. Proof of Theorem 4.

The type-II error has the following upper bound:Type-II error=Pr H (cid:18) KPW(ˆ µ n , ˆ ν m ) < γ m,n (cid:19) =Pr H (cid:18) KPW(ˆ µ n , ˆ ν m ) − KPW( µ, ν ) < γ m,n − KPW( µ, ν ) (cid:19) =Pr H (cid:18) KPW( µ, ν ) − KPW(ˆ µ n , ˆ ν m ) > KPW( µ, ν ) − γ m,n (cid:19) ≤ Pr H (cid:18) | KPW( µ, ν ) − KPW(ˆ µ n , ˆ ν m ) | > KPW( µ, ν ) − γ m,n (cid:19) ≤ MSE( µ, ν )(KPW( µ, ν ) − γ m,n )2