Two-sample Test with Kernel Projected Wasserstein Distance
TTwo-sample Test with Kernel ProjectedWasserstein Distance
Jie Wang, Rui Gao, Yao Xie ∗ February 15, 2021
Abstract
We develop a kernel projected Wasserstein distance for the two-sampletest, an essential building block in statistics and machine learning: giventwo sets of samples, to determine whether they are from the same distribu-tion. This method operates by finding the nonlinear mapping in the dataspace which maximizes the distance between projected distributions. Incontrast to existing works about projected Wasserstein distance, the pro-posed method circumvents the curse of dimensionality more efficiently. Wepresent practical algorithms for computing this distance function togetherwith the non-asymptotic uncertainty quantification of empirical estimates.Numerical examples validate our theoretical results and demonstrate goodperformance of the proposed method.
Two-sample hypothesis testing is a fundamental problem in statistical infer-ence (Casella & Berger, 2001), which aims to determine whether two collectionsof samples are from the same distribution or not. This problem has wide appli-cations in scientific discovery and machine learning problems such as anomalydetection (Chandola et al., 2009, 2010; Bhuyan et al., 2013), where the abnormalobservations follow a distinct distribution compared with the typical observations.Similarly, in change-point detection (Xie & Siegmund, 2013; Cao et al., 2018;Xie & Xie, 2020), the post-change observations follow a different distributionfrom the pre-change one. Other examples include model criticism (Lloyd &Ghahramani, 2015; Chwialkowski et al., 2016; Bi´nkowski et al., 2018), causalinference (Lopez-Paz & Oquab, 2018), and health care (Schober & Vetter, 2019).Traditional methods for two-sample tests have been focusing on the parametricor low-dimensional testing scenario. One can design a parametric test when extrainformation about the underlying distribution is available. Typical examples ∗ J. Wang and Y. Xie are with H. Milton Stewart School of Industrial and Systems Engi-neering, Georgia Institute of Technology. R. Gao is with Department of Information, Risk,and Operations Management, University of Texas at Austin. a r X i v : . [ m a t h . S T ] F e b nclude Hotelling’s two-sample test (Hotelling, 1931), Student’s t-test (Pfanzagl& Sheynin, 1996), etc. Non-parametric two-sample tests can be more appealingwhen it is difficult to specify the exact parametric form of the distributions.Integral probability metrics are popular for the design of non-parametric two-sample tests, since the evaluation of the corresponding empirical test statistics canbe directly estimated using samples without knowing the underlying distributions.Some earlier works construct two-sample tests based on the Kolmogorov-Smirnovdistance (Pratt & Gibbons, 1981; Jr., 1951), the total variation distance (Gy¨orfi& Van Der Meulen, 1991), and the Wasserstein distance (del Barrio et al., 1999;Ramdas et al., 2015). However, those tests are not applicable when performingtwo-sample tests in high-dimensional settings since the sample complexity forestimating those distance functions based on collected samples suffer from thecurse of dimensionality.There is a strong need for developing non-parametric test for high-dimensionaldata especially for modern applications. A notable contribution is two-sampletest based on the Maximum Mean Discrepancy (MMD) statistic (Gretton et al.,2012a, 2009, 2012b). However, as pointed out in (Reddi et al., 2014), the power ofMMD-based tests decreases polynomially with an increasing dimension of targetdistributions, which indicates it is not suitable for high-dimensional testing well.Recent works (Wang et al., 2020; Xie & Xie, 2020) use the idea of dimensionalityreduction for handling high-dimensional data, which introduces the projectedWasserstein distance as the test statistic, i.e.., the test statistic works by findingthe linear projector such that the distance between projected distributions ismaximized. However, a linear projector may not serve as an optimal design formaximizing the power of tests.In this paper, we present a new non-parametric two-sample test statistics aim-ing for high-dimensional setting based on a kernel projected Wasserstein (KPW)distance , with a nonlinear projector based on the reproducing kernel Hilbertspace designed to optimize the test power via maximizing the probability distancebetween the distributions after projection. In addition, our contributions includethe following: • We develop a computationally efficient algorithm for evaluating the KPWusing a representer theorem to reformulate the problem into a finite-dimensional optimization problem and a block coordinate descent opti-mization algorithm which is guaranteed to find a (cid:15) -stationary point withcomplexity O (cid:0) (cid:15) − (cid:1) . • To quantify the false detection rate, which is essential in setting thedetection threshold, we develop non-asymptotic bounds for empirical KPWdistance based on the Rademacher complexity and (cid:15) -net argument. • We present numerical experiments to validate our theoretical results aswell as demonstrate the good performance of our proposed test.The remaining of this paper is organized as follows. Section 3 introduces somepreliminary knowledge on two-sample testing and related probability distances,2ection 4 outlines a practical algorithm for computing KPW distance, Section 5studies the uncertainty quantification of empirical KPW distance together withsome theoretical analysis for the designed two-sample test, Section 6 demonstratessome numerical experiments, and Section 7 presents some concluding remarks.
Generalization on Integral Probability Metrics.
Our proposed KPWdistance falls into the framework of integral probability metrics, which arewidely used in non-parametric two-sample tests. In particular, several efficienttwo-sample tests have been constructed based on Maximum Mean Discrep-ancy (MMD) (Gretton et al., 2012a, 2009, 2012b), a special case of IPM. Thekey for designing efficient two-sample tests is to balance the discriminative - generalization trade-off, i.e., the discriminator function space should be largeenough to make IPM( µ, ν ) discriminative so that IPM( µ, ν ) = 0 implies µ = ν ;and the space should be relatively small to make IPM( µ, ν ) generalizable in thatthe sample complexity for estimating IPM based on empirical samples small.(Arora et al., 2017) studies the generalization bound for general GAN metrics andit turns out that the neural-net distance enjoys generalization properties. (Zhanget al., 2018) discusses the necessary and sufficient conditions for discriminativeand generalization properties for general IPMs. A recent work Wang et al. (2020)extends those arguments above to investigate the two-sample convergence rate ofempirical IPMs. The proof for the uncertainty quantification of empirical KPWdistance follows similar arguments in this work.Wasserstein distance is a special case of IPM when the discriminator functionspace is a collection of 1-Lipschitz functions. However, in high-dimensionalsettings, Wasserstein distance is difficult to compute, and the convergence ratefor empirical Wasserstein distance is slow. Several variants of Wassersteindistance have been developed in the literature to address these two issues. Thefollowing outlines two approaches in existing works for breaking the curse ofhigh dimensions. Smoothed Wasserstein Distance.
Sinkhorn divergence is first proposed in(Cuturi, 2013) to reduce the computational cost by adding entropic regularizationsinto the original Wasserstein distance. A recent work (Genevay et al., 2019) showsthat the convergence rate of empirical Sinkhorn divergences is of order O (cid:0) n − / (cid:1) with respect to the sample size n , which is independent of the dimension ofthe underlying distributions. The followed up work (Mena & Weed, 2019)further extends this convergence result into unbounded supports and tightens theconvergence rate up to some constants. Another variant of smoothed Wassersteindistance is by considering the Wasserstein distance between the distributionsconvolved with the isotropic Gaussian distribution. (Goldfeld et al., 2020b) showsthe convergence rate of such Gaussian-smoothed empirical measures (includingthe Wasserstein distance, total variation distance, etc.) under the sub-Gaussianassumption drops at the parametric rate in terms of the dimension d . Basedon this result, (Goldfeld & Greenewald, 2020) further establishes related metric3tructure and non-asymptotic statistical efficiency of the Gaussian-smoothedWasserstein distance, and (Goldfeld et al., 2020a) establishes the associatedasymptotic statistical results. Dimension Reduction.
Low-dimensional projections are widely used forunderstanding the structure of high-dimension distributions. Typical examplesinclude the principal component analysis (Jolliffe, 1986), linear discriminantanalysis (McLachlan & Sons, 1992), etc. It is intuitive to identify the differencesbetween two collections of high-dimensional samples by projecting those samplesinto low-dimensional spaces at some proper directions. (Wei et al., 2013; Ghosh &Biswas, 2016) design the direction by training binary linear classifiers on samples.(Mueller & Jaakkola, 2015; Xie & Xie, 2020) find the worst-case linear projectorthat maximizes the probability distance between projected sample points inone-dimension. The recent work Lin et al. (2020a) naturally extends this ideaby considering projector that maps sample points points into k dimensionallinear subspace with k ≥
1, called the projected Wasserstein distance. Sincecomputing the projected Wasserstein distance is costly for large scale problems,Huang et al. (2021) proposes a block coordinate descent (BCD) algorithm tofind the (cid:15) -stationary point of the optimization problem in O ( (cid:15) − ) iterations.The statistical property of this metric is discussed comprehensively in (Linet al., 2020b). Based on these uncertainty quantification results, (Wang et al.,2020) applies the projected Wasserstein distance for the task of two-sampletesting. However, the uncertainty quantification results are loose in general, anda linear projection mapping may not be the optimal choice for maximizing theperformance of two-sample tests. In contrast, our work focuses a more generalmetric called the KPW distance, in which general nonlinear mapping defined ina Hilbert space is considered. We follow the similar idea in Huang et al. (2021)to design a BCD algorithm for computing the KPW distance so that (cid:15) -stationarypoint can be found in O ( (cid:15) − ) iterations. Let x n := { x i } ni =1 and y m := { y i } mi =1 be i.i.d. samples generated from distri-butions µ and ν , respectively. Our goal is to design a two-sample test which,given samples x n and y n , decides to accept the null hypothesis H : µ = ν orthe alternative hypothesis H : µ (cid:54) = ν . Denote by T : ( x n , y m ) → { d , d } thetwo-sample test, where d i means that we accept the hypothesis H i and reject theother, i = 0 ,
1. Define the type-I risk as the probability of rejecting hypothesis H when it is true, and the type-II risk as the probability of accepting H when µ (cid:54) = ν : (cid:15) (I) n,m = P x n ∼ µ,y m ∼ ν (cid:18) T ( x n , y m ) = d (cid:19) , under H ,(cid:15) (II) n,m = P x n ∼ µ,y m ∼ ν (cid:18) T ( x n , y m ) = d (cid:19) , under H . α, β ∈ (0 , ), we aim at building a two-sample test such that,when applied to n -observation samples x n and m -observation samples y m , it hasthe type-I risk at most α (i.e., at level α ) and the type-II risk at most β (i.e.,of power 1 − β ). Moreover, we want to ensure these specifications with samplesizes n, m as small as possible.Denote by ˆ µ n and ˆ ν m the empirical distributions constructed from i.i.d.samples from µ and ν :ˆ µ n (cid:44) n n (cid:88) i =1 δ x i , ˆ ν m (cid:44) m m (cid:88) i =1 δ y i . We propose a non-parametric test by considering the probability distance func-tions between two empirical distributions constructed from observed samples.Specifically, we design a test T such that the null hypothesis H is rejected when D (ˆ µ n , ˆ ν m ) > (cid:96), where D ( · , · ) is a divergence quantifying the differences of two distributions, and (cid:96) is a data-dependent threshold. In particular, we consider the divergence D asa special type of integral probability metric (IPM). Definition 1 (Integral Probability Metric) . Given two distributions µ and ν ,define the integral probability metric asIPM F ( µ, ν ) = sup f ∈F (cid:18) (cid:90) f ( x ) d µ ( x ) − (cid:90) f ( y ) d ν ( y ) (cid:19) , where F denotes the discriminator function space. Several existing tests can be viewed as special cases of IPM, including theMMD test, total variation distance test, etc. In this paper, we will design thedivergence D based on 1-Wasserstein distance. General p -Wasserstein distanceswith p ∈ (1 , ∞ ] are beyond the scope of IPM, so we leverage the discussion forgeneral p -Wasserstein distances in future works. Definition 2 (1-Wasserstein Distance) . Given two distributions µ and ν , the -Wasserstein distance is defined as W ( µ, ν ) = min π ∈ Π( µ,ν ) (cid:90) (cid:107) x − y (cid:107) d π ( x, y ) , where (cid:107) · (cid:107) denotes the norm in the Euclidean space, and Π( µ, ν ) denotes thejoint distribution with marginal distributions µ and ν . While Wasserstein distance has wide applications in machine learning, thefinite-sample convergence rate of Wasserstein distance between empirical dis-tributions is slow in high-dimensional settings. Therefore, it is not suitable forhigh-dimensional two-sample tests. Instead, we use the projection idea to rescuethis issue. 5 efinition 3 (Projected Wasserstein Distance) . Given two distributions µ and ν , define the projected Wasserstein distance as P W ( µ, ν ) = max A : R D → R d A T A = I d W ( A µ, A ν ) , where the operator denotes the push-forward operator, i.e., A ( z ) ∼ A µ for z ∼ µ, and we denote A as a linear operator such that A ( z ) = A T z with z ∈ R D and A ∈ R D × d . Although this idea is demonstrated to be useful for breaking the curse ofdimensionality for the original Wasserstein distance, a linear projector is not anoptimal choice for dimensionality reduction. Instead, we will consider a nonlinearprojector to obtain a more powerful two-sample test, and we use functions invector-valued reproducing kernel Hilbert space (RKHS) for projection.
Definition 4 (Vector-valued RKHS) . A function K : X × X → R d × d is saidto be a positive semi-definite kernel if N (cid:88) i =1 N (cid:88) j =1 (cid:104) y i , K ( x i , x j ) y j (cid:105) ≥ for any finite set of points { x i } Ni =1 in X and { y i } Ni =1 in R d . Given such a kernel,there exists an unique R d -valued Hilbert space H K with the reproducing kernel K . For fixed x ∈ X and y ∈ R d , define the kernel section K x with the action y as the mapping K x y : X → R d such that ( K x y )( x (cid:48) ) = K ( x (cid:48) , x ) y, ∀ x (cid:48) ∈ X . In particular, the Hilbert space H K satisfies the reproducing property: ∀ f ∈ H K , (cid:104) f, K x y (cid:105) H K = (cid:104) f ( x ) , y (cid:105) . Definition 5 (Kernel Projected Wasserstein Distance) . Consider a R d -valuedRKHS H with the corresponding kernel function K . Given two distributions µ and ν , define the kernel projected Wasserstein (KPW) distance as KP W ( µ, ν ) = max f ∈H (cid:26) W ( f µ, f ν ) : (cid:107) f (cid:107) H ≤ (cid:27) . In this paper, we design the two-sample test such that the acceptance regionfor the null hypothesis H is given by: { ( x n , y m ) : KP W (ˆ µ n , ˆ ν m ) ≤ γ n,m } , (1)where the hyper-parameter γ n,m is chosen such that the two-sample test is atlevel α . In the following sections, we will consider how to compute this divergenceand how to obtain γ n,m based on some concentration results.6 Computing KPW Distance
By the definition of Wasserstein distance, computing KP W (ˆ µ n , ˆ ν m ) is equivalentto the following max-min problem:max f ∈H min π ∈ Γ (cid:88) i,j π i,j (cid:107) f ( x i ) − f ( y j ) (cid:107) , s.t. (cid:107) f (cid:107) H ≤ , (2)where Γ = (cid:110) π ∈ R n × m + : (cid:80) j π i,j = n , (cid:80) i π i,j = m (cid:111) .The computation of KPW distance has numerous challenges. It is crucial todesign a suitable kernel function to obtain low computational complexity andreliable testing power, which will be discussed in Section 6.1. Moreover, thefunction f ∈ H is a countable combination of basis functions, i.e., the problem (2)is an infinite-dimensional optimization. By developing the representer theorem inTheorem 1, we are able to convert this problem into a finite-dimensional problem.Finally, there is no theoretical guarantee for finding the global optimum since itis a non-convex non-smooth optimization problem. Moreover, Sion’s minimaxtheorem is not applicable due to the lack of convex-concave structure. Based onthis observation, we only focus on optimization algorithms for finding a localoptimum point in polynomial time. Theorem 1 (Representer Theorem for KPW Distance) . There exists an optimalsolution to (2) that admits the following expression: ˆ f = n (cid:88) i =1 K x i a x,i − m (cid:88) j =1 K y j a y,j , where K x ( · ) denotes the kernel section and a x,i , a y,j ∈ R d are coefficients to bedetermined. In order to express the optimal solution as the compact matrix form, define a x ∈ R nd as the concatenation of coefficients a x,i for i = 1 , . . . , n and K z ( x n ) = (cid:0) K ( z, x ) · · · K ( z, x n ) (cid:1) ∈ R d × nd . We also define the vector a y and matrix K z ( y m ) likewise. Then we haveˆ f ( z ) = K z ( x n ) a x − K z ( y m ) a y , ∀ z ∈ X . Define the gram matrix K ( x n , x n ) as the n × n block matrix with the ( i, j )-thblock being K ( x i , x j ). The gram matrices K ( x n , y m ) , K ( y m , x n ) and K ( y m , y m )can be defined likewise. Denote by G the concatenation of gram matrices: G = (cid:18) K ( x n , x n ) − K ( x n , y m ) − K ( y m , x n ) K ( y m , y m ) (cid:19) , and we assume that G is positive definite. Otherwise, we add the gram matrixwith a small number times identity matrix to make it invertible.7ubstituting the expression of ˆ f ( z ) , z ∈ X into (2), we obtain a finite-dimensional optimization problem:max ω min π ∈ Γ (cid:88) i,j π i,j c i,j , s.t. ω T Gω ≤ , where ω ∈ R d ( n + m ) is the concatenation of vectors a x , a y , and A i,j = [ K x i ( x n ) − K y j ( x n ) , K y j ( y m ) − K x i ( y m )] ,c i,j = (cid:107) A i,j ω (cid:107) . Suppose that the inverse of G admits the Cholesky decomposition G − = U U T , then by the linearity of the (cid:96) norm and the change of variable s = U − ω ,we obtain the norm-constrained optimization problem:max s ∈ S d ( n + m ) − min π ∈ Γ (cid:88) i,j π i,j c i,j , (3)where S d ( n + m ) − = { s ∈ R d ( n + m ) : s T s = 1 } .Note that the (cid:96) norm in the objective function is not continuously differ-entiable at zero. To this end, we approximate the (cid:96) norm with the followingform: (cid:107) x (cid:107) ≈ (cid:107) x (cid:107) ,κ := (cid:115)(cid:88) i x i + κ − κ, where κ = . It follows that the approximated function is 1 /κ -smooth.For the ease of optimization, we consider the entropic regularization of theproblem (3): max s ∈ S d ( n + m ) − min π ∈ Γ (cid:88) i,j π i,j c i,j − ηH ( π ) , (4)By the duality theory of entropic optimal transport and the change-of-variabletechnique, (4) is equivalent to the following minimization problem:min s ∈ S d ( n + m ) − u ∈ R n ,v ∈ R m F ( u, v, s ) , (5)where π i,j ( u, v, s ) = exp (cid:18) − η c i,j + u i + v j (cid:19) ,F ( u, v, s ) = (cid:88) i,j π i,j ( u, v, s ) − n n (cid:88) i =1 u i − m m (cid:88) j =1 v j . Based on this formulation, we consider a Riemannian block coordinate de-scent (BCD) method for optimization. whose t -th iteration updates are the8ollowing: u t +1 = min u ∈ R n F ( u, v t , s t ) , (6a) v t +1 = min v ∈ R m F ( u t +1 , v, s t ) , (6b) ζ t +1 = (cid:88) i,j ∇ s π i,j ( u t +1 , v t +1 , s t ) , (6c) ξ t +1 = P s t (cid:0) ζ t +1 (cid:1) , (6d) s t +1 = Retr s t (cid:0) − τ ξ t +1 (cid:1) , (6e)where the operator P s ( ζ ) denotes the orthogonal projection of the vector ζ ontothe tangent space of the manifold S d ( n + m ) − at s : P s (cid:0) ζ (cid:1) = ζ − (cid:104) s, ζ (cid:105) s, s ∈ S d ( n + m ) − , and the retraction on this manifold is defined asRetr s (cid:0) − τ ξ (cid:1) = s − τ ξ (cid:107) s − τ ξ (cid:107) , s ∈ S d ( n + m ) − . Note that the update steps (6a) and (6b) have closed-form expressions: u t +1 = u t + (cid:40) log 1 /n (cid:80) j π i,j ( u t , v t , s t ) (cid:41) i ∈ [ n ] , (6f) v t +1 = v t + (cid:26) log 1 /m (cid:80) i π i,j ( u t +1 , v t , s t ) (cid:27) j ∈ [ m ] , (6g)and the Euclidean gradient ζ t +1 in (6c) can be computed using the chain rule: ζ t +1 = − η (cid:88) i,j π i,j ( u t +1 , v t +1 , s t ) · U T A T i,j A i,j U s t (cid:107) A i,j U s t (cid:107) ,κ . (6h)By making use of the structure of matrices A i,j , ∀ i, j , the gradient ζ t +1 can beimplemented by avoiding double-loop computations. The overall algorithm forsolving the problem (5) is summarized in Algorithm 1. Theorem 2 (Convergence Analysis for BCD) . We say that (ˆ u, ˆ v, ˆ s ) is a ( (cid:15) , (cid:15) ) -stationary point of (5) if (cid:107) Grad s F (ˆ u, ˆ v, ˆ s ) (cid:107) ≤ (cid:15) /η,F (ˆ u, ˆ v, ˆ s ) − min u,v F ( u, v, ˆ s ) ≤ (cid:15) , where Grad s F ( u, v, s ) denotes the partial derivative of F with respect to thevariable s on the sphere S d ( n + m ) − . Let { u t , v t , s t } be the sequence generated byAlgorithm 1, then Algorithm 1 returns an ( (cid:15) , (cid:15) ) -stationary point in T = O (cid:18) κ · (cid:20) (cid:15) + 1 (cid:15) (cid:15) (cid:21)(cid:19) , lgorithm 1 BCD Algorithm for Solving (5)
Require:
Empirical distributions ˆ µ n and ˆ ν m . Initialize v , s for t = 0 , , , . . . , T − do Update u t +1 = min u F ( u, v t , s t ) according to (6f) Update v t +1 = min v F ( u t +1 , v, s t ) according to (6g) Update the Euclidean gradient ζ t +1 and the Riemannian gradient ξ t +1 according to (6h) and (6d) Update s t +1 according to (6e) end forReturn u ∗ = u T , v ∗ = v T , s ∗ = s T . where the constant depends on the gram matrix G and the Lipschitz propertiesof the Retraction operator. The proof of Theorem 2 follows the similar idea as in Huang et al. (2021),which can be found in Appendix B.
In order to design the two-sample test at a certain level, we first build thefinite-sample guarantee for the convergence of kernel projected Wassersteindistance (KPW). Results throughout this section are based on the followingassumption.
Assumption 1.
For any x, x (cid:48) ∈ X , the matrix-valued kernel K ( x, x (cid:48) ) is sym-metric and satisfies (cid:22) K ( x, x (cid:48) ) (cid:22) BI d . Proposition 1 (Finite-sample Guarantee for the convergence of KPW) . Basedon Assumption 1, the following concentration bounds hold:1. With probability at least − α , we have that (cid:12)(cid:12) KP W (ˆ µ n , ˆ ν m ) − KP W ( µ, ν ) (cid:12)(cid:12) ≤ (cid:114) B ( m + n ) mn log 2 α + 2 √ dB √ n + 2 √ dB √ m + ζ n,d + ζ m,d , where ζ N,d denotes a constant factor such that ζ N,d = (cid:40) ˜ O ( N − / ) , if d ≤ , O ( N − /d ) , if d ≥ . . Assume further that the sample size n = m and target distributions µ = ν .Then with probability at least − α , we have that (cid:12)(cid:12) KP W (ˆ µ n , ˆ ν m ) (cid:12)(cid:12) ≤ (cid:114) Bn log 2 α + 2 √ dB √ n + ζ n,d . Projected Wasserstein distance is proposed to break the curse of dimen-sionality of the original Wasserstein distance, and there are several existingworks about the sample complexity of projection Wasserstein distance, including(Mueller & Jaakkola, 2015; Wang et al., 2020; Lin et al., 2020b). However, theupper bounded presented in (Mueller & Jaakkola, 2015; Lin et al., 2020b) doesnot express the constant term explicitly, and the theoretical guarantee discussedin (Wang et al., 2020) relies on the restrictive assumption that the support oftarget distributions is compact. In constrast, our finite-sample guarantee on thekernel projected Wasserstein distance gives tighter concentration results andapplies into distributions with unbounded support. Therefore, the proposedkernel projected Wasserstein distance breaks the curse of dimensionality moreefficiently.Based on the concentration inequality above, we can develop a two-sampletest at certain level. It turns out that the acceptance region does not dependon the dimension of the support of distributions, but only on the number ofsamples and the dimension of projected spaces.
Theorem 3 (Test based on Finite-sample Uncertainty Quantification) . Designsof the two-sample test at a certain level can be summarized as follows.1. A hypothesis test of level α for null hypothesis µ = ν , has the acceptanceregion KP W (ˆ µ n , ˆ ν n ) ≤ (cid:114) B ( m + n ) mn log 2 α + 2 √ dB √ n + 2 √ dB √ m + ζ n,d + ζ m,d .
2. Assume further that m = n . Then a hypothesis test of level α for nullhypothesis µ = ν , has the acceptance region KP W (ˆ µ n , ˆ ν n ) ≤ (cid:114) Bn log 2 α + 2 √ dB √ n + ζ n,d . Moreover, we provide an upper bound of the type-II error of our designedtest based on the Markov’s inequality.
Theorem 4 (Power of Two-sample Test) . Provided that KP W ( µ, ν ) > γ m,n ,then the type-II error for the test with acceptance region presented in (1) can beupper bounded asPr H (cid:18) KP W (ˆ µ n , ˆ ν m ) < γ m,n (cid:19) ≤ MSE ( µ, ν )( KP W ( µ, ν ) − γ m,n ) ,
20 50 125 250 625
Sample Size L o g M S E KPW( n , n ) when = = ([ 1,1] ) KPW ( d =1) KPW * ( d =1) KPW ( d =2) KPW * ( d =2) KPW ( d =5) KPW * ( d =5) 5 20 50 125 250 625 Sample Size L o g M S E KPW( n , n ) when = = ([ 1,1] ) KPW ( d =1) KPW * ( d =1) KPW ( d =2) KPW * ( d =2) KPW ( d =5) KPW * ( d =5) 5 20 50 125 250 625 Sample Size L o g M S E KPW( n , n ) when = = ([ 1,1] ) KPW ( d =1) KPW * ( d =1) KPW ( d =2) KPW * ( d =2) KPW ( d =5) KPW * ( d =5) Figure 1: MSE of KPW distances between empirical distributions ˆ µ n and ˆ ν n across different number of samples n . Results are averaged over 10 independenttrials. where MSE ( µ, ν ) = E (cid:104)(cid:0) KP W (ˆ µ n , ˆ ν m ) − KP W ( µ, ν ) (cid:1) (cid:105) . The upper bound in Theorem 4 characterizes two intersting features fortwo-sample test: As the MSE between the underlying KPW distance and itsempirical estimate decreases, the type-II error of the designed test decreases.When the discrepancy between µ and ν increases, i.e., ( KP W ( µ, ν )) increases,the type-II error of the designed test decreases. In Section 6.1 we would designthe matrix-valued kernel to approximately minimize the upper bound on thetype-II error. Before the practical implementation, we outline how to design the matrix-valuedkernel to approximately minimize the type-II error. Most existing works includingMinh & Sindhwani (2011); Micchelli & Pontil (2005); Caponnetto et al. (2008);Baldassarre et al. (2010) consider the design of the matrix-valued kernel functionas K ( x, x (cid:48) ) = k ( x, x (cid:48) ) · P, where k ( · , · ) denotes a scalar-valued kernel function and P ∈ R d × d is a positivesemi-definite matrix that encoders the relation between the output space. Wefollow the design of this type of kernels in order to decrease the computationalcomplexity. As a result, the gram matrices presented in Section 4 can be expressedas the tensor product, e.g., K ( x n , x n ) = k ( x n , x n ) ⊗ P. More specifically, wechoose the scalar-valued kernel k ( · , · ) to be a standard Gaussian kernel with thebandwidth σ , and P = (1 − ρ ) T + ρI d , with ρ ∈ [0 , . Next, we focus on optimizing the hyper-parameters σ and ρ in which theperformance criteria is based on the upper bound in Theorem 4. For existingworks about the scalar-valued kernel design for maximizing the power of MMD,12 Figure 2: a) Scatteer plots for 2-dimensional HDGM dataset with sample size n = m = 400. b) KDE plot for the empirical distributions with the projectortrained by linear projected Wassersterin distance. c) KDE plot for the empiricaldistributions with the projector trained by KPW distances. The dashed linecorresponds to the KPW distance with the standard kernel function (denoted as KP W ), and the solid line corresponds to the distance with the optimized kernelfunction (denoted as
KP W ∗ ). KPW with the optimized kernel function identifieshigh-dimensional differences better since KP W ∗ = > KP W = .see Liu et al. (2020); Sutherland et al. (2021). For a reasonably large samplesize, the denominator of the fraction upper bound can be approximated as( KP W ( µ, ν )) . Since optimization for a fraction has pathological landscapes,such as gradient vanishing and explosion issues, we choose hyper-parameters sothat (( σ ∗ ) , ρ ∗ ) = arg min σ ,ρ MSE( µ, ν ) −
12 ( KP W ( µ, ν )) . We use the bootstrap method to approximately evaluate the objective functionin practice. In particular, we approximate MSE( µ, ν ) with1 L L (cid:88) (cid:96) =1 (cid:0) KP W (ˆ µ ( (cid:96) ) n , ˆ ν ( (cid:96) ) m ) − KP W (ˆ µ n , ˆ ν m ) (cid:1) , and ( KP W ( µ, ν )) with ( KP W (ˆ µ (0) n , ˆ µ (0) n )) , where for (cid:96) = 0 , , . . . , L , ˆ µ ( (cid:96) ) n andˆ ν ( (cid:96) ) m are empirical distributions re-sampled based on ˆ µ n and ˆ ν m , respectively. Theoptimization is performed using the simulated annealing algorithm based on theMATLAB optimization toolbox with maximum number of iterations being .The initial guess of ρ is chosen to be , and that of σ is based on the medianheuristic outlined in (Section 8, Gretton et al. (2012a)). The advantage for thekernel design will be demonstrated in the following experiments. Throughputthis section, denote by KP W the KPW distance with hyper-parameters selectedby median heuristic unless otherwise specified, and KP W ∗ the distance withoptimized hyper-parameters. We set µ = ν = U ([ − , D ) with D ∈ { , , } and examine the convergenceof KP W (ˆ µ n , ˆ ν n ) with respect to the sample complexity n . The dimension of13rojected space d ∈ { , , } . In this experiment, we fix hyper-parameters σ = , ρ = for KP W , and the hyper-parameters for KP W ∗ are optimizedfor each pair of ( d, D ) using 100 sample points from µ and ν , respectively.Fig. 1 presents the values of log KP W and log KP W ∗ between two empiricaldistributions ˆ µ n and ˆ ν n ) for n ∈ { , , , , , } over 10 independenttrials. As we can see, the value of KP W (ˆ µ n , ˆ ν n ) decreases into zero quicklywith respect to the sample size, which justifies the non-asymptotic uncertaintyquantification results in Proposition 1. The design of kernel further improves theconvergence rate, especially for smaller values of d . Interestingly, the values ofempirical KPW distances with d ∈ { , } are slightly smaller than that with d = 30. The intuition is that all information of high-dimensional data points areencoded into the gram matrix induced by kernel functions, and the computationof KPW distance does not depend on data points directly but only on the grammatrix. Mean Shifted Gaussians Figure 3: Comparison of test power of different methods across different dimen-sions D . a) Power of tests under mean shifted Gaussian distributions; b) Powerof tests under covariance shifted Gaussian distributions with the dimension ofintrinsic differences I = 1; c) Power of tests under covariance shifted Gaussiandistributions with the dimension of intrinsic differences I = 3; Our proposed KPW distance also provides a visual interpretation for identifyingthe differences between collected samples. Consider the HDGM dataset with D = 2 so that µ = (cid:80) i =1 12 N ( m hi , I D ) and ν = (cid:80) i =1 12 N ( m hi , Σ ), where m h = (cid:18) (cid:19) , m h = (cid:18) (cid:19) , Σ = I D , Σ = (cid:18) . . (cid:19) . Fig. 2a) shows the scatter plots for this dataset with sample size n = m =400. Fig. 2b) presents the kernel density estimate (KDE) plot in which targetdistributions are projected into one-dimensional linear subspace such that theWasserstein distance between projected distributions is maximized. Fig. 2c)demonstrates the KDE plot in which projectors are trained by computing KPWdistance. In particular, the dashed line and solid line correspond to KP W and
KP W ∗ , respectively. In contrast to (linear) projected Wasserstein distance, thesupport of projected distributions is bounded by the (cid:96) norm with radius √ B
14s long as Assumption 1 holds. Moreover, we can improve the performancefor identifying high-dimensional differences by design kernel functions. In thisscenario, the distance with optimized kernel functions
KP W ∗ = , whilethe distance with standard kernel functions KP W = . Finally, we investigate the performance of the proposed two-sample tests. Bench-mark algorithms are chosen to be the Hotelling’s T test and the MMD testwith standard Gaussian kernel functions. Since the acceptance region based onthe concentration inequality gives conservative results in general, we first projectobservations into d -dimensional subspace and then use MMD to identify whethertwo sets of projected points belong to the same distribution or not. We rejectthe null hypothesis when p < .
05. The sample size was chosen as n = m = 100,and results were averaged over 100 independent trials. We follow the similarsetup as in (Section VI.A, Xie & Xie (2020)) to examine the performance ofdifferent methods, where datasets are specified as follows. • µ, ν are Gaussian distributions with different mean values, i.e., µ = N (0 , I D ) and ν = N (0 . / √ D, I D ). The division of the mean vectorby √ D is to make the testing problem remain equivalently difficult toperform. • µ, ν are covariance-shifted Gaussian distributions. In particular, µ = N (0 , I D ) and ν = N (0 , Σ), with Σ being a diagonal matrix such thatΣ i,i = 0 .
25 if i ≤ I and otherwise Σ i,i = 1. Here I ∈ { , } quantifies thedimension of intrinsic differences between µ and ν .Fig. 3a)-Fig. 3c) demonstrate the power across different dimensions D underthe mean-shifted and covariance-shifted distributions. It can be seen that thepower decreases with increasing dimensions D . Benchmark algorithms performwell in low-dimensional settings, but their performances degrade quickly forhigh-dimensional problems. In contrast, our proposed tests outperform thebenchmark in high-dimensional settings. As long as a good low-dimensionalprojection mapping is found, the power of tests based on KPW tends to decreaseslower compared with Hotelling’s T test and MMD test. Interestingly, we cansee that d = 1 is not the optimal choice to maximize the power, even whenthe dimension of intrinsic differences I = 1. Therefore, it is of research interestto determine the optimal choice of d to obtain more efficient tests, which isleveraged for future works. Moreover, tests based on KP W ∗ usually outperformtests based on KP W . We proposed the KPW distance for the task of two-sample testing, which operatesby finding the nonlinear mapping in the data space to maximize the distance15etween projected distributions. Practical algorithms together with uncertaintyquantification of empirical estimates are discussed to help with this task. Infuture works we will develop sharper uncertainty quantification results on theempirical KPW distance.
References
Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Generalization andequilibrium in generative adversarial nets (gans), 2017.Baldassarre, L., Rosasco, L., Barla, A., and Verri, A. Vector field learning viaspectral filtering. In Balc´azar, J. L., Bonchi, F., Gionis, A., and Sebag, M.(eds.),
Machine Learning and Knowledge Discovery in Databases , pp. 56–71,Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.Bhuyan, M. H., Bhattacharyya, D. K., and Kalita, J. K. Network anomalydetection: methods, systems and tools.
Ieee communications surveys &tutorials , 16(1):303–336, 2013.Bi´nkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifyingmmd gans, 2018.Boumal, N., Absil, P.-A., and Cartis, C. Global rates of convergence for nonconvexoptimization on manifolds.
IMA Journal of Numerical Analysis , 39(1):1–33,Feb 2018.Cao, Y., Guigues, V., Juditsky, A., Nemirovski, A., and Xie, Y. Change detectionvia affine and quadratic detectors, 2018.Caponnetto, A., Micchelli, C. A., Pontil, M., and Ying, Y. Universal multi-taskkernels.
Journal of Machine Learning Research , 9(52):1615–1646, 2008.Casella, G. and Berger, R.
Statistical Inference . Duxbury Resource Center, June2001.Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection: A survey.
ACMComput. Surv. , 41(3), July 2009. ISSN 0360-0300.Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection for discretesequences: A survey.
IEEE transactions on knowledge and data engineering ,24(5):823–839, 2010.Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness offit.
Proceedings of Machine Learning Research , 48:2606–2615, 20–22 Jun 2016.Cover, T. M. and Thomas, J. A.
Elements of Information Theory (Wiley Seriesin Telecommunications and Signal Processing) . Wiley-Interscience, USA, 2006.Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transporta-tion distances, 2013. 16el Barrio, E., Cuesta-Albertos, J. A., Matr´an, C., and Rodriguez-Rodriguez,J. M. Tests of goodness of fit based on the l -wasserstein distance. Ann.Statist. , 27(4):1230–1239, 08 1999.Genevay, A., Chizat, L., Bach, F., Cuturi, M., and Peyr´e, G. Sample complexityof sinkhorn divergences, 2019.Ghosh, A. K. and Biswas, M. Distribution-free high-dimensional two-sampletests based on discriminating hyperplanes.
TEST , 25(3):525–547, 2016.Gin, E. and Nickl, R.
Mathematical Foundations of Infinite-Dimensional Statis-tical Models . Cambridge University Press, USA, 1st edition, 2015.Goldfeld, Z. and Greenewald, K. Gaussian-smooth optimal transport: Metricstructure and statistical efficiency, 2020.Goldfeld, Z., Greenewald, K., and Kato, K. Asymptotic guarantees for generativemodeling based on the smooth wasserstein distance, 2020a.Goldfeld, Z., Greenewald, K., Polyanskiy, Y., and Weed, J. Convergence ofsmoothed empirical measures with applications to entropy estimation, 2020b.Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. A fast, con-sistent kernel two-sample test. In Bengio, Y., Schuurmans, D., Lafferty, J. D.,Williams, C. K. I., and Culotta, A. (eds.),
Advances in Neural InformationProcessing Systems 22 , pp. 673–681. Curran Associates, Inc., 2009.Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B., and Smola, A. Akernel two-sample test.
J. Mach. Learn. Res. , 13:723–773, March 2012a.Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M.,Fukumizu, K., and Sriperumbudur, B. K. Optimal kernel choice for large-scaletwo-sample tests. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger,K. Q. (eds.),
Advances in Neural Information Processing Systems 25 , pp.1205–1213. Curran Associates, Inc., 2012b.Gy¨orfi, L. and Van Der Meulen, E. C.
A Consistent Goodness of Fit Test Basedon the Total Variation Distance , pp. 631–645. Springer Netherlands, Dordrecht,1991.Hotelling, H. The generalization of student’s ratio.
Ann. Math. Statist. , 2(3):360–378, 08 1931.Huang, M., Ma, S., and Lai, L. A riemannian block coordinate descent methodfor computing the projection robust wasserstein distance, 2021.Jolliffe, I.
Principal Component Analysis . Springer Verlag, 1986.Jr., F. J. M. The kolmogorov-smirnov test for goodness of fit.
Journal of theAmerican Statistical Association , 46(253):68–78, 1951.17in, T., Fan, C., Ho, N., Cuturi, M., and Jordan, M. I. Projection robustwasserstein distance and riemannian optimization, 2020a.Lin, T., Zheng, Z., Chen, E. Y., Cuturi, M., and Jordan, M. I. On projectionrobust optimal transport: Sample complexity and model misspecification,2020b.Liu, F., Xu, W., Lu, J., Zhang, G., Gretton, A., and Sutherland, D. J. Learningdeep kernels for non-parametric two-sample tests, 2020.Lloyd, J. R. and Ghahramani, Z. Statistical model criticism using kernel twosample tests. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., andGarnett, R. (eds.),
Advances in Neural Information Processing Systems 28 ,pp. 829–837. Curran Associates, Inc., 2015.Lopez-Paz, D. and Oquab, M. Revisiting classifier two-sample tests, 2018.Maurer, A. A vector-contraction inequality for rademacher complexities, 2016.McDiarmid, C.
On the method of bounded differences , pp. 148–188. LondonMathematical Society Lecture Note Series. Cambridge University Press, 1989.McLachlan, G. and Sons, J. W. .
Discriminant Analysis and Statistical PatternRecognition . Wiley Series in Probability and Statistics. Wiley, 1992.Mena, G. and Weed, J. Statistical bounds for entropic optimal transport: samplecomplexity and the central limit theorem, 2019.Micchelli, C. A. and Pontil, M. A. On learning vector-valued functions.
NeuralComput. , 17(1):177–204, January 2005.Minh, H. Q. and Sindhwani, V. Vector-valued manifold regularization. In
Pro-ceedings of the 28th International Conference on International Conference onMachine Learning , ICML’11, pp. 57–64, Madison, WI, USA, 2011. Omnipress.Mueller, J. and Jaakkola, T. Principal differences analysis: Interpretable charac-terization of differences between distributions, 2015.Pfanzagl, J. and Sheynin, O. Studies in the history of probability and statisticsXLIV A forerunner of the t-distribution.
Biometrika , 83(4):891–898, 12 1996.Pratt, J. W. and Gibbons, J. D.
Kolmogorov-Smirnov Two-Sample Tests , pp.318–344. Springer New York, New York, NY, 1981.Ramdas, A., Garcia, N., and Cuturi, M. On wasserstein two sample testing andrelated families of nonparametric tests, 2015.Reddi, S. J., Ramdas, A., PAczos, B., Singh, A., and Wasserman, L. On thedecreasing power of kernel and distance based nonparametric hypothesis testsin high dimensions, 2014. 18chober, P. and Vetter, T. Two-sample unpaired t tests in medical research.
Anesthesia and analgesia , 129:911, 10 2019.Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola,A., and Gretton, A. Generative models and model criticism via optimizedmaximum mean discrepancy, 2021.Wang, J., Gao, R., and Xie, Y. Two-sample test using projected wassersteindistance: Breaking the curse of dimensionality, 2020.Wei, S., Lee, C., Wichers, L., Li, G., and Marron, J. S. Direction-projection-permutation for high dimensional hypothesis tests, 2013.Xie, L. and Xie, Y. Sequential change detection by optimal weighted (cid:96) divergence,2020.Xie, Y. and Siegmund, D. Sequential multi-sensor change-point detection. TheAnnals of Statistics , 41(2):670–692, Apr 2013.Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. On the discrimination-generalization tradeoff in gans, 2018.19
Preliminary Technical Results
Theorem 5 (Pinsker’s Inequality (Cover & Thomas, 2006)) . Consider twodiscrete probability distributions p = { p i } ni =1 and q = { q i } ni =1 , then we have that n (cid:88) i =1 p i log p i q i ≥ (cid:107) p − q (cid:107) . Proposition 2 (Lipschitz Properties of Retraction Operator Boumal et al.(2018)) . There exists constants L , L such that the following inequalities hold: (cid:107) Retr U ( ζ ) − U (cid:107) ≤ L (cid:107) ζ (cid:107)(cid:107) Retr U ( ζ ) − ( U + ζ ) (cid:107) ≤ L (cid:107) ζ (cid:107) . Definition 6 (Rademacher complexity) . Given the function space F and µ ,define the Rademacher complexity as R n ( F , µ ) := E x i ∼ µ,σ i (cid:34) sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 σ i f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) , where σ i ’s are i.i.d. and take values in {− , } with equal probability. Lemma 1 (Contraction for vector-valued functions (Maurer, 2016)) . Let H be acollection of R k -valued functions on X ⊆ R d . Let (cid:96) : R m → R be a L (cid:96) -Lipschitzcontinuous function. Denote by (cid:96) ◦ H (cid:44) { z (cid:55)→ (cid:96) ( h ( z )) : h ∈ H} . Then we have R n ( (cid:96) ◦ H , µ ) ≤ √ L (cid:96) n E x i ∼ µ,σ sup h ∈H n (cid:88) i =1 k (cid:88) j =1 σ i,j h j ( x i ) Definition 7 (Integral Probability Metric) . Given two distributions µ and ν together with the function space F , define the integral probability metric asIPM F ( µ, ν ) = sup f ∈F (cid:18) (cid:90) f ( x ) d µ ( x ) − (cid:90) f ( y ) d ν ( y ) (cid:19) . Lemma 2 (Generalization Bound on IPM (Zhang et al., 2018)) . Consider theintegral probability metric IPM F with the discriminative function space F , then E [ IPM F ( µ, ˆ µ n )] ≤ R n ( F , µ ) , where ˆ µ n is the empirical distribution constructed from n i.i.d. samples generatedfrom µ . Theorem 6 (McDiarmid’s Inequality (McDiarmid, 1989)) . Let X , . . . , X n beindependent random variables, where X i has the support X i . Let f : X × X ×· · · × X n → R be any function with the ( c , . . . , c n ) bounded difference property, .e., for i ∈ { , . . . , n } and for any ( x , . . . , x n ) , ( x (cid:48) , . . . , x (cid:48) n ) that differs only inthe i -th corodinate, we have | f ( x , . . . , x n ) − f ( x (cid:48) , . . . , x (cid:48) n ) | ≤ c i . Then for any t > , we havePr (cid:26) | f ( X , . . . , X n ) − E [ f ( X , . . . , X n )] | ≥ t (cid:27) ≤ (cid:18) − t (cid:80) ni =1 c i (cid:19) . Lemma 3 (Equivalent Definition for Sub-Gaussian variables (Lemma 2.3.2 inGin & Nickl (2015))) . Assume that E [ ζ ] = 0 and P {| ζ | ≥ t } ≤ C exp (cid:18) − t σ (cid:19) , t > , for some C ≥ and σ > . Then the random variable ζ is sub-Gaussian withconstant ˜ σ = 12(2 C + 1) σ . B Technical Proofs in Section 4
Proof of Theorem 1.
Assume that ˆ f is an optimal solution to the problem (2).Let S be the subspace S = n (cid:88) i =1 m (cid:88) j =1 ( K x i − K y j ) a i,j : a i,j ∈ R d . Then by the projection theorem, we have ˆ f = ˆ f S + ˆ f S ⊥ and (cid:107) ˆ f (cid:107) H = (cid:107) ˆ f S (cid:107) H + (cid:107) ˆ f S ⊥ (cid:107) H . It remains to show that ˆ f S shares the same objective value with ˆ f . Forfixed i, j , we have that (cid:107) ˆ f ( x i ) − ˆ f ( y j ) (cid:107) = max a i,j : (cid:107) a i,j (cid:107) ≤ (cid:104) ˆ f ( x i ) − ˆ f ( y j ) , a i,j (cid:105) = max a i,j : (cid:107) a i,j (cid:107) ≤ (cid:104) ˆ f ( x i ) , a i,j (cid:105) − (cid:104) ˆ f ( y j ) , a i,j (cid:105) = max a i,j : (cid:107) a i,j (cid:107) ≤ (cid:104) ˆ f , K x i a i,j (cid:105) − (cid:104) ˆ f , K y j a i,j (cid:105) = max a i,j : (cid:107) a i,j (cid:107) ≤ (cid:104) ˆ f , ( K x i − K y j ) a i,j (cid:105) = max a i,j : (cid:107) a i,j (cid:107) ≤ (cid:104) ˆ f S , ( K x i − K y j ) a i,j (cid:105) = (cid:107) ˆ f S ( x i ) − ˆ f S ( y j ) (cid:107) . Therefore, there always exists an optimal solution that lies in the subspace S , which means that there exists an optimal solution to (2) that admits thefollowing expression: ˆ f = n (cid:88) i =1 m (cid:88) j =1 ( K x i − K y j ) a i,j . Define a x,i = (cid:80) mj =1 a i,j and a y,j = (cid:80) ni =1 a i,j completes the proof.21 emma 4 (Properties of the gradient with respect to s ) . For fixed ( i, j ) , denoteby ∇ s c i,j [ s ] the gradient of c i,j = (cid:107) A i,j ω (cid:107) ,κ = (cid:107) A i,j U s (cid:107) ,κ with respect to thevariable s . We have that (cid:107)∇ s c i,j [ s ] (cid:107) ≤ κ (cid:107) AU (cid:107) ∞ , ∀ s ∈ S d ( n + m ) − , (7a) (cid:107)∇ s c i,j [ s ] − ∇ s c i,j [ s ] (cid:107) ≤ (cid:107) AU (cid:107) ∞ κ (cid:107) s − s (cid:107) , ∀ s , s ∈ S d ( n + m ) − , (7b) where (cid:107) AU (cid:107) ∞ = max i,j (cid:107) A i,j U (cid:107) .Proof of Lemma 4. The gradient of c i,j equals ∇ s c i,j [ s ] = U T A T i,j A i,j U s (cid:107) A i,j U s (cid:107) ,κ . Therefore, (cid:107)∇ s c i,j [ s ] (cid:107) ≤ (cid:107) U T A T i,j A i,j U s (cid:107) κ ≤ κ (cid:107) U T A T i,j A i,j U (cid:107) ≤ κ (cid:107) AU (cid:107) ∞ , where the first inequality is because (cid:107) A i,j U s (cid:107) ,κ ≥ κ , the second inequalityis because (cid:107) Ax (cid:107) le (cid:107) A (cid:107) for any (cid:107) x (cid:107) ≤
1, and the last inequality is because (cid:107) A (cid:107) = (cid:107) A T (cid:107) for any matrix A . It suffices to upper bound the spectral radius ofthe Hessian matrix of c i,j with respect to s when showing the Lipschitz propertyof ∇ s c i,j [ s ]. In particular, ∇ s c i,j [ s ] = (cid:107) A i,j U s (cid:107) ,κ U T A T i,j A i,j U − ( U T A T i,j A i,j U s )( U T A T i,j A i,j U s ) T (cid:107) A i,j U s (cid:107) ,κ (cid:22) κ (cid:107) AU (cid:107) ∞ · I where I stands for the identity matrix of size d ( n + m ). Lemma 5 (Lipschitzness of ∇ s F ( u, v, s )) . Let { u t , v t , s t } t be the sequence gen-erated from Algorithm 1. The following inequality holds for any s ∈ S d ( n + m ) − and λ ∈ [0 , : (cid:107)∇ s F ( u t +1 , v t +1 , λs + (1 − λ ) s t ) − ∇ s F ( u t +1 , v t +1 , s t ) (cid:107) ≤ (cid:37)λ (cid:107) s t − s (cid:107) , where (cid:37) = (cid:107) AU (cid:107) ∞ κη + mn (cid:107) AU (cid:107) ∞ κ η .Proof of Lemma 5. An intermediate result is that (cid:88) i π i,j ( u t +1 , v t +1 , s t ) = (cid:88) i exp (cid:18) − η c i,j [ s t ] + u t +1 i (cid:19) exp (cid:0) v t +1 j (cid:1) = (cid:88) i exp (cid:18) − η c i,j [ s t ] + u t +1 i (cid:19) exp (cid:0) v tj (cid:1) /m (cid:80) i π i,j ( u t +1 , v t , s t )= 1 m (cid:80) i π i,j ( u t +1 , v t , s t ) (cid:80) i π i,j ( u t +1 , v t , s t ) = 1 /m. (cid:80) i,j π i,j ( u t +1 , v t , s t ) = 1. For fixed s t , define s λ = λs + (1 − λ ) s t . Then we have that (cid:107)∇ s F ( u t +1 , v t +1 , λs + (1 − λ ) s t ) − ∇ s F ( u t +1 , v t +1 , s t ) (cid:107) = 1 η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i,j π i,j ( u t +1 , v t +1 , s λ ) · ∇ s c i,j [ s λ ] − (cid:88) i,j π i,j ( u t +1 , v t +1 , s t ) · ∇ s c i,j [ s t ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i,j π i,j ( u t +1 , v t +1 , s λ ) · (cid:18) ∇ s c i,j [ s λ ] − ∇ s c i,j [ s t ] (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 1 η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i,j (cid:18) π i,j ( u t +1 , v t +1 , s λ ) − π i,j ( u t +1 , v t +1 , s t ) (cid:19) · ∇ s c i,j [ s t ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) AU (cid:107) ∞ κη (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i,j π i,j ( u t +1 , v t +1 , s λ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) s λ − s t (cid:107) + (cid:107) AU (cid:107) ∞ κη (cid:88) i,j (cid:12)(cid:12) π i,j ( u t +1 , v t +1 , s λ ) − π i,j ( u t +1 , v t +1 , s t ) (cid:12)(cid:12) = (cid:107) AU (cid:107) ∞ κη (cid:107) s λ − s t (cid:107) + (cid:107) AU (cid:107) ∞ κη (cid:88) i,j (cid:12)(cid:12) π i,j ( u t +1 , v t +1 , s λ ) − π i,j ( u t +1 , v t +1 , s t ) (cid:12)(cid:12) ≤ (cid:107) AU (cid:107) ∞ κη (cid:107) s λ − s t (cid:107) + mn (cid:107) AU (cid:107) ∞ κ η (cid:107) s λ − s t (cid:107) . where the second inequality is by applying (7a) and (7b), and the last inequalityis by evaluating the gradient of π i,j with respect to s and then applying (7a).Then applying the condition that (cid:107) s λ − s t (cid:107) = λ (cid:107) s − s t (cid:107) completes the proof. Lemma 6 (Boundedness of ∇ s F ( u, v, s )) . Let { u t , v t , s t } t be the sequence gen-erated from Algorithm 1. Define ζ t +1 = ∇ s F ( u t +1 , v t +1 , s t ) , then (cid:107) ζ t +1 (cid:107) ≤ (cid:107) AU (cid:107) ∞ κη . Proof of Lemma 6.
By the matrix calculus knowledge, (cid:107) ζ t +1 (cid:107) = 1 η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i,j π i,j ( u t +1 , v t +1 , s t ) · ∇ s c i,j [ s t ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) AU (cid:107) ∞ κη (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i,j π i,j ( u t +1 , v t +1 , s t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:107) AU (cid:107) ∞ κη . where the inequality is by applying (7a), and the last equality is because that (cid:80) i,j π i,j ( u t +1 , v t , s t ) = 1. Lemma 7 (Decrease of F in s ) . Let { u t , v t , s t } t be the sequence generated fromAlgorithm 1. The following inequality holds for any k ≥ : F ( u t +1 , v t +1 , s t +1 ) − F ( u t +1 , v t +1 , s t ) ≤ − (cid:107) AU (cid:107) ∞ L / ( κη ) + 2 (cid:37)L (cid:107) ξ t +1 (cid:107) . roof of Lemma 7. Note that (cid:12)(cid:12) F ( u t +1 , v t +1 , s t +1 ) − F ( u t +1 , v t +1 , s t ) − (cid:104)∇ t F ( u t +1 , v t +1 , s t ) , s t +1 − s t (cid:105) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) (cid:104)∇ s F ( u t +1 , v t +1 , λs t +1 + (1 − λ ) s t ) − ∇ s F ( u t +1 , v t +1 , s t ) , s t +1 − s t (cid:105) d λ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) (cid:107)∇ s F ( u t +1 , v t +1 , λs t +1 + (1 − λ ) s t ) − ∇ s F ( u t +1 , v t +1 , s t ) (cid:107)(cid:107) s t +1 − s t (cid:107) d λ ≤ (cid:90) (cid:37)λ (cid:107) s t +1 − s t (cid:107) d λ = (cid:37) (cid:107) s t +1 − s t (cid:107) = (cid:37) (cid:13)(cid:13) Retr s t (cid:0) − τ ξ t +1 (cid:1) − s t (cid:13)(cid:13) ≤ (cid:37)τ L (cid:107) ξ t +1 (cid:107) . where the second inequality is by applying Lemma 5, and the last inequality isby applying Proposition 2. Moreover, we have that (cid:104)∇ t F ( u t +1 , v t +1 , s t ) , s t +1 − s t (cid:105) = (cid:104)∇ t F ( u t +1 , v t +1 , s t ) , − τ ξ t +1 (cid:105) + (cid:104)∇ t F ( u t +1 , v t +1 , s t ) , Retr s t (cid:0) − τ ξ t +1 (cid:1) − ( s t − τ ξ t +1 ) (cid:105)≤ − τ (cid:107) ξ t +1 (cid:107) + (cid:107)∇ s F ( u t +1 , v t +1 , s t ) (cid:107) (cid:107) Retr s t (cid:0) − τ ξ t +1 (cid:1) − ( s t − τ ξ t +1 ) (cid:107)≤ − τ (cid:107) ξ t +1 (cid:107) + (cid:107) ζ t +1 (cid:107) · L τ (cid:107) ξ t +1 (cid:107) ≤ − τ (cid:107) ξ t +1 (cid:107) + (cid:107) AU (cid:107) ∞ L τ κη (cid:107) ξ t +1 (cid:107) . Combining those inequalities above implies that F ( u k +1 , v k +1 , t k +1 ) − F ( u k +1 , v k +1 , t k ) ≤ − τ (cid:18) − (cid:20) (cid:107) AU (cid:107) ∞ L κη + (cid:37) L (cid:21) τ (cid:19) (cid:107) ξ t +1 (cid:107) . Taking τ = (cid:107) AU (cid:107) ∞ L / ( κη )+ (cid:37)L gives the desired result. Lemma 8 (Decrease of F in v ) . Let { u t , v t , s t } t be the sequence generated fromAlgorithm 1. The following inequality holds for any k ≥ : F ( u t +1 , v t +1 , s t ) − F ( u t +1 , v t , s t ) ≤ − (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) . where (cid:107) /m − π ( u t +1 , v t , s t ) (cid:107) = (cid:88) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m − (cid:88) i π i,j ( u t +1 , v t , s t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . roof of Lemma 8. According to the expression of F , we have that F ( u t +1 , v t +1 , s t ) − F ( u t +1 , v t , s t )= (cid:88) i,j π i,j ( u t +1 , v t +1 , s t ) − (cid:88) i,j π i,j ( u t +1 , v t , s t ) + 1 m m (cid:88) j =1 ( v tj − v t +1 j )= 1 m m (cid:88) j =1 ( v tj − v t +1 j ) = − m m (cid:88) j =1 log 1 /m (cid:80) i π i,j ( u t +1 , v t , s t ) , where the second equality is because that (cid:88) i π i,j ( u t +1 , v t +1 , s t ) = 1 m , (cid:88) j π i,j ( u t +1 , v t , s t ) = 1 n . Therefore, applying the Pinsker’s inequality in Theorem 5 implies that F ( u t +1 , v t +1 , s t ) − F ( u t +1 , v t , s t ) ≤ − (cid:88) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m − (cid:88) i π i,j ( u t +1 , v t , s t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Lemma 9 (Decrease of F in u ) . Let { u t , v t , s t } t be the sequence generated fromAlgorithm 1. The following inequality holds for any t ≥ : F ( u t +1 , v t , s t ) − F ( u t , v t , s t ) ≤ − (cid:107) /n − π ( u t , v t , s t )1 (cid:107) . where (cid:107) /n − π ( u t , v t , s t )1 (cid:107) = (cid:88) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − (cid:88) j π i,j ( u k , v k , t k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Proof of Lemma 9.
For fixed i ∈ [ n ], define h i = (cid:88) j π i,j ( u t +1 , v t , s t ) − (cid:88) j π i,j ( u t , v t , s t ) − n log 1 /n (cid:80) j π i,j ( u t , v t , s t )According to the expression of F , F ( u t +1 , v t , s t ) − F ( u t , v t , s t ) = (cid:88) i h i , h i , i ∈ [ n ]. By substituting theexpression of u t +1 into h i , we have that h i = (cid:88) j π i,j ( u t , v t , s t ) (cid:34) /n (cid:80) j π i,j ( u t , v t , s t ) − (cid:35) − n log 1 /n (cid:80) j π i,j ( u t , v t , s t )= 1 n − (cid:0) π ( u t , v t , s t )1 (cid:1) i − n log 1 /n (cid:0) π ( u t , v t , s t )1 (cid:1) i Define the function (cid:96) ( x ) = 1 n − x − n log 1 /nx + ( x − /n ) . We can see that this function attains its maximum at x = 1 /n , with (cid:96) (1 /n ) = 0.It follows that h i ≤ − (cid:18)(cid:0) π ( u t , v t , s t )1 (cid:1) i − n (cid:19) . The proof is completed.It is reasonable to assume that the sequence { u t , v t , s t } t is bounded sincethe following lemma shows that the objective value F ( u t , v t , s t ) is monotonedecreasing and lower bounded. Hence, we assume that (cid:107) u t (cid:107) ∞ ≤ U, (cid:107) v t (cid:107) ∞ ≤ V, ∀ t. Lemma 10.
Let { u t , v t , s t } t be the sequence generated from Algorithm 1, whichis terminated when the following conditions hold: (cid:107) ξ t +1 (cid:107) ≤ (cid:15) /η, (cid:107) /n − π ( u t , v t , s t )1 (cid:107) ≤ (cid:15) U , (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) ≤ (cid:15) V .
Then { u T , v T , s T } is an ( (cid:15) , (cid:15) ) stationary point of (5) .Proof of Lemma 10. The condition (cid:107) ξ t +1 (cid:107) ≤ (cid:15) directly implies that (cid:107) Grad s F ( u T , v T , s T ) (cid:107) ≤ (cid:15) /η. Suppose that π ( u T , v T , s T )1 = r, π ( u T , v T , s T ) T c, where (cid:107) /n − r (cid:107) ≤ (cid:15) / (4 U ) and (cid:107) /m − c (cid:107) ≤ (cid:15) / (4 V ). Then the solution( u T , v T ) is the optimum solution to the following optimization problem:min u ∈ R n ,v ∈ R m F (cid:48) ( u, v, s T )s.t. π i,j ( u, v, s ) = exp (cid:18) − η c i,j + u i + v j (cid:19) F (cid:48) ( u, v, s ) = (cid:88) i,j π i,j ( u, v, s ) − n (cid:88) i =1 r i u i − m (cid:88) j =1 c j v j . u ∗ , v ∗ ) as the minimizer for the problem min u,v F ( u, v, s T ). It followsthat F ( u T , v T , s T ) − min u,v F ( u, v, s T )=[ F ( u T , v T , s T ) − F (cid:48) ( u T , v T , s T )] + [ F (cid:48) ( u T , v T , s T ) − min u,v F ( u, v, s T )] ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ( r i − /n ) u Ti (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) j =1 ( c i − /m ) v Ti (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + [ F (cid:48) ( u ∗ , v ∗ , s T ) − F ( u ∗ , v ∗ , s T )] ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ( r i − /n ) u Ti (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) j =1 ( c i − /m ) v Ti (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ( r i − /n ) u ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) j =1 ( c i − /m ) v ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) /n − π ( u t , v t , s t )1 (cid:107) U + 2 (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) V<(cid:15) . Lemma 11 (Lower Boundedness of F ) . Denote by ( u ∗ , v ∗ , s ∗ ) the global optimumof (5) . Then we have that F ( u ∗ , v ∗ , s ∗ ) ≥ − η (cid:107) AU (cid:107) ∞ . Proof of Lemma 11.
It is easy to show that (cid:88) i,j π i,j ( u ∗ , v ∗ , s ∗ ) = 1 . Moreover, for any ( i, j ), we have that c i,j ≤ (cid:107) AU (cid:107) ∞ . It follows thatexp (cid:18) − η (cid:107) AU (cid:107) ∞ + u ∗ i + v ∗ j (cid:19) ≤ π i,j ≤ , and therefore u ∗ i + v ∗ j ≤ η (cid:107) AU (cid:107) ∞ for any ( i, j ) . Hence we conclude that (cid:88) i,j π i,j ( u ∗ , v ∗ , s ∗ ) − n n (cid:88) i =1 u i − m m (cid:88) j =1 v j ≥ − η (cid:107) AU (cid:107) ∞ . In the following we give a re-statement of Theorem 2 and the formal proof.
Theorem (Re-statement of Theorem 2) . Choose parameters τ = 12 (cid:107) AU (cid:107) ∞ L / ( κη ) + (cid:37)L , η = (cid:15) , (cid:37) = (cid:107) AU (cid:107) ∞ κη + mn (cid:107) AU (cid:107) ∞ κ η , nd Algorithm 1 terminates when (cid:107) ξ t +1 (cid:107) ≤ (cid:15) /η, (cid:107) /n − π ( u t , v t , s t )1 (cid:107) ≤ (cid:15) U , (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) ≤ (cid:15) V .
We say that (ˆ u, ˆ v, ˆ s ) is a ( (cid:15) , (cid:15) ) -stationary point of (5) if (cid:107) Grad s F (ˆ u, ˆ v, ˆ s ) (cid:107) ≤ (cid:15) /η,F (ˆ u, ˆ v, ˆ s ) − min u,v F ( u, v, ˆ s ) ≤ (cid:15) , where Grad s F ( u, v, s ) denotes the partial derivative of F with respect to thevariable s on the sphere S d ( n + m ) − . Then Algorithm 1 returns an ( (cid:15) , (cid:15) ) -stationary point in iterations T = O (cid:18) κ · (cid:20) (cid:15) + 1 (cid:15) (cid:15) (cid:21)(cid:19) . Proof of Theorem 2.
We can build the one-iteration descent result based onLemma 7, Lemma 8, and Lemma 9: F ( u t +1 , v t +1 , s t +1 ) − F ( u t , v t , s t ) ≤ − (cid:18) (cid:107) /n − π ( u t , v t , s t )1 (cid:107) + 12 (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) + 14 (cid:107) AU (cid:107) ∞ L / ( κη ) + 2 (cid:37)L (cid:107) ξ t +1 (cid:107) (cid:19) = − (cid:18) (cid:107) /n − π ( u t , v t , s t )1 (cid:107) + (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) + κ η (cid:107) ξ t +1 (cid:107) κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:19) Then we have that F ( u T , v T , s T ) − F ( u , v , s ) ≤ − T − (cid:88) t =0 (cid:18) (cid:107) /n − π ( u t , v t , s t )1 (cid:107) + (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) + κ η (cid:107) ξ t +1 (cid:107) κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:19) ≤ − · min (cid:26) , κ κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:27) × T − (cid:88) t =0 (cid:0) (cid:107) /n − π ( u t , v t , s t )1 (cid:107) + (cid:107) /m − π ( u t +1 , v t , s t ) T (cid:107) + η (cid:107) ξ t +1 (cid:107) (cid:1) ≤ − T · min (cid:26) , κ κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:27) · min (cid:26) (cid:15) , (cid:15) U , (cid:15) V (cid:27) Since the gradient ζ t +1 can be expressed as a vector multiplied with 1 /η , we restrict η (cid:107) Grad s F (ˆ u, ˆ v, ˆ s ) (cid:107) ≤ (cid:15) from the perspective of rescaling. T ≤ [ F ( u , v , t ) − F ( u T , v T , s T )] max (cid:40) , (cid:0) κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:1) κ (cid:41) max (cid:26) (cid:15) , U (cid:15) , V (cid:15) (cid:27) ≤ (cid:18) F ( u , v , t ) − (cid:107) AU (cid:107) ∞ η (cid:19) max (cid:40) , (cid:0) κη (cid:107) AU (cid:107) ∞ ( L + 2 L ) + mn (cid:107) AU (cid:107) ∞ L (cid:1) κ (cid:41) max (cid:26) (cid:15) , U (cid:15) , V (cid:15) (cid:27) = O (cid:18) κ · (cid:20) (cid:15) + 1 (cid:15) (cid:15) (cid:21)(cid:19) . C Technical Proofs in Section 5
We first present several technical lemmas before showing the proof of Theorem 1.
Lemma 12.
Based on Assumption 1, for f ∈ { f ∈ H : (cid:107) f (cid:107) H ≤ } , we have (cid:107) f ( x ) (cid:107) ≤ √ B, ∀ x ∈ R D . Proof of Lemma 12.
For fixed x ∈ X , the norm of f ( x ) can be upper boundedas the following: (cid:107) f ( x ) (cid:107) = (cid:104) f ( x ) , f ( x ) (cid:105) = (cid:104) f, K x f ( x ) (cid:105) H ≤ (cid:107) f (cid:107) H (cid:107) K x f ( x ) (cid:107) H ≤ (cid:107) K x f ( x ) (cid:107) H . In particular, (cid:107) K x f ( x ) (cid:107) H = (cid:104) K x f ( x ) , K x f ( x ) (cid:105) H = (cid:104) (cid:0) K x f ( x ) (cid:1) f ( x ) , f ( x ) (cid:105) = (cid:104) K ( x, f ( x )) f ( x ) , f ( x ) (cid:105) = f ( x ) T K ( x, f ( x )) f ( x ) ≤ B (cid:107) f ( x ) (cid:107) Combining those two relations above implies the desired result.
Lemma 13.
Let (cid:96) : R d → R be a -Lipschitz continuous function. Define thefunction space F (cid:96) = { R D (cid:51) z (cid:55)→ (cid:96) ( f ( z )) : f ∈ H , (cid:107) f (cid:107) H ≤ } . Then we have that R n ( F (cid:96) , µ ) ≤ √ dB √ n . roof of Lemma 13. The Rademacher complexity term can be upper boundedby applying the contraction Lemma 1: R n ( F (cid:96) , µ ) ≤ √ n E x i ∼ µ,σ sup f ∈H n (cid:88) i =1 d (cid:88) j =1 σ i,j f j ( x i ) = √ n E x i ∼ µ,σ (cid:34) sup f ∈H n (cid:88) i =1 (cid:104) σ i , f ( x i ) (cid:105) (cid:35) = √ n E x i ∼ µ,σ (cid:34) sup f ∈H n (cid:88) i =1 (cid:104) f, K x i σ i (cid:105) H (cid:35) = √ n E x i ∼ µ,σ (cid:34) sup f ∈H (cid:104) f, n (cid:88) i =1 K x i σ i (cid:105) H (cid:35) where the first equality is by expressing the inner summation over j as the innerproduct form, the second equality is by the reproducing property, and the lastequality is by the linearity of inner product. Moreover, the inner product formcan be upper bounded as the following: (cid:104) f, n (cid:88) i =1 K x i σ i (cid:105) H ≤ (cid:107) f (cid:107) H (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 K x i σ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H = (cid:118)(cid:117)(cid:117)(cid:116) (cid:104) n (cid:88) i =1 K x i σ i , n (cid:88) i =1 K x i σ i (cid:105) H = (cid:115)(cid:88) i (cid:88) j (cid:104) K x i σ i , K x j σ j (cid:105) H = (cid:115)(cid:88) i (cid:88) j (cid:104) (cid:0) K x i σ i (cid:1) ( x j ) , σ j (cid:105) = (cid:115)(cid:88) i (cid:88) j (cid:104) K ( x i , x j ) σ i , σ j (cid:105) = (cid:115)(cid:88) i (cid:88) j σ T j K ( x i , x j ) σ i E x i ∼ µ,σ (cid:34) sup f ∈H (cid:104) f, n (cid:88) i =1 K x i σ i (cid:105) H (cid:35) ≤ E x i ∼ µ,σ (cid:115)(cid:88) i (cid:88) j σ T j K ( x i , x j ) σ i ≤ (cid:115) E x i ∼ µ,σ (cid:88) i (cid:88) j σ T j K ( x i , x j ) σ i = (cid:115) E x i ∼ µ,σ (cid:88) i K ( x i , x i ) + (cid:88) i (cid:54) = j σ T j K ( x i , x j ) σ i = (cid:115) E x i ∼ µ,σ (cid:88) i σ T i K ( x i , x i ) σ ≤ (cid:115) E x i ∼ µ,σ (cid:88) i Bσ T i σ i = √ ndB, which implies R n ( F (cid:96) , µ ) ≤ √ dB √ n . Lemma 14.
We have that E [ KP W (ˆ µ n , µ )] ≤ √ kK √ n + ζ n,d , where ζ n,d = inf (cid:15)> (cid:15) + 6 (cid:114) Bn (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) (cid:38) √ Bε (cid:39) + 1 (cid:33) + (1 + 4 √ B/ε ) d log 2 . Proof of Lemma 14.
Define the function space F (cid:96) = { R D (cid:51) z (cid:55)→ (cid:96) ( f ( z )) : f ∈ H , (cid:107) f (cid:107) H ≤ } . Without loss of generality, the kernel projected Wasserstein distance suffices toconsider the following discriminator function space: T = (cid:26) (cid:96) : B ( √ B ) → R (cid:12)(cid:12)(cid:12)(cid:12) (cid:96) (0) = 0 , | (cid:96) ( y ) − (cid:96) ( y (cid:48) ) | ≤ (cid:107) y − y (cid:48) (cid:107) , ∀ y, y (cid:48) ∈ B ( √ B ) (cid:27) . where B ( r ) := { y ∈ R d : (cid:107) y (cid:107) ≤ r } . Then by the bias-variation decomposition,we have that E [ KP W (ˆ µ n , µ )] = E (cid:20) sup (cid:96) ∈ T IPM F (cid:96) (ˆ µ n , µ ) (cid:21) ≤ sup (cid:96) ∈ T E [IPM F (cid:96) (ˆ µ n , µ )] + E (cid:20) sup (cid:96) ∈ T (cid:0) IPM F (cid:96) (ˆ µ n , µ ) − E [IPM F (cid:96) (ˆ µ n , µ )] (cid:1)(cid:21) (cid:96) ∈ T E [IPM F (cid:96) (ˆ µ n , µ )] ≤ (cid:96) ∈ T R n ( F (cid:96) , µ ) ≤ √ dB √ n . Now we start to upper bound the variation term. Define the empirical process X (cid:96) = IPM F (cid:96) (ˆ µ n , µ ) − E [IPM F (cid:96) (ˆ µ n , µ )]. It is easy to see that E [ X (cid:96) ] = 0, andwe can show that for fixed (cid:96) , the random variable X (cid:96) is sub-Gaussian: defineˆ µ n = n (cid:80) ni =1 δ z i , Z = { z i } ni =1 , and g ( Z ) = IPM F (cid:96) (ˆ µ n , µ ). Then we have that | g ( Z ) − g ( Z (cid:48) ( i ) ) | ≤ IPM F (cid:96) (ˆ µ n , ˆ µ (cid:48) n ) = 1 n sup f ∈H , (cid:107) f (cid:107) H ≤ (cid:107) f ( z i ) − f ( z (cid:48) i ) (cid:107) ≤ √ Bn .
Therefore, applying the McDiarmid’s inequality in Theorem 6 impliesPr {| X (cid:96) | ≥ u } ≤ (cid:18) − nu B (cid:19) . Applying Lemma 3 implies that for fixed (cid:96) , X (cid:96) is a sub-Gaussian variable withthe parameter σ = Bn . Consider the metric space T for the collection ofLipschitz functions (cid:96) ∈ T and define the distance d ( (cid:96), (cid:96) (cid:48) ) = sup y ∈ B ( √ K ) | (cid:96) ( y ) − (cid:96) (cid:48) ( y ) | . Then for any (cid:96), (cid:96) (cid:48) ∈ T , we have that | X (cid:96) − X (cid:96) (cid:48) | ≤ | IPM F (cid:96) (ˆ µ n , µ ) − IPM F (cid:96) (cid:48) (ˆ µ n , µ ) | + | E [IPM F (cid:96) (ˆ µ n , µ )] − E [IPM F (cid:96) (cid:48) (ˆ µ n , µ )] |≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup f ( E ˆ µ n [ (cid:96) ( f ( z ))] − E ˆ µ n [ (cid:96) (cid:48) ( f ( z ))]) + sup f ( E µ [ (cid:96) ( f ( z ))] − E µ [ (cid:96) (cid:48) ( f ( z ))]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup f ( E ˆ µ n [ (cid:96) ( f ( z ))] − E ˆ µ n [ (cid:96) (cid:48) ( f ( z ))]) + sup f ( E µ [ (cid:96) ( f ( z ))] − E µ [ (cid:96) (cid:48) ( f ( z ))]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ f E µ [ (cid:96) ( f ( z )) − (cid:96) (cid:48) ( f ( z ))]+ sup f n n (cid:88) i =1 [ (cid:96) ( f ( z i )) − (cid:96) (cid:48) ( f ( z i ))] + E (cid:34) sup f n n (cid:88) i =1 [ (cid:96) ( f ( z i )) − (cid:96) (cid:48) ( f ( z i ))] (cid:35) ≤ d ( (cid:96), (cid:96) (cid:48) ) . Applying the (cid:15) -net argument gives E (cid:20) sup (cid:96) ∈ T X (cid:96) (cid:21) ≤ inf (cid:15)> (cid:40) (cid:15) + (cid:114) Bn (cid:112) N ( T, d , (cid:15) ) (cid:41) N ( T, d , ε ) = (cid:32) (cid:38) B ( √ B )) ε (cid:39) + 1 (cid:33) · N ( B ( √ B ) ,ε/ , (cid:107)·(cid:107) ) = (cid:32) (cid:38) √ Bε (cid:39) + 1 (cid:33) · (1+4 √ B/ε ) d , which implies that E (cid:20) sup (cid:96) ∈ T X (cid:96) (cid:21) ≤ inf (cid:15)> (cid:15) + 6 (cid:114) Bn (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) (cid:38) √ Bε (cid:39) + 1 (cid:33) + (1 + 4 √ B/ε ) d log 2 . Proof of Proposition 1.
For the first part of this proposition, we consider thefollowing bias-variance decomposition of (cid:12)(cid:12) KP W (ˆ µ n , ˆ ν m ) − KP W ( µ, ν ) (cid:12)(cid:12) : (cid:12)(cid:12) KP W (ˆ µ n , ˆ ν m ) − KP W ( µ, ν ) (cid:12)(cid:12) ≤ | E [ KP W (ˆ µ n , ˆ ν m )] − KP W ( µ, ν ) | + |KP W (ˆ µ n , ˆ ν m ) − E [ KP W (ˆ µ n , ˆ ν m )] | where the first term quantifies the bias for empirical estimation, and the secondterm quantifies the variance of estimation. The bias term can be upper boundedby applying the triangular inequality together with Lemma 14: | E [ KP W (ˆ µ n , ˆ ν m )] − KP W ( µ, ν ) | ≤ E [ KP W (ˆ µ n , µ )] + E [ KP W (ˆ ν m , ν )] ≤ √ dB √ n + 2 √ dB √ m + ζ n,d + ζ m,d . The remaining step is to bound the variance term |KP W (ˆ µ n , ˆ ν m ) − E [ KP W (ˆ µ n , ˆ ν m )] | .For fixed f ∈ H , define X = { x i } ni =1 , Y = { y i } mi =1 , g ( X, Y ) = W ( f µ n , f ν m ).We use X (cid:48) ( i ) to represent the collection of data points such that X and X (cid:48) ( i ) differs only in the i -th component, and Y (cid:48) ( i ) is defined likewise. Then we havethat g ( X, Y ) − g ( X (cid:48) ( i ) , Y ) ≤ W ( f µ n , f µ (cid:48) n ) ≤ n − (cid:107) f ( x i ) − f ( x (cid:48) i ) (cid:107) ≤ B / n . where the last inequality is by applying Lemma 12. Similarly, we have g ( X, Y ) − g ( X, Y (cid:48) ( i ) ) ≤ B / m . Therefore, applying Mcdiamard’s inequality in Theorem 6with c i = B / n for i = 1 , , . . . , n and c i = B / m for i = n + 1 , . . . , n + m impliesPr (cid:26) |KP W (ˆ µ n , ˆ ν m ) − E [ KP W (ˆ µ n , ˆ ν m )] | ≤ (cid:15) (cid:27) ≤ (cid:18) − (cid:15) mn B ( m + n ) (cid:19) . In other words, with probability 1 − α , we have that |KP W (ˆ µ n , ˆ ν m ) − E [ KP W (ˆ µ n , ˆ ν m )] | ≤ (cid:114) B ( m + n ) mn log 2 α . E [ KP W (ˆ µ n , ˆ ν m )]. Now we prove the second part of this proposition. In thiscase, we can follow the similar idea as in Lemma 14 to show that E [ KP W (ˆ µ n , ˆ ν n )] ≤ √ dB √ n + ζ n,d . Define X = { x i } ni =1 , Y = { y i } ni =1 , and ∆( X, Y ) = KP W (ˆ µ n , ˆ ν n ). Then | ∆( X, Y ) − ∆( X (cid:48) ( i ) , Y ) | ≤ B / n , | ∆( X, Y ) − ∆( X, Y (cid:48) ( i ) ) | ≤ B / n . Hence, applying Mcdiamard’s inequality in Theorem 6 with c i = B / n for i = 1 , , . . . , n and c i = B / n for i = n + 1 , . . . , n impliesPr (cid:26) |KP W (ˆ µ n , ˆ ν n ) − E [ KP W (ˆ µ n , ˆ ν n )] | ≤ (cid:15) (cid:27) ≤ (cid:18) − (cid:15) n B (cid:19) . The second part of this proposition is proved by combining the upper bound on E [ KP W (ˆ µ n , ˆ ν n )]. Proof of Theorem 4.
The type-II error has the following upper bound:Type-II error=Pr H (cid:18) KPW(ˆ µ n , ˆ ν m ) < γ m,n (cid:19) =Pr H (cid:18) KPW(ˆ µ n , ˆ ν m ) − KPW( µ, ν ) < γ m,n − KPW( µ, ν ) (cid:19) =Pr H (cid:18) KPW( µ, ν ) − KPW(ˆ µ n , ˆ ν m ) > KPW( µ, ν ) − γ m,n (cid:19) ≤ Pr H (cid:18) | KPW( µ, ν ) − KPW(ˆ µ n , ˆ ν m ) | > KPW( µ, ν ) − γ m,n (cid:19) ≤ MSE( µ, ν )(KPW( µ, ν ) − γ m,n )2