[PDF] Asymptotically Optimal One- and Two-Sample Testing with Kernels

Abstract

We characterize the asymptotic performance of nonparametric one- and two-sample testing. The exponential decay rate or error exponent of the type-II error probability is used as the asymptotic performance metric, and an optimal test achieves the maximum rate subject to a constant level constraint on the type-I error probability. With Sanov's theorem, we derive a sufficient condition for one-sample tests to achieve the optimal error exponent in the universal setting, i.e., for any distribution defining the alternative hypothesis. We then show that two classes of Maximum Mean Discrepancy (MMD) based tests attain the optimal type-II error exponent on R d , while the quadratic-time Kernel Stein Discrepancy (KSD) based tests achieve this optimality with an asymptotic level constraint. For general two-sample testing, however, Sanov's theorem is insufficient to obtain a similar sufficient condition. We proceed to establish an extended version of Sanov's theorem and derive an exact error exponent for the quadratic-time MMD based two-sample tests. The obtained error exponent is further shown to be optimal among all two-sample tests satisfying a given level constraint. Our work hence provides an achievability result for optimal nonparametric one- and two-sample testing in the universal setting. Application to off-line change detection and related issues are also discussed.

Full PDF

aa r X i v : . [ c s . I T ] A ug Asymptotically Optimal One- and Two-SampleTesting with Kernels

Shengyu Zhu † , Biao Chen ‡ , Zhitang Chen † , and Pengfei Yang § † Huawei Noah’s Ark Lab, Shatin, Hong Kong ‡ Syracuse University, Syracuse, NY § Cubist Systematic Strategies, New York, NY

Abstract

We characterize the asymptotic performance of nonparametric one- and two-sample testing. The exponentialdecay rate or error exponent of the type-II error probability is used as the asymptotic performance metric, and anoptimal test achieves the maximum rate subject to a constant level constraint on the type-I error probability. WithSanov’s theorem, we derive a sufﬁcient condition for one-sample tests to achieve the optimal error exponent inthe universal setting, i.e., for any distribution deﬁning the alternative hypothesis. We then show that two classes ofMaximum Mean Discrepancy (MMD) based tests attain the optimal type-II error exponent on R d , while the quadratic-time Kernel Stein Discrepancy (KSD) based tests achieve this optimality with an asymptotic level constraint. Forgeneral two-sample testing, however, Sanov’s theorem is insufﬁcient to obtain a similar sufﬁcient condition. Weproceed to establish an extended version of Sanov’s theorem and derive an exact error exponent for the quadratic-time MMD based two-sample tests. The obtained error exponent is further shown to be optimal among all two-sampletests satisfying a given level constraint. Our results not only solve a long-standing open problem in informationtheory and statistics, but also provide an achievability result for optimal nonparametric one- and two-sample testing.Application to off-line change detection and related issues are also discussed. Index Terms

Universal hypothesis testing, error exponent, large deviations, maximum mean discrepancy (MMD), kernel Steindiscrepancy (KSD)

I. I

NTRODUCTION

We study two fundamental problems in statistical hypothesis testing: the one- and two-sample testing. One-sample testing, also referred to as goodness of ﬁt testing, aims to determine how well a given distribution P ﬁtsthe observed sample y m := { y i } mi =1 . This goal can be achieved by testing the null hypothesis H : P = Q againstthe alternative hypothesis H : P = Q , where Q is the true distribution governing the sample y m . In two-sampleor homogeneity testing, one wishes to test if two samples x n and y m originate from the same distribution. Let P and Q denote the underlying unknown distributions for the respective samples. Then a two-sample test decideswhether to accept H : P = Q or H : P = Q .Both one- and two-sample testing have a long history in statistics and ﬁnd applications in a variety of areas.In anomaly detection [2]–[4], the abnormal sample is supposed to come from a distribution that deviates from the This work was presented in part at the 22nd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), Naha, Okinawa,Japan, April 2019 [1].Correspondence to: Shengyu Zhu (email: [email protected]) typical distribution or sample. Similarly in change-point detection [5]–[9], the post-change observations originatefrom a different source from the pre-change one. In bioinformatics, two-sample testing may be conducted to comparemicro-array data from identical tissue types measured by different laboratories, to decide whether the data can beanalyzed jointly [10]. Other examples include spectrum sensing in cognitive radio [11], [12], criticizing statisticalmodels [13], [14], and measuring quality of samples drawn from a given probability density function (up to thenormalization constant) by Markov Chain Monte Carlo (MCMC) methods [15]–[17].In this paper, we consider the universal nonparametric setting, in which no prior information on the unknowndistributions is available. We will only allow for tests that are independent of the unknown distributions, whereasthe statistical performance of the tests may depend on the unknown distributions (cf. Section II). Typical testsin this setting are constructed based on some probability distance measures between distributions, which possessthe property that they are zero if and only if two distributions are identical; hence a larger sample estimate ofthe distance measure indicates that the two distributions are more likely to be different. Examples in some earliertests include the Kolmogorov-Smirnov distance [18]–[20], total variation distance [21], and Wasserstein distance[22]–[24]. Although the constructed tests have satisfactory theoretic properties and work well in low dimensions(namely, R ), they do not in general apply to high dimension data. Recent tests have also used the Kullback-LeiblerDivergence (KLD) [25], [26] and Sinkhorn divergence (smoothed Wasserstein distance) [24], [27], whose statisticsare estimated by solving some optimization problems and can better handle higher dimensions. We refer the readerto related works, e.g., [17], [24], [28], and references therein for a more detailed review of existing one- andtwo-sample tests.More recently, kernel based statistics have attracted much attention in machine learning, as they possess severalkey advantages such as computational efﬁciency and fast convergence [29], [30]. A particular example is theMaximum Mean Discrepancy (MMD), deﬁned by the distance between the mean embeddings of two distributionsinto a Reproducing Kernel Hilbert Space (RKHS) [28]. There have been several effective two-sample tests thatare constructed based on the MMD: a vanilla test statistic can be computed by plugging in the sample empiricaldistributions with a quadratic-time computation complexity in terms of number of samples, and some variants havebeen proposed with even lower complexities [35]–[40]. Applying the MMD to one-sample testing is straightforwardbut requires integrals with respect to (w.r.t.) the target distribution P [41]–[43]. Another idea, in the context ofmodel criticism, is to conduct a two-sample testing by drawing samples from P [13], [14]. A difﬁculty with thisapproach is to determine the required number of samples drawn from P relative to m , the sample number ofthe test sequence. Alternatively, there exist more efﬁcient one-sample tests constructed based on classes of Steintransformed RKHS functions [15]–[17], [44], [45], where the test statistic is the norm of the smoothness-constrainedfunction with largest expectation under Q and is referred to as the Kernel Stein Discrepancy (KSD). The KSD haszero expectation under P and does not require computing integrals or drawing samples. Additionally, constructingexplicit features of distributions attains a linear-time one-sample test that is also more interpretable [46].Distinguishing distributions with high success probability at a given ﬁxed sample size, however, is not possiblewithout any prior assumptions regarding the difference between P and Q (see an example for two-sample testingin [28, Section 3]). Consequently, statistical performance in the universal setting are often considered in the largesample regime. A test is said to be consistent if its type-II error probability approaches zero in the limit, subjectto a constant level constraint on the type-I error probability. While consistency is a desired property for hypothesistests, it is even more desirable to characterize the decay rate w.r.t. sample size as it provides a natural metric The MMD is closely related to another probability distance measure, the energy distance [31], [32]. Roughly speaking, for every distancemetric, there exists a suitable kernel (and also vice versa) so that the MMD and the energy distance are equivalent; see [33], [34] for details.In this paper, we focus on kernel based statistics, and particularly, the MMD and KSD. for comparing tests’ performance. Indeed, the decay rate of the type-II error probability has been investigated forexisting kernel based one- and two-sample tests. For the one-sample tests in [41]–[43] and two-sample tests in[35]–[40], analysis is based on test statistics, through their asymptotic distributions or some probabilistic bounds ontheir convergence to the population statistics. The statistical characterizations depend on kernels and are loose ingeneral (more details are given in Section IV-C). For KSD based one-sample tests, current characterization restrictsto consistency; attempts in characterizing their asymptotic performance, in particular, the optimal error decay rate,have not been fruitful [17], [44], [46].The present work is devoted to characterizing the statistical optimality of nonparametric one- and two-sample testsin the universal setting. This is motivated by the success of Hoeffding’s result in [47] which established universaloptimality of the so-called likelihood ratio test for testing against a multinomial distribution. For the general case,i.e., with arbitrary sample space, testing between H : y m ∼ P and H : y m ∼ Q can be extremely hard when Q is arbitrary but unknown, as opposed to the simple case where Q is known. With independent and identicallydistributed (i.i.d.) sample and known Q , the type-II error probability of an optimal test, subject to a constant levelconstraint on the type-I error probability, vanishes exponentially fast w.r.t. the sample size m , and the exponentialdecay rate or error exponent coincides with the KLD between P and Q (cf. Lemma 1). This motivates the so-calledUniversal Hypothesis Testing (UHT) problem [47]: does there exist a nonparametric one-sample test that achievesthe same optimal error exponent as in the simple hypothesis testing problem where Q is known? Over the years,universally optimal tests are known to exist only when the sample space is ﬁnite [47], [48]. For a more generalsample space, attempts have been largely fruitless except the works of [49]–[51]. These results, however, wereobtained at the cost of weaker optimalities and the proposed tests were rather complicated for practical use. Herewe remark that even the existence of an optimal test for the UHT problem remains unknown in the latter case.Closely related to the current setting is a broader class of composite hypothesis testing, where there is uncertaintyin the distributions associated with the hypotheses. This uncertainty, if known a priori , could be used to devise teststo optimize the worst-case performance, leading to generalized likelihood ratio tests or other minimax based tests,e.g., [52]. By contrast, the universal optimality criterion used in this paper is much stronger in that the optimummust be achieved for any distribution deﬁning the alternative. Also related are the works [46] and [53], whichrespectively use the approximate Bahadur slope and detection boundary as performance metric to compare kernelbased one-sample tests. The authors of [46] show that their linear-time test has a greater relative efﬁciency thanthe linear-time test proposed in [44], assuming a mean-shift alternative. In [53], a nonparametric kernel based testis proposed to achieve the minimax optimality for a composite alternative. It is noted that, while the tests in [46],[53] are able to work in the universal nonparametric setting, the corresponding statisical results need to assume aparticular composite alternative.

A. Contributions

We show that a simple kernel test, comparing the MMD between the given distribution and the sample empiricaldistribution with a proper threshold, is an optimal approach to the UHT problem on Polish, locally compactHausdorff space, e.g., R d . Taking into account the difﬁculty of obtaining closed-form integrals for non-Gaussiandistributions, we then follow [13] to cast one-sample testing into a two-sample problem. We establish the sameoptimality for the quadratic-time kernel two-sample tests proposed in [28], provided that a suitable number ofindependent samples are drawn from the given distribution. For the KSD based tests, the constant level constrainton the type-I error probability is difﬁcult to meet for all possible sample sizes. By relaxing the constraint to an Throughout the rest of this paper, the UHT problem refers to the speciﬁc problem of ﬁnding a universally optimal one-sample test interms of the type-II error exponent. asymptotic one, we show that the quadratic-time KSD based tests proposed in [17], [44] are also optimal for theUHT problem under suitable conditions. Key to our approach are Sanov’s theorem and the weak convergenceproperties of the MMD [54], [55] and the KSD [16], which enable us to directly investigate the acceptance regiondeﬁned by the test, rather than using the test statistic as an intermediary.As another contribution, we investigate the quadratic-time kernel two-sample tests in a more general setting wherethe sample sizes scale in the same order. The original Sanov’s theorem, however, is insufﬁcient in this setting as itinvolves only a single distribution. To proceed, we derive an extended version of Sanov’s theorem, based on whichan exact type-II error exponent of the two-sample test is established. The obtained error exponent is then shown tobe optimal among all two-sample tests under the same level constraint, and is independent of the choice of kernelsprovided that they are bounded continuous and characteristic.Finally, we discuss related issues, including how two other statistical criteria, exact Bahadur slope and Chernoffindex, perform under the universal optimality criterion. Application of our results to nonparametric off-line changedetection is also included, and we establish an optimal change detection result when no prior information on thepost-change distribution is available.

B. Paper Organization

Section II formally presents the problems of one- and two-sample testing, along with the optimality criterionused in this paper. A sufﬁcient condition for a one-sample test to be universally optimal in terms of the type-II errorexponent is proposed in Section III. We brieﬂy review the MMD and KSD, and related tests in Section IV. Section Vpresents two classes of MMD based tests and the KSD based tests that are optimal for the UHT problem. Section VIestablishes an extended version of Sanov’s theorem and shows that the quadratic-time MMD based two-sampletest is also universally optimal for two-sample testing. We apply our results to nonparametric off-line change-pointdetection in Section VII and conclude this paper in Section VIII.Mostly standard notations are used throughout the paper. We use boldface P to denote probability of a set w.r.t. adistribution speciﬁed by the subscript, e.g., P y m ∼ Q ( A ) denotes the probability of y m ∈ A with y m i.i.d. ∼ Q .II. P ROBLEM S TATEMENT

In this section, we formally state the problems of one- and two-sample testing and also introduce the optimalitycriterion used in this paper.

A. One-Sample Testing

Throughout the rest of this paper, let X denote a Polish space (that is, a separable completely metrizabletopological space) and P the set of Borel probability measures deﬁned on X . Given a distribution P ∈ P anda sample sequence y m from an unknown distribution Q ∈ P , we want to determine whether to accept H : P = Q or H : P = Q . A hypothesis test Ω( m ) = { Ω ( m ) , Ω ( m ) } partitions X m into two disjoint sets with Ω ( m ) ∪ Ω ( m ) = X m . If y m ∈ Ω i ( m ) , i = 0 , , a decision is made in favor of hypothesis H i . We say that Ω ( m ) is an acceptance region for the null hypothesis H and Ω ( m ) the rejection region. A type-I error is made when P = Q is rejected while H is true, and a type-II error occurs when P = Q is accepted despite H being true.The two error probabilities are respectively α m := P y m ∼ P (Ω ( m )) , under H ,β m := P y m ∼ Q (Ω ( m )) , under H . In general, the two error probabilities can not be minimized simultaneously. A commonly used approach is theNeyman-Pearson approach [56] which imposes the type-I error probability constraint in the form of α m ≤ α for apre-deﬁned α ∈ (0 , . A level α test is said to be consistent when the type-II error probability vanishes in the largesample limit. Such a test is exponentially consistent if the error probability vanishes exponentially fast w.r.t. thesample size, i.e., when lim inf m →∞ − m log β m > . Here and throughout the rest of this paper, log denotes the logarithm to the base . The above limit is also referredto as the type-II error exponent [57].We next present Chernoff-Stein lemma, which gives the optimal type-II error exponent of any level α test forsimple hypothesis testing between two known distributions. Let D ( P k Q ) denote the KLD between P and Q . Thatis, D ( P k Q ) = E P log( dP/dQ ) where dP/dQ stands for the Radon-Nikodym derivative of P w.r.t. Q when itexists, and D ( P k Q ) = ∞ otherwise [58]. Lemma 1 (Chernoff-Stein Lemma [57], [58]):

Let y m i.i.d. ∼ R . Consider hypothesis testing between H : R = P ∈ P and H : R = Q ∈ P , with < D ( P k Q ) < ∞ . Given < α < , let Ω ∗ ( m, P, Q ) =(Ω ∗ ( m, P, Q ) , Ω ∗ ( m, P, Q )) be the optimal level α test with which the type-II error probability is minimizedfor each sample size m . It follows that the type-II error probability satisﬁes lim m →∞ − m log P y m ∼ Q (Ω ∗ ( m, P, Q )) = D ( P k Q ) . We can now describe the universal optimality criterion used in this paper. Let Ω( m ) = (Ω ( m ) , Ω ( m )) be anonparametric one-sample test of level α . With y m i.i.d. ∼ Q under the alternative hypothesis H , the correspondingtype-II error probability Q (Ω ( m )) can not be lower than the minimum Q (Ω ∗ ( m, P, Q )) . As a result, Chernoff-Stein lemma indicates that its type-II error exponent is upper bounded by D ( P k Q ) . The UHT problem is then toﬁnd a test Ω( m ) for a given P so that, for an arbitrary Q with < D ( P k Q ) < ∞ , it holds thata) under H : P y m ∼ P (Ω ( m )) ≤ α, b) under H : lim inf m →∞ − m log P y m ∼ Q (Ω ( m )) = D ( P k Q ) , giving rise to the name universal hypothesis testing. Here we remark that even the existence of such a test remainsunknown when the sample space X is non-ﬁnite. B. Two-Sample Testing

Let x n and y m be independent samples with x n ∼ P and y m ∼ Q , and both P and Q are unknown. The goalof two-sample testing is to decide between H : P = Q and H : P = Q based on the observed samples. We use Ω( n, m ) = (Ω ( n, m ) , Ω ( n, m )) to denote a two-sample test, with Ω ( n, m ) ∩ Ω ( n, m ) = ∅ and Ω ( n, m ) ∪ Ω ( n, m ) = X n + m . The type-I and type-II error probabilities are given by α n,m := P x n ∼ P,y m ∼ P (( x n , y m ) ∈ Ω ( n, m )) , under H ,β n,m := P x n ∼ P,y m ∼ Q (( x n , y m ) ∈ Ω ( n, m )) , under H , respectively. Notice that both α n,m and β n,m are deﬁned w.r.t. the underlying yet unknown distributions under therespective hypotheses.Motivated by the UHT problem, we also consider the error exponent of β n,m deﬁned in the large sample limit,with a constant level constraint on α n,m . That is, we would like to maximize lim inf n,m →∞ − n + m log β n,m , subject to α n,m ≤ α. (1) Unlike one-sample testing, there does not exist a characterization on the optimal type-II error exponent for two-sample testing. As such, we would like not only to derive an exact characterization of the type-II error exponentfor a given two-sample test, but also to investigate if the characterization is optimal among all two-sample testssatisfying the level constraint.III. A S

UFFICIENT C ONDITION FOR U NIVERSAL H YPOTHESIS T ESTING

A useful tool for establishing the exponential decay of a hypothesis test is Sanov’s theorem from large deviationtheory. In this section, we will use it to derive a sufﬁcient condition for one-sample tests to be universally optimal,followed by discussions on why various tests fail to meet this condition.We start with the weak convergence of probability measures, followed by Sanov’s theorem.

Deﬁnition 1 (Weak Convergence):

For a sequence of probability measures P l ∈ P , we say that P l → P weaklyif and only if E x ∼ P l f ( x ) → E x ∼ P f ( x ) for every bounded continuous function f : X → R . The topology on P induced by this weak convergence is referred to as the weak topology. Theorem 1 (Sanov’s Theorem [58], [59]):

Let y m i.i.d. ∼ Q ∈ P . Denote by ˆ Q m the empirical measure ofsample y m , i.e., ˆ Q m = m δ y i where δ y is Dirac measure at y . For a set Γ ⊂ P deﬁned on the Polish space X , itholds that lim sup m →∞ − m log P y m ∼ Q ( ˆ Q m ∈ Γ) ≤ inf R ∈ int Γ D ( R k Q ) , lim inf m →∞ − m log P y m ∼ Q ( ˆ Q m ∈ Γ) ≥ inf R ∈ cl Γ D ( R k Q ) , where int Γ and cl Γ are the interior and closure of Γ w.r.t. the weak topology, respectively.A useful property of the KLD is its lower semi-continuity. Lemma 2 (Lower Semi-Continuity of the KLD [47], [60]):

For a ﬁxed Q ∈ P , D ( ·k Q ) is a lower semi-continuousfunction w.r.t. the weak topology of P . That is, for any ǫ > , there exists a neighborhood U ⊂ P of P such that forany P ′ ∈ U , D ( P ′ k Q ) ≥ D ( P k Q ) − ǫ if D ( P k Q ) < ∞ , and D ( P ′ k Q ) → ∞ as P ′ tends to P if D ( P k Q ) = ∞ .We can now present a sufﬁcient condition which follows from Sanov’s theorem and the lower semi-continuityof the KLD. Theorem 2:

Let y m i.i.d. ∼ Q . Let Ω( m ) = (Ω ( m ) , Ω ( m )) be a one-sample test based on y m and P . Then itis optimal for the UHT problem ifa) P y m ∼ P (Ω ( m )) ≤ α with P = Q .b) Ω ( m ) ⊂ { y m : d ( P, ˆ Q m ) ≤ γ m } , where d ( · , · ) is a probability metric that metrizes the weak topology on P , ˆ Q m denotes the empirical measure of y m , γ m > denotes the test threshold and goes to as m → ∞ . Proof:

Condition a) is simply the constant constraint on the type-I error probability. By Chernoff-Stein lemma,we only need to show that the type-II error exponent is no lower than D ( P k Q ) . Assuming Condition b), we have lim inf m →∞ − m log P y m ∼ Q (Ω ( m )) ≥ lim inf m →∞ − m log P y m ∼ Q (cid:16) d ( P, ˆ Q m ) ≤ γ m (cid:17) . (2)To proceed, we notice that deciding if y m ∈ { y m : d ( P, ˆ Q m ) ≤ γ m } is equivalent to deciding if the empiricalmeasure ˆ Q m ∈ { P ′ : d ( P, P ′ ) ≤ γ m } . Since γ m → as m → ∞ , for any given γ > , there exists an integer m such that γ m < γ for all m > m . Therefore, { P ′ : d ( P, P ′ ) ≤ γ m } ⊂ { P ′ : d ( P, P ′ ) ≤ γ } for large enough m . Itfollows that for any γ > , lim inf m →∞ − m log P y m ∼ Q (cid:16) d ( P, ˆ Q m ) ≤ γ m (cid:17) ≥ lim inf m →∞ − m log P y m ∼ Q (cid:16) d ( P, ˆ Q m ) ≤ γ (cid:17) ≥ inf { P ′ ∈P : d ( P,P ′ ) ≤ γ } D ( P ′ k Q ) , (3) where the last inequality is from Sanov’s theorem and that { P ′ : d ( P, P ′ ) ≤ γ } is closed w.r.t. the weak topology.Moreover, for any given ǫ > , there exists some γ > such that inf { P ′ : d k ( P,P ′ ) ≤ γ } D ( P ′ k Q ) ≥ D ( P k Q ) − ǫ, (4)due to the lower semi-continuity of the KLD in Lemma 2 and the assumption that < D ( P k Q ) < ∞ under H .Since ǫ can be arbitrarily small, combining (2), (3) and (4) gives lim inf m →∞ − m log P y m ∼ Q (Ω ( m )) ≥ D ( P k Q ) , under H : P = Q, which completes the proof. Remark 1:

It is worth noting that Condition b) requires only a vanishing threshold γ m and does not place anyconstraint on how fast it vanishes. Indeed, the requirement on the vanishing rate is determined by the type-I errorconstraint in Condition a). Consequently, if a test has its acceptance region in the form of { y m : d ( P, ˆ Q m ) ≤ γ m } and is universally optimal, then any such test with a vanishing threshold γ ′ m > , where γ ′ m ≥ γ m , also satisﬁes thetwo conditions, as { y m : d ( P, ˆ Q m ) ≤ γ m } ⊂ { y m : d ( P, ˆ Q m ) ≤ γ ′ m } . However, using a larger threshold may resultin a higher type-II error probability in the ﬁnite sample regime. Several methods have been proposed to choose atighter threshold but they introduce additional randomness. More discussions will be given in Section V-D. Remark 2:

A direct extension of the above result is to obtain a similar theorem for nonparametric two-sample tests.However, Sanov’s theorem works only with a single distribution, whereas there are two distributions involved intwo-sample testing. Extending Sanov’s theorem to handle two distributions would be key to establishing a sufﬁcientcondition for two-sample testing similar to Theorem 2. This is the topic of Section VI-A .While Theorem 2 is somewhat straightforward since most of the hard work has been done in proving Sanov’stheorem [58], [59], the two conditions are indeed quite hard to meet simultaneously. For example, the KLD andtotal variation distance do not metrize weak convergence and tests that are constructed from them fail to meetCondition b). While other distances, such as L´evy metric, Wasserstein distance, and the bounded Lipstchiz metric,metrize weak convergence, their sample estimates are usually not easy to compute. Moreover, there does not exist auniform threshold such that Condition a) is satisﬁed. To the best of our knowledge, universally optimal one-sampletests only exist for distributions deﬁned on a ﬁnite sample space where the empirical KLD [47] or mismatcheddistance [48] are used for constructing one-sample tests. Seeking a proper probability distance becomes key tomeeting the sufﬁcient condition given in Theorem 2.Meanwhile, in the machine learning community, there has been an active research topic on kernel based probabilitydistances. While several efﬁcient tests have been constructed based on these probability distances, little is knownabout their statistical optimality. In the next section, we will introduce two such kernel based probability distancesand their empirical estimates for constructing nonparametric one- and two-sample tests.IV. M

AXIMUM M EAN D ISCREPANCY AND K ERNEL S TEIN D ISCREPANCY

We introduce two kernel based probability distances, followed by a brief review of related one- and two-sampletests.

A. Maximum Mean Discrepancy

Let H k be an RKHS deﬁned on X with reproducing kernel k : X × X → R which is positive deﬁnite. Let x bean X -valued random variable with probability measure P , and E x ∼ P f ( x ) the expectation of f ( x ) for a function f : X → R . Assume that k is bounded continuous. Then for every Borel probability measure P deﬁned on X , there exists a unique element µ k ( P ) ∈ H k such that E x ∼ P f ( x ) = h f, µ k ( P ) i H k for all f ∈ H k [61]. The MMDbetween two Borel probability measures P and Q is deﬁned as d k ( P, Q ) := sup k f k H k ≤ ( E x ∼ P f ( x ) − E x ∼ Q f ( x )) , where k f k H k ≤ denotes the unit ball in the RKHS [28]. The MMD belongs to a class of metrics for probabilitymeasures, called the Integral Probability Metric (IPM). Choosing an appropriate set of functions over which thesupremum is taken can obtain many other popular distance measures, including the total variation distance andWasserstein distance. We refer the reader to [62] for more details on the IPM.The use of the unit ball in the RKHS brings in an equivalent formulation of the MMD. It is shown in [28] thatthe MMD can also be expressed as the RKHS-distance between µ k ( P ) and µ k ( Q ) : d k ( P, Q ) = k µ k ( P ) − µ k ( Q ) k H k = (cid:0) E x,x ′ k ( x, x ′ ) + E y,y ′ k ( y, y ′ ) − E x,y k ( x, y ) (cid:1) / , (5)where x, x ′ i.i.d. ∼ P and y, y ′ i.i.d. ∼ Q . If kernel k is characteristic, then d k ( · , · ) becomes a metric on P [28], [63], which enables the MMD to distinguish between different distributions. Moreover, [54], [55] have alsoestablished the weak metrizable property of d k ( · , · ) , as stated below. Theorem 3 ([54], [55]):

The MMD d k ( · , · ) metrizes the weak convergence on P if the following conditions hold: • ( A1 ) the sample space X is Polish, locally compact and Hausdorff; • ( A2 ) the kernel k is bounded continuous and characteristic.As discussed in Section III, the weak metrizable property is key to the sufﬁcient condition in Theorem 2. We notethat this property is also important to training deep generative models [64], [65] in machine learning. Examplesunder Condition A1 include any ﬁnite set and R d , and Condition A2 is satisﬁed by Gaussian and Laplace kernelsdeﬁned on R d , which are k ( x, y ) = exp( −k x − y k /γ ) and k ( x, y ) = exp( −k x − y k /γ ) , respectively, with x, y ∈ R d and γ > being a kernel parameter. B. Kernel Stein Discrepancy

The KSD is recently proposed in [17], [44] and calculates some discrepancy from distribution Q to a givendistribution P by using a Stein operator. For the rest of this paper, the discussion of the KSD always assumes X = R d .Denote by p and q the density functions (w.r.t. Lebesgue measure) of P and Q , respectively. In [17], [44], theKSD is deﬁned as d S ( P, Q ) := sup k f k H k ≤ E x ∼ Q [ s p ( x ) f ( x ) + ∇ x f ( x )] , where k f k H k ≤ is the unit ball in the RKHS H k , and s p ( x ) = ∇ x log p ( x ) is the score function of p ( x ) .With a C -universal kernel [66, Deﬁnition 1 and Theorem 4.1] and E x ∼ Q k∇ x log p ( x ) − ∇ x log q ( x ) k ≤ ∞ , d S ( P, Q ) = 0 if and only if P = Q [17, Theorem 2.2]. A nice property of the KSD is that this result requires onlythe knowledge of p ( x ) up to the normalization constant. To see this, notice that ∇ x log p ( x ) = ∇ x log( ηp ( x )) forany constant η > . An equivalent expression of the squared KSD can be derived as d S ( P, Q ) = E x ∼ Q E x ′ ∼ Q h p ( x, x ′ ) , Indeed, it is shown in [54] that X only needs to be locally compact Hausdorff. We require X be Polish in order to utilize Sanov’stheorem. C denotes the set of continuous functions vanishing at inﬁnity. where h p ( x, x ′ ) := s Tp ( x ) s p ( x ′ ) k ( x, x ′ ) + s Tp ( x ′ ) ∇ x k ( x, x ′ ) + s Tp ( x ) ∇ x ′ k ( x, x ′ ) + trace( ∇ x,x ′ k ( x, x ′ )) . (6)Although the KSD is not a probability metric on P , it has been shown to be lower bounded in terms of someweak metrizable measures, as stated below. Theorem 4 ([16]):

If a) X = R and k ( x, y ) = Φ( x − y ) for some Φ ∈ C (twice continuous differentiable)with a non-vanishing generalized Fourier transform; b) k ( x, y ) = Φ( x − y ) for some Φ ∈ C with a non-vanishinggeneralized Fourier transform and the sequence { ˆ Q m } m ≥ is uniformly tight, then there exists a weak metrizablemeasure d W such that d W ( P, ˆ Q m ) ≤ g ( d S ( P, ˆ Q m )) , where g is a function involving some unknown constants and g ( w ) → if w → .This theorem indicates that d S ( P, ˆ Q m ) → only if ˆ Q m → P weakly, i.e., d S ( P, ˆ Q m ) vanishing is a necessarycondition for weak convergence. The Gaussian kernel deﬁned on R satisﬁes Condition a), and an example underCondition b) is the inverse multi-quadric kernel k ( x, y ) = ( c + k x − y k ) η , with c > and − < η < . C. Preliminary Results

We end this section with some preliminary results on MMD and KSD based one- and two-sample tests. As onewill see, these results depend on kernels and are generally loose. Nevertheless, they are important to ﬁnding aproper test threshold to meet the level constraint on the type-I error probability.

1) MMD based one-sample test statistic:

From the deﬁnition of the MMD, a one-sample test statistic can bedirectly obtained by plugging in the empirical distribution of the observed sample. With sample y m and its empiricaldistribution ˆ Q m , the squared MMD can be estimated as d k ( P, ˆ Q m ) = E x,x ′ k ( x, x ′ ) + 1 m m X i =1 m X j =1 k ( y i , y j ) − m m X i =1 E x k ( x, y i ) , (7)where x, x ′ i.i.d. ∼ P . A statistical characterization on this statistic is given as follows. Lemma 3 ([41], [42]):

Assume A1 and A2 , with ≤ k ( · , · ) ≤ K . Given y m i.i.d. ∼ Q , it follows that P y m ∼ Q (cid:16)(cid:12)(cid:12)(cid:12) d k ( P, ˆ Q m ) − d k ( P, Q ) (cid:12)(cid:12)(cid:12) > (2 K/m ) / + ǫ (cid:17) ≤ exp (cid:18) − ǫ m K (cid:19) .

2) MMD based two-sample test statistic:

Given two samples x n and y m , a two-sample test statistic can beconstructed as d k ( ˆ P n , ˆ Q m ) = 1 n n X i =1 n X j =1 k ( x i , x j ) + 1 m m X i =1 m X j =1 k ( y i , y j ) − nm n X i =1 m X j =1 k ( x i , y j ) . (8)where ˆ P n and ˆ Q m are the empirical distributions of x n and y m , respectively. This statistic was proposed in [28]and is a biased estimator of d k ( P, Q ) . The following lemma states the convergence of d k ( ˆ P n , ˆ Q m ) to d k ( P, Q ) Lemma 4 ([28, Theorem 7]):

Assume the same conditions in Lemma 3. With x n i.i.d. ∼ P and y m i.i.d. ∼ Q ,it holds that P x n ∼ P,y m ∼ Q (cid:16)(cid:12)(cid:12) d k ( ˆ P n , ˆ Q m ) − d k ( P, Q ) (cid:12)(cid:12) > K/n ) / + 2( K/m ) / + ǫ (cid:17) ≤ (cid:18) − ǫ nm K ( n + m ) (cid:19) .

3) KSD based one-sample test statistic:

Given sample y m , we may estimate d S ( P, Q ) by d S ( P, ˆ Q m ) = 1 m m X i =1 m X j =1 h p ( y i , y j ) , where h p ( · , · ) is deﬁned in (6). The statistic d S ( P, ˆ Q ) is a degenerate V-statistic under the null hypothesis H : P = Q [17]. To our best knowledge, existing results only characterize its limiting behavior, as stated in the followinglemma. Lemma 5 ([17, Proposition 3.1]): If h p is Lipschitz continuous and E x ∼ Q h p ( x, x ) < ∞ , then under the nullhypothesis, md S ( P, ˆ Q m ) converges weakly to some distribution.Letting P = Q so that d k ( P, Q ) = 0 in Lemmas 3 and 4, one can easily obtain a distribution-free threshold tomeet the constant type-I error constraint. With such a threshold, however, the type-II error probability under thealternative hypothesis H : P = Q depends on the kernel k (more precisely, an upper bound K ) and the resultingtype-II error exponent is not tight. As to the KSD based test statistic, since there is no ﬁnite sample result likeLemmas 3 and 4, the constant level constraint can not be satisﬁed for each sample size m . In Section V-C, we willrelax the level constraint to an asymptotic one for KSD based one-sample tests.V. A SYMPTOTICALLY O PTIMAL O NE -S AMPLE T ESTS

In this section, we investigate three classes of kernel based one-sample tests for the UHT problem: the ﬁrst testdirectly computes the MMD between the given distribution and the sample empirical distribution, which requiresclosed-form integrals w.r.t. the given distribution; the second test relaxes the exact integration by drawing samplesfrom the target distribution but needs more treatment in applying Sanov’s theorem; and the third test is morecomputationally favorable but only meets an asymptotic level constraint.

A. Simple Kernel Tests

The ﬁrst test relies on the statistic d k ( P, ˆ Q m ) deﬁned in (7) and has been studied in [41]–[43], [53], yet itsoptimality for the UHT problem remains unknown.Let ˆ Q m be the empirical measure of y m . A simple kernel test can be constructed with acceptance region Ω ( m ) = n y m : d k ( P, ˆ Q m ) ≤ γ m o , where γ m denotes the test threshold. On the one hand, we want the threshold γ m to be small so that the test type-IIerror probability is low; on the other hand, the threshold cannot be too small in order to satisfy the level constrainton the type-I error probability. The balance between the two error probabilities is attained with a threshold thatvanishes at an appropriate rate. Theorem 5:

For P ∈ P and y m i.i.d. ∼ Q ∈ P , assume < D ( P k Q ) < ∞ under the alternative hypothesis H . Assume A1 , A2 , where kernel k satisﬁes ≤ k ( · , · ) ≤ K and K > is a constant value. For a given α , < α < , set γ m = p K/m (cid:0) √− log α (cid:1) . Then the simple kernel test d k ( P, ˆ Q n ) ≤ γ m is an optimal level α test for the UHT problem, that is,a) under H : P y m ∼ P (cid:16) d k ( P, ˆ Q m ) > γ m (cid:17) ≤ α, b) under H : lim inf m →∞ − m log P y m ∼ Q (cid:16) d k ( P, ˆ Q m ) ≤ γ m (cid:17) = D ( P k Q ) . The authors of [17] assume τ -mixing as the notion of dependence within the observations, which holds in the i.i.d. case. They alsoassume a technical condition P ∞ t =1 t p τ ( t ) ≤ ∞ on τ -mixing. See details in [17], [67]. Proof:

That the test d k ( P, ˆ Q m ) ≤ γ m has level α can be directly veriﬁed by Lemma 3 with P = Q . The restfollows from Theorem 2, since d k ( · , · ) metrizes weak convergence on P and γ m → as m → .The statistic d k ( P, ˆ Q m ) is a biased estimator for d k ( P, Q ) . Replacing m P i P j k ( y i , y j ) in d k ( P, ˆ Q m ) with m ( m − P i P j k ( y i , y j ) results in an unbiased statistic, which is denoted as d u ( P, ˆ Q n ) . We remark that d u ( P, ˆ Q m ) is not a squared quantity and can be negative, a consequence of its unbiasedness. The following result shows that d u ( P, ˆ Q m ) can also be used to construct a universally optimal one-sample test. Corollary 1:

Under the same conditions as in Theorem 5, the test d u ( P, ˆ Q m ) ≤ γ m + K/m is level α and optimalfor the UHT problem. Proof:

Since ≤ k ( · , · ) ≤ K , we have (cid:12)(cid:12)(cid:12) d u ( P, ˆ Q m ) − d k ( P, ˆ Q m ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m ( m − m X i =1 X j = i k ( x i , x j ) − m m X i =1 k ( x i , x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K/m.

It follows that n y m : d k ( P, ˆ Q m ) ≤ γ m o ⊂ n y m : d u ( P, ˆ Q m ) ≤ γ m + K/m o ⊂ n y m : d k ( P, ˆ Q m ) ≤ γ m + 2 K/m o . Under H , we thus have P y m ∼ P (cid:16) d u ( P, ˆ Q m ) > γ m + K/m (cid:17) ≤ P y m ∼ P (cid:16) d k ( P, ˆ Q m ) > γ m (cid:17) ≤ α, where the last inequality is from Lemma 3 and the fact that d k ( P, ˆ Q m ) ≥ . Under H : P = Q , we have thetype-II error exponent being lim inf m →∞ − n log P y m ∼ Q (cid:16) d u ( P, ˆ Q m ) ≤ γ m + K/m (cid:17) ≥ lim inf n →∞ − m log P y m ∼ Q (cid:16) d k ( P, ˆ Q m ) ≤ γ m + 2 K/m (cid:17) = lim inf n →∞ − m log P y m ∼ Q (cid:16) d k ( P, ˆ Q m ) ≤ p γ m + 2 K/m (cid:17) ≥ D ( P k Q ) . The last inequality follows from Theorem 2 because γ m + 2 K/m → when m → ∞ . Applying Chernoff-Steinlemma completes the proof.It is worth noting that the tests in this section require closed-form integrals, namely, E x k ( x, y i ) and E x,x ′ k ( x, x ′ ) ,which may be difﬁcult to obtain for non-Gaussians. Our purpose here is to show that the universally optimal type-IIerror exponent is indeed achievable for non-ﬁnite sample spaces, providing a meaningful optimality criterion fornonparametric one-sample tests. In the next section, we consider another class of MMD based tests without theneed of closed-form integrals. B. Kernel Two-Sample Tests for One-Sample Testing

In the context of model criticism, [13] casts one-sample testing into a two-sample problem where one drawssample x n from distribution P . A question that arises is the choice of number of samples, which is not obviousdue to the lack of an explicit criterion. In light of UHT, we may ask how many samples would sufﬁce for a kerneltwo-sample test to attain the type-II error exponent D ( P k Q ) .Denote by ˆ P n the empirical measure of x n . Consider a two-sample test with acceptance region Ω ( n, m ) = { ( x n , y m ) : d k ( ˆ P n , ˆ Q m ) ≤ γ n,m } , where d k ( ˆ P n , ˆ Q m ) is given in (8), K is a ﬁnite bound on k ( · , · ) , and γ n,m = 2( K/n ) / + 2( K/m ) / + ( − α/ K/m + K/n )) / . (9)Notice that the type-II error probability now depends on both P and Q , due to the use of ˆ P n . Although additionalrandomness is introduced, it does not hurt the type-II error exponent provided that n is sufﬁciently large, as statedbelow. Theorem 6:

Assume the same conditions as in Theorem 5, and that x n i.i.d. ∼ P and y m i.i.d. ∼ Q . Let Ω ( n, m ) = X n + m \ Ω ( n, m ) be the rejection region. Letting n be such that n/m → ∞ as m → ∞ , we havea) under H : P = Q, P x n ∼ P,y m ∼ P (Ω ( n, m )) ≤ α, b) under H : P = Q, lim inf m →∞ − m log P x n ∼ P,y m ∼ Q (Ω ( n, m )) = D ( P k Q ) . Proof:

That the two-sample test is level α can be veriﬁed by Lemma 4. The rest is to show the type-II errorexponent being D ( P k Q ) . To proceed, we write the type-II error probability as P x n ∼ P,y m ∼ Q (cid:16) d k ( ˆ P n , ˆ Q m ) ≤ γ n,m (cid:17) = β un,m + β ln,m , where β un,m = P x n ∼ P,y m ∼ Q (cid:16) d k ( ˆ P n , ˆ Q m ) ≤ γ n,m , d k ( P, ˆ P n ) > γ ′ n,m (cid:17) ,β ln,m = P x n ∼ P,y m ∼ Q (cid:16) d k ( ˆ P n , ˆ Q m ) ≤ γ n,m , d k ( P, ˆ P n ) ≤ γ ′ n,m (cid:17) ,γ ′ m,n = p K/n + p KmD ( P k Q ) /n It sufﬁces to show that both β un,m and β ln,m decrease at least exponentially fast at a rate of D ( P k Q ) . We ﬁrst have β un,m ≤ P x n ∼ P,y m ∼ Q (cid:16) d k ( P, ˆ P n ) > γ ′ n,m (cid:17) ≤ P x n ∼ P (cid:16) d k ( P, ˆ P n ) > γ ′ n,m (cid:17) ≤ e − mD ( P k Q ) , (10)where the last inequality is due to Lemma 3. Thus, β un,m vanishes at least exponentially fast with the exponent D ( P k Q ) .For β ln,m , we have β ln,m = P x n ∼ P,y m ∼ Q (cid:16) d k ( ˆ P n , ˆ Q m ) ≤ γ n,m , d k ( P, ˆ P n ) ≤ γ ′ n,m (cid:17) ≤ P x n ∼ P,y m ∼ Q (cid:16) d k ( ˆ P n , ˆ Q m ) + d k ( P, ˆ P n ) ≤ γ n,m + γ ′ n,m (cid:17) ( a ) ≤ P x n ∼ P,y m ∼ Q (cid:16) d k ( P, ˆ Q m ) ≤ γ n,m + γ ′ n,m (cid:17) ≤ P y m ∼ Q (cid:16) d k ( P, ˆ Q m ) ≤ γ n,m + γ ′ n,m (cid:17) , where ( a ) is from the triangle inequality for metric d k . Similar to (3), we get lim inf n →∞ − n log β ln,m ≥ D ( P k Q ) , because γ n,m + γ ′ n,m → as m → ∞ . Together with (10), we have under H : P = Q , lim inf m →∞ − m log P x n ∼ P,y m ∼ Q (cid:16) d k ( ˆ P n , ˆ Q m ) ≤ γ n,m (cid:17) ≥ D ( P k Q ) . We next show the other direction under H . We can write P x n ∼ P,y m ∼ Q (cid:16) d k ( ˆ P n , ˆ Q m ) ≤ γ n,m (cid:17) ( a ) ≥ P x n ∼ P,y m ∼ Q (cid:16) d k ( ˆ P n , P ) ≤ γ ′ n , d k ( P, ˆ Q m ) ≤ γ ′ m (cid:17) = P x n ∼ P (cid:16) d k ( ˆ P n , P ) ≤ γ ′ n (cid:17) P y m ∼ Q (cid:16) d k ( P, ˆ Q m ) ≤ γ ′ m (cid:17) , where ( a ) is because d k is a metric, and we choose γ ′ m = p K/m (cid:0) p log α − / (cid:1) and γ ′ n = p K/n (cid:0) p log α − / (cid:1) so that γ n,m > γ ′ n + γ ′ m . Then Lemma 3 gives P x n ∼ P ( d k ( P, ˆ P n ) ≤ γ ′ n ) > −√ α and P y m ∼ Q ( d k ( P, ˆ Q m ) ≤ γ ′ m ) > − √ α , where the latter implies that d k ( P, ˆ Q m ) ≤ γ ′ m is a level √ α test for testing H : x n ∼ P and H : x n ∼ Q with P = Q . Together with Chernoff-Stein Lemma, we get lim inf m →∞ − m log P x n ∼ P,y m ∼ Q (cid:16) d k ( ˆ P n , ˆ Q m ) ≤ γ n,m (cid:17) ≤ lim inf m →∞ − m log (cid:16) P x n ∼ P (cid:16) d k ( ˆ P n , P ) ≤ γ ′ n (cid:17) P y m ∼ Q (cid:16) d k ( P, ˆ Q m ) ≤ γ ′ m (cid:17)(cid:17) ≤ lim inf m →∞ − m log (cid:0) − √ α (cid:1) + lim inf m →∞ − m log P y m ∼ Q (cid:16) d k ( P, ˆ Q m ) ≤ γ ′ m (cid:17) ≤ D ( P k Q ) . The proof is complete.Replacing the ﬁrst two terms in d k ( ˆ P n , ˆ Q m ) with m ( m − P mi =1 P j = i k ( y i , y j ) and n ( n − P ni =1 P j = i k ( x i , x j ) also results in an unbiased statistic, which we denote as d u ( ˆ P n , ˆ Q m ) [28]. We then have the following universallyoptimal test based on d u ( ˆ P n , ˆ Q m ) . Corollary 2:

Under the same assumptions as in Theorem 6, the test deﬁned by Ω = { y m : d u ( ˆ P n , ˆ Q m ) ≤ γ n,m + K/n + K/m } has its type-I error probability below α and type-II error exponent being D ( P k Q ) , providedthat n/m → ∞ as m → ∞ . Proof (Sketch):

Similar to the proof of Corollary 1 by noting that | d u ( ˆ P n , ˆ Q m ) − d k ( ˆ P n , ˆ Q m ) | ≤ K/n + K/m . Remark 3:

The above result can be treated as a special case of the two-sample problem where samples sizesscale in different orders. For the case with < lim m →∞ n/m < ∞ , however, the current approach is not readilyapplicable. A naive way is to attempt to decompose the acceptance region Ω ( n, m ) into Ω ′ ( n ) × Ω ′′ ( m ) with Ω ′ ( n ) and Ω ′′ ( m ) being respectively decided by x n and y m , and then apply Sanov’s theorem to each set. Unfortunately,such a decomposition is not feasible for the MMD based two-sample tests. We postpone a further investigationuntil Section VI, after studying the KSD based one-sample tests. C. Kernel Stein Discrepancy based One-Sample Tests

As mentioned in Section IV-C, there does not exist a uniform or distribution-free probabilistic bound on d S ( P, ˆ Q m ) ,and it becomes difﬁcult to ﬁnd a threshold to meet the constant level constraint for all sample sizes. To proceed, werelax the level constraint to an asymptotic one and use the result of Lemma 5 which states that md S ( P, ˆ Q m ) con-verges weakly to some distribution under H : P = Q . We assume a ﬁxed α -quantile γ α of the limiting cumulativedistribution function, i.e., lim m →∞ P ( md S ( P, ˆ Q m ) > γ α ) = α . If γ m is such that γ m → and lim m →∞ mγ m → ∞ ,e.g., γ m = p /m (cid:0) √− log α (cid:1) , we get mγ m > γ α in the limit and thus lim m →∞ P ( d S ( P, ˆ Q m ) > γ m ) ≤ α .Together with the weak convergence properties of the KSD, we have the following result. Theorem 7:

Let P and Q be distributions deﬁned on R d , with D ( P k Q ) < ∞ under the alternative hypothesis.Assume y m i.i.d. ∼ Q and set γ m = p /m (cid:0) √− log α (cid:1) . It follows that1) assuming the conditions in Lemma 5, we have lim n →∞ P y m ∼ P (cid:16) d S ( P, ˆ Q m ) > γ m (cid:17) ≤ α, under H ;

2) assuming that kernels satisfy the conditions in Theorem 4, it follows that lim inf m →∞ − m log P y m ∼ Q (cid:16) d S ( P, ˆ Q m ) ≤ γ m (cid:17) = D ( P k Q ) , under H . Proof:

To establish the type-II error exponent, let d W denote the weak metrizable metric that lower boundsthe KSD in Theorem 4. Then d W ( P, ˆ Q m ) ≤ g ( d S ( P, ˆ Q m )) where g ( d S ) → as d S → . Then there exists γ ′ m such that { y m : d S ( P, ˆ Q m ) ≤ γ m } ⊂ { y m : d W ( P, ˆ Q m ) ≤ γ ′ m } and γ ′ m → as m → ∞ . The rest follows fromthe sufﬁcient condition in Theorem 2.An unbiased U-statistic d S ( u ) ( P, ˆ Q m ) = m ( m − P mi =1 P j = i h p ( y i , y j ) for estimating d S ( P, Q ) was proposedin [44]. A similar result holds under an additional assumption on the boundedness of h p according to the sameargument of Corollary 1; detailed proof is omitted. Corollary 3:

Assume the same conditions as in Theorem 7 and further that h p ( · , · ) ≤ H p for some H p ∈ R + .Then the test d S ( u ) ( P, ˆ Q m ) ≤ γ m + H p /m is asymptotically level α and achieves the optimal type-II error exponent D ( P k Q ) . D. Remarks

We have the following remarks regarding our results.

1) Threshold choice:

The distribution-free thresholds used in the MMD based tests are generally too conservative,as the actual distribution P is not taken into account. Alternatively, one may use Monte Carlo or bootstrapmethods to empirically estimate the acceptance threshold [17], [28], [46], making the tests asymptotically level α , i.e., lim m →∞ α m ≤ α . Bootstrap thresholds have also been proposed for the KSD based tests in [68]–[70].These methods, however, introduce additional randomness on the threshold choice and further on the type-II errorprobability. As a result, it becomes difﬁcult to establish the optimal type-II error exponent. A simple ﬁx is to take theminimum of the Monte Carlo or bootstrap threshold and the distribution-free one, guaranteeing a deterministicallyvanishing threshold and hence the optimal type-II error exponent.

2) Weak metrizable property:

To apply Sanov’s theorem as in our approch, we ﬁnd a superset of probabilitymeasures for the equivalent acceptance region, which is required to be closed and to converge (in terms of weakconvergence) to P in the large sample limit. Without the weak convergence property, the equivalent acceptanceregion may contain probability measures that are not close to P , and the minimum KLD over the superset wouldbe hard to obtain. An example can be found in [16, Theorem 6] where the KSDs are driven to zero by sequencesof probability measures not converging to P . Consequently, our approach does not establish the optimal type-IIerror exponent with the linear-time KSD based tests in [44], [46], the linear-time kernel two-sample test in [28],the kernel based B-test in [40], and a pseudometric based two-sample test in [35], due to the lack of the weakmetrizable property.

3) Non-i.i.d. data:

We notice that [17] considered non-i.i.d. data by use of wild bootstrap. In general, however,statistical optimality with non-i.i.d. data is difﬁcult to establish even for simple hypothesis testing.

E. Other Asymptotic Criteria

Before ending this section, we would like to discuss two other related asymptotic statistical criteria.

1) Exact Bahadur slope:

We consider the exact Bahadur slope for its close relationship with our asymptoticstatistical criterion [71], [72]. In particular, the exact Bahadur slope for a hypothesis test is equivalent to twice ofthe type-I error exponent, subject to a constant constraint on the type-II error probability, that is, lim inf m →∞ − m log α m , subject to β m ≤ β. The optimal exact Bahadur slople is given by D ( Q k P ) , assuming that < D ( Q k P ) < ∞ . However, the universaloptimality w.r.t. this criterion cannot be achieved for any one-sample test. To see this, notice that a nonparametricone-sample test, including both the test statistic and threshold, is constructed only through the sample y m and the target distribution P . Moreover, the type-I error probability is deﬁned w.r.t. the null hypothesis when y m ∼ P .Therefore, the type-I error exponent of a nonparametric one-sample test is characterized by only P and cannotcapture the information of the alternative distribution Q , thereby cannot attain the optimum D ( Q k P ) in the universalsetting.

2) Chernoff index:

The Chernoff index of a hypothesis test is deﬁned as the minimum of its type-I and type-IIerror exponents, i.e., min (cid:26) lim inf m →∞ − m log α m , lim inf m →∞ − m log β m (cid:27) . (11)Assuming that P and Q are mutually absolutely continuous, then the maximum Chernoff index is given by theChernoff information and is achieved by the likelihood ratio test whose type-I and type-II error exponents are equal[57], [58]. As discussed above with the exact Bahadur slope, the type-I error probability of a nonparametric test isindependent of the alternative distribution Q , whereas the optimal Chernoff index, Chernoff information, dependson both P and Q . Therefore, the optimal Chernoff index can not be achieved in the universal setting, either.VI. A SYMPTOTICALLY O PTIMAL T WO -S AMPLE T ESTS

In this section, we present our main results on the type-II error exponent for general kernel two-sample tests.As discussed in Section V-B, the ﬁrst and the most important step is to establish an extended Sanov’s theorem thatworks with two sample sequences.

A. Extended Sanov’s Theorem

We begin with our deﬁnition of pairwise weak convergence for probability measures: we say ( P l , Q l ) → ( P, Q ) weakly if and only if both P l → P and Q l → Q weakly. Consider P × P endowed with the topology induced bythis pairwise weak convergence; it can be veriﬁed that this topology is equivalent to the product topology on

P × P where each P is endowed with the topology of weak convergence. An extended version of Sanov’s theorem handlingtwo distributions is stated below, which may be of independent interest to other large deviation applications. Theorem 8 (Extended Sanov’s Theorem):

Let X be a Polish space, x n i.i.d. ∼ P , and y m i.i.d. ∼ Q . Assume < lim n,m →∞ nn + m =: c < . Then for a set Γ ⊂ P × P , it holds that lim sup n,m →∞ − n + m log P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ) ≤ inf ( R,S ) ∈ int Γ cD ( R k P ) + (1 − c ) D ( S k Q ) , lim inf n,m →∞ − n + m log P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ) ≥ inf ( R,S ) ∈ cl Γ cD ( R k P ) + (1 − c ) D ( S k Q ) , where int Γ and cl Γ are the interior and closure of Γ w.r.t. the pairwise weak convergence, respectively. Proof:

See Appendix A.We comment that the extended Sanov’s theorem is not apparent from the original one, as existing tools, e.g.,Cram´er theorem [58] that is used for proving the original Sanov’s theorem, can only handle a single distribution.Our proof ﬁrst establishes the result with ﬁnite sample spaces and then extend it to general Polish spaces, with twosimple combinatorial lemmas as prerequisites.

B. Exact and Optimal Error Exponent

With the extended Sanov’s theorem and a vanishing threshold, we are ready to establish the type-II error exponentfor the kernel two-sample test deﬁned in Section V-B. Our result follows. Theorem 9:

Assume A1 , A2 , and < lim n,m →∞ nn + m =: c < . Further assume, under the alternative hypothesis H , that < D ∗ := inf R ∈P cD ( R k P ) + (1 − c ) D ( R k Q ) < ∞ . Given < α < , the kernel test d k ( ˆ P n , ˆ Q m ) ≤ γ n,m , with d k ( ˆ P n , ˆ Q m ) and γ n,m being respectively given inSection V-B, is an exponentially consistent level α test with type-II error exponent being D ∗ , that is, α n,m ≤ α and lim inf n,m →∞ − n + m log β n,m = D ∗ . Proof:

A proof is given in Appendix B, which is similar to the proof of Theorem 2 but with some extratreatment on the pairwise weak convergence .Therefore, when < c < , the type-II error probability vanishes as O ( e − ( n + m )( D ∗ − ǫ ) ) , where ǫ ∈ (0 , D ∗ ) isﬁxed and can be arbitrarily small. The result also shows that the choice of kernels only affects the sub-exponentialterm in the type-II error probability, provided that they meet the conditions of A2 .Now that we have obtained the exact type-II error exponent for the kernel two-sample test, we proceed to derivean upper bound on the optimal type-II error exponent for any (asymptotically) level α test. Theorem 10:

Let x n , y m , P , Q , and D ∗ be deﬁned as in Theorem 9. For a nonparametric two-sample test Ω ′ ( n, m ) which is (asymptotically) level α, < α < , its type-II error probability β ′ n,m satisﬁes lim inf n,m →∞ − n + m log β ′ n,m ≤ D ∗ , if < lim n,m →∞ nn + m = c < and < D ∗ < ∞ , and lim inf n,m →∞ − m log β ′ n,m ≤ D ( P k Q ) , if lim n,m →∞ nm = ∞ and < D ( P k Q ) < ∞ . Proof:

Our proof is based on the notion of relative entropy typical set, where the relative entropy is anothername for the KLD [57].We begin with the case where < c < . Let P ′ be such that cD ( P ′ k P ) + (1 − c ) D ( P ′ k Q ) = D ∗ . Considerﬁrst D ( P ′ k P ) = 0 and D ( P ′ k Q ) = 0 . Since D ∗ is assumed to be ﬁnite, we have both D ( P ′ k P ) and D ( P ′ k Q ) being ﬁnite. This implies that P ′ is absolutely continuous w.r.t. both P and Q , so the Radon-Nikodym derivatives dP ′ /dP and dP ′ /dQ exist.We can deﬁne the following set A n = (cid:26) x n : D ( P ′ k P ) − ǫ ≤ n log dP ′ ( x n ) dP ( x n ) ≤ D ( P ′ k P ) + ǫ (cid:27) , (12)which contains samples so that the log-likelihood ratios are ǫ -close to the true KLD, and is called the relativeentropy typical set. Recall the deﬁnition of the KLD: D ( P ′ k P ) = E x ∼ P ′ log( dP ′ ( x ) /dP ( x )) . By law of largenumbers, we have for any given ǫ > , P x n ∼ P ′ ( A n ) ≥ − ǫ/ , for large enough n. Similarly, deﬁne B m = (cid:26) y m : D ( P ′ k Q ) − ǫ ≤ m log dP ′ ( y m ) dQ ( y m ) ≤ D ( P ′ k Q ) + ǫ (cid:27) , (13)and we have P y m ∼ P ′ ( B m ) ≥ − ǫ/ , for large enough m. Therefore, we get P x n ∼ P ′ ,y m ∼ P ′ ( A n × B m ) = P x n ∼ P ′ ,y m ∼ P ′ ( x n ∈ A n , y m ∈ B m ) ≥ − ǫ, for large enough n and m. (14)Now consider the type-II error probability for a level α test. First, if a test is level α , we have its acceptanceregion satisfy P x n ∼ P,y m ∼ P (cid:0) Ω ′ ( n, m ) (cid:1) > − α, (15)when the null hypothesis H holds, i.e., when x n and y m are i.i.d. according to a common distribution (which isnot necessarily P ′ ). Then under the alternative hypothesis H : P = Q , we have β ′ n,m = P x n ∼ P,y m ∼ Q (cid:0) Ω ′ ( n, m ) (cid:1) ≥ P x n ∼ P,y m ∼ Q (cid:0) A n × B m ∩ Ω ′ ( n, m ) (cid:1) = Z A n × B m ∩ Ω ′ ( n,m ) dP ( x n ) dQ ( y m ) ( a ) ≥ Z A n × B m ∩ Ω ′ ( n,m ) − n ( D ( P ′ k P )+ ǫ ) − m ( D ( P ′ k Q )+ ǫ ) dP ′ ( x n ) dP ′ ( y m )= 2 − nD ( P ′ k P ) − m ( D ( P ′ k Q ) − ( n + m ) ǫ Z A n × B m ∩ Ω ′ ( n,m ) dP ′ ( x n ) dP ′ ( y m ) ( b ) ≥ − nD ( P ′ k P ) − mD ( P ′ k Q ) − ( n + m ) ǫ (1 − α − ǫ ) , where ( a ) is from the deﬁntion of A n and B m , and ( b ) is due to (14) and (15). Thus, when ǫ is small enough sothat − α − ǫ > , we get lim inf n,m →∞ − n + m log β ′ n,m ≤ lim inf n,m →∞ n + m (cid:0) nD ( P ′ k P ) + m ( D ( P ′ k Q ) + ( n + m ) ǫ (cid:1) = D ∗ + ǫ. (16)If a test is an asymptotic level α test, we can replace α by α + ǫ ′ where ǫ ′ can be made arbitrarily small providedthat n and m are large enough. Thus, the above equation (16) holds, too. Since ǫ can also be arbitrarily small, weconclude that lim inf n,m →∞ − n + m log β ′ n,m ≤ D ∗ . If P ′ = P , then A n contains all x n ∈ X n and the above procedure results in the same result.Finally, the same argument also applies the case with lim n,m →∞ nm = ∞ and we have lim inf n,m →∞ − m log β ′ n,m ≤ D ( P k Q ) . This theorem shows that the kernel test d k ( ˆ P n , ˆ Q m ) ≤ γ n,m is an optimal level α two-sample test, by choosingthe type-II error exponent as the asymptotic performance metric. Moreover, Theorems 9 and 10 together provide away of identifying more universally optimal two-sample tests: • Assuming n = m , the test d u ( ˆ P n , ˆ Q m ) ≤ (4 K/ √ n ) p log( α − ) is also level α [28, Corollary 11]. As k ( · , · ) is ﬁnitely bounded by K , its type-II error probability vanishes exponentially at a rate of inf R ∈P D ( R k P ) + D ( R k Q ) , which can be shown by the same argument of Corollary 1. • It is also possible to consider a family of kernels for the test statistic [36], [55]. For a given family κ , the teststatistic is sup k ∈ κ d k ( ˆ P n , ˆ Q m ) which also metrizes weak convergence under suitable conditions, e.g., when κ consists of ﬁnitely many Gaussian kernels [55, Theorem 3.2]. If K remains to be an upper bound for all k ∈ κ , then comparing sup k ∈ κ d k ( ˆ P n , ˆ Q m ) with γ n,m deﬁned in Section V-B results in an asymptoticallyoptimal level α test. Remark 4:

There is a similar result to Lemma 3 for the unbiased two-sample statistic d u ( ˆ P n , ˆ Q m ) . Assume that < lim n,m →∞ n/ ( n + m ) < and kernel k ( · , · ) is bounded by K , then [38, Theorem 12] shows that the statistic ( n + m ) d u ( ˆ P n , ˆ Q m ) converges in distribution to some distribution under the null hypothesis. With a ﬁxed α -quantile γ ′ α for the limiting distribution, then ( m + n ) d u ( ˆ P n , ˆ Q m ) ≤ γ ′ α is level α in the sample limit. Consequently, the(asymptotic) level α constraint requires the threshold for d u ( ˆ P n , ˆ Q m ) decrease at most O (1 / ( n + m )) fast. Remark 5:

In [73], a notion of fair alternative is proposed for two-sample testing as dimension increases, whichis to ﬁx D ( P k Q ) under the alternative hypothesis for all dimensions. This idea is guided by the fact that theKLD is a fundamental information-theoretic quantity determining the hardness of hypothesis testing problems. Thisapproach, however, does not take into account the impact of sample sizes. In light of our results, perhaps a betterchoice is to ﬁx D ∗ in Theorem 9 when the sample sizes grow in the same order. In practice, D ∗ may be hard tocompute, so ﬁxing its upper bound (1 − c ) D ( P k Q ) and hence D ( P k Q ) is reasonable. Remark 6:

The main results indicate that the type-II error exponent is independent of the choice of kernelsas long as kernels are bounded continuous and characteristic. We remark that this indication does not contradictprevious studies on kernel choice, as the sub-exponential term can dominate in the ﬁnite sample regime. In light ofthe exponential decay, it then raises interesting connections with a kernel selection strategy, where part of samplesare used as training data to choose a kernel and the remaining samples are used with the selected kernel to computethe test statistic [38], [39]. On the one hand, the sample size should not be too small so that there are enough datafor training. On the other hand, if the number of samples is large enough and the exponential decay term becomesdominating, directly using the entire samples may be good enough to have a low type-II error probability, providedthat kernel is not too poor. We conduct a toy experiment to further illustrate this point in Appendix C. Selecting aproper kernel is an important ongoing research topic and we refer the reader to existing works on kernel selectionstrategy, e.g., [38], [39]. VII. A

PPLICATION TO O FF - LINE C HANGE D ETECTION

In this section, we apply our results to off-line change detection. To our best knowledge, no tests have beenshown to be optimal, in terms of either error probability or error exponent, for detecting the change, when noprior information on the post-change distribution [5] is available. We only study the case where both the pre- andpost-change distributions are unknown; the case with a known pre-change distribution can be handled similarly andis omitted.Let z , . . . , z n ∈ R d be an independent sequence of observations. Assume that there is at most one change-pointat index < t < n , which, if exists, indicates that z i ∼ P, ≤ i ≤ t and z i ∼ Q, t + 1 ≤ i ≤ n with P = Q . Theoff-line change-point analysis consists of two steps: 1) detect if there is a change-point in the sample sequence; 2)estimate the index t if such a change-point exists. Notice that a method may readily extend to multiple change-pointand on-line settings, through sliding windows running along the sequence, as in [6], [7], [9].The ﬁrst step in the change-point analysis is usually formulated as a hypothesis testing problem: H : z i ∼ P, i = 1 , . . . , n,H : there exists < t < n such that z i ∼ P, ≤ i ≤ t and z i ∼ Q = P, t + 1 ≤ i ≤ n. Let ˆ P i and ˆ Q n − i denote the empirical measures of sequences z , . . . , z i and z i +1 , . . . , z n , respectively. Then anMMD based test can be directly constructed using the maximum partition strategy:decide H , if max a n ≤ i ≤ b n d k ( ˆ P i , ˆ Q n − i ) ≤ γ n , where the maximum is searched in the interval [ a n , b n ] with a n > and b n < n . If the test favors H , we canproceed to estimate the change-point index by arg max a n ≤ i ≤ b n d k ( ˆ P i , ˆ Q n − i ) . Here we characterize the performanceof detecting the presence of a change for this test, using Theorems 9 and 10. We remark that the assumptions onthe search interval and on the change-point index in the following theorem are standard practice for nonparametricchange detection [5]–[9]. Theorem 11:

Let < u < v < , a n /n → u and b n /n → v as n → ∞ . Under the alternative hypothesis H , assume that the change-point index t satisﬁes u < lim n →∞ t/n = c < v , and that < D ∗ < ∞ where D ∗ is deﬁned w.r.t. P , Q , and c in Theorem 9. Further assume that the kernel k satisﬁes A2 , with K > being anupper bound. Given < α < , set c n = min { a n ( n − a n ) , b n ( n − b n ) } and γ n = 2 p K/a n + 2 p K/ ( n − b n ) + p Kn log(2 nα − ) /c n . Then the test max a n ≤ i ≤ b n d k ( ˆ P i , ˆ Q n − i ) ≤ γ n is level α and achieves the optimal type-IIerror exponent. That is, α n ≤ α, and lim inf n →∞ − n log β n = D ∗ , where α n and β n are the type-I and type-II error probabilities, respectively. Proof:

We ﬁrst have P z n ∼ P (cid:18) max a n ≤ i ≤ b n d k ( ˆ P i , ˆ Q n − i ) > γ n (cid:19) ≤ X a n ≤ i ≤ b n P z n ∼ P (cid:16) d k ( ˆ P i , ˆ Q n − i ) > γ n (cid:17) . To meet the type-I error constraint, it sufﬁces to make each P z n ( d k ( ˆ P i , ˆ Q n − i ) > γ n ) ≤ α/n under the nullhypothesis H . This can be veriﬁed using Lemma 4 with the choice of γ n in the above theorem. To see the optimaltype-II error exponent, consider a simpler problem where the possible change-point t is known, i.e., a two-sampleproblem between z , . . . , z t and z t +1 , . . . , z n . Since γ n → as n → ∞ , applying Theorems 9 and 10 establishesthe optimal type-II error exponent. VIII. C ONCLUSION AND D ISCUSSION

In this paper, we have established the statistical optimality of the MMD and KSD based one-sample tests in thespirit of universal hypothesis testing. The KSD based tests are more computationally efﬁcient, as there is no needto draw samples or compute integrals. In contrast, the MMD based tests are more statistically favorable, as theyrequire weaker assumptions and can meet the level constraint for any sample size. Following the same optimalitycriterion, we further show that the quadratic-time MMD based two-sample tests are also asymptotically optimal inthe universal setting, by extending the Sanov’s theorem to the two-sample case. Our results provide a practicallymeaningful approach for constructing universally optimal one- and two-sample tests.A future direction is to generalize the result to a Polish sample space, without the locally compact Hausdorffassumption [50]. Although we cannot establish this result in the current work, we believe that our approach wouldbe feasible once a proper metric is found to meet the two conditions in Theorem 2, since both Sanov’s theoremand its extended version are established w.r.t. the Polish space. A PPENDIX AP ROOF OF THE E XTENDED S ANOV ’ S T HEOREM

We prove the extended Sanov’s theorem with a ﬁnite sample space and then extend the result to a general Polishspace. Our proof is inspired by [74] which proved Sanov’s theorem (w.r.t. a single distribution) in the τ -topology.The prerequisites are two combinatorial lemmas that are standard tools in information theory.For a positive integer t , let P n ( t ) denote the set of probability distributions deﬁned on { , . . . , t } of form P = (cid:0) n n , · · · , n t n (cid:1) , with integers n , . . . , n t . Stated below are the two lemmas. Lemma 6 ( [57, Theorem 11.1.1]):

The cardinality |P n ( t ) | ≤ ( n + 1) t . Lemma 7 ( [57, Theorem 11.1.4]):

Assume x n i.i.d. ∼ Q where Q is a distribution deﬁned on { , . . . , t } . Forany P ∈ P n ( t ) , the probability of the empirical distribution ˆ P n of x n equal to P satisﬁes ( n + 1) − t e − nD ( P k Q ) ≤ P x n ∼ Q ( ˆ P n = P ) ≤ e − nD ( P k Q ) . A. Finite Sample Space1) Upper bound:

Let t denote the cardinality of X . Without loss of generality, assume that inf ( R,S ) ∈ int Γ cD ( R k P ) + (1 − c ) D ( S k Q ) < ∞ , which indicates that the open set int Γ is non-empty. As < lim n,m →∞ nn + m = c < , we can ﬁnd n and m such that there exists ( P ′ n , Q ′ m ) ∈ int Γ ∩ P n ( t ) × P m ( t ) for all n > n and m > m , and that cD ( P ′ n k P ) + (1 − c ) D ( Q ′ m k Q ) → inf ( R,S ) ∈ int Γ cD ( R k P ) + (1 − c ) D ( S k Q ) as n, m → ∞ . Then we have, with n > n and m > m , P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ)= X ( R,S ) ∈ Γ ∩ P n ( t ) ×P m ( t ) P x n ∼ P,y m ∼ Q ( ˆ P n = R, ˆ Q m = S ) ≥ X ( R,S ) ∈ int Γ ∩ P n ( t ) ×P m ( t ) P x n ∼ P,y m ∼ Q ( ˆ P n = R, ˆ Q m = S ) ≥ P x n ∼ P,y m ∼ Q ( ˆ P n = P ′ n , ˆ Q m = Q ′ m )= P x n ∼ P ( ˆ P n = P ′ n ) P y m ∼ Q ( ˆ Q m = Q ′ m ) ≥ ( n + 1) − t ( m + 1) − t e − nD ( P ′ n k P ) e − mD ( Q ′ m k Q ) , where the last inequality is from Lemma 7. It follows that lim sup n,m →∞ − n + m log P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ) ≤ lim n,m →∞ n + m (cid:0) − t log(( n + 1)( m + 1)) + nD ( P ′ n k P ) + mD ( Q ′ m k Q ) (cid:1) = lim n,m →∞ n + m (cid:0) nD ( P ′ n k P ) + mD ( Q ′ m k Q ) (cid:1) = inf ( R,S ) ∈ int Γ ( cD ( R k P ) + (1 − c ) D ( S k Q )) .

2) Lower bound:

We can write the probability as P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ)= X ( R,S ) ∈ Γ ∩P n ( t ) ×P m ( t ) P x n ∼ P ( ˆ P n = R ) P y m ∼ Q ( ˆ Q m = S ) ( a ) ≤ X ( R,S ) ∈ Γ ∩P n ( t ) ×P m ( t ) e − nD ( R k P ) e − mD ( S k Q )( b ) ≤ ( n + 1) t ( m + 1) t sup ( R,S ) ∈ Γ e − nD ( R k P ) e − mD ( S k Q ) , (17)where ( a ) and ( b ) are due to Lemmas 7 and 6, respectively. This gives lim inf n →∞ − n + m log P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ) ≥ inf ( R,S ) ∈ Γ cD ( R k P ) + (1 − c ) D ( S k Q ) , and hence the lower bound by noting that Γ ∈ cl Γ . Indeed, when the right hand side is ﬁnite, the inﬁmum over Γ equals the inﬁmum over cl Γ as a result of the continuity of KLD for ﬁnite sample spaces. B. Polish sample space

We consider the general case with X being a Polish space. Now P is the space of probability measures deﬁnedon X endowed with the topology of weak convergence. To proceed, we introduce another topology on P and anequivalent deﬁnition of the KLD. τ -topology : Denote by Π the set of all partitions A = { A , . . . , A t } of X into a ﬁnite number of measurablesets A i . For P ∈ P , A ∈ Π , and ζ > , denote U ( P, A , ζ ) = { P ′ ∈ P : | P ′ ( A i ) − P ( A i ) | < ζ, i = 1 , . . . , t } . (18)The τ -topology on P is the coarsest topology in which the mapping P → P ( F ) are continuous for every measurableset F ⊂ X . A base for this topology is the collection of the sets (18). We will use P τ when we refer to P endowedwith this τ -topology, and write the interior and closure of a set Γ ∈ P τ as int τ Γ and cl τ Γ , respectively. We remarkthat the τ -topology is stronger than the weak topology: any open set in P w.r.t. weak topology is also open in P τ (see more details in [58], [74]). The product topology on P τ × P τ is determined by the base of the form of U ( P, A , ζ ) × U ( Q, A , ζ ) , for ( P, Q ) ∈ P τ × P τ , A , A ∈ Π , and ζ , ζ > . We still use int τ (Γ) and cl τ (Γ) to denote the interior andclosure of a set Γ ⊂ P τ × P τ . As there always exists A ∈ Π that reﬁnes both A and A , any element from thebase has an open subset ˜ U ( P, Q, A , ζ ) := U ( P, A , ζ ) × U ( Q, A , ζ ) ⊂ P τ × P τ , for some ζ > . Another deﬁnition of the KLD : We now introduce an equivalent deﬁnition of the KLD D ( P k Q ) = sup A∈ Π t X i =1 P ( A i ) log P ( A i ) Q ( A i ) = sup A∈ Π D ( P A k Q A ) , with the conventions = 0 and a log a = + ∞ if a > . Here P A denotes the discrete probabilitymeasure ( P ( A ) , . . . , P ( A t )) obtained from probability measure P and partition A . It is not hard to verify that for < c < , cD ( R k P ) + (1 − c ) D ( S k Q ) = c sup A ∈ Π D ( R A k P A ) + (1 − c ) sup A ∈ Π D ( S A k Q A )= sup A∈ Π (cid:0) cD (cid:0) R A k P A (cid:1) + (1 − c ) D (cid:0) S A k Q A (cid:1)(cid:1) , (19)due to the existence of A that reﬁnes both A and A and the log-sum inequality [57].We are ready to show the extended Sanov’s theorem with Polish space.

1) Upper bound:

It sufﬁces to consider only non-empty open Γ . If Γ is open in P × P , then Γ is also open in P τ × P τ . Therefore, for any ( R, S ) ∈ Γ , there exists a ﬁnite (measurable) partition A = { A , . . . , A t } of X and ζ > such that ˜ U ( R, S, A , ζ ) = (cid:8) ( R ′ , S ′ ) : | R ( A i ) − R ′ ( A i ) | < ζ, | S ( A i ) − S ′ ( A i ) | < ζ, i = 1 , . . . , t (cid:9) ⊂ Γ . (20)Deﬁne the function T : X → { , . . . , t } with T ( x ) = i for x ∈ A i . Then ( ˆ P n , ˆ Q m ) ∈ ˜ U ( R, S, A , ζ ) with R, S ∈ Γ if and only if the empirical measures ˆ P ◦ n of { T ( x ) , . . . , T ( x n ) } := T ( x n ) and ˆ Q ◦ m of { T ( y ) , . . . , T ( y m ) } := T ( y m ) lie in U ◦ ( R, S, A , ζ ) = { ( R ◦ , S ◦ ) : | R ◦ ( i ) − R ( A i ) | < ζ, | S ◦ ( i ) − S ( A i ) | < ζ, i = 1 , . . . , t } ⊂ R t × R t . Thus, we have P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ) ≥ P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ ˜ U ( R, S, A , ζ ))= P T ( x n ) T ( y m ) (( ˆ P ◦ n , ˆ Q ◦ m ) ∈ U ◦ ( R, S, A , ζ )) . As T ( x ) and T ( y ) takes values from a ﬁnite alphabet and U ◦ ( R, S, A , ζ ) is open, we obtain that lim sup n →∞ − n + m log P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ) ≤ lim sup n →∞ − n + m log P T ( x n ) T ( y m ) (( ˆ P ◦ n , ˆ Q ◦ m ) ∈ U ◦ ( R, S, A , ζ )) ≤ inf ( R ◦ ,S ◦ ) ∈ U ◦ ( R,S, A ,ζ ) (cid:0) cD ( R ◦ k P A ) + (1 − c ) D ( S ◦ k Q A ) (cid:1) = inf ( R ′ ,S ′ ) ∈ ˜ U ( R,S, A ,ζ ) (cid:0) cD ( R ′A k P A ) + (1 − c ) D ( S ′A k Q A ) (cid:1) ≤ cD ( R k P ) + (1 − c ) D ( S k Q ) , (21)where we have used deﬁnition of KLD in Eq. (19) and ( R, S ) ∈ ˜ U ( R, S, A , ζ ) in the last inequality. As ( R, S ) isarbitrary in Γ , the lower bound is established by taking inﬁmum over Γ .

2) Lower bound:

With notations Γ A = { ( R A , S A ) : ( R, S ) ∈ Γ } , Γ( A ) = { ( R, S ) : ( R A , S A ) ∈ Γ A } , where A = { A , . . . , A t } is a ﬁnite partition, it holds that P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ) ≤ P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ( A ))= P x n ∼ P,y m ∼ Q (( ˆ P A n , ˆ Q A m ) ∈ Γ A ∩ P n ( t ) × P m ( t )) ≤ ( n + 1) t ( m + 1) t max ( R ◦ ,S ◦ ) ∈ Γ A ∩P n ( t ) ×P m ( t ) P x n ∼ P,y m ∼ Q (cid:16) ˆ P n = R ◦ , ˆ Q m = S ◦ (cid:17) ≤ ( n + 1) t ( m + 1) t exp (cid:18) − inf ( R,S ) ∈ Γ (cid:0) nD ( R A k P A ) + mD ( S A k Q A ) (cid:1)(cid:19) , where the last two inequalities are from Lemmas 6 and 7. As the above holds for any A ∈ Π , Eq. (19) indicates lim sup n →∞ n + m log P x n ∼ P,y m ∼ Q (( ˆ P n , ˆ Q m ) ∈ Γ) ≤ inf A (cid:18) − inf ( R,S ) ∈ Γ (cid:0) cD ( R A k P A ) + (1 − c ) D ( S A k Q A ) (cid:1)(cid:19) = − sup A inf ( R,S ) ∈ Γ (cid:0) cD ( R A k P A ) + (1 − c ) D ( S A k Q A ) (cid:1) . To obtain the lower bound, it remains to show sup A inf ( R,S ) ∈ Γ (cid:0) cD ( R A k P A ) + (1 − c ) D ( S A k Q A ) (cid:1) ≥ inf ( R,S ) ∈ cl Γ ( cD ( R k P ) + (1 − c ) D ( S k Q )) . Assuming, without loss of generality, that the left hand side is ﬁnite, we only need to show cl Γ ∩ B ( P, Q, η ) = ∅ , whenever η > sup A inf ( R,S ) ∈ Γ (cid:0) cD ( R A k P A ) + (1 − c ) D ( S A k Q A ) (cid:1) . Here B ( P, Q, η ) is the divergence ball deﬁned as follows B ( P, Q, η ) = { ( R, S ) : cD ( R k P ) + (1 − c ) D ( S k Q ) ≤ η } , which is compact in P × P w.r.t. the weak topology, due to the lower semi-continuity of D ( ·k P ) and D ( ·k Q ) aswell as the fact that < c < .To this end, we ﬁrst show the following: cl Γ = \ A cl Γ( A ) . (22)The inclusion is straightforward since Γ ∈ Γ( A ) . The reverse means that if ( R, S ) ∈ cl Γ( A ) for each A , then anyneighborhood of ( R, S ) w.r.t. the weak convergence intersects Γ . To verify this, let O ( R, S ) be a neighborhood of ( R, S ) w.r.t. the weak convergence, then there exists ˜ U ( R, S, B , ζ ) ∈ O ( R, S ) over a ﬁnite partition B as O ( R, S ) is also open in P τ × P τ . Furthermore, the partition B can be chosen to reﬁne A so that cl Γ( B ) ⊂ cl Γ( A ) .As τ -topology is stronger than the weak topology, a closed set in the P τ × P τ is closed in P × P , and hence cl Γ( B ) ⊂ cl τ Γ( B ) . That ( R, S ) ∈ cl τ Γ( B ) implies that there exists ( R ′ , S ′ ) ∈ ˜ U ( R, S, B , ζ ) ∩ Γ( B ) . By thedeﬁnition of Γ( B ) , we can also ﬁnd ( ˜ R, ˜ S ) ∈ Γ such that ˜ R ( B i ) = R ′ ( B i ) and ˜ S ( B i ) = S ′ ( B i ) for each B i ∈ B ,and hence ( ˜ R, ˜ S ) ∈ ˜ U ( R, S, B , ζ ) . In summary, we have ( ˜ R, ˜ S ) ∈ ˜ U ( R, S, B , ζ ) ⊂ O ( R, S ) and ( ˜ R, ˜ S ) ∈ Γ .Therefore, Γ ∩ O ( R, S ) = ∅ and the claim follows.Next we show that, for each partition A , Γ( A ) ∩ B ( P, Q, η ) = ∅ . (23)By Eq. (19), there exists ( ˜ P , ˜ Q ) such that cD ( ˜ P A k P A ) + (1 − c ) D ( ˜ Q A k Q A ) ≤ η . For such ( ˜ P , ˜ Q ) , we canconstruct ( P ′ , Q ′ ) ∈ Γ( A ) as P ′ ( F ) = t X i =1 ˜ P ( A i ) P ( A i ) P ( F ∩ A i ) ,Q ′ ( F ) = t X i =1 ˜ Q ( A i ) Q ( A i ) Q ( F ∩ A i ) , for any measurable subset F ⊂ X . If P ( A i ) = 0 ( Q ( A i ) = 0 ) and hence ˜ P ( A i ) = 0 ( ˜ Q ( A i ) = 0 ), as D ( ˜ P A k P A ) < ∞ ( D ( ˜ Q A k Q A ) < ∞ ), for some i , the corresponding term in the above equation is set equal to . Then ( P ′ , Q ′ ) belongs to Γ( A ) and also lies in B ( P, Q, η ) . The latter is because D ( P ′ k P ) = D ( ˜ P A k Q A ) and D ( Q ′ k Q ) = D ( ˜ Q A k Q A ) : one can verify that any B that reﬁnes A satisﬁes D ( P ′B k P B ) = D ( ˜ P A k P A ) , D ( Q ′B k Q B ) = D ( ˜ Q A k Q A ) . For any ﬁnite collection of partitions A i ∈ Π and A ∈ Π reﬁning each A i , each Γ( A i ) contains Γ( A ) . Thisimplies that r \ i =1 (Γ( A i ) ∩ B ( p, q, η )) = ∅ , for any ﬁnite r . Finally, the set cl Γ( A ) ∩ B ( P, Q, η ) for any A is compact due to the compactness of B ( P, Q, η ) ,and any ﬁnite collection of them has non-empty intersection. It follows that all these sets are also non-empty. Thiscompletes the proof. A PPENDIX BP ROOF OF T HEOREM d k metrizes weak convergence over P . That α n,m ≤ α can be veriﬁed by Lemma 4,and we only need to show that the type-II error probability β n,m vanishes exponentially as n and m scale. Forconvenience, we will write the error exponent of β n,m as β .We ﬁrst show β ≥ D ∗ . With a ﬁxed γ > , we have γ n,m ≤ γ for sufﬁciently large n and m . Therefore, β = lim inf n,m →∞ − n + m log P x n ∼ P,y m ∼ Q ( d k ( ˆ P n , ˆ Q m ) ≤ γ n,m ) ≥ lim inf n,m →∞ − n + m log P x n ∼ P,y m ∼ Q ( d k ( ˆ P n , ˆ Q m ) ≤ γ ) ≥ inf ( R,S ): d k ( R,S ) ≤ γ ( cD ( R k P ) + (1 − c ) D ( S k Q )):= D ∗ γ , (24)where the last inequality is from the extended Sanov’s theorem and that d k metrizes weak convergence of P sothat { ( R, S ) : d k ( R, S ) ≤ γ } is closed in the product topology on P × P . Since γ > can be arbitrarily small, wehave β ≥ lim γ → + D ∗ γ , where the limit on the right hand side must exist as D ∗ γ is positive, non-decreasing when γ decreases, and boundedby D ∗ that is assumed to be ﬁnite. Then it sufﬁces to show lim γ → + D ∗ γ = D ∗ . To this end, let ( R γ , S γ ) be such that d k ( R γ , S γ ) ≤ γ and cD ( R γ k P ) + (1 − c ) D ( S γ k Q ) = D ∗ γ . Notice that R γ and S γ must lie in (cid:26) W : D ( W k P ) ≤ D ∗ c , D ( W k Q ) ≤ D ∗ − c (cid:27) := W , for otherwise D ∗ γ > D ∗ . We remark that W is a compact set in P as a result of the lower semi-continuityof KLD w.r.t. the weak topology on P [58], [60]. Existence of such a pair is a consequence of the facts that { ( R, S ) : d k ( R, S ) ≤ γ } is closed and convex, and that both D ( ·k P ) and D ( ·k Q ) are convex functions [60]. Assume that D ∗ cannot be achieved. We can write lim γ → + D ∗ γ = D ∗ − ǫ, (25)for some ǫ > . By the deﬁnition of lower semi-continuity, there exists a κ W > for each W ∈ W such that cD ( R k P ) + (1 − c ) D ( S k Q ) ≥ cD ( W k P ) + (1 − c ) D ( W k Q ) − ǫ ≥ D ∗ − ǫ , (26)whenever R and S are both from S W = { R : d k ( R, W ) < κ W } . Here the last inequality comes from the deﬁnition of D ∗ given in Theorem 9. To ﬁnd a contradiction, deﬁne S ′ W = n R : d k ( R, W ) < κ W o . Since S ′ W is open and S W S ′ W covers W , the compactness of W implies that there exists ﬁnite S ′ W ’s, denoted by S ′ W , . . . , S ′ W N , covering W . Deﬁne κ ∗ = min Ni =1 κ W i > . Now let γ < κ ∗ / as γ can be made arbitrarily small.Since S Ni =1 S ′ W i covers W , we can ﬁnd a W i with R γ ∈ S ′ W i ⊂ S W i . Thus, it holds that d k ( S γ , W i ) ≤ d k ( S γ , R γ ) + d k ( R γ , W i ) < κ W i . That is, S γ also lies in S W i . By Eq. (26) we get cD ( R γ k P ) + (1 − c ) D ( S γ k Q ) ≥ D ∗ − ǫ/ . However, by our assumption in Eq. (25), it should hold that cD ( R γ k P ) + (1 − c ) D ( S γ k Q ) ≤ D ∗ − ǫ. Therefore, β ≥ D ∗ .The other direction can be simply seen from the optimal type-II error exponent in Theorem 10. Alternatively, wecan use Chernoff-Stein lemma in a similar manner as in the proof of Theorem 6. Let P ′ be such that cD ( P ′ k P ) +(1 − c ) D ( P ′ k Q ) = D ∗ . Such P ′ exists because < D ∗ < ∞ and D ( ·k P ) and D ( ·k Q ) are convex w.r.t. P . That D ∗ is bounded implies that both D ( P ′ k P ) and D ( P ′ k Q ) are ﬁnite. We have β n,m = P x n ∼ P,y m ∼ Q ( d k ( ˆ P n , ˆ Q m ) ≤ γ n,m ) ( a ) ≥ P x n ∼ P,y m ∼ Q ( d k ( ˆ P n , P ′ ) + d k ( ˆ Q m , P ′ ) ≤ γ n,m ) ( b ) ≥ P x n ∼ P,y m ∼ Q ( d k ( ˆ P n , P ′ ) ≤ γ n , d k ( ˆ Q m , P ′ ) ≤ γ m )= P x n ∼ P ( d k ( ˆ P n , P ′ ) ≤ γ n ) P y m ∼ Q ( d k ( ˆ Q m , P ′ ) ≤ γ m ) , where ( a ) and ( b ) are from the triangle inequality of the metric d k , and we pick γ n = p K/n (1 + √− log α ) , and γ m = p K/m (1 + √− log α ) so that γ n,m > γ n + γ m . Then Lemma 3 implies P x n ∼ P ′ ( d k ( ˆ P n , P ′ ) ≤ γ n ) > − α .For now assume that D ( P ′ k P ) > and D ( P ′ k Q ) > . We can regard { x n : d k ( ˆ P n , P ′ ) ≤ γ n } as an acceptanceregion for testing H : x n ∼ P ′ and H : x n ∼ P . Clearly, this test performs no better than the optimal level α testfor this simple hypothesis testing in terms of the type-II error probability. Therefore, Chernoff-Stein lemma implies lim inf n →∞ − n log P x n ∼ P ( d k ( ˆ P n , P ′ ) ≤ γ n ) ≤ D ( P ′ k P ) . (27)Analogously, we have lim inf m →∞ − m log P y m ∼ Q ( d k ( ˆ Q m , P ′ ) ≤ γ m ) ≤ D ( P ′ k Q ) . (28) Now assume without loss of generality that D ( P ′ k P ) = 0 , i.e., P ′ = P . Then D ( P ′ k Q ) > under the alternativehypothesis H : P = Q , and Eq. (28) still holds. Using Lemma 3, we have P x n ∼ P ( d k ( ˆ P n , P ′ ) ≤ γ n ) > − α ,which gives zero exponent. Therefore, Eq. (27) holds with P ′ = P .As lim n,m →∞ nn + m = c , we conclude that β = lim inf n,m →∞ − n + m log β n,m ≤ D ∗ . The proof is complete. (cid:4) A PPENDIX CE XPERIMENTS : K

ERNEL C HOICE VS . S

AMPLE S IZE

Following the discussion in Remark 6, we conduct a toy experiment to investigate how kernel choice and samplesize affect the test type-II error probability. We consider Gaussian kernels that are determined by their bandwidth γ : k ( x, y ) = exp( −k x − y k /γ ) . The work [39] uses part of samples as training data to select the bandwidth, whichwe refer to the trained bandwidth in this paper. The estimated MMD is then computed using the trained bandwidthand the remaining samples.We take a similar setting from [39] using the Blobs dataset [38]: P is a × grid of 2D standard Gaussians,with spacing between the neighboring centers; Q is laid out identically, but with covariance ǫ − ǫ +1 between thecoordinates. Here we pick ǫ = 6 and generate n = m = 720 samples from each distribution; an example ofthese samples is shown in Fig. 1. For our purpose, we pick splitting ratios r = 0 . and r = 0 . for computingthe trained bandwidth. Correspondingly, there are n = m = 540 and n = m = 360 samples used to calculatethe test statistic, respectively. With a level constraint α = 0 . , we report in Fig. 2 the type-II error probabilitiesover different bandwidths, averaged over trials, for each case with n = m ∈ { , , } . The unbiasedtest statistic d u ( ˆ P n , ˆ Q m ) is used and the test threshold takes the minimum of the distribution-free threshold andthe bootstrap one obtained by permutations [28]. We also mark the trained bandwidths corresponding to therespective sample sizes in the ﬁgure (red star marker). Fig. 1. An example of samples drawn from distributions P (blue dot) and Q (orange plus sign). Fig. 2 veriﬁes that the trained bandwidth is close to the optimal one in terms of the type-II error probability.Moreover, it indicates that a large range of bandwidths lead to lower or comparable error probabilities if we directlyuse the entire samples for testing. As the sample number increases, the exponential decay term in the type-II error t y p e - II e rr o r n=360n=540n=720 Fig. 2. Experiment results for kernel choice vs. sample size. Red star denotes the trained bandwidth. probability becomes dominating and the effect of kernel choice diminishes. Since the desired range of bandwidthsis not known in advance, an interesting question is when we should split data for kernel selection and what is aproper splitting ratio. R

EFERENCES [1] S. Zhu, B. Chen, P. Yang, and Z. Chen, “Universal hypothesis testing with kernels: Asymptotically optimal tests for goodness of ﬁt,”in the 22nd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), Okinawa, Japan , 2019.[2] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal outlier hypothesis testing,”

IEEE Trans. Inf. Theory , vol. 60, no. 7, pp. 4066–4082,2014.[3] S. Zou, Y. Liang, H. V. Poor, and X. Shi, “Nonparametric detection of anomalous data streams,”

IEEE Trans. Signal Process. , vol. 65,no. 21, pp. 5785–5797, 2017.[4] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”

ACM Computing Surveys (CSUR) , vol. 41, no. 3, p. 15,2009.[5] M. Basseville and I. V. Nikiforov,

Detection of Abrupt Changes: Theory and Application . Englewood Cliffs: Prentice Hall, 1993.[6] F. Desobry, M. Davy, and C. Doncarli, “An online kernel change detection algorithm,”

IEEE Trans. Signal Process. , vol. 53, no. 8, pp.2961–2974, 2005.[7] Z. Harchaoui, E. Moulines, and F. R. Bach, “Kernel change-point analysis,” in

Advances in Neural Information Processing Systems ,2009.[8] B. James, K. L. James, and D. Siegmund, “Tests for a change-point,”

Biometrika , vol. 74, no. 1, pp. 71–83, 1987.[9] S. Li, Y. Xie, H. Dai, and L. Song, “M-statistic for kernel change-point detection,” in

Advances in Neural Information ProcessingSystems , 2015, pp. 3366–3374.[10] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch¨olkopf, and A. J. Smola, “Integrating structured biological data bykernel maximum mean discrepancy,”

Bioinformatics , vol. 22, no. 14, pp. e49–e57, Jul. 2006.[11] H. Wang, E.-H. Yang, Z. Zhao, and W. Zhang, “Spectrum sensing in cognitive radio using goodness of ﬁt testing,”

IEEE Trans. WirelessCommun. , vol. 8, no. 11, pp. 5427–5430, 2009.[12] D. Denkovski, V. Atanasovski, and L. Gavrilovska, “HOS based goodness-of-ﬁt testing signal detection,”

IEEE Communications Letters ,vol. 16, no. 3, pp. 310–313, 2012.[13] J. R. Lloyd and Z. Ghahramani, “Statistical model criticism using kernel two sample tests,” in

Advances in Neural Information ProcessingSystems , 2015.[14] B. Kim, R. Khanna, and O. O. Koyejo, “Examples are not enough, learn to criticize! Criticism for interpretability,” in

Advances inNeural Information Processing Systems , 2016.[15] J. Gorham and L. Mackey, “Measuring sample quality with Stein’s method,” in

Advances in Neural Information Processing Systems ,2015. [16] ——, “Measuring sample quality with kernels,” in International Conference on Machine Learning , 2017.[17] K. Chwialkowski, H. Strathmann, and A. Gretton, “A kernel test of goodness of ﬁt,” in

International Conference on Machine Learning ,2016.[18] A. Kolmogorov, “Sulla determinazione empirica di una legge di distribuzione,”

Giornale dell’ Istituto Italiano degli Attuari , vol. 4, pp.83–91, 1933.[19] N. Smirnov, “Table for estimating the goodness of ﬁt of empirical distributions,”

Annals of Mathematical Statistics , vol. 19, no. 2, pp.279–281, 1948.[20] A. Justel, D. Pe˜na, and R. Zamar, “A multivariate Kolmogorov-Smirnov test of goodness of ﬁt,”

Statistics & Probability Letters , vol. 35,no. 3, pp. 251–259, 1997.[21] L. Gy¨orﬁ and E. C. Van Der Meulen, “A consistent goodness of ﬁt test based on the total variation distance,” in

NonparametricFunctional Estimation and Related Topics . Springer, 1991, pp. 631–645.[22] E. Del Barrio, J. A. Cuesta-Albertos, C. Matr´an, and J. M. Rodr´ıguez-Rodr´ıguez, “Tests of goodness of ﬁt based on the L2-Wassersteindistance,”

Annals of Statistics , pp. 1230–1239, 1999.[23] E. Del Barrio, E. Gin´e, and F. Utzet, “Asymptotics for L2 functionals of the empirical quantile process, with applications to tests ofﬁt based on weighted Wasserstein distances,”

Bernoulli , vol. 11, no. 1, pp. 131–189, 2005.[24] A. Ramdas, N. Trillos, and M. Cuturi, “On Wasserstein two-sample testing and related families of nonparametric tests,”

Entropy , vol. 19,no. 2, p. 47, 2017.[25] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,”

IEEE Trans. Inf. Theory , vol. 56, no. 11, pp. 5847–5861, 2010.[26] T. Kanamori, T. Suzuki, and M. Sugiyama, “ f -divergence estimation and two-sample homogeneity test under semiparametric density-ratio models,” IEEE Trans. Inf. Theory , vol. 58, no. 2, pp. 708–720, 2012.[27] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in

Advances in Neural Information Processing Systems ,2013, pp. 2292–2300.[28] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola, “A kernel two-sample test,”

Journal of Machine LearningResearch , vol. 13, no. Mar, pp. 723–773, 2012.[29] A. Smola, A. Gretton, L. Song, and B. Sch¨olkopf, “A Hilbert space embedding for distributions,” in

International Conference onAlgorithmic Learning Theory , 2007.[30] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schlkopf, “Kernel mean embedding of distributions: A review and beyond,”

Foundations and Trends in Machine Learning , vol. 10, no. 1-2, pp. 1–141, 2017.[31] L. Baringhaus and C. Franz, “On a new multivariate two-sample test,”

Journal of Multivariate Analysis , vol. 88, no. 1, pp. 190–206,2004.[32] G. J. Sz´ekely and M. L. Rizzo, “Testing for equal distributions in high dimension,”

InterStat , 2004.[33] R. Lyons, “Distance covariance in metric spaces,”

The Annals of Probability , vol. 41, no. 5, pp. 3284–3305, 2013.[34] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu, “Equivalence of distance-based and RKHS-based statistics in hypothesistesting,”

The Annals of Statistics , vol. 41, no. 5, pp. 2263–2291, 2013.[35] K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton, “Fast two-sample testing with analytic representations of probabilitymeasures,” in

Advances in Neural Information Processing Systems , 2015.[36] K. Fukumizu, A. Gretton, G. R. Lanckriet, B. Sch¨olkopf, and B. K. Sriperumbudur, “Kernel choice and classiﬁability for RKHSembeddings of probability distributions,” in

Advances in Neural Information Processing Systems , 2009.[37] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur, “A fast, consistent kernel two-sample test,” in

Advances in NeuralInformation Processing Systems , 2009.[38] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur, “Optimal kernel choicefor large-scale two-sample tests,” in

Advances in Neural Information Processing Systems , 2012.[39] D. Sutherland, H. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton, “Generative models and model criticism viaoptimized maximum mean discrepancy,” in

International Conference on Learning Representations , 2017.[40] W. Zaremba, A. Gretton, and M. Blaschko, “B-test: A non-parametric, low variance kernel two-sample test,” in

Advances in NeuralInformation Processing Systems , 2013.[41] Y. Altun and A. Smola, “Unifying divergence minimization and statistical inference via convex duality,” in the 19th Annual Conferenceon Learning Theory , 2006.[42] Z. Szab´o, A. Gretton, B. P´oczos, and B. Sriperumbudur, “Two-stage sampled learning theory on distributions,” in

Artiﬁcial Intelligenceand Statistics , 2015.[43] Z. Szab´o, B. K. Sriperumbudur, B. P´oczos, and A. Gretton, “Learning theory for distribution regression,”

The Journal of MachineLearning Research , vol. 17, no. 1, pp. 5272–5311, 2016.[44] Q. Liu, J. Lee, and M. Jordan, “A kernelized Stein discrepancy for goodness-of-ﬁt tests,” in

International Conference on MachineLearning , 2016. [45] C. J. Oates, M. Girolami, and N. Chopin, “Control functionals for Monte Carlo integration,” Journal of the Royal Statistical Society:Series B (Statistical Methodology) , vol. 79, no. 3, pp. 695–718, 2017.[46] W. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton, “A linear-time kernel goodness-of-ﬁt test,” in

Advances in NeuralInformation Processing Systems , 2017.[47] W. Hoeffding, “Asymptotically optimal tests for multinomial distributions,”

The Annals of Mathematical Statistics , pp. 369–401, 1965.[48] J. Unnikrishnan, D. Huang, S. P. Meyn, A. Surana, and V. V. Veeravalli, “Universal and composite hypothesis testing via mismatcheddivergence,”

IEEE Trans. Inf. Theory , vol. 57, no. 3, pp. 1587–1603, 2011.[49] G. Tusn´ady, “On asymptotically optimal tests,”

The Annals of Statistics , vol. 5, no. 2, pp. 385–393, 1977.[50] O. Zeitouni and M. Gutman, “On universal hypotheses testing via large deviations,”

IEEE Trans. Inf. Theory , vol. 37, no. 2, pp.285–290, 1991.[51] P. Yang and B. Chen, “Robust Kullback-Leibler divergence and universal hypothesis testing for continuous distributions,”

IEEE Trans.Inf. Theory , vol. 65, no. 4, pp. 2360–2374, 2019.[52] M. Feder and N. Merhav, “Universal composite hypothesis testing: A competitive minimax approach,”

IEEE Trans Inf Theory , vol. 48,no. 6, pp. 1504–1517, 2002.[53] K. Balasubramanian, T. Li, and M. Yuan, “On the optimality of kernel-embedding based goodness-of-ﬁt tests,” arXiv preprintarxiv:1709.08148 , 2017.[54] C.-J. Simon-Gabriel and B. Sch¨olkopf, “Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metricson distributions,”

The Journal of Machine Learning Research , vol. 19, no. 1, pp. 1708–1736, 2018.[55] B. Sriperumbudur, “On the optimal estimation of probability measures in weak and strong topologies,”

Bernoulli , vol. 22, no. 3, pp.1839–1893, 08 2016.[56] G. Casella and R. Berger,

Statistical Inference . Duxbury Thomson Learning, 2002.[57] T. M. Cover and J. A. Thomas,

Elements of Information Theory , 2nd ed. New York: Wiley, 2006.[58] A. Dembo and O. Zeitouni,

Large Deviations Techniques and Applications . New York: Springer, 2009.[59] I. N. Sanov, “On the probability of large deviations of random variables,” North Carolina State University. Dept. of Statistics, Tech.Rep., 1958.[60] T. Van Erven and P. Harremos, “R´enyi divergence and Kullback-Leibler divergence,”

IEEE Trans. Inf. Theory , vol. 60, no. 7, pp.3797–3820, 2014.[61] A. Berlinet and C. Thomas-Agnan,

Reproducing Kernel Hilbert Spaces in Probability and Statistics . Springer Science & BusinessMedia, 2011.[62] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch¨olkopf, and G. R. Lanckriet, “On the empirical estimation of integral probabilitymetrics,”

Electronic Journal of Statistics , vol. 6, pp. 1550–1599, 2012.[63] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch¨olkopf, and G. R. Lanckriet, “Hilbert space embeddings and metrics on probabilitymeasures,”

Journal of Machine Learning Research , vol. 11, no. Apr, pp. 1517–1561, 2010.[64] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in

International Conference on MachineLearning , 2017.[65] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. P´oczos, “MMD GAN: Towards deeper understanding of moment matching network,”in

Advances in Neural Information Processing Systems , 2017.[66] C. Carmeli, E. De Vito, A. Toigo, and V. Umanit´a, “Vector valued reproducing kernel Hilbert spaces and universality,”

Analysis andApplications , vol. 8, no. 01, pp. 19–61, 2010.[67] J. Dedecker, P. Doukhan, G. Lang, J. Leon, S. Louhichi, and C. Prieur,

Weak Dependence: With Examples and Applications . NewYork: Springer, 2007.[68] M. A. Arcones and E. Gine, “On the bootstrap of U and V statistics,”

The Annals of Statistics , pp. 655–674, 1992.[69] K. P. Chwialkowski, D. Sejdinovic, and A. Gretton, “A wild bootstrap for degenerate kernel tests,” in

Advances in Neural InformationProcessing Systems , 2014.[70] A. Leucht, “Degenerate U- and V-statistics under weak dependence: Asymptotic theory and bootstrap consistency,”

Bernoulli , vol. 18,no. 2, pp. 552–585, 2012.[71] R. R. Bahadur, “Stochastic comparison of tests,”

The Annals of Mathematical Statistics , pp. 276–295, 1960.[72] R. J. Serﬂing,

Approximation Theorems of Mathematical Statistics . John Wiley & Sons, 2009.[73] A. Ramdas, S. J. Reddi, B. P´oczos, A. Singh, and L. A. Wasserman, “On the decreasing power of kernel and distance based nonparametrichypothesis tests in high dimensions,” in

AAAI Conference on Artiﬁcial Intelligence , 2015.[74] I. Csisz´ar, “A simple proof of Sanov’s theorem,”