A Kernel Two-Sample Test for Functional Data
AA Kernel Two-Sample Test for Functional Data
George Wynne and Andrew B. Duncan Imperial College London, Department of Mathematics The Alan Turing Institute
Abstract
We propose a nonparametric two-sample test procedure based on Maximum MeanDiscrepancy (MMD) for testing the hypothesis that two samples of functions have thesame underlying distribution, using kernels defined on function spaces. This construc-tion is motivated by a scaling analysis of the efficiency of MMD-based tests for datasetsof increasing dimension. Theoretical properties of kernels on function spaces and theirassociated MMD are established and employed to ascertain the efficacy of the newlyproposed test, as well as to assess the effects of using functional reconstructions basedon discretised function samples. The theoretical results are demonstrated over a rangeof synthetic and real world datasets.
Nonparametric two-sample tests for equality of distributions are widely studied instatistics, driven by applications in goodness-of-fit tests, anomaly and change-pointdetection and clustering. Classical examples of such tests include the Kolmogorov-Smirnov test [41, 69, 62] and Wald-Wolfowitz runs test [84] with subsequent multi-variate extensions [25].Due to advances in the ability to collect large amounts of real time or spatiallydistributed data there is a need to develop statistical methods appropriate for functionaldata, where each data sample is a discretised function. Such data has been studied fordecades in the Functional Data Analysis (FDA) literature [32, 35] particularly in thecontext of analysing populations of time series, or in statistical shape analysis [45].More recently, due to this modern abundance of functional data, increased study hasbeen made in the machine learning literature for algorithms suited to such data [7, 15,37, 12, 88].In this paper we consider the case where the two probability distributions beingcompared are supported over a real, separable Hilbert space, for example L ( D ) with D ⊂ R d and we have discretised observations of the function samples. If the samplesconsist of evaluations of the functions over a common mesh of points, then well-known a r X i v : . [ m a t h . S T ] O c t ethods for nonparametric two-sample testing for vector data can be used directly.Aside from the practical issue that observations are often made on irregular meshes foreach different sample there is also the issue of degrading performance of classical testsas mesh size increases, meaning the observed vectors are high dimensional. As is typi-cal with nonparametric two-sample tests, the testing power will degenerate rapidly withincreasing data dimension. We therefore seek to better understand how to develop test-ing methods which are not strongly affected by the mesh resolution, exploiting intrinsicstatistical properties of the underlying functional probability distributions.In the past two decades kernels have seen a surge of use in statistical applications[47, 28, 78, 9]. In particular, kernel based two-sample testing [28, 27] has becomeincreasingly popular. These approaches are based on a distance on the space of proba-bility measures known as Maximum Mean Discrepancy . Given two probability dis-tributions P and Q , a kernel k is employed to construct a mapping known as the mean embedding , of the two distributions into an infinite dimensional ReproducingKernel Hilbert Space (RKHS). The MMD between P and Q , denoted MMD k ( P, Q ) isgiven by the RKHS norm of the difference between the two embeddings, and definesa pseudo-metric on the space of probability measures. This becomes a metric if k is characteristic , see Section 3. By the kernel trick, MMD simplifies to a closed form,up to expectations, with respect to P and Q , which can be estimated unbiasedly usingMonte Carlo simulations.A major advantage of kernel two-sample tests is that they can be constructed on anyinput space which admits a well-defined kernel, including Riemannian manifolds [55],as well as discrete structures such as graphs [66] and strings [26]. The flexibility in thechoice of kernel is one of the strengths of MMD-based testing, where a priori knowl-edge of the structure of the underlying distributions can be encoded within the kernel toimprove the sensitivity or specificity of the corresponding test. The particular choice ofkernel strongly influences the efficiency of the test, however a general recipe for con-structing a good kernel is still an open problem. On Euclidean spaces, radial basis func-tion (RBF) kernels are often used, i.e. kernels of the form k ( x, y ) = φ ( γ − (cid:107) x − y (cid:107) ) ,where (cid:107)·(cid:107) is the Euclidean norm, φ : R + → R + is a function and γ > is the band-width . Numerous kernels used in practice belong to this class of kernels, including theGaussian kernel φ ( r ) = e − r / , the Laplace kernel φ ( r ) = e − r and others includ-ing the rational quadratic kernel, the Matern kernel and the multiquadric kernel. Theproblem of selecting the bandwidth parameter to maximise test efficiency over a partic-ular input space has been widely studied. One commonly used strategy is the medianheuristic where the bandwidth is chosen to be the median of the inter-sample distance.Despite its popularity, there is only limited understanding of the median heuristic, withsome notable exceptions. In Ramdas et al. [58, 59] the authors investigate the dimin-ishing power of the kernel two-sample test using a Gauss kernel for distributions ofwhite Gaussian random vectors with increasing dimension, demonstrating that underappropriate alternatives, the power of the test will decay with a rate dependent on therelative scaling of γ with respect to dimension. Related to kernel based tests are energydistance tests [79, 80], the relationship was made clear in Sejdinovic et al. [64].There has been relatively little work on understanding the theoretical properties ofkernels on function spaces. A Gauss type kernel on L ([0 , was briefly considered in hristmann and Steinwart [16, Example 3]. Recently, in Chevyrev and Oberhauser [15]a kernel was defined on the Banach space of paths on [0 , of unbounded 1-variation,using a novel approach based on path signatures, demonstrating that this is a charac-teristic kernel over the space of such paths. The associated MMD has been employedas a loss function to train models generating stochastic processes [40]. Furthermore, inNelsen and Stuart [50] the authors propose an extension of the random Fourier featurekernel of Rahimi and Recht [57] to the setting of an infinite dimensional Banach space,with the objective of regression between Banach spaces. This paper will build on as-pects of these works, but with a specific emphasis on two-sample testing for functionaldata.Two-sample testing in function spaces has received much attention in FDA and isstudied in a variety of contexts. Broadly speaking there are two classes of methods.The first approach seeks to initially reduce the problem to a finite dimensional problemthrough a projection onto a finite orthonormal basis within the function space, typicallyusing principal components, and then makes use of standard multivariate two-sampletests [4, 43]. The second approach poses a two-sample test directly on function space[2, 10, 34, 56, 11]. Many of these works construct the test on the Hilbert space L ( D ) using the L ( D ) norm as the testing statistic. A priori, it is not obvious why this normwill be well suited to the testing problem, in general. Investigation into the impact ofthe choice of distance in distanced based tests for functional data has been studied inthe literature [14, 13, 90] and a distance other than L ( D ) for the functional data wasadvocated. This motivates the investigation into kernels which involve distances otherthan L ( D ) in their formulation. In many works, the two-sample tests are designed tohandle a specific class of discrepancy, such as a shift in mean, such as Horv´ath et al.[33] and Zhang et al. [89], or a shift in covariance structure [52, 23, 24].This paper has two main aims. First, to naturally generalise the finite dimensionaltheory of kernels to real, separable Hilbert spaces to establish kernels that are char-acteristic, identify their RKHS and establish topological properties of the associatedMMD. In particular the proof of characterticness builds upon the spectral methods in-troduced in Sriperumbudur et al. [72] and the weak convergence results build uponSimon-Gabriel and Sch¨olkopf [67]. Second, we apply such kernels to the two-sampletesting problem and analyse the power of the tests as well as the statistical impact ofperforming the tests using data reconstructed from discrete functional observations.The specific contributions are as follows.1. For Gaussian processes, we identify a scaling of the Gauss kernel bandwidth withmesh-size which results in testing power which is asymptotically independent ofmesh-size, under mean-shift alternatives. In the scaling of vanishing mesh-sizewe demonstrate that the associated kernel converges to a kernel over functions.2. Motivated by this, we construct a family of kernels defined on real, separableHilbert spaces and identify sufficient conditions for the kernels to be characteris-tic, when MMD metrises the weak topology and provide an explicit constructionof the reproducing kernel Hilbert space for a Gauss type kernel.3. Using these kernels we investigate the statistical effect of using reconstructedfunctional data in the two-sample test. . We numerically validate our theory and compare the kernel based test with estab-lished two-sample tests from the functional data analysis literature.The remainder of the paper is as follows. Section 2 covers preliminaries of mod-elling random functional data such as the Karhunen-Lo`eve expansion and Gaussianmeasures. Section 3 recalls some important properties of kernels and their associatedreproducing kernel Hilbert spaces, defines maximum mean discrepancy and the kerneltwo-sample test. Section 4 outlines the scaling of test power that occurs when an in-creasingly finer observation mesh is used for functional data. Section 5 defines a broadclass of kernels and offers an integral feature map interpretation as well as outliningwhen the kernels are characteristic, meaning the two-sample test is valid. Section 6highlights the statistical impact of fitting curves to discretised functions before per-forming the test. A relationship between MMD and weak convergence is highlightedand closed form expressions for the MMD and mean-embeddings when the distribu-tions are Gaussian processes are given. Section 7 provides multiple examples of choicesfor the kernel hyper parameters and principled methods of constructing them. Section8 contains multiple numerical experiments validating the theory in the paper, a simula-tion is performed to validate the scaling arguments of Section 4 and synthetic and realdata sets are used to compare the performance of the kernel based test against existingfunctional two-sample tests. Concluding remarks and thoughts about future work areprovided in Section 9. In this paper we shall follow the Hilbert space approach to functional data analysisand use this section to outline the required preliminaries [17, 35]. Before discussingrandom functions we establish notation for families of operators that will be used ex-tensively. Let X be a real, separable Hilbert space with inner product (cid:104)· , ·(cid:105) X then L ( X ) denotes the set of bounded linear maps from X to itself, L + ( X ) denotes the subset of L ( X ) of operators that are self-adjoint (also known as symmetric) and non-negative,meaning (cid:104) T x, y (cid:105) X ≥ ∀ x, y ∈ X . The subset of L + ( X ) of trace class operators isdenoted L +1 ( X ) and by the spectral theorem [74, Theorem A.5.13] such operators canbe diagonalised. This means for every T ∈ L +1 ( X ) there exists an orthonormal basisof eigenfunctions { e n } ∞ n =1 in X such that T x = (cid:80) ∞ n =1 λ n (cid:104) x, e n (cid:105) X e n , where { λ n } ∞ n =1 are non-negative eigenvalues and the trace satisfies Tr ( T ) = (cid:80) ∞ n =1 λ n < ∞ . Whenthe eigenvalues are square summable the operator is called Hilbert-Schmidt and theHilbert-Schmidt norm is (cid:107) T (cid:107) HS = (cid:80) ∞ n =1 λ n .We now outline the Karhunen-Lo`eve expansion of stochastic processes. Let x ( · ) be a stochastic process in X = L ([0 , , note the following will hold for a stochasticprocess taking values in any real, separable Hilbert space but we focus on L ([0 , since it is the most common setting for functional data. Suppose that the point-wise covariance function E [ x ( s ) x ( t )] = k ( s, t ) is continuous, then the mean func-tion m ( t ) = E [ X ( t )] is also in X . Define the covariance operator C k : X → X associated with X by C k y ( t ) = (cid:82) k ( s, t ) y ( s ) ds . Then C k ∈ L +1 ( X ) and de-note the spectral decomposition C k y = (cid:80) ∞ n =1 λ n (cid:104) y, e n (cid:105) X e n . The Karhunen-Lo`eve KL) expansion [76, Theorem 11.4] provides a characterisation of the law of the pro-cess x ( · ) in terms of an infinite-series expansion. More specifically, we can write x ( · ) ∼ m + (cid:80) ∞ n =1 λ / n η n e n ( · ) , where { η n } ∞ n =1 are unit-variance uncorrelated randomvariables. Additionally, Mercer’s theorem [75] provides an expansion of the covarianceas k ( s, t ) = (cid:80) ∞ n =1 λ n e n ( s ) e n ( t ) where the convergence is uniform.An important case of random functions are Gaussian processes [60]. Given a kernel k , see Section 3, and a function m we say x is a Gaussian process with mean function m and covariance function k if for every finite collection of points { s n } Nn =1 the ran-dom vector ( x ( s ) , . . . , x ( s N )) is a multivariate Gaussian random variable with meanvector ( m ( s ) , . . . , m ( s N )) and covariance matrix k ( s n , s m ) Nn,m =1 . The mean func-tion and covariance function completely determines the Gaussian process. We write x ∼ GP ( m, k ) to denote the Gaussian process with mean function m and covariancefunction k . If x ∼ GP (0 , k ) then in the Karhunen-Lo`eve representation η n ∼ N (0 , and the η n are all independent.Gaussian processes that take values in X can be associated with Gaussian measureson X . Gaussian measures are natural generalisations of Gaussian distributions on R d toinfinite dimensional spaces, which are defined by a mean element and covariance oper-ator rather than a mean vector and covariance matrix, for an introduction see Da Prato[18, Chapter 1]. Specifically x ∼ GP ( m, k ) can be associated with the Gaussian mea-sure N m,C k with mean m and covariance operator C k , the covariance operator associ-ated with k as outlined above. Similarly given any m ∈ X and C ∈ L +1 ( X ) then thereexists a Gaussian measure N m,C with mean m and covariance operator C [18, Theorem1.12]. In fact, the Gaussian measure N m,C is characterised as the unique probabilitymeasure on X with Fourier transform (cid:98) N m,C ( y ) = exp( i (cid:104) m, y (cid:105) X − (cid:104) Cy, y (cid:105) X ) . Fi-nally, if C is injective then a Gaussian measure with covariance operator C is callednon-degenerate and has full support on X [18, Proposition 1.25]. This section will outline what a kernel and a reproducing kernel Hilbert space is withexamples and associated references. Subsection 3.1 defines kernels and RKHS, Sub-section 3.2 defines MMD and the corresponding estimators and Subsection 3.3 outlinesthe testing procedure.
Given a nonempty set X a kernel is a function k : X × X → R which is symmetric,meaning k ( x, y ) = k ( y, x ) , for all x, y ∈ X , and positive definite, that is, the matrix { k ( x n , x m ); n, m ∈ { , . . . , N }} is positive semi-definite, for all { x n } Nn =1 ⊂ X andfor N ∈ N . For each kernel k there is an associated Hilbert space of functions over X known as the reproducing kernel Hilbert space (RKHS) denoted H k ( X ) [6, 74, 22].RKHSs have found numerous applications in function approximation and inference fordecades since their original application to spline interpolation [83]. Multiple detailed urveys exist in the literature [61, 54]. The RKHS associated with k satisfies the fol-lowing two properties i). k ( · , x ) ∈ H ( X ) for all x ∈ X ii). (cid:104) f, k ( · , x ) (cid:105) H ( X ) = f ( x ) forall x ∈ X and f ∈ H ( X ) . The latter is known as the reproducing property. The RKHSis constructed from the kernel in a natural way. The linear span of a kernel k withone input fixed H ( X ) = (cid:110)(cid:80) Nn =1 a n k ( · , x n ) : N ∈ N , { a n } Nn =1 ⊂ R , { x n } Nn =1 ⊂ X (cid:111) is a pre-Hilbert space equipped with the following inner product (cid:104) f, g (cid:105) H ( X ) = (cid:80) Nn =1 (cid:80) Mm =1 a n b m k ( x n , y m ) where f = (cid:80) Nn =1 a n k ( · , x n ) and g = (cid:80) Mm =1 b m k ( · , y m ) . The RKHS H k ( X ) of k is then obtained from H ( X ) throughcompletion. More specifically H k ( X ) is the set of functions which are pointwise limitsof Cauchy sequences in H ( X ) [6, Theorem 3]. The relationship between kernels andRKHS is one-to one, for every kernel the RKHS is unique and for every Hilbert spaceof functions such that there exists a function k satisfying the two properties aboveit may be concluded that the k is unique and a kernel. This result is known as theAronszajn theorem [6, Theorem 3].A kernel k on X ⊆ R d is said to be translation invariant if it can be written as k ( x, y ) = φ ( x − y ) for some φ . Bochner’s theorem, Theorem 10 in the Appendix,tells us that if k is continuous and translation invariant then there exists a Borel meaureon X such that ˆ µ k ( x − y ) = k ( x, y ) and we call µ k the spectral measure of k . Thespectral measure is an important tool in the analysis of kernel methods and shall becomeimportant later when discussing the two-sample problem. Given a kernel k and associated RKHS H k ( X ) let P be the set of Borel probabilitymeasures on X and assuming k is measurable define P k ⊂ P as the set of all P ∈ P k such that (cid:82) k ( x, x ) dP ( x ) < ∞ . Note that P k = P if and only if k is bounded [72,Proposition 2] which is very common in practice and shall be the case for all kernelsconsidered in this paper. For P, Q ∈ P k we define the Maximum Mean Discrepancy denoted MMD k ( P, Q ) as follows MMD k ( P, Q ) = sup (cid:107) f (cid:107) H k ( X ) ≤ (cid:12)(cid:12)(cid:82) f dP − (cid:82) f dQ (cid:12)(cid:12) .This is an integral probability metric [48, 72] and without further assumptions defines apseudo-metric on P k , which permits the possibility that MMD k ( P, Q ) = 0 but P (cid:54) = Q .We introduce the mean embedding Φ k P of P ∈ P k into H k ( X ) defined by Φ k P = (cid:82) k ( · , x ) dP ( x ) . This can be viewed as the mean in H k ( X ) of the function k ( x, · ) with respect to P in the sense of a Bochner integral [35, Section 2.6]. FollowingSriperumbudur et al. [72, Section 2] this allows us to writeMMD k ( P, Q ) = (cid:32) sup (cid:107) f (cid:107) H k ( X ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) f dP − (cid:90) f dQ (cid:12)(cid:12)(cid:12)(cid:12)(cid:33) = (cid:32) sup (cid:107) f (cid:107) H k ( X ) ≤ |(cid:104) Φ k P − Φ k Q, f (cid:105)| (cid:33) = (cid:107) Φ k P − Φ k Q (cid:107) H k ( X ) . (1)The crucial observation which motivates the use of MMD as an effective measure ofdiscrepancy is that the supremum can be eliminated using the reproducing property of he inner product [72, Section 2]. This yields the following closed form representationMMD k ( P, Q ) = (cid:90) (cid:90) k ( x, x (cid:48) ) dP ( x ) dP ( x (cid:48) ) + (cid:90) (cid:90) k ( y, y (cid:48) ) dQ ( y ) dQ ( y (cid:48) ) − (cid:90) (cid:90) k ( x, y ) dP ( x ) dQ ( y ) . (2)It is clear that MMD k is a metric over P k if and only if the map Φ k : P k → H k ( X ) is injective. Given a subset P ⊆ P k , a kernel is characteristic to P if the map Φ k isinjective over P . In the case that P = P we just say that k is characteristic. Variousworks have provided sufficient conditions for a kernel over finite dimensional spaces tobe characteristic [72, 73, 67].Given independent samples X n = { x i } ni =1 from P and Y m = { y i } mi =1 from Q wewish to estimate MMD k ( P, Q ) . A number of estimators have been proposed. Forclarity of presentation we shall assume that m = n , but stress that all of the followingcan be generalised to situations where the two data-sets are unbalanced. Given samples X n and Y n , the following U-statistic is an unbiased estimator of MMD k ( P, Q ) (cid:92) MMD k ( X n , Y n ) := 1 n ( n − n (cid:88) i (cid:54) = j h ( z i , z j ) , (3)where z i = ( x i , y i ) and h ( z i , z j ) = k ( x i , x j ) + k ( y i , y j ) − k ( x i , y j ) − k ( x j , y i ) . Thisestimator can be evaluated in O ( n ) time. An unbiased linear time estimator proposedin Jitkrittum et al. [36] is given by (cid:92) MMD k, lin ( X n , Y n ) := 2 n n/ (cid:88) i =1 h ( z i − , z i ) , (4)where it is assumed that n is even. While the cost for computing (cid:92) MMD k, lin ( X n , Y n ) is only O ( n ) this comes at the cost of reduced efficiency, i.e. Var ( (cid:92) MMD k ( X n , Y n ) ) < Var ( (cid:92) MMD k, lin ( X n , Y n ) ) , see for example Sutherland [77]. Various probabilisticbounds have been derived on the error between the estimator and MMD k ( P, Q ) [28,Theorem 10, Theorem 15]. Given independent samples X n = { x i } ni =1 from P and Y n = { y i } ni =1 from Q we seekto test the hypothesis H : P = Q against the alternative hypothesis H : P (cid:54) = Q with-out making any distributional assumptions. The kernel two-sample test of Gretton et al.[28] employs an estimator of MMD as the test statistic. Indeed, fixing a characteristickernel k , we reject H if (cid:92) MMD k ( X n , Y n ) > c α , where c α is a threshold selected toensure a false-positive rate of α . While we do not have a closed-form expression for c α , it can be estimated using a permutation bootstrap. More specifically, we randomlyshuffle X n ∪ Y n , split it into two data sets X (cid:48) n and Y (cid:48) n , from which (cid:92) MMD k ( X (cid:48) n , Y (cid:48) n ) is calculated. This is repeated numerous times so that an estimator of the threshold ˆ c α s then obtained as the (1 − α ) -th quantile of the resulting empirical distribution. Thesame test procedure may be performed using the linear time MMD estimator as the teststatistic.The efficiency of a test is characterised by its false-positive rate α and its its false-negative rate β . The power of a test is a measure of its ability to correctly reject the nullhypothesis. More specifically, fixing α , and obtaining an estimator ˆ c α of the threshold,we define the power of the test at α to be P ( n (cid:92) MMD k ( X n , Y n ) ≥ ˆ c α ) . Invoking thecentral limit theorem for U-statistics [65] we can quantify the decrease in variance ofthe unbiased MMD estimators, asymptotically as n → ∞ . Theorem 1. [28, Corollary 16] Suppose that E x ∼ P,y ∼ Q [ h ( x, y )] < ∞ . Then un-der the alternative hypothesis P (cid:54) = Q , the estimator (cid:92) MMD k ( X n , Y n ) converges indistribution to a Gaussian as follows √ n (cid:16) (cid:92) MMD k ( X n , Y n ) − MMD k ( P, Q ) (cid:17) D −→ N (0 , ξ ) , n → ∞ , where ξ = Var z [ E z (cid:48) [ h ( z, z (cid:48) )]] . An analogous result holds for the linear-time estima-tor, with ξ = Var z,z (cid:48) [ h ( z, z (cid:48) )] instead of ξ . In particular, under the conditions of Theorem 1, for large n , the power of the testwill satisfy the following asymptotic result P (cid:16) n (cid:92) MMD k ( X n , Y n ) > (cid:98) c α (cid:17) ≈ Φ (cid:18) √ n MMD k ( P, Q ) √ ξ − c α √ nξ (cid:19) , (5)where Φ is the CDF for a standard Gaussian distribution and ξ = Var z [ E z (cid:48) [ h ( z, z (cid:48) )]] .The analogous result for the linear-time estimator holds with ξ = Var z,z (cid:48) [ h ( z, z (cid:48) )] instead of ξ [59, 42]. This suggests that the test power can be maximised by max-imising MMD k ( P, Q ) / √ ξ which can be seen as a signal-to-noise-ratio [42]. It isevident from previous works that the properties of the kernel will have a very signifi-cant impact on the power of the test, and methods have been proposed for increasingtest power by optimising the kernel parameters using the signal-to-noise-ratio as anobjective [78, 59, 42]. To motivate the construction of kernel two-sample tests for random functions, in thissection we will consider the case where the samples X n and Y n are independent realisa-tions of two Gaussian processes, observed along a regular mesh Ξ N = { t , . . . , t N } of N points in D where D ⊂ R d is some compact set. Therefore N will be the dimensionof the observed vectors. To develop ideas, we shall focus on a mean-shift alternative,where the underlying Gaussian processes are given by GP (0 , k ) and GP ( m, k ) re-spectively, where k is a covariance function, and m ∈ L ( D ) is the mean function.We use the subscript on k to distinguish it from the kernel k we use to perform the est. We will use the linear time test due to easier calculations. This reduces to a multi-variate two-sample hypothesis test problem on R N , with samples X n = { x i } ni =1 from P = N (0 , Σ) and Y n = { y i } ni =1 from Q = N ( m N , Σ) , where Σ i,j = k ( t i , t j ) for i, j = 1 , . . . , N and m N = ( m ( t ) , . . . , m ( t N )) (cid:62) .We consider applying a two-sample kernel test as detailed in Section 3, with aGaussian kernel k ( x, y ) = exp( − γ − N (cid:107) x − y (cid:107) ) on R N where γ N may depend on N .The large N limit was studied in Ramdas et al. [58] but not in the context of functionaldata. This motivates the question whether there is a scaling of γ N with respect to N which, employing the structure of the underlying random functions, guarantees thatthe statistical power remains independent of the mesh size N . To better understandthe influence of bandwidth on power, we use the signal-to-noise ratio as a convenientproxy, and study its behaviour in the large N limit. We say the mesh Ξ N satisfies theRiemann scaling property if N (cid:107) m N (cid:107) = N (cid:80) Ni =1 m ( t i ) → (cid:82) D m ( t ) dt = (cid:107) m (cid:107) L ( D ) as N → ∞ for all m ∈ L ( D ) , this will be used in the next result to characterise thesignal-to-noise ratio from the previous subsection. Proposition 1.
Let
P, Q be as above with Ξ N satisfying the Riemann scaling propertyand γ N = Ω( N α ) with α > / then if k ( s, t ) = δ st MMD k ( P, Q ) √ ξ ∼ √ N (cid:107) m (cid:107) L ( D ) (cid:113) (cid:107) m (cid:107) L ( D ) , (6) and if k is continuous and bounded then MMD k ( P, Q ) √ ξ ∼ (cid:107) m (cid:107) L ( D ) (cid:113) (cid:107) C k (cid:107) HS + (cid:107) C / k m (cid:107) L ( D ) , (7) where ∼ means asymptotically equal in the sense that the ratio of the left and righthand side converges to one as N → ∞ . The proof of this result is in the Appendix and generalises Ramdas et al. [58] byconsidering non-identity Σ . The way this ratio increases with N , the number of obser-vation points, in the white noise case makes sense since each observation is revealingnew information about the signal as the noise is independent at each observation. Onthe other hand the non-identity covariance matrix means the noise is not independentat each observation and thus new information is not obtained at each observation point.Indeed the stronger the correlations, meaning the slower the decay of the eigenvaluesof the covariance operator C k , the smaller this ratio shall be since the Hilbert-Schmidtnorm in the denominator will be larger.It is important to note that the ratio in the right hand sides of (7) and (6) are in-dependent of the choice of α once α > / meaning that once greater than / thisparameter will be ineffective for obtaining greater testing power. The next subsectiondiscusses how α = 1 / provides a scaling resulting in kernels defined directly overfunction spaces, facilitating other methods to gain better test power. .1 Kernel Scaling Proposition 1 does not include the case γ N = Θ( N / ) however it can be shown thatthe ratio does not degenerate in this case, see Theorem 7 and Theorem 8. In fact, thetwo different scales of the ratio, when Σ is the identity matrix or a kernel matrix, stilloccur. This is numerically verified in Section 8.Suppose γ N = γ N / for some γ ∈ R and one uses a kernel of the form k ( x, y ) = f ( γ − N (cid:107) x − y (cid:107) ) over R N for some continuous f . Suppose now thoughthat our inputs shall be x N , y N , discretisations of functions x, y ∈ L ( D ) observed ona mesh Ξ N that satisfies the Riemann scaling property. Then as the mesh gets finer weobserve the following scaling k ( x N , y N ) = f ( γ − N (cid:107) x N − y N (cid:107) ) N →∞ −−−−→ f ( γ − (cid:107) x − y (cid:107) L ( D ) ) . Therefore the kernel, as the discretisation resolution increases, will converge to a kernelover L ( D ) where the Euclidean norm is replaced with the L ( D ) norm. For examplethe Gauss kernel would become exp( − γ − (cid:107) x − y (cid:107) L ( D ) ) .This scaling coincidentally is similar to the scaling of the widely used medianheuristic defined as γ = Median (cid:8) (cid:107) a − b (cid:107) : a, b ∈ { x i } ni =1 ∪ { y i } mi =1 , a (cid:54) = b (cid:9) , (8)where { x i } ni =1 are the samples from P , { y i } ni =1 samples from Q . It was not designedwith scaling in mind however in Ramdas et al. [59] it was noted that it results in a γ = Θ( N ) scaling for the mean shift, identity matrix case. The next lemma makesthis more precise by relating the median of the squared distance to its expectation. Lemma 1.
Let P = N ( µ , Σ ) and Q = N ( µ , Σ ) be independent Gaussian distri-butions on R N then E x ∼ P,y ∼ Q [ (cid:107) x − y (cid:107) ] = Tr (Σ + Σ ) + (cid:107) µ − µ (cid:107) and (cid:12)(cid:12)(cid:12)(cid:12) Median x ∼ P,y ∼ Q [ (cid:107) x − y (cid:107) ] E x ∼ P,y ∼ Q [ (cid:107) x − y (cid:107) ] − (cid:12)(cid:12)(cid:12)(cid:12) ≤ √ (cid:18) − (cid:107) µ − µ (cid:107) ( Tr (Σ + Σ ) + (cid:107) µ − µ (cid:107) ) (cid:19) , in particular if P, Q are discretisations of Gaussian processes GP ( m , k ) , GP ( m , k ) on a mesh Ξ N of N points satisfying the Riemannscaling property over some compact D ⊂ R d with m , m ∈ L ( D ) and k , k continuous then E x ∼ P,y ∼ Q [ (cid:107) x − y (cid:107) ] ∼ N ( Tr ( C k + C k ) + (cid:107) m − m (cid:107) L ( D ) ) andas N → ∞ the right hand side of the above inequality converges to √ (cid:32) − (cid:107) m − m (cid:107) L ( D ) ( Tr ( C k + C k ) + (cid:107) m − m (cid:107) L ( D ) ) (cid:33) . The above lemma does not show that the median heuristic results in γ N = γ N / but relates it to the expected squared distance which does scale directly as γ N / .Therefore investigating the properties of such a scaling is natural.Since L ( D ) is a real, separable Hilbert space when using kernels defined directlyover L ( D ) in later sections we can leverage the theory of probability measures on such ilbert spaces to deduce results about the testing performance of such kernels. In fact,we shall move past L ( D ) and obtain results for kernels over arbitrary real, separableHilbert spaces. Note that a different scaling of γ N would not result in such a scaling ofthe norm to L ( D ) so such theory cannot be applied. For the rest of the paper, unless specified otherwise, for example in Theorem 4, thespaces X , Y will be real, separable Hilbert spaces with inner products and norms (cid:104)· , ·(cid:105) X , (cid:104)· , ·(cid:105) Y , (cid:107)·(cid:107) X , (cid:107)·(cid:107) Y . We adopt the notation in Section 2 for various families ofoperators. T kernel Motivated by the scaling discussions in Section 4 we define a kernel that acts directlyon a Hilbert space.
Definition 1.
For T : X → Y the squared-exponential T kernel (SE- T ) is defined as k T ( x, y ) = e − (cid:107) T ( x ) − T ( y ) (cid:107) Y . We use the name squared-exponential instead of Gauss because the SE- T kernel isnot always the Fourier transform of a Gaussian distribution whereas the Gauss kernelon R d is, which is a key distinction and is relevant for our proofs. Lemma 2 in theAppendix assures us this function is a kernel. This definition allows us to adapt resultsabout the Gauss kernel on R d to the SE- T kernel since it is the natural infinite dimen-sional generalisation. For example the following theorem characterises the RKHS ofthe SE- T kernel for a certain choice of T , as was done in the finite dimensional casein Minh [46]. Before we state the result we introduce the infinite dimensional gener-alisation of a multi-index, define Γ to be the set of summable sequences indexed by N taking values in N ∪ { } and for γ ∈ Γ set | γ | = (cid:80) ∞ n =1 γ n , so γ ∈ Γ if andonly if γ n = 0 for all but finitely many n ∈ N meaning Γ is a countable set. We set Γ n = { γ ∈ Γ : | γ | = n } and the notation (cid:80) | γ |≥ shall mean (cid:80) ∞ n =0 (cid:80) γ ∈ Γ n which is acountable sum. Theorem 2.
Let T ∈ L + ( X ) be of the form T x = (cid:80) ∞ n =1 λ / n (cid:104) x, e n (cid:105) X e n with con-vergence in X for some orthonormal basis { e n } ∞ n =1 and bounded positive coefficients { λ n } ∞ n =1 then the RKHS of the SE- T kernel is H k T ( X ) = F ( x ) = e − (cid:107) T x (cid:107) X (cid:88) | γ |≥ w γ x γ : (cid:88) | γ |≥ γ ! λ γ w γ < ∞ , where x γ = (cid:81) ∞ n =1 x γ n n , x n = (cid:104) x, e n (cid:105) X , λ γ = (cid:81) ∞ n =1 λ γ n n and γ ! = (cid:81) ∞ n =1 γ n ! and H k T ( X ) is equipped with the inner product (cid:104) F, G (cid:105) H kT ( X ) = (cid:80) | γ |≥ γ ! λ γ w γ v γ where F ( x ) = e − (cid:107) T x (cid:107) X (cid:80) | γ |≥ w γ x γ , G ( x ) = e − (cid:107) T x (cid:107) X (cid:80) | γ |≥ v γ x γ . emark 1. In the proof of Theorem 2 an orthonormal basis of H k T ( X ) is given whichresembles the infinite dimensional Hermite polynomials which are used throughoutinfinite dimensional analysis and probability theory, for example see Da Prato andZabczyk [19, Chapter 10] and Nourdin and Peccati [51, Chapter 2]. In particular theyare used to define Sobolev spaces for functions over a real, separable Hilbert space[19, Theorem 9.2.12] which raises the interesting and, as far as we are aware, openquestion of how H k T ( X ) relates to such Sobolev spaces for different choices of T . For the two-sample test to be valid we need the kernel to be characteristic meaningthe mean-embedding is injective over P , so the test can tell the difference between anytwo probability measures. To understand the problem better we again leverage resultsregarding the Gauss kernel on R d , in particular the proof in Sriperumbudur et al. [72,Theorem 9] that the Gauss kernel on R d is characteristic. This uses the fact that theGauss kernel on R d is the Fourier transform of a Gaussian distribution on R d whosefull support implies the kernel is characteristic. By choosing T such that the SE- T kernel is the Fourier transform of a Gaussian measure on X that has full support wecan use the same argument. Theorem 3.
Let T ∈ L +1 ( X ) then the SE- T kernel is characteristic if and only if T isinjective. This is dissatisfyingly limiting since T ∈ L +1 ( X ) is a restrictive assumption, forexample it does not include T = I the identity operator. We shall employ a limitargument to reduce the requirements on T . To this end we define admissible maps. Definition 2.
A map T : X → Y is called admissible if it is Borel measurable, contin-uous and injective.
The next result provides a broad family of kernels which are characteristic. It ap-plies for X being more general than a real, separable Hilbert space. A Polish space is aseparable, completely metrizable topological space. Multiple examples of admissible T are given in Section 7 and are examined numerically in Section 8. Theorem 4.
Let X be a Polish space, Y a real, separable Hilbert space and T anadmissible map then the SE- T kernel is characteristic. Theorem 4 generalises Theorem 3. A critical result used in the proof is the Minlos-Sazanov theorem, detailed as Theorem 11 in the Appendix, which is an infinite dimen-sional version of Bochner’s theorem. The result allows us to identify spectral propertiesof the SE- T kernel which are used to deduce characteristicness. Let k : R × R → R be a kernel, C ∈ L +1 ( X ) and N C the corresponding mean zeroGaussian measure on X and define k C,k : X × X → R as follows k C,k ( x, y ) := (cid:90) X k ( (cid:104) x, h (cid:105) X , (cid:104) y, h (cid:105) X ) dN C ( h ) . onsider the particular case where k ( s, t ) = (cid:104) Φ( s ) , Φ( t ) (cid:105) F , where Φ : R → F is acontinuous feature map, mapping into a Hilbert space ( F , (cid:104)· , ·(cid:105) F ) , which will typicallybe R F for some F ∈ N . In this case the functions x → Φ( (cid:104) x, h (cid:105) X ) can be viewedas F –valued random features for each h ∈ X randomly sampled from N C , and k C,k is very similar to the random feature kernels considered in Nelsen and Stuart [50] andBach [3]. Following these previous works, we may completely characterise the RKHSof this kernel, the result involves L N C ( X ; F ) which is the space of equivalence classesof functions from X to F that are square integrable in the F norm with respect to N C and L ( X ) := L ( X ; F ) . Proposition 2.
Suppose that ψ ( x, h ) = Φ( (cid:104) x, h (cid:105) X ) satisfies ψ ∈ L N C × N C ( X × X ; F ) then the RKHS defined by the kernel k C,k is given by H k C,k ( X ) = (cid:26)(cid:90) (cid:104) v ( h ) , ψ ( · , h ) (cid:105) F dN C ( h ) : v ∈ L N C ( X ; F ) (cid:27) ⊂ L N C ( X ) . The proof of this result is an immediate generalization of the real-valued case givenin Nelsen and Stuart [50]. Using the spectral representation of translation invariantkernels we can provide conditions for k C,k to be a characteristic kernel. Proposition 3. If k is a kernel over R × R then k C,k is a kernel over X × X . If C is injective and k is also continuous and translation invariant with spectral measure µ such that there exists an interval ( a, b ) ⊂ R with µ ( U ) > for every open subset U ⊂ ( a, b ) then k C,k is characteristic. For certain choices of T the SE- T kernel falls into a family of integral kernels.Indeed, if k ( x, y ) = cos( x − y ) then k C,k is the SE- C kernel k C,k ( x, y ) = (cid:98) N C ( x − y ) = e − (cid:107) x − y (cid:107) C = e − (cid:80) ∞ n =1 λ n ( x n − y n ) , where (cid:107) x − y (cid:107) C = (cid:104) C ( x − y ) , x − y (cid:105) X , { λ n } ∞ n =1 are the eigenvalues of C and x n = (cid:104) x, e n (cid:105) are the coefficients with respect to the eigenfunction basis { e n } ∞ n =1 of C .Secondly, let γ > and assume C is non-degenerate and set k to be the complexexponential of γ multiplied the by white noise mapping associated with C , see Da Prato[18, Section 1.2.4], then k C,k is the SE- γI kernel k C,k ( x, y ) = k γI ( x, y ) = e − γ (cid:107) x − y (cid:107) X , (9)Note that k γI is not the Fourier transform of any Gaussian measure on X [44, Proposi-tion 1.2.11] which shows how the integral kernel framework is more general than onlyusing the Fourier transform of Gaussian measures to obtain kernels, as was done inTheorem 3.The integral framework can yield non-SE type kernels. Let N be the measureassociated with the Gaussian distribution N (0 , on R , C be non-degenerate and k ( x, y ) = (cid:98) N ( x − y ) then we have k C,k ( x, y ) = (cid:90) X (cid:90) R e iz (cid:104) h,x − y (cid:105) dN ( z ) dN C ( h ) = (cid:0) (cid:107) x − y (cid:107) C + 1 (cid:1) − . (10) efinition 3. For T : X → Y the inverse multi-quadric T kernel (IMQ- T ) is definedas k T ( x, y ) = (cid:0) (cid:107) T ( x ) − T ( y ) (cid:107) Y + 1 (cid:1) − / . By using Proposition 3 we immediately obtain that if T ∈ L +1 ( X ) and T is non-degenerate then the IMQ- T kernel is characteristic. But by the same limiting argumentas Theorem 4 and the integral kernel formulation of IMQ- T we obtain a more generalresult. Corollary 1.
Under the same conditions as Theorem 4 the IMQ- T kernel is character-istic. In Section 5 we derived kernels directly over function spaces that were characteristic,meaning that the MMD induced by them is a metric on P ( X ) . Therefore a two-sampletest based on such kernels may be constructed, as detailed in Section 3, using the sameform of U-statistic estimators and bootstrap technique as the finite dimensional sce-nario. This section will explore properties of the test. Subsection 6.1 will investigatethe effect of performing the test on reconstructions of the random function based onobserved data. Subsection 6.2 will provide explicit calculations for MMD when P, Q are Gaussian processes. Subsection 6.3 discusses the topology on P ( X ) induced byMMD and how it relates to the weak topology. In practice, rather than having access to the full realisation of random functions the dataavailable will be some finite-dimensional representation of the functions, for examplethrough discretisation over a mesh, or as a projection onto a finite dimensional basis of X . Therefore to compute the kernel a user may need to approximate the true underly-ing functions from this finite dimensional representation. We wish to ensure that theeffectiveness of the tests using reconstructed data.We formalise the notion of discretisation and reconstruction as follows. Assumethat we observe {I x i } ni =1 where { x } ni =1 are the random samples from P and I : X → R N is a discretisation map. For example, I could be point evaluation at some somefixed t , t , . . . , t N ∈ D i.e. I x i = ( x i ( t ) , . . . , x i ( t N )) (cid:62) . Noisy point evaluation canalso be considered in this framework. Then a reconstruction map R : R N → X isemployed so that RI X n = {RI x i } ni =1 is used to perform the test, analogously for RI Y n . For example, R could be a kernel smoother or a spline interpolation operator.In practice one might have a different number of observations for each function, thefollowing results can be adapted to this case straightforwardly. Proposition 4.
Assume k is a kernel on X satisfying | k ( x, y ) − k ( u, v ) | ≤ L (cid:107) x − y − ( u − v ) (cid:107) X for all u, v, x, y ∈ X for some L > and let P, Q ∈ P with X n = x i } ni =1 , Y n = { y i } ni =1 i.i.d. samples from P, Q respectively with reconstructed data RI X n = {RI x i } ni =1 , RI Y n = {RI y i } ni =1 then (cid:12)(cid:12)(cid:12)(cid:12) (cid:92) MMD k ( X n , Y n ) − (cid:92) MMD k ( RI X n , RI Y n ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ Ln n (cid:88) i =1 (cid:107)RI x i − x i (cid:107) X + (cid:107)RI y i − y i (cid:107) X . Corollary 2. If k T is the SE- T or IMQ- T kernel then the above bound holds with (cid:107) T ( RI x i ) − T ( x i ) (cid:107) Y , (cid:107) T ( RI y i ) − T ( y i ) (cid:107) Y instead of (cid:107)RI x i − x i (cid:107) X , (cid:107)RI y i − y i (cid:107) X with L = √ e and L = √ respectively. An analogous result can be derived for the linear time estimator with the sameproof technique. While Proposition 4 provides a statement on the approximationof (cid:92)
MMD k ( RI X n , RI Y n ) we are primarily concerned with its statistical properties.Asymptotically, the test efficiency is characterised via the Gaussian approximation inTheorem 1, specifically through the asymptotic variance in (5). The following resultprovides conditions under which a similar central limit theorem holds for the estimatorbased on reconstructed data, with the same asymptotic variance. It imposes conditionson the number of discretisation points per function sample N , the error of the approxi-mations and the number of function samples n . Theorem 5.
Let k satisfy the condition in Proposition 4 and let X n = { x i } ni =1 and Y n = { y i } ni =1 be i.i.d. samples from P and Q respectively with P (cid:54) = Q , and asso-ciated reconstructions RI X n and RI Y n based on N ( n ) dimensional discretisations I X n and I Y n where N ( n ) → ∞ as n → ∞ . If n E x ∼ P [ (cid:107) x − RI x (cid:107) X ] → and n E y ∼ P [ (cid:107) y − RI y (cid:107) X ] → as n → ∞ , then for ξ = 4 Var z [ E z (cid:48) [ h ( z, z (cid:48) )]] n (cid:0) (cid:92) MMD k T ( RI X n , RI Y n ) − MMD k T ( P, Q ) (cid:1) d −→ N (0 , ξ ) . A similar result can be derived for the linear time estimator by using the linear timeestimator version of Proposition 4. The discretisation map, number of discretisationsper function sample and the reconstruction map need to combine to satisfy the con-vergence assumption. For example if a weaker reconstruction map is used then moreobservations per function sample will be needed to compensate for this. Additionallyif the discretisation map offers less information about the underlying function, for ex-ample it provides observations that are noise corrupted, then more observations perfunction sample are needed.We now discuss three settings in which these assumptions hold, relevant to differentapplications. We shall assume that k satisfies the conditions of Proposition 4 and that T = I . Let X = L ([0 , and Ξ N ( n ) = { t i } N ( n ) i =1 be a mesh of evaluation points where t i +1 − t i = N ( n ) − for all i and define I x = ( x ( t ) , . . . , x ( t N ( n ) )) (cid:62) ∈ R N ( n ) . Let R e the piecewise linear interpolant defined as ( RI x )( t ) = ( x ( t i +1 ) − x ( t k )) t − t i t i +1 − t i + x ( t i ) , for t ∈ [ t i , t i +1 ) . Suppose that realisations x ∼ P and y ∼ Q are almost surely in C ([0 , and inparticular satisfy E x ∼ P [ (cid:107) x (cid:48)(cid:48) (cid:107) X ] < ∞ and E y ∼ Q [ (cid:107) y (cid:48)(cid:48) (cid:107) X ] < ∞ . Then E x ∼ P [ (cid:107) x − RI x (cid:107) X ] ≤ N ( n ) E x ∼ P [ (cid:107) x (cid:48)(cid:48) (cid:107) X ] , and analogously for y ∼ Q . Therefore if N ( n ) ∼ n α with α > / then the conditionsof Theorem 5 are satisfied. Let X = L ( D ) with D ⊂ R d compact. As in the previous example, I will be the eval-uation operator over a set of points Ξ N ( n ) = { t i } N ( n ) i =1 but now we assume the pointsare placed quasi-uniformly in the scattered data approximation sense, for example reg-ularly placed grid points, see Wynne et al. [87] and Wendland [85, Chapter 4] for othermethod to obtain quasi-uniform points.We set R as the kernel interpolant using a kernel k with RKHS norm equivalent to W ν ( D ) with ν > d/ , this is achieved by the common Mat´ern and Wendland kernels[85, 38]. For this choice of recovery operator, ( RI x )( t ) = k ( t, Ξ) K − I ( x ) where k ( t, Ξ) = ( k ( t, t ) , . . . , k ( t, t N ( n ) ) and K is the kernel matrix of k over Ξ N ( n ) .Suppose the realisations of P and Q lie almost surely in W τ ( D ) for some τ > d/ ,this assumption is discussed when P, Q are Gaussian processes in Kanagawa et al. [38,Section 4], then E x ∼ P [ (cid:107) x − RI x (cid:107) X ] ≤ CN ( n ) − ( τ ∧ ν ) /d , for some constant C > , with an identical result holding for realisations of Q [87, 49].It follows that choosing N ( n ) ∼ n guarantees that the conditions of Theorem 5 hold inthis setting. Here we see that to maintain a scaling of N ( n ) independent of dimension d we need the signal smoothness ν, τ to increase with d .Note that the case where I is pointwise evaluation corrupted by noise may betreated in a similar way by using results from Bayesian non-parametrics, for examplevan der Vaart and van Zanten [82, Theorem 5]. In this case R would be the posteriormean of a Gaussian process that is conditioned on I ( x ) . Let X be an arbitrary real, separable Hilbert space and { e n } ∞ n =1 be an orthonormalbasis. Suppose that I is a projection operator onto the first N ( n ) elements of thebasis I x = ( (cid:104) x, e (cid:105) X , . . . , (cid:104) x, e N ( n ) (cid:105) X ) (cid:62) and R constructs a function from basis co-efficients R ( β , . . . , β N ( n ) ) = (cid:80) N ( n ) i =1 β i e i meaning RI x = (cid:80) N ( n ) i =1 (cid:104) x, e i (cid:105) X e i . Atypical example on L ([0 , would be a Fourier series representation of the sam-ples { x i } ni =1 and { y i } ni =1 from which the functions can be recovered efficiently via n inverse Fast Fourier Transform. By Parseval’s theorem E x ∼ P [ (cid:107) x − RI x (cid:107) X ] = (cid:80) ∞ i = N ( n )+1 |(cid:104) x, e i (cid:105) X | → as N ( n ) → ∞ . For the conditions of Theorem 5 to hold,we require that n / E x ∼ P (cid:2) (cid:80) ∞ i = N ( n )+1 |(cid:104) x, e i (cid:105) X | (cid:3) → as n → ∞ which means N ( n ) will need to grow in a way to compensate for the auto-correlation of realisationsof P and Q .In this setting the use of the integral kernels described in Section 5.2 are particularlyconvenient. Indeed, let C = (cid:80) ∞ i =1 λ i e i ⊗ e i ∈ L ( X ) and consider the integral kernel k C,k where k ( s, t ) = Φ( s )Φ( t ) for a feature map Φ taking values in R . An evaluationof the kernel k ( x, y ) can then be approximated using a random Fourier feature approach[57] by k ( x, y ) ≈ n S n S (cid:88) l =1 Φ N ( n ) (cid:88) i =1 λ i x i η li Φ N ( n ) (cid:88) j =1 λ j y j η lj , where x i = (cid:104) x, e i (cid:105) X , y i = (cid:104) y, e i (cid:105) X and η li ∼ N (0 , i.i.d. for i = 1 , . . . N ( n ) and l =1 , . . . , n S for some n S ∈ N . The permits opportunities to reduce the computational costof MMD tests as judicious choices of Φ will permit accurate approximations of k ( x, y ) using n S small. Similarly, the weights, λ i can be chosen to reduce the dimensionalityof the functions x i and y i . A key property of the SE- T kernel is that the mean-embedding Φ k T P andMMD k T ( P, Q ) have closed form solutions when P, Q are Gaussian measures. Usingthe natural correspondence between Gaussian measures and Gaussian processes fromSection 2 we may get closed form expressions for Gaussian processes. This addressesthe open question regarding the link between Bayesian non-parametrics methods andkernel mean-embeddings that was discussed in Muandet et al. [47, Section 6.2].Before stating the next result we need to introduce the concept of determinant foran operator, for S ∈ L ( X ) define det( I + S ) = (cid:81) ∞ n =1 (1 + λ n ) where { λ n } ∞ n =1 arethe eigenvalues of S . The equality det (cid:0) ( I + S )( I + R ) (cid:1) = det( I + S ) det( I + R ) holds and is frequently used. Theorem 6.
Let k T be the SE- T kernel for some T ∈ L + ( X ) and P = N a,S be anon-degenerate Gaussian measure on X then Φ k T ( N a,S )( x ) = det( I + T ST ) − e − (cid:104) ( I + T ST ) − T ( x − a ) ,T ( x − a ) (cid:105) X . Theorem 7.
Let k T be the SE- T kernel for some T ∈ L + ( X ) and P = N a,S , Q = N b,R be non-degenerate Gaussian measures on X then MMD k T ( P, Q ) = det( I + 2 T ST ) − + det( I + 2 T RT ) − − (cid:0) ( I + T ST )( I + ( T RT ) ( I + T ST ) − ( T RT ) ) (cid:1) − × e − (cid:104) ( I + T ( S + R ) T ) − T ( a − b ) ,T ( a − b ) (cid:105) X . hese results outline the geometry of Gaussian measures with respect to the dis-tance induced by the SE- T kernel. We see that the means only occur in the formulathrough their difference and if both mean elements are zero then the distance is mea-sured purely in terms of the spectrum of the covariance operators. Corollary 3.
Under the Assumptions of Theorem 7 and that
T, S, R commute then
MMD k T ( P, Q ) = det( I + 2 T ST ) − + det( I + 2 T RT ) − − I + T ( S + R ) T ) − e − (cid:104) ( I + T ( S + R ) T ) − T ( a − b ) ,T ( a − b ) (cid:105) X . Since the variance terms ξ , ξ from Section 3 are simply multiple integrals of theSE- T kernel against Gaussian measures we may obtain closed forms for them too.Theorem 8 is a particular instance of the more general Theorem 13 in the Appendix. Theorem 8.
Let P = N S , Q = N m,S be non-degenerate Gaussian measures on X , T ∈ L +1 ( X ) and assume T and S commute then when using the SE- T kernel ξ = 2 det(Σ S ) − (cid:0) e −(cid:104) ( I +3 T ST ) − T m,T m (cid:105) X − e − (cid:104) ( I +2 T ST )Σ − S T m,T m (cid:105) X (cid:1) − I + 2 T ST ) − (cid:0) e −(cid:104) ( I +2 T ST ) − T m,T m (cid:105) X − e − (cid:104) ( I +2 T ST ) − T m,T m (cid:105) X (cid:1) ,ξ = 2 det( I + 4 T ST ) − (cid:0) e −(cid:104) ( I +4 T ST ) − T m,T m (cid:105) X (cid:1) − I + 2 T ST ) − (cid:0) e −(cid:104) ( I +2 T ST ) − T m,T m (cid:105) X − e − (cid:104) ( I +2 T ST ) − T m,T m (cid:105) X (cid:1) − S ) − e − (cid:104) ( I +2 T ST )Σ − S T m,T m (cid:105) X , where Σ S = ( I + T ST )( I + 3 T ST ) . We know that MMD is a metric on P when k is characteristic, so it is natural to identifythe topology it generates and in particular how it relates to the standard topology forelements of P , the weak topology. Theorem 9.
Let X be a Polish space, k a bounded, continuous, characteristic kernel on X × X and P ∈ P then P n w −→ P implies MMD k ( P n , P ) → and if { P n } ∞ n =1 ⊂ P istight then MMD k ( P n , P ) → implies P n w −→ P where w −→ denotes weak convergence. For a discussion on weak convergence and tightness see Billingsley [8]. The tight-ness is used to compensate for the lack of compactness of X which is often requiredin analogous finite dimensional results. In particular, in Chevyrev and Oberhauser [15]an example where MMD k ( P n , P ) → but P n but does converge to P was given with-out the assumption of tightness. A precise characterisation of the relationship betweenMMD and weak convergence over a Polish space is an open problem. We now present examples and techniques to choose kernels and construct maps T thatare admissible. Two main categories will be discussed, integral operators induced bykernels and situation specific kernels. or the first catergory assume X = Y = L ( D ) for some compact D ⊂ R d and let k be a measurable kernel over D × D and set T = C k where C k is the covarianceoperator associated with k , see Section 2. We call k an admissible kernel if C k isadmissible. If k is continuous then by Mercer’s theorem C k x = (cid:80) ∞ n =1 λ n (cid:104) x, e n (cid:105) e n for some positive sequence { λ n } ∞ n =1 and orthonormal set { e n } ∞ n =1 [74, Chapter 4.5].To be admissible C k needs to be injective which is equivalent to { e n } ∞ n =1 forming abasis [75, Proof of Theorem 3.1]. Call k integrally strictly positive definite (ISPD) if (cid:82) D (cid:82) D x ( s ) k ( s, t ) x ( t ) dsdt > for all non-zero x ∈ X . Recall that if k is translationinvariant then by Theorem 10 there exists a measure µ k such that ˆ µ k ( s − t ) = k ( s, t ) . Proposition 5.
Let
D ⊂ R d be compact and k a continuous kernel on D , if k is ISPDthen k is admissible. In particular, if k is continuous and translation invariant and µ k has full support on D then k is admissible. For multiple examples of ISPD kernels see Sriperumbudur et al. [73] and of µ k see Sriperumbudur et al. [72]. Using the product to convolution property of the Fouriertransform one can construct k such that µ k has full support relatively easily or mod-ify standard integral operators which aren’t admissible. For example, for some F ∈ N consider the kernel k cos ( F ) ( s, t ) = (cid:80) F − n =0 cos(2 πn ( s − t )) on [0 , whose spectralmeasure if a sum of Dirac measures so does not have full support. If the Dirac mea-sures are convolved with a Gaussian then they would be smoothed out and would resultin a measure with full support. Since convolution in the frequency domain correspondsto a product in space domain the new kernel k c-exp ( F,l ) ( s, t ) = e − l ( s − t ) k cos ( F ) ( s, t ) satisfies the conditions of Proposition 5. This technique of frequency modification hasfound success in modelling audio signals [86, Section 3.4]. In general, any operator ofthe form T x = (cid:80) ∞ n =1 λ n (cid:104) x, e n (cid:105) X e n for positive, bounded { λ n } ∞ n =1 and an orthonor-mal basis { e n } ∞ n =1 is admissible even if it is not induced by a kernel, for example thefunctional Mahalanobis distance [7].The second category is scenario specific choices. By this we mean kernels whosestructure is specified to the testing problem at hand. For example, while the kernel two-sample test may be applied for distributions with arbitrary difference one may tailorit for a specific testing problem, such a difference of covariance operator. If one doesonly wish to test for difference in covariance operator of the two probability measuresthen an appropriate kernel would be k cov ( x, y ) = (cid:104) x, y (cid:105) X which is not characteris-tic but MMD k cov ( P, Q ) = 0 if and only if
P, Q have the same covariance operator.However, a practitioner may want a kernel which emphasises difference in covarianceoperator, due to prior knowledge regarding the data, while still being able to detectarbitrary difference, in case the difference is more complicated than initially thought.We now present examples of T which do this. To emphasise higher order moments,let X ⊂ L ( D ) and Y the direct sum of L ( D ) with itself equipped with the norm (cid:107) ( x, x (cid:48) ) (cid:107) Y = (cid:107) x (cid:107) L ( D ) + (cid:107) x (cid:48) (cid:107) L ( D ) and T ( x ) = ( x, x ) . This map captures secondorder differences and first order differences individually, as opposed to the polynomialmap which combines them. Alternatively, one might be in a scenario where the differ-ence in the distributions is presumed to be in the lower frequencies. In this case a mapof the form T ( x ) = (cid:80) Fn =1 λ n (cid:104) x, e n (cid:105) X e n could be used for decreasing, positive λ n andsome orthogonal e n . This will not be characteristic, since T only acts on F frequencies, owever if F is picked large enough then good performance could still be obtained inpractice. For example, λ n , e n could be calculated empirically from the function sam-ples using functional principal component analysis and F could be picked so that thecomponents explain a certain percentage of total variance. See Horv´ath and Kokoszka[32] for a deeper discussion on functional principal components and its central role infunctional data analysis.All of the choices outlined above have associated hyperparameters, for exampleif T = C k then hyperparameters of k are hyperparameters of T such as the band-width. It is outside the scope of this paper to investigate new methods to choose theseparameters but we do believe it is important future work. Multiple methods for finitedimensional data have been proposed using the surrogrates for test power outlined inSection 4 [78, 29, 42] which could have potential for use in the infinite dimensionalscenario. In this section we perform numerical simulations on real and synthetic data to re-inforce the theoretical results. Code is available at https://github.com/georgewynne/Kernel-Functional-Data . Verification of the power scaling when performing the mean shift two-sample test usingfunctional data, discussed in Section 4, is performed. Specifically we perform the two-sample test using the SE- I kernel with x ∼ GP (0 , k l ) and y ∼ GP ( m, k l ) where m ( t ) = 0 . for t ∈ [0 , and k l ( s, t ) = e − l ( s − t ) with samples from eachdistribution. This is repeated times to calculate power with permutationsused in the bootstrap to simulate the null. The observation points are a uniform gridon [0 , with N points, meaning N will be the dimension of the observed discretisedfunction vectors. The parameter l dictates the dependency of the fluctations. Small l means less dependency between the random function values so the covariance matrix iscloser to the identity. When the random functions are m with N (0 , i.i.d. corruptionthe corresponding value of l is zero which essentially means k l ( x, y ) = δ xy . In thiscase the scaling of power is expected to follow (6) and grow asymptotically as √ N .On the other hand if l > the fluctuations within each random function are dependentand we expect scaling as (7) which does not grow asymptotically with N .Figure 1 confirms this theory showing that power increases with a finer observationmesh only when there is no dependence in the random functions values. We see someincrease of power as the mesh gets finer for the case of small dependency however therate of increase is much smaller than the i.i.d. setting. N . . . . . . . . . P o w e r i.i.d. l = 0 . · − l = 1 · − l = 2 · − l = 4 · − l = 8 · − Figure 1: Test power as mesh size decreases given different point dependency strengths
The tests are all performed using the k cov kernel from Section 7 and the SE- T kernel forfour different choices of T and unless stated otherwise Y = L ([0 , and we use theshort hand L for L ([0 , and n x , n y will denote the sample sizes of the two samples.To calculate power each test is repeated times and permutations are used inthe bootstrap to simulate the null distribution.COV will denote the k cov kernel, which can only detect difference of covarianceoperator. ID will denote T = I . CEXP will denote T = C k with k = k c-exp (20 , √ the cosine exponential kernel. SQR will denote T ( x ) = ( x, x ) with Y the di-rect sum of L ([0 , with itself as detailed in Section 7. FPCA will denote T x = (cid:80) Fn =1 λ / n (cid:104) x, e n (cid:105) e n where λ n , e n are empirical functional principal components andprincipal values computed from the union of the two collections of samples with F chosen such that of variance is explained. The abbreviations are summarised inTable 1 along with references to the other tests being compared against.For the four uses of the SE- T kernel exp ( − γ (cid:107) T ( x ) − T ( y ) (cid:107) Y ) we use, forall but SQR scenario, the median heuristic γ = Median (cid:8) (cid:107) T ( a ) − T ( b ) (cid:107) Y : a, b ∈{ x i } n X i =1 ∪ { y i } n Y i =1 , a (cid:54) = b (cid:9) . As the SQR scenario involves two norms in the ex-ponent two calculations of median heuristic are needed so that the kernel used is exp( − γ (cid:107) x − y (cid:107) L − γ (cid:107) x − y (cid:107) L ) with γ j = Median (cid:8) (cid:107) a j − b j (cid:107) L : a, b ∈{ x i } n X i =1 ∪ { y i } n Y i =1 , a (cid:54) = b (cid:9) for j = 1 , . Difference of Mean
We compare to the Functional Anderson-Darling (FAD) test in Pomann et al. [56]which involves computing functional principal components and then doing multipleAnderson-Darling tests. Independent realisations { x i } n x i =1 and { y j } n y j =1 of the ran-dom functions x, y over [0 , are observed on a grid of uniform points with n x = n y = 100 and observation noise N (0 , . . The two distributions are x ( t ) ∼ t + ξ √ πt ) + ξ √ πt ) ,y ( t ) ∼ t + δt + η √ πt ) + η √ πt ) , T kernel, T = I Section 5FPCA SE- T kernel, T based on functional principle components Section 7SQR SE- T kernel, T squaring feature expansion Section 7CEXP SE- T kernel, T based on the cosine-exponential kernel Section 7COV Covariance kernel k ( x, y ) = < x, y > X Section 7FAD Functional Anderson-Darling [56]CVM Functional Cramer-von Mises [30]BOOT-HS Bootstrap Hilbert-Schmidt [53]FPCA- χ Functional Principal Component χ [24]Table 1: Summary of two-sample tests and kernels used in numerical experiments with ξ , η i.i.d ∼ N (0 , and ξ , η i.i.d ∼ N (0 , . The δ parameter measures thedeviation from the null hypothesis that x, y have the same distribution. The range ofthe parameter is δ ∈ { , . , , . , } .Figure 2 shows CEXP performing best among all the choices which makes sensesince this choice explicitly smooths the signal to make the mean more identifiable com-pared to the noise. We see that FPCA performs poorly because the principal compo-nents are deduced entirely from the covariance structure and do not represent the meandifference well. Likewise COV performs poorly since it can only detect difference incovariance, not mean. Except from FPCA and COV all choices of T out perform theFAD method. This is most likely because the FAD method involves computing mul-tiple principle components, an estimation which is inherently random, and computesmultiple FAD tests with a Bonferroni correction which can cause too harsh a require-ment for significance. There is a slight inflation of test size, meaning rejection is mildlylarger than when the null hypothesis is true.Figure 3 shows an ROC curve. On the x -axis is α the false positive rate parameterin the test, see Section 3, and on the y -axis is the power of the test, meaning the truepositive rate. The plot was obtained using δ = 1 . , n x = 50 , n y = 50 with the sameobservation locations and noise as described above. The dashed line is y = x whichcorresponds to a test with trivial performance. We see that COV and FPCA performstrivially weakly implying the calculated principal components are uninformative foridentifying the difference in mean. CEXP has the best curve and the other three choicesof T perform equally well. Difference of Variance
We investigate two synthetic data sets, the first from Pomann et al. [56] and the secondfrom Paparoditis and Sapatinas [53]. The first represents a difference in covariance ina specific frequency and the second a difference across all frequencies.In the first data set n x = n y = 100 , observations are made on a uniform grid of . . . . . δ . . . . . . . . . . P o w e r IDCEXPSQRFPCACOVFAD
Figure 2: Test power under mean difference for different kernels. .
05 0 .
15 0 .
25 0 .
35 0 .
45 0 .
55 0 .
65 0 .
75 0 .
85 0 . α . . . . . . . . . . P o w e r IDCEXPSQRFPCACOV
Figure 3: ROC curve for different kernels. points and the observation noise is N (0 , . . The two distributions are x ( t ) ∼ ξ √ πt ) + ξ √ πt ) ,y ( t ) ∼ η δ √ πt ) + η √ πt ) , with ξ , η i.i.d ∼ N (0 , and ξ ∼ N (0 , and η δ ∼ N (0 ,
10 + δ ) . Therefore thedifference in covariance structure is manifested in the first frequency. The range of theparameter is δ ∈ { , , , , } and we again compare against the FAD test.Figure 4 shows that COV performs the best which is to be expected since it isspecifically designed to only detect change in covariance. SQR and FPCA performwell since they are designed to capture covariance information too. CEXP performsalmost identically to ID since it designed to improve performance on mean shift tests,not covariance shift. δ . . . . . . . . . . P o w e r IDCEXPSQRFPCACOVFAD
Figure 4: Test power under variance difference in one frequency for different kernels.
The second dataset is from Paparoditis and Sapatinas [53] and we compare againstthe data reported there of a bootstrap Hilbert-Schmidt norm (BOOT-HS) test [53, Sec-tion 2.2] and a functional principal component chi-squared (FPCA- χ ) test [24], whichis similar to the test in Panaretos et al. [52]. The number of function samples is n x = n y = 25 and each sample is observed on a uniform grid over [0 , consisting of points. The first distribution is defined as x ( t ) ∼ (cid:88) n =1 ξ n n − √ πnt ) + η n n − √ πnt ) , where ξ n , η n are i.i.d. Student’s t -distribution random variables with degrees of free-dom. For δ ∈ R the other function distribution is y ∼ δx (cid:48) where x (cid:48) is an i.i.d. copy of x . When δ = 1 the two distributions are the same. The entire covariance structure of Y is different from that of X when δ (cid:54) = 1 which is in contrast the previous numericalexample where the covariance structure differed at only one frequency. The range ofthe deviation parameter is δ ∈ { , . , . , . , . , } .Figure 5 shows again that COV and SQR performs the best. The BOOT-HS andFPCA- χ tests are both conservative, providing rejection rates below when the nullis true as opposed to the kernel based tests which all lie at or very close to the level. Difference of Higher Orders
Data from Hall and Keilegom [30] is used when performing the test. The randomfunctions x, y are distributed as x ( t ) ∼ (cid:88) n =1 e − n ξ xn ψ n ( t ) ,y ( t ) ∼ (cid:88) n =1 e − n ξ yn, ψ n ( t ) + δ (cid:88) n =1 n − ξ yn, ψ ∗ n ( t ) , . . . . . . δ . . . . . . . . . . P o w e r IDCEXPSQRFPCACOVBOOT-HSFPCA- χ Figure 5: Test power under variance difference across all frequencies for different kernels. with ξ xn , ξ yn, , ξ yn, i.i.d ∼ N (0 , , ψ ( t ) = 1 , ψ n ( t ) = √ k − πt ) for n > and ψ ∗ ( t ) = 1 , ψ ∗ n ( t ) = √ k − π (2 t − if n > is even, ψ ∗ n ( t ) = √ k − π (2 t − if n > is odd. The observation noise for x is N (0 , . and for y is N (0 , . . The range of the parameter is δ ∈ { , , , , } and we compare againstthe FAD test and the Cramer-von Mises test in Hall and Keilegom [30]. The numberof samples is n x = n y = 15 and for each random function observation locationsare sampled randomly according to p x or p y with p x being the uniform distribution on [0 , and p y the distribution with density function . . t on [0 , .Since the data is noisy and irregularly sampled, curves were fit to the data before thetest was performed. The posterior mean of a Gaussian process with noise parameter σ = 0 . was fit to each data sample using a Mat´ern- . kernel k Mat ( s, t ) = (1 + √ s − t )) e −√ s − t ) .Figure 6 shows that the COV, SQR perform the best with other choices of T per-forming equally. Good power is still obtained against the existing methods despite thefunction reconstructions, validating the theoretical results of Section 6. Berkeley Growth Data
We now perform tests on the Berkeley growth dataset which contains the height of male and female children from age to and locations. The data can be foundin the R package fda. We perform the two sample test on this data for the five differentchoices of T with γ chosen via the median heuristic outlined in the previous subsec-tion. To identify the effect on test performance of sample size we perform randomsubsampling of the datasets and repeat the test to calculate test power. For each samplesize M ∈ { , , , } we sample M functions from each data set and perform thetest, this is repeated times to calculate test power. The results are plotted in Figure7. Similarly, to investigate the size of the test we sample two disjoint subsets of size δ . . . . . . . . . . P o w e r IDCEXPSQRFPCACOVFADCVM
Figure 6: Test power under difference of higher orders for different kernels. M ∈ { , , } from the female data set and perform the test and record whether thenull was incorrectly rejected, this is repeated times to obtain a rate of incorrectrejection of the null, the results are reported in Table 2.Figure 7 shows ID, SQR performing the best, COV performs weaker than CEXPsuggesting that it is not just a difference of covariance operator that distinguishes thetwo samples. Table 2 shows nearly all the tests have the correct empirical size. M . . . . . . . . . . P o w e r IDCEXPSQRFPCACOV
Figure 7: Test power under subsamples of size M using Berkeley growth data. NEU Steel Data
We perform the two-sample test on two classes from the North Eastern University steeldefect dataset [70, 31, 20]. The dataset consists of × pixel grey scale imagesof defects of steel surfaces with different classes of defects and images in each
26 ID CEXP COV SQR FPCA5 5.0 4.8 4.2 5.6 5.015 4.4 4.6 4.4 5.2 4.625 5.0 4.6 5.4 5.8 5.6Table 2: Empirical size, meaning the rate of rejection of the null when the null is true, of thetwo-sample test performed on the female Berkeley growth data for different sample sizesacross different choices of T . The values are written as percentages, a value above showstoo frequent rejection and below shows too conservative a test. class. We perform the test on the two classes which are most visually similar, calledrolled-in scale and crazing. See the URL [71] for further description of the dataset.For each sample size M ∈ { , , , } we sample M images from each class andperform the test, this is repeated times to calculate test power. Again we assessthe empirical size by sampling two distinct subsets from one class, the rolled-in class,for sample sizes M ∈ { , , , } and repeat this times and report the rate ofincorrect null rejection. For CEXP we use the two dimensional tensor product kernelinduced by the CEXP kernel with frequencies and l = √ / to normalise thesize of the images.Figure 8 shows SQR having the best performance, CEXP performs well and so doesID. Table 3 shows that the empirical size is inflated under some choices of T especiallyCEXP. Once the test is performed with samples the sizes return to an acceptablelevel for SQR and FPCA. This inflation of empirical size should be taken into accountwhen viewing the powers of the tests.
10 20 30 40 M . . . . . . . . . . P o w e r IDCEXPSQRFPCACOV
Figure 8: Test power under subsamples of size M using NEU steel data.27 ID CEXP COV SQR FPCA10 4.2 4.2 5.0 5.8 5.820 4.8 4.8 6.6 5.8 6.230 5.2 6.4 6.0 6.2 6.640 6.4 7.0 5.2 4.4 4.8Table 3: Empirical size, meaning the rate of rejection of the null when the null is true, of thetwo-sample test performed on the rolled-in scale class from the NEU steel defect data, fordifferent sample sizes across different choices of T . The values are written as percentages,a value above shows too frequent rejection and below shows too conservative a test. In this paper we studied properties of kernels on real, separable Hilbert spaces, andthe associated Maximum Mean Discrepancy distance. Based on this, we formulated anovel kernel-based two-sample testing procedure for functional data. The developmentof kernels on Hilbert spaces was motivated by the observation that certain scaling ofkernel parameters in the finite dimensional regime can result in a kernel over a Hilbertspace. Indeed, multiple theoretical properties emerged as natural infinite dimensionalgeneralisations of the finite dimensional case. The development of kernels defined di-rectly over Hilbert spaces facilitates the use of hyperparameters adapted for functionaldata, such as the choice of T in the SE- T kernel, which can result in greater test power.While other nonparametric two-sample tests for functional data have been proposedrecently, we believe that kernel-based approaches offer unique advantages. In particu-lar, the ability to choose the kernel to reflect a priori knowledge about the data, such asany underlying dependencies, or to emphasise at which spatial scales the comparisonshould be made between the samples can be of significant benefit to practitioners. Theconstruction of kernels which are tailor-made for two-sample testing of specific formsof functional data, for example time series and spatial data, is an interesting and openquestion, which we shall defer to future work.The theory of kernels on function spaces is of independent interest and our workhighlights how existing results on probability measures on infinite dimensional spacescan be applied to kernel methods, for example the use of the Minlos-Sazanov theo-rem when proving characteristicness. Recent theoretical developments relating to ker-nels and MMD on general topological spaces in the absence of compactness or localcompactness have revealed the challenges in establishing important properties of suchmetrics, for example determining weak convergence of sequences of probability mea-sures [67, 68, 15], which have important implications for the development of effectiveMMD-based tests for functional data.A further application of kernels on function spaces is statistical learning of mapsbetween function spaces, in particular, the challenge of learning surrogate models forlarge-scale PDE systems which can be viewed as a nonlinear deterministic maps froman input function space, initial or boundary data, to an output function space, a systemresponse. Here there are fundamental challenges to be addressed relating to the univer- ality properties of such kernels. Preliminary work in Nelsen and Stuart [50] indicatesthat this is a promising direction of research. Acknowledgments
GW was supported by an EPSRC Industrial CASE award [18000171] in partnershipwith Shell UK Ltd. AD was supported by the Lloyds Register Foundation Programmeon Data Centric Engineering and by The Alan Turing Institute under the EPSRC grant[EP/N510129/1]. We thank Sebastian Vollmer for helpful comments.
References [1] S. Albeverio and S. Mazzucchi. An introduction to infinite-dimensional oscillatory and probabilistic integrals. In
Stochastic Analysis: A Series of Lectures , pages 1–54. Springer Basel, 2015.[2] A. Aue, G. Rice, and O. S¨onmez. Detecting and dating structural breaks in functional data without dimensionreduction.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 80(3):509–529, 2018.[3] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions.
The Journal ofMachine Learning Research , 18(1):714–751, 2017.[4] M. Benko, W. H¨ardle, and A. Kneip. Common functional principal components.
The Annals of Statistics , 37(1):1–34, 2009.[5] C. Berg, J. P. R. Christensen, and P. Ressel.
Harmonic Analysis on Semigroups . Springer New York, 1984.[6] A. Berlinet and C. Thomas-Agnan.
Reproducing Kernel Hilbert Spaces in Probability and Statistics . SpringerUS, 2004.[7] J. R. Berrendero, B. Bueno-Larraz, and A. Cuevas. On Mahalanobis distance in functional settings.
Journal ofMachine Learning Research , 21(9):1–33, 2020.[8] P. Billingsley.
Weak Convergence of Measures . Society for Industrial and Applied Mathematics, 1971.[9] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Scholkopf, and A. J. Smola. Integrating structuredbiological data by kernel maximum mean discrepancy.
Bioinformatics , 22(14):49–57, 2006.[10] B. Bucchia and M. Wendler. Change-point detection and bootstrap for Hilbert space valued random fields.
Journal of Multivariate Analysis , 155:344–368, 2017.[11] A. Cabana, A. M. Estrada, J. Pena, and A. J. Quiroz. Permutation tests in the two-sample problem for functionaldata. In
Functional Statistics and Related Fields , pages 77–85. Springer, 2017.[12] C. Carmeli, E. de Vito, A. Toigo, and V. Umanit`a. Vector valued reproducing kernel hilbert spaces and univer-sality.
Analysis and Applications , 08(01):19–61, Jan. 2010.[13] S. Chakraborty and X. Zhang. A new framework for distance and kernel-based metrics in high dimensions. arXiv:1909.13469 , 2019.[14] H. Chen, P. T. Reiss, and T. Tarpey. Optimally weighted L2 distance for functional data.
Biometrics , 70(3):516–525, 2014.[15] I. Chevyrev and H. Oberhauser. Signature moments to characterize laws of stochastic processes. arXiv:1810.10971 , 2018.[16] A. Christmann and I. Steinwart. Universal kernels on non-standard input spaces.
Advances in Neural InformationProcessing Systems 23 , pages 406–414, 2010.[17] A. Cuevas. A partial overview of the theory of statistics with functional data.
Journal of Statistical Planning andInference , 147:1–23, 2014.[18] G. Da Prato.
An Introduction to Infinite-Dimensional Analysis . Springer Berlin Heidelberg, 2006.[19] G. Da Prato and J. Zabczyk.
Second Order Partial Differential Equations in Hilbert Spaces . Cambridge Univer-sity Press, 2002.[20] H. Dong, K. Song, Y. He, J. Xu, Y. Yan, and Q. Meng. PGA-Net: pyramid feature fusion and global contextattention network for automated surface defect detection.
IEEE Transactions on Industrial Informatics , 2019.[21] S. N. Ethier and T. G. Kurtz, editors.
Markov Processes . John Wiley & Sons, Inc., 1986.[22] G. Fasshauer and M. McCourt.
Kernel-based Approximation Methods using MATLAB . World Scientific, June2014.[23] F. Ferraty and P. Vieu. Curves discrimination: a nonparametric functional approach.
Computational Statistics &Data Analysis , 44(1-2):161–173, 2003.[24] S. Fremdt, J. G. Steinbach, L. Horv´ath, and P. Kokoszka. Testing the equality of covariance operators in functionalsamples.
Scandinavian Journal of Statistics , 40(1):138–152, 2012.[25] J. H. Friedman and L. C. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sampletests.
The Annals of Statistics , pages 697–717, 1979.[26] T. G¨artner. A survey of kernels for structured data.
ACM SIGKDD Explorations Newsletter , 5(1):49, 2003.
27] A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨olkopf, and A. J. Smola. A kernel method for the two-sample-problem.
Advances in Neural Information Processing Systems 19 , pages 513–520, 2007.[28] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola. A kernel two-sample test.
Journal ofMachine Learning Research , 13(1):723–773, 2012.[29] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur.Optimal kernel choice for large-scale two-sample tests.
Advances in Neural Information Processing Systems 25 ,pages 1205–1213, 2012.[30] P. Hall and I. V. Keilegom. Two-sample tests in functional data analysis starting from discrete data.
StatisticaSinica , 17(4):1511–1531, 2007. ISSN 10170405, 19968507.[31] Y. He, K. Song, Q. Meng, and Y. Yan. An end-to-end steel surface defect detection approach via fusing multiplehierarchical features.
IEEE Transactions on Instrumentation and Measurement , 69(4):1493–1504, 2020.[32] L. Horv´ath and P. Kokoszka.
Inference For Functional Data With Applications . Springer Science & BusinessMedia, 2012.[33] L. Horv´ath, P. Kokoszka, and R. Reeder. Estimation of the mean of functional time series and a two-sampleproblem.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 75(1):103–122, 2012.[34] L. Horv´ath, P. Kokoszka, and G. Rice. Testing stationarity of functional time series.
Journal of Econometrics ,179(1):66–82, 2014.[35] T. Hsing and R. Eubank.
Theoretical Foundations of Functional Data Analysis, with an Introduction to LinearOperators . John Wiley & Sons, Ltd, 2015.[36] W. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton. A linear-time kernel goodness-of-fit test.
Advancesin Neural Information Processing Systems 30 , pages 262–271, 2017.[37] H. Kadri, E. Duflos, P. Preux, S. Canu, A. Rakotomamonjy, and J. Audiffren. Operator-valued kernels for learningfrom functional response data.
Journal of Machine Learning Research , 17(20):1–54, 2016.[38] M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur. Gaussian processes and kernel methods: Areview on connections and equivalences. arXiv:1807.02582 , 2018.[39] A. S. Kechris.
Classical Descriptive Set Theory . Springer New York, 1995.[40] P. Kidger, P. Bonnier, I. Perez Arribas, C. Salvi, and T. Lyons. Deep signature transforms. In
Advances in NeuralInformation Processing Systems 32 , pages 3105–3115. Curran Associates, Inc., 2019.[41] A. Kolmogorov-Smirnov, A. Kolmogorov, and M. Kolmogorov. Sulla determinazione empirica di uma legge didistribuzione.
Giornale dell’Istituto Italiano degli Attuari , 1933.[42] F. Liu, W. Xu, J. Lu, G. Zhang, A. Gretton, and D. J. Sutherland. Learning deep kernels for non-parametrictwo-sample tests. In
Proceedings of the 37th International Conference on Machine Learning , 2020.[43] M. Lopes, L. Jacob, and M. J. Wainwright. A more powerful two-sample test in high dimensions using randomprojection.
Advances in Neural Information Processing Systems , pages 1206–1214, 2011.[44] S. Maniglia and A. Rhandi. Gaussian measures on separable hilbert spaces and applications, 2004.[45] K. Mardia and I. Dryden. The statistical analysis of shape data.
Biometrika , 76(2):271–281, 1989.[46] H. Q. Minh. Some properties of Gaussian reproducing kernel Hilbert spaces and their implications for functionapproximation and learning theory.
Constructive Approximation , 32(2):307–338, 2009.[47] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Sch¨olkopf. Kernel mean embedding of distributions: Areview and beyond.
Foundations and Trends® in Machine Learning , 10(1-2):1–141, 2017.[48] A. M¨uller. Integral probability metrics and their generating classes of functions.
Advances in Applied Probability ,29(2):429–443, 1997.[49] F. J. Narcowich, J. D. Ward, and H. Wendland. Sobolev error estimates and a Bernstein inequality for scattereddata interpolation via radial basis functions.
Constructive Approximation , 24(2):175–186, 2006.[50] N. H. Nelsen and A. M. Stuart. The random feature model for input-output maps between Banach spaces. arXiv:2005.10224 , 2020.[51] I. Nourdin and G. Peccati.
Normal Approximations with Malliavin Calculus . Cambridge University Press, 2009.[52] V. M. Panaretos, D. Kraus, and J. H. Maddocks. Second-order comparison of Gaussian random functions and thegeometry of DNA minicircles.
Journal of the American Statistical Association , 105(490):670–682, 2010.[53] E. Paparoditis and T. Sapatinas. Bootstrap-based testing of equality of mean functions or equality of covarianceoperators for functional data.
Biometrika , 103(3):727–733, 2016.[54] V. I. Paulsen and M. Raghupathi.
An introduction to the theory of reproducing kernel Hilbert spaces , volume 152of
Cambridge Studies in Advanced Mathematics . Cambridge University Press, Cambridge, 2016.[55] B. Pelletier. Kernel density estimation on riemannian manifolds.
Statistics & Probability Letters , 73(3):297–304,2005.[56] G.-M. Pomann, A.-M. Staicu, and S. Ghosh. A two-sample distribution-free test for functional data with appli-cation to a diffusion tensor imaging study of multiple sclerosis.
Journal of the Royal Statistical Society: Series C(Applied Statistics) , 65(3):395–414, Jan. 2016.[57] A. Rahimi and B. Recht. Random features for large-scale kernel machines.
Advances in Neural InformationProcessing Systems 20 , pages 1177–1184, 2008.[58] A. Ramdas, S. J. Reddi, B. P´oczos, A. Singh, and L. Wasserman. On the decreasing power of kernel anddistance based nonparametric hypothesis tests in high dimensions. In
Twenty-Ninth AAAI Conference on ArtificialIntelligence , 2015.[59] A. Ramdas, S. J. Reddi, B. P´oczos, A. R. Singh, and L. A. Wasserman. On the high-dimensional power of inear-time kernel two-sample testing under mean-difference alternatives. In , 2015.[60] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning . MIT Press, 2006.[61] S. Saitoh and Y. Sawano.
Theory of reproducing kernels and applications . Springer, 2016.[62] P. Schmid. On the Kolmogorov and Smirnov limit theorems for discontinuous distribution functions.
The Annalsof Mathematical Statistics , 29(4):1011–1027, 1958.[63] I. J. Schoenberg. Metric spaces and completely monotone functions.
The Annals of Mathematics , 39(4):811,1938.[64] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based and RKHS-basedstatistics in hypothesis testing.
The Annals of Statistics , 41(5):2263–2291, 2013.[65] R. J. Serfling.
Approximation Theorems of Mathematical Statistics . John Wiley & Sons, Inc., Nov. 1980.[66] J. Shawe-Taylor and N. Cristianini.
Kernel Methods for Pattern Analysis . Cambridge University Press, 2004.[67] C.-J. Simon-Gabriel and B. Sch¨olkopf. Kernel distribution embeddings: Universal kernels, characteristic kernelsand kernel metrics on distributions.
Journal of Machine Learning Research , 19(44):1–29, 2018.[68] C.-J. Simon-Gabriel, A. Barp, and L. Mackey. Metrizing weak convergence with Maximum Mean Discrepancies. arXiv:2006.09268 , 2020.[69] N. Smirnov. Table for estimating the goodness of fit of empirical distributions.
The Annals of MathematicalStatistics , 19(2):279–281, 1948.[70] K. Song and Y. Yan. A noise robust method based on completed local binary patterns for hot-rolled steel stripsurface defects.
Applied Surface Science , 285:858–864, 2013.[71] K. Song and Y. Yan. NEU Steel Dataset Description. http://faculty.neu.edu.cn/yunhyan/NEU_surface_defect_database.html , 2020. Last accessed 29/07/2020.[72] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch¨olkopf, and G. R. Lanckriet. Hilbert space embeddingsand metrics on probability measures.
Journal of Machine Learning Research , 11:1517–1561, 2010. ISSN 1532-4435.[73] B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and RKHSembedding of measures.
Journal of Machine Learning Research , 12:2389–2410, 2011. ISSN 1532-4435.[74] I. Steinwart and A. Christmann.
Support Vector Machines . Springer, 2008.[75] I. Steinwart and C. Scovel. Mercer’s theorem on general domains: On the interaction between measures, kernels,and RKHSs.
Constructive Approximation , 35(3):363–417, 2012.[76] T. Sullivan.
Introduction to Uncertainty Quantification . Springer International Publishing, 2015.[77] D. J. Sutherland. Unbiased estimators for the variance of MMD estimators. arXiv:1906.02104 , 2019.[78] D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton. Generative models andmodel criticism via optimized maximum mean discrepancy. In
Proceedings of the 8th International Conferenceon Learning Representations , 2016.[79] G. J. Sz´ekely. E-statistics: The energy of statistical samples.
Bowling Green State University, Department ofMathematics and Statistics Technical Report , 3(05):1–18, 2003.[80] G. J. Sz´ekely and M. L. Rizzo. Testing for equal distributions in high dimension.
InterStat , 5(16.10):1249–1272,2004.[81] N. N. Vakhania, V. I. Tarieladze, and S. A. Chobanyan.
Probability Distributions on Banach Spaces . SpringerNetherlands, 1987.[82] A. W. van der Vaart and J. H. van Zanten. Information rates of nonparametric Gaussian process methods.
Journalof Machine Learning Research , 12:2095–2119, 2011.[83] G. Wahba.
Spline models for observational data , volume 59. Siam, 1990.[84] A. Wald and J. Wolfowitz. On a test whether two samples are from the same distribution.
The Annals ofMathematical Statistics , 11:147–162, 1940.[85] H. Wendland.
Scattered Data Approximation . Cambridge University Press, 2005.[86] W. J. Wilkinson.
Gaussian Process Modelling for Audio Signals . PhD thesis, Queen Mary Univeristy London,2019.[87] G. Wynne, F.-X. Briol, and M. Girolami. Convergence guarantees for Gaussian process means with misspecifiedlikelihoods and smoothness. arXiv:2001.10818 , 2020.[88] H. Zhang, Y. Xu, and Q. Zhang. Refinement of operator-valued reproducing kernels.
Journal of MachineLearning Research , 13(4):91–136, 2012.[89] J.-T. Zhang, X. Liang, and S. Xiao. On the two-sample Behrens-Fisher problem for functional data.
Journal ofStatistical Theory and Practice , 4(4):571–587, 2010.[90] C. Zhu, S. Yao, X. Zhang, and X. Shao. Distance-based and RKHS-based dependence metrics in high dimension.
Annals of Statistics , 2019. To appear. Appendix
A.1 Bochner and Minlos-Sazanov Theorem
Bochner’s theorem provides an exact relationship between continuous, translation in-variant kernels on R d , meaning k ( x, y ) = φ ( x − y ) for some continuous φ , and theFourier transforms of finite Borel measures on R d . For a proof see Wendland [85,Theorem 6.6]. Theorem 10 ( Bochner ) . A continuous function φ : R d → C is positive definite if andonly if it is the Fourier transform of a finite Borel measure µ φ on R d ˆ µ φ ( x ) := (cid:90) R d e ix T y dµ φ ( y ) = φ ( x ) . Bochner’s theorem does not continue to hold in infinite dimensions, for example thekernel k ( x, y ) = e − (cid:107) x − y (cid:107) X when X is an infinite dimensional, real, separable Hilbertspace is not the Fourier transform of a finite Borel measure on X [44, Proposition1.2.11]. Instead, a stronger continuity property is required, this is the content of theMinlos-Sazanov theorem. For a proof see [44, Theorem 1.1.5] or Vakhania et al. [81,Theorem VI.1.1]. Theorem 11 ( Minlos-Sazanov ) . Let X be a real, seperable Hilbert space and φ : X → C a positive definite function on X then the following are equivalent1. φ is the Fourier transform of a finite Borel measure on X
2. There exists C ∈ L +1 ( X ) such that φ is continuous with respect to the norminduced by C given by (cid:107) x (cid:107) C = (cid:104) Cx, x (cid:105) X . The existence of such an operator is a much stronger continuity property than stan-dard continuity on X and will be crucial in the proof of Theorem 12, one of our mainresults. To see that continuity with respect to such a C is stronger than usual conti-nuity consider the following example. Fix any ε > and assume we only know that φ : X → R is continuous and for simplicity assume that φ (0) = 1 , then we know thereexists some δ > such that (cid:107) x (cid:107) X < δ implies | φ ( x ) − | < ε meaning we have controlof φ ( x ) over the bounded set (cid:107) x (cid:107) X < δ . On the other hand, if φ is continuous withrespect to (cid:107)·(cid:107) C for some C ∈ L +1 ( X ) then we know there exists some δ (cid:48) > such that (cid:107) x (cid:107) C < δ (cid:48) implies | φ ( x ) − | < ε so we have control of φ ( x ) over the unbounded set (cid:107) x (cid:107) C < δ (cid:48) . To see this set is unbounded let { λ n , e n } ∞ n =1 be the orthonormal eigen-system of C and for n ∈ N let y n = δ (cid:48) e n λ n if λ n > otherwise y n = ne n , then since C ∈ L +1 ( X ) we know λ n → so (cid:107) y n (cid:107) X → ∞ . Since we used elements from theeigensystem it is clear that (cid:107) y n (cid:107) C ≤ δ (cid:48) / . Therefore we have constructed a subset ofthe ball with respect to (cid:107)·(cid:107) C of radius δ (cid:48) that has unbounded norm. A.2 Proofs for Section 4
Proof of Proposition We begin by outlining the scaling that occurs for each of thetwo cases. For economy of notation we set γ = γ N . If k ( s, t ) = δ st then Tr (Σ i ) = N or all i ∈ N and (cid:104) m N , Σ m N (cid:105) = (cid:107) m N (cid:107) = Θ( N ) by the Riemann scaling. If k is continuous and bounded then Tr (Σ i ) = Θ( N i ) since N − i Tr (Σ i ) → Tr ( C ik ) and (cid:104) m N , Σ i m N (cid:105) = Θ( N i +1 ) since N − ( i +1) (cid:104) m N , Σ i m N (cid:105) → (cid:82) D (cid:82) D m ( s ) C ik m ( t ) dsdt .We shall outline the proof of the result in the second case, the proof for the firstcase is entirely analogous and is completed by substituting the above scaling resultswhere appropriate. Throughout this proof we shall repeatedly use the following Taylorseries approximation, for a positive definite matrix Σ ∈ R N × N , based on the classicalGirard-Waring formulas. det( I + ε Σ) = 1 + ε Tr (Σ) + ε R ( s )= 1 + ε Tr (Σ) + ε Tr (Σ) − Tr (Σ )) + ε R ( s (cid:48) ) , for some < s, s (cid:48) ≤ ε where R ( s ) = (cid:18)(cid:88) λ i (1 + sλ i ) (cid:19) − det( I + s Σ) (cid:88) λ i (1 + sλ i ) , and R ( s ) = − I + s Σ)) (cid:88) λ i (1 + sλ i ) (cid:88) λ i (1 + sλ i ) + 2 det( I + s Σ) (cid:88) λ i (1 + sλ i ) , where { λ , . . . , λ N } are the eigenvalues of Σ . We may bound these remainders R ( s ) ≤ (cid:18)(cid:88) λ i (1 + sλ i ) (cid:19) ≤ Tr (Σ) = O ( N ) , and R ( s ) ≤ I + s Σ) (cid:88) λ i (1 + sλ i ) ≤ e ε Tr (Σ) Tr (Σ ) . In our scenario ε = γ − = O ( N − α ) = o ( N − ) therefore e ε Tr (Σ) is bounded sinceTr (Σ) = O ( N ) . Finally, Tr (Σ ) = O ( N ) therefore R ( s ) = O ( N ) . Setting T = γ I and R = S = Σ in Corollary 3 and applying the above Taylor expansion along withthe expansions for x − / and ( I + ε Σ) − with Lagrange remainder termMMD k ( P, Q ) = 2 det( I + 2Σ /γ ) − (cid:16) − e − γ (cid:104) ( I +2Σ /γ ) − µ,µ (cid:105) (cid:17) = (cid:18) γ r (cid:19) (cid:18) γ (cid:10) ( I + 2Σ /γ ) − µ, µ (cid:11) + 1 γ r (cid:19) = (cid:18) γ r (cid:19) (cid:18) γ (cid:104) µ, µ (cid:105) + 1 γ r + 1 γ r (cid:19) , where r , r , r are remainder terms. o leading order r /γ = O ( N − α ) , r /γ = O ( N − α ) , r /γ = O ( N − α ) so it follows that MMD k ( P, Q ) = γ − (cid:107) m N (cid:107) + r ( N ) where, to leading order, r ( N ) = O ( N − α ) . We now consider the denominator. To this end, applying Theorem 8 with T = γ − I and S = Σ we obtain ξ = 2 det( I + 4Σ /γ ) − (cid:16) e − γ (cid:104) µ, ( I +4Σ /γ ) − µ (cid:105) (cid:17) − I + 2Σ /γ ) − (1 + e − γ (cid:104) µ ( I +2Σ /γ ) − µ (cid:105) − e − γ (cid:104) µ ( I +2Σ /γ ) − µ (cid:105) ) − I + Σ /γ )( I + 3Σ /γ )) − / e − γ (cid:104) µ, ( I +2Σ /γ )( I +3Σ /γ ) − ( I +Σ /γ ) − µ (cid:105) . We split this into two terms, ξ = A + A , where A = 4 det( I + 4Σ /γ ) − + 4 det( I + 2Σ /γ ) − − I + Σ /γ )( I + 3Σ /γ )) − / , and A = 2 det( I + 4Σ /γ ) − (cid:16) − e − γ (cid:104) µ, ( I +4Σ /γ ) − µ (cid:105) (cid:17) − I + 2Σ /γ ) − (3 + e − γ (cid:104) µ ( I +2Σ /γ ) − µ (cid:105) − e − γ (cid:104) µ ( I +2Σ /γ ) − µ (cid:105) ) − I + Σ /γ )( I + 3Σ /γ )) − / · ( − e − γ (cid:104) µ, ( I +2Σ /γ )( I +3Σ /γ ) − ( I +Σ /γ ) − µ (cid:105) ) . Applying Taylor’s theorem for the determinant and exponential terms we obtain: A = 4 (cid:20) − γ Tr (Σ) + 2 γ ( Tr (Σ)) + 4 γ Tr (Σ ) + r (cid:21) + 4 (cid:20) − γ Tr (Σ) + 2 γ ( Tr (Σ)) + 2 γ Tr (Σ ) + r (cid:21) − (cid:20) − γ Tr (Σ) + 18 γ Tr (Σ) + 14 γ Tr (Σ ) + r (cid:21) × (cid:20) − γ Tr (Σ) + 98 γ ( Tr (Σ)) + 94 γ Tr (Σ ) + r (cid:21) , where to leading order r , r , r , r = O ( N − α + N − α ) . After simplification weobtain A = γ Tr (Σ ) + r ( N ) where r ( N ) = O ( N − α ) . Similarly writing A = B + B + B , we obtain: B = 2 γ (1 − Tr (Σ) /γ + r ) (cid:18) −(cid:104) µ, µ (cid:105) + 4 γ (cid:104) µ, Σ µ (cid:105) + 12 γ (cid:104) µ, µ (cid:105) + r (cid:19) ,B = − γ (1 − Tr (Σ) /γ + r ) (cid:18) (cid:104) µ, µ (cid:105) − γ (cid:104) µ, Σ µ (cid:105) + r (cid:19) ,B = − γ (cid:0) − Tr (Σ) /γ + r (cid:1) (cid:18) − (cid:104) µ, µ (cid:105) + 1 γ (cid:104) µ, Σ µ (cid:105) + 18 γ (cid:104) µ, µ (cid:105) + r (cid:19) , here r , . . . , r are remainder terms which satisfy r i = O ( N − α + N − α ) for i = 5 , . . . , . Expanding B + B + B and collecting into powers of γ , we see thatthe constant terms and the terms with denominator γ cancel out. Collecting terms with γ denominator gives A = γ (cid:104) µ, Σ µ (cid:105) + r ( N ) where r ( d ) = O ( d − α ) is a remainderterm containing the higher order terms. Combining the leading order expressions for A and A , and collecting together the remainder terms we obtainMMD k ( P, Q ) √ ξ = (cid:107) µ (cid:107) /γ + r ( N ) (cid:112) Tr (Σ ) /γ + 4 (cid:104) µ, Σ µ (cid:105) /γ + r (cid:48) ( N ) , where r ( N ) = O ( N − α ) and r (cid:48) ( N ) = O ( N − α ) It follows thatMMD k ( P, Q ) √ ξ = (cid:107) µ (cid:107) (cid:112) Tr (Σ ) + 4 (cid:104) µ, Σ µ (cid:105) r ( N ) (cid:107) µ (cid:107) /γ (cid:113) r (cid:48) ( N )4 Tr (Σ ) /γ +4 (cid:104) µ, Σ µ (cid:105) /γ . (11)As discussed at the start of the proof (cid:107) m N (cid:107) = Θ( N ) and Tr (Σ ) , (cid:104) m N , Σ m N (cid:105) =Θ( N ) meaning that γ r ( N ) / (cid:107) m N (cid:107) = O ( N − α ) and so converges to zero. Like-wise the fraction in the square root in the denominator converges to zero, meaning theterm in the brackets of (11) converge to as N → ∞ , yielding the advertised resultonce numerator and denominator is divided by N since N − Tr (Σ ) → (cid:90) D (cid:90) D k ( s, t ) dsdt = (cid:107) C k (cid:107) HS ,N − (cid:104) m N , Σ m N (cid:105) → (cid:90) D (cid:90) D m ( s ) k ( s, t ) m ( t ) dsdt = (cid:107) C / k m (cid:107) L ( D ) . For the case of identity matrix substituting the corresponding scaling ofTr (Σ ) , (cid:104) m N , Σ m N (cid:105) in the relevant places will produce the result. Proof of Lemma 1 . We use the standard fact that for a real valued random variables η the following inequality holds | E [ η ] − Median [ η ] | ≤ Var [ η ] and we will be using thisinequality on (cid:107) x − y (cid:107) where x ∼ P, y ∼ Q . For economy of notation let µ = µ − µ and Σ = Σ + Σ . Using standard Gaussian integral identities we obtain E [ (cid:107) x − y (cid:107) ] = ( Tr (Σ) + (cid:107) µ (cid:107) ) , E [ (cid:107) x − y (cid:107) ] = 2 Tr (Σ ) + 4 (cid:104) µ, Σ µ (cid:105) + ( Tr (Σ) + (cid:107) µ (cid:107) ) , thereforeVar [ (cid:107) x − y (cid:107) ] = E [ (cid:107) x − y (cid:107) ] − E [ (cid:107) x − y (cid:107) ] = 2 Tr (Σ) + 4 (cid:107) µ (cid:107) Tr (Σ) . Substituting into the inequality at the start of the proof (cid:12)(cid:12)(cid:12)(cid:12)
Median [ (cid:107) x − y (cid:107) ] E [ (cid:107) x − y (cid:107) ] − (cid:12)(cid:12)(cid:12)(cid:12) ≤ Tr (Σ) + 4 (cid:107) µ (cid:107) Tr (Σ)( Tr (Σ) + (cid:107) µ (cid:107) ) = 2 (cid:18) − (cid:107) µ (cid:107) ( Tr (Σ) + (cid:107) µ (cid:107) ) (cid:19) , which completes the first part of the proof. The special case follows from dividing thenumerator and denominator of the fraction in the right hand side by N and using theassumption that the kernels are continuous and the mesh satisfies the Riemann scalingproperty. Writing m N for the discretised version of m − m gives N − (cid:107) m N (cid:107) →(cid:107) m − m (cid:107) L ( D ) , N − Tr (Σ) → Tr ( C k + C k ) , N − (cid:107) m N (cid:107) → (cid:107) m − m (cid:107) L ( D ) . .3 Proofs for Section 5 Lemma 2.
The SE- T function is a kernel. Proof of Lemma 2 . Consider first the case T = I , it is shown in Schoenberg [63,Theorem 3] that if k ( x, y ) = φ ( (cid:107) x − y (cid:107) Y ) for φ a completely monotone functionthen k is positive definite on Y and it is well known that e − ax is such a functionfor a > therefore k I is a kernel. Now take k T to be the SE- T kernel then forany N ∈ N , { a n } Nn =1 ⊂ R , { x n } Nn =1 ⊂ X we have (cid:80) Nn,m =1 a n a m k T ( x n , x m ) = (cid:80) Nn,m =1 a n a m k I ( T ( x n ) , T ( x m )) ≥ Proof of Theorem 2 . This proof uses the argument of Minh [46, Theorem 1]. The planis to first show the function space stated in the theorem is an RKHS and that the kernelis the SE- T kernel so by uniqueness of kernel for RKHS we are done. This is doneusing the Aronszajn theorem [46, Theorem 9] which identifies the kernel as an infinitesum of basis functions, see also Paulsen and Raghupathi [54, Theorem 2.4].First we prove that H k T ( X ) is a separable Hilbert space. That it is an inner productspace is clear from the definition of the inner product and the assumption that λ n > for all n ∈ N . The definition of the inner product means completeness of H k T ( X ) equivalent to completeness of the weighted l space given by l λ (Γ) = ( w γ ) γ ∈ Γ : (cid:107) ( w γ ) γ ∈ Γ (cid:107) l λ (Γ) := (cid:88) γ ∈ Γ γ ! λ γ w γ < ∞ which can be easily seen to be complete since Γ is countable and γ ! /λ γ is positive forall γ ∈ Γ . To see that H k T ( X ) is separable observe that from the definition of the innerproduct, the countable set of functions φ γ ( x ) = (cid:112) λ γ /γ ! e − (cid:107) T x (cid:107) X x γ is orthonormaland spans H k T ( X ) hence is an orthonormal basis.Next we prove that H k T ( X ) is an RKHS. Expanding the kernel through the expo-nential function gives k T ( x, y ) = e − (cid:107) T x (cid:107) X e − (cid:107) T x (cid:107) X e (cid:104) T x,T y (cid:105) X = e − (cid:107) T x (cid:107) X e − (cid:107) T x (cid:107) X ∞ (cid:88) n =0 (cid:104) T x, T y (cid:105) n X n ! , and by the assumption on T we know (cid:104) T x, T y (cid:105) n X = (cid:0) (cid:80) ∞ m =1 λ m x m y m (cid:1) n = (cid:80) | γ | = n n ! γ ! λ γ x γ y γ where x m = (cid:104) x, e m (cid:105) X , similarly for y m , therefore k T ( x, y ) = e − (cid:107) T x (cid:107) X e − (cid:107) T x (cid:107) X (cid:88) | γ |≥ λ γ γ ! x γ y γ = (cid:88) | γ |≥ φ γ ( x ) φ γ ( y ) . So for any F ∈ H k T ( X ) we have (cid:104) F, k T ( · , x ) (cid:105) H kT ( X ) = (cid:80) γ ∈ Γ (cid:104) F, φ γ (cid:105) H kT ( X ) φ γ ( x ) = F ( x ) so k T is a reproducing kernel of H k T ( X ) so by uniqueness of reproducing kernelswe have that H k T ( X ) is the RKHS of k T . roof of Theorem 3 . If k T ( x, y ) = ˆ µ ( x − y ) for some Borel measure on X then (2)lets us writeMMD k T ( P, Q ) = (cid:90) X (cid:90) X k T ( x, y ) d ( P − Q )( x ) d ( P − Q )( y )= (cid:90) X (cid:90) X (cid:90) X e i (cid:104) h,x − y (cid:105) X dµ ( h ) d ( P − Q )( x ) d ( P − Q )= (cid:90) X (cid:18)(cid:90) X e i (cid:104) h,x (cid:105) X d ( P − Q )( x ) (cid:19) (cid:18)(cid:90) X e − i (cid:104) h,y (cid:105) X d ( P − Q )( y ) (cid:19) dµ ( h ) (12) = (cid:90) X (cid:12)(cid:12)(cid:12) (cid:98) P ( h ) − (cid:98) Q ( h ) (cid:12)(cid:12)(cid:12) dµ ( h ) , where (12) is obtained by using Fubini’s theorem to swap the integrals which is per-mitted since | e i (cid:104) h,x − y (cid:105) X | = 1 and is therefore integrable with respect to P and Q .Fourier transforms of finite Borel measures on X are uniformly continuous [1,Proposition 2.21] therefore if µ has full support, meaning that µ ( U ) > for everyopen U ⊂ X , then we may immediately conclude that (cid:98) P = (cid:98) Q and since the Fouriertransform of finite Borel measures on X is injective [18, Proposition 1.7] we may con-clude that P = Q meaning that Φ k T is injective. Assume T is injective. If k T is theSE- T kernel then Da Prato [18, Proposition 1.25] shows that µ = N T has full supportsince T is also injective. Therefore k T is characteristic. For the converse direction weuse the contrapositive and assume that T is not injective meaning there exists x ∗ ∈ X with x ∗ (cid:54) = 0 and T x = 0 . Set P = δ and Q = δ x ∗ the Dirac measures on , x ∗ then Φ k T ( P ) = k T (0 , · ) = k T ( x ∗ , · ) = Φ k T ( Q ) therefore Φ k T is not injective so k T is notcharacteristic. Proof of Theorem 4 . The result is a corollary of the next theorem which is where The-orem 11 is employed.
Theorem 12.
Let X be a real, separable Hilbert space and T = I then the SE- T kernelis characteristic. Proof of Theorem 12 . The idea of this proof is to use the contrapositive and assume P (cid:54) = Q and conclude that MMD k I ( P, Q ) > . The main tool shall be Theorem 11since P (cid:54) = Q implies (cid:98) P (cid:54) = (cid:98) Q and Theorem 11 implies that these Fourier transformsvary slowly in some sense so there will be a set of big enough measure such that (cid:98) P ( x ) (cid:54) = (cid:98) Q ( x ) for x in this set, which will allow us to deduce MMD k I ( P, Q ) > .Suppose P, Q are Borel probability measures on X with P (cid:54) = Q then (cid:98) P (cid:54) = (cid:98) Q [18,Proposition 1.7] so there exists x ∗ ∈ X , ε > such that | (cid:98) P ( x ∗ ) − (cid:98) Q ( x ∗ ) | > ε . ByTheorem 11 there exists S, R ∈ L +1 ( X ) such that (cid:98) P (respectively (cid:98) Q ) is continuous withrespect to the norm induced by S (repectively R ). For r > let B S ( x ∗ , r ) = { x ∈X : (cid:104) S ( x − x ∗ ) , x − x ∗ (cid:105) X < r } be the ball based at x ∗ of radius r with respect to thenorm induced by S , and B R ( x ∗ , r ) will denote the analogous ball with respect to the orm induced by R . By the continuity properties of (cid:98) P , (cid:98) Q there exists r > such that x ∈ B S ( x ∗ , r ) = ⇒ | (cid:98) P ( x ) − (cid:98) P ( x ∗ ) | < ε x ∈ B R ( x ∗ , r ) = ⇒ | (cid:98) Q ( x ) − (cid:98) Q ( x ∗ ) | < ε . The set B S ( x ∗ , r ) ∩ B R ( x ∗ , r ) is non-empty since it contains x ∗ and if x ∈ B S ( x ∗ , r ) ∩ B R ( x ∗ , r ) then by reverse triangle inequality | (cid:98) P ( x ) − (cid:98) Q ( x ) | = | (cid:98) P ( x ) − (cid:98) P ( x ∗ ) + (cid:98) P ( x ∗ ) − (cid:98) Q ( x ∗ ) + (cid:98) Q ( x ∗ ) − (cid:98) Q ( x ) |≥ | (cid:98) P ( x ∗ ) − (cid:98) Q ( x ∗ ) | − | (cid:98) P ( x ) − (cid:98) P ( x ∗ ) | − | (cid:98) Q ( x ) − (cid:98) Q ( x ∗ ) | > ε − ε − ε ε . (13)Now define the operator U = S + R which is positive, symmetric and trace classsince both S and R have these properties. Note that B U ( x ∗ , r ) ⊂ B S ( x ∗ , r ) ∩ B R ( x ∗ , r ) because (cid:107) x − x ∗ (cid:107) U = (cid:104) U ( x − x ∗ ) , x − x ∗ (cid:105) X = (cid:104) S ( x − x ∗ ) , x − x ∗ (cid:105) X + (cid:104) R ( x − x ∗ ) , x − x ∗ (cid:105) X = (cid:107) x − x ∗ (cid:107) S + (cid:107) x − x ∗ (cid:107) R . Since U is a positive, compact, symmetric operator there exists a decompositioninto its eigenvalues { λ n } ∞ n =1 , a non-negative sequence converging to zero, and eigen-functions { e n } ∞ n =1 which form an orthonormal basis of X . We will later need to as-sociate a non-degenerate Gaussian measure with U . To this end define V to be thepositive, symmetric, trace class operator with eigenvalues { ρ n } ∞ n =1 where ρ n = λ n if λ n > otherwise ρ n = n − , and eigenfunctions { e n } ∞ n =1 inherited from U . Then byconstruction V is injective, positive, symmetric and trace class. The V induced normdominates the U induced norm therefore B V ( x ∗ , r ) ⊂ B U ( x ∗ , r ) so for x ∈ B V ( x ∗ , r ) we have | (cid:98) P ( x ) − (cid:98) Q ( x ) | > ε/ .Now we construct an operator which will approximate I , define the operator I m x = (cid:80) ∞ n =1 ω ( m ) n (cid:104) x, e n (cid:105) X e n where ω ( m ) n = 1 for n ≤ m and ω ( m ) n = n − for n > m and { e n } ∞ n =1 is the eigenbasis of V , then I m ∈ L +1 ( X ) for every m ∈ N . It is easy to see k I / m converges pointwise to k I as m → ∞ since e − x is a continuous function on R and (cid:107) x (cid:107) I / m → (cid:107) x (cid:107) I = (cid:107) x (cid:107) X however clearly I / m does not converge in operatornorm to I since I is not a compact operator. Since k I m ≤ for all m we may use thebounded convergence theorem to obtainMMD k I ( P, Q ) = (cid:90) X (cid:90) X k I ( x, y ) d ( P − Q )( x ) d ( P − Q )( y )= lim m →∞ (cid:90) X (cid:90) X k I / m ( x, y ) d ( P − Q )( x ) d ( P − Q )( y )= lim m →∞ (cid:90) X | (cid:98) P ( x ) − (cid:98) Q ( x ) | dN I m ( x ) , (14) here (14) is by the same reasoning as in the proof of Theorem 3. In light of the lowerbound we derived earlier over B V ( x ∗ , r ) of the integrand in (13)MMD k I ( P, Q ) = lim m →∞ (cid:90) X | (cid:98) P ( x ) − (cid:98) Q ( x ) | dN I m ( x ) ≥ lim m →∞ (cid:90) B V ( x ∗ ,r ) ε dN I m ( x ) , so if we can lower bound N I m ( B V ( x ∗ , r )) by a positive constant indepdendent of m then we are done. This set is the ball with respect to V ∈ L +1 ( X ) which is a some-how large set, see the discussion after Theorem 11, and we will use a push-forward ofmeasure argument.Define T ( x ) = x − x ∗ then N I m ( B V ( x ∗ , r )) = T N I m ( B V (0 , r )) and [18, Propo-sition 1.17] tells us T N I m ( B V (0 , r )) = N − x ∗ ,I m ( B V (0 , r )) . Next we note that N − x ∗ ,I m ( B V (0 , r )) = V N − x ∗ ,I m ( B (0 , r )) and [18, Proposition 1.18] tells us that V N − x ∗ ,I m ( B (0 , r )) = N − V x ∗ ,V I m V ( B (0 , r )) . For ease of notation let y ∗ = − V x ∗ and since we choose to construct I m fromthe eigenbasis of V we have V m x := V I m V x = (cid:80) n ∈ N ρ ( m ) n (cid:104) x, e n (cid:105) X e n where ρ ( m ) n = ρ n for n ≤ m and ρ ( m ) n = ρ n n − for n > m so V m ∈ L +1 ( X ) and is injectivefor every m ∈ N . We follow the proof of [18, Proposition 1.25] and define the sets A l = (cid:40) x ∈ X : l (cid:88) n =1 (cid:104) x, e n (cid:105) X ≤ r (cid:41) B l = (cid:40) x ∈ X : ∞ (cid:88) n = l +1 (cid:104) x, e n (cid:105) X < r (cid:41) . Since V m is non-degenerate for every m ∈ N we have that for every l ∈ N the events A l , B l are independent under N y ∗ ,V m [18, Example 1.22] meaning ∀ m, l ∈ N we have N y ∗ ,V m ( B (0 , r )) ≥ N y ∗ ,V m ( A l ∩ B l ) = N y ∗ ,V m ( A l ) N y ∗ ,V m ( B l ) , and by the measure theoretic Chebyshev inequality, for every l ∈ N N y ∗ ,V m ( B l ) ≥ − N y ∗ ,V m ( B cl ) ≥ − r ∞ (cid:88) n = l +1 (cid:90) X (cid:104) x, e n (cid:105) X dN y ∗ ,V m = 1 − r (cid:32) (cid:88) n = l +1 ρ ( m ) n + (cid:104) y ∗ , e n (cid:105) X (cid:33) ≥ − r (cid:32) ∞ (cid:88) n = l +1 ρ n + (cid:104) y ∗ , e n (cid:105) X (cid:33) . As the final line involves the tail of a finite sum with no dependcy on m there existsan L ∈ N such that N y ∗ ,V m ( B L ) > for every m ∈ N and l ≥ L . Note that for m > L we have N y ∗ ,V m ( A L ) = N y ∗ ,V ( A L ) since A L depends only on the first L coordinatesand ρ ( m ) n = ρ n for n ≤ L if m > L . So for m > LN y ∗ ,V m ( A L ) = N y ∗ ,V ( A L ) ≥ N y ∗ ,V (cid:18) B (cid:18) , r √ (cid:19)(cid:19) > c > , or some c since V is non-degenerate [18, Proposition 1.25].Overall we have shown that there exists an L ∈ N such that for m > L we have N I m ( B V ( x ∗ , r )) = N y ∗ ,V m ( B (0 , r )) > c . Therefore, by substituting back into thelower bound for MMD k I ( P, Q ) MMD k I ( P, Q ) = lim m →∞ (cid:90) X | (cid:98) P ( x ) − (cid:98) Q ( x ) | dN I m ( x ) ≥ lim m →∞ (cid:90) B V ( x ∗ ,r ) ε dN I m ( x ) > ε c > , which implies by contrapositive that k I is characteristic.With Theorem 12 proved we proceed to prove Theorem 4. By (2)MMD k T ( P, Q ) = (cid:90) (cid:90) k T ( x, x (cid:48) ) dP ( x ) dP ( x (cid:48) ) + (cid:90) (cid:90) k T ( y, y (cid:48) ) dQ ( y ) dQ ( y (cid:48) ) − (cid:90) (cid:90) k T ( x, y ) dQ ( x ) dP ( y )= (cid:90) (cid:90) k I ( x, x (cid:48) ) dT P ( x ) dT P ( x (cid:48) ) + (cid:90) (cid:90) k I ( y, y (cid:48) ) dT Q ( y ) dT Q ( y (cid:48) ) − (cid:90) (cid:90) k I ( x, y ) dT P ( x ) dT Q ( y ) (15) = MMD k I ( T P, T Q ) . Using Theorem 12 we know that if MMD k T ( P, Q ) = 0 then T P = T Q and allthat is left to show is that the assumption on T implies P = Q . By the definitionof push-forward measure we know that for every B ∈ B ( Y ) we have the equality P ( T − B ) = Q ( T − B ) . By the assumptions on X , Y , T we know that T ( A ) ∈ B ( Y ) for every A ∈ B ( X ) [39, Theorem 15.1]. Hence for any A ∈ B ( X ) take B = T ( A ) then P ( A ) = P ( T − B ) = Q ( T − B ) = Q ( A ) , which shows P = Q . Proof of Proposition 3 . For any N ∈ N take any { a n } Nn =1 ⊂ R , { x n } Nn =1 ⊂ X then N (cid:88) n,m =1 k C,k ( x n , x m ) = (cid:90) X N (cid:88) n,m =1 k ( (cid:104) x n , h (cid:105) X , (cid:104) x m , h (cid:105) X ) dN C ( h ) ≥ , since this is the integral of a non-negative quantity as k is a kernel. Symmetry of k C,k follows since k is symmetric meaning k C,k is a kernel. Expanding k using itsspectral measure we have k C,k ( x, y ) = (cid:90) X (cid:90) R e i (cid:104) x − y,rh (cid:105) X dµ ( r ) dN C ( h ) = ˆ ν ( x − y ) , where ν ( A ) = (cid:82) X (cid:82) R A ( rh ) dµ ( r ) dN C ( h ) for all A ∈ B ( X ) . This is the law of the X valued random variable ξX where ξ ∼ µ and X ∼ N C are independent. We will how that ν has full support on X from which is follows that k C,k is characteristic byfollowing the proof of Theorem 3.Fix any open ball B = B ( h, r ) ⊂ X then given the assumption on µ by intersectingwith (0 , ∞ ) or ( −∞ , we may assume that a, b have the same sign. Assume that a, b > , the proof for when a, b < is analogous. We first treat the case h (cid:54) = 0 .Set δ = min( ( ba − , r (cid:107) h (cid:107) X ) so that ( a, a (1 + δ )) ⊂ ( a, b ) . Now consider the ball B (cid:48) = B ( ha (1+ δ ) , r a (1+ δ ) ) , take any c ∈ ( a, a (1 + δ )) and any x ∈ B (cid:48) then (cid:107) cx − h (cid:107) X ≤ (cid:13)(cid:13)(cid:13)(cid:13) ξx − ξha (1 + δ ) (cid:13)(cid:13)(cid:13)(cid:13) X + (cid:13)(cid:13)(cid:13)(cid:13) ξha (1 + δ ) − h (cid:13)(cid:13)(cid:13)(cid:13) X ≤ cr a (1 + δ ) + (cid:107) h (cid:107) X (cid:18) − ca (1 + δ ) (cid:19) < r (cid:107) h (cid:107) X (cid:18) −
11 + δ (cid:19) < r r < r. Therefore for any c ∈ ( a, a (1 + δ )) we have cB (cid:48) ⊂ B . Hence P ( ξX ∈ B ) = ν ( B ) ≥ µ (cid:0) ( a, a (1 + δ )) (cid:1) N C ( B (cid:48) ) > by the assumptions on µ and the way N C isnon-degenerate.The case h = 0 is analogous, take B (cid:48) = B (0 , r b ) then for every c ∈ ( a, b ) we have cB (cid:48) ⊂ B and we again conclude ν ( B ) > . Proof of Corollary 1 . The idea of the proof is to represent the IMQ- T kernel as anintegral of the SE- T kernel then use the same limit argument as in the proof of Theorem12 and push-forward argument of Theorem 4.Throughout this proof k IMQ T and k SE T will denote the IMQ- T and SE- T kernelsrespectively. By the same proof technique as Theorem 4 is suffices to prove that k IMQ I is characteristic. Let { e n } ∞ n =1 be any orthonormal basis of X and let I m x = (cid:80) mn =1 (cid:104) x, e n (cid:105) X e n so that I m converges to I pointwise. Then by the same limiting ar-gument in the proof of Theorem 12 using bounded convergence theorem, letting N bethe N (0 , measure on R we haveMMD k IMQ I ( P, Q ) = lim m →∞ (cid:90) X (cid:90) X k IMQ I m ( x, y ) d ( P − Q )( x ) d ( P − Q )( y )= lim m →∞ (cid:90) X (cid:90) X (cid:90) R k SE I m ( zx, zy ) dN ( z ) d ( P − Q )( x ) d ( P − Q )( y ) (16) = (cid:90) R MMD k SE I ( z P, z Q ) dN ( z ) , (17)where (16) is from the integral representation of k IMQ I m in (10) which can be used since I m ∈ L +1 ( X ) , (17) is obtained by using Fubini’s theorem and bounded convergencetheorem and z P denotes the push-forward of the measure under the linear map from X to X defined as multiplication by the scalar z . The integrand in (17) can be rewritten s MMD k SE I ( z P, z Q ) = (cid:90) X (cid:90) X k SE I ( x, y ) d ( z P − z Q )( x ) d ( z P − z Q )( y )= (cid:90) X (cid:90) X k SE I ( zx, zy ) d ( P − Q )( x ) d ( P − Q )( y )= (cid:90) X (cid:90) X e − z (cid:107) x − y (cid:107) X d ( P − Q )( x ) d ( P − Q )( y )= MMD k SE z I ( P, Q ) , which makes it clear that MMD k SE I ( z P, z Q ) is a continuous, non-negative functionof z and equals if and only if z = 0 . Using this we deduceMMD k IMQ I ( P, Q ) = (cid:90) R MMD k SE I ( z P, z Q ) dN σ ( z )= (cid:90) R MMD k SE z I ( P, Q ) dN σ ( z ) > , since N σ has strictly positive density. This proves that the IMQ- I kernel is charac-teristic and by the same push-forward argument as in the proof of Theorem 4 we mayconclude that the IMQ- T kernel is characteristic too. A.4 Proofs for Section 6
Proof of Proposition 4 . For any i, j using the assumption on k we have | k ( x i , y j ) − k ( RI x i , RI y j ) | ≤ L (cid:107) x i − y j − RI x i + RI y j (cid:107) X (18) ≤ L (cid:0) (cid:107)RI x i − x i (cid:107) X + (cid:107)RI y j − y j (cid:107) X (cid:1) , (19)where (18) is by assumption and (19) uses the triangle inequality. Using (3) gives (cid:12)(cid:12)(cid:12)(cid:12) (cid:92) MMD k ( X n , Y n ) − (cid:92) MMD k ( RI X n , RI Y n ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ n ( n − n (cid:88) i (cid:54) = j | h ( z i , z j ) − h ( RI z i , RI z j ) |≤ Ln ( n − n (cid:88) i (cid:54) = j (cid:107)RI x i − x i (cid:107) X + (cid:107)RI x j − x j (cid:107) X + (cid:107)RI y i − y i (cid:107) X + (cid:107)RI y j − y j (cid:107) X (20) = 4 Ln n (cid:88) i =1 (cid:107)RI x i − x i (cid:107) X + (cid:107)RI y i − y i (cid:107) X , (21)where (20) follows from expanding using the definition of h in Section 3 and using thetriangle inequality and (21) follows from counting the number of pairs of indices in thesum. roof of Corollary 2 . This can be deduced by the Lipschitz constants of e − x / and ( x + 1) − / . Then the proof of Proposition 4 may be continued in the same manner. Proof of Theorem 5 . Define the two random variables A n = n (cid:0) (cid:92) MMD k ( RI X n , RI Y n ) − (cid:92) MMD k ( X n , Y n ) (cid:1) B n = n (cid:0) (cid:92) MMD k ( X n , Y n ) − MMD k ( P, Q ) (cid:1) . It is known B n d −→ N (0 , ξ ) [28, Corollay 16] so the proof is complete by Slutsky’stheorem if A n P −→ . Fix any ε > then by Proposition 1 P ( | A n | > ε ) ≤ P (cid:18) Ln n (cid:88) i =1 (cid:107)RI x i − x i (cid:107) X + (cid:107)RI y i − y i (cid:107) X > ε (cid:19) ≤ Lεn E (cid:20) n (cid:88) i =1 (cid:107)RI x i − x i (cid:107) X + (cid:107)RI y i − y i (cid:107) X (cid:21) (22) = 4 Ln ε E [ (cid:107)RI x − x (cid:107) X + (cid:107)RI y − y (cid:107) X ] (23) → , (24)where (22) is by Markov’s inequality, (23) is by the assumption that the samples from P, Q and disctretisation
U, V are i.i.d. across samples and (24) is by assumption.
Proof of Theorem 6 . Note that Φ k T ( N a,S )( x ) = Φ k I ( N T a,T ST )( T x ) [18, Proposi-tion 1.18]. The proof simply uses [19, Proposition 1.2.8] to calculate the Gaussianintegrals. Φ k I ( N a,S )( x ) = (cid:90) X e − (cid:104) x − y,x − y (cid:105) X dN a,S ( y )= e − (cid:104) a − x,a − x (cid:105) X (cid:90) X e − (cid:104) y,y (cid:105) X e −(cid:104) y,a − x (cid:105) X dN S ( y )= det( I + S ) − e − (cid:104) a − x,a − x (cid:105) X e (cid:107) ( I + S ) − S ( a − x ) (cid:107) X (25) = det( I + S ) − e − (cid:104) a − x,a − x (cid:105) X e (cid:104) S ( I + S ) − S ( a − x ) ,a − x (cid:105) X = det( I + S ) − e − (cid:104) ( I − S ( I + S ) − S )( a − x ) , ( a − x ) (cid:105) X = det( I + S ) − e − (cid:104) ( I + S ) − ( x − a ) ,x − a (cid:105) X , (26)where (25) is due to Da Prato and Zabczyk [19, Proposition 1.2.8]. The last equalityis due to the Sherman-Morrison-Woodbury identity for operators [35, Theorem 3.5.6].Substituting in T a for a , ST S for S and T x for x gives the desired expression, asdiscussed at the start of the proof. Proof of Theorem 7 . The idea of the proof is that MMD k T ( P, Q ) is simply doubleintegrals of the kernel with respect to Gaussian measures. One integral was completed n Theorem 6 and we apply Da Prato and Zabczyk [19, Proposition 1.2.8] again. NoteMMD k T ( N a,S , N b,R ) = MMD k I ( N T a,T ST , N
T b,T RT ) so it suffices to do the calcu-lations for k I and substitute the other values in. Also since k T is translation invariantwe may without loss of generality assume a = 0 and replace b with a − b at the end. (cid:90) X (cid:90) X k I ( x, y ) dN S ( x ) dN b,R ( y )= det( I + S ) − (cid:90) X e − (cid:104) ( I + S ) − ( y − b ) ,y − b (cid:105) X dN R ( y ) (27) = det( I + S ) − e − (cid:104) ( I + S ) − b,b (cid:105) X (cid:90) X e − (cid:104) ( I + S ) − y,y (cid:105) X e (cid:104) y, ( I + S ) − b (cid:105) X dN R ( y )= det( I + S ) − det (cid:0) I + R ( I + S ) − R (cid:1) − (28) × e − (cid:104) ( I + S ) − b,b (cid:105) X e − (cid:104) ( I + S ) − R ( I + R ( I + S ) − R ) − R ( I + S ) − b,b (cid:105) X = det( I + S ) − det (cid:0) I + R ( I + S ) − R (cid:1) − (29) × e − ( I − ( I + S ) − R ( I + R ( I + S ) − R ) − R )( I + S ) − b,b (cid:105) X = det( I + S ) − det (cid:0) I + R ( I + S ) − R (cid:1) − e − (cid:104) ( I + S + R ) − b,b (cid:105) X , (30)where (27) is obtained by substituting the result of Theorem 6, (28) is applying Da Pratoand Zabczyk [19, Proposition 1.2.8], (29) is just rearranging terms and (30) is using theSherman-Morrison-Woodbury identity for operators [35, Theorem 3.5.6]. The proofis completed by using the expression of MMD in terms of three double integrals andsubstituting in the appropriate values of S, R, b inline with the description at the startof the proof. In particular when b = 0 and S = R det( I + S ) det (cid:0) I + S ( I + S ) − S (cid:1) = det (cid:0) ( I + S )( I + ( I + S ) − S ) (cid:1) = det( I + 2 S ) , by the Sherman-Morrison-Woodbury identity for operators. Proof of Theorem 8 . A more general result for which Theorem 8 is a specific caseshall be proved.
Theorem 13.
Let P = N a,S , Q = N b,R be two non-degenerate Gaussian measures on X , C ∈ L + ( X ) and assume C, S, R all commute then when using the SE- T kernel ξ = α ( T, S, R, a, b ) + α ( T, R, S, a, b ) ξ = β ( T, S, R, a, b ) + β ( T, R, S, a, b ) , here α ( T, S, R, a, b ) = det(( I + T ST )( I + 3 T ST )) − − det (cid:0) I + 2 T ST (cid:1) − + det (cid:0) ( I + T RT )( I + T (2 S + R ) T ) (cid:1) − e −(cid:104) ( I + T (2 S + R ) T ) − T ( a − b ) ,T ( a − b ) (cid:105) X − det (cid:0) I + T ( S + R ) T (cid:1) − e −(cid:104) ( I + T ( S + R ) T ) − T ( a − b ) ,T ( a − b ) (cid:105) X − S ) − e − (cid:104) ( I +2 T ST )Σ − S T ( a − b ) ,T ( a − b ) (cid:105) X + 2 det (cid:0) ( I + 2 T ST )( I + T ( S + R ) T ) (cid:1) − e − (cid:104) ( I + T ( S + R ) T ) − T ( a − b ) ,T ( a − b ) (cid:105) X ,β ( T, S, R, a, b ) = det( I + 4 T ST ) − − det( I + 2 T ST ) − + det (cid:0) I + 2 T ( S + R ) T (cid:1) − e −(cid:104) ( I +2 T ( S + R ) T ) − ( a − b ) ,a − b (cid:105) X − det (cid:0) I + T ( S + R ) T (cid:1) − e −(cid:104) ( I + T ( S + R ) T ) − ( a − b ) ,a − b (cid:105) X + 4 det (cid:0) ( I + T ( S + R ) T )( I + 2 T ST ) (cid:1) − e − (cid:104) ( I + T ( S + R ) T ) − ( a − b ) ,a − b (cid:105) X − S ) − e − (cid:104) ( I +2 T ST )Σ − S ( a − b ) ,a − b (cid:105) X , and Σ X = ( I + T ST )( I + T RT ) +
T XT (2 I + T ( S + R ) T ) for X ∈ { S, R } . Proof of Theorem 13 . As in the proof of Theorem 7 it suffices to consider T = I and a = 0 . Set k = k I and (cid:104)· , ·(cid:105) = (cid:104)· , ·(cid:105) X for ease of notation. The expression for ξ isderived first. The simplifications in Sutherland [77] reveal ξ = E x [ E x (cid:48) [ k ( x, x (cid:48) )] ] − E x,x (cid:48) [ k ( x, x (cid:48) )] + E y [ E y (cid:48) [ k ( y, y (cid:48) )] ] − E y,y (cid:48) [ k ( y, y (cid:48) )] + E x [ E y [ k ( x, y )] ] − E x,y [ k ( x, y )] + E y [ E x [ k ( y, x )] ] − E y,x [ k ( y, x )] − E x [ E x (cid:48) [ k ( x, x (cid:48) )] E y [ k ( x, y )]] + 2 E x,x (cid:48) [ k ( x, x (cid:48) )] E x,y [ k ( x, y )] − E y [ E y (cid:48) [ k ( y, y (cid:48) )] E x [ k ( y, x )]] + 2 E y,y (cid:48) [ k ( y, y (cid:48) )] E x,y [ k ( x, y )] . To calculate this only three of the terms need to be calculated then the rest arededuced by substituting in different values. For example E x,x (cid:48) [ k ( x, x (cid:48) )] can be deducedfrom the formula for E x,y [ k ( x, y )] by setting b = 0 and S = R since this would make y acts as an independent copy of x in the expectation. The three terms needed are E x,y [ k ( x, y )] (31) E x [ E y [ k ( x, y )] ] (32) E x [ E x (cid:48) [ k ( x, x (cid:48) )] E y [ k ( x, y )]] . (33)Expression (31) was derived in the proof of Theorem 7 as E x,y [ k ( x, y )] = det( I + S + R ) − e − (cid:104) ( I + S + R ) − b,b (cid:105) . ext a formula for (32) is derived. First note E y [ k ( x, y )] is the content of Theorem6. The rest follows by using [19, Proposition 1.2.8] and rearranging terms. E x [ E y [ k ( x, y )] ] = det( I + R ) − (cid:90) X e −(cid:104) ( I + R ) − ( x − b ) ,x − b (cid:105) dN S ( x )= det (cid:0) ( I + R )( I + R + 2 S ) (cid:1) − e −(cid:104) ( I + R ) − b,b (cid:105) × e (cid:104) S ( I + R ) − ( I +2 S ( I + R ) − ) − ( I + R ) − b,b (cid:105) = det (cid:0) ( I + R )( I + R + 2 S ) (cid:1) − e −(cid:104) ( I + R +2 S ) − b,b (cid:105) . Finally (33) is derived which involves the longest calculations. The terms in thefirst expectation are the content of Theorem 6. E x [ E x (cid:48) [ k ( x, x (cid:48) )] E y [ k ( x, y )]]= det (cid:0) ( I + S )( I + R ) (cid:1) − (cid:90) X e − (cid:104) ( I + S ) − x,x (cid:105) e − (cid:104) ( I + R ) − ( x − b ) ,x − b (cid:105) dN S ( x )= det (cid:0) ( I + S )( I + R ) (cid:1) − e − (cid:104) ( I + R ) − b,b (cid:105) × (cid:90) X e − (cid:104) (( I + S ) − +( I + R ) − ) x,x (cid:105) e − (cid:104) ( I + R ) − b,x (cid:105) dN S ( x )= det (cid:0) ( I + S )( I + R ) (cid:1) − det (cid:0) I + S (( I + S ) − + ( I + R ) − ) (cid:1) − × e (cid:104) (cid:2)(cid:0) I + S (( I + S ) − +( I + R ) − ) (cid:1) − S ( I + R ) − − I (cid:3) b, ( I + R ) − b (cid:105) = det(Σ S (cid:1) − e − (cid:104) ( I +2 S )Σ − S b,b (cid:105) , where Σ S = ( I + S )( I + R ) + S (2 I + S + R ) . The last equality is obtained byrearranging the terms in the exponent and determinant. Substituting into the formulafor ξ the derivations for (31), (32) and (33) completes the derivation for ξ . Thesimplification of ξ in [77] is ξ = E x,x (cid:48) [ k ( x, x (cid:48) ) ] − E x,x (cid:48) [ k ( x, x (cid:48) )] + E y,y (cid:48) [ k ( y, y (cid:48) ) ] − E y,y (cid:48) [ k ( y, y (cid:48) )] + 2 E x,y [ k ( x, y ) ] − E x,y [ k ( x, y )] − E x [ E x (cid:48) [ k ( x, x (cid:48) )] E y [ k ( x, y )]] + 4 E x,x (cid:48) [ k ( x, x (cid:48) )] E x,y [ k ( x, y )] − E y [ E y (cid:48) [ k ( y, y (cid:48) )] E x [ k ( y, x )]] + 4 E y,y (cid:48) [ k ( y, y (cid:48) )] E y,x [ k ( y, x )] , only the terms involving k need to be calculated. Note that k I = k √ I meaningif S, R, b are replaced by S, R, √ b then the formula for (31) immediately gives aformula for the terms involving k . Combining these derived formulas gives the desiredexpression for ξ .Theorem 8 is recovered by substituting S = R, a = 0 , b = m into Theorem 13. Proof of Theorem 9 . Suppose P n w −→ P then by Simon-Gabriel and Sch¨olkopf [67,Lemma 10], which holds in our case since the key intermediate result Berg et al. [5,Theorem 3.3] only requires X to be a Hausdorff space, we have MMD ( P n , P ) → . uppose MMD ( P n , P ) → , by Prokhorov’s theorem [8, Section 5] we know that { P n } ∞ n =1 is relatively compact. Since k is characteristic we know that H k is a separat-ing set in the sense of Ethier and Kurtz [21, Chapter 4] and MMD ( P n , P ) → impliesthat for every F ∈ H k that lim n →∞ (cid:82) F dP n = (cid:82) F dP therefore Ethier and Kurtz [21,Lemma 3.4.3] applies and we may conclude that P n w −→ P . A.5 Proof for Section 7
Proof of Proposition 5 . Suppose k is ISPD and C k isn’t injective. Then there ex-ists non-zero x ∈ L ( D ) such that C k x = 0 so (cid:82) D (cid:82) D x ( s ) k ( s, t ) x ( t ) dsdt = (cid:104) x, C k x (cid:105) L ( D ) = (cid:104) x, (cid:105) L ( D ) = 0 contradicting k being ISPD. Combining Sripe-rumbudur et al. [73, Proposition 5] and Sriperumbudur et al. [72, Theorem 9] showsthat if µ k has full support then k is ISPD.is ISPD.