Testing Hypotheses by Regularized Maximum Mean Discrepancy
Somayeh Danafar, Paola M.V. Rancoita, Tobias Glasmachers, Kevin Whittingstall, Juergen Schmidhuber
11 Testing Hypotheses by Regularized MaximumMean Discrepancy
Somayeh Danafar ∗† , Paola M.V. Rancoita ∗‡ , Tobias Glasmachers § Kevin Whittingstall ¶ (cid:107) , J ¨urgen Schmidhuber ∗†∗
IDSIA/SUPSI, Manno-Lugano, Switzerland † Universit`a della Svizzera Italiana, Lugano, Switzerland ‡ CUSSB, Vita-Salute San Raffaele University, Milan, Italy § Institut f ¨ur Neuroinformatik, Ruhr-Universit¨at Bochum, Germany ¶ Dept. of Diagnostic Radiology, Universit´e de Sherbrooke, QC, Canada (cid:107)
Sherbrooke Molecular Imaging Center, Universit´e de Sherbrooke, QC, Canada
Abstract
Do two data samples come from different distributions? Recent studies of this fundamental problem focused onembedding probability distributions into sufficiently rich characteristic Reproducing Kernel Hilbert Spaces (RKHSs),to compare distributions by the distance between their embeddings. We show that Regularized Maximum MeanDiscrepancy (RMMD), our novel measure for kernel-based hypothesis testing, yields substantial improvements evenwhen sample sizes are small, and excels at hypothesis tests involving multiple comparisons with power control. Wederive asymptotic distributions under the null and alternative hypotheses, and assess power control. Outstandingresults are obtained on: challenging EEG data, MNIST, the Berkley Covertype, and the Flare-Solar dataset.
I. I
NTRODUCTION
Homogeneity testing is an important problem in statistics and machine learning. It tests whether twosamples are drawn from different distributions. This is relevant for many applications, for instance, schemamatching in databases [9], and speaker identification [13]. Popular two-sample tests like Kolmogorov-Smirnov [2] and Cramer-von-Mises [17] are not capable of capturing statistical information of densitieswith high frequency features. Non-parametric kernel-based statistical tests such as Maximum Mean Dis-crepancy (MMD) [9], [10] enable one to obtain greater power than such density based methods. MMD isapplicable not only to Euclidean spaces R n , but also to groups and semigroups [8], and to structures suchas strings or graphs in bioinformatics, and robotics problems, etc. [1]. Here we consider a regularizedversion of MMD to address hypothesis testing.With more than two distributions to be compared simultaneously, we face the multiple comparisonssetting, for which statistical methods exist to deal with the issue of multiple test correction [23]. Givena prescribed global significance threshold α (type I error) for the set of all comparisons, however, thecorresponding threshold per comparison becomes small, which greatly reduces the power of the test. Insituations where one wants to retain the null hypothesis, tests with small α are not conservative. Our maincontribution is the definition of a regularized MMD (RMMD) method.The regularization term in RMMD allows to control the power of the test statistic. The regularizer isset provably optimal for maximal power; there is no need for fine-tuning by the user. RMMD improveson MMD through higher power, especially for small sample sizes, while preserving the advantages ofMMD. Power control enables us to look for true sets of null distributions among the significant ones inchallenging multiple comparison tasks. a r X i v : . [ c s . L G ] M a y We provide experimental evidence of good performance on a challenging Electroencephalography (EEG)dataset, artificially generated periodic and Gaussian data, and the MNIST and Covertype datasets. We alsoassess power control with the Asymptotic Relative Efficiency (ARE) test.The paper is organized as follows. In section 2, we elaborate on hypothesis testing and define maximummean discrepancy (MMD) as a metric. We describe how to use MMD for homogeneity testing, and howto extend it to multiple comparisons. In section 3, we define RMMD for hypothesis testing and compareit to MMD and Kernel Fisher Discriminant Analysis (KFDA), and assess power control through ARE.Additional empirical justification of our test on various datasets is presented in section 4.II. S
TATISTICAL H YPOTHESIS T ESTING
A statistical hypothesis test is a method which, based on experimental data, aims to decide whethera hypothesis (called null or H ) is true or false, against an alternative hypothesis ( H ). The level ofsignificance α of the test represents the probability of rejecting H under the assumption that H is true(type I error). A type II error ( β ) occurs when we reject H although it holds.The power of the statistical test is usually defined as − β . A desirable property of a statistical test isthat for a prescribed global significance level α the power equals one in the population limit. We dividethe discussion of hypothesis testing into two topics: homogeneity testing and multiple comparisons. A. Maximum Mean Discrepancy (MMD)
Embedding probability distributions into Reproducing Kernel Hilbert Spaces (RKHSs) yields a linearmethod that takes information of higher order statistics into account [9], [20], [21]. Characteristic kernels[6], [21], [8] injectively map the probability distribution onto its mean element in the correspondingRKHSs. The distance between the mean elements ( µ ) in the RKHS is known as MMD [9], [10]. Thedefinition of MMD [9] is given in the following theorem: Theorem 1.
Let ( X , B ) be a metric space, and let P , Q be two Borel probability measures defined on X . The kernel function k : X × X → R embeds the points x ∈ X into the corresponding reproducingkernel Hilbert space H . Then P = Q if and only if MMD(
P, Q ) = 0 , where
MMD(
P, Q ) := (cid:107) µ P − µ Q (cid:107) H = (cid:107) E P [ k ( x, . )] − E Q [ k ( y, . )] (cid:107) H = ( E x,x (cid:48) ∼ P [ k ( x, x (cid:48) )] + E y,y (cid:48) ∼ Q [ k ( y, y (cid:48) )] − E x ∼ P,y ∼ Q [ k ( x, y )]) . (1) B. Homogeneity Testing
A two-sample test investigates whether two samples are generated by the same distribution. To dotesting, MMD can be used to measure the distance between embedded probability distributions in RKHS.Besides calculating the distance measure, we need to check whether this distance is significantly differentfrom zero. For this, the asymptotic distribution of this distance measure is used to obtain a threshold onMMD values, and to extract the statistically significant cases. We perform a hypothesis test with nullhypothesis H : P = Q and alternative H : P (cid:54) = Q on samples drawn from two distributions P and Q .If the result of MMD is close enough to zero, we accept H , which indicates that the distributions P and Q coincide; otherwise the alternative is assumed to hold. With α as a threshold on the asymptoticdistribution of the empirical MMD (when P = Q ) , the ( − α )-quantile of this distribution is statisticallysignificant. Our MMD test determines it by means of a bootstrap procedure. C. Multiple Comparisons
Statistical analysis of a data set typically needs testing many hypotheses. The multiple comparisonsor multiple testing problem arises when we evaluate several statistical hypotheses simultaneously. Let α be the overall type I error, and let ¯ α denote the type I error of a single comparison in the multipletesting scenario. Maintaining the prescribed significance level of α in multiple comparisons yields ¯ α to bemore stringent than α . Nevertheless, in many studies α = ¯ α is used without correction. Several statisticaltechniques have been developed to control α [23]. We use the Dunn- ˆSid´ak method: For n independentcomparisons in multiple testing, the significance level α is obtained by: α = 1 − (1 − ¯ α ) n . As α decreases,the probability of type II error ( β ) increases and the power of the test decreases. This requires to control β while correcting α . To tackle this problem, and to control β , we define a new hypothesis test basedon RMMD, which has higher power than the MMD-based test, in the next section. To compare thedistributions in the multiple testing problem we use two approaches: one-vs-all and pairwise comparisons.In the one-vs-all case each distribution is compared to all other distributions in the family, thus M distributions require M − comparisons. In the pairwise case each pair of distributions is compared atthe cost of M ( M − comparisons.III. R EGULARIZED M AXIMUM M EAN D ISCREPANCY (RMMD)The main contribution of this paper is a novel regularization of MMD measure called RMMD. Thisregularization aims to provide a test statistics with greater power (power closer to 1 with a prescribedtype I error α ). Erdogmus and Principe [5] showed that − log (cid:107) µ P (cid:107) H is the Parzen window estimationof the Renyi entropy [16]. With RMMD we obtain a statistical test with greater power by penalizing theterm (cid:107) µ P (cid:107) H + (cid:107) µ Q (cid:107) H . We formulate RMMD and its empirical estimator as follows:
RMMD(
P, Q ) := MMD(
P, Q ) − κ P (cid:107) µ P (cid:107) H − κ Q (cid:107) µ Q (cid:107) H (2) (cid:92) RMMD(
P, Q ) := (cid:107) ˆ µ P − ˆ µ Q (cid:107) H − κ P (cid:107) ˆ µ P (cid:107) H − κ Q (cid:107) ˆ µ Q (cid:107) H (3)where κ P , and κ Q are non-negative regularization constants. For simplicity we consider κ P = κ Q = κ in many application, however, we can introduce prior knowledge about the complexity of distributions bychoosing κ P (cid:54) = κ Q . The modified Jensen-Shanon divergence (JS) [3] corresponding to RMMD is definedas: D ( P, Q ) := H s ( P, Q ) − ( κ + 1)( H s ( P ) + H s ( Q )) (4)where H s denotes the (cross) entropy. Since κ is positive, the absolute value of second term on the right-hand side of eq. (4) increases, leading to a higher weight for the mutual information than for the entropy(vice versa if κ would be lower than -1). Here we summarize the notation needed in the next section. Given samples { x i } n i =1 and { y i } n i =1 drawn from distributions P and Q , respectively, the mean element, the cross-covariance operator andthe covariance operator are defined as follows [7], [9]: ˆ µ P = n (cid:80) n i =1 k ( x i , . ) , (cid:98) Σ P Q = n n n + n (ˆ µ P − ˆ µ Q ) ⊗ (ˆ µ P − ˆ µ Q ) , and (cid:98) Σ P = n (cid:80) n i =1 ( k ( x i , . ) ⊗ k ( x i , . )) − (ˆ µ P ⊗ ˆ µ P ) , where u ⊗ v for u, v ∈ H isdefined for all f ∈ H as ( u ⊗ v ) f = (cid:104) v, f (cid:105) H u . The quantities ˆ µ Q and (cid:98) Σ Q are defined analogously for thesecond sample { y i } n i =1 . The population counterparts, i.e., the population mean element and the populationcovariance operator are defined for any probability measure P as (cid:104) µ P , f (cid:105) H = E [ f ( x )] for all f ∈ H , and (cid:104) f, Σ P g (cid:105) H = cov P [ f ( x ) , g ( y )] for f, g ∈ H . From now on we call Σ B = Σ P Q the between-distributioncovariance . The pooled covariance operator (which we call also the within-distribution covariance ) isdenoted by: Σ W = n n + n Σ P + n n + n Σ Q . RMMD with negative-valued κ can be used in clustering as a divergence to compare clusters. We achieve greater entropy with broaderclusters. The resulting clustering method avoids overfitting with narrow clusters. A. Limit Distribution of RMMD Under Null and Fixed Alternative Hypotheses
Now we derive the distribution of the test statistics under the null hypothesis of homogeneity H : P = Q (Theorem 2), which implies µ P = µ Q and Σ P = Σ Q = Σ W . Consistency of the test is guaranteedby the form of the distribution under H : P (cid:54) = Q (Theorem 2). Assume that { x i } n i =1 and { y i } n i =1 areindependent samples from P and Q, respectively (a priori they are not equally distributed). Let z i := ( x i , y i ) , h ( z i , z j ) := k ( x i , x j )+ k ( y i , y j ) − k ( x i , y j ) − k ( x j , y i ) − h (cid:48) ( z i , z j ) , and h (cid:48) ( z i , z j ) = κ P k ( x i , x j )+ κ Q k ( y i , y j ) ,and D −→ denotes convergence in distribution. Without loss of generality we assume n = n = n , and κ P = κ Q = κ . The proofs hold even when κ P (cid:54) = κ Q . Based on Hoeffding [14], Theorem A (p. 192) andTheorem B (p. 193) by Serfling [19], we can prove the following theorem: Theorem 2. If E [ h ] < ∞ , under H , (cid:92) RM M D is asymptotically normally distributed n ( (cid:92) RM M D − RM M D ) D −→ N (0 , ˆ σ ) , with variance ˆ σ = 4( E z [ E z (cid:48) [ h ( z, z (cid:48) ) ]] − E z,z (cid:48) [ h ( z, z (cid:48) )]) , uniformly at rate / √ n . Under H , the sameconvergence holds with ˆ σ = 4 ( E z [ E z (cid:48) [ h (cid:48) ( z, z (cid:48) ) ] ] − E z,z (cid:48) [ h (cid:48) ( z, z (cid:48) )]) > . To increase the power of our RMMD-based test we need to decrease the variance under H inTheorem 2. The following Theorem can be used to obtain maximal power by setting κ = 1 . This will giveus a fixed hyper-parameter—no need for user tuning. The optimal value of κ decreases both the varianceof H and H simultaneously and the fixed α is defined over the changed variance of H . Theorem 3.
The highest power of RMMD is obtained for κ = 1 . Proof.
Let denote A = k ( x i , x j ) + k ( y i , y j ) and B = k ( x i , y j ) − k ( x j , y i ) . Based on Theorem 2, thevariance under H is obtained by: ˆ σ = 4( E z [ E z (cid:48) [ h ( z, z (cid:48) ) ]] − E z,z (cid:48) [ h ( z, z (cid:48) )])= 4( E [((1 − κ ) A − B ) ] − ( E [(1 − κ ) A − B ]))= 4((1 − κ ) ( E [ A ] − E [ A ]) + E [ B ] − E [ B ])= 4((1 − κ ) var( A ) + var( B )) , (5)where var( A ) , and var( B ) denote the variances. To get maximal power, we set ∂ ((1 − κ ) var( A ) + var( B )) ∂κ = 0 , (6)which yields κ = 1 . B. Comparison between RMMD, MMD, and KFDA
According to Theorem 8 by Gretton et al. [9], under the null hypothesis the test statistics of MMDdegenerates. This corresponds to ˆ σ = 0 in our Theorem 2. For large sample sizes the null distributionof MMD approaches in distribution as an infinite weighted sum of independent χ random variables,with weights equal to the eigenvalues of the within-distribution covariance operator Σ W . If we denotethe test statistics based on MMD by ˆ T MMD n , then ˆ T MMD n D −→ C (cid:80) ∞ l =1 λ l ( z l − , where z l ∼ N (0 , are i.i.d. random variables, and C is a scaling factor. Harchaoui et al. [13] introduced Kernel FisherDiscriminant Analysis (KFDA) as a homogeneity test by regularizing MMD with the within-distributioncovariance operator. The maximum Fisher discriminant ratio defines this test statistic. The empirical KFDAtest statistic is denoted as (cid:92) KFDA(
P, Q ) = n n n + n (cid:107) ˆ µ P − ˆ µ Q (ˆΣ W + γ n I ) (cid:107) H . To analyze the asymptotic behaviourof this statistics under the null hypothesis, Harchaoui et al. [13] consider two situations regarding the regularization parameter γ n : 1) one where γ n is held fixed, obtaining the limit distribution similar to MMDunder H ; 2) one where γ n tends to zero slower than n − / . In the first situation the test statistic convergesto ˆ T KFDA( γ n ) n D −→ C (cid:80) ∞ l =1 ( λ l + γ n ) − λ l ( z l − . Thus, the test statistics based on KFDA normalizes theweights of χ random variables by using the covariance operator as the regularizer. In comparison MMDis more sensitive to the information of higher order moments because of their bigger weights (largereigenvalues of the covariance operator). In the second situation (applicable in practice only for very largesample sizes) the test statistics converges to ˆ T KFDA( γ n ) n D −→ N ( C, , where C is a constant.The asymptotic convergence of the test statistic based on RMMD is ˆ T RMMD n D −→ N (0 , ˆ σ ) , where ˆ σ is the variance of the function h in Theorem 2. The precise analytical normal distribution obtains higherpower in RMMD. Because of the divergence ( σ = 0 in the asymptotic distribution) for MMD and KFDA,they use an estimation of the distribution under the null hypothesis which looses the accuracy and affectthe power. In contrast to MMD and KFDA, RMMD is consistent since the divergence under the nullhypothesis does not happen any more. RMMD is the generalized form of the test statistics based onMMD, which we obtain for κ = 0 . Moreover, by minimizing the variance of the normal distribution, weobtain the best power for κ = 1 and thus the hyper-parameter κ is fixed without requiring tuning by theuser.In comparison to KFDA, RMMD does not require restrictive constraints to obtain high power. It alsoresults in higher power than MMD and KFDA in cases with small sample size. The speed of powerconvergence in KFDA is O p (1) , which is slower than O p ( n − ) in RMMD when n → ∞ .Regarding the computational complexity, for MMD a parametric model with lower order moments ofthe test statistics is used to estimate the value of MMD which degenerates under H , and which has noconsistency or accuracy guarantee. In comparison, the bootstrap resampling and the eigen-spectrum of thegram matrix are more consistent estimates with computational cost of O ( n ) , where n is the number ofsamples [11]. For RMMD, the convergence of the test statistic to a Normal distribution enables a fast,consistent and straightforward estimation of the null distribution within O ( n ) time without the need ofusing an estimation method. The results of power comparison between these tests are reported in section 4. C. Asymptotic Relative Efficiency of Statistical Tests
To assess the power control we use the asymptotic relative efficiency. This criterion shows that RMMDis a better test statistic and obtains higher power rather than KFDA and MMD with smaller sample size.Relative efficiency enables one to select the most effective statistical test quantitatively [15]. Let T and V be test statistics to be compared. The necessary sample size for the test statistics T to achieve the power − β with the significance level α is denoted by N T ( α, − β ) . The relative efficiency of the statisticaltest T with respect to the statistical test V is given by: e T,V ( α, − β ) = N V ( α, − β ) /N T ( α, − β ) . (7)Since calculating N T ( α, − β ) is hard even for the simplest test statistics, the limit value e T,V ( α ; 1 − β ) ,as − β → , is used. The limiting value is called the Bahadur Asymptotic Relative Efficiency (ARE)denoted by e BT,V . e BT,V := lim − β → e T,V ( α, − β ) , (8)The test statistic V is considered better than T , if e T,V is smaller than 1, because it means that V needs alower sample size to obtain a power of − β , for the given α . In [13], authors assessed the power control bymeans of analysis of local alternatives which work when we have very large sample size or when n tendsto infinity. In this article, we focus our attention on the small sample size case, which is more challenging. In section 4, we compute e B MMD , RMMD = N RMMD N MMD , e B MMD , KFDA = N KFDA N MMD , and e B KFDA , RMMD = N RMMD N KFDA using artificial datasets and two types of kernels, and we obtain smaller ARE for RMMD rather thanKFDA and MMD. This means RMMD gives higher power with much smaller sample size. Results fordifferent data sets are reported in Table 2, Figure 2, and Figure 3.IV. E
XPERIMENTS
MMD [9] was experimentally shown to outperform many traditional two-sample tests such as thegeneralized Wald-Wolfowitz test, the generalized Kolmogorov-Smirnov (KS) test [2], the Hall-Tajvidi(Hall) test [12], and the Biau-Gy¨orf test. It was shown [13] that KFDA outperforms the Hall-Tajviditest. We select KS and Hall as traditional baseline methods, on top of which we compare RMMD, KFDA,and MMD. To experimentally evaluate the utility of the proposed hypothesis testing method, we presentresults on various artificial and real-world benchmark datasets.
A. Artificial Benchmarks with Periodic and Gaussian Distributions
Our proposed method can be used for testing the homogeneity of structured data, which is an ad-vantage over traditional two-sample tests. We artificially generated distributions from Locally CompactAbelian Groups (periodic data) and applied our RMMD-test to decide whether the samples come fromthe same distributions or not. Suppose the first sample is drawn from a uniform distribution P onthe unit interval. The other sample is drawn from a perturbed uniform distribution Q ω with density ωx ) . For higher perturbation frequencies ω it becomes harder to discriminate Q ω from P . Sincethe distributions have a periodic nature, we use a characteristic kernel tailored to the periodic domain, k ( x, y ) = cosh( π − ( x − y ) mod π ) . For 200 samples from each distribution, the type II error is computedby comparing the prediction to the ground truth over 1000 repetition. We average the results over 10 runs.The significance level is set to α = 0 . . We perform the same experiment with MMD, KFDA, KS andHall. The powers of the homogeneity test for comparing P and Q with the above mentioned methodsare reported in Table 1 as Periodic1. The best power is achieved by RMMD, and as expected, the resultsof kernel methods are better than traditional ones.Since the selection of the kernel is a critical choice in kernel-based methods, we also investigated theusage of a different kernel and replaced the previous kernel with k ( x, y ) = − log(1 − θ cos( x − y ) + θ ) ,where θ is a hyperparameter. We report the best results achieved by θ = 0 . as Periodic2 in Table 1. Thereader is referred to [4], [8] for a detailed study on these kernels.We also report the results on the toy problem of comparing two -dimensional Gaussian distributionswith samples, both with zero mean vector but with covariance matrix . I and . I , respectively.This dataset is referred as Gaussian in Table 1. TABLE IT HE P OWER OBTAINED ON THE PERIODIC DATA , THE G AUSSIAN , THE
MNIST, C
OVERTYPE , AND F LARE S OLAR DATASETS , BYAPPLYING
RMMD
WITH κ = 0 . FOR THE PERIODIC DATA AND κ = 1 FOR THE OTHERS , AND
KFDA
WITH γ = 10 − . RMMD KFDA MMD KS Hall P ERIODIC ± ± ± ± ± REIODIC ± ± ± ± ± AUSSIAN ± ± ± MNIST ± ± ± ± ± OVERTYPE ± LARE -S OLAR
An investigation of the effect of kernel selection and tuning parameters [22] showed that best resultsfor MMD can be achieved by those kernels and parameters that obtain supreme value for MMD. Ourreported results agree. The results of kernel-based test statistics (RMMD, KFDA, and MMD) are improvedby kernel justification and parameter tuning, and in all cases RMMD outperform KFDA and MMD. Forinstance, the result of periodic kernel with tuned hyper-parameter θ is better than the one of the firstperiodic kernel without hyper-parameter (reported in Table 1 as Periodic2 and Periodic1, respectively).For Gaussian kernel-processed datasets, the median distance between data points provided the best results.We used the 5-fold cross validation procedure to tune the parameters in our experiment.The effect of changing κ on the power is simulated in two tests: first, by testing the similarity betweenthe uniform distribution and Q , and second with Q . In both cases, the best power is obtained for κ = 0 . .The results slightly differ from the theoretical value ( κ = 1 ) because of the relatively small sample sizes( n = n = 200 ) used for the tests. For samples with larger sizes we obtained maximal power with κ = 1 .The results are depicted in Figure 1. Fig. 1. Effect of κ on the power of the test. The alternatives are Q in the left and Q in the right figure. To assure that the statistical test is not aggressive for rejecting the null hypothesis, we reported theresults of type I error for RMMD, KFDA, and MMD with different sample sizes in Figure 2. Both samplesare supposed to be drawn from Q . We used Gaussian kernel with a variance equals to medium distanceof data points. The results averaged over 100 runs and the confidence interval obtained by 10 replicates.RMMD obtains zero type I error with smaller sample sizes, and the results of KFDA and MMD arecomparable.To assess the power control of the test statistics we also compared e B MMD , RMMD , e B MMD , KFDA , and e B KFDA , RMMD under H when P is a uniform distribution and the alternative is Q . We obtained smallerARE for RMMD rather than for KFDA and MMD. This means RMMD gives higher power with fewersamples. Table 2 shows the results, averaged over 1000 runs, for periodic data (Periodic1 and Periodic2).Figure 3 depicts the detailed results of the type II error for RMMD, MMD, and KFDA based on differentsample sizes n . AREs are also calculated for more complex tasks. Consider the first sample is drawnfrom a uniform distribution P on the unit area. The other sample is drawn from the perturbed uniformdistribution Q ω with density sin ( ωx ) sin ( ωy ) . For increasing values of ω , the discrimination of Q ω from P becomes harder (Figure 4). The range of ω changes between 1 to 6. We call these problemsPuni1 to Puni6, respectively. The best results for all statistical kernel-based methods are achieved byusing a characteristic kernel tailored to the periodic domain, k ( x, y ) = Π i =1 / (1 − θcos ( x i − y i ) + θ ) ,with θ = 0 . tuned using the 5-fold cross validation procedure. The results reported in Table 2 showmuch smaller values of ARE for RMMD rather than for KFDA and MMD. Figure 5 shows the detailedresults of the type II error for RMMD, MMD, and KFDA based on different sample sizes n and differentfrequencies ω . As displayed in Figure 5, RMMD obtains the robust result of zero type II error for 100samples over all different frequencies. Instead KFDA and MMD need much larger samples for the moredifficult cases with larger ω to obtain a power of one. Fig. 2. Type I error changed based on different sample size n . TABLE IIT HE ARE
OBTAINED ON THE PERIODIC DATA , BY APPLYING
RMMD
WITH κ = 1 , AND θ = 0 . IN PERIODIC KERNELS , AND
KFDA
WITH γ = 10 − . e B MMD , RMMD e B MMD , KFDA e B KFDA , RMMD P ERIODIC REIODIC P UNI UNI UNI UNI UNI UNI B. MNIST, Covertype, and Flare-Solar Datasets
Moving from synthetic data to standard benchmarks, we tested our method on three datasets: 1) theMNIST dataset of handwritten digits (LibSVM library: 10 classes, 5000 data points, and 784 dimensions);2) the Covertype dataset of forest cover types (LibSVM library: 7 classes, 1400 instances, and 54dimensions); 3) the Flare-Solar dataset (mldata.org: 2 classes, 100 instances, 10 dimensions). We comparethe performance of RMMD with κ = 1 , KFDA with γ = 10 − and MMD, using the pairwise approach andtesting for differences between the distributions of the classes, see Table 1. We average the results over 10runs. The family wide level is set to α = 0 . (resulting in ¯ α = 0 . , ¯ α = 0 . and ¯ α = 0 . for eachindividual comparison for MNIST, Covertype and Flare-Solar datasets, respectively). The RMMD-basedtest achieves higher power than the other methods (see Table 1). C. Electroencephalography Data
We recorded EEG from four subjects performing a visual task. A checkerboard was presented in thesubject’s left visual field. We refer to [25] for details on data collection and preprocessing. In our learningtask, for each subject we have 64 signal distributions assigned to 64 electrodes. The data contain 360
Fig. 3. Type II error change based on different sample size n. On the left, the results with Periodic kernel 1 and on the right, the resultswith Periodic kernel 2.Fig. 4. The probability density function of Puni1 with ω = 1 on the left and the probability density function of Puni6 with ω = 6 onthe right. As ω increases the probability density function looks more similar to the uniform distribution and the discrimination of P and Q ω becomes more difficult for the test statistics. instances of a 200 dimensional feature vector for each distribution. The goal of hypothesis testing is todisambiguate signals recorded from electrodes corresponding to early visual cortex from the rest. Thisis difficult because of the low signal-to-noise ratio and the similarity of the patterns of all electrodes.Moreover, the high number of electrodes makes this experiment a good candidate to assess the multiplecomparison part of our method. In the one-vs-all approach the normalized distribution of each electrodeis compared to the normalized combined distribution of the other 63 electrodes. RMMD with κ = 1 with Gaussian kernel is used as our hypothesis test. The parameter σ of the Gaussian kernel is set to themedian distance of data points. The results of our hypothesis test reject the null hypothesis and confirmthe dissimilarity of distributions in 63 electrodes. The results of the pairwise approach with RMMD andMMD are depicted in Figure 6.Neuroscientists usually subjectively assess the results obtained from imaging techniques and inferredfrom machine learning. For instance, in the current experiment the expectation is that electrodes in Fig. 5. On the left, different sample sizes n for different frequencies ω are shown. The type II error changes based on different samplesizes n and different frequencies ω , in the middle for the KFDA-based test, and on the right for the MMD-based test.Fig. 6. The results of RMMD and MMD as hypothesis tests on the EEG data recorded from 64 electrodes per subject in the top row andthe bottom row, respectively. Categorized electrodes recognized by the two methods as related to the visual task are colored. region A1 are categorized together by means of EEG imaging techniques and multiple comparisons.But electrodes of other area (such as A and A , see Figure 7) can be confused as belonging to A due tothe high noise. Figure 7 describes the categorization of the electrodes. We assess our results quantitativelyby means of False Discovery Rates (FDR), using the following FDRs to compare the results of RMMDto those of MMD: F DR = ( no. of electrodes categorized for the visual task in A ∪ A ∪ B ) U , F DR = ( no. of electrodes categorized for the visual task in A ∪ B ) U ,F DR = ( no. of electrodes categorized for the visual task in B ) U ,where U is the total number of electrodes categorized for the task. The results are depicted in Figure 7.RMMD obtained more robust and better results than MMD with smaller FDRs. Fig. 7. The reference image of the EEG electrodes is shown on the left. We categorized electrodes into four groups as follows: A1, theelectrodes corresponding to visual cortex in the region of interest, A2, the peripheral electrodes that can be wrongly detected due to noise,A3, the electrodes in the left visual cortex often detected due to noise or interrelation between brain areas, and B, all the remaining electrodes.On the right, the results of RMMD and MMD are quantitatively compared based of the FDRs defined in the text. The smallest and mostrobust FDRs are obtained by RMMD.
V. C
ONCLUSION
Our novel regularized maximum mean discrepancy (RMMD) is a kernel-based test statistic generalizingthe MMD test. We proved that RMMD overpowers MMD and KFDA; power consistency is obtained withhigher rate. Power control makes RMMD a good hypothesis test for multiple comparisons, especially forthe crucial case of small sample sizes. In contrast to KFDA and MMD, the convergence of RMMD-based test statistics to the normal distribution under null and alternative hypotheses yields fast andstraightforward RMMD estimates. Experiments with goldstandard benchmarks (MNIST, Covertype andFlare-Solar dataset) and with EEG data yield state of the art results.VI. A
CKNOWLEDGEMENT
This work was partially funded by the Nano-resolved Multi-scale Investigations of human tactile Sensa-tions and tissue Engineered nanobio-Sensors (SINERGIA:CRSIKO 122697/1) grant, and the State repre-sentation in reward based learning from spiking neuron models to psychophysics(NaNoBio Touch:228844)grant. R
EFERENCES [1] Borgwardt, K, Gretton, A., Rash, M., Kriegel, H. P., Sch¨olkopf, B, Smola, A. J.: Integrating structured biological data by kernel maximummean discrepancy. Bioinformatics, vol.22(14), pp. 49-57, (2006)[2] Freidman, J, Rafsky, L.: Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of statistics,vol.7, pp. 697-717, (1979)[3] Fuglede, B., Topsoe, F.: Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium onInformation Theory (ISIT), (2004)[4] Danafar, S., Gretton, A, Schmidhuber, J.: Characteristic kernels on structured domains excel in robotics and human action recognition.ECML: Machine Learning and Knowledge Discovery in Databases, pp. 264-279, LCNS, Springer, (2006)[5] Erdogmus, D, Principe, J., C.: From Linear adaptive filtering to nonlinear information processing. IEEE Signal Processing Magazine,vol.23(6), pp. 14-23, (2006)[6] Fukumizu, K, Bach, F., R., Jordan, M., I.: Dimensionality reduction for supervised learning with reproducing kernel Hilbert Spaces.Journal of Machine Learning, vol.5, pp. 73-99, (2004)[7] Fukumizu, K., Bach, F. R., Gretton, A.: Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning,vol.8, pp. 361-383, (2007)[8] Fukumizu, K., Sriperumbudur, B., Gretton, A. Sch¨olkopf, B.: Characteristic kernels on groups and semigroups. In Advances in NeuralInformation Processing Systems. Koller, D., D. Schuurmans, Y. Bengio, L. Bottou (eds.), vol. 21, pp. 473-480, Curran, Red Hook, NY,USA, (2009)2