A Kernel Test for Three-Variable Interactions
aa r X i v : . [ s t a t . M E ] J un A Kernel Test for Three-Variable Interactions
Dino Sejdinovic, Arthur Gretton
Gatsby Unit, CSML, UCL, UK { dino.sejdinovic,arthur.gretton } @gmail.com Wicher Bergsma
Department of Statistics, LSE, UK [email protected]
Abstract
We introduce kernel nonparametric tests for Lancaster three-variable interactionand for total independence, using embeddings of signed measures into a repro-ducing kernel Hilbert space. The resulting test statistics are straightforward tocompute, and are used in powerful interaction tests, which are consistent againstall alternatives for a large family of reproducing kernels. We show the Lancastertest to be sensitive to cases where two independent causes individually have weakinfluence on a third dependent variable, but their combined effect has a stronginfluence. This makes the Lancaster test especially suited to finding structure indirected graphical models, where it outperforms competing nonparametric tests indetecting such V-structures.
The problem of nonparametric testing of interaction between variables has been widely treated inthe machine learning and statistics literature. Much of the work in this area focuses on measuringor testing pairwise interaction: for instance, the Hilbert-Schmidt Independence Criterion (HSIC) orDistance Covariance [1, 2, 3], kernel canonical correlation [4, 5, 6], and mutual information [7].In cases where more than two variables interact, however, the questions we can ask about theirinteraction become significantly more involved. The simplest case we might consider is whether thevariables are mutually independent, P X = Q di =1 P X i , as considered in R d by [8]. This is alreadya more general question than pairwise independence, since pairwise independence does not implytotal (mutual) independence, while the implication holds in the other direction. For example, if X and Y are i.i.d. uniform on {− , } , then ( X, Y, XY ) is a pairwise independent but mutuallydependent triplet [9]. Tests of total and pairwise independence are insufficient, however, since theydo not rule out all third order factorizations of the joint distribution.An important class of high order interactions occurs when the simultaneous effect of two vari-ables on a third may not be additive. In particular, it may be possible that X ⊥⊥ Z and Y ⊥⊥ Z ,whereas ¬ (( X, Y ) ⊥⊥ Z ) (for example, neither adding sugar to coffee nor stirring the coffee in-dividually have an effect on its sweetness but the joint presence of the two does). In addition,study of three-variable interactions can elucidate certain switching mechanisms between positiveand negative correlation of two genes expressions, as controlled by a third gene [10]. The presenceof such interactions is typically tested using some form of analysis of variance (ANOVA) modelwhich includes additional interaction terms, such as products of individual variables. Since eachsuch additional term requires a new hypothesis test, this increases the risk that some hypothesis testwill produce a false positive by chance. Therefore, a test that is able to directly detect the presenceof any kind of higher-order interaction would be of a broad interest in statistical modeling. In thepresent work, we provide to our knowledge the first nonparametric test for three-variable interaction.1his work generalizes the HSIC test of pairwise independence, and has as its test statistic the normof an embedding of an appropriate signed measure to a reproducing kernel Hilbert space (RKHS).When the statistic is non-zero, all third order factorizations can be ruled out. Moreover, this test isapplicable to the cases where X , Y and Z are themselves multivariate objects, and may take valuesin non-Euclidean or structured domains. One important application of interaction measures is in learning structure for graphical models. Ifthe graphical model is assumed to be Gaussian, then second order interaction statistics may be usedto construct an undirected graph [11, 12]. When the interactions are non-Gaussian, however, otherapproaches are brought to bear. An alternative approach to structure learning is to employ condi-tional independence tests. In the PC algorithm [13, 14, 15], a V-structure (two independent variableswith directed edges towards a third variable) is detected when an independence test between theparent variables accepts the null hypothesis, while a test of dependence of the parents conditionedon the child rejects the null hypothesis. The PC algorithm gives a correct equivalence class of struc-tures subject to the causal Markov and faithfulness assumptions, in the absence of hidden commoncauses. The original implementations of the PC algorithm rely on partial correlations for testing,and assume Gaussianity. A number of algorithms have since extended the basic PC algorithm toarbitrary probability distributions over multivariate random variables [16, 17, 18], by using non-parametric kernel independence tests [19] and conditional dependence tests [20, 18]. We observethat our Lancaster interaction based test provides a strong alternative to the conditional dependencetesting approach, and is seen to outperform earlier approaches in detecting cases where independentparent variables weakly influence the child variable when considered individually, but have a strongcombined influence.We begin our presentation in Section 2 with a definition of interaction measures, these being thesigned measures we will embed in an RKHS. We cover this embedding procedure in Section 3. Wethen proceed in Section 4 to define pairwise and three way interactions. We describe a statistic totest mutual independence for more than three variables, and provide a brief overview of the morecomplex high-order interactions that may be observed when four or more variables are considered.Finally, we provide experimental benchmarks in Section 5.Matlab code for interaction tests considered in the paper is available at
An interaction measure [21, 22] associated to a multidimensional probability distribution P of a ran-dom vector ( X , . . . , X D ) taking values in the product space X ×· · ·×X D is a signed measure ∆ P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly mul-tivariate) marginal distributions. For the cases D = 2 , the correct interaction measure coincideswith the the notion introduced by Lancaster [21] as a formal product ∆ L P = D Y i =1 (cid:0) P ∗ X i − P X i (cid:1) , (1)where Q D ′ j =1 P ∗ X ij is understood as a joint probability distribution of a subvector (cid:0) X i , . . . , X i D ′ (cid:1) .We will term the signed measure in (1) the Lancaster interaction measure . In the case of a bivariatedistribution, the Lancaster interaction measure is simply the difference between the joint probabilitydistribution and the product of the marginal distributions (the only possible non-trivial factorizationfor D = 2 ), ∆ L P = P XY − P X P Y , while in the case D = 3 , we obtain ∆ L P = P XY Z − P XY P Z − P Y Z P X − P XZ P Y + 2 P X P Y P Z . (2)It is readily checked that ( X, Y ) ⊥⊥ Z ∨ ( X, Z ) ⊥⊥ Y ∨ ( Y, Z ) ⊥⊥ X ⇒ ∆ L P = 0 . (3)For D > , however, (1) does not capture all possible factorizations of the joint distribution, e.g.,for D = 4 , it need not vanish if ( X , X ) ⊥⊥ ( X , X ) , but X and X are dependent and X and As the reader might imagine, the situation becomes more complex again when four or more variablesinteract simultaneously; we provide a brief technical overview in Section 4.3. are dependent. Streitberg [22] corrected this definition using a more complicated constructionwith the M¨obius function on the lattice of partitions, which we describe in Section 4.3. In thiswork, however, we will focus on the case of three variables and formulate interaction tests based onembedding of (2) into an RKHS.The implication (3) states that the presence of Lancaster interaction rules out the possibility of anyfactorization of the joint distribution, but the converse is not generally true; see Appendix C for de-tails. In addition, it is important to note the distinction between the absence of Lancaster interactionand the total (mutual) independence of ( X, Y, Z ) , i.e., P XY Z = P X P Y P Z . While total indepen-dence implies the absence of Lancaster interaction, the signed measure ∆ tot P = P XY Z − P X P Y P Z associated to the total (mutual) independence of ( X, Y, Z ) does not vanish if, e.g., ( X, Y ) ⊥⊥ Z , but X and Y are dependent.In this contribution, we construct the non-parametric test for the hypothesis ∆ L P = 0 (no Lancasterinteraction), as well as the non-parametric test for the hypothesis ∆ tot P = 0 (total independence),based on the embeddings of the corresponding signed measures ∆ L P and ∆ tot P into an RKHS.Both tests are particularly suited to the cases where X , Y and Z take values in a high-dimensionalspace, and, moreover, they remain valid for a variety of non-Euclidean and structured domains, i.e.,for all topological spaces where it is possible to construct a valid positive definite function; see [23]for details. In the case of total independence testing, our approach can be viewed as a generalizationof the tests proposed in [24] based on the empirical characteristic functions. We review the embedding of signed measures to a reproducing kernel Hilbert space. The RKHSnorms of such embeddings will then serve as our test statistics. Let Z be a topological space.According to the Moore-Aronszajn theorem [25, p. 19], for every symmetric, positive definitefunction (henceforth kernel ) k : Z × Z → R , there is an associated reproducing kernel Hilbertspace (RKHS) H k of real-valued functions on Z with reproducing kernel k . The map ϕ : Z → H k , ϕ : z k ( · , z ) is called the canonical feature map or the Aronszajn map of k . Denote by M ( Z ) the Banach space of all finite signed Borel measures on Z . The notion of a feature map can then beextended to kernel embeddings of elements of M ( Z ) [25, Chapter 4]. Definition 1. ( Kernel embedding ) Let k be a kernel on Z , and ν ∈ M ( Z ) . The kernel embedding of ν into the RKHS H k is µ k ( ν ) ∈ H k such that ´ f ( z ) dν ( z ) = h f, µ k ( ν ) i H k for all f ∈ H k .Alternatively, the kernel embedding can be defined by the Bochner integral µ k ( ν ) = ´ k ( · , z ) dν ( z ) .If a measurable kernel k is a bounded function, it is straightforward to show using the Riesz repre-sentation theorem that µ k ( ν ) exists for all ν ∈ M ( Z ) . For many interesting bounded kernels k ,including the Gaussian, Laplacian and inverse multiquadratics, the embedding µ k : M ( Z ) → H k isinjective. Such kernels are said to be integrally strictly positive definite (ISPD) [27, p. 4]. A relatedbut weaker notion is that of a characteristic kernel [20, 28], which requires the kernel embeddingto be injective only on the set M ( Z ) of probability measures. In the case that k is ISPD, since H k is a Hilbert space, we can introduce a notion of an inner product between two signed measures ν, ν ′ ∈ M ( Z ) , hh ν, ν ′ ii k := h µ k ( ν ) , µ k ( ν ′ ) i H k = ˆ k ( z, z ′ ) dν ( z ) dν ′ ( z ′ ) . Since µ k is injective, this is a valid inner product and induces a norm on M ( Z ) , for which k ν k k = hh ν, ν ii / k = 0 if and only if ν = 0 . This fact has been used extensively in the literature toformulate: (a) a nonparametric two-sample test based on estimation of maximum mean discrepancy k P − Q k k , for samples { X i } ni =1 i.i.d. ∼ P , { Y i } mi =1 i.i.d. ∼ Q [29] and (b) a nonparametric indepen-dence test based on estimation of k P XY − P X P Y k k ⊗ l , for a joint sample { ( X i , Y i ) } ni =1 i.i.d. ∼ P XY [19] (the latter is also called a Hilbert-Schmidt independence criterion), with kernel k ⊗ l on the Unbounded kernels can also be considered, however [26]. In this case, one can still study embeddingsof the signed measures M / k ( Z ) ⊂ M ( Z ) , which satisfy a finite moment condition, i.e., M / k ( Z ) = n ν ∈ M ( Z ) : ´ k / ( z, z ) d | ν | ( z ) < ∞ o . V -statistic estimates of hh ν, ν ′ ii k ⊗ l in the two-variable case ν \ ν ′ P XY P X P Y P XY n ( K ◦ L ) ++ 1 n ( KL ) ++ P X P Y n K ++ L ++ product space defined as k ( x, x ′ ) l ( y, y ′ ) . When a bounded characteristic kernel is used, the abovetests are consistent against all alternatives , and their alternative interpretation is as a generalization[3, 26] of energy distance [30, 31] and distance covariance [2, 32].In this article, we extend this approach to the three-variable case, and formulate tests for boththe Lancaster interaction and for the total independence, using simple consistent estimators of k ∆ L P k k ⊗ l ⊗ m and k ∆ tot P k k ⊗ l ⊗ m respectively, which we describe in the next Section. Using thesame arguments as in the tests of [29, 19], these tests are also consistent against all alternatives aslong as ISPD kernels are used. Notational remarks:
Throughout the paper, ◦ denotes an Hadamard (entrywise) product. Let A bean n × n matrix, and K a symmetric n × n matrix. We will fix the following notational conventions: denotes an n × column of ones; A + j = P ni =1 A ij denotes the sum of all elements of the j -thcolumn of A ; A i + = P nj =1 A ij denotes the sum of all elements of the i -th row of A ; A ++ = P ni =1 P nj =1 A ij denotes the sum of all elements of A ; K + = ⊤ K , i.e., [ K + ] ij = K + j = K j + ,and (cid:2) K ⊤ + (cid:3) ij = K i + = K + i . We provide a short overview of the kernel independence test of [19], which we write as the RKHSnorm of the embedding of a signed measure. While this material is not new (it appears in [29, Section7.4]), it will help define how to proceed when a third variable is introduced, and the signed measuresbecome more involved. We begin by expanding the squared RKHS norm k P XY − P X P Y k k ⊗ l asinner products, and applying the reproducing property, k P XY − P X P Y k k ⊗ l = E XY E X ′ Y ′ k ( X, X ′ ) l ( Y, Y ′ ) + E X E X ′ k ( X, X ′ ) E Y E Y ′ l ( Y, Y ′ ) − E X ′ Y ′ [ E X k ( X, X ′ ) E Y l ( Y, Y ′ )] , (4)where ( X, Y ) and ( X ′ , Y ′ ) are independent copies of random variables on X × Y with distribution P XY .Given a joint sample { ( X i , Y i ) } ni =1 i.i.d. ∼ P XY , an empirical estimator of k P XY − P X P Y k k ⊗ l isobtained by substituting corresponding empirical means into (4), which can be represented usingGram matrices K and L ( K ij = k ( X i , X j ) , L ij = l ( Y i , Y j ) ), ˆ E XY ˆ E X ′ Y ′ k ( X, X ′ ) l ( Y, Y ′ ) = 1 n n X a =1 n X b =1 K ab L ab = 1 n ( K ◦ L ) ++ , ˆ E X ˆ E X ′ k ( X, X ′ )ˆ E Y ˆ E Y ′ l ( Y, Y ′ ) = 1 n n X a =1 n X b =1 n X c =1 n X d =1 K ab L cd = 1 n K ++ L ++ , ˆ E X ′ Y ′ h ˆ E X k ( X, X ′ )ˆ E Y l ( Y, Y ′ ) i = 1 n n X a =1 n X b =1 n X c =1 K ac L bc = 1 n ( KL ) ++ . Since these are V-statistics, there is a bias of O P ( n − ) ; U-statistics may be used if an unbiased esti-mate is needed. Each of the terms above corresponds to an estimate of an inner product hh ν, ν ′ ii k ⊗ l for probability measures ν and ν ′ taking values in { P XY , P X P Y } , as summarized in Table 1. Even4able 2: V -statistic estimates of hh ν, ν ′ ii k ⊗ l ⊗ m in the three-variable case ν \ ν ′ nP XY Z n P XY P Z n P XZ P Y n P Y Z P X n P X P Y P Z nP XY Z ( K ◦ L ◦ M ) ++ (( K ◦ L ) M ) ++ (( K ◦ M ) L ) ++ (( M ◦ L ) K ) ++ tr ( K + ◦ L + ◦ M + ) n P XY P Z ( K ◦ L ) ++ M ++ ( MKL ) ++ ( KLM ) ++ ( KL ) ++ M ++ n P XZ P Y ( K ◦ M ) ++ L ++ ( KML ) ++ ( KM ) ++ L ++ n P Y Z P X ( L ◦ M ) ++ K ++ ( LM ) ++ K ++ n P X P Y P Z K ++ L ++ M ++ though the second and third terms involve triple and quadruple sums, each of the empirical meanscan be computed using sums of all terms of certain matrices, where the dominant computational costis in computing the matrix product KL . In fact, the overall estimator can be computed in an evensimpler form (see Proposition 9 in Appendix F), as (cid:13)(cid:13)(cid:13) ˆ P XY − ˆ P X ˆ P Y (cid:13)(cid:13)(cid:13) k ⊗ l = n ( K ◦ HLH ) ++ , where H = I − n ⊤ is the centering matrix. Note that by the idempotence of H , we also havethat ( K ◦ HLH ) ++ = ( HKH ◦ HLH ) ++ . In the rest of the paper, for any Gram matrix K , wewill denote its corresponding centered matrix HKH by ˜ K . When three variables are present, atwo-variable test already allows us to determine whether for instance ( X, Y ) ⊥⊥ Z , i.e., whether P XY Z = P XY P Z . It is sufficient to treat ( X, Y ) as a single variable on the product space X × Y ,with the product kernel k ⊗ l . Then, the Gram matrix associated to ( X, Y ) is simply K ◦ L , and thecorresponding V -statistic is n (cid:16) K ◦ L ◦ ˜ M (cid:17) ++ . What is not obvious, however, is if a V-statisticfor the Lancaster interaction (which can be thought of as a surrogate for the composite hypothesis ofvarious factorizations) can be obtained in a similar form. We will address this question in the nextsection.
As in the two-variable case, it suffices to derive V-statistics for inner products hh ν, ν ′ ii k ⊗ l ⊗ m , where ν and ν ′ take values in all possible combinations of the joint and the products of the marginals, i.e., P XY Z , P XY P Z , etc. Again, it is easy to see that these can be expressed as certain expectations ofkernel functions, and thereby can be calculated by an appropriate manipulation of the three Grammatrices. We summarize the resulting expressions in Table 2 - their derivation is a tedious butstraightforward linear algebra exercise. For compactness, the appropriate normalizing terms aremoved inside the measures considered.Based on the individual RKHS inner product estimators, we can now easily derive estimators forvarious signed measures arising as linear combinations of P XY Z , P XY P Z , and so on. The first suchmeasure is an “incomplete” Lancaster interaction measure ∆ ( Z ) P = P XY Z + P X P Y P Z − P Y Z P X − P XZ P Y , which vanishes if ( Y, Z ) ⊥⊥ X or ( X, Z ) ⊥⊥ Y , but not necessarily if ( X, Y ) ⊥⊥ Z . Weobtain the following result for the empirical measure ˆ P . Proposition 2 (Incomplete Lancaster interaction) . (cid:13)(cid:13)(cid:13) ∆ ( Z ) ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m = n (cid:16) ˜ K ◦ ˜ L ◦ M (cid:17) ++ . Analogous expressions hold for ∆ ( X ) ˆ P and ∆ ( Y ) ˆ P . Unlike in the two-variable case where eithermatrix or both can be centered, centering of each matrix in the three-variable case has a differentmeaning. In particular, one requires centering of all three kernel matrices to perform a “complete”Lancaster interaction test, as given by the following Proposition. Proposition 3 (Lancaster interaction) . (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m = n (cid:16) ˜ K ◦ ˜ L ◦ ˜ M (cid:17) ++ . The proofs of these Propositions are given in Appendix A. We summarize various hypotheses and theassociated V-statistics in the Appendix B. As we will demonstrate in the experiments in Section 5,while particularly useful for testing the factorization hypothesis, i.e., for ( X, Y ) ⊥⊥ Z ∨ ( X, Z ) ⊥⊥ In general, however, this approach would require some care since, e.g., X and Y could be measured onvery different scales, and the choice of kernels k and l needs to take this into account. ∨ ( Y, Z ) ⊥⊥ X , the statistic (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m can also be used for powerful tests of either theindividual hypotheses ( Y, Z ) ⊥⊥ X , ( X, Z ) ⊥⊥ Y , or ( X, Y ) ⊥⊥ Z , or for total independence testing,i.e., P XY Z = P X P Y P Z , as it vanishes in all of these cases. The null distribution under each of thesehypotheses can be estimated using a standard permutation-based approach described in AppendixD.Another way to obtain the Lancaster interaction statistic is as the RKHS norm of the joint “cen-tral moment” Σ XY Z = E XY Z [( k X − µ X ) ⊗ ( l Y − µ Y ) ⊗ ( m Z − µ Z )] of RKHS-valued randomvariables k X , l Y and m Z (understood as an element of the tensor RKHS H k ⊗ H l ⊗ H m ). This isrelated to a classical characterization of the Lancaster interaction [21, Ch. XII]: there is no Lancasterinteraction between X , Y and Z if and only if cov [ f ( X ) , g ( Y ) , h ( Z )] = 0 for all L functions f , g and h . There is an analogous result in our case (proof is given in Appendix A), which states Proposition 4. k ∆ L P k k ⊗ l ⊗ m = 0 if and only if cov [ f ( X ) , g ( Y ) , h ( Z )] = 0 for all f ∈ H k , g ∈ H l , h ∈ H m . And finally, we give an estimator of the RKHS norm of the total independence measure ∆ tot P . Proposition 5 (Total independence) . Let ∆ tot ˆ P = ˆ P XY Z − ˆ P X ˆ P Y ˆ P Z . Then: (cid:13)(cid:13)(cid:13) ∆ tot ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m = 1 n ( K ◦ L ◦ M ) ++ − n tr ( K + ◦ L + ◦ M + ) + 1 n K ++ L ++ M ++ . The proof follows simply from reading off the corresponding inner-product V-statistics from theTable 2. While the test statistic for total independence has a somewhat more complicated form thanthat of Lancaster interaction, it can also be computed in quadratic time.
D > Streitberg’s correction of the interaction measure for
D > has the form ∆ S P = X π ( − | π |− ( | π | − J π P, (5)where the sum is taken over all partitions of the set { , , . . . , n } , | π | denotes the size of the partition(number of blocks), and J π : P P π is the partition operator on probability measures, which fora fixed partition π = π | π | . . . | π r maps the probability measure P to the product measure P π = Q rj =1 P π j , where P π j is the marginal distribution of the subvector ( X i : i ∈ π j ) . The coefficientscorrespond to the M¨obius inversion on the partition lattice [33]. While the Lancaster interactionhas an interpretation in terms of joint central moments, Streitberg’s correction corresponds to jointcumulants [22, Section 4]. Therefore, a central moment expression like E X ...X n [ (cid:16) k (1) X − µ X (cid:17) ⊗· · · ⊗ (cid:16) k ( n ) X n − µ X n (cid:17) ] does not capture the correct notion of the interaction measure. Thus, whileone can in principle construct RKHS embeddings of higher-order interaction measures, and computeRKHS norms using a calculus of V -statistics and Gram-matrices analogous to that of Table 2, it doesnot seem possible to avoid summing over all partitions when computing the corresponding statistics,yielding a computationally prohibitive approach in general. This can be viewed by analogy with thescalar case, where it is well known that the second and third cumulants coincide with the secondand third central moments, whereas the higher order cumulants are neither moments nor centralmoments, but some other polynomials of the moments. D > In general, the test statistic for total independence in the D -variable case is (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ P X D − D Y i =1 ˆ P X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N Di =1 k ( i ) = 1 n n X a =1 n X b =1 D Y i =1 K ( i ) ab − n D +1 n X a =1 D Y i =1 n X b =1 K ( i ) ab + 1 n D D Y i =1 n X a =1 n X b =1 K ( i ) ab . u ll acce p t a n ce r a t e ( T yp e II e rr o r) Marginal independence tests: Dataset ADimension . . . . ∆ L : ( X, Y ) ⊥⊥ Z ( X, Y ) ⊥⊥ Z X ⊥⊥ Z X ⊥⊥ Y Marginal independence tests: Dataset BDimension . . . . Figure 1: Two-variable kernel independence tests and the test for ( X, Y ) ⊥⊥ Z using the Lancasterstatistic N u ll acce p t a n ce r a t e ( T yp e II e rr o r) Total independence test: Dataset ADimension . . . . ∆ tot : total indep. ∆ L : total indep. Total independence test: Dataset BDimension . . . . Figure 2: Total independence: ∆ tot ˆ P vs. ∆ L ˆ P .A similar statistic for total independence is discussed by [24] where testing of total independencebased on empirical characteristic functions is considered. Our test has a direct interpretation in termsof characteristic functions as well, which is straightforward to see in the case of translation invariantkernels on Euclidean spaces, using their Bochner representation, similarly as in [28, Corollary 4]. We investigate the performance of various permutation based tests that use the Lancaster statistic (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m and the total independence statistic (cid:13)(cid:13)(cid:13) ∆ tot ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m on two synthetic datasets where X , Y and Z are random vectors of increasing dimensionality: Dataset A: Pairwise independent, mutually dependent data.
Our first dataset is a triplet ofrandom vectors ( X, Y, Z ) on R p × R p × R p , with X, Y i.i.d. ∼ N (0 , I p ) , W ∼ Exp ( √ ) , Z = sign ( X Y ) W , and Z p ∼ N (0 , I p − ) , i.e., the product of X Y determines the sign of Z , while the remaining p − dimensions are independent (and serve as noise in this example). In this case, ( X, Y, Z ) is clearly a pairwise independent but mutually dependent triplet. The mutualdependence becomes increasingly difficult to detect as the dimensionality p increases. Note that there is no reason for X , Y and Z to have the same dimensionality p - this is done for simplicityof exposition. u ll acce p t a n ce r a t e ( T yp e II e rr o r) V-structure discovery: Dataset ADimension . . . . CI: X ⊥⊥ Y | Z ∆ L : Factor2var: Factor V-structure discovery: Dataset BDimension . . . . Figure 3: Factorization hypothesis: Lancaster statistic vs. a two-variable based test; Test for X ⊥⊥ Y | Z from [18] Dataset B: Joint dependence can be easier to detect.
In this example, we consider a triplet ofrandom vectors ( X, Y, Z ) on R p × R p × R p , with X, Y i.i.d. ∼ N (0 , I p ) , Z p ∼ N (0 , I p − ) , and Z = X + ǫ, w.p. / ,Y + ǫ, w.p. / ,X Y + ǫ, w.p. / , where ǫ ∼ N (0 , . ) . Thus, dependence of Z on pair ( X, Y ) is stronger than on X and Y individ-ually.In all cases, we use permutation tests as described in Appendix D. The test level is set to α = 0 . ,and we use gaussian kernels with bandwidth set to the interpoint median distance. In Figure 1,we plot the null hypothesis acceptance rates of the standard kernel two-variable tests for X ⊥⊥ Y (which is true for both datasets A and B, and accepted at the correct rate across all dimensions) andfor X ⊥⊥ Z (which is true only for dataset A), as well as of the standard kernel two-variable test for ( X, Y ) ⊥⊥ Z , and the test for ( X, Y ) ⊥⊥ Z using the Lancaster statistic. As expected, in dataset B,we see that dependence of Z on pair ( X, Y ) is somewhat easier to detect than on X individuallywith two-variable tests. In both datasets, however, the Lancaster interaction appears significantlymore sensitive in detecting this dependence as dimensionality p increases. Figure 2 plots the Type IIerror of total independence tests with statistics (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m and (cid:13)(cid:13)(cid:13) ∆ tot ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m . The Lancasterstatistic outperforms the total independence statistic everywhere apart from the Dataset B when thenumber of dimensions is small (between 1 and 5). Figure 2 plots the Type II error of the factor-ization test, i.e., test for ( X, Y ) ⊥⊥ Z ∨ ( X, Z ) ⊥⊥ Y ∨ ( Y, Z ) ⊥⊥ X with Lancaster statisticwith Holm-Bonferroni correction as described in Appendix D, as well as the two-variable based test(which performs three standard two-variable tests and applies the Holm-Bonferroni correction). Wealso plot the Type II error for the conditional independence test for X ⊥⊥ Y | Z from [18]. Underassumption that X ⊥⊥ Y (correct on both datasets), negation of each of these three hypotheses isequivalent to the presence of V-structure X → Z ← Y , so the rejection of the null can be viewedas a V-structure detection procedure. As dimensionality increases, the Lancaster statistic appearssignificantly more sensitive to the interactions present than the competing approaches, which is par-ticularly pronounced in Dataset A. We have constructed permutation-based nonparametric tests for three-variable interactions, includ-ing the Lancaster interaction and total independence. The tests can be used in datasets where onlyhigher-order interactions persist, i.e., variables are pairwise independent; as well as in cases wherejoint dependence may be easier to detect than pairwise dependence, for instance when the effect oftwo variables on a third is not additive. The flexibility of the framework of RKHS embeddings ofsigned measures allows us to consider variables that are themselves multidimensional. While the to-tal independence case readily generalizes to more than three dimensions, the combinatorial nature of8oint cumulants implies that detecting interactions of higher order requires significantly more costlycomputation, and is an interesting topic for future work.
References [1] A. Gretton, O. Bousquet, A. Smola, and B. Sch¨olkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In
ALT , pages 63–78, 2005.[2] G. Sz´ekely, M. Rizzo, and N.K. Bakirov. Measuring and testing dependence by correlation of distances.
Ann. Stat. , 35(6):2769–2794, 2007.[3] D. Sejdinovic, A. Gretton, B. Sriperumbudur, and K. Fukumizu. Hypothesis testing using pairwise dis-tances and associated kernels. In
ICML , 2012.[4] F. R. Bach and M. I. Jordan. Kernel independent component analysis.
J. Mach. Learn. Res. , 3:1–48, 2002.[5] K. Fukumizu, F. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis.
J.Mach. Learn. Res. , 8:361–383, 2007.[6] J. Dauxois and G. M. Nkiet. Nonlinear canonical analysis and independence tests.
Ann. Stat. , 26(4):1254–1278, 1998.[7] D. Pal, B. Poczos, and Cs. Szepesvari. Estimation of renyi entropy and mutual information based ongeneralized nearest-neighbor graphs. In
NIPS 23 , 2010.[8] A. Kankainen.
Consistent Testing of Total Independence Based on the Empirical Characteristic Function .PhD thesis, University of Jyv¨askyl¨a, 1995.[9] S. Bernstein.
The Theory of Probabilities . Gastehizdat Publishing House, Moscow, 1946.[10] M. Kayano, I. Takigawa, M. Shiga, K. Tsuda, and H. Mamitsuka. Efficiently finding genome-wide three-way gene interactions from transcript- and genotype-data.
Bioinformatics , 25(21):2735–2743, 2009.[11] N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection with the lasso.
Ann.Stat. , 34(3):1436–1462, 2006.[12] P. Ravikumar, M.J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation byminimizing ℓ -penalized log-determinant divergence. Electron. J. Stat. , 4:935–980, 2011.[13] J. Pearl.
Causality: Models, Reasoning and Inference . Cambridge University Press, 2001.[14] P. Spirtes, C. Glymour, and R. Scheines.
Causation, Prediction, and Search . 2nd edition, 2000.[15] M. Kalisch and P. Buhlmann. Estimating high-dimensional directed acyclic graphs with the PC algorithm.
J. Mach. Learn. Res. , 8:613–636, 2007.[16] X. Sun, D. Janzing, B. Sch¨olkopf, and K. Fukumizu. A kernel-based causal learning algorithm. In
ICML ,pages 855–862, 2007.[17] R. Tillman, A. Gretton, and P. Spirtes. Nonlinear directed acyclic structure learning with weakly additivenoise models. In
NIPS 22 , 2009.[18] K. Zhang, J. Peters, D. Janzing, and B. Schoelkopf. Kernel-based conditional independence test andapplication in causal discovery. In
UAI , pages 804–813, 2011.[19] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch¨olkopf, and A. Smola. A kernel statistical test ofindependence. In
NIPS 20 , pages 585–592, Cambridge, MA, 2008. MIT Press.[20] K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨olkopf. Kernel measures of conditional dependence. In
NIPS 20 , pages 489–496, 2008.[21] H.O. Lancaster.
The Chi-Squared Distribution . Wiley, London, 1969.[22] B. Streitberg. Lancaster interactions revisited.
Ann. Stat. , 18(4):1878–1885, 1990.[23] K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Schoelkopf. Characteristic kernels on groups andsemigroups. In
NIPS 21 , pages 473–480, 2009.[24] A. Kankainen.
Consistent Testing of Total Independence Based on the Empirical Characteristic Function .PhD thesis, University of Jyv¨askyl¨a, 1995.[25] A. Berlinet and C. Thomas-Agnan.
Reproducing Kernel Hilbert Spaces in Probability and Statistics .Kluwer, 2004.[26] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based andRKHS-based statistics in hypothesis testing. arXiv:1207.6076, 2012.[27] B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, characteristic kernels and rkhs embed-ding of measures.
J. Mach. Learn. Res. , 12:2389–2410, 2011.
28] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch¨olkopf. Hilbert space embeddingsand metrics on probability measures.
J. Mach. Learn. Res. , 11:1517–1561, 2010.[29] A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨olkopf, and A. Smola. A kernel two-sample test.
J. Mach.Learn. Res. , 13:723–773, 2012.[30] G. Sz´ekely and M. Rizzo. Testing for equal distributions in high dimension.
InterStat , (5), November2004.[31] L. Baringhaus and C. Franz. On a new multivariate two-sample test.
J. Multivariate Anal. , 88(1):190–206,2004.[32] G. Sz´ekely and M. Rizzo. Brownian distance covariance.
Ann. Appl. Stat. , 4(3):1233–1303, 2009.[33] T.P. Speed. Cumulants and partition lattices.
Austral. J. Statist. , 25:378–388, 1983.[34] S. Holm. A simple sequentially rejective multiple test procedure.
Scand. J. Statist. , 6(2):65–70, 1979.[35] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel two-sample test.In
NIPS 22 , Red Hook, NY, 2009. Curran Associates Inc.
A Proofs
A.1 Proof of Proposition 2
Some basic matrix algebra used in this proof is reviewed in Appendix F. The proof of the followingsimple Lemma directly follows from the results therein.
Lemma 6.
The following equalities hold:1. ( K + ◦ L + ◦ M ) ++ = (cid:0) K ⊤ + ◦ L ⊤ + ◦ M (cid:1) ++ = tr ( K + ◦ L + ◦ M + ) = P na =1 K a + L a + M a + (cid:0) K + ◦ L ◦ M ⊤ + (cid:1) ++ = ( KLM ) ++ Now, we will take a kernel matrix M and consider its Hadamard product with ˜ K ◦ ˜ L : ˜ K ◦ ˜ L ◦ M = K ◦ L ◦ M − n K ◦ L + ◦ M | {z } A + K ◦ L ⊤ + ◦ M | {z } A ⊤ + K + ◦ L ◦ M | {z } B + K ⊤ + ◦ L ◦ M | {z } B ⊤ + 1 n ( K ++ L ◦ M + L ++ K ◦ M )+ 1 n K + ◦ L + ◦ M | {z } C + K ⊤ + ◦ L ⊤ + ◦ M | {z } C ⊤ + K + ◦ L ⊤ + ◦ M | {z } D + K ⊤ + ◦ L + ◦ M | {z } D ⊤ − n K ++ (cid:2) L + ◦ M + L ⊤ + ◦ M (cid:3) − n L ++ (cid:2) K + ◦ M + K ⊤ + ◦ M (cid:3) + 1 n K ++ L ++ M. and thus: (cid:16) ˜ K ◦ ˜ L ◦ M (cid:17) ++ = ( K ◦ L ◦ M ) ++ − n (( K ◦ M ) L + ( L ◦ M ) K ) ++ + 1 n [ K ++ ( L ◦ M ) ++ + L ++ ( K ◦ M ) ++ ]+ 2 n (cid:2) tr ( K + ◦ L + ◦ M + ) + ( LM K ) ++ (cid:3) − n (cid:2) K ++ ( LM ) ++ + L ++ ( KM ) ++ (cid:3) + 1 n K ++ L ++ M ++ . A ++ = (( K ◦ M ) ◦ L + ) ++ = (( K ◦ M ) L ) ++ , and similarly B ++ =(( L ◦ M ) K ) ++ . Also, C ++ = tr ( K + ◦ L + ◦ M + ) and D ++ = ( LM K ) ++ .By comparing to the table of V-statistics, we obtain that: n (cid:16) ˜ K ◦ ˜ L ◦ M (cid:17) ++ = (cid:13)(cid:13)(cid:13) ∆ ( Z ) ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m where ∆ ( Z ) ˆ P = ˆ P XY Z + ˆ P X ˆ P Y ˆ P Z − ˆ P Y Z ˆ P X − ˆ P XZ ˆ P Y , which completes the proof of Proposition2. Proposition 3 can be proved in an analogous way by including the additional terms correspondingto centering of M , i.e., (cid:16) ˜ K ◦ ˜ L ◦ M + (cid:17) ++ and (cid:16) ˜ K ◦ ˜ L ◦ M ++ (cid:17) ++ . In the next Section, however,we give an alternative proof which gives more insight into the role that the centering of each Grammatrix plays. A.2 Proof of Proposition 3
It will be useful to introduce into notation the kernel centered at a probability measure ν , given by: ˜ k ν ( z, z ′ ) := k ( z, z ′ ) + ˆ ˆ k ( w, w ′ ) dν ( w ) dν ( w ) − ˆ [ k ( z, w ) + k ( z ′ , w )] dν ( w ) , (6)Note that ´ ˜ k ν ( z, z ′ ) dν ( z ) dν ( z ′ ) = 0 , i.e., µ ˜ k ν ( ν ) ≡ .By expanding the population expression of the kernel norm of the joint under the kernels centeredat the marginals, we obtain: k P XY Z k k PX ⊗ ˜ l PY ⊗ ˜ m PZ = ˆ ˆ h ˜ k P X ( x, x ′ )˜ l P Y ( y, y ′ ) ˜ m P Z ( z, z ′ ) i dP XY Z ( x, y, z ) dP XY Z ( x ′ , y ′ , z ′ ) , Substituting the definition of the centered kernel in (6), it is readily obtained that k P XY Z k k PX ⊗ ˜ l PY ⊗ ˜ m PZ = k ∆ L P k k ⊗ l ⊗ m . Now, k P XY Z k k PX ⊗ ˜ l PY ⊗ ˜ m PZ is the first term in the expansion of k ∆ L P k k PX ⊗ ˜ l PY ⊗ ˜ m PZ . Let usshow that all the other terms are equal to zero. Indeed, all the other terms are of the form hh P W Q, Q ′ ii ˜ k PX ⊗ ˜ l PY ⊗ ˜ m PZ , where W = X , Y , or Z (individual variable). Without loss of generality, let W = X . Then, hh P X Q, Q ′ ii ˜ k PX ⊗ ˜ l PY ⊗ ˜ m PZ = ˆ ˆ ˆ h ˜ k P X ( x, x ′ )˜ l P Y ( y, y ′ ) ˜ m P Z ( z, z ′ ) i dP X ( x ) dQ ( y, z ) dQ ′ ( x ′ , y ′ , z ′ )= ˆ ˆ ˆ ˜ k P X ( x, x ′ ) dP X ( x ) | {z } = (cid:20) µ ˜ kPX ( P X ) (cid:21) ( x ′ )=0 ˜ l P Y ( y, y ′ ) ˜ m P Z ( z, z ′ ) dQ ( y, z ) dQ ′ ( x ′ , y ′ , z ′ )= 0 . Therefore, k ∆ L P k k PX ⊗ ˜ l PY ⊗ ˜ m PZ = k P XY Z k k PX ⊗ ˜ l PY ⊗ ˜ m PZ = k ∆ L P k k ⊗ l ⊗ m . P XY Z , and in particular for the empirical joint, whereby: (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m = (cid:13)(cid:13)(cid:13) ˆ P XY Z (cid:13)(cid:13)(cid:13) k ˆ PX ⊗ ˜ l ˆ PY ⊗ ˜ m ˆ PZ = 1 n (cid:16) ˜ K ◦ ˜ L ◦ ˜ M (cid:17) ++ . A.3 Proof of Proposition 4
Consider the element of H k ⊗ H l ⊗ H m given by E XY Z k ( · , X ) ⊗ l ( · , Y ) ⊗ m ( · , Z ) . This can beidentified with a Hilbert-Schmidt uncentered covariance operator C ( XY ) Z : H k ⊗ H l → H m , suchthat ∀ f ∈ H k , g ∈ H l , h ∈ H m : (cid:10) C ( XY ) Z [ f ⊗ g ] , h (cid:11) H m = E XY Z f ( X ) g ( Y ) h ( Z ) . By replacing k , l , m with kernels centered at the marginals, we obtain a centered covariance operator Σ ( XY ) Z , for which (cid:10) Σ ( XY ) Z [ f ⊗ g ] , h (cid:11) H m = E XY Z ˜ f ( X )˜ g ( Y )˜ h ( Z )= cov [ f ( X ) , g ( Y ) , h ( Z )] , where we wrote ˜ f ( X ) = f ( X ) − E f ( X ) , and similarly for ˜ g and ˜ h . Using the usual isometriesbetween Hilbert-Schmidt spaces and the tensor product spaces: (cid:13)(cid:13) Σ ( XY ) Z (cid:13)(cid:13) HS = (cid:13)(cid:13)(cid:13) E XY Z ˜ k P X ( · , X ) ⊗ ˜ l P Y ( · , Y ) ⊗ ˜ m P Z ( · , Z ) (cid:13)(cid:13)(cid:13) H k ⊗H l ⊗H m = k P XY Z k k PX ⊗ ˜ l PY ⊗ ˜ m PZ = k ∆ L P k k ⊗ l ⊗ m . Now, consider the supremum of the three-way covariance taken over the unit balls of respectiveRKHSs: sup f,g,h cov [ f ( X ) , g ( Y ) , h ( Z )] = sup f,g,h (cid:10) Σ ( XY ) Z [ f ⊗ g ] , h (cid:11) H m = sup f,g (cid:13)(cid:13) Σ ( XY ) Z [ f ⊗ g ] (cid:13)(cid:13) H m ≤ sup F ∈H k ⊗H l (cid:13)(cid:13) Σ ( XY ) Z F (cid:13)(cid:13) H m = (cid:13)(cid:13) Σ ( XY ) Z (cid:13)(cid:13) op ≤ (cid:13)(cid:13) Σ ( XY ) Z (cid:13)(cid:13) HS . and thus, k ∆ L P k k ⊗ l ⊗ m = 0 implies sup f,g,h cov [ f ( X ) , g ( Y ) , h ( Z )] = 0 . Conversely, ifcov [ f ( X ) , g ( Y ) , h ( Z )] = 0 ∀ f, g, h , then Σ ( XY ) Z [ f ⊗ g ] ≡ ∀ f, g , so the linear operator Σ ( XY ) Z vanishes. B The effect of centering
In a two-variable test, either or both of the kernel matrices can be centered when computing the teststatistic since (cid:16) K ◦ ˜ L (cid:17) ++ = (cid:16) ˜ K ◦ L (cid:17) ++ = (cid:16) ˜ K ◦ ˜ L (cid:17) ++ . To see this, simply note that by theidempotence of H , 12able 3: V-statistics for various hypotheses hypothesis V-statistic hypothesis V-statistic ( X, Y ) ⊥⊥ Z n (cid:16) K ◦ L ◦ ˜ M (cid:17) ++ ∆ ( X ) P = 0 n (cid:16) K ◦ ˜ L ◦ ˜ M (cid:17) ++ ( X, Z ) ⊥⊥ Y n (cid:16) K ◦ ˜ L ◦ M (cid:17) ++ ∆ ( Y ) P = 0 n (cid:16) ˜ K ◦ L ◦ ˜ M (cid:17) ++ ( Y, Z ) ⊥⊥ X n (cid:16) ˜ K ◦ L ◦ M (cid:17) ++ ∆ ( Z ) P = 0 n (cid:16) ˜ K ◦ ˜ L ◦ M (cid:17) ++ ∆ L P = 0 n (cid:16) ˜ K ◦ ˜ L ◦ ˜ M (cid:17) ++ Table 4:
An example of Lancaster interaction measure vanishing for the case where neither variable is inde-pendent of the other two. P (0 , ,
0) = 0 . P (0 , ,
1) = 0 . P (0 , ,
0) = 0 . P (0 , ,
1) = 0 . P (1 , ,
0) = 0 . P (1 , ,
1) = 0 . P (1 , ,
0) = 0 . P (1 , ,
1) = 0 . (cid:16) K ◦ ˜ L (cid:17) ++ = tr ( KHLH )= tr ( KH LH )= tr ( HKH LH )= ( HKH ◦ HLH ) ++ = (cid:16) ˜ K ◦ ˜ L (cid:17) ++ . (7)This is no longer true in the three-variable case, where centering of each matrix has a differentmeaning. Various hypotheses and their corresponding V-statistics are summarized in Table 3. Notethat the “composite” hypotheses are obtained simply by an appropriate centering of Gram matrices. C ∆ L P = 0 ; ( X, Y ) ⊥⊥ Z ∨ ( X, Z ) ⊥⊥ Y ∨ ( Y, Z ) ⊥⊥ X .Consider the following simple example with binary variables X , Y , Z with the × × probabilitytable given in Table 4. It is readily checked that all conditional covariances are equal, so ∆ L P = 0 .It is also clear, however, that neither variable is independent of the other two. Therefore, a test forLancaster interaction per se is not equivalent to testing for the possibility of any factorization of thejoint distribution, but our empirical results suggest that it can nonetheless provide a useful surrogate.In other words, while rejection of the null hypothesis ∆ L P = 0 is highly informative and impliesthat interaction is present and no non-trivial factorization of the joint distribution is available, theacceptance of the null hypothesis should be considered carefully and additional methods to rule outinteraction should be sought. D Permutation test
A permutation test for total independence is easy to construct: it suffices to compute the valueof the statistic (either the Lancaster statistic (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m or the total independence statistic (cid:13)(cid:13)(cid:13) ∆ tot ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m ) on (cid:8)(cid:0) X ( i ) , Y ( σi ) , Z ( τi ) (cid:1)(cid:9) ni =1 , for randomly drawn independent permutations σ, τ ∈ S n in order to obtain a sample from the null distribution.When testing for only one of the hypotheses ( Y, Z ) ⊥⊥ X , ( X, Z ) ⊥⊥ Y , or ( X, Y ) ⊥⊥ Z , ei-ther with a Lancaster statistic or with a standard two-variable kernel statistic, only one of thesamples should be permuted, e.g., if testing for ( Y, Z ) ⊥⊥ X , statistics should be computed on (cid:8)(cid:0) X ( σi ) , Y ( i ) , Z ( i ) (cid:1)(cid:9) ni =1 , for σ ∈ S n . However, when testing for the disjunction of these hy-potheses, i.e., for the existence of a nontrivial factorization of the joint distribution, we are within13 multiple hypothesis testing framework (even though one may deal with a single test statistic, asin the Lancaster case). To ensure that the required confidence level α = 0 . is reached for thefactorization hypothesis, in the experiments reported in Figure 3, the Holm’s sequentially rejectiveBonferroni method [34] is used for both the two-variable based and for the Lancaster based factor-ization tests. Namely, p -values are computed for each of the hypotheses ( Y, Z ) ⊥⊥ X , ( X, Z ) ⊥⊥ Y ,or ( X, Y ) ⊥⊥ Z using the permutation test, and sorted in the ascending order p (1) , p (2) , p (3) . Hy-potheses are then rejected sequentially if p ( l ) < α − l . The factorization hypothesis is then rejected ifand only if all three hypotheses are rejected. E Asymptotic behavior
Using terminology from [26], kernels k and k ′ are said to be equivalent if they induce the samesemimetric on the domain, i.e., k ( x, x ) + k ( x ′ , x ′ ) − k ( x, x ′ ) = k ′ ( x, x ) + k ′ ( x ′ , x ′ ) − k ′ ( x, x ′ ) ∀ x, x ′ . It can be shown that the Lancaster statistic is invariant to changing kernels within the kernelequivalence class, i.e., that (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m = (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k ′ ⊗ l ′ ⊗ m ′ , whenever k, k ′ , l, l ′ and m, m ′ are equivalent pairs. From here, (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k ⊗ l ⊗ m = (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k PX ⊗ ˜ l PY ⊗ ˜ m PZ . In Section A.2, we were able to show a similar expression but only for changing k to its version ˜ k ˆ P X centered at the empirical marginal . Now, under the assumption of total independence, i.e., that P XY Z = P X P Y P Z , the dominating term in (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k PX ⊗ ˜ l PY ⊗ ˜ m PZ is (cid:13)(cid:13)(cid:13) ˆ P XY Z (cid:13)(cid:13)(cid:13) k PX ⊗ ˜ l PY ⊗ ˜ m PZ . Bystandard arguments, under total independence, this converges in distribution to a sum of independentchi-squared variables, n (cid:13)(cid:13)(cid:13) ˆ P XY Z (cid:13)(cid:13)(cid:13) k PX ⊗ ˜ l PY ⊗ ˜ m PZ ∞ X a =1 ∞ X b =1 ∞ X c =1 λ a η b θ c N abc , (8)where { λ a } , { η b } , { θ c } are, respectively, eigenvalues of integral operators associated to ˜ k P X , ˜ l P Y and ˜ m P Z , and N abc i.i.d. ∼ N (0 , . Other terms in (cid:13)(cid:13)(cid:13) ∆ L ˆ P (cid:13)(cid:13)(cid:13) k PX ⊗ ˜ l PY ⊗ ˜ m PZ can be shown to dropto zero at a faster rate, as in the two-variable case. The resulting distribution of such a sum ofchi-squares can, in principle, be estimated using a Monte Carlo method, by computing a numberof eigenvalues of ˜ K , ˜ L and ˜ M , as in [35, 18]. This is of little practical value though, as it is inmost cases simpler and faster to run a permutation test, as we describe in Appendix D. On the otherhand, the above result quantifies the highest order of bias of the V-statistic under total independenceto be equal to n P ∞ a =1 λ a P ∞ b =1 η b P ∞ c =1 θ c , which can be estimated as n T r ( ˜ K ) T r ( ˜ L ) T r ( ˜ M ) . We emphasize that (8) refers to a null distribution under total independence - if say, the null holdsbecause ( X, Y ) ⊥⊥ Z , but X and Y are dependent, one needs to instead consider a kernel on X × Y centered at P XY and the eigenvalues of its integral operator then replace { λ a η b } (triple sum becomesa double sum). This also implies that the bias term needs to be corrected appropriately. F Some useful basic matrix algebra
Lemma 7.
Let A , B be n × n matrices. The following results hold:1. ⊤ = n [ ⊤ ] ij = 1 , ∀ i, j , and thus (cid:0) ⊤ (cid:1) ++ = n (cid:0) I − n ⊤ (cid:1) = I − n ⊤ . [ A ] i = A i + , (cid:2) ⊤ A (cid:3) j = A + j . ⊤ A = A ++ (cid:0) A ⊤ (cid:1) ++ = (cid:0) ⊤ A (cid:1) ++ = nA ++ ( αA + βB ) ++ = αA ++ + βB ++ (cid:0) A ⊤ B (cid:1) ++ = A ++ B ++ .Proof. (3): (cid:18) I − n ⊤ (cid:19) = I − n ⊤ + 1 n ⊤ |{z} n ⊤ . (8): From (4), (cid:2) A ⊤ B (cid:3) ij = A i + B + j , implying (cid:0) A ⊤ B (cid:1) ++ = n X i =1 A i + n X j =1 B + j = A ++ B ++ . Now, let K be a symmetric matrix, and denote H = I − n ⊤ (the centering matrix). Then: HKH = (cid:18) I − n ⊤ (cid:19) K (cid:18) I − n ⊤ (cid:19) = K − n (cid:0) K + + K ⊤ + (cid:1) + 1 n K ++ ⊤ . Note that: ( HKH ) ++ = K ++ − n (cid:16) ( K + ) ++ + (cid:0) K ⊤ + (cid:1) ++ (cid:17) + 1 n K ++ (cid:0) ⊤ (cid:1) ++ = K ++ − K ++ + K ++ = 0 . Lemma 8.
The following results hold:1. A ◦ ⊤ = ⊤ ◦ A = A ( I ◦ A ) ++ = tr ( A ) ( A ◦ B ) ++ = tr ( AB ⊤ ) For a symmetric matrix K and any matrix A , ( A ◦ K + ) ++ = ( AK ) ++ , (cid:0) A ◦ K ⊤ + (cid:1) ++ =( KA ) ++ For symmetric matrices K , L , ( K + ◦ L + ) ++ = (cid:0) K ⊤ + ◦ L ⊤ + (cid:1) ++ = n ( KL ) ++ For symmetric matrices K , L , (cid:0) K + ◦ L ⊤ + (cid:1) ++ = (cid:0) K ⊤ + ◦ L + (cid:1) ++ = K ++ L ++ .Proof. (4): ( A ◦ K + ) ++ = tr (cid:0) AK ⊤ (cid:1) = (cid:0) AK ◦ ⊤ (cid:1) ++ = ( AK ) ++ . (5): ( K + ◦ L + ) ++ =( K + L ) ++ = (cid:0) ⊤ KL (cid:1) ++ = n ( KL ) ++ . Proposition 9.
Denote H = I − n ⊤ . Then: ( K ◦ HLH ) ++ = ( K ◦ L ) ++ − n ( KL ) ++ + 1 n K ++ L ++ . roof. Let K and L be symmetric matrices and consider K ◦ HLH . We obtain: K ◦ HLH = K ◦ (cid:18) L − n (cid:0) L + + L ⊤ + (cid:1) + 1 n L ++ ⊤ (cid:19) = K ◦ L − n (cid:0) K ◦ L + + K ◦ L ⊤ + (cid:1) + 1 n L ++ K, so that: ( K ◦ HLH ) ++ = ( K ◦ L ) ++ − n ( KL ) ++ + 1 n K ++ L ++ . Corollary 10. tr ( HLH ) = tr ( L ) − n L ++++