[PDF] Double Generative Adversarial Networks for Conditional Independence Testing

Abstract

In this article, we consider the problem of high-dimensional conditional independence testing, which is a key building block in statistics and machine learning. We propose a double generative adversarial networks (GANs)-based inference procedure. We first introduce a double GANs framework to learn two generators, and integrate the two generators to construct a doubly-robust test statistic. We next consider multiple generalized covariance measures, and take their maximum as our test statistic. Finally, we obtain the empirical distribution of our test statistic through multiplier bootstrap. We show that our test controls type-I error, while the power approaches one asymptotically. More importantly, these theoretical guarantees are obtained under much weaker and practically more feasible conditions compared to existing tests. We demonstrate the efficacy of our test through both synthetic and real datasets.

Full PDF

DD OUBLE G ENERATIVE A DVERSARIAL N ETWORKS FOR C ONDITIONAL I NDEPENDENCE T ESTING

A P

REPRINT

Chengchun Shi, Tinalin Xu and Wicher Bergsma

London School of Economics and Political Science

Lexin Li

University of California, Berkeley A BSTRACT

In this article, we consider the problem of high-dimensional conditional independence testing, whichis a key building block in statistics and machine learning. We propose a double generative adversarialnetworks (GANs)-based inference procedure. We ﬁrst introduce a double GANs framework to learntwo generators, and integrate the two generators to construct a doubly-robust test statistic. We nextconsider multiple generalized covariance measures, and take their maximum as our test statistic.Finally, we obtain the empirical distribution of our test statistic through multiplier bootstrap. We showthat our test controls type-I error, while the power approaches one asymptotically. More importantly,these theoretical guarantees are obtained under much weaker and practically more feasible conditionscompared to existing tests. We demonstrate the efﬁcacy of our test through both synthetic and realdatasets.

Conditional independence (CI) is a fundamental concept in statistics and machine learning. Testing conditionalindependence is a key building block and plays a central role in a wide variety of statistical learning problems, forinstance, causal inference (Pearl, 2009), graphical models (Koller & Friedman, 2009), dimension reduction (Li, 2018),among others. In this article, we aim at testing whether two random variables X and Y are conditionally independentgiven a set of confounding variables Z . That is, we test the hypotheses: H : X ⊥⊥ Y | Z versus H : X (cid:54)⊥⊥ Y | Z, (1)given the observed data of n i.i.d. copies { ( X i , Y i , Z i ) } ≤ i ≤ n of ( X, Y, Z ) . For our problem, X, Y and Z can all bemultivariate. However, the main challenge arises when the confounding set of variables Z is high-dimensional. As such,we primarily focus on the scenario with a univariate X and Y , and a multivariate Z . Meanwhile, our proposed methodis applicable to the multivariate X and Y scenario as well. Another challenge is the limited sample size compared to thedimensionality of Z . As a result, many existing tests are ineffective, with either an inﬂated type-I error, or not havingenough power to detect the alternatives. See Section 2 for a detailed review.We propose a double generative adversarial networks (GANs Goodfellow et al., 2014)-based inference procedure for theCI testing problem (1). Our proposal involves two key components, a double GANs framework to learn two generatorsthat approximate the conditional distribution of X given Z and Y given Z , and a maximum of generalized covariancemeasures of multiple combinations of the transformation functions of X and Y . We ﬁrst establish that our test statisticis doubly-robust, which offers additional protections against potential misspeciﬁcation of the conditional distributions(see Theorems 1 and 2). Second, we show the resulting test achieves a valid control of the type-I error asymptotically,and more importantly, under the conditions that are much weaker and practically more feasible (see Theorem 3). Finally,we prove the power of our test approaches one asymptotically (see Theorem 4), and demonstrate it is more powerfulthan the competing tests empirically. a r X i v : . [ s t a t . M L ] J un ouble Generative Adversarial Networks for Conditional Independence Testing There has been a growing literature on conditional independence testing in recent years; see (Li & Fan, 2019) for areview. Broadly speaking, the existing testing methods can be cast into four main categories, the metric-based tests, e.g.,(Su & White, 2007, 2014; Wang et al., 2015), the conditional randomization-based tests (Candes et al., 2018; Bellot& van der Schaar, 2019), the kernel-based tests (Fukumizu et al., 2008; Zhang et al., 2011), and the regression-basedtests (Hoyer et al., 2009; Zhang et al., 2018; Shah & Peters, 2018). There are other types of tests, e.g., (Bergsma, 2004;Doran et al., 2014; Sen et al., 2017; Berrett et al., 2019), to mention a few.The metric-based tests typically employ some kernel smoothers to estimate the conditional characteristic functionor the distribution function of Y given X and Z . Kernel smoothers, however, are known to suffer from the curse ofdimensionality, and as such, these tests are not suitable when the dimension of Z is high. The conditional randomization-based tests require the knowledge of the conditional distribution of X | Z (Candes et al., 2018). If unknown, the type-Ierror rates of these tests rely critically on the quality of the approximation of this conditional distribution. Kernel-basedtest is built upon the notion of maximum mean discrepancy (MMD, Gretton et al., 2012), and could have inﬂated type-Ierrors. The regression-based tests have valid type-I error control, but may suffer from inadequate power. Next, wediscuss in detail the conditional randomization-based tests, in particular, the work of Bellot & van der Schaar (2019),the regression-based and the MMD-based tests, since our proposal is closely related to them. The family of conditional randomization-based tests is built upon the following basis. If the conditional distribution P X | Z of X given Z is known, then one can independently draw X (1) i ∼ P X | Z = Z i for i = 1 , . . . , n , and these samplesare independent of the observed samples X i ’s and Y i ’s. Write X = ( X , . . . , X n ) (cid:62) , X (1) = ( X (1)1 , . . . , X (1) n ) (cid:62) , Y = ( Y , . . . , Y n ) (cid:62) , and Z = ( Z , . . . , Z n ) (cid:62) . Here we use boldface letters to denote data matrices that consist of n samples. The joint distributions of ( X , Y , Z ) and ( X (1) , Y , Z ) are the same under H . Any large difference betweenthe two distributions can be interpreted as the evidence against H . Therefore, one can repeat the process M times, andgenerate X ( m ) i ∼ P X | Z = Z i , i = 1 , . . . , n , m = 1 , . . . , M . Write X ( m ) = ( X ( m )1 , . . . , X ( m ) n ) (cid:62) . Then, for any giventest statistic ρ = ρ ( X , Y , Z ) , its associated p -value is p = (cid:2) (cid:80) Mm =1 I { ρ ( X ( m ) , Y , Z ) ≥ ρ ( X , Y , Z ) } (cid:3) / (1 + M ) ,where I ( · ) is the indicator function. Since the triplets ( X , Y , Z ) , ( X (1) , Y , Z ) , . . . , ( X ( M ) , Y , Z ) are exchangeableunder H , the p -value is valid, and it satisﬁes that P ( p ≤ α |H ) ≤ α + o (1) for any < α < .In practice, however, P X | Z is rarely known, and Bellot & van der Schaar (2019) proposed to approximate it usingGANs. Speciﬁcally, they learned a generator G X ( · , · ) from the observed data, then took Z i and a noise variable v ( m ) i,X asinput to obtain a sample (cid:101) X ( m ) i , which minimizes the divergence between the distributions of ( X i , Z i ) and ( (cid:101) X ( m ) i , Z i ) .The p -value is then computed by replacing X ( m ) by (cid:102) X ( m ) = ( (cid:101) X ( m )1 , . . . , (cid:101) X ( m ) n ) (cid:62) . They called this test GCIT, shortfor generative conditional independence test. By Theorem 1 of Bellot & van der Schaar (2019), the excess type-I errorof this test is upper bounded by P ( p ≤ α |H ) − α ≤ E d TV ( (cid:101) P X | Z , P X | Z ) = E sup A | P ( X ∈ A | Z ) − P ( (cid:102) X ( m ) ∈ A | Z ) | ≡ D, (2)where d TV is the total variation norm between two probability distributions, the supremum is taken over all measurablesets, and the expectations in (2) are taken with respect to Z .By deﬁnition, the quantity D on the right-hand-side of (2) measures the quality of the conditional distributionapproximation. Bellot & van der Schaar (2019) argued that this error term is negligible due to the capacity of deepneural nets in estimating conditional distributions. To the contrary, we ﬁnd this approximation error is usually not negligible, and consequently, it may inﬂate the type-I error and potentially invalidate the test. We next consider a simpleexample to further elaborate this. Example 1.

Suppose X is one-dimensional, and follows a simple linear regression model, X = Z (cid:62) β + ε , where theerror ε is independent of Z and ε ∼ N (0 , σ ) for some σ > .Suppose we know a priori that the linear regression model holds. We thus estimate β by ordinary least squares, anddenote the resulting estimator by (cid:98) β . For simplicity, suppose σ is known too. For this simple example, we have thefollowing result regarding the approximation error term D . We use o (1) to denote a quantity that converges to zero asthe sample size diverges to inﬁnity. Proposition 1

Suppose the linear regression model holds. The derived distribution (cid:101) P X | Z is N ( Z (cid:98) β, σ I n ) , where I n is the n × n identity matrix. Then D is not o (1) . D , we sketch a few lines of an outline of the proof ofProposition 1. A detailed proof is given in the appendix. Let (cid:101) P X | Z = Z i denote the conditional distribution of (cid:101) X ( m ) i given Z i , which is N ( Z (cid:62) i (cid:98) β, σ ) in this example. If D = o (1) , then, (cid:101) D ≡ n / (cid:113) E d TV ( (cid:101) P X | Z = Z i , P X | Z = Z i ) = o (1) . (3)In other words, the validity of GCIT requires the root mean squared total variation distance in (3) to converge at a fasterrate than n − / . However, this rate cannot be achieved in general. In our simple Example 1, we have (cid:101) D ≥ c for someuniversal constant c > . Consequently, D in (2) is not o (1) . Proposition 1 shows that, even if we know a priori thatthe linear model holds, D is not to decay to zero as n grows to inﬁnity. In practice, we do not have such prior modelinformation. Then it would be even more difﬁcult to estimate the conditional distribution P X | Z . Therefore, using GANsto approximate P X | Z does not guarantee a negligible approximation error, nor the validity of the test. The family of regression-based tests is built upon a key quantity, the generalized covariance measure,GCM ( X, Y ) = 1 n n (cid:88) i =1 (cid:110) X i − (cid:98) E ( X i | Z i ) (cid:111) (cid:110) Y i − (cid:98) E ( Y i | Z i ) (cid:111) , where (cid:98) E ( X | Z ) and (cid:98) E ( Y | Z ) are the predicted condition mean E ( X | Z ) and E ( Y | Z ) , respectively, by any supervisedlearner. When the prediction errors of (cid:98) E ( X | Z ) and (cid:98) E ( Y | Z ) satisfy certain convergence rates, Shah & Peters (2018)proved that GCM is asymptotically normal. Under H , the asymptotic mean of GCM is zero, and its asymptoticstandard deviation can be consistently estimated by some standard error estimator, denoted by (cid:98) s ( GCM ) . Therefore, fora given signiﬁcance level α , we reject H , if |GCM| / (cid:98) s ( GCM ) exceeds the upper α/ th quantile of a standard normaldistribution.Such a test is valid. However, it may not have sufﬁcient power to detect H . This is because the asymptotic meanof GCM equals GCM ∗ ( X, Y ) = E { X − E ( X | Z ) }{ Y − E ( Y | Z ) } . The regression-based tests require | GCM ∗ | to benonzero under H to have power. However, there is no guarantee of this requirement. We again consider a simpleexample to elaborate. Example 2.

Suppose X ∗ , Y and Z are independent random variables. Besides, X ∗ has mean zero, and X = X ∗ g ( Y ) for some function g .For this example, we have E ( X | Z ) = E ( X ) , since both X ∗ and Y are independent of Z , and so is X . Besides, E ( X ) = E ( X ∗ ) E { g ( Y ) } = 0 , since X ∗ is independent of Y and E ( X ∗ ) = 0 . As such, GCM ∗ ( X, Y ) = E { X − E ( X ) }{ Y − E ( Y | Z ) } = 0 for any function g . On the other hand, X and Y are conditionally dependent given Z , aslong as g is not a constant function. Therefore, for this example, the regression-based tests would fail to discriminatebetween H and H . The family of kernel-based tests often involves the notion of maximum mean discrepancy as a measure of independence.For any two probability measures P , Q and a function space F , deﬁneMMD ( P, Q | F ) = sup f ∈ F { E f ( W ) − E f ( W ) } , W ∼ P, W ∼ Q. Let H , H be function spaces of square integrable functions, i.e., E h ( X ) < + ∞ , E h ( Y ) < + ∞ for h k ∈ H k .Deﬁne φ XY = MMD ( P XY , Q XY | H ⊗ H ) , where ⊗ is the tensor product, P XY is the joint distribution of ( X, Y ) ,and Q XY is the conditionally independent distribution with the same X and Y margins as P XY , Then after calculationsgiven in Appendix ?? , we have, φ XY = sup h ∈ H ,h ∈ H E [ h ( X ) − E { h ( X ) | Z } ][ h ( Y ) − E { h ( Y ) | Z } ] , (4)We see that φ XY measures the average conditional association between X and Y given Z . Under H , it equals zero,and hence an estimator of this measure can be used as a test statistic for H . We propose a double GANs-based testing procedure for the conditional independence testing problem (1). Conceptually,our test integrates GCIT, regression-based and MMD-based tests. Meanwhile, our new test addresses the limitations of3ouble Generative Adversarial Networks for Conditional Independence Testingthe existing ones. Unlike GCIT that only learned the conditional distribution of X | Z , we learn two generators G X and G Y to approximate the conditional distributions of both X | Z and Y | Z . We then integrate the two generators in anappropriate way to construct a doubly-robust test statistic, and we only require the root mean squared total variationnorm to converge at a rate of n − κ for some κ > / . Such a requirement is much weaker and practically more feasiblethan the condition in (3).Moreover, to improve the power of the test, we consider a set of the GCMs, { GCM ( h ( X ) , h ( Y )) : h , h } , formultiple combinations of transformation functions h ( X ) and h ( Y ) . We then take the maximum of all these GCMs asour test statistic. This essentially yields φ XY , which is connected with the notion of MMD. To see why the maximum-type statistic can enhance the power, we quickly revisit Example 2. When g is not a constant function, there existssome nonlinear function h such that h ∗ ( Y ) = E { h ( X ) | Y } is not a constant function of Y . Set h = h ∗ . We haveGCM ∗ = E [ h ( X ) { Y − E ( Y ) } ] = Var { h ∗ ( Y ) } > . This enables us to discriminate the null from the alternativehypothesis.We next detail our test. An overview of our testing procedure is depicted in Figure 1. We begin with two function spaces, H = (cid:8) h ,θ : θ ∈ R d (cid:9) and H = (cid:8) h ,θ : θ ∈ R d (cid:9) , indexed by some parame-ters θ and θ , respectively. We then randomly generate B functions, h , , . . . , h ,B ∈ H , h , , . . . , h ,B ∈ H , wherewe independently generate i.i.d. multivariate normal variables θ , , . . . , θ ,B ∼ N (0 , I d /d ) , and θ , , . . . , θ ,B ∼ N (0 , I d /d ) . We then set h ,b = h ,θ ,b , and h ,b = h ,θ ,b , b = 1 , . . . , B . Consider the following maximum-typetest statistic, max b ,b ∈{ ,...,B } (cid:98) σ − b ,b (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:104) h ,b ( X i ) − (cid:98) E { h ,b ( X i ) | Z i } (cid:105) (cid:104) h ,b ( Y i ) − (cid:98) E { h ,b ( Y i ) | Z i } (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (5)where (cid:98) σ b ,b = ( n − − (cid:80) ni =1 (cid:18) (cid:104) h ,b ( X i ) − (cid:98) E { h ,b ( X i ) | Z i } (cid:105) (cid:104) h ,b ( Y i ) − (cid:98) E { h ,b ( Y i ) | Z i } (cid:105) − GCM { h ,b ( X ) , h ,b ( Y ) } (cid:19) . To compute (5), however, we need to estimate the conditional means E { h ,b ( X ) | Z } , E { h ,b ( Y ) | Z } for b , b = 1 , . . . , B . Separately applying supervised learning algorithms B times to compute thesemeans is computationally very expensive for a large value of B . Instead, we propose to implement this step based onthe generators G X and G Y estimated using GANs, which is computationally much more efﬁcient.Speciﬁcally, for i = 1 , . . . , n , we randomly generate i.i.d. random noises { v ( m ) i,X } Mm =1 , { v ( m ) i,Y } Mm =1 and output thepseudo samples (cid:101) X ( m ) i = G X ( Z i , v ( m ) i,X ) , (cid:101) Y ( m ) i = G Y ( Z i , v ( m ) i,Y ) , for m = 1 , . . . , M , to approximate the condi-tional distributions of X i and Y i given Z i . We then compute (cid:98) E { h ,b ( (cid:101) X i ) | Z i } = M − (cid:80) Mm =1 h ,b ( X ( m ) i ) , and (cid:98) E { h ,b ( Y i ) | Z i } = M − (cid:80) Mm =1 h ,b ( (cid:101) Y ( m ) i ) , for b , b = 1 , . . . , B . Plugging those estimated means into (5) pro-duces our test statistic, T ≡ max b ,b (cid:12)(cid:12) n − / (cid:80) ni =1 ψ b ,b ,i (cid:12)(cid:12) , where ψ b ,b ,i = 1 √ n n (cid:88) i =1 (cid:98) σ − b ,b (cid:40) h ,b ( X i ) − M M (cid:88) m =1 h ,b (cid:16) (cid:101) X ( m ) i (cid:17)(cid:41) (cid:40) h ,b ( Y i ) − M M (cid:88) m =1 h ,b (cid:16) (cid:101) Y ( m ) i (cid:17)(cid:41) . To help reduce the type-I error of our test, we further employ a data splitting and cross-ﬁtting strategy, which iscommonly used in statistical testing (Romano & DiCiccio, 2019). That is, we use different subsets of data samples tolearn GANs and to construct the test statistic. We summarize our procedure of computing the test statistic in Algorithm1. Figure 1: Illustration of conditional independence testing with double GANs.4ouble Generative Adversarial Networks for Conditional Independence Testing

Algorithm 1

Compute the test statistic.

Input: number of functions B , number of pseudo samples M , and number of data splits L . Step 1:

Divide { , . . . , n } into L folds I (1) , . . . , I ( L ) . Let I ( − (cid:96) ) = { , . . . , n } − I ( (cid:96) ) . Step 2:

For (cid:96) = 1 , . . . , L , train two generators G ( (cid:96) ) X and G ( (cid:96) ) Y based on { ( X i , Z i ) } i ∈I ( − (cid:96) ) and { ( Y i , Z i ) } i ∈I ( − (cid:96) ) , toapproximate the conditional distributions of X | Z and Y | Z . Step 3:

For (cid:96) = 1 , . . . , L and i ∈ I (cid:96) , generate i.i.d. random noises (cid:110) v ( m ) i,X (cid:111) Mm =1 , (cid:110) v ( m ) i,Y (cid:111) Mm =1 . Set (cid:101) X ( m ) i = G ( (cid:96) ) X (cid:16) Z i , v ( m ) i,X (cid:17) , and (cid:101) Y ( m ) i = G ( (cid:96) ) Y (cid:16) Z i , v ( m ) i,Y (cid:17) , m = 1 , . . . , M . Step 4:

Randomly generate h , , . . . , h ,B ∈ H and h , , . . . , h ,B ∈ H . Step 5:

Compute the test statistic T . p -value Next, we propose a multiplier bootstrap method to approximate the distribution of √ nT under H to compute the corre-sponding p -value. The key observation is that ψ b ,b = n − / (cid:80) ni =1 ψ b ,b ,i is asymptotically normal with zero meanunder H ; see the proof of Theorem 3 in the appendix for details. As such, √ nT = max b ,b | n − / (cid:80) ni =1 ψ b ,b ,i | isto converge to a maximum of normal variables in absolute values.To approximate this limiting distribution, we ﬁrst estimate the covariance matrix of a B -dimensional vector formedby { ψ b ,b } b ,b using the sample covariance matrix (cid:98) Σ . We then generate i.i.d. random vectors with the covariancematrix equal to (cid:98) Σ , and compute the maximum elements of each of these vectors in absolute values. Finally, we usethese maximum absolute values to approximate the distribution of T under the null. We summarize this procedure inAlgorithm 2. We adopt the proposal in Genevay et al. (2017) to learn the conditional distributions P X | Z and P Y | Z . Recall that (cid:101) P X | Z is the distribution of pseudo outcome generated by the generator G X given Z . We consider estimating P X | Z byoptimizing min G X max c (cid:101) D c,(cid:15) ( P X | Z , (cid:101) P X | Z ) , where (cid:101) D c,(cid:15) denotes the Sinkhorn loss function between two probabilitymeasures with respect to some cost function c and some regularization parameter (cid:15) > . A detailed deﬁnition of (cid:101) D c,(cid:15) is given in the appendix. Intuitively, the closer the two probability measures, the smaller the Sinkhorn loss. As such,maximizing the loss with respect to the cost function learns a discriminator that can better discriminate the samplesgenerated between P X | Z and (cid:101) P X | Z . On the other hand, minimizing the maximum cost with respect to the generator G X makes it closer to the true distribution P X | Z . This yields the minimax formulation min G X max c (cid:101) D c,(cid:15) ( P X | Z , (cid:101) P X | Z ) that we target.In practice, we approximate the cost and the generator based on neural networks. Integrations in the objective function (cid:101) D c,(cid:15) ( P X | Z , (cid:101) P X | Z ) are approximated by sample averages. A pseudocode detailing our learning procedure is given inthe appendix. The conditional distribution P Y | Z is estimated similarly. To derive the theoretical properties of the test statistic T , we ﬁrst introduce a concept of the “oracle" test statistic T ∗ . If P X | Z and P Y | Z were known a priori, then one can draw { X ( m ) i } m and { Y ( m ) i } m from P X | Z = Z i and P Y | Z = Z i directly, Algorithm 2

Compute the p -value. Input: number of bootstrap samples J , and { ψ b ,b ,i } b ,b ,i . Step 1:

Compute a B × B matrix (cid:98) Σ whose { b + B ( b − , b + B ( b − } th entry is given by ( n − − (cid:80) ni =1 ( ψ b ,b ,i − ψ b ,b )( ψ b ,b ,i − ψ b ,b ) . Step 2:

Generate i.i.d. standard normal variables Z j,b for j = 1 , . . . , J , b = 1 , . . . , B . Set Z j = ( Z j, , . . . , Z j,B ) (cid:62) ,and (cid:101) T j = (cid:107) (cid:98) Σ / Z j (cid:107) ∞ , where (cid:98) Σ / is a positive semi-deﬁnite matrix that satisﬁes (cid:98) Σ / (cid:98) Σ / = (cid:98) Σ , and (cid:107) · (cid:107) ∞ is themaximum element of a vector in absolute values. Step 3:

Compute the p -value, p = J − (cid:80) Jj =1 I ( T ≥ (cid:101) T j ) .5ouble Generative Adversarial Networks for Conditional Independence Testingand can compute the test statistic T by replacing { (cid:101) X ( m ) i } m and { (cid:101) Y ( m ) i } m with { X ( m ) i } m and { Y ( m ) i } m . We call theresulting T ∗ an “oracle" test statistic.We next establish the double-robustness property of T , which helps us better understand why our proposed test canrelax the requirement in (3). Informally speaking, the double-robustness means that T is asymptotically equivalent to T ∗ when either the conditional distribution of X | Z , or that of Y | Z is well approximated by GANs. Theorem 1 (Double-robustness)

Suppose H , H are bounded function classes, M is proportional to n , and B = O ( n c ) for some constant c > . Suppose min h ∈ H ,h ∈ H Var [ { h ( X ) − E { h ( X ) | Z }}{ h ( Y ) − E { h ( Y ) | Z }} ] ≥ c ∗ for some constant c ∗ > . Then T − T ∗ = o p (1) , when either (cid:0) E [ d TV { (cid:101) Q ( (cid:96) ) X ( ·| Z ) , Q X ( ·| Z ) } ] (cid:1) / = o (1) , or (cid:0) E [ d TV { (cid:101) Q ( (cid:96) ) Y ( ·| Z ) , Q Y ( ·| Z ) } ] (cid:1) / = o (1) . A consequence of the doubly-robustness is that, when both total variation distances converge to zero, the test statistic T converges at a faster rate than those total variation distances. Therefore, we can greatly relax the condition in (3), andreplace it with, for any (cid:96) = 1 , . . . , L , (cid:2) E { d TV ( (cid:101) P ( (cid:96) ) X | Z , P X | Z ) } (cid:3) / = O ( n − κ ) , and (cid:2) E { d TV ( (cid:101) P ( (cid:96) ) Y | Z , P Y | Z ) } (cid:3) / = O ( n − κ ) , (6)for some constant < κ < / , where (cid:101) P ( (cid:96) ) X | Z and (cid:101) P ( (cid:96) ) Y | Z denote the conditional distributions approximated via GANstrained on the (cid:96) -th subset. The next theorem summarizes this discussion. Theorem 2

Suppose the conditions in Theorem 1 and (6) holds. Then T − T ∗ = O p ( n − κ ) . Since κ > , the convergence rate of ( T − T ∗ ) is faster than that in (6). To ensure √ n ( T − T ∗ ) = o p (1) , it sufﬁces torequire κ > / . In contrast to (3), this rate is achievable. We consider two examples to illustrate, while the conditionholds in a wide range of settings. Example 3 (Parametric setting).

Suppose the parametric forms of Q X and Q Y are correctly speciﬁed. Then therequirement κ > / holds if k = O ( n c ) for some c < / , where k is the dimension of the parameters deﬁning theparametric model. Example 4. (Nonparametric setting with binary data).

Suppose X , Y are binary variables. Then it sufﬁces toestimate the conditional means of X and Y given Z . The requirement κ > / holds if the mean squared predictionerrors of both nonparametric estimators are O ( n κ ) for some κ > / .More detailed discussion of the above two examples can be found in Section 5.1 of Berrett et al. (2019). Next, weestablish the size (type-I error) and the power properties of our proposed test. Theorem 3 (Type-I error)

Suppose the conditions in Theorem 1 hold. Suppose (6) holds for some κ > / . Then the p -value from Algorithm 2 satisﬁes that P ( p ≤ α |H ) = α + o (1) . Theorem 4 (Power)

Suppose the conditions in Theorem 3 hold. Suppose B diverges to inﬁnity with n and φ XY > c ∗ for some constant c ∗ > under H . For any θ , , θ , ∈ R d , θ , , θ , ∈ R d , suppose there exists some constant C > such that E | h j,θ j, ( X ) − h j,θ j, ( X ) | ≤ C (cid:107) θ j, − θ j, (cid:107) for j = 1 , . Then the p -value from Algorithm 2 satisﬁesthat P ( p ≤ α |H ) → , as n → ∞ . Theorem 3 essentially shows that our proposed test can control the type-I error, whereas Theorem 4 shows that thepower of our test approaches one as the sample size approaches inﬁnity. In Theorem 4, we require B to diverge toinﬁnity, at an arbitrary polynomial order with the sample size, to ensure the type-II error decays to zero with the samplesize. Note that such a condition is not needed to establish its size property in Theorem 3. The dimensions of parametersin the function spaces H and H , i.e., d and d are ﬁxed in Theorem 4 to simplify the analysis. However, we allow thedimension d Z of Z to diverge with n in Theorems 3 and 4. Whereas there is no explicit requirement on d Z , implicitly,this requirement is embedded in (6), as the rate κ in (6) is expected to decay with d Z . The time complexity of our testing procedure is dominated by Step 2 of Algorithm 1, where we use GANs to estimatethe conditional distributions P X | Z and P Y | Z . The complexity of each SGD iteration is O ( RN ) ; see the appendix. Allexperiments were run on 16 N1 CPUs on Google Cloud Computing platform. The wall clock time for computing asingle test statistic was about 3 minutes. 6ouble Generative Adversarial Networks for Conditional Independence Testing We generate synthetic data following the post non-linear noise model similarly as in Zhang et al. (2011); Doran et al.(2014); Bellot & van der Schaar (2019), X = sin( a (cid:62) f Z + ε f ) and Y = cos( a (cid:62) g Z + bX + ε g ) . The entries of a f , a g are randomly and uniformly sampled from [0 , , then normalized to unit norm. The noise variables ε are independently sampled from a normal distribution with mean zero and variance . . In this model, the parameter b determines the degree of conditional dependence. When b = 0 , H holds, and otherwise H holds. The sample size isﬁxed at n = 1000 .We call our proposed test DGCIT, short for double GANs-based conditional independence test. We compare it with theGCIT test of Bellot & van der Schaar (2019), the regression-based test (RCIT) of Shah & Peters (2018) and the kernelMMD-based test (KCIT) of Zhang et al. (2011). Type-I error under H . We vary the dimension of Z as d Z = 50 , , , , , and consider two generationdistributions. We ﬁrst generate Z from a standard normal distribution, then from a Laplace distribution. We set thesigniﬁcance level at α = 0 . and . . Figure 2 top panels report the empirical size of the tests aggregated over 500data replications. We make the following observations. First, the type-I error rates of our test and RCIT are close toor below the nominal level in nearly all cases. Second, KCIT fails in that its type-I error is considerably larger thanthe nominal level in all cases. Third, GCIT has an inﬂated type-I error in some cases. For instance, when Z is normal, d Z = 250 and α = 0 . , its empirical size is close to . . This is consistent with our discussion in Section 2.1, as itrequires a very strong condition to control the type-I error. Powers under H . We generate Z from a standard normal distribution, with d Z = 100 , , and vary the value of b = 0 . , . , . , . , . that controls the magnitude of the alternative. Figure 2 bottom panels report the empiricalpower of the tests aggregated over 500 data replications. We observe that our test is the most powerful, and the empiricalpower approaches 1 as b increases to . , demonstrating the consistency of the test. Meanwhile, both GCIT and RCIThave no power in all cases. We did not report the power of KCIT, because as we show earlier, it can not control the size,and thus its empirical power is meaningless. We illustrate our proposed test with an anti-cancer drug dataset from the Cancer Cell Line Encyclopedia (Barretinaet al., 2012). We concentrate on a subset, the CCLE data, that measures the treatment response of drug PLX4720. It iswell known that the patient’s cancer treatment response to drug can be strongly inﬂuenced by alterations in the genome(Garnett et al., 2012) This data measures 1638 genetic mutations of n = 472 cell lines, and the goal of our analysis isto determine which genetic mutation is signiﬁcantly correlated with the drug response after conditioning on all othermutations. The same data was also analyzed in Tansey et al. (2018) and Bellot & van der Schaar (2019). We adopt theFigure 2: Top panels: the empirical type-I error rate of various tests under H . From left to right: normal Z with α = 0 . ,normal Z with α = 0 . , Laplacian Z with α = 0 . , and Laplacian Z with α = 0 . . Bottom panels: the empiricalpower of various tests under H . From left to right: d Z = 100 , α = 0 . , d Z = 100 , α = 0 . , d Z = 200 , α = 0 . , and d Z = 200 , α = 0 . . 7ouble Generative Adversarial Networks for Conditional Independence TestingTable 1: The variable importance measures of the elastic net and random forest models, versus the p -values of the GCITand DGCIT tests for the anti-cancer drug example. BRAF.V600E BRAF.MC HIP1 FTL3 CDC42BPA THBS3 DNMT1 PRKD1 PIP5K1A MAP3K5EN 1 3 4 5 7 8 9 10 19 78RF 1 2 3 14 8 34 28 18 7 9GCIT <0.001 <0.001 0.008 0.521 0.050 0.013 0.020 0.002 0.001 <0.001DGCIT 0 0 0 0 0 0 0 0 0 0.794 same screening procedure as theirs to screen out irrelevant mutations, which leaves a total of 466 potential mutationsfor our conditional independence testing.The ground truth information is unknown for this data. Instead, we compare with the variable importance measuresobtained from ﬁtting an elastic net (EN) model and a random forest (RF) model as reported in Barretina et al. (2012). Inaddition, we compare with the GCIT test of Bellot & van der Schaar (2019). Table 1 reports the corresponding variableimportance measures and the p -values, for 10 mutations that were also reported by Bellot & van der Schaar (2019). Wesee that, the p -values of the tests generally agree well with the variable important measures from the EN and RF models.Meanwhile, the two conditional independence tests agree relatively well, except for two genetic mutations, MAP3K5and FTL3. GCIT concluded that MAP3K5 is signiﬁcant ( p < . ) but FTL3 is not ( p = 0 . ), whereas our testleads to the opposite conclusion that MAP3K5 is insigniﬁcant ( p = 0 . ) but FTL3 is ( p = 0 ). Besides, both ENand RF place FTL3 as an important mutation. We then compare our ﬁndings with the cancer drug response literature.Actually, MAP3K5 has not been previously reported in the literature as being directly linked to the PLX4720 drugresponse. Meanwhile, there is strong evidence showing the connections of the FLT3 mutation with cancer response(Tsai et al., 2008; Larrosa-Garcia & Baer, 2017). Combining the existing literature with our theoretical and syntheticresults, we have more conﬁdence about the ﬁndings of our proposed test. Our test statistic is constructed based on φ XY in (4). Meanwhile, we may consider another test based on φ XY Z = MMD ( P XY Z , Q

XY Z | H ⊗ H ⊗ H ) , where P XY Z is the joint distribution of ( X, Y, Z ) , Q XY Z = P X | Z P Y | Z P Z ,and H is the class of square integrable functions of Z . This type of test may be more powerful for certain alternativehypotheses, and we leave it as our future research. References

Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J., Lehár, J., Kryukov,G. V., Sonkin, D., et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity.

Nature , 483(7391):603–607, 2012.Bellot, A. and van der Schaar, M. Conditional independence testing using generative adversarial networks. In

Advancesin Neural Information Processing Systems , pp. 2199–2208, 2019.Bergsma, W. P.

Testing conditional independence for continuous random variables . Eurandom, 2004.Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. The conditional permutation test for independence whilecontrolling for confounders.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , accepted,2019.Candes, E., Fan, Y., Janson, L., and Lv, J. Panning for gold:‘model-x’knockoffs for high dimensional controlled variableselection.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 80(3):551–577, 2018.Doran, G., Muandet, K., Zhang, K., and Schölkopf, B. A permutation-based kernel conditional independence test. In

UAI , pp. 132–141, 2014.Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. Kernel measures of conditional dependence. In

Advances inneural information processing systems , pp. 489–496, 2008.Garnett, M., Edelman, E., Gill, S., Greenman, C., Dastur, A., Lau, K., Greninger, P., Thompson, R., Luo, X., Soares, J.,Liu, Q., Iorio, F., Surdez, D., Chen, L., Milano, R., Bignell, G., Tam, A., Davies, H., Stevenson, J., and Benes, C.Systematic identiﬁcation of genomic markers of drug sensitivity in cancer cells.

Nature , 483:570–5, 03 2012. doi:10.1038/nature11005.Genevay, A., Peyré, G., and Cuturi, M. Learning generative models with sinkhorn divergences. arXiv preprintarXiv:1706.00292 , 2017. 8ouble Generative Adversarial Networks for Conditional Independence TestingGoodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.Generative adversarial nets. In

Advances in neural information processing systems , pp. 2672–2680, 2014.Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test.

Journal ofMachine Learning Research , 13(Mar):723–773, 2012.Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., and Schölkopf, B. Nonlinear causal discovery with additive noisemodels. In

Advances in neural information processing systems , pp. 689–696, 2009.Koller, D. and Friedman, N.

Probabilistic Graphical Models: Principles and Techniques . Adaptive computation andmachine learning. MIT Press, 2009.Larrosa-Garcia, M. and Baer, M. R. Flt3 inhibitors in acute myeloid leukemia: Current status and future directions.

Molecular Cancer Therapeutics , 16(6):991–1001, 2017.Li, B.

Sufﬁcient Dimension Reduction: Methods and Applications with R . CRC Press, 2018.Li, C. and Fan, X. On nonparametric conditional independence tests for continuous variables.

Wiley InterdisciplinaryReviews: Computational Statistics , pp. e1489, 2019.Pearl, J.

Causality: Models, Reasoning and Inference . Cambridge: Cambridge University Press, 2nd Edition, 2009.Romano, J. and DiCiccio, C. Multiple data splitting for testing. Technical report, Technical report, 2019.Sen, R., Suresh, A. T., Shanmugam, K., Dimakis, A. G., and Shakkottai, S. Model-powered conditional independencetest. In

Advances in neural information processing systems , pp. 2951–2961, 2017.Shah, R. D. and Peters, J. The hardness of conditional independence testing and the generalised covariance measure. arXiv preprint arXiv:1804.07203 , 2018.Su, L. and White, H. A consistent characteristic function-based test for conditional independence.

Journal ofEconometrics , 141(2):807–834, 2007.Su, L. and White, H. Testing conditional independence via empirical likelihood.

Journal of Econometrics , 182(1):27–44, 2014.Tansey, W., Veitch, V., Zhang, H., Rabadan, R., and Blei, D. M. The holdout randomization test: Principled and easyblack box feature selection. arXiv preprint arXiv:1811.00645 , 2018.Tsai, J., Lee, J. T., Wang, W., Zhang, J., Cho, H., Mamo, S., Bremer, R., Gillette, S., Kong, J., Haass, N. K., Sproesser,K., Li, L., Smalley, K. S. M., Fong, D., Zhu, Y.-L., Marimuthu, A., Nguyen, H., Lam, B., Liu, J., Cheung, I., Rice, J.,Suzuki, Y., Luu, C., Settachatgul, C., Shellooe, R., Cantwell, J., Kim, S.-H., Schlessinger, J., Zhang, K. Y. J., West,B. L., Powell, B., Habets, G., Zhang, C., Ibrahim, P. N., Hirth, P., Artis, D. R., Herlyn, M., and Bollag, G. Discoveryof a selective inhibitor of oncogenic b-raf kinase with potent antimelanoma activity.

Proceedings of the NationalAcademy of Sciences , 105(8):3041–3046, 2008. doi: 10.1073/pnas.0711741105.Wang, X., Pan, W., Hu, W., Tian, Y., and Zhang, H. Conditional distance correlation.

Journal of the American StatisticalAssociation , 110(512):1726–1734, 2015.Zhang, H., Zhou, S., and Guan, J. Measuring conditional independence by independent residuals: Theoretical resultsand application in causal discovery. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018.Zhang, K., Peters, J., Janzing, D., and Schölkopf, B. Kernel-based conditional independence test and application incausal discovery. In27th Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2011)