A Critical View of the Structural Causal Model
AA Critical View of the Structural Causal Model
Tomer Galanti Ofir Nabati Lior Wolf
Abstract
In the univariate case, we show that by comparingthe individual complexities of univariate causeand effect, one can identify the cause and the ef-fect, without considering their interaction at all.In our framework, complexities are captured bythe reconstruction error of an autoencoder thatoperates on the quantiles of the distribution. Com-paring the reconstruction errors of the two autoen-coders, one for each variable, is shown to performsurprisingly well on the accepted causality direc-tionality benchmarks. Hence, the decision as towhich of the two is the cause and which is theeffect may not be based on causality but on com-plexity.In the multivariate case, where one can ensurethat the complexities of the cause and effect arebalanced, we propose a new adversarial trainingmethod that mimics the disentangled structure ofthe causal model. We prove that in the multidi-mensional case, such modeling is likely to fit thedata only in the direction of causality. Further-more, a uniqueness result shows that the learnedmodel is able to identify the underlying causal andresidual (noise) components. Our multidimen-sional method outperforms the literature methodson both synthetic and real world datasets.
1. Introduction
A long standing debate in the causality literature, is whethercausality can be inferred without intervention (Pearl, 2009;Spirtes et al., 2000). The Structural Causal Model(SCM) (Spirtes et al., 2000) is a simple causative modelfor which many results demonstrate the possibility of suchinference (Stegle et al., 2010; Bloebaum et al., 2018; Goudetet al., 2018; Lopez-Paz et al., 2017; 2015). In this model,the effect ( Y ) is a function of the cause ( X ) and some inde-pendent random noise ( E ). School of Computer Science, Tel Aviv University, Israel Facebook AI Research (FAIR). Correspondence to: TomerGalanti < [email protected] > , Lior Wolf < [email protected] > . In this work, we take a critical perspective of the univariateSCM. We demonstrate empirically that for the univariatecase, which is the dominant case in the existing literature,the SCM leads to an effect that has a lower complexity thanthe cause. Therefore, one can identify the cause and theeffect, by measuring their individual complexities, with noneed to make the inference based on both variables simul-taneously. Thus, the decision as to which of the two is thecause and which is the effect may not be based on causalitybut on complexity.Since we are dealing with unordered univariate randomvariables, the complexity measure has to be based on theprobability distribution function. As we show empirically,comparing the entropies of the distribution of two randomvariables is ineffective for inferring the causal direction. We,therefore, consider the quantiles, i.e, fixed sized vectors thatare obtained as sub-sequences of the sorted sampled valuesof the variable.We consider suitable complexity scores for these vectors.In our analysis, we show that the reconstruction error ofan autoencoder of a multivariate random variable is a validcomplexity measure. In addition, we link the reconstruc-tion error based complexity, in the case of variational au-toencoders, to the differential entropy of the input randomvariable. Hence, by computing the reconstruction errorsof trained autoencoders on these vectors, we estimate theentropies of the quantile vectors of X and Y .The challenges of measuring causality independently ofcomplexity in the 1D case lead us to consider the multidi-mensional case, where the complexity can be controlled by,for example, manipulating the dimension of the noise signalin the SCM. Note that unlike (Goudet et al., 2018), we con-sider pairs of multivariate vectors and not many univariatevariables in a graph structure. We demonstrate that for themultidimensional case, any method that is based on compar-ing the complexity of the individual random variables X and Y fails to infer causality of random variables. Furthermore,we extend a related univariate result by (Zhang & Hyvrinen,2010) to the multidimensional case and prove that an SCMis unlikely to hold in both directions X → Y and Y → X .Based on our observations, we propose a new causalityinference method for multidimensional cause and effect.The algorithm learns three networks in a way that mimics a r X i v : . [ c s . L G ] F e b Critical View of the Structural Causal Model the parts of the SCM. The noise part is unknown and isreplaced by a function that is constrained to be independentof the cause, as captured by an adversarial loss. However, weshow empirically that even without the explicit constraint,in several cases, such an independence emerges.Our empirical results support our analysis and demonstratethat in the univariate case, assigning cause and effect basedon complexity is competitive with the state of the art meth-ods. In the multidimensional case, we show that the pro-posed method outperforms existing multivariate methods,as well as new extensions of univariate literature methods.
We investigate the problem of causal inference from obser-vational data. A non-linear structural causal model (SCMfor short) is a generative process of the following form: X ∼ P X E ∼ P E Y ← g ( f ( X ) , E ) (1)The functions g : R d f + d e → R d y and f : R d x → R d f arefixed and unknown. In general, g and f are non-linear. Here, X is the input random variable and E is the environmentrandom variable that is independent of X . We say that X ∈ R d x causes Y ∈ R d y if they satisfy a generativeprocess, such as Eq. 1.We present methods for inferring whether X causes Y (de-noted by X → Y ) or Y causes X , or neither. The algorithmis provided with i.i.d samples { ( x i , y i ) } mi =1 ∼ P mX,Y (thedistribution of m i.i.d samples from the joint distribution P X,Y ) from the generative process of Eq. 1. In general, by(cf. Prop 4.8, (Peters et al., 2017)), for any joint distribution P X,Y of two random variables X and Y , there is an SCM, Y = g ( f ( X ) , E ) , where E is a noise variable, such that, X | = E and f, g are some (measurable) functions. Therefore,in general, deciding whether X causes Y or vice versa isill-posed when only provided with samples from the jointdistribution. However, (Zhang & Hyvrinen, 2010) showedfor the one dimensional case (i.e., X, Y ∈ R ) that underreasonable conditions, a representation Y = g ( f ( X ) + E ) holds only in one direction. In Sec. 3.2, we extend thistheorem and show that a representation Y = g ( f ( X ) , E ) holds only in one direction when g and f are assumed tobe neural networks and X, Y are multidimensional (we callsuch SCMs neural SCMs).Throughout the paper, we denote by P U [ u ] := P [ U ≤ u ] the cumulative distribution function of a uni/multi-variatereal valued random variable U and P is a standard Lebesguemeasure. Additionally, we denote by p U ( u ) = dd u P U [ u ] the probability density function of U (if exists, i.e., P U [ u ] is absolutely continuous). We denote by E u ∼ U [ f ( u )] the expected value of f ( u ) for u that is distributed by P U [ u ] .The identity matrix of dimension n × n is denoted by I n or I , when the dimension is obvious from the context. In causal inference, the algorithm is provided with a datasetof matched samples ( x, y ) of two random variables X and Y and decides whether X causes Y or vice versa. Theearly wisdom in this area asserted that this asymmetry ofthe data generating process (i.e., that Y is computed from X and not vice versa) is not apparent from looking at P X,Y alone. That is, in general, provided with samples from thejoint distribution P X,Y of two variables
X, Y does tell uswhether it has been induced by an SCM from X to Y orfrom Y to X .In publications, such as (Pearl, 2009; Spirtes et al., 2000),it is argued that in order to decide whether X causes Y orvice versa, one needs to observe the influence of interven-tions on the environment parameter. To avoid employinginterventions, most publications assume prior knowledgeon the generating process and/or independence between thecause and the mechanism.Various methods for causal inference under the SCM havebeen suggested. Many of these methods are based on inde-pendence testing, where the algorithm models the data as Y = g ( f ( X ) , E ) (and vice versa) and decides upon the sidethat provides a better fitting in terms of mapping accuracyand independence between f ( X ) and E = r ( X, Y ) . TheLiNGAM (Shimizu et al., 2006) algorithm assumes that theSCM takes the form Y = βX + E , where X | = E , β ∈ R and E is non-Gaussian. The algorithm learns β , such that, X and Y − βX are independent by applying independentcomponent analysis (ICA). The Direct-LiNGAM (Shimizuet al., 2011) extends this method and replaces the mutual in-formation minimization with a non-parametric kernel basedloss (Bach & Jordan, 2003). However, the computation ofthis loss is of order Θ( m ) in the the worst case ( m is thenumber of samples).The ANM approach (Hoyer et al., 2009) extends LiNGAM’smodeling and assumes that Y = f ( X ) + E , where X | = E .A Gaussian Process is employed as the learned mechanismbetween the two random variables. The function f is trainedto map between X and Y (and vice versa) and the methodthen tests whether, X and f ( X ) − Y are independent. Theindependence test is based on kernels (Gretton et al., 2005).A different extension of LiNGAM is the PNL algorithmby (Zhang & Hyvrinen, 2010). This algorithm learns amapping between X and Y (and vice versa) of the form Y = g ( f ( X ) + E ) , where f ( X ) and E are restricted to beindependent. To do so, PNL trains two neural networks f and g to minimize the mutual information between f ( X ) Critical View of the Structural Causal Model and E = g − ( Y ) − f ( X ) . The main disadvantage of thismethod is the reliance on the minimization of the mutualinformation. It is often hard to measure and optimize themutual information directly, especially in higher dimensions.In many cases, it requires having an explicit modeling ofthe density functions, because of the computation of ex-pected log-probability within the formulation of the entropymeasure.In our multivariate method, we take a similar approach tothe above methods. However, our GAN-based indepen-dence constraint is non-parametric, is applied on the ob-servations rather on an explicit modeling of the densityfunctions, and the method is computationally efficient. Inaddition, we do not assume restrictive structural assump-tions and treat the generic case, where the effect is of theform Y = g ( f ( X ) , E ) .Another independence constraint is applied by the Infor-mation Geometric Causal Inference (IGCI) (Daniusis et al.,2012) approach, which determines the causal relationship ina deterministic setting Y = f ( X ) under an independenceassumption between the cause X and the mechanism f , Cov(log f (cid:48) ( x ) , p X ) = 0 .The Conditional Distribution Similarity Statistic(CDS) (Fonollosa, 2016) measures the standard devi-ation of the values of Y (resp. X ) after binning in the X (resp. Y ) direction. The lower the standard deviation,the more likely the pair to be X → Y . The CUREalgorithm (Sgouritsa et al., 2015) compares between X → Y and Y → X directions in the following manner:if we can estimate p X | Y based on samples from p Y moreaccurately than p Y | X based on samples from p X , then X → Y is inferred.The BivariateFit method learns a Gaussian Process regressorin both directions and decides upon the side that had thelowest error. The RECI method (Bloebaum et al., 2018)trains a regression model (a logistic function, polynomialfunctions, support vector regression, or a neural networks)in both directions, and returns the side that produced a lowerMSE loss. The CGNN algorithm (Goudet et al., 2018) usesthe Maximum Mean Discrepancy (MMD) distance betweenthe distribution produced by modeling Y as an effect of X , ( X, g ( X, E )) (and vice versa), and the ground truthdistribution. The algorithm compares the two distancesand returns the direction that led to a smaller distance. TheGaussian Process Inference model (GPI) (Stegle et al., 2010)builds two generative models, one for X → Y and one for Y → X . The distribution of the candidate cause variable ismodelled as a Gaussian Mixture Model, and the mechanism f is a Gaussian Process. The causal direction is determinedfrom the generative model that best fits the data.Finally, it is worth mentioning that several other methods, such as (Heinze-Deml et al., 2017; Zhang et al., 2011) as-sume a different type of SCM, where the algorithm is pro-vided with separate datasets that correspond to differentenvironments, i.e., sampled i.i.d from P X,Y | E , where thevalue of E is fixed for all samples in the dataset. In thesepublications, a different independence condition is assumed: Y is independent of E given X . This assumption fails inour setting, since we focus on the vanilla SCM, where thealgorithm is provided only with observational i.i.d. samplesof X and Y = g ( f ( X ) , E ) and the samples are not dividedinto subsets that are invariant w.r.t E .
2. The Univariate Case
In this section, we show that the univariate SCM does notnecessarily capture causality. For this purpose, we describea method for identifying the cause and the effect, whichconsiders each of the two variables independently withoutconsidering the mapping between them. The success ofthis method, despite neglecting any interaction between thevariables, indicates that univariate SCM challenges can besolved without considering causality.The proposed method computes a complexity score for X and, independently, for Y . It then compares the scoresand decides that the cause is the random variable with thelarger score among them. Capturing the complexity of aunivariate random variable without being able to anchor theobservations in additional features is challenging. One canobserve the probability distribution function and compute,for example, its entropy. As we show empirically, in Sec. 4,this is ineffective.Our complexity scoring method, therefore, has a few stages.As a first step, it converts the random variable at hand (say, X ) into a multivariate random variable. This is done bysorting the samples of the random variable, and then cuttingthe obtained list into fixed sized vectors of length k . Wediscard the largest measurements in the case, where thenumber of samples is not a multiple of k . We denote therandom variable obtained this way by U . At the secondstage, the method computes the complexity of the obtainedrandom variable U using an autoencoder reconstructionerror. One can consider the complexity of a multivariate randomvariable in various ways. We consider non-negative com-plexity measures C ( X ) , which satisfy the weak assumptionthat when X and Y are independent then their complexitiesare lower than the complexity of their concatenation: C ( X, Y ) ≥ max( C ( X ) , C ( Y )) . (2) Critical View of the Structural Causal Model
Examples of sample complexity measures that satisfy thiscondition are the Shannon Entropy and the KolmogorovComplexity. The following lemma shows that a complexitythat is based on autoencoder modeling is also in this family.Let F = {H d } ∞ d =1 be a family of classes of autoencoders A : R d → R d . Assume that the family F is closed to fixa-tions, i.e., for any autoencoder A ∈ H d + d and any fixedvector y ∗ ∈ R d ( x ∗ ∈ R d ), we have: A ( x, y ∗ ) d ∈ H d ( A ( x ∗ , y ) d +1: d ∈ H d ). Here, v i : j = ( v i , . . . , v j ) . Notethat this is the typical situation when considering neuralnetworks with biases.Let X be a random variable. Let X be a multivariate ran-dom variable dimension d . We define the autoencodingcomplexity of X as follows: C F ( X ) := min A ∗ ∈ H d E x ∼ X [ (cid:96) ( A ∗ ( x ) , x )] (3)where (cid:96) ( a, b ) is some loss function. Lemma 1.
Let {H d } ∞ d =1 be a family of classes of autoen-coders that is closed to fixations. The function C F ( X ) is aproper complexity measure. The AEQ method we propose estimates and compares theauto-encoder reconstruction error of the quantile vectors of X and Y . It is important to note that it does not imply thatthe AEQ method compares between the entropies of X and Y .Once the random variable U is obtained as the quantiles ofa random variable (either X or Y ), our method trains anautoencoder A : R k → R k on U . A is trained to minimizethe following objective: L recon ( A ) := E u ∼ U [ (cid:96) ( A ( u ) , u )] (4)where (cid:96) ( a, b ) is some loss function. In our implementation,we employ the L -loss function, defined as (cid:96) ( a, b ) = (cid:107) a − b (cid:107) . Finally, the method uses the value of L recon ( A ) , whichwe refer to as the AEQ score, as a proxy for the complexityof X (smaller loss means lower complexity). It decides that X or Y is the cause, based on which side provides a higherAEQ.As we show in Sec. 4, the proposed causality-free methodis as successful at solving SCM challenges as the leadingliterature methods. However, we do not propose it as astandalone method, and rather develop it to show the short-coming of the univariate SCM setting and the associatedliterature datasets.
3. The Multivariate Case
For the univariate case, one can consider the complexity ofthe X and Y variables of the SCM and infer directionality. We propose the AEQ complexity for this case, since moreconventional complexities are ill-defined for unordered 1Ddata or, in the case of entropy, found to be ineffective.The following technical lemma shows that for any com-plexity measure C , one cannot infer directionality in themultivariate SCM based on C . Lemma 2.
Let C be a complexity measure of multivariaterandom variables (i.e, non-negative and satisfies Eq. 2).Then, there are triplets of random variables ( X, E, Y ) and ( ˆ X, E, Y ) and functions g and g (cid:48) , such that, Y = g ( X, E ) , Y = g (cid:48) ( ˆ X, E ) , C ( X ) < C ( Y ) and C ( ˆ X ) > C ( Y ) . There-fore, C cannot serve as a score for causal inference. We now turn our attention to a new multivariate causalityinference method.
Our causality inference algorithm trains neural networks
G, F, R and D . The success of fitting these networks servesas the score for the causality test. The function F modelsthe function f , G models g and R ( Y ) aims to model theenvironment parameter E . In general, our method aims atsolving the following objective: min G,F,R L err ( G, F, R ) := 1 m m (cid:88) i =1 (cid:107) G ( F ( a i ) , R ( b i )) − b i (cid:107) s.t: A | = R ( B ) (5)where A is either X or Y and B is the other option, and a i = x i , b i = y i or a i = y i , b i = x i accordingly. To decidewhether X → Y or vice versa, we train a different triplet G, F, R for each direction and see if we can minimize themapping error L err subject to independence. We decideupon a specified direction, if the loss can be minimizedsubject to independence. In general, searching within thespace of functions that satisfy A | = R ( B ) is an intractableproblem. However, we can replace it with a loss term that isminimized when A | = R ( B ) . Independence loss
We would like R ( B ) to capture theinformation encoded in E . Therefore, restrict R ( B ) and A to be independent in each other. We propose an adversarialloss for this purpose, which is a modified version of a lossproposed by (Brakel & Bengio, 2017) and later analyzed by(Press et al., 2019).This loss measures the discrepancy between the joint distri-bution P A,R ( B ) and the product of the marginal distributions P A × P R ( B ) . Let d F ( d R ) be the dimension of F ’s output( R ). To measure the discrepancy, we make use of a discrim-inator D : R d a + d R → [0 , ( d a equals d x or d y depending Critical View of the Structural Causal Model on A = X or A = Y ) that minimizes the following term: L D ( D ; R ) := 1 m m (cid:88) i =1 (cid:96) ( D ( a i , R ( b i )) , m m (cid:88) i =1 (cid:96) ( D (ˆ a i , R (ˆ b i )) , (6)where D is a discriminator network, and l ( p, q ) = − ( q log( p ) + (1 − q ) log(1 − p )) is the binary cross entropyloss for p ∈ [0 , and q ∈ { , } . In addition, { (ˆ a i , ˆ b i ) } mi =1 are i.i.d samples from P A × P B . To create these samples, wesample independently ˆ a i and ˆ b i from the respective trainingsets { (ˆ a i } mi =1 and { (ˆ b i } mi =1 and then arbitrarily match theminto couples (ˆ a i , ˆ b i ) .To restrict that R ( B ) and A are independent, R is trainedto confuse the discriminator D such that the two sets ofsamples are indistinguishable by D , L indep ( R ; D ) := 1 m m (cid:88) i =1 (cid:96) ( D ( a i , R ( b i )) , m m (cid:88) i =1 (cid:96) ( D (ˆ a i , R (ˆ b i )) , (7) Full objective
The full objective of our method is thentranslated into the following program: min
G,F,R L err ( G, F, R ) + λ · L indep ( R ; D )min D L D ( D ; R ) (8)Where λ is some positive constant. The discriminator D minimizes the loss L D ( D ; R ) concurrently with the othernetworks. Our method decides if X causes Y or vice versa,by comparing the score L err ( G, F, R ) . A lower error meansa better fit. The full description of the architecture employedfor the encoders, generator and discriminator is given inAppendix A. A sensitivity experiment for the parameter λ is provided in Appendix B.In addition to the success in fitting, we also measure thedegree of independence between A and R ( B ) . We denote by c real the percentage of samples ( a i , b i ) that the discriminatorclassifies as and by c fake the percentage of samples (ˆ a i , ˆ b i ) that are classified as . We note that when c real ≈ − c fake , the discriminator is unable to discriminate between thetwo distributions, i.e., it is wrong in classifying half of thesamples. We, therefore, use | c real + c fake − | as a measureof independence. In this section, we analyze the proposed method. In Thm. 1,we show that if X and Y admit a SCM in one direction, then it admits a SCM in the opposite direction, only ifthe involved functions satisfy a specific partial differentialequation. Theorem 1 (Identifiability of neural SCMs) . Let P X,Y ad-mit a neural SCM from X to Y as in Eq. 1, such that p X ,and the activation functions of f and g are three-times dif-ferentiable. Then it admits a neural SCM from Y to X , onlyif p X , f , g satisfy Eq. 27 in the appendix. This result generalizes the one-dimensional case presentedin (Zhang & Hyvrinen, 2010), where a one-dimensionalversion of this differential equation is shown to hold in theanalog case.In the following theorem, we show that minimizing theproposed losses is sufficient to recover the different compo-nents, i.e., F ( X ) ∝ f ( X ) and R ( Y ) ∝ E , where A ∝ B means that A = f ( B ) for some invertible function f . Theorem 2 (Uniqueness of Representation) . Let P X,Y ad-mit a nonlinear model from X to Y as in Eq. 1, i.e., Y = g ( f ( X ) , E ) for some random variable E | = X . Assume that f and g are invertible. Let G , F and R be functions, suchthat, L err := E ( x,y ) ∼ ( X,Y ) [ (cid:107) G ( F ( x ) , R ( y )) − y (cid:107) ] = 0 and G and F are invertible functions and X | = R ( Y ) . Then, F ( X ) ∝ f ( X ) and R ( Y ) ∝ E . where, L err is the mapping error proposed in Eq. 5. Inaddition, the assumption X | = R ( Y ) is sufficed by the inde-pendence loss.A more general results, but which requires additional ter-minology, is stated as Thm. 3 in Appendix C. It extendsThm. 2 to the case, where the mapping loss is not neces-sarily zero and the independence X | = R ( Y ) is replaced bya discriminator-based independence measure. Thm. 3 alsogets rid of the assumption that the various mappings f, g and F, G are invertible. In this case, instead of showing that R ( Y ) ∝ E , we provide an upper bound on the reconstruc-tion of E out of R ( Y ) (and vice versa) that improves as thetraining loss of G , F and R decreases.To conclude our analysis, by Thm. 1, under reasonableassumptions, if X and Y admit a multivariate SCM in di-rection X → Y , then, there is no such representation in theother direction. By Thm. 2, by training our method in bothdirections, one is able to capture the causal model in thecorrect direction. This is something that is impossible to doin the other direction by Thm. 1.
4. Experiments
This section is divided into two parts. The first is devoted toshowing that causal inference in the one-dimensional casehighly depends on the complexities of the distributions of X and Y . In the second part of this section, we show that ourmultivariate causal inference method outperforms existing Critical View of the Structural Causal Model baselines. Most of the baseline implementations were takenfrom the Causality Discovery Toolbox of (Kalainathan &Goudet, 2019). The experiments with PNL (Zhang & Hyvri-nen, 2010), LiNGAM (Shimizu et al., 2006) and GPI (Stegleet al., 2010) are based on their original matlab code.
We compared the autoencoder method on several well-known one dimensional cause-effect pairs datasets. Eachdataset consists of a list of pairs of real valued randomvariables ( X, Y ) with their direction or , depending on X → Y or Y → X (resp.). For each pair, we have a datasetof samples { ( x i , y i ) } mi =1 .Five cause-effect inference datasets, covering a wide rangeof associations, are used. CE-Net (Goudet et al., 2018) con-tains 300 artificial cause-effect pairs generated using ran-dom distributions as causes, and neural networks as causalmechanisms. CE-Gauss contains 300 artificial cause-effectpairs as generated by (Mooij et al., 2016), using randommixtures of Gaussians as causes, and Gaussian Process pri-ors as causal mechanisms. CE-Multi (Goudet et al., 2018)contains 300 artificial cause-effect pairs built with randomlinear and polynomial causal mechanisms. In this dataset,simulated additive or multiplicative noise is applied beforeor after the causal mechanism.The real-world datasets include the diabetes datasetby (Frank & Asuncion, 2010), where causality is fromInsulin → Glucose. Glucose curves and Insulin doses wereanalysed for 69 patients, each serves as a separate dataset.To match the literature protocols, the pairs are taken in anorderless manner, ignoring the time series aspect of theproblem. Finally, the T¨ubingen cause-effect pairs datasetby (Mooij et al., 2016) is employed. This dataset is a col-lection of 100 heterogeneous, hand-collected, real-worldcause-effect samples.The autoencoder A employed in our method, Eq. 4, is afully-connected five-layered neural network with three lay-ers for the encoder and two layers for the decoder. Thehyperparameters of this algorithm are the sizes of each layer,the activation function and the input dimension, i.e., lengthof sorted cuts (denoted by k in Sec. 2). Throughout theexperiments, we noticed that the hyperparameter with thehighest influence is the input dimension. For all datasets,results are stable in the range of ≤ k ≤ , and we,therefore, use k = 250 throughout the experiments. Forall datasets, we employed the ReLU activation function,except the T¨ubingen dataset, where the sigmoid activationfunction produced better results (results are also reasonablewith ReLU, but not state of the art).In addition to our method, we also present results obtainedwith the entropy of each individual variable as a complexity Table 1.
Mean AUC rates of various baselines on different onedimensional cause-effect pairs datasets. Our interaction-less AEQalgorithm achieves competitive results on most datasets.
CE- CE- CE- T¨ubi- Dia-Method Net Gauss Multi ngen betesBivariateFit 77.6 36.3 55.4 58.4 0.0LiNGAM(Shimizu et al., 2006) 43.7 66.5 59.3 39.7
CDS (Fonollosa, 2016) 89.5 84.3 37.2 59.8 12.0IGCI (Daniusis et al., 2012) 57.2 33.2 80.7 62.2
ANM (Hoyer et al., 2009) 85.1 88.9 35.5 53.7 22.2PNL(Zhang & Hyvrinen, 2010) 75.5 83.0 49.0 68.1 28.1GPI (Stegle et al., 2010) 88.4
We first compare our method on several synthetic datasets.Each dataset consists of a list of pairs of real multivariaterandom variables ( X, Y ) with their direction or , depend-ing on X → Y or Y → X (resp.). For each pair, we have adataset of samples { ( x i , y i ) } mi =1 .We employ five datasets, covering multiple associations.Each dataset contains 300 artificial cause-effect pairs. Thecause random variable is of the form X = h ( z ) , where h is some function and z ∼ N (0 , σ · I n ) . The effect is ofthe form Y = g ( u ( X, E )) , where E ∼ N (0 , σ · I n ) isindependent of X , u is a fixed function that combined thecause X and the noise term E and g is the causal mechanism.For each dataset, the functions h and g are taken from thesame family of causal mechanisms H . Each pair of randomvariables is specified by randomly selected functions h and Critical View of the Structural Causal Model
Table 2.
Accuracy rates of various baselines on the CE-T¨ub dataset.Our interaction-less algorithm AEQ achieves almost SOTA accu-racy.
Method Supervised AccLiNGAM (Shimizu et al., 2006) - 44.3%BivariateFit - 44.9%Entropy as a complexity measure - 52.5%IGCI (Daniusis et al., 2012) - 62.6%CDS (Fonollosa, 2016) - 65.5%ANM (Hoyer et al., 2009) - 59.5%CURE (Sgouritsa et al., 2015) - 60.0% a GPI (Stegle et al., 2010) - 62.6%PNL (Zhang & Hyvrinen, 2010) - 66.2%CGNN (Goudet et al., 2018) - 74.4%RECI (Bloebaum et al., 2018) - 77.5%SLOPE (Marx & Vreeken, 2017) - %Our AEQ comparison - 80.0%Jarfo (Fonollosa, 2016) + 59.5%RCC (Lopez-Paz et al., 2015) + 75.0% b NCC (Lopez-Paz et al., 2017) + 79.0% a The accuracy of CURE is reported on version 0.8 of the datasetin (Sgouritsa et al., 2015) as 75%. In (Bloebaum et al., 2018) theyre-ran this algorithm and achieved an accuracy rate of around 60%. b The accuracy scores reported in (Lopez-Paz et al., 2015) arefor version 0.8 of the dataset, in (Lopez-Paz et al., 2017) theyre-ran RCC (Lopez-Paz et al., 2015) on version 1.0 of the dataset. g .The synthetic datasets extend the standard synthetic datagenerators of (Kalainathan & Goudet, 2019) to the multi-variate causal pairs. MCE-Poly is generated element-wisepolynomials composed on linear transformations as mecha-nisms and u ( X, E ) = X + E . MCE-Net pairs are generatedusing neural networks as causal mechanisms and u is theconcatenation operator. The mechanism in MCE-SigMixconsists of linear transformation followed by element wiseapplication of q a,b,c ( x ) := ab (˜ x + c ) / (1 + | b · (˜ x + c ) | ) ,where a, b, c are random real valued numbers, which aresampled for each pair and ˜ x = x + e , where e is the envi-ronment random variable. In this case, u ( X, E ) = X + E .We noticed that a-priori, the produced datasets are imbal-anced in a way that the reconstruction error of a standardautoencoder on each random variable can be employed as ascore that predicts the cause variable with a high accuracy.Therefore, in order to create balanced datasets, we variedthe amount of noise dimensions and their intensity, until theautoencoder reconstruction error of both X and Y becamesimilar. Note that for these multivariate variables, we donot use quantiles and use the variables themsevles. As theAutoEncoder reconstruction results in Tab. 3 show, in theMCE-SigMix dataset, balancing was only partly successful. Table 3.
Mean AUC rates of various baselines on different multi-variate cause-effect pairs datasets. The datasets are designed andbalanced, such that an autoencoder method would fail. Our methodachieves SOTA results.
Method MCE- MCE- MCE- MOUS-Poly Net SigMix MEGAE reconstruction 57.2 42.4 22.3 41.2BivariateFit 54.7 48.4 48.2 44.2IGCI (Daniusis et al., 2012) 41.9 49.3 59.8 56.0CDS (Fonollosa, 2016) 63.8 57.0 62.1 89.9ANM (Hoyer et al., 2009) 52.2 51.1 46.4 52.4PNL (Zhang & Hyvrinen, 2010) 76.4 54.7 16.8 56.3CGNN (Goudet et al., 2018) 47.8 67.8 58.8 40.9Our method
Table 4.
Results of various methods on different variations of theMOUS-MEG dataset. R stands for the MEG scan at rest, W standsfor the word presented to the subject and A stands for the MEGscan, when the subject is active.
Method R + W → A R → A W → AExpected to be causal Yes No NoAE reconstruction 41.2 51.7 98.6BivariateFit 44.2 58.1 0.0IGCI (Daniusis et al., 2012) 56.0 50.6 42.2CDS (Fonollosa, 2016) 89.9 52.1 90.2ANM (Hoyer et al., 2009) 52.4 49.3 0.0PNL (Zhang & Hyvrinen, 2010) 56.3 43.7 0.0CGNN (Goudet et al., 2018) 40.9 52.2 100.0Our method 97.7 44.4 0.0We compare our results to two types of baseline methods:(i) BivariateFit and ANM (Hoyer et al., 2009) are meth-ods that were designed (also) for the multivariate case, (ii)CGNN (Goudet et al., 2018) and PNL (Zhang & Hyvrinen,2010) are naturally extended to this case. To extend theCDS (Fonollosa, 2016) and IGCI (Daniusis et al., 2012)methods to higher dimension, we applied quantizationsover the data samples, i.e., cluster the samples { x i } mi =1 and { y i } mi =1 using two distinct k-means with k = 10 , andthen, each sample is replaced with its corresponding clusterto obtain a univariate representation of the data. After pre-processing the data, we apply the corresponding method. Toselect the hyperparameter k , we varied its value between to for different scales and found to provide the bestresults. RECI (Bloebaum et al., 2018) could be extended.However, RECI’s runtime is of order O ( n ) , where n is Critical View of the Structural Causal Model
Table 5.
Emergence of independence. Ind C (Ind E) is the meanof | c real + c fake − | over all pairs of random variables, epochs andsamples, when training the method from X to Y (vice versa). w/obackprop means without backpropagating gradients from D to R . Full method w/o backpropDataset AUC Ind C Ind E AUC Ind C Ind EMCE-Poly 95.3 0.06 0.05 95.1 0.10 0.10MCE-Net 84.2 0.28 0.31 65.1 0.55 0.55MCE-SigMix 98.5 0.05 0.06 98.8 0.16 0.20MOUS-MEG 97.7 0.14 0.14 80.7 0.74 0.75the input dimension. Other methods cannot be extended, orrequire significant modifications. For example, the SLOPEmethod (Marx & Vreeken, 2017) heavily relies on the abil-ity to order the samples of the random variables X and Y .However, it is impossible to do so in the multivariate case.We could not find any open source implementation of theCURE algorithm (Sgouritsa et al., 2015).The results, given in Tab. 3 show a clear advantage over theliterature methods across the four datasets. The exact samearchitecture is used thorughout all experiments, with thesame λ parameter. See Sec. 1 of the supplementary material.A sensitivity analysis (see supplementary Sec. 2) shows thatour results are better than all baseline methods, regardlessof the parameter λ .In addition to the synthetic datasets, we also employ theMOUS-MEG real world dataset, provided to us by the au-thors of (King et al., 2020). This dataset is part of MotherOf Unification Studies (MOUS) dataset (Schoffelen et al.,2019). This dataset contains magneto-encephalography(MEG) recordings of 102 healthy Dutch-speaking subjectsperforming a reading task (9 of them were excluded dueto corrupted data). Each subject was asked to read 120sentences in Dutch, both in the right order and randomlymixed order, which adds up to a total of over 1000 words.Each word was presented on the computer screen for 351mson average and was separated from the next word by 3-4seconds. Each time step consists of 301 MEG readings ofthe magnetometers, attached to different parts of the head.For more information see (Schoffelen et al., 2019). For eachpair ( X, Y ) , X is the interval [ − . s, − . s ] relative to theword onset concatenated with the word embedding (usingthe spaCy python module with the Dutch language model),this presents the subject in his “rest” state (i.e. the cause). Y is the interval [0 , . s ] relative to the word onset, whichpresents the subject in his “active” state (i.e. the effect).To validate the soundness of the dataset, we ran a few ex-periments on variations of the dataset and report the resultsas additional columns in Tab. 4. As can be seen, a datasetwhere the cause consists of the word embedding and the effect consists of the subject’s “active” state is highly imbal-anced. This is reasonable, since the word embedding and theMEG readings are encoded differently and are of differentdimensions. In addition, when the cause is selected to be the“rest” state and the effect is the “active” state, the variousalgorithms are unable to infer which side is the cause andwhich one is the effect, since the word is missing. Finally,when considering the Rest+Word → Active variation, therelationship is expected to be causal, the AE reconstructionindicates that the dataset is balanced, and our method is theonly one to achieve a high AUC rate.
Emergence of independence
To check the importance ofour adversarial loss in identifying the direction of causalityand capturing the implicit independent representation f ( X ) and E , we applied our method without training R againstthe discriminator. Therefore, in this case, the discriminatoronly serves as a test whether X and R ( Y ) are independentor not and does not contribute to the training loss of R ( λ = 0 ).As mentioned in Sec. 3.1, the distance between c real + c fake to indicates the amount of dependence between X and R ( Y ) . We denote by Ind C the mean values of | c real + c fake − | over all pairs of random variables and samples whentraining our method in the causal direction. The same meanscore when training in the anti-causal direction is denotedInd E. As is evident from Tab. 5, the independence is similarbetween the two directions, emphasizing the importance ofthe reconstruction error in the score.As can be seen in Tab. 5, the adversarial loss improves the re-sults when there is no implicit emergence of independence.However, in cases where there is emergence of indepen-dence, the results are similar. We noticed that the values ofInd C and Ind E are smaller for the full method. However,in MCE-Poly and MCE-SigMix they are still very small and,therefore, there is implicit emergence of independence be-tween X and R ( Y ) , even without explicitly training R ( Y ) to be independent of X .
5. Summary
We discover an inbalance in the complexities of cause andeffect in the univariate SCM and suggest a method to exploitit. Since the method does not consider the interactionsbetween the variables, its success in predicting cause andeffect indicates an inherent bias in the univariate datasets.Turning our attention to the multivariate case, where thecomplexity can be actively balanced, we propose a newmethod in which the learned networks model the underlyingSCM itself. Since the noise term E is unknown, we replaceit by a function of Y that is enforced to be independentof X . We also show that under reasonable conditions, theindependence emerges, even without explicitly enforcing it. Critical View of the Structural Causal Model
6. Acknowledgements
This project has received funding from the European Re-search Council (ERC) under the European Unions Horizon2020 research and innovation programme (grant ERC CoG725974). The authors would like to thank Dimitry Shaider-man for insightful discussions. The contribution of TomerGalanti is part of Ph.D. thesis research conducted at TelAviv University.
References
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gen-erative adversarial networks. In
Proceedings of the 34thInternational Conference on Machine Learning, ICML2017 , pp. 214–223, 2017.Bach, F. R. and Jordan, M. I. Kernel independentcomponent analysis.
J. Mach. Learn. Res. , 3:1–48,March 2003. ISSN 1532-4435. doi: 10.1162/153244303768966085. URL https://doi.org/10.1162/153244303768966085 .Bloebaum, P., Janzing, D., Washio, T., Shimizu, S., andSchoelkopf, B. Cause-effect inference by comparingregression errors. In Storkey, A. and Perez-Cruz, F. (eds.),
Proceedings of the Twenty-First International Conferenceon Artificial Intelligence and Statistics , volume 84 of
Proceedings of Machine Learning Research , pp. 900–909, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr2018. PMLR.Brakel, P. and Bengio, Y. Learning independent featureswith adversarial nets for non-linear ica, 2017.Chazelle, B.
The Discrepancy Method: Randomness andComplexity . Cambridge University Press, USA, 2000.ISBN 0521003571.Daniusis, P., Janzing, D., Mooij, J. M., Zscheischler, J.,Steudel, B., Zhang, K., and Sch¨olkopf, B. Inferringdeterministic causal relations.
CoRR , 2012.Fonollosa, J. A. R. Conditional distribution variability mea-sures for causality detection.
ArXiv , abs/1601.06680,2016.Frank, A. and Asuncion, A. UCI machine learning reposi-tory, 2010. http://archive.ics.uci.edu/ml.Goudet, O., Kalainathan, D., Caillou, P., Lopez-Paz, D.,Guyon, I., and Sebag, M. Learning functional causalmodels with generative neural networks. In
Explain-able and Interpretable Models in Computer Vision andMachine Learning , Springer Series on Challenges in Ma-chine Learning. Springer International Publishing, 2018.Gretton, A., Herbrich, R., Smola, A., Bousquet, O., andSch¨olkopf, B. Kernel methods for measuring indepen-dence.
J. Mach. Learn. Res. , 6:2075–2129, December2005. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1046920.1194914 .Heinze-Deml, C., Peters, J., and Meinshausen, N. Invariantcausal prediction for nonlinear models.
Journal of CausalInference , 6, 2017.
Critical View of the Structural Causal Model
Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., andSch¨olkopf, B. Nonlinear causal discovery with additivenoise models. In Koller, D., Schuurmans, D., Bengio, Y.,and Bottou, L. (eds.),
Advances in Neural InformationProcessing Systems 21 , pp. 689–696. Curran Associates,Inc., 2009.Kalainathan, D. and Goudet, O. Causal discovery toolbox:Uncover causal relationships in python, 2019.King, J.-R., Charton, F., Oquab, M., and Lopez-Paz, D.Measuring causal influence with back-to-back regression:the linear case, 2020. URL https://openreview.net/forum?id=B1lKDlHtwS .Lopez-Paz, D., Muandet, K., Schlkopf, B., and Tolstikhin, I.Towards a learning theory of cause-effect inference. InBach, F. and Blei, D. (eds.),
Proceedings of the 32nd In-ternational Conference on Machine Learning , volume 37of
Proceedings of Machine Learning Research , pp. 1452–1461, Lille, France, 07–09 Jul 2015. PMLR.Lopez-Paz, D., Nishihara, R., Chintala, S., Sch¨olkopf, B.,and Bottou, L. Discovering causal signals in images. In
Proceedings IEEE Conference on Computer Vision andPattern Recognition (CVPR) 2017 , pp. 58–66, Piscataway,NJ, USA, July 2017. IEEE.Mansour, Y., Mohri, M., and Rostamizadeh, A. Domainadaptation: Learning bounds and algorithms. In
COLT ,2009.Marx, A. and Vreeken, J. Telling cause from effect usingmdl-based local and global regression. In , pp.307–316, Nov 2017.Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., andSch¨olkopf, B. Distinguishing cause from effect usingobservational data: Methods and benchmarks.
Journal ofMachine Learning Research , 17(32):1–102, 2016.M¨uller, A. Integral probability metrics and their generatingclasses of functions advances in applied probability. In
Advances in Applied Probability , pp. 429–443, 1997.Pearl, J.
Causality: Models, Reasoning and Inference . Cam-bridge University Press, New York, NY, USA, 2nd edi-tion, 2009. ISBN 052189560X, 9780521895606.Peters, J., Janzing, D., and Sch¨olkopf, B.
Elements of CausalInference - Foundations and Learning Algorithms . Adap-tive Computation and Machine Learning Series. The MITPress, Cambridge, MA, USA, 2017.Press, O., Galanti, T., Benaim, S., and Wolf, L. Emerg-ing disentanglement in auto-encoder based unsupervisedimage content transfer. In
International Conference onLearning Representations , 2019. Schoffelen, J.-M., Oostenveld, R., Lam, N. H., Udd´en, J.,Hult´en, A., and Hagoort, P. A 204-subject multimodalneuroimaging dataset to study language processing.
Sci-entific data , 6(1):17, 2019.Sgouritsa, E., Janzing, D., Hennig, P., and Schlkopf, B.Inference of Cause and Effect with Unsupervised In-verse Regression. In Lebanon, G. and Vishwanathan,S. V. N. (eds.),
Proceedings of the Eighteenth Interna-tional Conference on Artificial Intelligence and Statistics ,volume 38 of
Proceedings of Machine Learning Research ,pp. 847–855, San Diego, California, USA, 09–12 May2015. PMLR.Shimizu, S., Hoyer, P. O., Hyv¨arinen, A., and Kerminen, A.A linear non-gaussian acyclic model for causal discov-ery.
J. Mach. Learn. Res. , 7:2003–2030, December 2006.ISSN 1532-4435.Shimizu, S., Inazumi, T., Sogawa, Y., Hyv¨arinen, A.,Kawahara, Y., Washio, T., Hoyer, P. O., and Bollen,K. Directlingam: A direct method for learning a lin-ear non-gaussian structural equation model.
J. Mach.Learn. Res. , 12:1225–1248, July 2011. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1953048.2021040 .Silvester, J. R. Determinants of block matrices.
The Mathe-matical Gazette , pp. 2000, 1999.Spirtes, P., Glymour, C., and Scheines, R.
Causation, Pre-diction, and Search . MIT press, 2nd edition, 2000.Stegle, O., Janzing, D., Zhang, K., Mooij, J. M., andSch¨olkopf, B. Probabilistic latent variable models fordistinguishing between cause and effect. In Lafferty, J. D.,Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., andCulotta, A. (eds.),
Advances in Neural Information Pro-cessing Systems 23 , pp. 1687–1695. Curran Associates,Inc., 2010.Zhang, K. and Hyvrinen, A. Distinguishing causes fromeffects using nonlinear acyclic causal models. In Guyon,I., Janzing, D., and Schlkopf, B. (eds.),
Proceedings ofWorkshop on Causality: Objectives and Assessment atNIPS 2008 , volume 6 of
Proceedings of Machine Learn-ing Research , pp. 157–164, Whistler, Canada, 12 Dec2010. PMLR.Zhang, K., Peters, J., Janzing, D., and Sch¨olkopf, B.Kernel-based conditional independence test and appli-cation in causal discovery. In
Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelli-gence , UAI’11, pp. 804–813, Arlington, Virginia, UnitedStates, 2011. AUAI Press. ISBN 978-0-9749039-7-2. URL http://dl.acm.org/citation.cfm?id=3020548.3020641 . Critical View of the Structural Causal Model
Figure 1.
Sensitivity experiment. The graph presents the AUCof our algorithm on MOUS-MEG dataset with λ , which variesbetween − to 1 in a logarithmic scale. A. Architecture for All MultivariateExperiments
The functions G , F , R and D in the adversarial multivari-ate method are fully connected neural networks and theirarchitectures are as follows: F is a 2-layered network withdimensions → → , R is a 3-layered networkwith dimensions → → → , G is a 2-layersneural network with dimensions
50 + 20 → → (the input has 50 dimensions for F ( X ) and 20 for R ( Y ) ).The discriminator is a 3-layers network with dimensions
100 + 20 → → → (the input is the concatenationof X and R ( Y ) ). The activation function in all networks isthe sigmoid function except the discriminator that appliesthe leaky ReLU activation. For all networks, the activationis not applied at the output layer.Throughout the experiments the learning rate for training G , F and R is 0.01 and the learning rate of D is 0.001. B. Sensitivity Experiment
To check that our results are robust with respect to λ , weconducted a sensitivity analysis. In this experiment we ranour algorithm on the MOUS-MEG dataset (i.e., Rest + Word → Active variation) with λ that varies between − to 1 ina logarithmic scale. As can be seen in Fig. 1, our algorithmis highly stable to the selection of λ ∈ [10 − , − ] . Theperformance decays (gradually) only for λ ≥ . . C. Analysis
C.1. Terminology and Notations
We recall some relevant notations and terminology. Fora vector x = ( x , . . . , x n ) ∈ R n we denote (cid:107) x (cid:107) := (cid:112)(cid:80) ni =1 x i the Euclidean norm of x . For a differentiablefunction f : R m → R n and x ∈ R m , we denote byJ ( f ( x )) := (cid:18) ∂f i ∂ζ j ( x ) (cid:19) i ∈ [ n ] ,j ∈ [ m ] (9)the Jacobian matrix of f in x . For a twice differentiablefunction f : R m → R , we denote byH ( f ( x )) := (cid:18) ∂ f∂ζ i ∂ζ j ( x ) (cid:19) i,j ∈ [ m ] (10)the Hessian matrix of f in x . Additionally, for atwice differentiable function f : R m → R n , f ( x ) =( f ( x ) , . . . , f n ( x )) , we denote the Hessian of f byH ( f ( x )) := ( H ( f ( x )) , . . . , H ( f n ( x ))) . For a scalar func-tion f : R m → R instead of using the Jacobian notation, thegradient notation will be employed, ∇ ( f ( x )) := J ( f ( x )) .For two positive functions f ( x ) and g ( x ) , we denote, f ( x ) (cid:46) g ( x ) if there is a constant C > , such that, f ( x ) ≤ C · g ( x ) . C.2. Proofs for the Results
In this section we provide the proofs of the main results inthe paper.
Lemma 1.
Let {H d } ∞ d =1 be a family of classes of autoen-coders that is closed to fixations. The function C F ( X ) is aproper complexity measure.Proof. First, since (cid:96) ( a, b ) ≥ for all a, b ∈ R k , this func-tion is non-negative. Next, we would like to show that C F ( X, Y ) ≥ max( C F ( X ) , C F ( Y )) . Let A ∗ be the mini-mizer of E x ∼ X [ (cid:96) ( A ∗ ( x ) , x )] within H d + d . We considerthat there is a vector y ∗ , such that, E ( x,y ) ∼ ( X,Y ) [ (cid:96) ( A ( x, y ) , ( x, y ))] ≥ E y ∼ Y E x ∼ X [ (cid:96) ( A ( x, y ) , ( x, y ))] ≥ E x ∼ X [ (cid:96) ( A ( x, y ∗ ) , ( x, y ∗ ))] ≥ E x ∼ X [ (cid:96) ( A ( x, y ∗ ) d , x )] (11)We note that A ( x, y ∗ ) d ∈ H d . Therefore, E ( x,y ) ∼ ( X,Y ) [ (cid:96) ( A ( x, y ) , ( x, y ))] ≥ min A ∗ ∈H d E x ∼ X [ (cid:96) ( A ∗ ( x ) , x )] = C F ( X ) (12)By similar considerations, C F ( X, Y ) . Lemma 2.
Let C be a complexity measure of multivariaterandom variables (i.e, non-negative and satisfies Eq. 2). Critical View of the Structural Causal Model
Then, there are triplets of random variables ( X, E, Y ) and ( ˆ X, E, Y ) and functions g and g (cid:48) , such that, Y = g ( X, E ) , Y = g (cid:48) ( ˆ X, E ) , C ( X ) < C ( Y ) and C ( ˆ X ) > C ( Y ) . There-fore, C cannot serve as a score for causal inference.Proof. Let X be a random variable and E | = X , such that, Y = g ( X, E ) . Assume that C ( X ) < C ( Y ) . Then, let X (cid:48) be a random variable independent of X , such that, C ( X (cid:48) ) >C ( Y ) . Then, according to the definition of a complexitymeasure, we have: C ( X, X (cid:48) ) > C ( Y ) and we have: Y = g (cid:48) ( X, X (cid:48) , E ) , for g (cid:48) ( a, b, c ) = g ( a, c ) .The following lemma is an extension of Thm. 1 in (Zhang& Hyvrinen, 2010) to real valued random variables of di-mension > . Lemma 3.
Assume that ( X, Y ) can be described by both: Y = g ( f ( X ) + E ) , s.t: X | = E and g is invertible(13) and X = g ( f ( Y ) + E ) , s.t: Y | = E and g is invertible(14) Assume that g and g are invertible and let: T := g − ( Y ) and h := f ◦ g T := g − ( X ) and h := f ◦ g (15) Assume that the involved densities p T , p E and nonlinearfunctions f , g and f , g are third order differentiable. Wethen have the following equations for all ( X, Y ) satisfying: H ( η ( t )) · J ( h ( t )) − H ( η ( e )) · J ( h ( t ))+ H ( η ( e )) · J ( h ( t )) · J ( h ( t )) · J ( h ( t )) − ∇ ( η ( e )) · H ( h ( t )) · J ( h ( t )) = 0 (16) where η ( t ) := log p T ( t ) and η ( e ) := log p E ( e ) .Proof. The proof is an extension of the proof of Thm. 1in (Zhang & Hyvrinen, 2010). We define: T := g − ( Y ) and h := f ◦ g T := g − ( X ) and h := f ◦ g (17)Since g is invertible, the independence between X and E is equivalent to the independence between T and E .Similarly, the independence between Y and E is equivalentto the independence between T and E . Consider thetransformation F : ( E , T ) (cid:55)→ ( E , T ) : E = T − f ( X ) = T − f ( g ( T )) T = f ( Y ) + E = f ( g ( T )) + E (18) The Jacobian matrix of this transformation is given by:J := J ( F ( e , t ))= (cid:20) − J ( h ( t )) I − J ( h ( t )) · J ( h ( t )) I J ( h ( t )) (cid:21) (19)Since I commutes with any matrix, by Thm. 3 in (Silvester,1999), we have: (cid:12)(cid:12)(cid:12) det( J ( F ( E , T ))) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) det (cid:16) − J ( h ( T )) · J ( h ( T )) − I · ( I − J ( h ( T )) · J ( h ( T ))) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 (20)Therefore, we have: p T ( t ) · p E ( e ) = p T ,E ( t , e ) / | det J | = p T ,E ( t , e ) . Hence, log( p T ,E ( t , e )) = η ( t ) + η ( e ) and we have: ∂ log( p T ,E ( t , e )) ∂e = ∇ η ( t ) − ∇ η ( e ) · J ( h ( t )) (21)Therefore, ∂ log( p T ,E ( t , e )) ∂e ∂t = H ( η ( t )) · J ( h ( t )) − H ( η ( e )) · ( I − J ( h ( t )) · J ( h ( t ))) · J ( h ( t )) − ∇ ( η ( e )) · H ( h ( t )) · J ( h ( t ))= H ( η ( t )) · J ( h ( t )) − H ( η ( e )) · J ( h ( t ))+ H ( η ( e )) · J ( h ( t )) · J ( h ( t )) · J ( h ( t )) − ∇ ( η ( e )) · H ( h ( t )) · J ( h ( t )) (22)The independence between T and E implies that for everypossible ( t , e ) , we have: ∂ log p T ,E ( t ,e ) ∂e ∂t = 0 . Lemma 4 (Reduction to post-linear models) . Let f ( x ) = σ ( W d . . . σ ( W x )) and g ( u, v ) = σ ( U k . . . σ ( U ( u, v ))) be two neural networks. Then,if Y = g ( f ( X ) , E ) for some E | = X , we can represent Y = ˆ g ( ˆ f ( X ) + N ) for some N | = X .Proof. Let f ( x ) = σ ( W d . . . σ ( W x )) and g ( u, v ) = σ ( U k . . . σ ( U ( u, v ))) be two neural networks. Here, ( u, v ) is the concatenation of the vectors u and v . We con-sider that U ( f ( X ) , E ) = U f ( X ) + U E . We define anoise variable N := U E and have: X | = N . In addition,let ˆ f ( x ) := U f ( x ) and ˆ g ( z ) := σ ( U k . . . σ ( U σ ( z ))) .We consider that: Y = ˆ g ( ˆ f ( X ) + N ) as desired. Theorem 1 (Identifiability of neural SCMs) . Let P X,Y ad-mit a neural SCM from X to Y as in Eq. 1, such that p X , Critical View of the Structural Causal Model and the activation functions of f and g are three-times dif-ferentiable. Then it admits a neural SCM from Y to X , onlyif p X , f , g satisfy Eq. 27 in the appendix.Proof. Let f i ( z ) = σ ( W i,d . . . σ ( W i, z )) and g i ( u, v ) = σ ( U i,k . . . σ ( U i, ( u, v ))) (where i = 1 , ) be pairs of neu-ral networks, such that, σ and σ are three-times differen-tiable. Assume that: Y = g ( f ( X ) , E ) and X = g ( f ( Y ) , E ) (23)for some E | = X and E | = Y . By Lem. 4, we can represent Y = ˆ g ( ˆ f ( X ) + N ) , where N = U , E , ˆ f = U , f ( X ) and ˆ g ( z ) = σ ( U ,k . . . σ ( U , σ ( z ))) (24)and also, X = ˆ g ( ˆ f ( Y ) + N ) , where N = U , E , ˆ f = U , f ( X ) and ˆ g ( z ) = σ ( U ,k . . . σ ( U , σ ( z ))) (25)Here, for each i = 1 , and j = 1 , , U ji, are the sub-matrices of U i, that satisfy: U i, ( f i ( X ) , E i ) = U i, f i ( X ) + U i, E i (26)From the proof of Lem. 4, it is evident that the constructed ˆ g , ˆ f and ˆ g , ˆ f are three-times differentiable whenever σ and σ are. Therefore, by Lem. 3, the following differentialequation holds:H ( η ( t )) · J ( h ( t )) − H ( η ( n )) · J ( h ( t ))+ H ( η ( n )) · J ( h ( t )) · J ( h ( t )) · J ( h ( t )) − ∇ ( η ( n )) · H ( h ( t )) · J ( h ( t )) = 0 (27)where T := ˆ g − ( Y ) and h := ˆ f ◦ ˆ g T := ˆ g − ( X ) and h := ˆ f ◦ ˆ g (28)and η ( t ) := log p T ( t ) and η ( n ) := log p N ( n ) . Theorem 2 (Uniqueness of Representation) . Let P X,Y ad-mit a nonlinear model from X to Y as in Eq. 1, i.e., Y = g ( f ( X ) , E ) for some random variable E | = X . Assume that f and g are invertible. Let G , F and R be functions, suchthat, L err := E ( x,y ) ∼ ( X,Y ) [ (cid:107) G ( F ( x ) , R ( y )) − y (cid:107) ] = 0 and G and F are invertible functions and X | = R ( Y ) . Then, F ( X ) ∝ f ( X ) and R ( Y ) ∝ E .Proof. Since F and f are invertible, one can represent: F ( X ) = F ( f − ( f ( X ))) and f ( X ) = f ( F − ( F ( X ))) .Similarly, since G and g are invertible, we also have: ( F ( X ) , R ( Y )) ∝ ( f ( X ) , E ) . Since ( F ( X ) , R ( Y )) ∝ ( f ( X ) , E ) and F ( X ) ∝ f ( X ) , we have: R ( Y ) = Q ( F ( X ) , E ) . However, R ( Y ) | = F ( X ) and therefore, wecan represent R ( Y ) = P ( E ) and vice versa. C.3. An Extension of Thm. 2
In this section we extend Thm. 2. As a reminder,in our method, we employ two losses: a mappingloss L err ( G, F, R ) and a GAN-like independence loss L indep ( R ; D ) .Informally, in similar fashion to Thm. 2, we would liketo claim that when the algorithm successfully minimizesthe losses, the information present in r ( Y ) := E can berecovered from R ( Y ) . In Thm. 2, it is shown that wheneverthe losses are optimal, we have: R ( Y ) ∝ r ( Y ) . In Thm. 3,we relax the optimality assumption and we would like toexpress the recoverability of r ( Y ) given R ( Y ) in termsof the success of the algorithm in minimizing the losses.By similar arguments we can also show that f ( X ) can berecovered from F ( X ) .To define a measure of recoverability of one random variablegiven another random variable we consider a class T oftransformations T : R n → R n . The reconstruction of agiven random variable V out of U is given by:Rec T ( V ; U ) := inf T ∈T E ( u,v ) ∼ ( U,V ) [ (cid:107) T ( u ) − v (cid:107) ] (29)The class T of transformations serves as the set of possiblecandidate mappings from U to V .In our case, we are interested in measuring the ability torecover the information present in r ( Y ) given R ( Y ) . There-fore, we would like to show that our algorithm implicitlyminimizes:Rec T ( r ( y ); R ( y )) = inf T ∈T E y ∼ Y [ (cid:107) T ( R ( y )) − r ( y ) (cid:107) ] (30)To do so, we upper bound the recoverability using the map-ping error and a discriminator based divergence. In ourbound, instead of employing L indep ( R ; D ) directly, we makeuse of a different discriminator based measure of indepen-dence. For simplicity, we will assume that T consists offunctions T : ∪ n ∈ N R n → R d e and for every fixed u ∈ R k ,we have: T u ( x ) := T ( x, u ) ∈ T . This is the case of T = ∪ n ∈ N T n , where T n is the class of fully-connected neu-ral networks (with biases) with input dimension n and fixedhidden dimensions.The proposed measure of independence will be based on thediscrepancy measure (Chazelle, 2000; Mansour et al., 2009).For a given class D of discriminator functions D : X → R , we define the D -discrepancy, also known as IntegralProbability Metric (M¨uller, 1997), between two randomvariables X and X over X by:disc D [ X (cid:107) X ] := sup D ∈D { E x ∼ X [ D ( x )] − E x ∼ X [ D ( x )] } (31)A well known example of this measure is the WGAN diver-gence (Arjovsky et al., 2017) that is specified by a class D of neural networks of Lipschitzness ≤ . Critical View of the Structural Causal Model
In our bound, to measure the independence between F ( X ) and R ( Y ) , we make use of the term:disc D (cid:2) ( F ( X ) , R ( Y ) , Y ) (cid:107) ( F ( X (cid:48) ) , R ( Y ) , Y ) (cid:3) (32)for some class of discriminators D . Even though we do notuse the original measure of independence, the idea is verysimilar. Instead of using a GAN-like divergence between ( X, R ( Y )) and ( X (cid:48) , R ( Y )) , we employ a WGAN-likedivergence between ( F ( X ) , R ( Y )) and ( F ( X (cid:48) ) , R ( Y )) .From a theoretical standpoint, it is easier to work with thediscrepancy measure since it resembles a distance measure.The selection of D is a technical by-product of the proof ofthe theorem and one can treat it as an “expressive enough”class of functions. Specifically, each discriminator D ∈ D takes the following form: D ( u , u , u ) = (cid:107) T ( u , u ) − Q ( u ) (cid:107) (33)where T ∈ T and Q ∈ Q . Here, u ∈ R d f , u ∈ R d e and u ∈ R d y . In particular, the discrepancy measure is:disc D (cid:2) ( F ( X ) , R ( Y ) , Y ) (cid:107) ( F ( X (cid:48) ) , R ( Y ) , Y ) (cid:3) = sup T ∈T ,Q ∈Q (cid:110) E ( x,y ) (cid:2) (cid:107) T ( F ( x ) , R ( y )) − Q ( y ) (cid:107) (cid:3) − E ( x (cid:48) ,x,y ) (cid:2) (cid:107) T ( F ( x (cid:48) ) , R ( y )) − Q ( y ) (cid:107) (cid:3) (cid:111) (34)where ( x, y ) ∼ ( X, Y ) and x (cid:48) ∼ X is an in-dependent copy of x . A small discrepancy indi-cates that there is no discriminator D ∈ D thatis able to separate between ( F ( X ) , R ( Y ) , Y ) and ( F ( X (cid:48) ) , R ( Y ) , Y ) . In particular, if F ( X ) | = R ( Y ) , then,disc D (cid:2) ( F ( X ) , R ( Y ) , Y ) (cid:107) ( F ( X (cid:48) ) , R ( Y ) , Y ) (cid:3) = 0 . Theorem 3.
Let P X,Y admits a nonlinear model from X to Y , i.e., Y = g ( f ( X ) , E ) for some random variable E | = X .We denote by G , F and R the classes from which the algo-rithm selects the mappings G, F, R (resp.). Let Q be a classof L -Lipschitz continuous functions Q : R d y → R d e . Let T be be a class of functions that satisfies Q ◦ G ⊂ T . Let D = (cid:8) D ( u , u , u ) := (cid:107) T ( u , u ) − Q ( u ) (cid:107) (cid:9) Q ∈Q ,T ∈T bethe class of discriminators. Then, for any G ∈ G , F ∈ F and R ∈ R , we have: Rec T ( r ( Y ); R ( Y )) (cid:46) L err ( G, F, R ) + λ + disc D (cid:2) ( F ( X ) , R ( Y ) , Y ) (cid:107) ( F ( X (cid:48) ) , R ( Y ) , Y ) (cid:3) (35) where λ := inf Q ∈Q E y ∼ Y [ (cid:107) Q ( y ) − r ( y ) (cid:107) ] . As can be seen from Thm. 3, when Q is expressive enough,such that, λ is small and T is expressive enough to satisfy Q ◦ G ⊂ T , for any functions
G, F, R , the recoverabilityof r ( Y ) given R ( Y ) is upper bounded by the sum of themapping error and the discriminator based independence measure. Hence, when selecting G, F, R that minimize bothlosses, one implicitly learns a modeling G ( F ( X ) , R ( Y )) ,such that, r ( Y ) can be recovered from R ( Y ) . By a similarargument, the same relation holds for f ( X ) and F ( X ) . Proof.
Let Q ∗ ∈ arg min Q ∈Q E y ∼ Y [ (cid:107) Q ( y ) − r ( y ) (cid:107) ] . Weconsider that: inf T ∈T E ( x,y ) ∼ ( X,Y ) (cid:107) T ( R ( y )) − r ( y ) (cid:107) ≤ T ∈T E ( x,y ) ∼ ( X,Y ) (cid:107) T ( R ( y )) − Q ∗ ( y ) (cid:107) + 3 inf Q ∈Q (cid:107) Q ( y ) − r ( y ) (cid:107) =3 inf T ∈T E ( x,y ) ∼ ( X,Y ) (cid:107) T ( R ( y )) − Q ∗ ( y ) (cid:107) + 3 λ =3 inf T ∈T E x (cid:48) ∼ X ( x,y ) ∼ ( X,Y ) (cid:107) T ( F ( x (cid:48) ) , R ( y )) − Q ∗ ( y ) (cid:107) + 3 λ (36)where x (cid:48) and x are two independent copies of X . The lastequation follows from the fact that x (cid:48) and y are independentand from the definition of T , inf T ∈T E x (cid:48) ∼ X ( x,y ) ∼ ( X,Y ) (cid:107) T ( F ( x (cid:48) ) , R ( y )) − Q ∗ ( y ) (cid:107) ≥ inf T ∈T E x (cid:48) E ( x,y ) ∼ ( X,Y ) (cid:107) T F ( x (cid:48) ) ( R ( y )) − Q ∗ ( y ) (cid:107) ≥ E x (cid:48) inf T ∈T E ( x,y ) ∼ ( X,Y ) (cid:107) T F ( x (cid:48) ) ( R ( y )) − Q ∗ ( y ) (cid:107) ≥ E x (cid:48) inf T ∈T E ( x,y ) ∼ ( X,Y ) (cid:107) T ( R ( y )) − Q ∗ ( y ) (cid:107) = inf T ∈T E ( x,y ) ∼ ( X,Y ) (cid:107) T ( R ( y )) − Q ∗ ( y ) (cid:107) (37)Next we consider that for any T ∈ T , we can rewrite: E x (cid:48) ∼ X ( x,y ) ∼ ( X,Y ) (cid:107) T ( F ( x (cid:48) ) , R ( y )) − Q ∗ ( y ) (cid:107) = E ( x,y ) ∼ ( X,Y ) (cid:107) T ( F ( x ) , R ( y )) − Q ∗ ( y ) (cid:107) + (cid:110) E x (cid:48) ∼ X ( x,y ) ∼ ( X,Y ) (cid:107) T ( F ( x (cid:48) ) , R ( y )) − Q ∗ ( y ) (cid:107) − E ( x,y ) ∼ ( X,Y ) (cid:107) T ( F ( x ) , R ( y )) − Q ∗ ( y ) (cid:107) (cid:111) ≤ E ( x,y ) ∼ ( X,Y ) (cid:107) T ( F ( x ) , R ( y )) − Q ∗ ( y ) (cid:107) + disc D (cid:2) ( F ( X ) , R ( Y ) , Y ) (cid:107) ( F ( X (cid:48) ) , R ( Y ) , Y ) (cid:3) (38)Since the class T includes Q ∗ ◦ G , we have: inf T E ( x,y ) ∼ ( X,Y ) (cid:107) T ( R ( y )) − r ( y ) (cid:107) ≤ E ( x,y ) ∼ ( X,Y ) (cid:107) Q ∗ ( G ( F ( x ) , R ( y ))) − Q ∗ ( y ) (cid:107) + disc D (cid:2) ( F ( X ) , R ( Y ) , Y ) (cid:107) ( F ( X (cid:48) ) , R ( Y ) , Y ) (cid:3) + 3 λ (39)Since Q ∗ is a L -Lipschitz function for some constant L >0