[PDF] Conditionally-additive-noise Models for Structure Learning

Abstract

Constraint-based structure learning algorithms infer the causal structure of multivariate systems from observational data by determining an equivalent class of causal structures compatible with the conditional independencies in the data. Methods based on additive-noise (AN) models have been proposed to further discriminate between causal structures that are equivalent in terms of conditional independencies. These methods rely on a particular form of the generative functional equations, with an additive noise structure, which allows inferring the directionality of causation by testing the independence between the residuals of a nonlinear regression and the predictors (nrr-independencies). Full causal structure identifiability has been proven for systems that contain only additive-noise equations and have no hidden variables. We extend the AN framework in several ways. We introduce alternative regression-free tests of independence based on conditional variances (cv-independencies). We consider conditionally-additive-noise (CAN) models, in which the equations may have the AN form only after conditioning. We exploit asymmetries in nrr-independencies or cv-independencies resulting from the CAN form to derive a criterion that infers the causal relation between a pair of variables in a multivariate system without any assumption about the form of the equations or the presence of hidden variables.

Full PDF

CConditionally-additive-noise Models for StructureLearning

Daniel Chicharro

Neural Computation LaboratoryCenter for Neuroscience and Cognitive Systems@UniTnIstituto Italiano di Tecnologia38068 Rovereto, ItalyDepartment of NeurobiologyHarvard Medical SchoolBoston, MA 02115 [email protected][email protected]

Stefano Panzeri

Neural Computation LaboratoryCenter for Neuroscience and Cognitive Systems@UniTnIstituto Italiano di Tecnologia38068 Rovereto, Italy [email protected]

Ilya Shpitser

Department of Computer ScienceWhiting School of EngineeringJohns Hopkins University [email protected]

Abstract

Constraint-based structure learning algorithms infer the causal structure of multi-variate systems from observational data by determining an equivalent class of causalstructures compatible with the conditional independencies in the data. Methodsbased on additive-noise (AN) models have been proposed to further discriminatebetween causal structures that are equivalent in terms of conditional independencies.These methods rely on a particular form of the generative functional equations,with an additive noise structure, which allows inferring the directionality of causa-tion by testing the independence between the residuals of a nonlinear regressionand the predictors (nrr-independencies). Full causal structure identiﬁability hasbeen proven for systems that contain only additive-noise equations and have nohidden variables. We extend the AN framework in several ways. We introducealternative regression-free tests of independence based on conditional variances(cv-independencies). We consider conditionally-additive-noise (CAN) models, inwhich the equations may have the AN form only after conditioning. We exploitasymmetries in nrr-independencies or cv-independencies resulting from the CANform to derive a criterion that infers the causal relation between a pair of variablesin a multivariate system without any assumption about the form of the equations orthe presence of hidden variables.

Inferring the causal structure of multivariate systems from observational data has become an indispens-able need in many domains of science, from physics, to neuroscience, to ﬁnance (Lütkepohl, 2006;Wibral et al., 2014; Peters et al., 2017). Constraint-based structure learning algorithms have been usedto infer the causal structure by determining an equivalent class of causal structures compatible with

Preprint. Work in progress. a r X i v : . [ s t a t . M L ] M a y he conditional independencies in the data (Spirtes et al., 2000; Pearl, 2009). Additive-noise (AN)models were proposed as powerful solutions that allow further discriminating between structureswithin these equivalent classes (Hoyer et al., 2009; Peters et al., 2014; Mooij et al., 2016). A pureAN functional equation requires that the noise is additively separable from the causes of a variable,and in the standard approach this property is exploited by testing independencies of the residuals ofa nonlinear regression with the regression predictors. For multivariate systems, algorithms testingthese nonlinear regression residuals independencies (nrr-independencies) proceed inferring a globalcausal ordering (Mooij et al., 2009) under the assumption that the noise is separable in all equations.This approach has been mostly studied in the case of causal sufﬁciency (no hidden variables), (butsee Janzing et al., 2009, for an exception). In this work we extend the AN framework in four fronts.First, allowing for the presence of hidden variables. Second, considering functional equations thathave the AN form only after conditioning on certain variables. Third, introducing an alternativeregression-free test to infer causality exploiting the independencies present in AN models. Fourth,proposing a criterion to infer the causal relation between a speciﬁc pair of variables in a multivariatesystem with hidden variables, without restrictions on the form of the functional equations and withoutinvolving the inference of a global causal ordering.In more detail, we generalize AN models to partial conditionally-additive-noise (CAN) models withhidden variables. These models contain both equations reducible and irreducible to the AN form, andthe AN form may only be obtained after conditioning on some of the observable variables. We showhow structure learning for partial CAN models can be formulated in terms of nrr-independenciesasymmetries, analogously to AN models (Hoyer et al., 2009; Peters et al., 2014). Furthermore,we introduce a regression-free test to detect additive noise. This test assesses the independence ofthe residuals second-order moments from the predictors indirectly, estimating conditional varianceindependencies (cv-independencies) that do not require an actual reconstruction of the noise variables.We formulate a criterion to infer a potential cause from one particular variable to another in thepresence of hidden variables, which does not require inferring a global causal ordering. Finally,we discuss the extension of CAN models by generalizing post-nonlinear AN models (Zhang andHyvärinen, 2009), which allow for the presence in the functional equations of a global invertiblenonlinear transformation of the AN terms. We believe that this work will lead to a structure learningalgorithm alternative to the ones existing for AN models with no hidden variables (Mooij et al.,2009; Peters et al., 2014; Bühlmann et al., 2014). The proposal of such algorithm exploiting the newcriterion we propose is left for a future contribution.This paper is organized as follows. In Section 2, we review previous work on AN models and post-nonlinear AN models. In Section 3, we describe the regression-free test based on cv-independence.In Section 4, we extend the AN models to CAN models, providing conditions for the existence ofcv-independencies and nrr-independencies that appear after conditioning. We introduce a criterionthat exploits these independencies to infer causal relations in the presence of hidden variables and forsystem which may be only partially CAN models. In Section 5, we examine examples of concretesystems. In Section 6, we extend our approach to post-nonlinear AN models. We start with some basic notions for graphs. We use capital letters for random variables andbold letters for sets and vectors. Consider a set of random variables V “ t V , ..., V n u . A graph G “ p V ; E q consists of nodes V and edges E between the nodes. p V; V q R E for any V P V . Wewrite V i Ñ V j for p V i ; V j q P E . We refer to V as both variable V and its corresponding node. Anode V i is called a parent of V j if p V i ; V j q P E . The set of parents of V j is denoted by Pa j . Apath in G is a sequence of (at least two) distinct nodes V , ..., V n , such that there is an edge between V k and V k ` for all k “ , ..., n ´ . If all edges are V k Ñ V k ` the path is a causal or directedpath. A node V i is a collider in a path if it has incoming arrows V i ´ Ñ V i Ð V i ` and is anoncollider otherwise. The set of descendants D p V i q of node V i comprises those variables that canbe reached going forward through causal pathways from V i . The set of non-descendants ND p V i q of V i , is complementary to D p V i q , including V i . In Directed Acyclic Graphs (DAGs) no node is itsown descendant. Two nodes V i and V j are adjacent if either p V i ; V j q P E , p V j ; V i q P E , or thereis a hidden common parent V k between them (i. e. p V k ; V i q P E and p V k ; V j q P E and V k is notobservable). V i is a potential cause of V j if it is a parent or they share a hidden common parent.Conditional independence between two variables is equivalent to d-separation Pearl (2009) of their2orresponding nodes under the faithfulness assumption Spirtes et al. (2000), which ensures that theprobability distribution contains only independencies induced by the causal structure. Accordingly,a conditional dependence between V and V n given S , i.e. , V M V n | S exists iff the nodes areconnected by a path that is active when blocking the nodes in S (S-active path). See Spirtes et al.(2000) for a more detailed description.The functional equation generating variable V i has the AN form if it conforms to V i “ f i p V i , ε i q “ f i p V i, , ..., V i,n q ` ε i , (1)with Pa i “ V i “ t V i, , ..., V i,n u and noise ε i by deﬁnition independent of the parents. Mostpart of the work with AN models assumes that all variables are observable. Under the assumptionof no hidden variables, since the noise is additively separable from the parents, an estimate ˆ ε i can be obtained by nonlinear regression as ˆ ε i ” V i ´ ˆ f i p V i q . If ε i is properly reconstructed, ˆ ε i K V @ V P V i , that is, the independence of the noise from the parents is recovered. Consider aparticular variable Y and parent X P Pa y . If all variables are observed and the equation of Y has theAN form it is guaranteed that an independent noise can be reconstructed. Proposition Nrr-independence with AN functional equations: ‘If the functional equation of Y hasthe AN form, then @ X P Pa y D S and ˆ f y p X , S q such that ˆ ε y K X , with ˆ ε y ” Y ´ ˆ f y p X , S q .’The existence of at least one set S is guaranteed because S “ Pa y z X leads to ˆ ε y K X . If we knewthat ˆ ε y reconstructs a truly generative noise variable, Proposition would sufﬁce to infer a causefrom X to Y (assuming no hidden variables). This is because, if Y is adjacent to both X and ˆ ε y (thereare edges Y ´ X and Y ´ ˆ ε y ), the fact that X M Y and X K ˆ ε y is a sufﬁcient condition for Y to be acollider ( X Ñ Y Ð ˆ ε y ) (Pearl, 2009). However, because ˆ ε y is only a reconstruction of the presumedunderlying noise variable, extra checks are required: the question is if the nrr-independence X K ˆ ε y could also occur when ˆ ε y is estimated but the generative model contains the reverse causal relation.Hoyer et al. (2009) proved that, if Y has a generative AN functional equation and X P Pa y ,nrr-independence holds for ˆ ε y “ Y ´ ˆ f y p X , S q , given S “ Pa y z X ﬁxed, and in general not forthe direction opposite to causality, that is, there is no nrr-independence for ˆ ε x “ X ´ ˆ f x p Y , S q .However, they also showed that nrr-independence in both directions holds for a family of distributions p p X , Y | S q which, for S ﬁxed, is characterized as the solutions of a third-order linear inhomogeneousdifferential equation. For example, Gaussian distributions belong to that family. Accordingly, if asystem only contained AN equations with no hidden variables, an asymmetry ˆ ε y K X and ˆ ε x M Y ,would sufﬁce to infer a cause from X to Y . This is because ˆ ε y K X always holds given the ANform of the functional equation of Y and only if the data generating distribution is within the specialfamily of Hoyer et al. (2009) nrr-independence holds in both directions, in which case nothing can beconcluded.However, generally not all functional equations have an AN form. Focusing on the bivariate case,Janzing and Steudel (2010) discussed the necessary assumptions for structure learning based onasymmetries of nrr-independencies. They indicated that, for a generative functional equation withthe opposite direction of causality X “ f x p Y , ε x q , it has to be assumed that ˆ ε y K X will not holdfor any ˆ ε y “ Y ´ ˆ f y p X q , except within the family of Hoyer et al. (2009). Janzing and Steudel(2010) justiﬁed the fulﬁllment of this assumption because ˆ ε y K X would impose constraints making p p Y q and p p X | Y q dependent. This dependence requires a ﬁne tuning of the distribution p p Y q of thecause, given the mechanism p p X | Y q , and hence is fragile to changes in p p Y q if the cause distributionchanges independently of the causal mechanism, as expected. These arguments are tightly relatedwith the justiﬁcation of the faithfulness assumption for conditional independencies based on stability.In particular stability rules out ’pathological parameterizations’ (Pearl, 2009) in which a conditionalindependence does not correspond to a d-separation present in the causal structure because suchindependencies also require tuning the parameters of the functional equation, and will vanish withsmall changes of these parameters.When testing for nrr-independencies in multivariate systems, the common procedure starts by in-ferring a global causal ordering of the variables (Mooij et al., 2009). This step already uses nrr-independencies, and relies on the fact that conditioning on a descendant introduces a dependencebetween Y and its noise variable. Subsequently, nrr-independencies are tested with regression modelsthat, if the causal ordering is correct, do not take descendants as arguments. This allows removingsuperﬂuous edges from non-descendants that are not parents of Y . To our knowledge, for the multi-3ariate case an analogous assumption of faithfulness has not been formulated explicitly. For the sakeof comparison with our results we here explicitly state the following assumption: Assumption Nrr-independence faithfulness for non-additive-noise functional equations : ‘ifthe generative functional equation of X , with Y P Pa x , does not have an AN form, then ˆ ε y M X , @t S , ˆ f y p X , S qu with S Ď ND p X q , Pa x z Y Ď S and ˆ ε y ” Y ´ ˆ f y p X , S q .’This assumption is a multivariate version of the bivariate one discussed in Janzing and Steudel (2010).It considers that all other parents of X are included in the regression, and that only X and Y areexchanged. The assumption can be used iteratively when determining the causal ordering. It ensuresthat, if a functional equation does not have the AN form and hence Proposition does not guaranteeindependence in the right direction, an asymmetry of independence does not appear in the wrongdirection. The assumption focuses on equations without an AN form because, by Proposition , withthe AN form nrr-independence in the wrong direction only leads to symmetric nrr-independencies.Finally, we also review post-nonlinear AN models, where a global nonlinearity transforms the ANequation (Zhang and Hyvärinen, 2009): V i “ f i p V i , ε i q “ h i, p h i, p V i q ` ε i q . (2)Here h i, is an invertible nonlinear function. For the bivariate case, with Y “ h p h p X q ` ε y q , Zhangand Hyvärinen (2009) generalized the work of Hoyer et al. (2009) extending the characterization ofthe special family of distributions that admits a statistical post-nonlinear AN model in both directions.Furthermore, they showed how to ﬁt a nonlinear model to extract residuals ˆ ε y ” ˆ h ´ p y q ´ ˆ h p x q to test nrr-independencies. For the multivariate case, assuming no hidden variables, Zhang andHyvärinen (2009) used regressions to evaluate nrr-independencies given sets of candidate parentspreviously determined examining conditional independencies between the variables (Spirtes et al.,2000).Additive-noise models are a well-established approach for structure learning, which has been mostlystudied in the case of causal sufﬁciency (no hidden variables). A pure AN functional equationrequires that the noise is separable as in Eq. 1, and in the standard approach this property is exploitedby testing the independence of the residuals of a nonlinear regression from the predictors. Formultivariate systems, the application of these tests proceeds by inferring a global causal ordering. Weextend the AN framework in four fronts, allowing for the presence of hidden variables, consideringfunctional equations that have the AN form only after conditioning on certain variables, introducingan alternative regression-free test to infer causality, and modifying the procedure not to rely on theinference of a global causal ordering. We start introducing a regression-free test for causal directionality alternative to the regression-basedanalysis of nrr-independencies. For this purpose, we continue to consider the pure AN equations ofthe form of Eq. 1. The key property of AN functional equations is that the independent noise ε i isadditively separable from the parents. For a particular variable Y and a parent X , deﬁne Z ” Pa y z X .We can study the conditional variance σ Y | X , Z as a variable which is a function only of X , with Z ﬁxed.For AN functional equations, the independence and separability of the noise leads to σ Y | X , Z “ σ ε y ,which is independent of X ( σ Y | X , Z K X ). This independence reﬂects the independence of thesecond-order moments of the residuals from the predictors indirectly, and does not require an actualreconstruction of the noise variables. Analogously to Proposition , the AN form sufﬁces for thistype of independence, which we call conditional variance independence (cv-independence). Proposition Cv-independence with AN functional equations: ‘If the functional equation of Y hasthe AN form, then @ X P Pa y D S : σ Y | X , S K X @ S “ s .’The existence of at least one set S is guaranteed because S “ Z leads to σ Y | X , S K X . Becausecv-independence follows from the fact that the noise is independent and separable from the otherarguments of the equation, for the special family characterized by Hoyer et al. (2009) in which anAN statistical model can also be constructed in the reverse direction, cv-independence holds in bothdirections. For general systems, possibly containing functional equations without the AN form, anassumption analogous to Assumption is required to ensure that the asymmetry of cv-independencies4oes not hold in the direction inconsistent with the causal relation. Like for Assumption 1, toformulate this assumption of faithfulness we consider a functional equation with the opposite causaldirection for X and Y , and compare the cv-dependence of σ Y | X , S with respect to the independencestated in Proposition 2. Assumption Cv-independence faithfulness for non-additive-noise functional equations : ‘if thefunctional equation of X , with Y P Pa x , does not have an AN form, then σ Y | X , S M X @ S Ď ND p X q , Pa x z Y Ď S .’The two faithfulness assumptions are related by the following conditions: Proposition Relation between cv-independence faithfulness and nrr-independence faithfulness :‘The fulﬁllment of Assumption implies the one of Assumption , but not the opposite.’Proof of Proposition : See Appendix.Despite this theoretical asymmetry between the two faithfulness assumptions, the fulﬁlment ofassumption and not assumption would impose further constraints to the probability distributions.It would require that p p ˆ ε y | x, s q is such that dependencies appear in third or higher-order moments,so that cv-independence holds despite nrr-dependence. Furthermore, because we are consideringa functional equation where Y is a parent of X , the fulﬁllment of faithfulness regards p p ˆ ε y | x, s q ,which does not correspond to the generative direction. Accordingly, cases in which nrr-independencefaithfulness is violated and cv-independence faithfulness holds require a speciﬁc tuning introducing adependence between the probability of the causes and the causal mechanism (Janzing and Steudel,2010). The necessity of this tuning renders these cases fragile to changes in the distribution of thecauses, and hence nonstable.Testing nrr-independencies intrinsically requires a regression-based approach, ﬁtting a (nonlinear)regression model. On the other hand, while cv-independencies can also be evaluated using thevariance of the residuals, they can alternatively be tested in a regression-free approach, estimatingthe conditional variance of the variables without reconstructing the noise variables. The latter hasthe advantage that it does not rely on a particular model of regression. However, in some cases atest of variance homogeneity, if S is highly dimensional, may require more data than the nonlinearregression approach. These practical issues are out of the scope of this work. As we will see below,the cv-independencies formulation is particularly intuitive to derive an extension of AN models topartial CAN models. For systems in which all functional equations have the AN form, full identiﬁability of the causalstructure has been proven when there are no hidden variables (Peters et al., 2014). For partially ANmodels, for which only some of the equations have the AN form, asymmetries in nrr-independencieshave been used (Tillman et al., 2009) as a method to complement algorithms of constraint-based causaldiscovery such as the PC algorithm (Spirtes et al., 2000), which exploit conditional independenciesbetween the variables. However, to our knowledge, it has not been examined how extra inferentialpower can be gained from functional equations that, although not having a pure AN form, areconverted to the AN form after conditioning on some variables. We call these type of equationsconditionally-additive-noise (CAN) functional equations. We derive the conditions on the form of afunctional equation so that it can be converted to the CAN form in order to test nrr-independencies orcv-independencies. Furthermore, we now drop the assumption of causal sufﬁciency and consider alsothe existence of hidden variables.To derive which equations have the conditionally-additive-noise form, we start expressing a genericfunctional equation as: V i “ f i p V i , ε i q “ f i, p V i, , , ..., V i, ,n q ` f i, p V i, , , ..., V i, ,n , ε i q ` f ε p ε i q . (3)Here the form allowed for f i, and f i, should be understood as complementary to simpler terms.That is, f i, comprises any function of only V i . Function f i, comprises any function that contains ε i as an argument, but excluding terms that only contain ε i . The sets V i, and V i, can overlap. Anyfunctional equation can be expressed in this form. In particular, if V i, “ H the equation reduces tothe AN form. 5onsider Y “ V i and a particular parent X P V y . We want to determine under which conditionscv-independencies or nrr-independencies can occur. As a ﬁrst remark, if X P V y, , σ Y | X , S M X for any set S , since ε y is an argument of f y, and X modulates its variance. For the same reasonthe residuals cannot be independent from X when X P V y, . Subsequently, we focus on variables X P V y, . Taking a particular variable X as reference, Eq. 3 can be expanded into the followingsubterms, where we also differentiate between observable variables (V) and hidden variables (U): Y “ f , p X , V , , U , q ` f , p V , , U , q ` n ÿ j “ f , ,j p ˜V , ,j q V , ,j ` ÿ j β j V ,j ` n ÿ j “ f , ,j p ˜V , ,j q U , ,j ` ÿ j α j U ,j ` f p V , U , ε y q ` f ε p ε y q . (4)We dropped subindex y from all variables and functions to simplify the notation. As in Eq. 3, themeaning of each function is determined by opposition to simpler terms explicitly separated. Forexample, f , is any function that does not have ε y as an argument and does not include the otherexplicit simpler terms that do not include ε y either. As will be appreciated below, we only separatethose terms that are subject to different constraints in the conditions to obtain the CAN form. Only thefunction f , has X as an argument. Function f , is linear on some observable variables V , ,j with acoefﬁcient that is a function f , ,j of other observable variables. Function f , is linear in each hiddenvariable of U , , with a coefﬁcient that is a function f , ,j of observable variables ˜V , . Here ˜V , “t ˜V , , , ..., ˜V , ,n u , and ˜V , “ t ˜V , , , ..., ˜V , ,n u . Similarly V , “ t V , , , ..., V , ,n u and U , “ t U , , , ..., U , ,n u . V y “ t V , , V , , V , , ˜V , , ˜V , , V , V u contains all otherobservable parents apart from X , and U y “ t U , , U , , U , , U , U u all hidden parents. Therecan be overlaps between subgroups of V y or of U y .We determine the conditions that lead to cv-independencies and nrr-independencies. We will focuson the case in which, for a certain variable X P PA y , which causal relation with Y is examined, X is adjacent to all other potential causes of Y , i.e. , parents and variables sharing a hidden commoncause with Y . This is because, as discussed above, it sufﬁces that two observable potential causesare nonadjacent to infer that Y is a collider for them using conditional independencies (Spirtes et al.,2000). This means that the conditions we derive could be relaxed, but the knowledge obtained wouldbe redundant to the one provided by conditional independencies. Because cv-independencies only relyon second-order moments, there is a difference in the conditions needed to obtain cv-independenceand nrr-independence. We start with cv-independencies, which lead to less restrictive conditions. We deﬁne the cv-CAN form as the form of a functional equation leading to cv-independence:

Deﬁnition Cv-independence with cv-CAN functional equations: ‘The functional equation of Y hasthe cv-CAN form for X when conditioning on S if σ Y | X , S K X @ S “ s .’We now enunciate when a functional equation can be set into the cv-CAN form. For this purpose,expressing the functional equation of Y as in Eq. 4, we deﬁne the functions Y , ” f , p U , ; V , q and Y ” f p U , ε y ; V q , where V , and V play the role of ﬁxed parameters, and we also deﬁne S ” t Y , , Y , U , , U u . The cv-CAN form is characterized as follows. Theorem Functional equations with the cv-CAN form: ‘Consider an X P Pa y and a set S . For thecase in which X is adjacent to all other potential causes of Y , the functional equation of Y has thecv-CAN form with respect to X given the set S if and only if the hidden variables fulﬁll the followingconditions i q U , “ H ;ii q X K U k | S @ U k P t U , , U u ;iii q σ U k | X , S K X @ U k P t U , , U u ;iv q σ Z i Z j | X , S K X @ Z i , Z j P S , (5)the set S is such that t V , , V , , ˜V , , ˜V , , V , V , , , V , u Ď S , V , is deﬁned as V , ” V z V , , with V , Ď V such that @ V i P V , σ V i | X , S K X , V , , is deﬁned as V , , ” V , z V , , with V , , Ď V , such that @ V i P V , , σ V i | X , S K X ,and the unconditioned observable variables also fulﬁll the following conditions v q σ V i V j | X , S K X @ V i , V j P S ;vi q σ V i Z j | X , S K X @ V i P S , Z j P S , (6)where S “ t V , , V , , u .’Proof of Theorem : See Appendix.To understand the logic of these conditions, we rewrite Eq. 4 as Y “ f , p X; V , q ` « f , p U , ; V , q ` ÿ j ˜ β j V , , ,j ` ÿ j β j V , ,j ` ÿ j ˜ α j U , ,j ` ÿ j α j U ,j ` f p U , ε y ; V q ` f ε p ε y q ` c ﬀ . (7) ˜ β j “ f , , ,j p ˜V , , ,j q and ˜ α j “ f , ,j p ˜V , ,j q are constant coefﬁcients because t ˜V , , ˜V , u Ď S .The constant c equals ř j ˜ β j V , , ,j ` ř j β j V , ,j because t ˜V , , V , , , V , u Ď S . Eq. 7 can besummarized as: Y “ f , p X; S q ` g p V , , , V , , U , , U , , U , U ; S q “ f , p X; S q ` ξ y | S , (8)where the function g plays the role of a noise ξ y | S analogous to the additive noise term of a pure ANequation, and hence the equation has the additive-noise form when seen as a function of X . U , and U are conditionally independent of X given S , and g is linear in all the other arguments, withtheir variances and covariances conditionally independent of X given S . This leads to σ ξ y | S | X , S K X .Note that, to fulﬁll the conditions in Eqs. 5 and 6, S may need to include other variables that arenot parents of Y . Furthermore, the constraints are intertwined because independencies changedepending on which variables are included in S . Since all variables in V and V , are observable,it is always possible to try to ﬁnd a valid set S with V , “ H and V , , “ H . In that case,the constraints of Eq. 6 vanish. It is also possible to formulate a simpler sufﬁcient condition bydemanding X K U k | S @ U k P t U y z U , u .Note that the cv-CAN form is obtained relative to a certain variable. The existence of a valid set S to place an equation in the CAN form relative to a variable is not guaranteed for all the observableparents. This is because of two reasons. First, it may be due to the presence of hidden variables thatfor a certain X do not fulﬁll the conditions of Theorem . This limitation is common to pure ANfunctional equations if hidden variables are allowed, since AN equations are CAN equations with V y, “ H . Second, even with no hidden variables, @ V P V y, E S : σ Y | V , S K V . That is, certainparents are not additively separable from the noise and cannot lead to any cv-independence. Thefact that only some equations in the system, and only relatively to certain variables, have the CANform, hinders the application of algorithms of structure learning in which a global causal orderingis inferred searching for the ordering that leads to the highest estimates of residuals independence(Mooij et al., 2009; Peters et al., 2014), which are designed for systems in which all equations havethe pure AN form. This is because now a lack of independence can be due not to the wrong order, butto the lack of separability of the noise, for the reasons mentioned above.Theorem 1 states which form of a functional equation will create a cv-independence. Assumingthat a certain functional equation is known or hypothesized, and for a certain context in which theexistence of certain hidden variables is known or hypothesized, the theorem allows determiningif a cv-independence exists. However, the theorem cannot be applied for inference, given that theconditions in Eqs. 5 and 6 involve hidden variables and hence their fulﬁllment cannot be tested fromdata. To derive a criterion applicable for inference, we identify the assumptions required so that a7peciﬁc asymmetry of cv-independencies provides information about the causal relation between thecorresponding pair of variables, without inferring a global causal ordering. Assumption Cv-independence faithfulness for non-conditionally-additive-noise functional equa-tions : ‘if @ S Ď S the generative functional equation of X , with Y P Pa x , does not have the cv-CANform for Y conditioned on S , then σ Y | X , S M X .’In comparison to the previous assumptions of faithfulness, here there is no restriction of S to non-descendants of X . S is not limited based on any causal knowledge. The assumption again focuses onfunctional equations which do not have the CAN form. In the Appendix we indicate that, like for pureAN equations, a special family of joint distributions p p X , Y | S q as described by Hoyer et al. (2009),allows a CAN statistical form in both directions. Assumption can be used to infer a potential causefrom X to Y , that is, to infer that X causes Y or there is a latent common cause: Proposition Inferring noncausality with cv-independence asymmetries : ‘Consider two adjacentvariables X and Y . Under the assumption of cv-independence faithfulness for non-cv-CAN functionalequations (Assumption ), if D S : σ Y | X , S K X and σ X | Y , S M Y @ S Ď S , then there is no causalityfrom Y to X , that is, X is a potential cause of Y .’Proof of Proposition : If D S : σ Y | X , S K X , it does not hold that σ Y | X , S M X . By Assumption , thisimplies that, either Y R Pa x or the functional equation of X has the cv-CAN form for Y conditioningon S for some S Ď S . The latter is discarded since we have σ X | Y , S M Y @ S Ď S . l We now provide some intuition about this criterion. First, if there is only a latent common causebetween X and Y , it is valid to infer a potential cause in either direction. Therefore, what we need isto avoid inferring the potential cause in the wrong direction when there is a genuine cause. For thebivariate case, the asymmetry of cv-independencies sufﬁces if we assume faithfulness for non-CANfunctional equations. However, conditioning on some set S not only can convert an equation tothe CAN form, it can also introduce cv-dependencies that were not present when conditioning onlyon a subset of S . An asymmetry could appear in the following way: for a certain S ˚ , not onlythe functional equation of Y has the CAN form relatively to X , but furthermore the conditionaljoint distribution p p X , Y | S ˚ q belongs to the special family that allows a CAN statistical model inboth directions. For S ˚ , a symmetry of cv-independencies is obtained. However, conditioning on alarger set ( S ˚ Ă S ) can introduce a cv-dependence that only appears in the direction in which theindependence given S ˚ was consistent with the causal structure. Accordingly, for S an unfaithfulasymmetry is obtained. See Section 5 for an example of a system in which this type unfaithful ofasymmetry occurs. Checking if σ X | Y , S M Y @ S Ď S , we can ﬁnd the S ˚ Ă S for which symmetricindependencies were obtained, showing that the observed asymmetry is not reliable.Altogether, Theorem 1 states when cv-independencies occur as a consequence of the causal structure,and Assumption 3 speciﬁes the faithfulness assumption required so that cv-independencies do notoccur inconsistently with the causal structure, which allows formulating the criterion of Proposition to infer noncausality from data. That is, Theorem 1 provides us an analytical tool to establishcv-independencies from a known or hypothesized functional equation, and Proposition provides usan empirical tool to infer the causal information from data. We now deﬁne the nrr-CAN form as the form of a functional equation leading to nrr-independence:

Deﬁnition Nrr-independence with nrr-CAN functional equations: ‘The functional equation of Y has the nrr-CAN form for X when conditioning on S if @ S “ s D ˆ f y p X; S q such that ˆ ε y | X; S K X , with ˆ ε y | X; S ” Y ´ ˆ f y p X; S q .’We distinguish X and S as an argument and constant parameters of the function ˆ f y , since S “ s is ﬁxed when conditioning. We now enunciate the conditions in which a functional equation canbe set into the nrr-CAN form. Similarly to Theorem , we focus on conditions for the case that X is adjacent to all other potential causes of Y , since otherwise the rules based on conditionalindependencies would already be applicable to extract the same causal information. For this purpose,we ﬁrst introduce some further notation. Consider a variable Z P t V , U u . This variable has a8inear additive contribution to the functional equation of Y (Eq. 4), and hence ˆ f y p X; S q will containan additive component associated with the term in which Z appears. This component corresponds tothe conditional mean of Z given X and S , scaled by its coefﬁcient in Eq. 4. The contribution of thisterm to the residual of Y is hence proportional to the residual ε z | X; S that would result from a separateregression to estimate Z . Therefore, we deﬁne ε z | X; S ” Z ´ ˆ f z p X; S q for Z P t V , U u . We usean analogous deﬁnition in relation to the part of the residual of Y associated with Z P t V , , U , u when, after conditioning on ˜V , and ˜V , , respectively, they also have linearly additive contributionsin Eq. 4. The nrr-CAN form is characterized as follows: Theorem Functional equations with nrr-CAN form: ‘Consider an X P Pa y and the case in which X is adjacent to all other potential causes of Y . Express the functional equation of Y as in Eq. 4. Theequation has the nrr-CAN form with respect to X given S if and only if the hidden variables fulﬁllthe following conditions i q U , “ H ;ii q X K U k | S @ U k P t U , , U u ;iii q ε U k | X; S K X @ U k P t U , , U u , (9)the set S is such that t V , , V , , ˜V , , ˜V , , V , V , , , V , u Ď S , where V , is deﬁned as V , ” V z V , , with V , Ď V such that @ V i P V , ε V i | X; S K X , V , , is deﬁned as V , , ” V , z V , , with V , , Ď V , such that @ V i P V , , ε V i | X; S K X .Proof of Theorem : See Appendix.The correspondence between Theorems and can be understood considering that the conditionalvariances only quantify, in a regression-free way, dependencies of X with the second-order momentsof the residuals of Y . On the other hand, nrr-independencies are sensitive also to dependencies of X with the residuals higher-order moments. Accordingly, while the conditions i-ii) of Theorem requiring conditional independencies are preserved in Theorem , the rest of conditions iii-vi), speciﬁcfor second-order moments, are modiﬁed. Condition iii) of Theorem is analogous to conditioniii) of Theorem . It indicates that for U P t U , , U u a dependence with X can exist in the mean µ u | X; S , which will be captured by the regression function, but any other dependence with X in ε u | X; S will create also an nrr-dependence between X and the residuals ε y | X; S . The other conditions ofTheorem , iv-vi), are already fulﬁlled given the standard assumption of faithfulness for conditionalindependencies (Spirtes et al., 2000). This because in Theorem condition iii) and the requirementsin the selection of V , and V , , only involve conditional variances, and the conditional variance of Y also depends on the covariance between the different linear contributions in its functional equation.Conversely, in Theorem condition iii) and the requirements in the selection of V , and V , , areconditional independence constraints. Any dependence between X and a subset of variables in S or S which exists despite X being independent of each of these single variables would violate thestandard assumption of faithfulness for conditional independencies.For most functional equations, both or none of the CAN forms are obtainable, because the existenceof higher-order dependencies without second-order dependencies imposes restrictive constraintsto the form of the functional equations. However, the speciﬁc cases in which the cv-CAN formholds and the nrr-CAN form does not may still be stable, in the sense that they do not depend ona speciﬁc tuning of the distribution of the causes (Janzing and Steudel, 2010). This is because theindependencies required in Theorem 1 and 2 may depend exclusively on the form of the functionalequations. The relation between the fulﬁllment of the cv-CAN form and the nrr-CAN form is thusqualitatively different than the one of cv-independence faithfulness and nrr-independence faithfulness,as discussed in relation to Proposition 3. In the latter case, because the violation of faithfulnessregards dependencies with residuals extracted in the direction opposite to the generative functionalequation, cases in which cv-independence faithfulness is violated and nrr-independence faithfulnessis not will occur only for speciﬁc tunings of the distribution of the causes, as discussed above.Similarly to the formulation based on cv-independencies, the conditions in Eq. 5 are not testableexperimentally, since they involve hidden variables. Again, Theorem 2 serves to identify for whichtype of functional equations nrr-independencies will exist as a consequence of the form of the9quation, but furthermore a criterion for inference from data has to be introduced. For this purposewe formulate for nrr-independence an assumption of faithfulness analogous to Assumption : Assumption Nrr-independence faithfulness for non-conditionally-additive-noise functional equa-tions : ‘if @ S Ď S the generative functional equation of X , with Y P Pa x , does not havethe nrr-CAN form for Y conditioned on S , then ˆ ε y | X; S M X for any regression ˆ f y p X; S q , with ˆ ε y | X; S ” Y ´ ˆ f y p X; S q .’Based on this assumption, we can state a criterion of noncausality using nrr-independencies analogousto Proposition : Proposition Inferring noncausality with nrr-independence asymmetries : ‘Under the assump-tion of nrr-independence faithfulness for non-nrr-CAN functional equations (Assumption ), if D S and ˆ f y p X; S q : ˆ ε y | X; S K X with ˆ ε y | X; S ” Y ´ ˆ f y p X; S q and ˆ ε x | Y; S M Y for any regression ˆ f x p Y; S q , @ S Ď S , with ˆ ε x | Y; S ” X ´ ˆ f x p Y; S q , then there is no causality from Y to X .’Proof of Proposition : If D S and ˆ f y p X; S q : ˆ ε y | X; S K X with ˆ ε y | X; S ” Y ´ ˆ f y p X; S q , it does nothold that ˆ ε y | X; S M X for any ˆ f y p X; S q . By assumption , this implies that, either Y R Pa x or thefunctional equation of X has the nrr-CAN form for Y conditioning on S for some S Ď S . Thelatter is discarded since we have ˆ ε x | Y; S M Y for any regression ˆ f x p Y; S q , @ S Ď S . l This criterion is analogous to the one with cv-independencies. However, because nrr-independenciesare a regression-based approach, there is an extra condition requiring that dependencies hold for anypossible regression. Theoretically, this is an extra requirement to apply nrr-independencies for causaldiscovery as opposed to cv-independencies. Pragmatically, this reduces to the requirement of a goodregression model, in the same way that we need a good estimate of the conditional variances. Notethat the use of nonlinear regressions differs from that common in algorithms that infer a global causalordering (Mooij et al., 2009). In that approach, a regression takes as predictors all the candidateparents of a variable. Conversely, here the regression operates on X with all variables in S conditioned,or at least, regarding the terms in Eq. 7, it has to estimate f , as a function of X and the subset ofvariables in V , which does not appear in any other term, while conditioning on the rest. The relationbetween a formulation of nrr-independence in terms of conditional regressions and of multivariateregressions will be further addressed in future work. We now examine some concrete examples to understand the different possible effects that conditioningon an extra variable can have to confer the CAN form or remove it from a functional equation. Forthat purpose, we ﬁrst consider systems within the class of linear mixed models (LMM) (West et al.,2007). This widely applied type of models takes into account the existence of random effects, thatis, coefﬁcients of the predictors which are themselves random variables. A functional equation in alinear mixed model has the form: V i “ ÿ k b ik V k ` ÿ k (cid:15) ik V k ` ξ i . (10)The sets of parents V “ t V , , ..., V ,n u and V “ t V , , ..., V ,n u can overlap. Here b ik indicates a constant ﬁxed coefﬁcient, while (cid:15) ik indicates a random coefﬁcient, that is, (cid:15) ik is itself arandom variable. For example, (cid:15) ik can represent across-subjects variability in the inﬂuence strength ofa parent variable. All random coefﬁcients are hidden variables. Furthermore, only a subset of V maybe observed. For simplicity, we restrict the examples to Gaussian linear mixed models. Because linearGaussian models belong to the special family of Hoyer et al. (2009) for which cv-independenciessymmetrically hold, this has the advantage that in these examples we can relate cv(nrr)-dependenciesonly to the presence of random effects introducing nonlinearities in the equations. LMM equationsare only in the AN form if the random coefﬁcients vanish. A CAN form can be obtained conditioningon the parents in V . We use LMM models for exemplary purpose because the connection betweenrandom effects and cv(nrr)-dependencies facilitates the explanation. However, as it is clear fromthe general form of the functional equations that can have the cv(nrr)-CAN-forms, according toTheorem 1 and 2, cv(nrr)-independencies will exist in a much wider type of systems than LMM10 Y V Z 𝜖 A B

X Y V Z 𝜖 X Y Z 𝜖 X Y Z 𝜖 C D

Figure 1: Examples of the effect of conditioning on cv-independencies. The examples representGaussian Linear Mixed Models (Eq. 10), with random coefﬁcients indicated by (cid:15) ÝÑ . The correspondingcv-dependencies between X and Y conditioned or unconditioned on Z are collected in Table 1.Systems with the same causal structure of A) and B) and more general functional equations aredescribed in Eqs. 11-13.Table 1: Cv-Independencies in the examples of Figure 1. The last column indicates whether or not itis possible to infer a potential cause based on the criterion of Proposition . σ Y | X σ X | Y σ Y | X , Z σ X | Y , Z A K M K M

YesB

M M K M

YesC

M M K K

NoD

K K K M

Nomodels. We will later discuss general versions of these examples, sharing the same causal structuresas in Figure 1, but with a more general form of the functional equations. Furthermore, note that therandom coefﬁcients do not play any especial role other than being hidden variables which appearmultiplicatively with the observed variables.Figure 1 shows examples of different effects that conditioning has on the cv(or nrr)-independenciesasymmetry. For simpliﬁcation from now on we describe these examples referring only to cv-independencies, but the same reasoning holds for the nrr-independencies. To reﬂect the form of theequation in the graphical representation, we indicate by an arrow V i (cid:15) ÝÑ V j the presence of (cid:15) V i in theequation of V j , but as mentioned above the random effects are just hidden variables. We focus oncv-dependencies between X and Y , conditioned or unconditioned on Z , which are collected in Table1. In Figure 1A, conditioning on Z does not alter the asymmetry. This is because it is the inﬂuence of (cid:15) V on Y what leads to a cv-dependence in the direction Y Ñ X . Because V is independent of X , (cid:15) V acts effectively as a source of noise on Y and the equation of Y has the CAN form for X , conditionedor unconditioned on Z . (cid:15) V does not have a Gaussian distribution, which brings the distribution of X and Y out of the special family of Hoyer et al. (2009) and leads to cv-dependencies in the directionopposite to causality. In this case, if V is observable, the collider V Ñ Y Ð X can be identiﬁed usingconditional independencies. Otherwise, σ Y | X K X , σ X | Y M Y provides new causal information.In Figure 1B, conditioning on X activates the collider V Ñ X Ð Z , activating a path of dependencebetween V and Y . Changes in the mean of V modulate the variance of (cid:15) , leading to σ Y | X M X . In theopposite direction, again (cid:15) V acts a source of non-Gaussian noise, leading to σ X | Y M Y . Conditioningon Z inactivates the alternative path between V and Y , providing the CAN form to the equation of Y .The non-Gaussian inﬂuence from (cid:15) V results in the asymmetry σ Y | X , Z K X , σ X | Y , Z M Y .In Figure 1C, conditioning does not help to ﬁnd an asymmetry. When Z is not conditioned, eitherconditioning on X or Y changes the mean of Z , which modulates the variance of (cid:15) . After conditioning Z , the system is reduced to a linear Gaussian model, leading to a symmetry of cv-independencies.Finally, in Figure 1D, conditioning creates a misleading asymmetry. Because the random effectonly affects Z , without conditioning Z the system is linear Gaussian, resulting in a symmetric cv-independence. After conditioning Z , a dependence is created between the random effect and both X and Y . Because Y K Z | X , this dependence is inactivated when further conditioning on X , leadingto σ Y | X , Z K X . That is, in this case the cv-independence results from a more general conditionalindependence. In the opposite direction, Y cannot inactivate the dependence between X and Z X M Z | Y ), and the effect of (cid:15) leads to σ X | Y , Z M Y . Because the asymmetry only appears afterconditioning on Z , the extra check of Proposition can detect that it is not reliable to infer a potentialcause from X to Y .These examples do not cover all possible effects of conditioning, but indicate that conditioningcan maintain an informative asymmetry (Figure 1A), create an informative asymmetry (Figure 1B),exchange symmetries of cv-dependencies and cv-independencies (Figure 1C), and create a misleadingasymmetry that has to be detected by the extra checks of Proposition (Figure 1D). Note that thegraphs of Figure 1 do not have the structure of DAGs for all variables, since the random effectvariables are assigned to edges instead of nodes. However, the way they provide information aboutcv(nrr)-independencies suggests that graphical criteria can be used to read cv(nrr)-independencies. Aformal introduction of graphical criteria will be described in forthcoming work.We now discuss more general forms of systems that would lead to the cv-independence asymmetriesreported in Table 1A-B, corresponding to the causal structures of Figure 1A-B, that is, the examplesfor which it is possible to infer a potential cause from X to Y . The pattern of cv-independencies ofTable 1A is more generally compatible with any system of the form: Z “ η z ; X “ b xz Z ` η x ; Y “ b yx X ` b yz Z ` f y p V , (cid:15), ε y q , (11)where η indicates a Gaussian noise. We follow the same notational rule as in Section 4, writingthe functional equations in the most generic form possible given the constraints we require. Thisclass of systems is more general than Gaussian LMM models since f y can have any form, includingnonlinearities, and ε y can be non-Gaussian. This is because, with respect to X , the component f y p V , (cid:15), ε y q acts as an additive noise, in agreement with the CAN form of Eq. 8. Furthermore, thepattern σ Y | X , Z K X and σ X | Y , Z M Y of Table 1A, which by itself allows inferring the potential causefrom X to Y , holds for a larger class of systems compatible with the causal structure of Figure 1A: Z “ ε z ; X “ f x p Z , ε x q ; Y “ f y, p X , Z q ` f y, p Z , V , (cid:15), ε y q , (12)where all noises can have generic distributions and f x , f y , and f y are generic and can be nonlinear.Again, after conditioning Z , given the causal structure of Figure 1A and the form of the functionalequation of Y in Eq. 12, the cv-CAN form holds according to Theorem 1.In the same way, the pattern of Table 1B is also obtained for a much wider class of functionalequations compatible with the causal structure of Figure 1B: Z “ ε z ; X “ f x p Z , V , (cid:15), ε x q ; Y “ f y, p X , Z q ` f y, p Z , ε y q , (13)where all noises have generic distributions and f x , f y, , and f y, are generic.The analysis of these concrete examples illustrates how, when the functional equations are knownor hypothesized, Theorem 1 (or Theorem 2), allow determining which cv(or nrr)-independenciesexist. In application to data, the criterion of Proposition 4 (or Proposition 5) would be applied afterestimating the cv(nrr)-independencies, and the patterns displayed in Table 1 determine whether apotential cause would be inferred. Finally, we also brieﬂy consider how post-nonlinear AN equations (Zhang and Hyvärinen, 2009)can also be extended to a post-nonlinear CAN form. From Theorem and , it is straightforward toderive the same conditions for CAN post-nonlinear forms, simply considering that the conditionsapply to h ´ i, p V i q in Eq. 2. However, this class of models can be further generalized. To see this,consider a functional equation of the form Y “ h p h p h p X , V , U , ε y qq ` h p X , V , U qq , (14)where both h and h are nonlinear invertible functions and h is a function that has the CAN formfor X given a certain conditioning set S , where X is the parent of interest for which we examine thecausal relation with Y . The equation can be reexpressed as h ´ p h ´ p Y q ´ h p X , V , U qq “ h p X , V , U , ε y q . (15)If U “ H , considering the set S “ S Y V , and using the same notation as in Eq. 8 for the CANfunction h , Eq. 15 has the form h p Y , X; S q “ f p X; S q ` ξ y | S . (16)12xploiting a model of this type requires estimating the functions h and f to minimize the informationbetween X and ˆ ξ y | S ” ˆ h p Y , X; S q ´ ˆ f p X; S q . If X is not an argument of h this reduces to thesame estimation problem studied in Zhang and Hyvärinen (2009), with ˆ ξ y | S ” ˆ h p Y; S q ´ ˆ f p X; S q .The form of Eq. 14 suggests a generalization by an iterative composition of two operations. Considerthe operation consisting in an invertible nonlinear univariate transformation g p z q and the operationconsisting in the bivariate sum s p z , z q “ z ` z . Starting from a function h p X , V , U , ε y q thathas the CAN form for X given a certain conditioning set S , a set of invertible nonlinear functions g i p z q i “ , ..., m and a set of arguments z ,i “ ˜ f i p X , V i q , i “ , ..., m , the functional equationof Y can be constructed by the iterative composition starting as g p g p h q ` z , q ` z , , with s k p g k p s p k ´ q p g p k ´ q , z ,k ´ qq , z ,k q . Because all functions g k are invertible, the functional equationof Y can be expressed in the form of Eq. 16 by inverting the operations. As in the case of Eq. 14, if X is not an argument of the functions ˜ f k , the expression further simpliﬁes to the form studied in Zhangand Hyvärinen (2009). The required conditioning set is Ť t S , V , ..., V m u . The same procedurecan be followed replacing the sum operation by the product. This procedure results in increasinglycomplex functional equations for which in principle cv-independencies and nrr-independencies canbe tested. In practice, the difﬁculty of the estimation problem of Eq. 16 will depend on the number ofthese operations, the extra variables introduced in the functions analogous to h , as well as on thenumber of variables in h p X , V , U , ε y q , and the complexity of the functions. In this paper we extended the theory behind the AN framework for structure learning in several ways.We ﬁrst introduced an alternative regression-free test of independence. This test does not requirethe reconstruction of the additive noise using the residuals of a nonlinear regression. Instead oftesting the independence between the residuals and the parents of a variable (nrr-independencies), itevaluates indirectly the independence between the noise variance and the parents using conditionalvariances (cv-independencies). The use of cv-independencies is expected to be especially useful whenthe form of the functional equation is complex. In that case, the family of regression models usedmay not be powerful enough to capture the form of the actual dependencies, and thus our indirectestimate of independencies may be particularly beneﬁcial. On the other hand, the examinationof cv-independencies and nrr-independencies is not mutually exclusive and could be combined toimprove learning.We formulated all the other contributions of this work both for cv-independencies as well as for nrr-independencies. In the latter case, the implementation of nonlinear regressions developed in previouswork (see the actual implementations provided by Hoyer et al., 2009; Mooij et al., 2009; Peters et al.,2014; Bühlmann et al., 2014) can already be applied to implement this extended framework. Wegeneralized AN models to partial conditionally-additive-noise (CAN) models with hidden variables.In these models, only some functional equations and only for certain parents have the AN form,possibly after conditioning. We determined when a functional equation has the CAN form thatresults in cv(or nrr)-independencies. Exploiting asymmetries in cv(or nrr)-independencies, we thenintroduced a criterion to infer the causal relation between speciﬁc pairs of variables in a multivariatesystem with hidden variables, without restrictions on the form of the functional equations. Thecriterion can be applied locally, if the CAN form holds for a certain functional equation, and withoutinferring a global causal ordering (Mooij et al., 2009). Because the type of functional equations thathave a CAN form is substantially larger than the type of pure additive-noise functional equations, wecan expect that cv(nrr)-independencies induced by the CAN form will exist more often and hence thatin more practical cases the AN framework will increase the inferential power of standard methodsbased on conditional independencies. The magnitude of this increase will be speciﬁc to each domainof application, depending on the properties of the generative functional equations.The new criterion can readily be applied to complement the existing algorithms that in the presenceof hidden variables extract equivalence classes of causal structures given conditional independencies(Spirtes et al., 2000; Drton and Maathuis, 2017; Heinze-Deml et al., 2018). Like for any standard ruleof causal orientation used in constraint-based structure learning algorithms (e.g. Spirtes et al., 2000),this new criterion relies on faithfulness assumptions. While it is an ongoing subject of research tounderstand when faithfulness holds (Uhler et al., 2013), only under these types of assumptions thecorresponding analysis of independencies can be applied for structure learning. In future work we will13ddress in full detail how to exploit the new criterion in combination with conditional independenciesas part of a structure learning algorithm.

Acknowledgments

This research was supported by the NIH Brain Initiative (Grant No. U19 NS107464) and by theFondation Bertarelli.

Appendix

Proof of Proposition Proof of Proposition : We ﬁrst prove that cv-independence faithfulness implies nrr-independencefaithfulness. Consider that a nonlinear regression is implemented such that ˆ ε y ” Y ´ ˆ f y p X , S q is independent of X despite Y P Pa x . Then the statistical model Y “ ˆ f y p X , S q ` ˆ ε y has theAN form and it follows that σ Y | X , S K X . Given that ˆ ε y K X implies σ Y | X , S K X , inversely σ Y | X , S M X implies ˆ ε y M X . Because cv-independence faithfulness assumes σ Y | X , S M X @ S Ď ND p X q , Pa x z Y Ď S , this implies ˆ ε y M X @ S Ď ND p X q , Pa x z Y Ď S , which corresponds tothe assumption of nrr-independence faithfulness. We now justify that nrr-independence faithfulnessdoes not imply cv-independence faithfulness. To see this, it sufﬁces to realize that nrr-independencerequires that all moments of the residuals variable ˆ ε y are independent of X . On the other hand, cv-independence only requires that the variance of the residuals variable is independent. The distribution p p ˆ ε y | x, s q can be such that the dependence only appears in the third and higher moments. In thatcase, cv-independence holds despite nrr-dependence. l Proof of Theorem and Theorem We ﬁrst prove the if and only if conditions of Theorem 1 for the functional equation of Y to be inthe cv-CAN form with respect to a parent X given the set S when X is adjacent to all other potentialcauses of Y .Proof of Theorem : We proceed justifying the necessary and sufﬁcient requirements for each setof hidden and observed variables of Eq. 4. First, we need U , “ H because X modulates thevariance of any U P U , , since they appear together as arguments of f , in Eq. 4, which is nonlinear.Also for U P U , , dependencies on X produce a change in the variance of f , p V , , U , q due to nonlinearities, even if V , is conditioned. Because U , are hidden, we have to require X K U k | S @ U k P U , . For the same reason, we need X K U k | S @ U k P U . On the other hand,the variables U , , conditioning on ˜V , , contribute linearly to Y , similarly to U . Accordingly,to avoid that these terms introduce a dependence of σ Y | X , S on X , it is required that σ U k | X , S K X @ U k P t U , , U u and that the covariances fulﬁll σ Z i Z j | X , S K X @ Z i , Z j P S , where S “t Y , , Y , U , , U u .Regarding the observable variables, we need V , Ď S because of the nonlinearity of f , suchthat X modulates their variance. Similarly, we also need V , Ď S because of the nonlinearity of f , and also because V P V , can modulate the variance of variables in U , . The same holdsto require ˜V , Ď S and ˜V , Ď S . We also need V Ď S because any V P V modulates thevariance of ε y . Regarding V , because these variables only contribute linearly, we can divide the setin two groups. The variables in ¯V ” t V i P V : σ V i | X , S K X u may not need to be conditioned,because although not independent from X , the dependence does not affect their conditional variance.Similarly, once we have conditioned on ˜V , , the variables V , contribute linearly to Y , and hencecan be divided analogously to V . The variables in ¯V , ” t V i P V , : σ V i | X , S K X u maynot need to be conditioned. Because V and V , are observable there is the option to try to ﬁnda valid S conditioning on all of them. Alternatively, we can exclude from the conditioning setsome subsets V , Ď ¯V and V , , Ď ¯V , . We complementarily deﬁne V , “ V z V , and V , , “ V , z V , , . If V , ‰ H or V , , ‰ H , we also need to require that the covariancebetween any pair of variables from these subsets is not modulated by X . Similarly we need that thecovariance with the other linear terms in S is also independent of X . This together guarantees that the14inear contributions of the functional equation do not create a dependence of σ Y | X , S on X . Altogether,the observable variables to be included in S are t V , , V , , ˜V , , V , , , ˜V , , V , V , u . Becausefor each set of variables we described the requirements necessary and sufﬁcient to eliminate theircontribution to any dependence of the conditional variance, the fulﬁllment of these requirementsleads to cv-independence. l Proof of Theorem : The proof is analogous to the one of Theorem 1. We only highlight thedifferences. In contrast to cv-independence, nrr-independence regards all moments of the residu-als. According to Eq. 7, the variables Z P t V , , , V , , U , , U u contribute linearly to Y afterconditioning on S . The regression will result in a residual for Y that can equally be decomposedas comprising a contribution from each of these linearly additive terms, which is proportional tothe residual from the separate regression of each Z . That is, the regression ˆ f y p X; S q will contain acomponent ﬁtted to each conditional mean µ z | X; S . Accordingly, the residual of Y has a contributionfrom the residuals ε z | X; S ” Z ´ µ z | X; S , and hence any dependence of Z with X in any moment otherthan the mean produces an nrr-dependence between X and the residual of Y . l The special family of distributions with bidirectional statistical CAN form

We here show that, analogously to the case of pure AN equations studied in Hoyer et al. (2009),there is a special family of joint distributions p p X , Y | S q that allows a CAN statistical form in bothdirections. The proof relies on the one of Hoyer et al. (2009). It sufﬁces to realize that when thefunctional equation of Y admits the CAN form for X given the set S , the bivariate distribution p p X , Y | S q , for ﬁxed S , can be expressed in the same form used in the proof of Hoyer et al. (2009). Inparticular, using the notation of Eq. 8, log p p X , Y | S q “ log p ξ y | S p Y ´ f , p X; S qq ` log p p X q , (17)which is analogous to Equation 5 in Hoyer et al. (2009). The rest of the proof follows equivalently. References

Bühlmann, P., Peters, J., and Ernest, J. (2014). CAM: Causal additive models, high-dimensionalorder search and penalized regression.

The Annals of Statistics , 42(6):2526–2556.Drton, M. and Maathuis, M. H. (2017). Structure learning in graphical modeling.

Annual Review ofStatistics and Its Application , 4:365–393.Heinze-Deml, C., Maathuis, M. H., and Meinshausen, N. (2018). Causal structure learning.

AnnualReview of Statistics and Its Application , 5:371–391.Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., and Schölkopf, B. (2009). Nonlinear causaldiscovery with additive noise models.

Proceedings of the 21st Conference on Advances in NeuralInformation Processing Systems (NIPS 2008) , pages 689–696.Janzing, D., Peters, J., Mooij, J. M., and Schölkopf, B. (2009). Identifying confounders using additivenoise models.

Proceedings of the 25th Annual Conference on Uncertainty in Artiﬁcial Intelligence(UAI) , pages 249–257.Janzing, D. and Steudel, B. (2010). Justifying additive-noise-model based causal discovery viaalgorithmic information theory.

Open Systems and Information Dynamics , 17(2):189–212.Lütkepohl, H. (2006).

New introduction to multiple time series analysis . Springer-Verlag, Berlin.Mooij, J. M., Janzing, D., Peters, J., and Schölkopf, B. (2009). Regression by dependence minimiza-tion and its application to causal inference.

Proceedings of the 26th International Conference onMachine Learning (ICML) , pages 745–752.Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. (2016). Distinguishing causefrom effect using observational data: Methods and benchmarks.

Journal of Machine LearningResearch , 17:1–102.Pearl, J. (2009).

Causality: Models, Reasoning, and Inference . Cambridge University Press, NewYork, 2nd edition. 15eters, J., Janzing, D., and Schölkopf, B. (2017).

Elements of causal inference: Foundations andlearning algorithms . MIT Press, Cambridge, MA.Peters, J., Mooij, J. M., Janzing, D., and Schölkopf, B. (2014). Causal discovery with continuousadditive noise models.

Journal of Machine Learning Research , 15:2009–2053.Spirtes, P., Glymour, C., and Scheines, R. (2000).

Causation, Prediction, and Search . MIT Press,Cambridge, MA, 2nd edition.Tillman, R., Gretton, A., and Spirtes, P. (2009). Nonlinear directed acyclic structure learningwith weakly additive-noise models.

Proceedings of the 22nd Conference on Advances in NeuralInformation Processing Systems (NIPS 2009) , pages 1847–1855.Uhler, C., Raskutti, G., Bühlmann, P., and Yu, B. (2013). Geometry of the faithfulness assumption incausal inference.

The Annals of Statistics , 41(2):436–463.West, B. T., Welch, K. B., and Galecki, A. T. (2007).

Linear Mixed Models: A Practical Guide UsingStatistical Software . Chapman and Hall/CRC, New York.Wibral, M., Vicente, R., and Lizier, J. T. (2014).

Directed Information Measures in Neuroscience .Springer-Verlag, Berlin Heidelberg.Zhang, K. and Hyvärinen, A. (2009). On the identiﬁability of the post-nonlinear causal model.