Perturbations and Causality in Gaussian Latent Variable Models
PPerturbations and Causality in Gaussian Models
Armeen Taeb and Peter B¨uhlmann ∗ Seminar for Statistics, ETH Z¨urich
Abstract
Causal inference is understood to be a very challenging problem with observational data alone. Withoutmaking additional strong assumptions, it is only typically possible given access to data arising from perturbingthe underlying system. To identify causal relations among a collections of covariates and a target or responsevariable, existing procedures rely on at least one of the following assumptions: i) the target variable remainsunperturbed, ii) the hidden variables remain unperturbed, and iii) the hidden effects are dense. In this paper,we consider a perturbation model for interventional data (involving soft and hard interventions) over a collec-tion of Gaussian variables that does not satisfy any of these conditions and can be viewed as a mixed-effectslinear structural causal model. We propose a maximum-likelihood estimator – dubbed
DirectLikelihood – thatexploits system-wide invariances to uniquely identify the population causal structure from perturbation data.Our theoretical guarantees also carry over to settings where the variables are non-Gaussian but are generatedaccording to a linear structural causal model. Further, we demonstrate that the population causal parametersare solutions to a worst-case risk with respect to distributional shifts from a certain perturbation class. Weillustrate the utility of our perturbation model and the
DirectLikelihood estimator on synthetic data as well asreal data involving protein expressions.
Identifying causal relations from observational data is challenging and often impossible. In the context of SCMs[24, 21], one possibility is to find the
Markov equivalence class (MEC) of directed acyclic graphs under the faith-fulness assumption [32] or the beta-min condition [31]. Some of the well-known algorithms for structure learning ofMECs with observational data include the constraint based PC algorithm [29], score based greedy algorithm GES[5] and hybrid methods that integrate constraint based and score based methods such as ARGES [19]. In manyapplications though, we have available both observational and interventional or perturbation data, where the latterare coming from experiments with unknown targets. In genomics, for example, with the advance of gene editingtechnologies, high throughput interventional gene expression data is being produced [8]. Interventional data can beviewed as perturbations to components of the system and can offer substantial gain in identifiability: [12] demon-strated that combining interventional with observational data reduces ambiguity and enhances identifiability to asmaller equivalence class than the MEC, known as the I-MEC (Interventional MEC). A variety of methods havebeen proposed for causal structure learning from observational and interventional data. This includes the modifiedGES algorithm by [12] known as GIES, permutation-based causal structure learning [33], penalized maximum-likelihood procedure in Gaussian models [14], and methods based on causal invariance framework [17, 23]. For amore comprehensive list, see [9] and the references therein.A common challenge for accurate structure learning is that there may be hidden variables for which it is ex-pensive or impossible to obtain sample observations. Such unobserved variables pose a significant difficulty as thecausal graphical model structure is not closed under marginalization; therefore, the graphical structure correspond-ing to the marginal distribution of the observed variables consists of potentially many confounding dependenciesthat are induced due to the marginalization over the hidden variables. There are causal structural learning methodsthat account for the presence of hidden variables. In the observational setting, two prominent examples are theFast Causal Inference (and its variants) [29] for DAG learning and the two-stage deconfounding procedure [11]involving the sparse-plus-low rank decomposition framework [4] as the first stage and the standard DAG learningprocedure in the second stage. As discussed earlier, the two-stage deconfounding procedure will only enable to ∗ Correspondence email: [email protected] a r X i v : . [ s t a t . M E ] J a n nfer a certain MEC but not the causal parameter and structure itself. In the observational and interventionalsetting with unperturbed hidden variables and only shift interventions on the observed covariates, Causal Dantzig[25] consistently estimates the causal relations of a target variable assuming that the interventions do not directlyaffect the target variable. Such an assumption is relaxed in the backShift procedure [27] while it still requires thatthe hidden variables remain unperturbed for identifying the causal structure.Guaranteed identifiability using these previous techniques for perturbation data relies on at least one of thefollowing assumptions: i) the target variable remains unperturbed, ii) the hidden variables remain unperturbed,and iii) the hidden effects are dense. In this paper, we propose a modeling framework and an estimator that doesnot rely on any of these assumptions and yet identifies the population DAG structure. Fig 1 demonstrates a toyexample of our setup among 4 observed variables X , X , X , X and hidden variables H , and A represents externalvariables (to the graphical structure among observed and hidden variables) that provide perturbations. 𝑋 𝑋 𝑋 𝑋 𝐻 𝒜 𝑋 𝑋 𝑋 𝑋 𝐻 Figure 1: Toy example of 4 observed variables X , X , X , X and hidden variables H where solid lines are connectionsamong observed variables and dotted lines are connections between observed and hidden variables; left: without perturbations,right: perturbations A on all components indicated with red dotted lines. We consider a Gaussian structural causal model (SCM) specifying the perturbation model and the relationshipbetween p observed variables X ∈ R p and hidden variables H ∈ R h . We consider the setting with heterogeneousgrouped data from different environments e ∈ E . Here e denotes the index of an environment or a sub-populationand E is the space of different observed environments. As we will formalize in Section 2.1, each group or environment e corresponds to some perturbations of the underlying SCM. The grouped data, across different environments, is de-noted by ( X e , H e ) with e ∈ E . The SCM is parameterized by a connectivity matrix encoding the causal relationshipamong the observed variables, a coefficient matrix encoding the hidden variable effects, and nuisance parametersinvolving the noise variances and perturbation magnitudes among all of the variables. A key property of this mod-eling framework is that the connectivity matrix and the hidden variable coefficient matrix remain invariant acrossof all the perturbation environments. With this insight, we propose a maximum-likelihood estimator – dubbed DirectLikelihood – to score a given DAG structure.
DirectLikelihood provides a flexible framework to incorporateadditional knowledge including do-interventions when their intervention-locations are known or additional infor-mation on the perturbation structure (such as statistically identical perturbations on all of the observed variables).Further, the framework can be specialized to the setting considered by [25, 27] where the hidden variables are notperturbed across environments (i.e. A does not point to H in Fig 1), or to the setting where there is no hiddenconfounding (i.e. H does not point to the covariates in Fig 1).Besides the novel methodology, we provide conditions for which DirectLikelihood correctly identifies the popula-tion DAG structure. In particular, we demonstrate that with at least two interventional environments, where one ofthe environments consists of sufficiently large interventions on each of the observed variables, and the hidden effectssatisfying a hidden spuriousness assumption,
DirectLikelihood provides consistent estimates. The hidden spurious-ness assumption for an environment e states that the hidden variables induce confounding dependencies among theobserved variables; formally, there exists at least one pair of variables ( X ek , X el ) such that X ek ⊥⊥ X el | { X e \{ k,l } , H e } and X ek (cid:54)⊥⊥ X el | X e \{ k,l } , where X e \{ k,l } denotes the collections of variables X e excluding X ek , X el . The hiddenspuriousness assumption is substantially weaker than the hidden denseness assumption required in the two-stagedeconfounding procedure in [11] which insists that there are many pairs of variables satisfying the condition above.Our theoretical results are further specialized to the setting where the hidden variables remain unperturbed acrossall of the environments or when there are no hidden variables. When the hidden variables are unperturbed, Direct-Likelihood requires no assumption on the hidden structure for identifiability, whereas the two-stage deconfoundingprocedure still requires hidden denseness. We remark that the main focus of our analysis is on identifiability guar-antees, and we discuss in Section 7 future work on understanding high-dimensional consistency properties of the
DirectLikelihood procedure. 2urther, we highlight a connection between distributional robustness and the causal parameters in our per-turbation model. Specifically, we prove that the population causal parameters are solutions to a worst-case riskover the space of DAGs and distributional shifts from a certain perturbation class. Here, the risk is measured bythe Kullback-Libeler divergence between the estimated and population Gaussian distributions. As with the DAGidentifiability, the relation between causality and distributional robustness relies on the stringent assumption thatthe perturbations do not directly affect the target variable or the hidden variables [3, 26]. The results in this paperprovided a more complete picture on the connection between perturbations, causality, and distributional robustness(see also Table 1).As our final contribution, we propose an optimization procedure to solve
DirectLikelihood and demonstrate theutility of our proposed estimator with synthetic data and real data involving protein expression data [28].The outline of this paper is as follows. In Section 2, we describe the model for observational and perturbationdata and its representation as a mixed-effects model, and then present the maximum-likelihood estimator
Direct-Likelihood to score a given DAG structure. In Section 3, we provide theoretical guarantees for the optimally scoringDAGs (scored via
DirectLikelihood ). In Section 4, the connection between the causal parameters of the proposedperturbation model and distributional robustness is explored. In Section 5, we present an optimization strategy forsolving
DirectLikelihood for a given DAG structure and how to use it to obtain the best scoring DAGs. In Section6, we demonstrate the utility of our approach with real and synthetic data. We conclude with a discussion andfuture research directions in Section 7.
We have mentioned differences to backshift [27] and two-stage deconfounding procedures and provide more compar-isons throughout the paper.
DirectLikelihood is similar in spirit to approaches based on invariance principles [23, 25]as it exploits certain model parameters (such as the connectivity matrix and hidden variable effects) remainingunchanged across perturbations. However, a key difference between
DirectLikelihood and these other techniques –in addition to being able to incorporate perturbations on the hidden variables – is that
DirectLikelihood modelsthe entire system of observed variables as opposed to just the regression of the target variable and the remainingobserved variables. The virtue of this system-wide modeling is that all of the variables can experience perturbationswithout sacrificing consistency guarantees while the methods [23, 25] assume that the perturbations do not directlyaffect the target variable. This perspective was also adopted in the backShift procedure [27], although
DirectLike-lihood can allow for perturbations on the hidden variables. For a summary of the assumptions for
DirectLikelihood as compared to competing methods, see Table 1.
Method Perturbed target Unperturbed hiddens Perturbed hiddensIV, ICP, Cauzal Dantzig x (cid:88) xtwo-stage deconfounding (cid:88) (cid:88) hidden denseness (cid:88) hidden densenessbackShift (cid:88) (cid:88) x DirectLikelihood (cid:88) (cid:88) (cid:88) hidden spuriousness
Table 1: Comparison of
DirectLikelihood with competing methods in the following settings: target variable is perturbed,hidden variables are unperturbed, and the hidden variables are perturbed. The methods are Instrumental Variables IV [1], In-variant Causal Prediction ICP [23], two-stage deconfounding [11] tailored for observational and interventional data, backShift[27] and our proposal
DirectLikelihood . We denote the identity matrix by I , with the size being clear from context. The collection of positive-semidefinitematrices in S d is denoted by S d + and the collection of strictly positive-definite matrices by S d ++ . The collection ofnon-negative vectors in R d is denoted by R d + and strictly positive vectors by R d ++ . Given a matrix M ∈ R d × d anda set S ⊆ { , , . . . , d } , we denote the restriction of M to rows and columns indexed by S by [ M ] S . We denotethe number of nonzeros in a matrix M ∈ R p × p by (cid:107) M (cid:107) (cid:96) . We apply a similar notation to count the number ofedges in a graph. Given a DAG D , we denote moral( D ) to be the moralization of D . We denote the the index set3f the parents of a random variable X p by PA( p ). We denote the index set for the descendants and ancestors byDES( p ) and ANC( p ), respectively. Further, letting D be the DAG underlying a collection of variables ( X, H ), wedenote the subgraph of D restricted to the variables X by D X and likewise for D H . Given a matrix M ∈ R d × d ,we denote (cid:107) M (cid:107) to be the largest singular value (spectral norm). For two vectors z , z ∈ R d , we denote z (cid:23) z to denote element-wise inequality. Finally, for random variables V , V and random vectors Z , we use the notation ρ ( V , V | Z ) to denote the partial correlation between V and V given Z . In Section 2.1, we describe a data generation process associated with the perturbation model in Fig 1. In Section2.2, we propose
DirectLikelihood , a maximum-likelihood estimator with respect to the marginal distribution of theobserved variables.
DirectLikelihood identifies estimates of the unknown perturbation effects, the hidden effects,and the causal relation among the observed variables.
We consider a directed acyclic graph D (cid:63) whose p + h nodes correspond to jointly Gaussian and centered randomvariables ( X, H ) ⊆ R p × R h , where X are observable and H are hidden variables. As described in Section 1.1, ourmethodology is also applicable in the setting where one may be primarily interested in the causal effects of a targetvariable. As such, we distinguish X p as the target or response variable. Owing to the joint Gaussianity of ( X, H ),the random pair (
X, H ) satisfies the following (compactified) SCM: X = B (cid:63) X + Γ (cid:63) H + (cid:15). (1)Here, (cid:15) = ( (cid:15) , (cid:15) , . . . , (cid:15) p ) are independent Gaussian random variables independent of H where (cid:15) ∼ N (0 , diag ( w ,(cid:63) ))for some w ,(cid:63) ∈ R p ++ . The connectivity matrix B (cid:63) ∈ R p × p contains zeros on the diagonal and encodes the causalrelationship among the observables X , i.e. B (cid:63)k,j (cid:54) = 0 if and only if j ∈ PA D (cid:63)X ( k ). The p -th row vector B (cid:63)p, : encodesthe causal parents of the target variable and the magnitude of their effects. The matrix Γ (cid:63) in (1) encodes the effectsof the hidden variables on the observed variables where Γ (cid:63)k,j (cid:54) = 0 if and only if j ∈ PA D (cid:63)H ( k ). For the sake of general-ity, we do not immediately put any assumption on the number of hidden variables h or the denseness of their effects.The compact SCM (1) describes the generating process of X in the observational setting where there are noexternal perturbations on the system. We next describe how the data generation process alters due to some typeof perturbations to the variables ( X, H ). We consider perturbations that directly shift the distributions of therandom variables by some noise, either acting additively to the system or being a do-intervention eliminating theconnections between the perturbed variable and the corresponding parents.We consider perturbations A that act either additively to the system or are do-interventions that yield therandom pair ( X e , H e ) for each e ∈ E satisfying the following SCM: X e = F do ( e ) c ( B (cid:63) X e + Γ (cid:63) H e + (cid:15) e + δ e ) + F do ( e ) ( δ e ) H e ∼ N (0 , Ψ (cid:63),e ) , (2)where for every e ∈ E , (cid:15) e dist = (cid:15) , ( H e , δ e , (cid:15) e ) are jointly independent, and that the collection ( X e , H e , δ e , (cid:15) e ) isindependent across e . Further, δ e ∈ R p is a Gaussian random vector (independent across the coordinates) thatrepresents the additive perturbations, H e ∈ R h is a Gaussian random vector that represents the perturbed hiddenvariables with covariance Ψ e ∈ S h ++ , and do ( e ) ⊆ { , . . . , p } denotes do-locations in the sub-graph of D (cid:63)X . Finally, F S ∈ R p × p is a diagonal matrix with ones corresponding to coordinates inside S ⊆ { , . . . , p } and zeros elsewhere.Notice that (cid:15) e , δ e are in general not identifiable from the sum (cid:15) e + δ e in (2); we specify below in Section 2.1.1 anidentifiable parametrization for the terms (cid:15) e + δ e .The compactified SCM (2) characterizes the distribution among all of the observed variables and encodes system-wide invariances. Specifically, (2) insists that for every k = 1 , , . . . , p , the regression coefficients when regressing X ek on the parent sets { X ej : j ∈ PA D (cid:63)X ( k ) } and { H el : l ∈ PA D (cid:63)H ( k ) } remain invariant for all environments e ∈ E .This is a point of departure from instrumental variable techniques or invariant causal prediction in two significant Without loss of generality, we assume that the observed variables are centered. X p (i.e.they assume H e dist = H and δ ep ≡ e ∈ E ) and 2) they only consider “local” invariances arising from thedistribution X ep | { ( X ej , H el ) : j ∈ PA D (cid:63)X ( k ) , h ∈ PA D (cid:63)H ( k ) } . The virtue of considering a joint model over all of thevariables and exploiting system-wide invariances is that we can propose a maximum-likelihood estimator Direct-Likelihood which identifies the population DAG structure even with perturbations on the target variable and thehidden variables.The SCM (2) is similar in spirit to previous modeling frameworks in the literature. The authors [14] considerjointly observational and interventional Gaussian data where the interventions are limited to do-interventions andthere are no hidden variables. In the context of (2), this means that δ e ≡ (cid:63) ≡
0. As such, the frameworkconsidered in this paper is a substantial generalization of [14]. Further, the backShift [27] procedure considers thelinear SCM (2) with the some modifications: i) there are no do-interventions, e.g. do ( e ) = ∅ for all e ∈ E , ii) thereare no perturbations to the hidden variables, i.e. H e dist = H for all e ∈ E , and iii) B (cid:63) may be a cyclic directed graph.In addition, the backShift algorithm relies on exploiting invariances of differences of estimated covariance matricesacross environments. Our DirectLikelihood is more in the ”culture of likelihood modeling and inference” and hasthe advantage that it can cope well with having only a few observations per group or environment. This likelihoodperspective also fits much more into the context of inference for mixed models as briefly discussed next.
Clearly, one cannot distinguish the parameters for (cid:15) e and δ e . We thus write, for all e ∈ E : (cid:15) e + δ e ∼ N (0 , diag( w e,(cid:63) )) , w ∗ ,e ∈ R p + . Since we are mainly interested in the connectivity matrix B (cid:63) , the parameters Γ (cid:63) , w e,(cid:63) , Ψ e,(cid:63) are nuisance parametersand we may simplify the modeling framework by restricting the parameter space for the covariances Ψ e, ∗ .. Ourdefault proposal is to model the hidden variables as independent and identically distributed across the environments.Specifically, we let Ψ e,(cid:63) = Ψ (cid:63) + ψ e,(cid:63) I where ψ e,(cid:63) ∈ R + . Further, without loss of generality, Ψ (cid:63) can be taken to bethe identity matrix by absorbing its effect on Γ (cid:63) via the transformation Γ (cid:63) → Γ (cid:63) Ψ (cid:63) / so that:Ψ e,(cid:63) = (1 + ψ e,(cid:63) ) I , ψ e,(cid:63) ∈ R + . Further, as an additional default setting, we assume that we have access to an observational environment ( e = 1without loss of generality) so that: w e,(cid:63) (cid:23) w ,(cid:63) , ψ ,(cid:63) = 0 . Here, the inequality w e,(cid:63) (cid:23) w ,(cid:63) is element-wise.In the setting where the hidden variables are unperturbed across the environments, one can take Ψ e,(cid:63) ≡ I after the transformation Γ(Ψ e,(cid:63) ) / → Γ (cid:63) . Fitting to a model with equally distributed perturbations across thecoordinates may be attained by the reparametrization w e,(cid:63) = w ,(cid:63) + ζ e,(cid:63) for ζ e,(cid:63) ∈ R + . In general, other modelsfor the random terms (cid:15) e + δ e and H e are possible. A connection to random effects modeling is discussed next. The framework in (2) bears some similarities to standard random effects mixed models [16]. In particular, randomeffects mixed models are widely employed to model grouped data across multiple environments, where some pa-rameter components remain fixed and others are random. In the context of our problem, the fixed parameters arethe matrices B (cid:63) , Γ (cid:63) and the random parameters are the shift perturbations δ e .For example, we can write for the target variable Y = X p and for simplicity in the absence of hidden variables:for each environment or group e , Y e = X e β + Z e b e + (cid:15) e , e = 1 , . . . , m, (3)where Y e , (cid:15) e are n e × X e is an n e × p design matrix, here Z e = I n e , n e is the sample size withingroup or environment e , and the variables across e are independent. The correspondence to (2) is as follows: (cid:15) e ∼ N (0 , w ,(cid:63)p I n e ), b e ∼ N (0 , v e,(cid:63)p I n e ) (where v e,(cid:63)p = w e,(cid:63)p − w ,(cid:63)p ) and β = B (cid:63)p, : T . There are three main differences5o standard mixed models. First, the distribution of b e ∼ N (0 , v e,(cid:63)p I n e ) changes with e and the shrinkage effectacross groups is abandoned. Second, we take a multivariate view point for all the variables X ej ( j = 1 , . . . , p ) in (2):they are all modelled with random effects and can be individually written as in (3), but we allow for dependenceamong all the p variables. Finally, a difference between our model in (1) and (2) and the standard mixed modelsis that the group specific random effects, the random parameters δ e in (2) or the random parameters vector b e in(3), act in a dynamic way on the system: the effects of δ e are propagated through the structural equations; and inpractice, the order of propagation is usually unknown.Thus, our model in (2) leads to a different way of describing group-specific perturbations, calling also for a dif-ferent likelihood calculation: in fact, as we show, such dynamic perturbations allow to identify the causal structure.The latter is not possible with standard mixed models but due to the connection pointed out above, we refer toour formalization in (2) as ”mixed-effects linear structural causal modeling”. We believe that the causal inferenceliterature has not much exploited this connection. We argue here that our random effects approach is very usefuland practical for modeling perturbation data where the perturbations are believed to propagate further on othervariables in the system. DirectLikelihood
Let D be a given DAG structure among the observed variables (which we can think of as the restriction D X ofa DAG among observed and hidden variables). In this section, we score this DAG via the maximum likelihoodprocedure DirectLikelihood . We suppose that there are m environments |E| = m , and for every environment e = 1 , , . . . , m , we have samples of random pairs ( X e , H e ): { X e ( i ) } n e i =1 for some positive integer n e which are IIDfor each e and independent across e . Thus, since the X e ’s are independent for e = 1 , , . . . , m and the samples foreach environment e are are IID, the maximum-likelihood estimator for the DAG structure D is given by:arg min B ∈ R p × p , Γ ∈ R p × ¯ h { Ψ e } me =1 ⊆ S ¯ h ++ , { w e } me =1 ⊆ R p ++ m (cid:88) e =1 ˆ π e n e (cid:88) i =1 − log prob (cid:16) X e ( i ) | B, Γ , Ψ e , w e (cid:17) subject-to B compatible with D . (4)Here, prob (cid:16) X e ( i ) | B, Γ , Ψ e , w e (cid:17) represents the Gaussian likelihood of X e ( i ) given parameters B, Γ , Ψ e , w e ; ¯ h ≤ p ;the constraint B compatible with D ensures that the estimated B has its support restricted to the structure of D , i.e. B i,j (cid:54) = 0 if and only if j → i in D ; ˆ π e = n e (cid:80) me =1 n e ; and w e is a surrogate for the variances of the sum δ e + (cid:15) . It is straightforward to see that the negative log-likelihood log prob ( · ) decouples across the parameters( B, Γ , { Ψ e } me =1 , { w e do ( e ) c } me =1 ) and { w e do ( e ) } me =1 . In other words, the structure of the DAG D only plays a role in theterm involving the parameters ( B, Γ , { Ψ e } me =1 , { w e do ( e ) c } me =1 ), and we thus focus on that component of the likelihood:( ˆ B, ˆΓ , { ( ˆΨ e , ˆ w e ) } me =1 ) = arg min B ∈ R p × p , Γ ∈ R p × ¯ h { Ψ e } me =1 ⊆ S ¯ h ++ , { w e } me =1 ⊆ R p ++ m (cid:88) e =1 ˆ π e (cid:96) ( B, Γ , Ψ e , w e ; do ( e ) , ˆΣ e ) , subject-to B compatible with D (5)where (cid:96) ( · ) = log det (cid:16)(cid:2) diag ( w e ) + ΓΨ e Γ T (cid:3) do ( e ) c (cid:17) +trace (cid:18)(cid:2) diag ( w e ) + ΓΨ e Γ T (cid:3) − do ( e ) c (cid:104) ( I − F do ( e ) c B ) ˆΣ e ( I − F do ( e ) c B ) T (cid:105) do ( e ) c (cid:19) , and ˆΣ e is the sample covariance matrix of the data { X e ( i ) } n e i =1 . We assume that the location of the do-interventionsare known so that the input to the program (5) are the sample covariance matrices ˆΣ e , the do-intervention locations do ( e ), and the estimate ¯ h for the number of hidden variables. We note that the DirectLikelihood estimator can bespecialized to different modeling options based on appropriate reparametrization of the nuisance parameters Ψ e , w e in (5). For example, in our default setting of IID hidden variables with the environment e = 1 being observational6see Section 2.1.1), we add the following constraints to (5):Ψ e = (1 + ψ e ) I with ψ e ∈ R + for e = 1 , . . . , mw e (cid:23) w for e = 2 , . . . , m ; ψ = 0 . Given estimates ( ˆ B, ˆΓ , { ˆΨ e } me =1 , { ˆ w e } me =1 ), our score for the DAG D is:score λ ( D ) = m (cid:88) e =1 ˆ π e (cid:96) ( ˆ B, ˆΓ , ˆΨ e , ˆ w e ; do ( e ) , ˆΣ e ) + λ (cid:107) moral( D ) (cid:107) (cid:96) . (6)Here, moral( D ) denotes the moralization of D which forms an undirected graph of D by adding edges betweennodes that have commons children, λ > λ (cid:107) moral( D ) (cid:107) (cid:96) is akin to the BayesianInformation Criterion (BIC) score that prevents overfitting by incorporating the denseness of the moral graph of D in the likelihood score. In principle, a collection of DAGs can each be individually scored via (6) to find the bestfit to the data. We remark that regularization terms controlling for complexity of estimated DAGs are commonlyemployed in structural causal learning (see [9] and the references therein). A classically known fact is that in asingle environment setting, the moral graph of the DAGs in the Markov equivalence class have the same cardinality[32]. In the context of this paper with perturbations, incorporating the sparsity of the moral graph plays a centralrole in our theoretical analysis for proving identifiability; see Section 3.1 for more discussion.In comparison to the DirectLikelihood procedure, backShift [27] fits the SCM (2) (with some restrictions outlinedin Section 2.1) by performing joint diagonalization to the difference of sample covariance matrices.
DirectLikelihood allows for much more modeling flexibility. First, in contrast to backShift where the hidden effects are subtractedby computing the difference of covariances,
DirectLikelihood explicitly models these effects. This feature of
Direct-Likelihood enables the possibility of perturbations to the hidden variables and a manner to control the number ofestimated hidden variables (as opposed to arbitrary number of hidden variables with backShift). We discuss inSection 3 that controlling the number of hidden variables may lead to identifiability using
DirectLikelihood witha single interventional environment, whereas backShift is guaranteed to fail. Second,
DirectLikelihood also mod-els the perturbation magnitudes in each environment, allowing for the flexibility of constraining the perturbationmagnitudes to improve estimation accuracy. Finally,
DirectLikelihood allows to pool information over differentenvironments e for the parameter B of interest: this enable DirectLikelihood to be used with only a few samplepoints per environment.
The
DirectLikelihood estimator (5) fits a Gaussian perturbation model (2) to the data. However, the perturbationdata of the observed variables may be non-Gaussian but satisfy the linear SCM (2). In particular, the randomvariables H e , δ e may be non-Gaussian while still inducing a linear relationship with the observed variables X e .Nonetheless, since the DirectLikelihood estimator only operates on second moments, one may still use the
Direct-Likelihood procedure to find the best scoring DAGs and the associated connectivity matrices without compromisingidentifiability guarantees as shown in Section 3, still implying corresponding estimation consistency. Further, weempirically explore the robustness of the
DirectLikelihood procedure to non-Gaussianity as well as other modelmisspecifications via numerical experiments in Section 6.
DirectLikelihood
We next investigate the theoretical properties of the
DirectLikelihood procedure. The main theorem in this section(Theorem 1) considers the general setting with perturbed latent variables and establishes identifiability propertiesunder some population assumptions. Subsequently, Theorem 2 and Theorem 3 in Section 3.2 analyze
Direct-Likelihood under specializations of the modeling framework, namely unperturbed hidden variables and no hiddenvariables, respectively. Throughout, the notation with ∗ indicates the true underlying population objects which weaim to estimate from data. Setup:
We consider the perturbation model in (2) with a population connectivity matrix B (cid:63) ∈ R p × p , hiddeneffects coefficient matrix Γ (cid:63) ∈ R p × h . For every environment e , the random vector H e has a covariance matrixΨ e,(cid:63) ∈ S h ++ and the random vector (cid:15) e + δ e has a diagonal covariance matrix diag ( w e,(cid:63) ) for w e,(cid:63) ∈ R p + . In thesubsequent discussion, we allow for H e , δ e , and (cid:15) to be non-Gaussian random vectors. As prescribed in Section 2.1.17ut not requiring Gaussianity, we assume that the hidden variables are independent and identically distributed(our theoretical results can be extended to the setting with non-IID latent variables), i.e. Ψ e,(cid:63) = (1 + ψ e,(cid:63) ) I with ψ e,(cid:63) ∈ R + . We assume that for every environment e = 1 , , . . . , m , we have IID data { X e ( i ) } n e i =1 ⊆ R p where e = 1 isan observational environment with no perturbations (although the results can be readily extended when there areperturbations in all environments). To score a given DAG D , we consider the modified DirectLikelihood estimator(5) in population: min B ∈ R p × p , Γ ∈ R p × ¯ h { ( ψ e ,w e ) } me =1 ⊆ R + × R p ++ m (cid:88) e =1 π e,(cid:63) (cid:96) ( B, Γ , (1 + ψ e ) I , w e ; do ( e ) , Σ e,(cid:63) ) . subject-to B compatible with D ; ψ e ≤ C ψ for e = 1 , . . . , mψ = 0 ; w e (cid:23) w for e = 2 , . . . , m. (7)Comparing (7) to the DirectLikelihood estimator (5), the reparametrization Ψ e → (1 + ψ e ) I is to account for thehidden variables being IID and the constraints ψ = 0 and w e (cid:23) w for e = 2 , . . . , m account for e = 1 being anobservational environment. Further, the constraint (cid:107) ψ (cid:107) ∞ ≤ C ψ bounds the strength of the hidden perturbationsfor some user-specified parameters C ψ ≥ D opt = arg min DAG D score λ =0 ( D ) ; B opt : associated connectivity matrix(ces) . (8)Here, score λ =0 ( D ) is the achieved minimum in (7). It is the analogue of (6) but using the population covariancematrix Σ e, ∗ and the population optimizers from (7). The sample DirectLikelihood procedure replaces Σ e,(cid:63) and π e,(cid:63) in (7) with the population covariance matrix ˆΣ e and the mixture coefficients ˆ π e , respectively. Further, in thesample setting, the regularization parameters λ in the score evaluation (6) is set to scale with the BIC score. Usingthe sample quantities as described above we denote byˆ D opt , ˆ B opt (9)the optimal scoring DAGs and connectivity matrices in the sample version. Our objective is to demonstrate thatunder some assumptions, D opt = D (cid:63)X , B opt = B (cid:63) , and in the limit of sample sizes for all environments tending toinfinity, ˆ D opt → D (cid:63)X and ˆ B opt → B (cid:63) in probability.Our consistency results are in the general setting where there are perturbations on the hidden variables requirean assumption on the hidden variable effects, dubbed hidden spuriousness , that is formalized below: Definition 1 ( hidden spuriousness for e ∈ E ) . The random variables ( X e , H e ) satisfy hidden spuriousness if thereexists a pair k, l such that: ρ ( X ek , X el | X e \{ k,l } , H e ) = 0 & ρ ( X ek , X el | X e \{ k,l } ) (cid:54) = 0 . In words, (1) states that the hidden variables induce “some” confounding dependencies among the observedvariables in environment e ∈ E . In comparison, the hidden denseness assumption needed for consistency of thetwo stage deconfounding procedures [4, 11] require that the hidden variables induce “many” confounding depen-dencies. As such, hidden spuriousness is a strictly (and often substantially) weaker condition than the densenessassumption required for the success of the two stage deconfounding. We investigate whether DirectLikelihood isable to identify the population connectivity matrix B (cid:63) under this weaker condition, and answer in the affirmativeunder appropriate conditions on the strength and heterogeneity of the interventions. We provide two sets of as-sumptions that lead to identifiability. The first set requires two interventional environments that have sufficientlylarge interventions on the observed variables, and the second set requires two interventional environments withone interventional environment consisting of much stronger interventions on the observed variables than the otherinterventional environment. The two set of assumptions are described below where the observational environmentis denoted by e = 1 and the two interventional environments are denoted by e = 2 , − no do interventions: do ( e ) = ∅ for e = 2 , − the mixture effects are non-vanishing: π e,(cid:63) > e = 1 , , − heterogeneity of perturbations for e = 2 , w ,(cid:63)k − (1 + ψ ,(cid:63) ) w ,(cid:63)k w ,(cid:63)l − (1 + ψ ,(cid:63) ) w ,(cid:63)l (cid:54) = w ,(cid:63)k − (1 + ψ ,(cid:63) ) w ,(cid:63)k w ,(cid:63)l − (1 + ψ ,(cid:63) ) w ,(cid:63)l for all k (cid:54) = l Assumption 4 − hidden spuriousness in Definition 1 for environments e = 2 , − sufficiently large interventions on variables for e = 2 , k ( w e,(cid:63)k ) max k w e,(cid:63)k > κ (cid:63) (1 + 2 C ψ ) (1 + (cid:107) w ,(cid:63) (cid:107) ∞ )(1 + (cid:107) Γ (cid:63) (cid:107) + (cid:107) Γ (cid:63) (cid:107) ) (10)or Assumption 6 − no do interventions: do ( e ) = ∅ for e = 2 , − the mixture effects are non-vanishing: π e,(cid:63) > e = 1 , , − heterogeneity of perturbations for e = 2 , w ,(cid:63)k − (1 + ψ ,(cid:63) ) w ,(cid:63)k w ,(cid:63)l − (1 + ψ ,(cid:63) ) w ,(cid:63)l (cid:54) = w ,(cid:63)k − ψ ,(cid:63) ψ ,(cid:63) w ,(cid:63)k w ,(cid:63)l − ψ ,(cid:63) ψ ,(cid:63) w ,(cid:63)l for all k (cid:54) = l Assumption 9 − hidden spuriousness in Definition 1 for environments e = 3Assumption 10 − sufficiently large interventions on variables in S for e = 3:min k ( w e,(cid:63)k ) max k w e,(cid:63)k > κ (cid:63) (1 + 2 C ψ ) (1 + (cid:107) w ,(cid:63) (cid:107) ∞ )(1 + (cid:107) Γ (cid:63) (cid:107) + (cid:107) Γ (cid:63) (cid:107) ) (11)where the quantity κ (cid:63) ≡ i (cid:107) B (cid:63) : ,i (cid:107) i (cid:107) B (cid:63) : ,i (cid:107) in Assumption 5 of (10) and Assumption 10 of (11). Assumptions 1-5 in(10) and 6-10 in (11) impose conditions on the population quantities associated with the environments e = 1 , , e = 2 , w ,(cid:63) = w ,(cid:63) ) or one of them being even zero (i.e. w ,(cid:63) = w ,(cid:63) ) with differenthidden variable perturbations (i.e. ψ ,(cid:63) (cid:54) = ψ ,(cid:63) ) without compromising Assumption 3 in (10) or Assumption 8 in(11). Additionally, one can show that if the parameters w ,(cid:63) , w ,(cid:63) , w ,(cid:63) and ψ ,(cid:63) , ψ ,(cid:63) are drawn from continuousdistributions, Assumption 3 in both (10) and (11) are satisfied almost surely. Assumption 4 in (10) and (11)insists that the hidden spuriousness in (1) is satisfied so that the hidden variables induce at least a single spuriousdependency among the observed variables. Finally, Assumption 5 in (10) and Assumption 10 in (11) requires thatthe perturbations on the observed variables are sufficiently large. This is akin to strong instruments assumption inthe instrumental variables literature [1].Given Assumptions 1-5 in (10) or Assumptions 6-10 in (11), we first analyze the theoretical properties of thepopulation DirectLikelihood procedure.
Theorem 1 (Identifiability in population: perturbed hidden variables) . Suppose that the user-specified parameters ¯ h and C ψ in (7) are chosen conservatively so that ¯ h ≥ dim ( H ) and C ψ ≥ ψ e,(cid:63) for all e = 1 , , . . . , m . UnderAssumptions 1-5 in (10) or Assumptions 6-10 in (11) , the following are satisfied for D opt in (8) :1. D (cid:63)X ∈ D opt and any other optimum D ∈ D opt satisfies: moral ( D (cid:63)X ) ⊆ moral ( D ) .2. The optimum of arg min D ∈ D opt (cid:107) moral ( D ) (cid:107) (cid:96) is unique and equal to D (cid:63)X . Further, the associated connectivitymatrix is equal to B (cid:63) . The proof sketch of Theorem 1 is provided in Section 3.1 and the full proof is presented in the supplementarymaterial Section A.1. The first assertion in Theorem 1 states that the moral graph of any optimum
D ∈ D opt ofthe
DirectLikelihood procedure is a superset of the moral graph of D (cid:63) , and the second assertion states that theconnectivity matrices yielding the sparsest moral graphs among the optima are unique and equal to B (cid:63) . These9tatements do not guarantee recovering the other model parameters , viewed here as nuisance part, including Γ (cid:63) ,and { ( ψ e,(cid:63) , w e,(cid:63) ) } me =1 . However, under additional assumptions namely: ¯ h = dim( H ) and the incoherence of thesubspace col-space(Γ (cid:63) ), recovery of Γ (cid:63) Γ (cid:63)T and { ( ψ e,(cid:63) , w e,(cid:63) ) } me =1 can be shown.We note that Assumptions 1-5 in (10) or Assumptions 6-10 in (11) are sufficient conditions for identifiabilityand are generally not necessary. As an example, we show in supplementary material Section A.2 that identifiabilitycannot be achieved with only a single interventional environment if ¯ h = p (e.g. most conservative choice for thenumber of latent variables). However, we also demonstrate that if ¯ h < p , DirectLikelihood will attain identifiabilityunder certain configurations of model parameters (i.e dense hidden effects with sparse population DAG D (cid:63)X ). Thus,Assumptions 1-5 in (10) or 6-10 in (11) serve as protection for arbitrary population DAG structure and a classof model parameters. We believe that relaxing these assumptions while retaining identifiability guarantees is aninteresting direction for future research.The virtue of incorporating the regularization term λ (cid:107)D(cid:107) (cid:96) in (6) is that in the large data limit, this penaltyterm encourages sparser moral graphs. Thus, in conjunction with the results of Theorem 1, we demonstrate thatin the large data limit, the set ˆ D opt and ˆ B opt asymptotically converge to D (cid:63)X and B (cid:63) , respectively. To appeal tostandard empirical process theory results, we constrain the parameter space to be compact as described in thecorollary below: Corollary 1 (Asymptotic consistency for perturbed hidden variables) . Consider the sample version of the
Di-rectLikelihood procedure in (7) with the compactness constraints max { / min k w ek , (cid:107) B (cid:107) } ≤ C comp for every every e = 1 , , . . . , m where C comp > max { / min k w e,(cid:63)k , (cid:107) B (cid:63) (cid:107) } so that the true parameters are in the feasible region.Further, let λ ∼ O (log( (cid:80) me =1 n e ) / (cid:80) me =1 n e ) in (6) . Under the conditions in Theorem 1, the following are satisfiedfor ˆ D opt and ˆ B opt in (9) : ˆ D opt → D (cid:63)X and B opt → B (cid:63) , in probability, as n e → ∞ for e = 1 , , . The proof of Corollary 1 is a straightforward consequence of Theorem 1 and left out for brevity. The combinedresults of Theorem 1 and Corollary 1 state that under perturbations that are sufficiently different across environ-ments and the hidden spuriousness condition, two interventional environments suffice for consistent estimation. R emark 1: Assumptions 1-5 in (10) and (11) needed for identifiability suggest that perturbations on the hiddenvariables can improve identifiability. Specifically, the perturbations on the observed variables in one interventionalenvironment may be statistically identical to another environment or even be completely equal to zero and stillpreserve identifiability as long as the hidden variables have been perturbed. R emark 2: As described in Section 2.1, the perturbation model (2) offers flexibility with respect to many componentsof the model such as the structure of the perturbations on the observed or hidden variables. In particular, onemay fit to data the perturbation model (2) where the perturbation magnitudes are equal in magnitude across thecoordinates, e.g. diag ( w e,(cid:63) ) ∝ I . We demonstrate in the supplementary material Section A.3 that DirectLikelihood , under Assumptions similar to (10) or (11), provides consistent estimators in this setting. Thus, in principle onemay have only two additional perturbation parameters per environment: a scalar for the hidden variables and ascalar for the observed variables. As a point of contrast, in the setting where the perturbations among the observedvariables may vary, there are p + 1 new variables for each environment. The substantial reduction in the numberof parameters can lead to better statistical properties in practice. We sketch the main proof idea of Theorem 1 with Assumptions 1-5 in (10). The proof strategy with Assumptions6-10 in (11) is similar. It is straightforward to see that any optimal connectivity matrix B ∈ B opt of the population DirectLikelihood procedure satisfies the following condition for some { ( ψ e , w e ) } me =1 ⊂ R + × R p ++ , Γ ∈ R p × ¯ h with ψ e ≤ C ψ for all e = 1 , , . . . , m :Σ e,(cid:63) = ( I − F do ( e ) c B ) − ( diag ( w e ) + [(1 + ψ e )ΓΓ T ] do ( e ) c )( I − F do ( e ) c B ) − T for all e = 1 , , . . . , m where π e,(cid:63) > . (12)In addition to the population parameters ( B (cid:63) , Γ (cid:63) , { ( ψ e,(cid:63) , w e,(cid:63) ) } me =1 ), suppose there is another collection of optimalparameters ( ˜ B, ˜Γ , { ( ψ e , ˜ w e ) } me =1 ) satisfying (12). Focusing on the environments e = 1 , ,
3, Assumptions 5 impliesthat the matrix Σ e,(cid:63) − (1 + ˜ ψ e )Σ ,(cid:63) is invertible for e = 2 ,
3, where the inverse can be expressed as:(Σ e,(cid:63) − (1 + ˜ ψ e )Σ ,(cid:63) ) − = ( I − B (cid:63) ) T ( D e − L e )( I − B (cid:63) ) = ( I − ˜ B ) T ˜ D e ( I − ˜ B ) . D e = diag ( ˜ w e − ˜ w (1 + ˜ ψ e )). Moreover, under Assumptions 3 and 5, D e , ˜ w e ∈ S p are invertible diagonalmatrices with D k,k D l,l (cid:54) = D k,k D l,l for all k (cid:54) = l . Further, L e ∈ S p is identically zero if ψ e,(cid:63) = ˜ ψ e , and a rank- h matrix other-wise. The nonzero entries of ( I − B (cid:63) ) T D e ( I − B (cid:63) ) and ( I − ˜ B ) T ˜ D e ( I − ˜ B ) encode the moral graph associated with B (cid:63) and ˜ B , respectively. By Assumption 5, the magnitude of any nonzero entry of the matrix ( I − B (cid:63) ) T D e ( I − B (cid:63) )is greater than the maximal nonzero entry of ( I − B (cid:63) ) T L e ( I − B (cid:63) ). This proves the first component of Theorem 1.We next sketch the proof of the second component. If L e (cid:54) = 0, by Assumption 4, there exists a pair of indices( k, l ) such that: [( I − B (cid:63) ) T L e ( I − B (cid:63) )] k,l (cid:54) = 0 and [( I − B (cid:63) ) T D e ( I − B (cid:63) )] k,l = 0. This would imply thatmoral( B (cid:63) ) ⊂ moral( ˜ B ). Since we are interested in connectivity matrices that yield the sparsest moral graph, weconclude that L e = 0. We now have established that ˜ B yields the same moral graph as B (cid:63) and satisfies for e = 2 , I − B (cid:63) ) T D e ( I − B (cid:63) ) = ( I − ˜ B ) T ˜ D e ( I − ˜ B ) . Since D , D are non-singular matrices with the property that D k,k D l,l (cid:54) = D k,k D l,l for all k (cid:54) = l , one can then show viaalgebraic manipulations that ˜ B = B (cid:63) . We first analyze the identifiability guarantees of the
DirectLikelihood procedure when the hidden variables remainunperturbed across the environments, i.e. the perturbation A does not point to H in Fig 1. Specifically, we considerthe setup described in the beginning of Section 3 with the modification that ψ e,(cid:63) = 0. Thus, we also modify the DirectLikelihood estimator (7) by setting ψ ≡
0. We further consider an arbitrary hidden effects matrix Γ (cid:63) , wherethe two-stage deconfounding procedure will not perform well, since hidden denseness may not be satisfied. Wedemonstrate on the other hand, that the under sufficient interventions, the connectivity matrix that attains theoptimum score via
DirectLikelihood in population is unique and equal to B (cid:63) . Theorem 2 (Identifiability in population: unperturbed hidden variables) . Let ¯ h ≥ h in the DirectLikelihood estimator (7) . Letting S ⊆ { , , . . . , p } encode the location of perturbations, suppose that Assumption 3 in (10) ismodified to w ,(cid:63)k − w ,(cid:63)k w ,(cid:63)l − w ,(cid:63)l (cid:54) = w ,(cid:63)k − w ,(cid:63)k w ,(cid:63)l − w ,(cid:63)l for all k, l ∈ S, k (cid:54) = l and Assumption 5 in (10) is modified to w e,(cid:63)k > w ,(cid:63)k for e = 2 , and all k ∈ S . Then, under Assumptions 1,2 in (10) and modified Assumptions 3 and 5, we have for D opt and B opt as in (8) :(a) D opt = D (cid:63)X and B opt = B (cid:63) if S = { , . . . , p } .(b) Any optimum B ∈ B opt satisfies B p, : = B (cid:63)p, : for the sets ANC ( p ) ⊆ S or DES ( p ) ∪ p ⊆ S .(c) ¯ B = arg min B ∈ B opt (cid:107) B (cid:107) (cid:96) satisfies ¯ B p, : = B (cid:63)p, : if PA ( p ) ∪ p ⊆ S and additionally D (cid:63)X is faithful to the distributionof X | H . The proof of Theorem 2 is similar in nature to that of Theorem 1 and can be found in the supplementarymaterial Section A.4. Further, analogous to Corollary 1, one can readily show the large limit convergence of thepopulation
DirectLikelihood to the sample
DirectLikelihood , although we omit this for brevity.
Remark 2:
The conditions needed for identifiability of the unperturbed hidden variable setting (Theorem 2) differfrom the perturbed setting (Theorem 1) in multiple ways. First, there are no conditions on the strength of per-turbations on the observed variables. Further, the hidden coefficient matrix Γ (cid:63) may be arbitrary without needingconditions like hidden spuriousness . Finally, the setting with unperturbed latent variables requires two interven-tional environments where all observed variables are perturbed, whereas the setting with perturbed hidden variablesonly requires a single environment with perturbations on all the observed variables and another environment wherethe hidden variables are perturbed.
Remark 3:
Theorem 2(a) is similar in nature to the backShift procedure [27]. Nonetheless,
DirectLikelihood pro-vides additional flexibility such as controlling the number of latent variables, incorporating do-interventions, andstructure in the strength of shift interventions that lead to more desirable statistical properties. As an example,a necessary condition for identifiability using the backShift procedure is that there are at least two interventionalenvironments. We demonstrate in supplementary material Section A.2 that this is also a necessary condition with11 irectLikelihood if ¯ h = p . However, under ¯ h < p , DirectLikelihood may attain identifiability with only a singleinterventional environment. As another example, a single interventional environment consisting of the same magni-tude perturbation across the coordinates is sufficient for consistency (see Section A.4 of the supplementary materialfor the theoretical statement).We next focus on the setting where there is no hidden confounding, i.e. H does not point to the covariates X in Fig 1. We consider the setup described in the beginning of Section 3 with the modification Γ (cid:63) ≡
0. Thus, wealso modify the
DirectLikelihood estimator (7) by setting Γ ≡
0. Below, we provide identifiability guarantees forthe
DirectLikelihood scoring procedure:
Theorem 3 (Identifiability in population: no hidden variables) . Suppose that Assumption 1 and 2 in (10) aremodified to only be valid for environments e = 1 , . Further, letting S ⊆ { , , . . . , p } encode the location ofperturbations, Assumption 3 is modified to w ,(cid:63)k w ,(cid:63)l (cid:54) = w ,(cid:63)k w ,(cid:63)l for all k, l ∈ S, k (cid:54) = l and Assumption 5 to w ,(cid:63)k > w ,(cid:63)k forall k ∈ S . Then, under these modified Assumptions 1,2,3,5, we have for D opt and B opt as in (8) :(a) D opt = D (cid:63)X and B opt = B (cid:63) if S = { , . . . , p } .(b) Any optimum B ∈ B opt satisfies B p, : = B (cid:63)p, : if ANC ( p ) ⊆ S or DES ( p ) ∪ p ⊆ S .(c) ¯ B = arg min B ∈ B opt (cid:107) B (cid:107) (cid:96) satisfies ¯ B p, : = B (cid:63)p, : if PA ( p ) ∪ p ⊆ S and additionally D (cid:63)X is faithful to the distributionof X .Remark 5: The proof of Theorem 3 is shown in the supplementary material Section A.4. Notice that unlikethe settings with hidden confounding (perturbed or unperturbed), one interventional environment suffices forconsistent estimation. Furthermore, as in the case of hidden confounding, in principle, the shift interventions maybe independent and identically distributed without compromising the identifiability results.
Recent works have demonstrated an intrinsic connection between distributional robustness and causal inference.Specifically, in the setting where the target variable is not directly perturbed and there is no hidden confounding,the causal parameter B (cid:63)p, : linking the covariates X \ p to the target variable X p in the SCM (2) satisfies the followingmax-risk optimization problem: B (cid:63)p, : = min β ∈ R p β p =0 max P e ∈P X e ∼P e (cid:107) X ep − X e β (cid:107) , (13)for a certain perturbation distribution class P consisting of distributions P e indexed by environments e [3]. In par-ticular, the causal coefficients B (cid:63)p, : are solutions to a robust optimization problem subject to distributional changesto the system which do not act directly on X p . Given access to exogenous variables or different environments, [26]allow for non-perturbed latent variables and possibly direct action of change on the target of interest, and prove arelation between the causal parameters and a particular robust optimization program.In this section, we demonstrate that the joint causal parameters B (cid:63) are solutions to a certain max-risk prob-lem in the setting where there may be perturbations to all the variables including the hidden variables, furtherstrengthening the connection between causal inference and distributional robustness. We consider the followingperturbation distribution class parameterized by the quantities C ζ , C ψ ≥ P C ζ ,C ψ = (cid:40) distribution P e over random pairs ( X e , H e ) satisfying default SCM (2) and w e,(cid:63) = w ,(cid:63) + ζ e,(cid:63) with ζ e,(cid:63) ∈ [0 , C ζ ] , ψ e,(cid:63) ∈ [0 , C ψ ] , do ( e ) = {∅} (cid:41) , where the default SCM is the setting with IID hidden variables, i.e. Ψ e,(cid:63) = (1 + ψ e,(cid:63) ) I . Recall that the sum (cid:15) e + δ e in the SCM (2) is distributed as follows: (cid:15) e + δ e ∼ N (0 , diag ( w e,(cid:63) )). The constraints on w e,(cid:63) ensure that theperturbations on the observed variables are IID with magnitude less than a pre-specified level C ζ ; the constraintson ψ (cid:63)e ensure that the perturbations on the hidden variables have magnitude less than a pre-specified level of12 ψ ; finally the constraint do ( e ) = {∅} ensures that there are no do-interventions. We note that the distributionsinside P are specified by parameters that are invariant, namely the population connectivity matrix B (cid:63) , the hiddeneffects matrix Γ (cid:63) , and noise variable (cid:15) with variance of its coordinates encoded in w ,(cid:63) . We consider the followingworst-case optimization program that identifies parameters B, Γ , v that are robust to perturbations from the class P C ζ ,C ψ : ( B robust , Γ robust , w ) = arg min B is a DAGΓ ∈ R p × ¯ h ,w ∈ R p ++ max P e ∈P Cζ,Cψ ( X e ,H e ) ∼P e KL (Σ e,(cid:63) , Σ B, Γ ,w (¯ ζ e , ¯ ψ e )) , (14)where Σ B, Γ ,w ( · , · ) is an estimated covariance model with definition shown below and KL is the Gaussian Kullback-Leibler divergence between the estimated and population covariance models. Here, ¯ ζ e , ¯ ψ e ∈ R + are estimatesfor the nuisance perturbation parameters ζ e,(cid:63) , ψ e,(cid:63) that vary across the perturbation distributions. For a given B, Γ , w , the quantities (¯ ζ e , ¯ ψ e ) are obtained by finding the best fit to data: (¯ ζ e , ¯ ψ e ) = arg min ≤ ζ e ≤ C ζ , ≤ ψ e ≤ C ψ KL (Σ e,(cid:63) , Σ B, Γ ,w ( ζ e , ψ e )) where Σ B, Γ ,d ( ζ e , ψ e ) = ( I − F do ( e ) c B ) − ( diag ( w ) + ζ e I ) + (1 + ψ e )ΓΓ T )( I − F do ( e ) c B ) − T is the covariance specified by the model parameters.In comparison to (13), the risk in (14) is measured jointly over the entire collection of observed variables (viathe covariance matrix). As observed previously, this system-wide perspective is crucial for allowing perturbationson all of the variables. The following theorem connects the max-risk solutions B robust to the causal parameter B (cid:63) . Theorem 4.
Suppose that the estimated number of latent variables ¯ h in (14) is chosen conservatively, i.e. ¯ h ≥ dim ( H ) . Let the maximum perturbation size on the observed variables in the perturbation class satisfy C ζ ≥ κ (cid:63) (1 + 2 C ψ ) (1 + (cid:107) w ,(cid:63) (cid:107) ∞ )(1 + (cid:107) Γ (cid:63) (cid:107) + (cid:107) Γ (cid:63) (cid:107) ) where κ (cid:63) ≡ i (cid:107) B (cid:63) : ,i (cid:107) i (cid:107) B (cid:63) : ,i (cid:107) . Suppose there exists a perturbationdistribution P e ∈ P C ζ ,C ψ with parameters ζ e,(cid:63) = C ζ , ψ e,(cid:63) (cid:54) = 0 such that the random pairs ( X e , H e ) drawn from thisdistribution satisfy the hidden spuriousness assumption in (1) . Then:1. Any optimal connectivity matrix B ∈ B robust satisfies moral ( B (cid:63) ) ⊆ moral ( B )
2. The optimum of arg min B ∈ B robust (cid:107) moral ( B ) (cid:107) (cid:96) is unique and equal to B (cid:63) .Remark 6: The proof of Theorem 4 is presented in the supplementary material Section A.5. This theorem resultstates that the causal parameter B (cid:63) is a solution to a max-risk optimization problem over a certain perturbation class(and produces the sparsest moral graph among the optimum), establishing a fundamental relation between causalityand distributional robustness. Further, under similar assumptions as required in Theorem 2 and Theorem 3,analogous connections can be established for the setting with unperturbed and no hidden variables, respectively. DirectLikelihood estimates
Solving the
DirectLikelihood estimator (5) for a DAG D is a challenging task, as the problem is non-convex over thedecision variables B, Γ , { Ψ e } me =1 , { v e } me =1 . Further, searching over the space of DAGs is super-exponential in thenumber of variables. These computational challenges are common in causal structure learning problems and aremade worse with the presence of multiple environments and hidden confounding. In this section, we propose someheuristics for computing DirectLikelihood based on perturbation data to find optimal scoring DAGs; we discussopen questions regarding computations involving the
DirectLikelihood in Section 7.The outline of this section is as follows: in Section 5.1, we describe a method to compute
DirectLikelihood for agiven DAG structure, that is, when the support of B is pre-specified. Building on this, in Section 5.2, we describesome computational heuristics for structure search over different DAGs. As announced above, we assume here that a DAG hence also the support of B are pre-specified. The goal is, fora given DAG, to estimate the unknown parameters. As prescribed in Section 2.1, we employ the DirectLikelihood procedure to fit the perturbation model where the hidden variables are independent and identically distributedacross all the environments. In other words, we reparametrize (5) by setting Ψ e → (1+ ψ e ) I . While the optimizationprogram (5) is jointly non-convex, solving for the connectivity matrix B with the nuisance parameters ψ, Γ , and { w e } me =1 fixed is a convex program. Since we are mainly interested in an accurate estimate for the connectivitymatrix, we propose the following alternating minimization strategy: starting with an initialization of all of the13odel parameters, we fix B and perform gradient updates to find updated estimates for the nuisance parameters,and then update B – by solving a convex program to optimality with the remaining parameters fixed. We findthat the alternating method described above is relatively robust to the initialization scheme, but we nonethelesspropose the following concrete strategy:1) B (0) via linear regression with observational data2) u (0) = diag (cid:110) I − B (0) ) ˆΣ ( I − B (0) ) T (cid:111)
3) Γ (0) = U D / where U DU T is SVD of ( I − B (0) ) ˆΣ ( I − B (0) ) T − d (0)
4) initialize w e (0) = u (0) + ζ e and solve ζ e , ψ e by 2-dimensional gridding , (15)where the first step follows since the DAG structure is known, the fourth step is based on the observation that for afixed B, Γ , u , the optimization problems for ζ e , and ψ e decouples across the environments e = 1 , , , . . . , m . Theentire procedure, involving the initialization step and the parameter updates is presented in Algorithm 1. Algorithm 1
Scoring D via DirectLikelihood Input : ˆΣ e ; do ( e ) for e = 1 , , . . . , m , regularization λ ≥ Initialize parameters : via relation (15) Alternating minimization : Perform alternating iterates for positive integers t until convergence at iteration T :(a) Fixing (Γ ( t ) , { ( ψ e ( t ) , w e ( t ) ) } me =1 ), update B ( t +1) by solving the convex optimization program (5). Fixing B ( t +1) , perform gradient updates until convergence to find (Γ ( t +1) , { ( ψ e ( t +1) , w e ( t +1) ) } me =1 )(b) Perform alternating steps until convergence Compute score λ ( D ) : plug-in the estimates ( B ( T ) , Γ ( T ) , { ( ψ e ( T ) , w e ( T ) ) } me =1 ) into (6) Output: score λ ( D ) and the connectivity matrix B ( T ) Step 3 of Algorithm 1 involves two convergence criteria: the convergence of the gradient steps for the pa-rameters (Γ ( t ) , { ( ψ e ( t ) , w e ( t ) ) } me =1 ) as well as the convergence of the alternating procedure. For the first criterion,we terminate the gradient descent when the relative change in the likelihood score is below (cid:15) . For the secondcriterion, we terminate the alternating minimization at step T when (cid:107) B ( T ) − B ( T − (cid:107) ∞ ≤ (cid:15) . In the numericalexperiments in Section 6, we set (cid:15) = 10 − and (cid:15) = 10 − . Further, we set the learning rate in the gradient stepsto be 0 . λ to be twice the BIC analogue [20]. We have discussed how to score a given DAG using the
DirectLikelihood estimator. Searching over all DAGs istypically not possible unless the number of observed variables p is small. In fact, performing a combinatorial searchis known to be very challenging and in some sense NP-hard [6]. One could rely on greedy strategies [5]; we discussbelow a strategy which exploits a reasonable set of candidate DAGs. In some applications with domain expertise,a set of plausible DAGs may be considered as candidate DAGs to be scored by the DirectLikelihood . Withoutthis knowledge however, this candidate set must be obtained from data. In this section, we propose a heuristicto identify a collection of candidate DAGs to be score via Algorithm 1. Our approach is to identify these DAGsby assuming no hidden confounding. In general, fitting a DAG without taking into account the effect of hiddenvariables yields a denser graph (compared to the population or Markov equivalent DAGs) since marginalizationof hidden variables induces confounding dependencies. As such, scoring such DAGs using Algorithm 1 may yieldconnectivity matrices that are more dense than the population connectivity matrix, although the magnitude of thespurious edges will be small. Thus, simply looking at the strongest edges or a crude thresholding of the entriesof the connectivity matrices may compensate for the overly dense candidate DAGs , or one could also do somebackward deletion from scored DAGs resulting to consider sub-DAGs of the candidate set.14n Section 1, we outlined procedures to identify DAGs without hidden confounding, with the constraint basedPC algorithm, score based GES, and the hybrid method ARGES being among the most popular for structurelearning in the observational setting. In principal, when domain expertise is not available, many of these methodscan be used to find candidate DAGs. For simplicity, in our synthetic illustrations in Section 6, we perform GES onpooled environmental data. The GES procedure greedily adds or deletes edges in the space of Markov equivalentDAGs based on (cid:96) regularized likelihood score and is asymptotically consistent [5]. We select the regularizationparameter to be twice the analogue from the BIC score (as was suggested in [20]. Algorithm 2 presents the entireprocedure of finding candidate DAGs, scoring them, and selecting the final output. Algorithm 2
Optimizing
DirectLikelihood Input : ˆΣ e ; do ( e ) for e ∈ , , . . . , m ; regularization parameter λ > Find candidate DAGs: D cand = D , D , . . . , D q using domain expertise, GES with pooled data, or somestructure learning algorithm Score each DAG : For each D i , compute score λ ( D i ) and corresponding ˆ B i via Algorithm 1 Output : Compute ˆ D opt = arg max D∈D cand score λ ( D ) and the associated ˆ B opt . (Note that the arg max may nobe unique in the infinite sample limit, due to potential non-identifiability. In practice, the optimization is doneto find all optimal DAGs within a tolerance value from the minimum, and outputs also its several associatedparameter estimates.A few remarks regarding Algorithm 2 are in order. Since the scoring procedure in Algorithm 1 involves someheuristics and there may be multiple DAGs that yield an optimal score, one may more conservatively choose “near”optimal scoring DAGs in step 4 of Algorithm 2. Second, since the candidate DAGs D cand may be overly dense dueto the hidden effects, the optimal connectivity matrix ˆ B opt may consist of weak but nonzero spurious edges. Inour numerical experiments in Section 6, we use the entries of ˆ B opt to rank the edges according to their absolutestrengths. One may simply also threshold the edges at a pre-specified level. In this section, we illustrate the utility of
DirectLikelihood with simulated and real data. In Section 6.1, we studythe accuracy of the
DirectLikelihood procedure in estimating the population causal graph underlying the observedvariables. In Section 6.2, we provide comparisons of
DirectLikelihood with the two-stage deconfounding procedure[11]. In Section 6.3, we examine
DirectLikelihood under model miss-specifications, namely: non-Gaussian vari-ables in a linear structure equation model, dependent latent variables, and non-linear SCMs. Finally in Section 6.4,we evaluate
DirectLikelihood on the task of learning a causal network from a protein mass spectroscopy dataset [28].
Setup for synthetic experiments : we consider a collection of p = 10 observed variables influenced by h ∈ { , } hidden variables. To generate the connectivity matrix B (cid:63) ∈ R p × p , we sample from an Erd¨os R´eyni graph withedge probabilities 0 . B (cid:63) by setting edge strengths equal to 0 .
7. Theresulting DAG and connectivity matrix consists of 10 nonzero entries. The entries of the hidden coefficient ma-trix Γ (cid:63) ∈ R p × h are generated IID distributed uniformly from the interval [0 , √
2] and the entries below 0 . √ (cid:15) is distributed according to (cid:15) ∼ N (0 , . I p ). Unless otherwise specified, thehidden variables H are generated as H ∼ N (0 , I h ). These parameters specify the distribution of the observed andhidden variables when there are no perturbations and we denote this environment by e = 1. In addition to thisobservational environment, we suppose there are m − n = 300 and n e = 5 t for positive integer t . The values for t , thenumber of environments, and the magnitude of perturbations on the observed and hidden variables is specified later.We compare the accuracy of the DAG structures of an estimated connectivity matrix and the populationconnectivity matrix as follows: a false positive is incurred if an edge produced in the estimated DAG is missing oris in the reverse direction in the population DAG. A true positive is achieved if an edge in the estimated DAG ispresent in the correct direction in the population DAG. 15 .1 DAG structural recovery We consider the synthetic setup described in Section 6. For each environment, we set δ ek ∼ N (0 , ζ + Unif(0 , k = 1 , , . . . , p and certain values of ζ , and H e ∼ (1 + ψ e ) N (0 , I h ) with ψ e,(cid:63) ∼ Unif(0 , . m = 7 environments, one observational environment with no perturbations and six interventional environments, andconsider the following five settings ( a ) h = 1 , ζ = 5, ( b ) h = 1 , ζ = 2, ( c ) h = 2 , ζ = 5, and ( d ) h = 1 , ζ = 5 and thelast five environments have two observed variables that are chosen randomly to receive do-interventions with valuesset identically equal to 5. The perturbation data for each setting is supplied to the DirectLikelihood procedure toevaluate a score each DAG in a collection of candidate DAGs. We set ¯ h = h + 1 in the DirectLikelihood estimator(5) and constrain the hidden variable perturbation ψ e ≤ ψ max = 0 . e = 2 , , . . . , DirectLikelihood procedure (Algorithm 2) for DAG structural recoveryin each of the settings ( a − d ) averaged across 10 independent trials. Fig 2 demonstrates these results when thecandidate DAGs are obtained by performing the GES algorithm on pooled data. To ensure that the accuracy of ourresults is not simply due to the pooled GES, we compute the average size of the observational Markov equivalenceclass obtained after the pooled GES step in setting ( a ): 9 DAGs for t = 64, 8 . t = 16, 9 . t = 4, 6 for t = 2,6 . t = 1. false positives t r u e po s i t i ves t = 64t = 16t = 4t = 2t= 1 (a) h = 1 , ζ = 5 false positives t r u e po s i t i ves t = 64t = 16t = 4t = 2t= 1 (b) h = 2 , ζ = 5 false positives t r u e po s i t i ves t = 64t = 16t = 4t = 2t= 1 (c) h = 1 , ζ = 2 false positives t r u e po s i t i ves t = 64t = 16t = 4t = 2t= 1 (d) h = 1 , ζ = 5, do-interventions Figure 2: Structure estimation accuracy of Algorithm2 (best scoring DAG) using candidate DAGs obtained by the GESalgorithm on pooled data for different problem settings. Total number of true discoveries equals to . The curve for each t corresponds to t samples for each interventional environment, with t ∈ { , , , , } . For each curve, the accuracy ofthe estimated DAG in comparison to the population DAG is calculated by ordering the edges according to their strengths andsequentially counting (from strongest edge to weakest edge) an edge to be a false discovery if it is missing or in a reversedirection in the population DAG, and otherwise count as a true discovery. Each curve is average across 10 independenttrials. We compare the performance of
DirectLikelihood to two methods: two-stage deconfounding and backShift. Thetwo-stage deconfounding procedure first employs a sparse+low-rank decomposition on data from each environment16o deconfound the hidden variable effects and then employs the
DirectLikelihood procedure with Γ ≡ h = 3and consider the synthetic setup described earlier in this section with the following modifications: the columnsof Γ (cid:63) ∈ R p × consist of a the standard basis elements with the coordinate corresponding to X and X nonzero,and a third column with entries sampled IID from the uniform distribution and the entries below 0 . e = 1 and four interventional environments e = 2 , , , δ ek ∼ N (0 , ζ + Unif(0 , k = 1 , , . . . , p with ζ = 2. We generate the hidden perturbation coefficient ψ e ∼ Uniform(0 , . n =1000 IID observational data and 5 t IID interventional data for each inter-ventional environment with t ∈ { , } .The DirectLikelihood procedure and two-stage deconfounding procedures both rely on choosing parameters thatcontrol the number of hidden variables (¯ h in DirectLikelihood and two regularization parameters in two-stage de-confounding procedure). Since we are interested in comparing identifiability properties of these procedures, wechoose ¯ h = 3 in DirectLikelihood . Further, we chose the regularization parameters in the deconfounding step of thetwo-stage deconfounding procedure as to optimize prediction while ensuring that the number of latent variables isless than or equal to 3. Specifically, we take 80% of the data (at random) as training samples, and the remaining20% as test samples. After sweeping through the two regularization parameters, we choose the best performingmodel on the test set that have number of latent variables less than or equal to h . Both the DirectLikelihood pro-cedure and the second stage two-stage deconfounding score a set of candidate DAGs. Noticing that the sparsenessof the hidden effects induces a spurious edges between the pairs ( X , X ) , ( X , X ) , ( X , X ), we generate the setof candidate DAGs by adding the directed edge X → X , X → X , X → X to the population DAG D (cid:63)X and consider 11 DAGs that add an additional edge at random. Further, we take the 8 members in the Markovequivalence class of the population DAG. Thus, the total number of candidate DAGs is equal to 20.We consider the vanilla backShift output after thresholding the magnitude of the entries of the estimated con-nectivity matrix at 0 .
25. As prescribed in [27], we also consider the output after stability selection (with stabilityparameter 0 .
75) on top of the joint diagonalization algorithm to select stable edges.Table 2 compares the structural recovery (across 10 independent trials) of
DirectLikelihood , the two-stagedeconfounding procedure, and backShift (with and without stability select). Evidently, the backShift procedureperforms much more poorly than the other methods, which may be due to the perturbations on the hidden variables(not allowed for identifiability) as well as algorithmic instabilities that were also reported by the authors [27]. Next,we observe that since the denseness assumption is violated, the two-stage deconfounding produces a DAG withfalse positives, even when the number of samples in each environment is large (i.e. 1000 samples). Furthermore,in the low-sample regime, the two-stage procedure yields fewer true discoveries than
DirectLikelihood . It is worthnoting that in addition to the superior performance of
DirectLikelihood as compared to two-stage deconfounding,the
DirectLikelihood solution is faster to compute since it does not involve tuning two regularization parameters(as with two-stage deconfounding).Method t = 10 t = 200 DirectLikelihood
TP = 10, FP = 2.4 TP = 10, FP = 0two-stage deconfounding TP = 7.4, FP = 2.6 TP = 10, FP = 3backShift TP = 7.4, FP = 53.5 TP = 6.8, FP = 50.1backShift with stability TP = 0, FP = 0 TP = 0, FP = 0
Table 2: Comparison of
DirectLikelihood with two-stage deconfounding and backShift procedures. Maximum possible numberof true discoveries is equal . There are samples in the observational environment and t samples in each of theinterventional environments. .3 Model miss-specification We next explore the robustness of
DirectLikelihood to model miss-specifications. We consider three types of modelmiss-specifications: non-Gaussian noise terms in the linear SCM (2) so that the observed variables are non-Gaussian,non-IID hidden variables, and non-linear functional forms in the SCM. We consider the synthetic setup describedearlier where the data is generated with two hidden variables (i.e. h = 2) in the first two settings and with onehidden variable (i.e. h = 1) in the non-linear setting. Below we describe the specific modifications for each problemsetting: • Non-Gaussian: (cid:15) k ∼ Unif( − , δ ek ∼ Unif( − − , , − , ψ e ∼ Unif(0 , .
2) for k = 1 , , . . . , p and e = 2 , , . . . , m . • Non-IID hidden variables: (cid:15) ∼ N (0 , . I p ) and H ∼ N (cid:18) , (cid:18) . . (cid:19)(cid:19) ; δ ek ∼ N (0 , , H e ∼ N (cid:18) , (cid:18) , .
2) 0 . . , . (cid:19)(cid:19) for k = 1 , , . . . , p and e = 2 , , . . . , m . • Non-linear SCM: (cid:15) ∼ N (0 , . I p ) and H ∼ N (0 , δ ek ∼ N (0 , , H e ∼ (1+Unif(0 , . N (0 , k = 1 , , . . . , p and e = 2 , , . . . , m . Further, for every k = 1 , , . . . , p : X ek = B (cid:63)k, pa ( k ) X e pa ( k ) + γ Tk H e + ξ tanh( B (cid:63)k, pa ( k ) X e pa ( k ) + γ Tk H e ) + (cid:15) k + δ ek . We consider ξ = { . , } .For each setting, we obtain data for an observational environment and 6 interventional environments, for a totalof m = 7 environments. We supply the perturbation data to the DirectLikelihood procedure with ¯ h = 3 and theconstraint ψ e ≤ ψ max = 0 . h = 1 and the constraint ψ e ≤ ψ max = 0 . DirectLikelihood procedure to these model miss-specifications.We observe that the
DirectLikelihood procedure provides an accurate estimate under non-Gaussian and non-IIDhidden variable settings. Further, our method appears to be robust to some amount of non-linearity. We remarkthat the empirical success in the non-Gaussian setting is supported by our theoretical results in Section 3. As alsonoted in Section 3, our theoretical results can be extended to the setting with non-IID hidden variables. However,we are unable to provide any guarantees for non-linear SCMs.
We evaluate our algorithm on the task of learning a causal network from a protein mass spectroscopy dataset [28].This data set contains a large number of measurements of the abundance of 11 phosphoproteins and phospholipidsrecorded under different experimental conditions in primary human immune system cells. The different experimen-tal conditions are characterized by associated reagents that inhibit or activate signaling nodes, corresponding tointerventions at different points of the protein-signaling network. Following the previous works [18, 17], we take8 environments consisting of an observational environment and 7 interventional environments. Knowledge aboutsome of the “ground truth” is available, which helps verification of results.To identify a set of candidate DAGs to score using our
DirectLikelihood procedure, we consider all DAGs thatare Markov equivalent to the ground truth DAG reported in [28] (total of 176 DAGs) as well as the Markov equiv-alence class of the DAG reported in [18] (22 DAGs). We leave one interventional environment out and supply thedata to
DirectLikelihood estimator with ¯ h = 1. We find the best scoring connectivity matrix and hidden variablecoefficient matrix. We repeat this process by varying the interventional environment that is left out. Among theseven optimal connectivity matrices (across the seven settings), we keep the edges that appear at least five times,and compute the average strength of those edges. Fig 4 and Table 3 demonstrate the estimated graph and edgestrengths, respectively. Further, we examine the effect of the estimated hidden variable on the proteins. Specif-ically, following [30], we compute the top singular vector of the average projection matrix (computed across theseven settings) onto the 1-dimensional hidden variable subspace. The magnitude of the entries of this singularvector are below 0 .
015 for all proteins except PKC, P38, and JNK, whose magnitudes are all above 0 .
5. Thus, weobserve that the majority of the influence of the hidden variables appears to be on the proteins PKC, P38, and JNK.18 false positives t r u e po s i t i ves t = 64t = 16t = 4t = 2t= 1 (a) Non-Gaussian vars. false positives t r u e po s i t i ves t = 64t = 16t = 4t = 2t = 1 (b) Non-IID hidden vars. false positives t r u e po s i t i ves t = 64t = 16t = 4 (c) Non-linear SCM: ξ = 0 . false positives t r u e po s i t i ves t = 64t = 16t = 4 (d) Non-linear SCM: ξ = 1 Figure 3: Robustness of
DirectLikelihood under model miss-specifications including non-Gaussian data, non-IID hiddenvariables, and non-linear SCM with different amounts of non-linearity. The total number of possible true discoveries equals . We consider t ∈ { , , , , } in the non-Gaussian and non-IID hidden settings and t ∈ { , , } in the non-linearSCM settings. In some problem settings, t = 64 has the same behavior has t = 16 and thus cannot be seen. The accuracy ofthe estimated DAGs via DirectLikelihood is evaluated in a similar fashion as Figure 2.
Finally, we compare our findings to the direct causal relations reported in the literature [10, 17, 18, 28] inTable 4. We observe that many of the strong edges reported by
DirectLikelihood agree with the previous methodswith the exception of the edges: Akt → PKA and Erk → Mek.
MekPIP2PIP3Erk
Akt PKA p38JNKRafPKC
PLCg
Figure 4: Estimated DAG by
DirectLikelihood . Edge Average strengthPKC → p38 1.71Mek → Raf 0.88Akt → Erk 0.67PKC → JNK 0.57Akt → PKA 0.26PIP2 → PIP3 0.22PLCg → PIP2 0.20PLCg → PIP3 0.14Erk → PKA 0.12PKA → P38 0.04PKA → PKC 0.03PKA → JNK 0.02Erk → Mek 0.01PKA → Mek 0.004PKA → Raf 0.001
Table 3: Average edge strengths obtained by
DirectLikelihood . → p38 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Mek → Raf (cid:88) (cid:88) (cid:88)
Akt → Erk (cid:88) (cid:88)
PKC → JNK (cid:88) (cid:88) (cid:88) (cid:88)
Akt → PKAPIP2 → PIP3 (cid:88)
PLCg → PIP2 (cid:88) (cid:88) (cid:88) (cid:88)
PLCg → PIP3 (cid:88) (cid:88)
Erk → PKA (cid:88)
PKA → P38 (cid:88) (cid:88) (cid:88)
PKA → PKC (cid:88)
PKA → JNK (cid:88) (cid:88) (cid:88) (cid:88)
Erk → MekPKA → Mek (cid:88) (cid:88) (cid:88) (cid:88)
PKA → Raf (cid:88) (cid:88)
Table 4: Comparing the findings of
DirectLikelihood (ordered by edge strength) with different causal discovery methods.Here, we are only including edges found by
DirectLikelihood and note that additional edges have been identified by the othermethods. The consensus network according to [28] is denoted by “[28](a)” and their reconstructed network by “[28](b)”.
In this paper, we proposed a framework to model perturbation data among a collection of Gaussian observedand hidden variables. It can be represented as a certain mixed-effects linear structural causal model where theinterventions are modeled as random effects which propagate dynamically through the structural equations. Thisframework allows for perturbations on all components of the system, including a target variable of interest orthe hidden variables. In contrast to previous techniques that do not allow for such perturbations, our maximum-likelihood estimator
DirectLikelihood identifies the population DAG structure among the observed variables underrelatively mild assumptions. The most restrictive assumption (dubbed hidden spuriousness ) requires that the hid-den variables induce some confounding among the observed variables, and is far less stringent than hidden densenessassumption that is needed for the two-stage deconfounding procedure [11]. We further specialized the theoreticalguarantees to the setting where the hidden variables are unperturbed or when there are no hidden affects. Finally,we demonstrated the utility of
DirectLikelihood for identifying causal relationships on synthetic and real data.There are several interesting directions for further investigation that arise from our work that we outline below:
Computational tools for DirectLikelihood : In Section 5, we proposed heuristics for searching over the space ofDAGs and for solving the
DirectLikelihood estimator (5) to score DAGs. While the empirical results in Section 6support the utility of our heuristics, there is much room for more rigorous optimization techniques. Greedy ap-proaches, for example, are common for causal structural identification and are sometimes supported by theoreticalguarantees (e.g. GES [5]). A conceptual challenge with developing greedy-based techniques for
DirectLikelihood is that the likelihood score is not decomposable according to the DAG structure due to the hidden confounding.One potential approach to overcome this challenge is to extend the sparsest poset framework developed in [2] toincorporate data from multiple environments.
High-dimensional consistency analysis of DirectLikelihood : The theoretical results in Section 3 are based onanalysis in the large data limit. However, our empirical results in Section 6 suggest that the
DirectLikelihood proce-dure may provide accurate estimates with moderate data size. As such, an exciting research direction is to develophigh-dimensional consistency guarantees for
DirectLikelihood . A possible approach is to extend the results of [31]in the context of causal learning from observational data to the setting where there are multiple environments andhidden confounding.
Characterizing the interventional Markov equivalence class:
In the setting where the perturbations are limited,there may be multiple DAGs that are equally representative of the data, or equivalently, multiple DAGs that yieldthe same exact likelihood score in the population case.. The class of equally scoring DAGs forms an equivalence20lass, and is known as the Markov equivalence class in the observational setting, and as interventional Markovequivalence class in the setting with perturbed data. The characterization of these equivalence classes was centralto developing structural learning algorithms [5, 12] as well as constructing active learning strategies for maximallyinformative interventions [13]. Obtaining an understanding of the interventional Markov equivalence class for ourproblem may enable similar consequences.
Beyond linear models:
The perturbation model (2) assumes a linear relationship between the observed andhidden variables. In certain applications, the linearity assumption may be restrictive, necessitating a richer modelclass to capture the underlying causal relations. The conceptual leap to non-linearity was considered in [15] byextending the Invariant Causal Prediction (ICP) methodology to non-linear models. As with the ICP technology,[15] assumes that there are no hidden confounders and that the target variable remains unperturbed. Given thatour framework overcame these challenges in the linear setting, it would be of interest to explore extensions to thenon-linear setting. Alternatively, one can characterize the extent to which linear models capture the causal effects[7].
Acknowledgements
A. Taeb and P. B¨uhlmann have received funding from the European Research Council under the Grant AgreementNo 786461 (CausalStats - ERC-2017-ADG). They both acknowledge scientific interaction and exchange at “ETHFoundations of Data Science”. The authors would like to thank Mona Azadkia, Yuansi Chen and Juan Gamellafor helpful discussions and feedback on the manuscript.
References [1]
J. Angrist, G. Imbens, and D. Rubin , Identification of causal effects using instrumental variables , Journalof the American Statistical Association, 91 (1996), pp. 444–455.[2]
D. Bernstein, B. Saeed, C. Squires, and C. Uhler , Ordering-based causal structure learning in thepresence of latent variables , International Conference on Artificial Intelligence and Statistics, (2020), pp. 4098–4108.[3]
P. B¨uhlmann , Invariance, causality and robustness , Statistical Science, 35 (2020), pp. 404–426.[4]
V. Chandrasekaran, P. Parillo, and A. Willsky , Latent variable graphical model selection via convexoptimization , Annals of Statistics, 40 (2012), pp. 1935–1967.[5]
D. Chickering , Optimal structure identification with greedy search , Journal of Machine Learning Research,3 (2002), pp. 507–554.[6]
D. Chickering, C. Meek, and D. Heckerman , Large-sample learning of bayesian networks is np-hard ,Journal of Machine Learning Research, 5 (2004), pp. 1287–1330.[7]
R. Christiansen, N. Pfister, M. Jakobsen, and N. Gnecco , The difficult task of distribution general-ization in nonlinear models , arXiv:2006.07433, (2020).[8]
A. Dixit, O. Parnas, and B. Li , Perturb-seq: dissecting molecular circuits with scalable single-cell rnaprofiling of pooled genetic screens , Cell, 167 (2016), pp. 1853–1866.[9]
M. Drton and M. Maathius , Structure learning in graphical modeling , Annual Review of Statistics and ItsApplication, 4 (2017), pp. 365–393.[10]
D. Eaton and K. Murphy , Exact bayesian structure learning from uncertain interventions , In Proceedingsof the 10th International Conference on Artificial Intelligence and Statistics (AISTATS), (2007).[11]
B. Frot, P. Nandy, and M. Maathius , Robust causal structure learning with hidden variables , Journal ofRoyal Statistical Society, Series B, 81 (2019), pp. 459–487.[12]
A. Hauser and P. B¨uhlmann , Characterization and greedy learning of interventional markov equivalenceclasses of directed acyclic graphs , Journal of Machine Learning Research, 13 (2012), pp. 2409–2464.2113]
A. Hauser and P. B¨uhlmann , Two optimal strategies for active learning of causal models from interventionaldata , International Journal of Approximate Reasoning, 4 (2014), pp. 926–939.[14]
A. Hauser and P. B¨uhlmann , Jointly interventional and observational data: estimation of interventionalmarkov equivalence classes of directed acyclic graphs , Journal of Royal Statistical Society, Series B, 77 (2015),pp. 291–318.[15]
C. Heinze-Demle, J. Peters, and N. Meinshausen , Invariant causal prediction for nonlinear models ,Journal of Causal Inference, 6 (2018), pp. 1–35.[16]
A. McLean, L. Sanders, and W. Walter , A unified approach to mixed linear models , Journal of AmericanStatistical Association, 45 (1991), pp. 54–64.[17]
N. Meinshausen, A. Hauser, J. Mooij, J. Peters, P. Versteeg, and P. B¨uhlmann , Methods for causalinference from gene perturbation experiments and validation , Proceeding of National Academy of Sciences, 113(2016), pp. 7361–7368.[18]
J. Mooij and T. Heskes , Cyclic causal discovery from continuous equilibrium data , In Proceedings of the29th Annual Conference on Uncertainty in Artificial Intelligence, (2013), pp. 431–439.[19]
P. Nandy, A. Hauser, and M. Maathius , High-dimensional consistency in score-based and hybrid structurelearning , Annals of Statistics, 46 (2018), pp. 3151–3183.[20]
C. Nowzohour and P. B¨uhlmann , Score-based causal learning in additive noise models , Statistics, 50(2016), pp. 471–485.[21]
J. Pearl , Causality: Models, reasoning, and inference , Cambridge University Press, 2nd edition, 2009.[22]
J. Peters and P. B¨uhlmann , Identifiability of gaussian structural equation models with equal error variances ,Biometrika, 101 (2014), pp. 219–228.[23]
J. Peters, P. B¨uhlmann, and N. Meinshausen , Causal inference using invariant prediction: identificationand confidence intervals , Journal of the Royal Statistical Society, Series B, 78 (2016), pp. 947–1012.[24]
J. Robins, M. Hernan, and B. Brumback , Marginal structural models and causal inference in epidemiology ,Epidemiology, 11 (2000), pp. 550–560.[25]
D. Rothenh¨ausler, P. B¨uhlmann, and N. Meinshausen , Causal dantzig: fast inference in linear struc-tural equation models with hidden variables under additive interventions , Annals of Statistics, 47 (2019),pp. 1688–1722.[26]
D. Rothenh¨ausler, N. Meinshausen, P. B¨uhlmann, and J. Peters , Anchor regression: heterogeneousdata meets causality , arXiv:1801.06229, (2020).[27]
D. Rothhenh¨ausler, C. Heinze, J. Peters, and N. Meinshausen , backshift: Learning causal cyclicgraphs from unknown shift interventions , In Advances in Neural Information Processing Systems, (2016).[28] K. Sachs, O. Perez, D. Lauffenburger, and G. Nolan , Causal protein-signaling networks derived frommultiparameter single-cell data , Science, 308 (2005), pp. 523–529.[29]
P. Spirites, C. Glymour, and R. Scheines , Causation, Prediction, and Search , Cambridge: MITPress,2000.[30]
A. Taeb, P. Shah, and V. Chandrasekaran , False discovery and its control in low-rank estimation ,Journal of Royal Statistical Society, Series B, 82 (2020), pp. 992–1027.[31]
S. van de Geer and P. B¨uhlmann , (cid:96) -penalized maximum likelihood for sparse directed acyclic graphs ,Annals of Statistics, 41 (2013), pp. 536–567.[32] T. Verma and J. Pearl , Equivalence and synthesis of causal models , In Proceedings of the 6th Conferenceon Uncertainty in Artificial Intelligence (UAI), (1991), pp. 255–270.[33]
Y. Wang, L. Solus, K. Yang, and C. Uhler , Permutation based causal inference algorithms with inter-ventions , In Advances in Neural Information Processing Systems, (2017), pp. 5822–5831.22
Supplementary Material
The proof of the theoretical results are based on the following population quantities that we summarize. Let B (cid:63) ∈ R p × p be the population connectivity matrix, Γ (cid:63) ∈ R p × h be the hidden coefficient matrix, w ,(cid:63) ∈ R p ++ encodethe variance of the coordinates of (cid:15) , diag ( w e,(cid:63) ) ∈ S p + be a non-negative diagonal matrix encoding the perturbationson the observed variables, w e,(cid:63) ∈ R p ++ with w e,(cid:63)k = w ,(cid:63) + w e,(cid:63)k . Let { ψ e,(cid:63) } me =1 ⊂ R + be the perturbation on thehidden variables. Let κ (cid:63) = i (cid:107) B (cid:63) : ,i (cid:107) i (cid:107) B (cid:63) : ,i (cid:107) . Finally, for a matrix M ∈ R d × d , we denote (cid:107) M (cid:107) as the spectral norm(largest singular value) of M . A.1 Proof of Theorem 1
For notational convenience, we restate the
DirectLikelihood estimator (5) in the presence of perturbed IID hiddenvariables and observational data( ˆ B, ˆΓ , ˆ ψ, { ˆ w e } me =1 ) = arg min B, Γ ,ψ ∈ R m + , { w e } me =1 ⊆ R p ++ m (cid:88) e =1 π e,(cid:63) (cid:96) ( B, Γ , (1 + ψ e ) I , w e )subject-to. B compatible with D ; (cid:107) ψ (cid:107) ∞ ≤ C ψ ψ = 0 ; w e (cid:31) w for e = 2 , . . . , m (16)with a score function defined as in (6) with λ = 0. Here the decision variable ψ encodes the latent perturbationsand consists of coordinates ψ = ( ψ e , ψ , . . . , ψ m ). The optimal DAG is given by D opt = min D score λ =0 ( D ) withassociated connectivity matrices B opt .The proof strategy for proving Theorem 1 is based on appealing to the following three lemmas: Lemma 1.
Optimal solutions of (16) satisfy the following equivalence: ( B, Γ , ψ, { v e } me =1 ) optimum of (16) ⇐⇒ B is DAG , (cid:107) ψ (cid:107) ∞ < C ψ , ψ = 0 & for every e = 1 , , . . . , m Σ e,(cid:63) = ( I − F do ( e ) c B ) − ( diag ( w e ) + ( ψ e + 1)[ΓΓ T ] do ( e ) c )( I − F do ( e ) c B ) − T . Lemma 2.
Let ( ˜ B, ˜Γ , ˜ ψ, { ˜ w e } me =1 ) be an optimal solution of (16) . The following statements hold:1. Suppose ˜ ψ e (cid:54) = ψ e,(cid:63) for some e ∈ { , } . Under Assumptions 1-5 in (10) or Assumptions 6-10 in (11) ,moral ( B (cid:63) ) ⊂ moral ( ˜ B ) .2. Suppose ˜ ψ e = ψ e,(cid:63) for e = 2 , . Under Assumptions 1-5 in (10) or Assumptions 6-10 in (11) , moral ( ˜ B ) = moral ( B (cid:63) ) . Lemma 3.
Let ( ˜ B, ˜Γ , ˜ ψ, { ˜ w e } me =1 ) be an optimal solution of (16) . Suppose ˜ ψ e = ψ e,(cid:63) for e = 2 , . Then, ˜ B = B (cid:63) . Combining Lemma 2 and 3 will conclude the proof of Theorem 1, due to the fact that the true parameter valuesleads to an optimum of (16). We now prove each lemma.
A.1.1 Useful notations
We introduce some notations that will be repeatedly used. Specifically, we define for e ∈ E where there are nodo-interventions: κ e cond ≡ min k,l | ( I − B (cid:63) ) T diag ( w e,(cid:63) ) − ( I − B (cid:63) ) | k,l s.t. ρ ( X ek , X el | X e \{ k,l } , H e ) (cid:54) = 0 κ e hidden ≡ max k,l | ( I − B (cid:63) ) T diag ( w e,(cid:63) ) − (Γ (cid:63)T diag ( w e,(cid:63) ) − Γ (cid:63) + 11 + ψ e,(cid:63) I ) − diag ( w e,(cid:63) ) − ( I − B (cid:63) ) | k,l s.t. ρ ( X ek , X el | X e \{ k,l } , H e ) = 023he intuition behind the quantities κ e cond and κ e hidden is based on the decomposition of (Σ e,(cid:63) ) − . Specifically, fromthe Woodbury inversion lemma:(Σ e,(cid:63) ) − = ( I − B (cid:63) ) T diag ( w e,(cid:63) ) − ( I − B (cid:63) ) − ( I − B (cid:63) ) T diag ( w e,(cid:63) ) − (Γ (cid:63)T diag ( w e,(cid:63) ) − Γ (cid:63) + 11 + ψ e,(cid:63) I ) − diag ( w e,(cid:63) ) − ( I − B (cid:63) )Standard multivariate analysis states that for any pair of indices ( k, l ) with k (cid:54) = l , (cid:2) (Σ e,(cid:63) ) − (cid:3) k,l (cid:54) = 0 if and only if ρ ( X ek , X el | X e \{ k,l } ) (cid:54) = 0. Similarly, since the precision matrix cov( X e | H e ) − = ( I − B (cid:63) ) T diag ( w e,(cid:63) ) − ( I − B (cid:63) ), wehave that (cid:2) ( I − B (cid:63) ) T diag ( w e,(cid:63) ) − ( I − B (cid:63) ) (cid:3) k,l (cid:54) = 0 if and only if ρ ( X ek , X el | X e \{ k,l } , H e ) (cid:54) = 0. Thus, by definition, κ e cond > hidden spuriousness in Definition 1, κ e hidden >
0. Then, κ e cond ≥ (1 + min i (cid:107) B (cid:63) : ,i (cid:107) )2 max k w e,(cid:63)k . (17)Similarly, we have due to min k w e,(cid:63)k ≥ (cid:107) Γ (cid:63) (cid:107) (1 + C ψ ) and min k w e,(cid:63)k ≥ (cid:107) w ,(cid:63) (cid:107) ∞ : κ e hidden ≥ (1 + min i (cid:107) B (cid:63) : ,i (cid:107) )(1 + ψ e,(cid:63) )9 (max k w e,(cid:63)k ) . (18) A.1.2 Proof of Lemma 1
Proof.
Let M ( B, Γ , ψ e , w e ) denote a model associated with each equation in the SCM (2). For notational conve-nience, we use the short-hand notation M e for this model. We let Σ( M e ) = ( I − F do ( e ) c B ) − ( diag ( w e ) + (1 + ψ e )[ΓΓ T ] do ( e ) c )( I −F do ( e ) c B ) − T be the associated covariance model parameterized by the parameters ( B, Γ , ψ e , w e ).The optimal solution of the population DirectLikelihood can be equivalently reformulated as:arg min {M e } me =1 m (cid:88) e =1 π e,(cid:63) KL (Σ (cid:63),e , Σ( M e )) . (19)Notice that for the decision variables M e,(cid:63) = ( B (cid:63) , Γ (cid:63) , ψ e,(cid:63) , w e,(cid:63) ) for each e = 1 , , . . . , m , (19) achieves zero loss.Hence, any other optimal solution of (19) must yield zero loss, or equivalently, Σ( M e ) = Σ e,(cid:63) for any optimalcollection {M e } me =1 . A.1.3 Proof of Lemma 2
Proof.
We first provide the proof of Lemma 2 under Assumptions 1-5 in (10). Since there are no do-interventionsin the environments e = 1 , , e = 2 , e,(cid:63) − (1 + ˜ ψ e )Σ ,(cid:63) = ( I − B (cid:63) ) − (cid:16) diag (cid:16) w e,(cid:63) − (1 + ˜ ψ e ) w ,(cid:63) (cid:17) + ( ψ e,(cid:63) − ˜ ψ e )Γ (cid:63) Γ (cid:63)T (cid:17) ( I − B (cid:63) ) − T = ( I − ˜ B ) − diag (cid:16) ˜ w e − (1 + ˜ ψ e ) ˜ w (cid:17) ( I − ˜ B ) − T (20)Since min k w e,(cid:63)k ≥ C ψ ( (cid:107) w ,(cid:63) (cid:107) ∞ + C ψ (cid:107) Γ (cid:63) (cid:107) ) from Assumption 3 in (10), we conclude that the matrix Σ e,(cid:63) − (1 +˜ ψ e )Σ ,(cid:63) is invertible for e = 2 , e ∈ { , } for which ψ e,(cid:63) (cid:54) = ˜ ψ e . After an inversion of(20) for this environment, we obtain:( I − B (cid:63) ) T ( M e + L e )( I − B (cid:63) )= ( I − ˜ B ) T diag (cid:16) ˜ w e − (1 + ˜ ψ e ) ˜ w (cid:17) − ( I − ˜ B ) (21)where, M e = diag (cid:16) w e,(cid:63) − (1 + ˜ ψ e ) w ,(cid:63) (cid:17) − ; L e = M e Γ (cid:63) (cid:20) Γ (cid:63)T M e Γ (cid:63) + 1∆ ψ e I (cid:21) − Γ (cid:63)T M e , ψ e = ( ψ e,(cid:63) − ˜ ψ e ). Notice that the nonzero entries of ( I − ˜ B ) T diag (cid:16) ˜ w e − (1 + ˜ ψ e ) ˜ w (cid:17) − ( I − ˜ B ) encode the moral graph induced by ˜ B . Our strategy is to use Assumptions 1-5in (10) to show ( I − B (cid:63) ) T ( M e + L e )( I − B (cid:63) ) has non-zeros in the entries corresponding to the moral graph of B (cid:63) and at least one nonzero outside of the moral graph. To conclude this, we consider the following intermediateterms close to M e and L e :¯ M e = diag ( w e,(cid:63) ) − ; ¯ L e = ¯ M e Γ (cid:63) (cid:20) Γ (cid:63)T ¯ M e Γ (cid:63) + 11 + ψ e,(cid:63) I (cid:21) − Γ (cid:63)T ¯ M e Notice that: (cid:107) ¯ M e − M e (cid:107) ≤ C ψ ) (cid:107) w ,(cid:63) (cid:107) ∞ k w e,(cid:63)k ) ; (cid:107) M e (cid:107) ≤ k w e,(cid:63)k ) (22)where the inequalities follow by noting that 5(1 + C ψ ) (cid:107) w ,(cid:63) (cid:107) ∞ ≤ min k w e,(cid:63)k from Assumption 5 in (10). Now let( k, l ) be any pair of indices connected in the moral graph of B (cid:63) . Then: | ( I − B (cid:63) ) T M e ( I − B (cid:63) ) | k,l ≥ | ( I − B (cid:63) ) T ¯ M e ( I − B (cid:63) ) | k,l − (1 + max i (cid:107) B (cid:63) : ,i (cid:107) ) (cid:107) ¯ M e − M e (cid:107) ≥ κ e cond − C ψ ) (cid:107) w ,(cid:63) (cid:107) ∞ (1 + max i (cid:107) B (cid:63) : ,i (cid:107) )4(min k w e,(cid:63)k ) ≥ (1 + min i (cid:107) B (cid:63) : ,i (cid:107) ) k w e,(cid:63)k ) − C ψ ) (cid:107) w ,(cid:63) (cid:107) ∞ (1 + max i (cid:107) B (cid:63) : ,i (cid:107) )4(min k w e,(cid:63)k ) ≥ (1 + max i (cid:107) B (cid:63) : ,i (cid:107) )4 max k w e,(cid:63)k . (23)Here, the second to last inequality follows from the relation (17) and the last inequality follows from min k w e,(cid:63)k cond( diag ( w e,(cid:63) )) ≥ κ (cid:63) (1 + C ψ ) (cid:107) w ,(cid:63) (cid:107) ∞ . Next, we control (cid:107) ( I − B (cid:63) ) T L e ( I − B (cid:63) ) (cid:107) ∞ . Using the inequality (cid:104) Γ (cid:63)T M e Γ (cid:63) + ψ e I (cid:105) − (cid:22) (∆ ψ e ) I , we have that: (cid:107) ( I − B (cid:63) ) T L e ( I − B (cid:63) ) (cid:107) ∞ ≤ C ψ (1 + max i (cid:107) B (cid:63) : ,i (cid:107) ) (cid:107) Γ (cid:63) (cid:107) k w e,(cid:63)k ) (24)Since min k w e,(cid:63)k cond( diag ( w e,(cid:63) )) ≥ C ψ (cid:107) Γ (cid:63) (cid:107) , comparing (24) and (23), we conclude that for any indices ( k, l ) connected inthe moral graph of B (cid:63) | ( I − B (cid:63) ) T M e ( I − B (cid:63) ) | k,l > (cid:107) ( I − B (cid:63) ) T L e ( I − B (cid:63) ) (cid:107) ∞ . To finish the proof of the first assertion of Lemma 2, we have to show that for indices ( k, l ) attaining the optimum κ hidden , | ( I − B (cid:63) ) T L e ( I − B (cid:63) ) | k,l > | ( I − B (cid:63) ) T ( ψ e,(cid:63) ∆ ψ e L e )( I − B (cid:63) ) | k,l >
0. Notice that: (cid:12)(cid:12)(cid:12)(cid:12) ( I − B (cid:63) ) T (cid:18) ψ e,(cid:63) ∆ ψ e L e (cid:19) ( I − B (cid:63) ) (cid:12)(cid:12)(cid:12)(cid:12) k,l ≥ | ( I − B (cid:63) ) T ¯ L e ( I − B (cid:63) ) | k,l − | ( I − B (cid:63) ) T (cid:18) ψ e,(cid:63) ∆ ψ e L e − ¯ L e (cid:19) ( I − B (cid:63) ) | k,l ≥ κ hidden − (cid:13)(cid:13)(cid:13)(cid:13) ψ e,(cid:63) ∆ ψ e L e − ¯ L e (cid:13)(cid:13)(cid:13)(cid:13) (1 + max i (cid:107) B (cid:63) : ,i (cid:107) ) ≥ (1 + min i (cid:107) B (cid:63) : ,i (cid:107) )(1 + ψ e,(cid:63) )9 (max k w e,(cid:63)k ) − (cid:13)(cid:13)(cid:13)(cid:13) ψ e,(cid:63) ∆ ψ e L e − ¯ L e (cid:13)(cid:13)(cid:13)(cid:13) (1 + max i (cid:107) B (cid:63) : ,i (cid:107) )Where the last inequality follows from the relation (18). Thus, it suffices to show that:8 (1 + min i (cid:107) B (cid:63) : ,i (cid:107) )(1 + ψ e,(cid:63) )9 (max k w e,(cid:63)k ) − (cid:13)(cid:13)(cid:13)(cid:13) ψ e,(cid:63) ∆ ψ e L e − ¯ L e (cid:13)(cid:13)(cid:13)(cid:13) (1 + max i (cid:107) B (cid:63) : ,i (cid:107) ) > . (25)25o that end, we control the term (cid:13)(cid:13)(cid:13) ψ e,(cid:63) ∆ ψ e L e − ¯ L e (cid:13)(cid:13)(cid:13) . (cid:13)(cid:13)(cid:13)(cid:13) ψ e,(cid:63) ∆ ψ e L e − ¯ L e (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( M e − ¯ M e )Γ (cid:63) (cid:20) ∆ ψ e ψ e,(cid:63) Γ (cid:63)T M e Γ (cid:63) + 11 + ψ e,(cid:63) I (cid:21) − Γ (cid:63)T M e (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:124) (cid:123)(cid:122) (cid:125) Term 1 + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( M e − ¯ M e )Γ (cid:63) (cid:20) ∆ ψ e ψ e,(cid:63) Γ (cid:63)T M e Γ (cid:63) + 11 + ψ e,(cid:63) I (cid:21) − Γ (cid:63)T ( M e − ¯ M e ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:124) (cid:123)(cid:122) (cid:125) Term 2 + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ¯ M e Γ (cid:63) (cid:40) (cid:20) ∆ ψ e ψ e,(cid:63) Γ (cid:63)T M e Γ (cid:63) + 11 + ψ e,(cid:63) I (cid:21) − − (cid:20) Γ (cid:63)T ¯ M e Γ (cid:63) + 11 + ψ e,(cid:63) I (cid:21) − (cid:41) Γ (cid:63)T ¯ M e (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:124) (cid:123)(cid:122) (cid:125) Term 3 . (26)We bound each of the individual terms in (26). Using the inequalities (cid:104) Γ (cid:63)T M e Γ (cid:63) + ψ e I (cid:105) − (cid:22) (∆ ψ e ) I and (cid:107) ˜ M e (cid:107) ≤ k w e,(cid:63)k and the relation (22), Term 1 and Term 2 can be bounded as follows:Term 1 ≤ C ψ ) (cid:107) w ,(cid:63) (cid:107) ∞ (cid:107) Γ (cid:63) (cid:107) k w e,(cid:63)k ) Term 2 ≤ (cid:107) Γ (cid:63) (cid:107) (1 + C ψ ) (cid:107) w ,(cid:63) (cid:107) ∞ k w e,(cid:63)k ) . (27)To bound Term 3, we use Taylor series expansion yielding( A + E ) − − A − = A − ∞ (cid:88) k =1 ( EA − ) k . Further, if (cid:107) E (cid:107) (cid:107) A − (cid:107) <
1, we can bound the spectral norm of the difference ( A + E ) − − A − as follows: (cid:107) ( A + E ) − − A − (cid:107) ≤ (cid:107) A − (cid:107) ∞ (cid:88) k =1 (cid:107) E (cid:107) k (cid:107) A − (cid:107) k = (cid:107) A − (cid:107) (cid:107) E (cid:107) − (cid:107) E (cid:107) (cid:107) A − (cid:107) (28)In the context of Term 3, E = ∆ ψ e ψ e,(cid:63) Γ (cid:63)T M e Γ (cid:63) − Γ (cid:63)T ¯ M e Γ (cid:63) and A = (1 + ψ e,(cid:63) ) I . One can check that (cid:107) E (cid:107) ≤ ( C ψ +1) (cid:107) Γ (cid:63) (cid:107) (min k w e,(cid:63)k ) . Thus, employing the relation (min k w e,(cid:63)k ) ≥ C ψ + 1) (cid:107) Γ (cid:63) (cid:107) , we have that:Term 3 ≤ (cid:107) Γ (cid:63) (cid:107) ( C ψ + 1)4 min k ( w e,(cid:63)k ) (29)Combining the bounds in (27) and (29) with (26), we find that: (cid:13)(cid:13)(cid:13)(cid:13) ψ e,(cid:63) ∆ ψ e L e − ˜ L e (cid:13)(cid:13)(cid:13)(cid:13) ≤ C ψ ) (cid:107) w ,(cid:63) (cid:107) ∞ (cid:107) Γ (cid:63) (cid:107) k w e,(cid:63)k ) + 25 (cid:107) Γ (cid:63) (cid:107) (1 + C ψ ) (cid:107) w ,(cid:63) (cid:107) ∞ k w e,(cid:63)k ) + 5 (cid:107) Γ (cid:63) (cid:107) ( C ψ + 1)4 min k ( w e,(cid:63)k ) ≤ (cid:18)
104 + 2516 + 54 (cid:19) (1 + 2 C ψ ) max {(cid:107) Γ (cid:63) (cid:107) , (cid:107) Γ (cid:63) (cid:107) } max { , (cid:107) w ,(cid:63) (cid:107) ∞ } (min k w e,(cid:63)k ) , where the second inequality follows from min k w e,(cid:63)k ≥ (cid:107) w ,(cid:63) (cid:107) ∞ (1 + C ψ ). Thus, since min k w e,(cid:63)k cond( diag ( w e,(cid:63) )) ≥ κ (cid:63) (1 + C ψ ) max {(cid:107) Γ (cid:63) (cid:107) , (cid:107) Γ (cid:63) (cid:107) } max { , (cid:107) w ,(cid:63) (cid:107) ∞ } , the sufficient condition in (25) is satisfied. This concludes that if˜ ψ e (cid:54) = ψ e,(cid:63) for e ∈ { , } , ( I − B (cid:63) ) T ( M e + L e )( I − B (cid:63) ) will have a non-zero outside of the moral graph of B (cid:63) B (cid:63) ) ⊂ moral( ˜ B ). We have established the first component of Lemma 2. Thesecond component (where ˜ ψ e = ψ e,(cid:63) for e = 2 ,
3) follows from (20).We next provide a proof of Lemma 2 under Assumptions 1-5 in (11). Lemma 1 implies the following relations:(
I − B (cid:63) ) − (cid:16) diag (cid:16) w ,(cid:63) − (1 + ˜ ψ ) w ,(cid:63) (cid:17) + ( ψ ,(cid:63) − ˜ ψ )Γ (cid:63) Γ (cid:63)T (cid:17) ( I − B (cid:63) ) − T = ( I − ˜ B ) − diag (cid:16) ˜ w − (1 + ˜ ψ ) ˜ w (cid:17) ( I − ˜ B ) − T ( I − B (cid:63) ) − (cid:32) diag (cid:32) w ,(cid:63) − ψ ψ w ,(cid:63) (cid:33) + (cid:18) ψ ,(cid:63) − (1 + ˜ ψ ) 1 + ψ ,(cid:63) ψ (cid:19) Γ (cid:63) Γ (cid:63)T (cid:33) ( I − B (cid:63) ) − T = ( I − ˜ B ) − diag (cid:32) ˜ w − (1 + ˜ ψ )1 + ˜ ψ ˜ w (cid:33) ( I − ˜ B ) − T (30)Using the relation (30) and a similar analysis as with the proof under Assumptions 1-5 in (10), one can arrive atthe conclusion of Lemma 2 with Assumptions 1-5 in (11). A.1.4 Proof of Lemma 3
Proof.
We consider the setup with Assumptions 1-5 in (10) and for brevity, leave out the proof with Assumption6-10 in (11). We have from (20) that for e = 2 , I − B (cid:63) ) − diag (cid:0) w e,(cid:63) − (1 + ψ (cid:63)e ) w ,(cid:63) (cid:1) ( I − B (cid:63) ) − T = ( I − ˜ B ) − diag (cid:16) ˜ w e − (1 + ψ (cid:63)e ) ˜ d (cid:17) ( I − ˜ B ) − T (31)By Assumption 4 and 5, we have that [ w ,(cid:63) − (1 + ψ (cid:63)e ) w ,(cid:63) , w ,(cid:63) − (1 + ψ (cid:63)e ) w ,(cid:63) ] has full row-rank and nonzeroentries. From relation (31), we have:( I − ˜ B )( I − B (cid:63) ) − diag (cid:0) w ,(cid:63) − (1 + ψ (cid:63)e ) w ,(cid:63) (cid:1) ( I − B (cid:63) ) − T ( I − ˜ B ) T = diag (cid:0) ˜ w − (1 + ψ (cid:63)e ) ˜ w (cid:1) ( I − ˜ B )( I − B (cid:63) ) − diag (cid:0) w ,(cid:63) − (1 + ψ (cid:63)e ) w ,(cid:63) (cid:1) ( I − B (cid:63) ) − T ( I − ˜ B ) T = diag (cid:0) ˜ w − (1 + ψ (cid:63)e ) ˜ w (cid:1) Let φ ek := (cid:104) ( I − ˜ B )( I − B (cid:63) ) − diag ( w e,(cid:63) − w ,(cid:63) (1 + ψ (cid:63)e ) (cid:105) k, for any k = 1 , , . . . , p . Let ξ k := (cid:104) ( I − ˜ B )( I − B (cid:63) ) − (cid:105) k, .Then for any k = 1 , , . . . , p , φ ek ⊥ ξ l for any k (cid:54) = l (32)Notice that for any k = 1 , , . . . , p , { ξ l } l (cid:54) = k are linearly independent. The condition above means that φ k and φ k (where neither would be exactly a zero vector because w ,(cid:63) − w ,(cid:63) (1 + ψ (cid:63)e ) , w ,(cid:63) − w ,(cid:63) (1 + ψ (cid:63)e )) (cid:54) = 0 ) live insidethe one-dimensional null-space of the matrix formed by concatenating the vectors { ξ l } l (cid:54) = k . In particular, for every k , we have that for some constant c (cid:54) = 0: ξ k diag ( w ,(cid:63) − w ,(cid:63) (1 + ψ (cid:63)e ))) = cξ k diag ( w ,(cid:63) − w ,(cid:63) (1 + ψ (cid:63)e ))). It isstraightforward to check the condition w ,(cid:63)k − (1+ ψ (cid:63)e ) w ,(cid:63)k w ,(cid:63)l − (1+ ψ (cid:63)e ) w ,(cid:63)l (cid:54) = w ,(cid:63)k − (1+ ψ (cid:63)e ) w ,(cid:63)k w ,(cid:63)l − (1+ ψ (cid:63)e ) w ,(cid:63)l for k, l ∈ S, k (cid:54) = l implies that:Ξ m, : = (cid:40) Ξ m,S = 0Ξ m,S has one nonzero-component & Ξ m,S c = 0 , (33)where Ξ ∈ R p × p be the matrix formed by concatenating the row vectors { ξ i } pi =1 so that I − ˜ B = Ξ( I − B (cid:63) ). Since I − ˜ B and I − B (cid:63) are invertible, so must be Ξ.The relation (33), that S = { , , . . . , p } , and that Ξ is invertible implies that Ξ is a diagonal matrix up torow-permutations so that: ( I − ˜ B ) = K π D ( I − B (cid:63) ) , where D is diagonal with all nonzero entries on the diagonal and ˜ π is a permutation. We know that ( I − ˜ B ) willhave ones on the diagonal. Hence, it is straightforward to check that K π = D = ξ ( m ) and thus ˜ B = B (cid:63) .27 .2 Role of ¯ h in identifiability In this section, we consider the role of ¯ h (i.e. the number of latent variables in the model) for identifiability. Theo-rem 1 states that as long as Assumptions 1-5 in (10) or Assumptions 6-10 in (11) are satisfied, then identifiabilityis possible for any ¯ h with ¯ h ≥ h . These assumptions rely on the exist of at least two interventional environments.In particular, we will first show that this is a necessary condition in the setting if ¯ h = p . We will also show thatif ¯ h = h and under some incoherence conditions (e.g. dense latent effects and sparse DAG structure), a singleinterventional environment is sufficient for identifiability. A.2.1 ¯ h = p Suppose there is only a single interventional environment satisfying Assumptions 1-5, as an example. We will showthat in addition to the population solutions, the MLE has an additional minimizer ˜ B, ˜Γ , { ( ˜ w e , ˜ ψ e ) } me =1 by showingthat these parameters satisfy the requirement for an optimal solution in Lemma 1. We let ˜ ψ e = ψ e,(cid:63) , ˜ B to be aconnectivity matrix associated to a DAG in the Markov equivalence class of D (cid:63)X with ˜ B (cid:54) = B (cid:63) , and the parameters˜Γ , ˜ w , ˜ w to be appropriately defined. Specifically, let ˜ B and ˜ w − ˜ w (1 + ˜ ψ ) satisfy the following equations:( I − B (cid:63) ) − ( diag ( w ,(cid:63) − (1 + ψ ,(cid:63) ) w ,(cid:63) ))( I − B (cid:63) ) − T (34)= ( I − ˜ B ) − ( diag ( ˜ w − (1 + ˜ ψ ) ˜ w ))( I − ˜ B ) − T . (35)In particular, we let ˜ B to be any Markov equivalent DAG to B (cid:63) and find ¯ w such that ( I − B (cid:63) ) − diag ( w ,(cid:63) )( I − B (cid:63) ) − T = ( I − ˜ B ) − diag ( ¯ w )( I − ˜ B ) − T and ¯ w such that ( I − B (cid:63) ) − diag ( w ,(cid:63) )( I − B (cid:63) ) − T = ( I − ˜ B ) − diag ( ¯ w )( I − ˜ B ) − T . We then let ˜ w = ¯ w − µ and ˜ w = ¯ w + (1 + ψ ,(cid:63) ) µ for any non-negative vector µ . Thus, for this choice ofparameters, (35) is satisfied. It remains to check that:( I − B (cid:63) ) − ( diag ( w ,(cid:63) ) + Γ (cid:63) Γ (cid:63)T )( I − B (cid:63) ) − T = ( I − ˜ B ) − ( diag ( ˜ w ) + ˜Γ˜Γ T )( I − ˜ B ) − T . Recalling that ˜ w = ¯ w − µ , it is straightforward to find a full rank ˜Γ that satisfies the relation above. A.2.2 ¯ h = h We consider the setting with a single interventional setting that satisfies Assumption 1-5 in (10). We show thatunder some incoherence-type assumptions, the
DirectLikelihood procedure combined with choosing the sparsestmoral graph has a unique optimum equaling B (cid:63) . By Lemma 2, we conclude that moral( B (cid:63) ) ⊂ moral( ˜ B ) unless˜ ψ = ψ ,(cid:63) . Since we are looking for the sparsest producing moral graph, we conclude that moral( B (cid:63) ) = moral( ˜ B ).By Lemma 1, we have that: ( I − B (cid:63) ) − ( diag ( w e,(cid:63) ) + (1 + ψ e,(cid:63) )Γ (cid:63) Γ (cid:63)T )( I − B (cid:63) ) − T = ( I − ˜ B ) − ( diag ( ˜ w e ) + (1 + ˜ ψ e )˜Γ˜Γ T )( I − ˜ B ) − T . By the Woodbury inversion lemma, we have for both e = 1 , (cid:104) ( I − B (cid:63) ) T diag ( w e,(cid:63) ) − ( I − B (cid:63) ) − ( I − ˜ B ) T diag ( ˜ w e ) − ( I − ˜ B ) (cid:105) + L e,(cid:63) is rank h, (36)where L e,(cid:63) is a rank h matrix with row and column space equal to the row and column space of ( I− B (cid:63) ) T diag ( w e,(cid:63) ) − Γ (cid:63) Γ (cid:63)T diag ( w e,(cid:63) ) − ( I − B (cid:63) ) T . Notice that the quantity inside the brackets in (36) lies inside the moral graph of B (cid:63) . We now use rank-sparsity incoherence [4] to conclude that the term inside the bracket in (36) vanishes. Inparticular, if the tangent space of the sparse variety at the moral graph of B (cid:63) is transverse with the tangent spaceof the low rank variety at L e,(cid:63) , then (36) is be satisfied if and only if for e = 1 , (cid:104) ( I − B (cid:63) ) T diag ( w e,(cid:63) ) − ( I − B (cid:63) ) − ( I − ˜ B ) T diag ( ˜ w e ) − ( I − ˜ B ) (cid:105) = 0 . (37)The transversality of the tangent spaces is satisfied if the latent effects are dense and D (cid:63)X is sparse. Thus, followingthe same strategy as the proof of Lemma 3, we conclude from the relations (37) that ˜ B = B (cid:63) .28 .3 Single parameter perturbation setting As discussed in Section 2.1, one may fit to data the perturbation model (2) where the perturbation magnitudesare equal in magnitude across the coordinates, e.g. var( δ e ) = ζ (cid:63)e for the parameter vector ζ (cid:63) ∈ R m . Fitting sucha model can be achieved by the reparametrization v e = v + ζ e of the decision variable w e in (5) for parametervectors v ∈ R p and ζ ∈ R m . We assume an observational environment e = 1 and two interventional environments e = 2 , − heterogeneity among the perturbations:the vectors (cid:18) ψ ,(cid:63) ψ ,(cid:63) (cid:19) & (cid:18) ζ ,(cid:63) ζ ,(cid:63) (cid:19) are linearly independent . Modified Assumption 5 − perturbation is sufficiently strong for e = 3 ζ (cid:63) ≥ κ (cid:63) (1 + 2 C ψ ) (1 + (cid:107) v ,(cid:63) (cid:107) ∞ )(1 + (cid:107) Γ (cid:63) (cid:107) + (cid:107) Γ (cid:63) (cid:107) ) (38)With this modification, we have the following consistency guarantees: Theorem 5 (Single parameter perturbation with perturbed hidden variables) . Suppose Assumption 1,2,4 in (11) and Assumption 3 and 5 in (38) are satisfied. The following assertions hold:1. B (cid:63) ∈ B opt and any other optimum B ∈ B opt satisfies: moral ( B (cid:63) ) ⊆ moral ( B ) .2. The optimum of arg min B ∈ B opt (cid:107) moral ( B ) (cid:107) (cid:96) is unique and equal to B (cid:63) . We next provide identifiability guarantees in the setting without hidden perturbations ( i.e. ψ e,(cid:63) = 0 for all e )with single parameter perturbation. Fitting such a model can be achieved by the reparametrization w e = w + ζ e for a parameter vector ζ e ∈ R + and ψ e ≡
0. We then have the following identifiability in this setting.
Theorem 6 (Single parameter perturbation with unperturbed hidden variables) . Suppose Assumption 1 and 2 in (10) are satisfied for only environments e = 2 . Then, if ζ ,(cid:63) > , D opt = D (cid:63)X and B opt = B (cid:63) . A.3.1 Proof of Theorem 5
Proof.
The proof of the first part closely mirrors that of Theorem 1 and is left out for brevity. It concludesthat ˜ ψ e = ψ e,(cid:63) for e = 1 , ,
3. To prove the second part, suppose that in addition to the population parameters( B (cid:63) , Γ (cid:63) , { ( ψ e,(cid:63) , ζ e,(cid:63) ) } me =1 ), DirectLikelihood has another solution ( ˜ B, { ( ˜ ψ e , ˜ ζ e ) } me =1 ). Then, since the first environ-ment does not consist of any perturbations, we find that:Σ ,(cid:63) − Σ ,(cid:63) = ( I − B (cid:63) ) − ( ζ ,(cid:63) I + ψ ,(cid:63) Γ (cid:63) Γ (cid:63)T )( I − B (cid:63) ) − T = ( I − ˜ B ) − (˜ ζ I + ψ ,(cid:63) ˜Γ˜Γ T )( I − ˜ B ) − T Σ ,(cid:63) − Σ ,(cid:63) = ( I − B (cid:63) ) − ( ζ ,(cid:63) I + ψ ,(cid:63) Γ (cid:63) Γ (cid:63)T )( I − B (cid:63) ) − T = ( I − ˜ B ) − (˜ ζ I + ψ ,(cid:63) ˜Γ˜Γ T )( I − ˜ B ) − T Due to Assumption 3, there exists a ∈ R such that a T (cid:18) ψ ,(cid:63) ψ ,(cid:63) (cid:19) = 0 but a T (cid:18) ζ ,(cid:63) ζ ,(cid:63) (cid:19) (cid:54) = 0. Then, a (Σ ,(cid:63) − Σ ,(cid:63) ) + a (Σ ,(cid:63) − Σ ,(cid:63) ) = a T (cid:18) ζ ,(cid:63) ζ ,(cid:63) (cid:19) ( I − B (cid:63) ) − ( I − B (cid:63) ) − T = a T (cid:18) ˜ ζ ˜ ζ (cid:19) ( I − ˜ B ) − ( I − ˜ B ) − T Lastly, by appealing to identifiability of DAG under equal variances [22], we have that ˜ B = B (cid:63) . We further notethat asymptotic convergence results similar to Corollary 1 may be shown but is left out for brevity.29 .3.2 Proof of Theorem 6 Proof.
We will show in Lemma 4 that for e = 1 , e,(cid:63) =( I − B (cid:63) ) − (cid:16) diag (cid:0) w ,(cid:63) + ζ e,(cid:63) (cid:1) + Γ (cid:63) Γ (cid:63)T (cid:17) ( I − B (cid:63) ) − T = ( I − ˜ B ) − (cid:16) diag (cid:16) ˜ w + ˜ ζ e (cid:17) + ˜Γ˜Γ T (cid:17) ( I − ˜ B ) − T (39)Taking the difference Σ ,(cid:63) − Σ ,(cid:63) , the relation (39) yields:Σ ,(cid:63) − Σ ,(cid:63) = ζ ,(cid:63) ( I − B (cid:63) ) − ( I − B (cid:63) ) − T = ˜ ζ ( I − ˜ B ) − ( I − ˜ B ) − T . (40)We can then appeal to identifiability of DAGs with equal noise variance [22] to conclude that ˜ B = B (cid:63) . A.4 Proof of Theorem 2 and Theorem 3
We consider the proof of the case with unperturbed hidden confounders and leave out the proof for the setting with-out hidden confounders. For notational convenience, we state the extended population
DirectLikelihood estimator(5) in the setting with unperturbed hidden variables as:arg min B ∈ R p × p , Γ ∈ R p × ¯ h { w e } me =1 ⊆ R p ++ m (cid:88) e =1 π e,(cid:63) (cid:96) e ( B, Γ , w e ; Σ e,(cid:63) , do ( e ))subject-to B DAG (41)where, (cid:96) e ( B, Γ , w e ; Σ e,(cid:63) , do ( e )) = log det (cid:0) diag ( w e ) + [ΓΓ T ] do ( e ) c (cid:1) + trace (cid:0) ( diag ( w e ) + [ΓΓ T ] do ( e ) c ) − ( I − F do ( e ) c B )Σ e,(cid:63) ( I − F do ( e ) c B ) T (cid:1) . As with Lemma 1, we characterize the optimal solutions of (41) in the following lemma.
Lemma 4.
Optimal solutions of (41) satisfy the following equivalence ( B, Γ , { w e } me =1 ) optimum to (41) ⇐⇒ B DAG , { w e } me =1 ∈ R p ++ , Γ ∈ R p × ¯ h with ¯ h ≥ h and Σ e,(cid:63) = ( I − F do ( e ) c B ) − ( diag ( w e ) + [ΓΓ T ] do ( e ) c )( I − F do ( e ) c B ) − T for e = 1 , , . . . , m The proof of Lemma 4 is similar to Lemma 1 and left out for brevity. Based on the result of Lemma 4 , anyoptimum ( B, Γ , { w e } me =1 ) of population DirectLikelihood must satisfy the relation for each e = 1 , , . . . , m .Σ e,(cid:63) = ( I − F do ( e ) c B ) − ([ΓΓ T ] do ( e ) c + diag ( w e ))( I − F do ( e ) c B ) − T for e = 1 , , . . . , m (42)Aside from ( B (cid:63) , { w e,(cid:63) } me =1 ), suppose there is another solution ( ˜ B, { ˜ w e } me =1 ) satisfying (42). Thus, we have for e = 2 ,
3: Σ e,(cid:63) − Σ ,(cid:63) = ( I − B (cid:63) ) − diag ( w e,(cid:63) − w ,(cid:63) )( I − B (cid:63) ) − T = ( I − ˜ B ) − diag ( ˜ w e − ˜ w )( I − ˜ B ) − T (43)Equation (43) yields the relation for e = 2 , I − ˜ B )( I − B (cid:63) ) − diag ( w e,(cid:63) − w ,(cid:63) )( I − B (cid:63) ) − T ( I − ˜ B ) T = diag ( ˜ w e − ˜ d ) . Let φ ek := [( I − ˜ B )( I − B (cid:63) ) − diag ( w e,(cid:63) − w ,(cid:63) )] k, : for any k = 1 , , . . . , p . Let ξ k := [( I − ˜ B )( I − B (cid:63) ) − ] k, : . Thenfor any k = 1 , , . . . , p , Then for any k = 1 , , . . . , p , φ ek ⊥ ξ l for any k (cid:54) = l (44)30otice that for any k = 1 , , . . . , p , { ξ l } l (cid:54) = k are linearly independent. The condition above means that φ k and φ k (where neither would be exactly a zero vector because w ,(cid:63) − w ,(cid:63) , w ,(cid:63) − w ,(cid:63) (cid:54) = 0) live inside the one-dimensionalnull-space of the matrix formed by concatenating the vectors { ξ l } l (cid:54) = k . As with the proof of part 2 of Theorem 1,the assumption that w ,(cid:63)k − w ,(cid:63)k w ,(cid:63)l − w ,(cid:63)l (cid:54) = w ,(cid:63)k − w ,(cid:63)k w ,(cid:63)l − w ,(cid:63)l for k, l ∈ S, k (cid:54) = l implies that the matrix Ξ consisting of concatenatingthe row vectors { ξ i } pi =1 satisfies relation (33). Proof of part (a) : The relation (33) and that Ξ is invertible implies that Ξ is a diagonal matrix up to row-permutations so that: (
I − ˜ B ) = K π D ( I − B (cid:63) ) , where D is diagonal with all nonzero entries on the diagonal and ˜ π is a permutation. We know that ( I − ˜ B ) willhave ones on the diagonal. Hence, it is straightforward to check that K π = D = ξ ( m ) and thus ˜ B = B (cid:63) . Proof of part (b) : Suppose B (cid:63) and ˜ B and Ξ are ordered according to ancestors of X p , then X p and then theremaining variables. Since the underlying graph is a DAG, there is a an ancestor of X p that does not have anyparent. We first consider this variable. Suppose Ξ , : ( S ) = 0. Then, since Ξ , : is zero on this variable and itschildren, then Ξ , : [( I − B (cid:63) ) : , ] will be zero. This is a contradiction since ( I − ˜ B ) has diagonal elements equal toone. By condition (33) and that ( I − ˜ B ) must be diagonal, then Ξ , : must have one nonzero entry, on either thisancestor variable or its children. Suppose for purposes of contradiction that this nonzero value happened on oneof the children. Notice that if Ξ j, is nonzero for some j (cid:54) = 1, then condition (33) implies that Ξ j, : = c e for someconstant c . However, since the variable corresponding to index j is not a parent to the variable correspondingto index 1, then Ξ j, : ( I − B (cid:63) ) : ,j will be zero. With this logic, Ξ : , will have all zeros, leading to a contradictionsince Ξ must be invertible. Hence, Ξ , : must be of the form Ξ , : = c e for some constant c . Since the diagonalelements of I − ˜ B are exactly one, then c = 1. Repeating the same argument, and letting ¯ S denote the set ofvariables X p and the ancestors of X p , we find that Ξ ¯ S, : = (cid:0) I | ¯ S | | ¯ S c | (cid:1) . Hence we have that ˜ B k, : = B (cid:63)k, : for all k corresponding to target variable X p and all ancestors of X p . Now suppose that S includes X p and descendants of X p . Let ˆ B, B (cid:63) , Ξ be organized in descending order the descendants of X p , X p and then everything else. Since theunderlying graph is a DAG, there is one or more descendants of X p that do not have any children. Let ¯ S be thiscollection. Since Ξ ¯ S, : ( I − B (cid:63) ) : , ¯ S must have diagonal equal one, and because of the condition (33), then Ξ ¯ S, ¯ S = I | ¯ S | .Now consider any parent of these nodes that is a descendant of X p . Since Ξ | ¯ S | +1 , : ( I − B (cid:63) ) : , | ¯ S | +1 must equal oneand (33), then Ξ | ¯ S | +1 , : must have only one nonzero entry on S , either entries corresponding to its descendants orthe variable itself. If this non-zero is in the location of one of the descendants, then Ξ will have two identical rows,meaning that it would not be invertible. This reasoning can be repeated until we arrive at the index correspondingto X p and show that Ξ p, : = e p . Hence, ˜ B p, : = B (cid:63)p, : . Proof of part (c) : We prove that when the target variable and it’s parents all receive shift interventions andthe DAG B (cid:63) is faithful with respect to underlying distribution, the sparsest optimum ˜ B satisfies ˜ B p, : = B (cid:63)p, : . Anyof the sparsest optimum DAGs with same v-structures and skeletons as the population DAG. From the discussionabove, Ξ will satisfy the relation (33) where S denotes the set of variables that have received a shift intervention.Suppose for the sake of contradiction that Ξ p, : (cid:54) = e p (e.g. the estimated causal parents are not equal to the truecausal parents). Since Ξ is invertible, the property (33) and that Ξ( I − B (cid:63) ) must have nonzero diagonal elements,it must be that for one of the parents of X p , denoted by index t , Ξ t, : = e p . With respect to the graph, this meansthat we are considering a graph where the edge between the parent of X p and X p is reversed. This edge reversalof course can be continued along the path of the descendants of X p as long as this descendant has only a singleparent. Suppose at any one of the descendants, the edge reversal stops so that this descendant becomes a sourcenode. Let s be the index of this variable. Consider a node s (cid:48) (cid:54) = s that is not a parent or ancestor of X p . Startingfrom the last descendant of this node, denoted by index s (cid:48) l , Γ s,s (cid:48) l = 0 since otherwise this would imply that s (cid:48) l isa parent to s , contradicting that s is a source node. Working upwards from this last descendant, we can see thatΓ s,s (cid:48) = 0. Furthermore, for any parent of X p denoted by k , Γ s,k = 0 since otherwise based on condition (33),Γ s,k = ce k , which would mean that the node k is a parent to s , contradicting that s is a source node. Followingthis logic upwards, we can also conclude that Γ s,k = 0 for k being an ancestor of X p . Since Γ is invertible, itremains that Γ s,s (cid:54) = 0. This again leads to a contradiction to s being a source node since it would mean that s in the estimated DAG would have the same parents as the population DAG, and this set of parents is non-emptysince s is a descendant of X p . These contradictions would imply that Ξ p, : = e p .31 .5 Proof of Theorem 4 Proof.
For any connectivity matrix B , latent coefficient Γ, noise variance w :KL(Σ e,(cid:63) , ˆΣ B, Γ ,w (¯ ζ e , ¯ ψ e )) ≥ KL(Σ e,(cid:63) , ˆΣ B (cid:63) , Γ (cid:63) ,w ,(cid:63) (¯ ζ e , ¯ ψ e ) = 0 . Thus, if there exists optimum ( ˜ B, ˜Γ , ˜ w ) to the max-risk optimization problem (14), it must satisfy for all P e ∈P C ζ ,C ψ the relation Σ e,(cid:63) = ˆΣ ˜ B, ˜Γ , ˜ w (¯ ζ e , ¯ ψ e ). We take three environments: first one corresponding to the observa-tional setting e = 1 where none of the variables are intervened on, a second environment e = 2 corresponding tosetting where only the hidden variables are perturbed, and a last environment ee