[PDF] Bayesian causal inference in probit graphical models

Abstract

We consider a binary response which is potentially affected by a set of continuous variables. Of special interest is the causal effect on the response due to an intervention on a specific variable. The latter can be meaningfully determined on the basis of observational data through suitable assumptions on the data generating mechanism. In particular we assume that the joint distribution obeys the conditional independencies (Markov properties) inherent in a Directed Acyclic Graph (DAG), and the DAG is given a causal interpretation through the notion of interventional distribution. We propose a DAG-probit model where the response is generated by discretization through a random threshold of a continuous latent variable and the latter, jointly with the remaining continuous variables, has a distribution belonging to a zero-mean Gaussian model whose covariance matrix is constrained to satisfy the Markov properties of the DAG. Our model leads to a natural definition of causal effect conditionally on a given DAG. Since the DAG which generates the observations is unknown, we present an efficient MCMC algorithm whose target is the posterior distribution on the space of DAGs, the Cholesky parameters of the concentration matrix, and the threshold linking the response to the latent. Our end result is a Bayesian Model Averaging estimate of the causal effect which incorporates parameter, as well as model, uncertainty. The methodology is assessed using simulation experiments and applied to a gene expression data set originating from breast cancer stem cells.

Full PDF

BBayesian causal inference in probit graphical models

Federico Castelletti & Guido Consonni ∗ Abstract

We consider a binary response which is potentially aﬀected by a set of continuous vari-ables. Of special interest is the causal eﬀect on the response due to an intervention on aspeciﬁc variable. The latter can be meaningfully determined on the basis of observationaldata through suitable assumptions on the data generating mechanism. In particular we as-sume that the joint distribution obeys the conditional independencies (Markov properties)inherent in a Directed Acyclic Graph (DAG), and the DAG is given a causal interpretationthrough the notion of interventional distribution. We propose a DAG-probit model wherethe response is generated by discretization through a random threshold of a continuous latentvariable and the latter, jointly with the remaining continuous variables, has a distributionbelonging to a zero-mean Gaussian model whose covariance matrix is constrained to satisfythe Markov properties of the DAG. Our model leads to a natural deﬁnition of causal eﬀectconditionally on a given DAG. Since the DAG which generates the observations is unknown,we present an eﬃcient MCMC algorithm whose target is the posterior distribution on thespace of DAGs, the Cholesky parameters of the concentration matrix, and the thresholdlinking the response to the latent. Our end result is a Bayesian Model Averaging estimate ofthe causal eﬀect which incorporates parameter, as well as model, uncertainty. The method-ology is assessed using simulation experiments and applied to a gene expression data setoriginating from breast cancer stem cells.

We consider a system of random quantities, which includes a binary response as well as a col-lection of continuous variables, and address the problem of determining the causal eﬀect onthe response due to an intervention on a given variable. A causal question involves the datagenerating mechanism after an intervention is applied to the system, and must be carefully dis-tinguished from traditional conditioning of probability theory (Pearl, 2009, Section 2.4). Thegold standard for addressing causal questions is represented by randomized controlled experi-ments; the latter however are not always available because they may be unethical, infeasible, ∗ Department of Statistical Sciences, Universit`a Cattolica del Sacro Cuore, Milan, [email protected],[email protected] a r X i v : . [ s t a t . M E ] S e p ime consuming or expensive (Maathuis & Nandy, 2016). By contrast, observational data, thatis observations produced without exogenous perturbations of the system, are widely availableand often plentiful. The challenge is then to infer causal eﬀects based on observational dataalone. To achieve this goal, it is crucial to set up a a suitable conceptual framework which isable to address causal questions; in particular the notion of joint distribution for a collection ofrandom variables can only address concepts linked to association, so much so that, by converse,“a causal concept is any relationship that cannot be deﬁned from the distribution alone” (Pearl,2009, Section 2).A very useful framework to bridge the gap between the observational and the experimentaldomains is represented by the Directed Acyclic Graph (DAG), or its allied concept of StructuralEquation Model (SEM); see Pearl (1995) and Pearl (2000). DAGs have been extensively usedto construct statistical models embodying conditional independence relations (Lauritzen, 1996).Applications are numerous especially in genomics; see for instance Friedman (2004) and Fried-man & Koller (2003). With observational data, conditional independence relations will driveinference on DAG and parameter space. On the other hand, the additional syntax and semanticsof causal DAGs (Pearl, 2000) will be instrumental to deﬁne the notion of causal eﬀect.As in standard probit regression (Albert & Chib, 1993), we assume that the observable binaryresponse is the result of a discretization of a continuous latent variable. Next, for a given DAG,we model all continuous random variables, along with the latent, as a multivariate Gaussianfamily satisfying the corresponding Markov property. We call the resulting setup a

DAG-probit model, and provide a deﬁnition of causal eﬀect on the response which is predicated on a givenDAG through the notion of interventional distribution (Pearl, 2000). However the structureof the DAG is usually unknown, and this must be taken into account because diﬀerent DAGswill typically induce distinct causal eﬀects; see the review paper Maathuis & Nandy (2016) andCastelletti & Consonni (2020a) for a Bayesian approach.In this work we extend the notion of interventional distribution and causal eﬀect (Pearl,2000; Maathuis et al., 2009) to DAG-probit models. Speciﬁcally, we propose a Bayesian methodwhich jointly performs DAG-model determination as well as inference of causal eﬀects in thepresence of a binary response. From a computational viewpoint we introduce an MCMC schemeto sample from the joint posterior of models (DAGs) and model-dependent parameters (causaleﬀects) which we implement through an eﬃcient PAS algorithm (Godsill, 2012). The rest ofthe paper is organized as follows. In Section 2 we review DAG-Gaussian models and deﬁne theDAG-probit model. In Section 3 we present the structure of the interventional distribution in itsgeneral form, then specialize it to the Gaussian case, and ﬁnally extend the deﬁnition of causaleﬀect to DAG-probit models. Section 4 presents our Bayesian methodology with particularemphasis on priors for model parameters. An MCMC algorithm for posterior inference onmodels, parameters and hence causal eﬀects is presented in Section 5. We evaluate the proposed2ethodology through simulation studies in Section 6, and then apply it to a data set on geneexpression measurements derived from breast cancer stem cells (Section 7). Finally a few pointsfor discussion are presented in Section 8. Some theoretical results as well as additional simulationoutputs are reported in the Supplementary material (Castelletti & Consonni, 2020b).

In this section we ﬁrst provide some background material on Gaussian DAG-models with specialemphasis on their parameterization (Section 2.1). Next we present our DAG-probit model(Section 2.2). Both sections deal with the likelihood, while choices of prior distributions arediscussed in Section 4.

Let D = ( V, E ) be a DAG, where V = { , . . . , q } is a set of vertices (or nodes) and E ⊆ V × V a set of edges whose elements are ( u, v ) ≡ u → v , such that if ( u, v ) ∈ E then ( v, u ) / ∈ E . Inaddition, D contains no cycles, that is paths of the form u → u → · · · → u k where u ≡ u k .For a given node v , if u → v ∈ E we say that u is a parent of v (conversely v is a child of u ).The parent set of v in D is denoted by pa( v ), the set of children by ch( v ). Moreover, we denoteby fa( v ) = v ∪ pa( v ) the family of v in D . For further theory and notation on DAGs we refer toLauritzen (1996).We consider a collection of random variables ( X , . . . , X q ) and assume that their joint prob-ability density function f ( x ) is Markov w.r.t. D , so that it admits the following factorization f ( x , . . . , x q ) = q Y j =1 f ( x j | x pa( j ) ) . (1)In this section, as well as in Section 3, we reason conditionally on a given DAG D without anexplicit conditioning in the notation we use. In Section 4 we will instead deal with model (DAG)uncertainty and will reinstate D in our notation.If the joint distribution is Gaussian with mean equal to zero, we write X , . . . , X q | Ω ∼ N q ( , Ω − ) , Ω ∈ P D (2)where Ω = Σ − is the precision matrix, and P D is the space of symmetric positive deﬁnite(s.p.d.) precision matrices Markov w.r.t. D . For a Gaussian DAG-model factorization (1)becomes f ( x , . . . , x q | Ω ) = q Y j =1 d N ( x j | µ j ( x pa( j ) ) , σ j ) , (3)3here d N ( · | µ, σ ) denotes the normal density having mean µ and variance σ . The assumptionof normality guarantees that the model if faithful to the DAG, that is all and only those con-ditional independencies emboded in the joint distribution can be read-oﬀ from the graph usingthe Markov property; see also Spirtes et al. (2000).Without loss of generality, we assume a parent ordering of the nodes which numerically labelsthe variables so that i > j whenever j is a child of i . A parent ordering always exists, althoughit is not unique in general. We also remark that a parent ordering is speciﬁc to any given DAGunder consideration and may change if alternative DAGs are entertained. Importantly, it onlyrepresents a convenient device to specify a prior on the parameter space; see Section 4 wherewe also show that the choice of the parent ordering does not aﬀect the prior assigned to modelparameters.Moreover, we declare node 1, which cannot have children, to be the (latent) outcome variable.Equation (3) can be also written as a structural equation model L > ( X , . . . , X q ) > = ε , (4)where because of the assumed parent ordering L is a ( q, q ) lower-triangular matrix of coeﬃcients, L = { L ij , i ≥ j } , such that L ij = 0 if and only if i → j and L ii = 1. Moreover, ε is a ( q,

1) vectorof error terms, ε ∼ N q ( , D ), where D = diag( σ ) and σ is the ( q,

1) vector of conditional variances whose j -th element is σ j = V ar( X j | x pa( j ) , Ω ). From (4) it follows that Ω = LD − L > . (5)We refer to Equation (5) as the modiﬁed Cholesky decomposition of Ω . Let now ≺ j (cid:31) = pa( j )and ≺ j ] = pa( j ) × j . Representation (5) induces a re-parametrization of Ω in terms of theCholesky parameters n ( σ j , L ≺ j ] ) , j = 1 , . . . , q o , where L ≺ j ] = − Σ ≺ j (cid:31) Σ ≺ j ] , σ j = Σ jj | pa( j ) ;see also Cao et al. (2019). Accordingly, Equation (3) can be written as f ( x , . . . , x q | D , L ) = q Y j =1 d N ( x j | − L >≺ j ] x pa( j ) , σ j ) , (6)with the understanding that the conditional expectation of X j in (6) is zero whenever pa( j ) isthe empty set. We introduce in this section the general form of a

DAG-probit model . We assume that the jointdistribution of ( X , X , . . . , X q ) is Gaussian and Markov w.r.t. D so that its density is as in46). Recall that X is latent and we are only allowed to observe the binary variable Y ∈ { , } .Speciﬁcally, for a given threshold θ ∈ ( −∞ , + ∞ ), we assume that Y =  X ∈ [ θ , + ∞ ) , X ∈ ( −∞ , θ ) . (7)Combining (6) with (7), the joint density of ( Y, X , . . . , X q ) becomes f ( y, x , . . . , x q | D , L , θ ) = f ( x , . . . , x q | D , L ) · ( θ y − < x ≤ θ y )=  q Y j =1 d N ( x j | − L >≺ j ] x pa( j ) , σ j )  · ( θ y − < x ≤ θ y ) , (8)where θ − = −∞ , θ = + ∞ . Equation (8) deﬁnes a (Gaussian) DAG-probit model . A relatedexpression appears in Guo et al. (2015) who model a multivariate distribution of ordered cat-egorical variables through a collection of Gaussian random variables Markov with respect toan undirected graphical model. Now recall from (6) that the conditional distribution of thelatent variable X is N ( − L >≺ x pa(1) , σ ) and, as in standard probit regression, we set σ = 1for identiﬁability reasons.Finally, by considering n independent samples ( y i , x i, , . . . , x i,q ), i = 1 , . . . , n , from (8), the augmented likelihood can be written as f ( y , X | D , L , θ ) = n Y i =1 f ( x i, , . . . , x i,q | D , L ) · ( θ y i − < x i, ≤ θ y i )= q Y j =1 d N n ( X j | − X pa( j ) L ≺ j ] , σ j I n ) · ( n Y i =1 ( θ y i − < x i, ≤ θ y i ) ) , (9)where y = ( y . . . , y n ) > , X is the ( n, q ) augmented data matrix, and X A is the submatrix of X corresponding to the set A of columns of X . Consider the joint density of the random vector ( X , . . . , X q ) Markov w.r.t. a DAG whichfactorizes as in (1); the latter is referred to as the observational (or pre-intervention ) distribution.We now introduce the notion of intervention . A deterministic intervention on variable X s , s ∈ { , . . . , q } is denoted by do( X s = ˜ x ) and consists in setting X s to the value ˜ x . The post-intervention density of ( X , . . . , X q ) is then obtained using the truncated factorization f ( x , . . . , x q | do( X s = ˜ x )) =  q Q j =1 ,j = s f ( x j | x pa( j ) ) | x s =˜ x if x s = ˜ x, , (10)5here, importantly, each term f ( x j | · ) in (10) is the corresponding (pre-intervention) condi-tional distribution of Equation (1); see Pearl (2000). We emphasize that the post-interventiondensity f ( x , . . . , x q | do( X s = ˜ x )) is conceptually distinct from the usual conditional density f ( x , . . . , x q | X s = ˜ x ), which arises out of passive observation of X s = ˜ x . An important featureof Equation (10) is that the data generating system is “stable” under exogenous interventions,in the sense that only the local component distribution f ( x s | x pa( s ) ) is aﬀected by the inter-vention and eﬀectively reduces to a point mass on ˜ x . All the remaining terms are immune tothe intervention and thus remain the same. The post-intervention distribution of the (latent)response X is then obtained by integrating (10) w.r.t. x , . . . , x q which simpliﬁes to f ( x | do( X s = ˜ x )) = Z f ( x | ˜ x, x pa( s ) ) f ( x pa( s ) ) d x pa( s ) ; (11)see also Pearl (2000, Theorem 3.2.2).We now move back to the Gaussian setting of Section 2.1, and assume that ( X , X , . . . , X q ) | Σ ∼N q ( , Σ ), where the covariance matrix Σ is Markov w.r.t. to the underlying DAG. The post-intervention distribution of X can thus be written as f ( x | do( X s = ˜ x ) , Σ ) = Z f ( x | ˜ x, x pa( s ) , Σ ) · f ( x pa( s ) | Σ ) d x pa( s ) = Z d N ( x | γ s ˜ x + γ > x pa( s ) , δ ) · d N ( x pa( s ) | , Σ pa( s ) , pa( s ) ) d x pa( s ) , (12)where δ = V ar( X | X s = ˜ x, x pa( s ) , Σ ). The following Proposition gives the analytic form of thepost-intervention distribution of X . Proposition 3.1.

Let ( X , X , . . . , X q ) | Σ ∼ N q ( , Σ ) and consider the do operator do( X s =˜ x ) , s ∈ { , . . . , q } . Then the post-intervention distribution of X is f ( x | do( X s = ˜ x ) , Σ ) = d N (cid:18) x | γ s ˜ x, δ − ( γ > T − γ ) /δ (cid:19) , where δ = Σ | fa( s ) , ( γ s , γ > ) > = Σ , fa( s ) (cid:0) Σ fa( s ) , fa( s ) (cid:1) − , T = (cid:0) Σ pa( s ) , pa( s ) (cid:1) − + 1 δ γγ > , with the understanding that node s occupies the ﬁrst position in the set fa( s ) .Proof. See Supplementary material (Castelletti & Consonni, 2020b).The previous reasoning considered the intervention distribution of the latent response vari-able X following do( X s = ˜ x ). Typically distribution (11) is summarized by its expected value E ( X | do( X s = ˜ x )). When X s is continuous, one can deﬁne the (total) causal eﬀect as the6erivative of E ( X | do( X s = x )) w.r.t. x evaluated at ˜ x : this is especially convenient whenthe expectation is linear, as in the Gaussian case (12), because the causal eﬀect admits a sim-ple interpretation: it corresponds to the “regression parameter” γ s associated to variable X s (Maathuis et al., 2009). Our interest however lies in the observable response variable Y , andtherefore we aim to evaluate E ( Y | do( X s = ˜ x ) , Σ , θ ). We thus obtain E ( Y | do( X s = ˜ x ) , Σ , θ ) = Pr( Y = 1 | do( X s = ˜ x ) , Σ , θ )= Pr( X ≥ θ | do( X s = ˜ x ) , Σ )= 1 − Φ (cid:18) θ − γ s ˜ xτ (cid:19) ≡ β s (˜ x, Σ , θ ) , (13)where Φ( · ) denotes the c.d.f. of a standard normal and τ = δ / (cid:0) − ( γ > T − γ ) /δ (cid:1) . One couldthen compute the partial derivative of E ( Y | do( X s = ˜ x ) , Σ , θ ) w.r.t x evaluated at ˜ x , and obtain φ ( θ − γ s ˜ x/τ ) γ s /τ , where φ ( · ) is the density function of a standard normal. This however wouldstill depend on ˜ x . For this reason, and because (13) enjoys an intuitive interpretation being aprobability, we will simply denote Pr( Y = 1 | do( X s = ˜ x ) , Σ , θ ) at the causal eﬀect on Y dueto an intervention do( X s = ˜ x ). Finally, we remark that the causal eﬀect of do( X s = ˜ x ) on Y ,besides being a function of the value ˜ x , depends on θ as well as on the covariance matrix Σ ,where the latter is constrained to be Markov w.r.t. the underlying DAG. In this section we introduce priors for ( Ω , θ , D ), which we structure as p ( Ω , θ , D ) = p ( Ω | D ) p ( D ) p ( θ ).Further distributional results useful for our MCMC scheme of Section 5 are also presented. Webrieﬂy preview here the essential features.To start with, consider p ( Ω | D ), Ω ∈ P D . We ﬁrst proceed to the re-parameterization Ω ( D , L ) presented in Subsection 2.1, and specify a DAG-Wishart prior (Cao et al., 2019) onthe Cholesky parameters ( D , L ). We achieve this goal using a highly parsimonious elicitationprocedure, which we detail in Section 4.1. For the unknown threshold θ ∈ ( −∞ , + ∞ ), weassume a uniform prior, so that p ( θ ) ∝ D is assignedthrough independent Bernoulli distributions on the elements of the skeleton of D (Section 4.2). Consider ﬁrst a DAG D = ( V, E ) which is complete , so that the precision matrix Ω is un-constrained. A standard conjugate prior is the Wishart distribution, Ω ∼ W q ( a, U ) havingexpectation a U − , where a > q − U is a s.p.d. matrix. The induced prior on theCholesky parameters consistent with the DAG parent ordering is such that the node parameters7 σ j , L ≺ j ] ), j = 1 , . . . , q , are independent with distribution σ j ∼ I-Ga (cid:18) a j − | pa( j ) | − , U jj |≺ j (cid:31) (cid:19) , L ≺ j ] | σ j ∼ N | pa( j ) | (cid:16) − U − ≺ j (cid:31) U ≺ j ] , σ j U − ≺ j (cid:31) (cid:17) , (14)where | A | is the number of elements in the set A , a j = a + q − j + 3; see Ben-David et al.(2015, Supplemental B). The symbol I-Ga( a, b ) stands for an Inverse-Gamma distribution withshape a > b > b/ ( a −

1) ( a > U , hereinafter adopted, is U = g I q ,where g > I q is the ( q, q ) identity matrix. It is straightforward to show that (14) reducesto σ j ∼ I-Ga (cid:18) a j − | pa( j ) | − , g (cid:19) , L ≺ j ] | σ j ∼ N | pa( j ) | (cid:18) , g σ j I | pa( j ) | (cid:19) . (15)In addition, to guarantee the identiﬁability of the DAG-probit model, we ﬁx σ = 1, so thatinstead of p ( σ , L ≺ ) we need only to consider p ( L ≺ ) with L ≺ ∼ N | pa(1) | ( , g − I | pa(1) | ); seealso Section 2.2. Recall that (15) applies only to a complete DAG D .Consider now the case in which D is not complete. The idea is to leverage (15) to constructa prior on ( D , L ), the Cholesky parameters of Ω ∈ P D , which is valid for any D . To this endwe follow the procedure of Geiger & Heckerman (2002). Speciﬁcally, let D be an arbitrary DAGand assume a parent ordering of its nodes. For each node j ∈ { , . . . , q } , let (cid:8) σ j , L ≺ j ] (cid:9) bethe Cholesky parameters associated to node j , and identify a complete DAG D C ( j ) such thatpa D C ( j ) ( j ) = pa D ( j ), where j in D C ( j ) corresponds to the same variable as j in D . Because ofthe parent ordering j = q −| pa D ( j ) | which is usually diﬀerent from j . Let (cid:8) σ C ( j ) j , L C ( j ) ≺ j ] (cid:9) be theCholesky parameters of node j under the complete DAG D C ( j ) . We then assign to n σ j , L ≺ j ] o the same prior of (cid:8) σ C ( j ) j , L C ( j ) ≺ j ] (cid:9) which can be gathered from Equation (15) in the completeDAG-Wishart version. In particular σ j ∼ I-Ga(( a + | pa D ( j ) | − q + 3) / − , g/

2) and L ≺ j ] isdistributed as a zero-mean multivariate normal with covariance matrix g − σ j I | pa D ( j ) | . Thereforeboth distributions only depend on the cardinality of pa D ( j ) which is the same across alternativeparent orderings. Finally, by assuming independence among node-parameters ( σ j , L ≺ j ] ), we canwrite p ( D , L ) = q Y j =1 p ( σ j , L ≺ j ] ) , ( L , D ) ∈ Θ D , (16)where Θ D is the image of the space P D under the mapping Ω ( D , L ).8 .2 Prior on DAG space For a given DAG D , let A D be the (symmetric) 0-1 adjacency matrix of the skeleton of D whose( u, v ) element is denoted by A D ( u,v ) . Conditionally on the edge inclusion probability π , we ﬁrstassign a Bernoulli prior independently to each element A D ( u,v ) belonging to the lower-triangularpart, that is: A D ( u,v ) | π iid ∼ Ber( π ) , u > v . As a consequence we get p ( A D ) = π | A D | (1 − π ) q ( q − −| A D | , (17)where | A D | denotes the number of edges in the skeleton, equivalently the number of entries equalto one in the lower-triangular part of A D . Finally, we set p ( D ) ∝ p ( A D ), for D ∈ S q , where S q is the set of all DAGs on q nodes. θ As mentioned, in absence of substantive prior information, we assign a ﬂat improper prior tothe treshold θ ∈ ( −∞ , ∞ ), p ( θ ) ∝

1. Accordingly, we need to prove that the posterior of θ isproper. The next proposition details under which conditions 2 is guaranteed. Proposition 4.1.

Under the prior (15) for ( D , L ) , p ( D ) as in Section 4.2 for DAG D and theimproper prior p ( θ ) ∝ for θ , the posterior of θ is proper provided the sample contains atleast two observations with distinct values for Y , that is y i = 1 , y i = 0 ( i = i ).Proof. See Supplementary material (Castelletti & Consonni, 2020b).Additionally, we prove in the Supplementary material that under the conditions of Proposi-tion 4.1 the joint posterior of ( D , L , D , θ , X ) is proper. Clearly, alternative priors for θ mighthave been used; yet the full conditional of θ would not be amenable to direct sampling. As aconsequence, posterior inference on θ is performed through a Metropolis Hastings step insideour MCMC scheme; see Section 5 for details. In this section we detail the MCMC scheme that we adopt to target the posterior distribution p ( D , L , D , θ , X | y , X − ) ∝ f ( y , X | D , L , D , θ ) p ( D , L | D ) p ( D ) , (18)now emphasizing the dependence on DAG D , where X − = ( X , . . . , X q ), and the term p ( θ )has been omitted because it is proportional to one.9 .1 Update of ( D, L, D ) From (18) the full conditional distribution of ( D , L , D ) is p ( D , L , D | y , X , θ ) ∝ p ( X | D , L , D ) p ( D , L | D ) p ( D )using (9), where X = ( X , X , . . . , X q ) is the ( n, q ) augmented data matrix.To sample from p ( D , L , D | X ) we adopt a reversible jump MCMC algorithm which takesinto account the partial analytic structure (PAS, Godsill 2012) of the DAG-Wishart distributionto sample DAG D and the Cholesky parameters ( D , L ) from their full conditional. A similarapproach was implemented in Wang & Li (2012) for Gaussian undirected graphical models usingG-Wishart priors. Details about the PAS algorithm and its implementation in our DAG settingare reported in the Supplementary material (Castelletti & Consonni, 2020b).Speciﬁcally, at each iteration of the MCMC scheme, we ﬁrst propose a new DAG D from asuitable proposal distribution q ( D | D ); see again our Supplementary material. In particular, itis shown that when proposing a DAG D which diﬀers from the current graph D by one edge( h, j ), the acceptance probability for D is given by α D = min (cid:26) , m ( X j | X pa D0 ( j ) , D ) m ( X j | X pa D ( j ) , D ) · p ( D ) p ( D ) · q ( D | D ) q ( D | D ) (cid:27) (19)where, for j ∈ { , . . . , q } , m ( X j | X pa D ( j ) , D ) = (2 π ) − n (cid:12)(cid:12) T j (cid:12)(cid:12) / (cid:12)(cid:12) ¯ T j (cid:12)(cid:12) / · Γ (cid:16) a ∗ j + n (cid:17) Γ (cid:16) a ∗ j (cid:17) (cid:20) g (cid:21) a ∗ j (cid:20) (cid:0) g + X > j X j − ˆ L > j ¯ T j ˆ L j (cid:1)(cid:21) − ( a ∗ j + n/ (20)with T j = g I | pa D ( j ) | ¯ T j = g I | pa D ( j ) | + X > pa D ( j ) X pa D ( j ) ˆ L j = (cid:0) g I | pa D ( j ) | + X > pa D ( j ) X pa D ( j ) (cid:1) − X > pa D ( j ) X j ,a ∗ j = a j − | pa D ( j ) | − X pa D ( j ) denotes the ( n, | pa D ( j ) | ) sub-matrix of X whose columnsbelong to the set pa D ( j ). For j = 1, because we ﬁxed σ = 1, we have instead m ( X | X pa D (1) , D ) = (2 π ) − n (cid:12)(cid:12) T (cid:12)(cid:12) / (cid:12)(cid:12) ¯ T (cid:12)(cid:12) / · exp (cid:26) − (cid:0) X > X + ˆ L > ¯ T ˆ L (cid:1)(cid:27) , (21)with T , ¯ T , ˆ L deﬁned in analogy with the expressions appearing after (20); see the Supplemen-tary material (Castelletti & Consonni, 2020b) for details. Moreover, given DAG D and X , thefull conditional of ( D , L ) reduces to the augmented posterior p ( D , L | X ), which is conditional10n the actual data ( X , . . . , X q ) as well as the latent values X and can be easily sampled from.Speciﬁcally, since f ( X | D , L ) = q Y j =1 d N n ( X j | − X pa D ( j ) L ≺ j ] , σ j I n ) (22)and because of (16) and conjugacy of the Normal-Inverse-Gamma prior in (15) with the Normaldensity, the posterior distribution of the Cholesky parameters given X is, for j = 2 , . . . , q , σ j | X ∼ I-Ga (cid:18) a ∗ j + n , (cid:0) g + X > j X j − ˆ L > j ¯ T j ˆ L j (cid:1)(cid:19) , L ≺ j ] | σ j , X ∼ N | pa D ( j ) | (cid:0) − ˆ L j , σ j ¯ T − j (cid:1) . (23)Moreover, for node 1 where σ = 1, we have L ≺ | X ∼ N | pa D (1) | (cid:0) − ˆ L , ¯ T − (cid:1) . (24) X and θ Updating of X = ( x , , . . . , x n, ) > can be performed by direct sampling from the full conditionaldistribution of each latent observation x i, , f ( x i, | y i , x i, , . . . , x i,q , D , L , D , θ ) ∝ f ( x i, | x i, pa D (1) , L ≺ j ] ) · ( θ y i − < x i, ≤ θ y i ) , which corresponds to a N ( − L >≺ x i, pa D (1) ,

1) truncated at the interval ( θ y i − , θ y i ].Finally, the cut-oﬀ θ is updated through a Metropolis Hastings step where, given the currentvalue θ , a candidate value g is proposed from q ( g | θ ) = d N ( g | θ , σ ). We then set θ = g with probability α θ = min { r θ } , (25)where r θ = n Q i =1 h Φ (cid:0) g y i | − L >≺ x i, pa D (1) , (cid:1) − Φ (cid:0) g y i − | − L >≺ x i, pa D (1) , (cid:1)i n Q i =1 h Φ (cid:0) θ y i | − L >≺ x i, pa D (1) , (cid:1) − Φ (cid:0) θ y i − | − L >≺ x i, pa D (1) , (cid:1)i · d N (cid:0) θ | g , σ (cid:1) d N (cid:0) g | θ , σ (cid:1) , and g − = ∞ , g = + ∞ . Algorithm 1 summarizes our MCMC scheme. The output is a collection of DAGs (cid:8) D ( t ) (cid:9) Tt =1 and Cholesky parameters (cid:8)(cid:0) D D ( t ) , L D ( t ) (cid:1)(cid:9) Tt =1 approximatively sampled from the target distri-bution (18). In particular we can compute posterior summaries of interest such as the posteriorprobabilities of edge inclusion, namelyˆ p u → v ( y , X , . . . , X q ) ≡ ˆ p u → v = 1 T T X t =1 u → v (cid:8) D ( t ) (cid:9) , (26)11here u → v (cid:8) D ( t ) (cid:9) takes value 1 if and only if D ( t ) contains the edge u → v . Moreover, wecan reconstruct the covariance matrices (cid:8) Σ D ( t ) (cid:9) Tt =1 using (5). The latter can be subsequentlyretrieved to obtain for selected s ∈ { , . . . , q } and intervention value ˜ x the collection of causaleﬀects (cid:8) β ( t ) s (˜ x ) (cid:9) Tt =1 deﬁned in (13), where we set β ( t ) s (˜ x ) ≡ β s (cid:0) ˜ x, Σ D ( t ) , θ ( t )0 (cid:1) . An overall summaryof the causal eﬀect of do( X s = ˜ x ) on Y can be computed asˆ β BMAs (˜ x ) = 1 T T X t =1 β ( t ) s (˜ x ) , (27)which corresponds to a Bayesian Model Averaging (BMA) estimate where posterior (DAG)model probabilities are approximated through the MCMC frequencies of visits; see Garc´ıa-Donato & Mart´ınez-Beneito (2013) for a discussion of the merits of frequency based estimatorsin large model spaces. Algorithm 1:

MCMC scheme to sample from (18)

Input:

A dataset ( y , X , . . . , X q ) Output: T samples from the posterior (18) Initialize D (0) , e.g. the empty DAG, the cut-oﬀs θ (0) − = −∞ , θ (0)0 = 0 , θ (0)1 = + ∞ and thelatent variables x (0)1 , e.g. x (0) i, ind ∼ N (0 ,

1) truncated at ( θ (0) y i − , θ (0) y i ]; for t = 1 , . . . , T do Sample D from q ( D | D ( t − ) and set D ( t ) = D with probability α D (19), otherwise D ( t ) = D ( t − ; Sample (cid:0) D D ( t ) , L D ( t ) (cid:1) from its augmented posterior distribution (23); For i = 1 , . . . , n , independently sample x i, from N (cid:0) − L D ( t ) >≺ x i, pa D ( t ) (1) , (cid:1) truncatedat ( θ ( t ) y i − , θ ( t ) y i ]; Propose a cut-oﬀ g from q ( g | θ ( t )0 ) and set θ ( t )0 = g with probability α θ (25),otherwise θ ( t )0 = θ ( t − ; set θ ( t ) − = −∞ , θ ( t )1 = + ∞ . end In this section we evaluate the performance of our method through simulation studies. Speciﬁ-cally, for each combination of number of nodes q ∈ { , } and sample size n ∈ { , , } ,which we call simulation scenario , we generate 40 DAGs using a probability of edge inclusionequal to p = 3 / (2 q −

2) to induce sparsity; see Peters & B¨uhlmann (2014). For each DAG D we then proceed as follows. We identify a parent ordering and ﬁx D D = I q and then randomlysample the entries of L D in the interval [ − , − ∪ [1 , n i.i.d. q -dimensional observations from the structural equation model (4) which also includes12he ( n,

1) vector of latent observations; we ﬁnally ﬁx the threshold θ = 0 and obtain the 0-1vector of responses y = ( y , . . . , y n ) > as in (7).We apply Algorithm 1 to approximate the target distribution in (18) by setting the numberof MCMC iterations T = 25000 for q = 20, and T = 50000 for q = 40. We also set g = 1 /n and a = q + 1 in the prior on the Cholesky parameters of Ω (15) and σ = 0 .

25 in the proposaldensity for the cut-oﬀ θ .We begin by evaluating the global performance of our method in learning the graph structure.To this end, we ﬁrst estimate the posterior probabilities of edge inclusion by computing ˆ p u → v ( · )in (26) for each pair of distinct nodes u, v . Next, we consider a threshold for edge inclusion k ∈ [0 ,

1] and for a given k construct a DAG estimate by including those edges ( u, v ) whoseposterior probability exceeds k . The resulting graph is compared with the true DAG throughthe sensitivity (SEN) and speciﬁcity (SPE) indexes, respectively deﬁned as SEN = T PT P + F N , SP E = T NT N + F P , where

T P, T N, F P, F N are the numbers of true positives, true negatives, false positives and falsenegatives respectively. The two indexes are used to construct a receiver operating characteristic(ROC) curve. Speciﬁcally, for each scenario deﬁned by q and n , we present a ROC curveconstructed as follows. For each threshold k , we compute SEN and (1 − SP E ) under eachof the 40 DAGs used in the simulation. The point whose coordinates are the mean of each ofthe two measures corresponds to one dot in Figure 1. The collection of dots connected by linesrepresents an average ROC curve. We proceed similarly to compute the 5th and 95th percentileand obtain the grey band.To better appreciate Figure 1, we also compute, for each simulation scenario ( q, n ), the areaunder the (average) ROC curve (AUC) whose values are reported in Table 1. They are close orabove 94% under the three sample sizes considered when q = 20. When q = 40 AUC exceeds90% for n = 100 and rises to over 97% for n = 500.13 = n = n = F i g u r e : S i m u l a t i o n s . R ece i v e r o p e r a t i n g c h a r a c t e r i s t i c ( R O C ) c u r v e o b t a i n e dund e r v a r y i n g t h r e s h o l d s f o r t h e p o s t e r i o r p r o b a b ili t i e s o f e d g e i n c l u s i o n f o r e a c h c o m b i n a t i o n o f nu m b e r o f n o d e s q = { , } ( ﬁ r s t a nd s ec o nd r o w r e s p ec t i v e l y ) a nd s a m p l e s i ze n ∈ { , , } . D o t s a nd c o nn ec t i n g li n e d e s c r i b e t h e ( a v e r ag e o v e r t h e s i m u l a t e d D A G s ) R O C c u r v e , w h il e t h e g r e y a r e a r e p r e s e n t s t h e t h - t h p e r ce n t il e b a nd . = 100 n = 200 n = 500 q = 20 93.89 94.19 95.12 q = 40 90.94 94.91 97.19Table 1: Simulations. Area under the curve (percentage values) computed from the averageROC curves in Figure 1 for number of nodes q ∈ { , } and sample sizes n ∈ { , , } .A more speciﬁc check on the ability of our method in recovering the structure of the under-lying DAG can be considered. Since Y is the response, interest centers on the causal eﬀect on Y following an intervention on a variable in the system. A natural group of intervention variablesis represented by the set of parents of the latent node X either because they directly inﬂuence X (and hence Y ) or because they act as intermediate nodes along a pathway originating from avariable upstream in the graph. To this end, under each simulation scenario, we ﬁx the thresholdfor edge inclusion k ∗ = 0 . u → p u → ( · ) ≥ . a = ( a , , . . . , a q, ) > , where a , = 0 while, for u = 2 , . . . , q , a u, = 1 if u → p ∗ = 1 q − q X u =2 (cid:8) a u, = A D ( u, (cid:9) , where A D ( u,v ) denotes the ( u, v ) element of the adjacency matrix of D . The results are summarizedin the box-plots of Figure 2 where we report the frequency distribution of p ∗ computed over the40 true DAGs. While for n = 100 the proportion of correctly classiﬁed edges presents somevariability with a median which is nevertheless around 80% ( q = 40) and 90% ( q = 20), theperformance greatly improves as the sample size increases with practically all values being closeto 1.We now focus on causal eﬀect estimation. Under each simulated DAG D and parameters( D D , L D ) we ﬁrst compute the (true) covariance matrix Σ D using (5). Now recall from (13) thatthe causal eﬀect on Y is a probability which also depends on the level ˜ x assigned to the intervenedvariable X s . For each intervened node s ∈ { , . . . , q } we evaluate β s (˜ x, Σ D , θ ) ≡ β trues (˜ x ) at eachobserved value of X s in the simulation scenario, ( x ,s , . . . , x n,s ), and obtain the ( n,

1) vector ofcausal eﬀects (cid:0) β trues ( x ,s ) , . . . , β trues ( x n,s ) (cid:1) > . Next we produce the collection of BMA estimatesˆ β BMAs ( x ,s ) , . . . , ˆ β BMAs ( x n,s ) according to Equation (27). To evaluate the performance of ourmethod in estimating the causal eﬀect we consider the diﬀerences (cid:0) β trues ( x i,s ) − ˆ β BMAs ( x i,s ) (cid:1) andcompute the mean absolute error (MAE) M AE s = 1 n n X i =1 (cid:12)(cid:12) β trues ( x i,s ) − ˆ β BMAs ( x i,s ) (cid:12)(cid:12) , = 20 q = 40Figure 2: Simulations. Distribution across 40 simulated datasets of the proportion of predictors p ∗ that are correctly classiﬁed given a threshold for edge inclusion k ∗ = 0 . q ∈ { , } and sample size n ∈ { , , } .16 = 20 q = 40Figure 3: Simulations. Distribution over 40 datasets and nodes s ∈ { , . . . , q } of the mean absolute error(MAE) of BMA estimates of true causal eﬀects. Results are presented for each combination of numberof nodes q ∈ { , } and sample size n ∈ { , , } . for each intervened node s = 2 , . . . , q . Results are summarized in the box-plots of Figure3, where we report the distribution of the MAE (constructed across the 40 DAGs and nodes s = 2 , . . . , q ) as a function of n . As expected, MAE decreases and approaches 0 as the samplesize n grows for both values of q . Notice that the median value of MAE in the worst case scenario( q = 20 , n = 100) is about half of one percent.Finally, we also explore settings where n ≤ q : in particular we include simulation results for q = 40 and n ∈ { , , } . Again, we generate 40 DAGs and the allied parameters as in ourﬁrst simulation study. Results are summarized in the box-plots of Figure 4, where we reportthe distribution of the MAE, constructed across the 40 DAGs and nodes s ∈ { , . . . , q } ) as afunction of n . It appears that, even if sample sizes are moderate, MAE decreases as n grows. In this section we apply our method to a gene expression dataset presented in Yin et al. (2014).The aim of the original study was to evaluate the ability of a gene signature derived frombreast cancer stem cells to predict the risk of metastasis and survival in breast cancer patients.To this end, a collection of genes which are believed to be the main responsible for tumor17igure 4:

Simulations. Distribution over 40 datasets and nodes s ∈ { , . . . , q } of the mean absolute error(MAE) of BMA estimates of true causal eﬀects. Results are presented for number of nodes q = 40 andsample size n ∈ { , , } . initiation, progression, and response to therapy was considered. The study was motivated byrecent literature establishing the existence of a rare population of cells, called stem-like cells ,which supposedly represent the cellular origin of cancer; see for instance OBrien et al. (2006).Gene-expression levels were measured on n = 198 breast cancer patients of which 62 manifesteddistant metastasis. In the following we consider the expression levels of q = 28 genes includedin the original study and a binary response variable Y indicating the occurrence (absence orpresence, respectively Y = 0 and Y = 1) of distant metastasis. Evaluating the causal eﬀecton Y due to an hypothetical intervention on a speciﬁc gene which sets its expression level mayhelp understand which genes are particularly relevant for determining distant metastasis. Thisin turn can be useful to identify external interventions, which are known to induce variationsin the expression of speciﬁc genes. For instance epigenetic modiﬁcations of gene expressionsmay be induced by lifestyle and environmental factors such as nutrition-dietary components,exercise, physical activity and toxins; see Abdul et al. (2017). Similarly Campbell et al. (2017)examine the eﬀects of lifestyle interventions on proposed biomarkers of lifestyle and cancer riskat the level of adipose tissue in humans.We apply Algorithm 1 by ﬁxing the number of MCMC iterations T = 120000. Observationsfrom the continuous variables X , . . . , X q were standardized. We also set g = 1 /n and a = q + 1in the prior on the Cholesky parameters of Ω as in the simulation scenarios of Section 6. We ﬁrstuse the MCMC output to estimate the posterior probability of inclusion of each directed edge18igure 5: Gene expression data. Heat map with estimated marginal posterior probabilities of edgeinclusion for each edge u → v . u → v , that we report in the heat map of Figure 5. Results show a substantial degree of sparsityin the underlying graph structure and only 48 edges have a posterior probability of inclusionexceeding 0 .

5. Moreover, among the 28 genes, only gene IL8, for which ˆ p IL8 → Y ( · ) = 0 .

70, seemsto directly aﬀect the response variable.To evaluate the incidence of each gene on the probability of recurrence we compute the causaleﬀect (13) on the response due to an intervention on a speciﬁc gene. To this end, starting fromthe MCMC output we produce a BMA estimate ˆ β BMAs (˜ x ) for each gene s = 2 , . . . , q accordingto (27). Since the causal eﬀect depends on the level ˜ x assigned to the intervened variable X s , weevaluate ˆ β BMAs (˜ x ) at each observed value of X s , that is x ,s , . . . , x n,s . The results are reportedin Figure 6, where each box-plot refers to a gene s ∈ { , . . . , q } and summarizes the distributionof n = 198 BMA estimates, (cid:8) ˆ β BMAs ( x i,s ) (cid:9) ni =1 . Because the data were standardized, the rangesof X , . . . , X q are similar and we can meaningfully compare results across genes.Recall from Proposition 3.1 that γ s is the covariance between X s and X . If γ s = 0, Equation(13) shows that the causal eﬀect on Y due to an intervention on X s does not vary with ˜ x . If19rior information is weak in relation to the sample size, the estimate of the causal eﬀect will beclose to the overall frequency of distant metastasis in the sample (0.31). This is the situationexhibited by most genes in Figure 6. On the other hand if γ s is not zero, the collection of causaleﬀects evaluated at x is , i = 1 , . . . , n will vary. Since the observations are centred, their averageis zero and the causal eﬀects will be spread around the value corresponding to the average¯ x s = 0 whose estimate is 0.31 as indicated above. This is what happens for a few genes such asIL8, OAS2 and KRT6B, which exhibit a much greater variability of the causal eﬀect across theirmeasurements, implying that their regulation can inﬂuence the occurrence of distant metastasis.In particular, gene IL8 has also been identiﬁed as having a potential impact on cancer cells inseveral studies (Waugh & Wilson, 2008). Other genes which stand out in terms of variabilityare OAS2 and KRT6B, with the latter not directly linked to Y (as one can see from the heatmap of Figure 5) and exhibiting a moderate causal eﬀect on Y which is likely due to the strongassociation of KRT6B with IL8 (as it emerges from the posterior probabilities of edge inclusionin Figure 5).For genes IL8 and OAS2 we also report in Figure 7 more detailed results for causal eﬀectestimation. In particular, each plot reports the BMA estimates (cid:8) ˆ β BMAs ( x i,s ) (cid:9) ni =1 (representedby n = 198 dots), and the corresponding credible regions at level 95% represented by the greyarea. Results show that increasing expression levels of IL8 are likely to increase the presence ofdistant metastasis, with BMA estimates of the probability of recurrence ranging in the interval[0 .

18; 0 . γ s for these genes is assigned to the positive half-line; seealso the discussion after (13). Moreover, more extreme levels of IL8 are associated with largercredible regions. A similar behavior, although less pronounced, is observed for gene OAS2 withBMA estimates ranging between [0 .

25; 0 . In this paper we deal with causal eﬀect estimation from observational data. We consider asystem of real-valued variables together with a binary response; interest lies in the evaluation ofthe causal eﬀect on the response due to an external intervention on a variable. We assume thatthe response is generated by standard thresholding applied to a continuous latent variable, andassume that the joint distribution of all continuous variables belongs to a zero-mean GaussianDirected Acyclic Graph (DAG) model, or equivalently a Structural Equation Model (SEM), andname the resulting model

DAG-probit .Rather than constructing a prior distribution on the covariance (or precision) matrix con-strained by a given DAG, we proceed by assigning to the corresponding Cholesky parametersa DAG-Wishart prior whose hyperparameters are deduced from a single unique Wishart distri-20igure 6:

Gene expression data. Box-plots of BMA estimate of causal eﬀect. Each box-plot refers toone of the 28 genes s , and represents the n = 198 BMA estimates computed at each observed value( x ,s , . . . , x n,s ) of expression for gene s . Gene expression data. BMA estimates (dots) and credible regions at level 95% (grey area) fortwo selected genes, IL8 and OAS2. bution. This technique not only drastically simpliﬁes prior elicitation, but has the importantadvantage of producing score equivalence, meaning that marginal likelihoods of Markov equiv-alent DAG models are all equal (Geiger & Heckerman, 2002). This feature will have a usefulimplication, as we discuss at the end of this section.Because the structure of the data-generating DAG is unknown, we construct an MCMCsampler whose target is the joint posterior distribution of the DAG and the allied Choleskyparameters. This is achieved by carefully tailoring a Partial Analytic Structure (PAS) algo-rithm to our DAG setting. As a by-product, we recover the MCMC sequence of causal eﬀectscorresponding to each visited DAG; this represents the input to our ﬁnal Bayesian Model Aver-aging (BMA) estimate, which naturally accounts for model uncertainty on the underlying graphstructure.The assumption of jointly normally distributed random variables can be a source of concernwhenever one faces a concrete data analysis. With regard to our application the Gaussianassumption has been often used to analyse gene expression data; see for instance Dobra et al.(2004) and Markowetz & Spang (2007). In addition, it allows to easily incorporate the binaryoutcome through a latent component, and results in an eﬃcient algorithm, because of closed-form expressions both for the posterior distribution of parameters, as well as for the marginallikelihood of models.Besides the assumption of normality, our model posits a unique graphical structure as thegenerating mechanism of all observations. Nevertheless, some problems may suggest to partitionthe units into groups each having a speciﬁc graphical structure which can be however relatedto the other ones, as in gene expressions collected on multiple tissues from the same individual(Xie et al., 2017). In this setting a multiple graphical model setup would be more appropriate to22ncourage similarities between group graphical structures; see for instance Peterson et al. (2015)for a Bayesian analysis of multiple Gaussian undirected graphical models. The latter could bea useful starting point for an extension of our DAG-probit model to multiple groups.In this work we consider causal eﬀects as obtained from interventions on single nodes. How-ever in practice an exogenous intervention may aﬀect many variables (genes) simultaneously andaccordingly one may want to predict for instance the eﬀect of a double or triple gene knockouton the response. Causal eﬀect estimation from joint interventions is carried out in a Gaus-sian setting by Nandy et al. (2017) using a frequentist approach. Their results show that thecausal eﬀect of X s on the response in a joint intervention on a given set of variables can bestill expressed as a function of the covariance matrix Markov w.r.t. D . The same problemcan be tackled by adopting a Bayesian methodology which combines DAG structural learningand causal eﬀect estimation and is currently under investigation by ourselves. In addition, anextension to DAG-probit models should be feasible along the lines of this paper.The methodology adopted in this work revolves around DAGs. However, it is known thatin the Gaussian setting DAGs encoding the same conditional independencies (Markov equiv-alent DAGs) are not distinguishable using observational data (Verma & Pearl, 1990) and canbe collected into Markov equivalence classes (MECs). Accordingly, when the goal of the anal-ysis is structural learning (model selection) MECs represent the appropriate inferential object(Andersson et al., 1997). However, if the objective is causal eﬀect estimation, this is no longerso, because Markov equivalent DAGs may return distinct causal eﬀects. An inspection of (13)reveals the reason: a causal eﬀect depends on the parent set of the intervened node, and this maydiﬀer among DAGs within the same MEC. Yet MECs can be exploited also for causal inference,as we now detail. In a frequentist setting, Maathuis et al. (2009) ﬁrst estimate a MEC usingthe classic PC algorithm (Spirtes et al., 2000), and then provide an estimate of the causal eﬀectunder each DAG within the estimated equivalence class. Alternatively, a Bayesian methodologywould ﬁrst determine the posterior distribution on MEC space, and then, conditionally on agiven MEC, compute the posterior of each causal eﬀect within the class (one for each DAG).A single MEC causal eﬀect estimate can be obtained by averaging eﬀects across DAGs, usinguniform weights on equivalent DAGs. Finally, an overall Bayesian Model Averaging (BMA)estimate can be obtained by averaging MEC-conditional estimates using posterior probabilitiesof MECs as weights; for details see (Castelletti & Consonni, 2020a). We remark that the abovestrategies require an exhaustive enumeration of all DAGs belonging to a MEC, which is not fea-sible even for a moderate number of nodes. Accordingly one considers only the distinct causaleﬀects within a given MEC, because these values can be eﬃciently recovered (Maathuis et al.,2009, Algorithm 3) even in high-dimensional settings. In this work we seemingly ignore the issueof DAG Markov equivalence, and propose a causal inference procedure which directly focuseson DAG space, rather than MEC space. However, as already remarked at the beginning of23his section, our method for parameter prior construction across DAG models guarantees scoreequivalence for DAGs within the same MEC. This, together with a uniform prior on DAGswithin the same MEC, ensures that causal eﬀects associated to Markov equivalent DAGs willbe assigned equal weights in the resulting BMA estimate. References

Abdul, Q. , Yu, B. , Chung, H. , Jung, H. & J.S., C. (2017). Epigenetic modiﬁcations ofgene expression by lifestyle and environment.

Archives of Pharmacal Research

40 1219–1237.URL https://doi.org/10.1007/s12272-017-0973-3 . 18

Albert, J. H. & Chib, S. (1993). Bayesian analysis of binary and polychotomous responsedata.

Journal of the American Statistical Association

88 669–679. URL . 2

Andersson, S. A. , Madigan, D. & Perlman, M. D. (1997). A characterization of Markovequivalence classes for acyclic digraphs.

The Annals of Statistics

25 505–541. URL http://dx.doi.org/10.1214/aos/1031833662 . 23

Barbieri, M. M. & Berger, J. O. (2004). Optimal predictive model selection.

The Annalsof Statistics

32 870–897. URL https://doi.org/10.1214/009053604000000238 . 15

Ben-David, E. , Li, T. , Massam, H. & Rajaratnam, B. (2015). High dimensional Bayesianinference for Gaussian directed acyclic graph models. arXiv pre-print

URL https://arxiv.org/abs/1109.4371 . 8

Campbell, K. L. , Landells, C. E. , Fan, J. & Brenner, D. R. (2017). A systematicreview of the eﬀect of lifestyle interventions on adipose tissue gene expression: Implicationsfor carcinogenesis.

Obesity

25 S40–S51. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/oby.22010 . 18

Cao, X. , Khare, K. & Ghosh, M. (2019). Posterior graph selection and estimation consistencyfor high-dimensional Bayesian DAG models.

The Annals of Statistics

47 319–348. URL https://doi.org/10.1214/18-AOS1689 . 4, 7

Castelletti, F. & Consonni, G. (2020a). Bayesian inference of causal eﬀects from observa-tional data in Gaussian graphical models.

Biometrics, In press . 2, 23

Castelletti, F. & Consonni, G. (2020b). Supplementary material to “Bayesian causalinference in probit graphical models” . 3, 6, 9, 1024 obra, A. , Hans, C. , Jones, B. , Nevins, J. R. , Yao, G. & West, M. (2004). Sparsegraphical models for exploring gene expression data.

Journal of Multivariate Analysis

90 196– 212. URL .22

Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models.

Science

303 799–805. URL https://science.sciencemag.org/content/303/5659/799 . 2

Friedman, N. & Koller, D. (2003). Being Bayesian about network structure. A Bayesianapproach to structure discovery in Bayesian networks.

Machine Learning

50 95–125. URL https://doi.org/10.1023/A:1020249912095 . 2

Garc´ıa-Donato, G. & Mart´ınez-Beneito, M. A. (2013). On sampling strategies in bayesianvariable selection problems with large model spaces.

Journal of the American StatisticalAssociation

108 340–352. URL https://doi.org/10.1080/01621459.2012.742443 . 12

Geiger, D. & Heckerman, D. (2002). Parameter priors for directed acyclic graphical modelsand the characterization of several probability distributions.

The Annals of Statistics https://doi.org/10.1214/aos/1035844981 . 8, 22

Godsill, S. J. (2012). On the relationship between markov chain monte carlo methods formodel uncertainty.

Journal of Computational and Graphical Statistics

10 230–248. URL https://doi.org/10.1198/10618600152627924 . 2, 10

Guo, J. , Levina, E. , Michailidis, G. & Zhu, J. (2015). Graphical models for ordinal data.

Journal of Computational and Graphical Statistics

24 183–204. URL https://doi.org/10.1080/10618600.2014.889023 . 5

Lauritzen, S. L. (1996).

Graphical Models . Oxford University Press. 2, 3

Maathuis, M. & Nandy, P. (2016). A review of some recent advances in causal inference. InP. B¨uhlmann, P. Drineas, M. Kane & M. van der Laan, eds.,

Handbook of Big Data . Chapmanand Hall/CRC, 387–408. 2

Maathuis, M. H. , Kalisch, M. & Bhlmann, P. (2009). Estimating high-dimensional in-tervention eﬀects from observational data.

The Annals of Statistics

37 3133–3164. URL https://doi.org/10.1214/09-AOS685 . 2, 7, 23

Markowetz, F. & Spang, R. (2007). Inferring cellular networks - a review.

BMC Bioinfor-matics

8. URL https://doi.org/10.1186/1471-2105-8-S6-S5 . 22

Nandy, P. , Maathuis, M. H. & Richardson, T. S. (2017). Estimating the eﬀect of jointinterventions from observational data in sparse high-dimensional settings.

Ann. Statist. https://doi.org/10.1214/16-AOS1462 . 2325

Brien, C. A. , Pollett, A. , Gallinger, S. & Dick, J. E. (2006). A human colon cancercell capable of initiating tumour growth in immunodeﬁcient mice.

Nature

445 106–110. URL https://doi.org/10.1038/nature05372 . 18

Pearl, J. (1995). Causal diagrams for empirical research.

Biometrika

82 669–688. URL . 2

Pearl, J. (2000).

Causality: Models, Reasoning, and Inference . Cambridge University Press,Cambridge. 2, 6

Pearl, J. (2009). Causal inference in statistics: An overview.

Statistics Surveys https://doi.org/10.1214/09-SS057 . 1, 2

Peters, J. & B¨uhlmann, P. (2014). Identiﬁability of Gaussian structural equation modelswith equal error variances.

Biometrika

101 219–228. URL . 12

Peterson, C. , Stingo, F. C. & Vannucci, M. (2015). Bayesian inference of multiple Gaus-sian graphical models.

Journal of the American Statistical Association

110 159–174. PMID:26078481, URL https://doi.org/10.1080/01621459.2014.896806 . 23

Spirtes, P. , Glymour, C. & Scheines, R. (2000).

Causation, Prediction and Search (2ndedition) . Cambridge, MA: The MIT Press. 4, 23

Verma, T. & Pearl, J. (1990). Equivalence and synthesis of causal models. In

Proceedingsof the Sixth Annual Conference on Uncertainty in Artiﬁcial Intelligence , UAI 90. New York,NY, USA: Elsevier Science Inc., 255–270. 23

Wang, H. & Li, S. Z. (2012). Eﬃcient gaussian graphical model determination under g -wishartprior distributions.

Electronic Journal of Statistics https://doi.org/10.1214/12-EJS669 . 10

Waugh, D. J. & Wilson, C. (2008). The interleukin-8 pathway in cancer.

Clinical CancerResearch

14 6735–6741. URL https://doi.org/10.1158/1078-0432.CCR-07-4843 . 20

Xie, Y. , Liu, Y. & Valdar, W. (2017). Joint estimation of multiple dependent Gaussiangraphical models with applications to mouse genomics.

Biometrika

103 493–511. URL https://doi.org/10.1093/biomet/asw035 . 22

Yin, Z.-Q. , Liu, J.-J. , Xu, Y.-C. , Yu, J. , Ding, G.-H. , Yang, F. , Tang, L. , Liu, B.-H. , Ma, Y. , Xia, Y.-W. , Lin, X.-L. & Wang, H.-X. (2014). A 41-gene signature derived frombreast cancer stem cells as a predictor of survival.

Journal of Experimental & Clinical CancerResearch

33. URL https://doi.org/10.1186/1756-9966-33-49 . 1726 upplement toBayesian causal inference in probit graphical models

Federico Castelletti, Guido Consonni

In this section we give a proof of Proposition 3.1. To this end we ﬁrst introduce thefollowing two lemmata.

Lemma 1.1.

Let x , b ∈ R d be two vectors, M ∈ R d × d a symmetric and invertible matrix.Then x > M x − b > x = ( x − M − b ) > M ( x − M − b ) − b > M − b . Proof.

Simply expand the right-hand side of the equation.

Lemma 1.2.

Let V | ( U = u ) ∼ N ( µ + γ > u , δ ) , U ∼ N d ( , Σ ) . Then V ∼ N (cid:18) µ, δ − ( γ > T − γ ) /δ (cid:19) , where T = Σ − + γγ > /δ .Proof. We can write by deﬁnition f ( v ) = Z f ( v | u ) f ( u ) d u = Z R d √ πδ exp (cid:26) − δ ( v − µ − γ > u ) (cid:27) · π ) d/ | Σ | / exp (cid:26) − u > Σ − u (cid:27) d u . Next, observe that − δ ( v − µ − γ > u ) = − δ h ( v − µ ) + ( γ > u ) − v − µ ) γ > u i = − δ h ( v − µ ) + u > γγ > u − v − µ ) γ > u i = − δ h ( v − µ ) + u > Γ u − v − µ ) γ > u i , being Γ = γγ > . Therefore we can write f ( v ) = 1 √ πδ exp ( − (cid:18) v − µδ (cid:19) ) · | Σ | / Z R d π ) d/ exp (cid:26) − (cid:20) u > (cid:18) Σ − + Γ δ (cid:19) u − v − µ ) δ γ > u (cid:21)(cid:27) d u . a r X i v : . [ s t a t . M E ] S e p et now T = Σ − + Γ /δ . By Lemma 1.1 we obtain u > T u − (cid:20) ( v − µ ) δ γ > (cid:21) u = (cid:20) u − T − ( v − µ ) δ γ (cid:21) > T (cid:20) u − T − ( v − µ ) δ γ (cid:21) − (cid:20) ( v − µ ) δ γ > (cid:21) T − (cid:20) ( v − µ ) δ γ (cid:21) . Therefore, f ( v ) = 1 √ πδ exp ( − (cid:18) v − µδ (cid:19) ) · | Σ | / Z R d π ) d/ exp ( − "(cid:18) u − T − ( v − µ ) δ γ (cid:19) > T (cid:18) u − T − ( v − µ ) δ γ (cid:19) · exp (cid:26) (cid:20) ( v − µ ) δ γ > (cid:21) T − (cid:20) ( v − µ ) δ γ (cid:21)(cid:27) d u . Moreover we can write f ( v ) = 1 √ πδ exp ( − (cid:18) v − µδ (cid:19) ) · | Σ | / exp (cid:26) (cid:20) ( v − µ ) δ γ > (cid:21) T − (cid:20) ( v − µ ) δ γ (cid:21)(cid:27) · | T | / Z R d | T | / (2 π ) d/ exp ( − (cid:18) u − ( v − µ ) δ T − γ (cid:19) > T (cid:18) u − ( v − µ ) δ T − γ (cid:19)) d u . Hence, f ( v ) = 1 √ πδ | Σ | / | T | / exp (cid:26) − δ (cid:20) − δ γ > T − γ (cid:21) ( v − µ ) (cid:27) , and so V ∼ N (cid:18) µ, δ − ( γ > T − γ ) /δ (cid:19) . Proof of Proposition 3.1.

Let ( X , X , . . . , X q ) | Σ ∼ ( , Σ ) and consider the do operatordo( X s = ˜ x ) where s ∈ { , . . . , q } is the intervened node. The post-intervention distributionof X is given by (Section 3) f ( x | do( X s = ˜ x ) , Σ ) = Z N ( x | γ s ˜ x + γ > x pa( s ) , Σ ) · N ( x pa( s ) | , Σ pa( s ) , pa( s ) ) d x pa( s ) , where X | X pa( s ) = x pa( s ) ∼ N ( γ s ˜ x + γ > x pa( s ) , δ ) , X pa( s ) ∼ N ( , Σ pa( s ) , pa( s ) ) . Applying Lemma 1.2 with V = X and U = X pa( s ) we obtain X | do( X s = ˜ x ) , Σ = N (cid:18) γ s ˜ x, δ − ( γ > T − γ ) /δ (cid:19) , δ = Σ | fa( s ) , ( γ s , γ > ) > = Σ , fa( s ) (cid:0) Σ fa( s ) , fa( s ) (cid:1) − , T = (cid:0) Σ pa( s ) , pa( s ) (cid:1) − + 1 δ γγ > . We ﬁrst summarize the main features of a

Partial Analytic Structure (PAS) algorithm(Godsill, 2012; Wang & Li, 2012). Let X , . . . , X q be a collection of variables, X a ( n, q )data matrix collecting n i.i.d. multivariate observations from X , . . . , X q . Consider K distinct models, M , . . . , M K , each one indexed by a parameter θ k ∈ Θ k and assumemodel uncertainty, so that the true data generating model is one of the K models. Undereach model M k the likelihood function is f ( X | θ k , M k ).Consider now two models M h , M k , h, k ∈ { , . . . , K } , h = k , with parameters θ h , θ k and let ( θ h ) u , ( θ k ) u be two sub-vector of θ h , θ k respectively. A general PAS algorithmrelies on the following two assumptions:1. the full conditional distribution of ( θ k ) u , p (( θ k ) u | ( θ k ) − u , M k , X ) is available inclosed form;2. in M h there exists an “equivalent” set of parameters ( θ h ) u with same dimension of( θ k ) u ;see also Wang & Li (2012, Section 5.2). A PAS reversible jump algorithm adopts a proposaldistribution which sets ( θ h ) − u = ( θ k ) − u , draws M k from q ( M k | M h ) and ( θ k ) u from p (( θ k ) u | ( θ k ) − u , M k , X ). Speciﬁcally, the update of model M k and model parameter θ k is performed in two steps. The ﬁrst step concerns the model update and can be summarizedas follows:1. propose M k ∼ q ( M k | M h ) and set ( θ h ) − u = ( θ k ) − u ;2. accept M k with probability α = min { , r k } , r k = p ( M k | ( θ k ) − u , X ) p ( M h | ( θ h ) − u , X ) q ( M h | M k ) q ( M k | M h ) , (1)where p ( M k | ( θ k ) − u , X ) = Z p ( M k , ( θ k ) u | ( θ k ) − u , X ) d ( θ k ) u (2)3. if M k is accepted, generate ( θ k ) u ∼ p (( θ k ) u | ( θ k ) − u , M k , X ),otherwise ( θ h ) u ∼ p (( θ h ) u | ( θ h ) − u , M h , X ).In the second step we then update parameter θ k (if M k is accepted in the model updatestep) or θ h (otherwise) from its full conditional distribution using standard MCMC steps.Notice that in the case of Gaussian undirected graphical models with G-Wishart priorsthis requires speciﬁc MCMC techniques to avoid the computation of prior normalizingconstants which are not available in closed form; see (Wang & Li, 2012, Section 2.4).3e now apply the PAS algorithm to our DAG setting. We start deﬁning a suitableproposal distribution, q ( D | D ) on the DAG space. To this end we consider three typesof operators that locally modify a given DAG D : insert a directed edge (InsertD a → b for short), delete a directed edge (DeleteD a → b ) and reverse a directed edge (ReverseD a → b ). For each D ∈ S q , being S q the set of all DAGs on q nodes, we then construct theset of valid operators O D , that is operators whose resulting graph is a DAG. Therefore,given the current D we propose D by uniformly sampling a DAG in O D . Because there isa one-to-one correspondence between each operator and resulting DAG D , the probabilityof transition is given by q ( D | D ) = 1 / |O D | . Such a proposal can be easily adapted toaccount for sparsity constraints on the DAG space, for instance by ﬁxing a maximumnumber of edges and limiting the model space to those DAGs having at most a givennumber of edges, typically a small multiple of the number of nodes.Because of the structure of our proposal, at each step of our MCMC algorithm wewill need to compare two DAGs D , D which diﬀer by one edge only. Notice that theReverseD a → b operator can be also brought back to the same case since is equivalent tothe consecutive application of the operators DeleteD a → b and InsertD b → a . Therefore,consider two DAGs D = ( V, E ), D = ( V, E ) such that E = E \ { ( h, j ) } . Notice that ifa parent ordering is valid for D , it is also valid for D , and we adopt this convention tosimplify the analysis. At this stage, for better clarity of notation, we index each parameterwith its own DAG-model and write accordingly ( D D , L D ) and ( D D , L D ) and Ω D and Ω D when interest centers on the covariance matrix. Notice that the Cholesky parametersunder the two DAGs diﬀer only with regard to their j -th component (( σ D j ) , L D≺ j ] ), and(( σ D j ) , L D ≺ j ] ) respectively. Moreover the remaining parameters { ( σ D r ) , L D≺ r ] ; r = j } and { ( σ D r ) , L D ≺ r ] ; r = j } are componentwise equivalent between the two graphs because theyrefer to structurally equivalent conditional models; see (6). This is crucial for the correctapplication of the PAS algorithm.The acceptance probability for D under a PAS algorithm is given by α D = min { r D } where r D = p ( D | D D \ ( σ D j ) , L D \ L D ≺ j ] , X ) p ( D | D D \ ( σ D j ) , L D \ L D≺ j ] , X ) · q ( D | D ) q ( D | D )= p ( X , D D \ ( σ D j ) , L D \ L D ≺ j ] | D ) p ( X , D D \ ( σ D j ) , L D \ L D≺ j ] | D ) · p ( D ) p ( D ) · q ( D | D ) q ( D | D ) . (3)Therefore we require to evaluate for DAG D p ( X , D \ σ j , L \ L ≺ j ] | D ) = Z ∞ Z R | pa( j ) | p ( X | D , L , D ) p ( D , L | D ) d L ≺ j ] dσ j (and similarly for D ) where we removed for simplicity the super-script D from theCholesky parameters, now emphasizing the dependence on D though the conditioningsets. Moreover, because of the likelihood and prior factorization in (8) and (16) we canwrite p ( X , D \ σ j , L \ L ≺ j ] | D ) = Y r = j f ( X r | X pa( r ) , σ r , L ≺ r ] , D ) p ( L ≺ r ] | σ r , D ) p ( σ r | D ) · Z ∞ Z R | pa( j ) | f ( X j | X pa( j ) , σ j , L ≺ j ] , D ) (4) · p ( L ≺ j ] | σ j , D ) p ( σ j | D ) d L ≺ j ] dσ j .

4n particular, because of conjugacy of the Normal density with the Normal-Inverse-Gammaprior, the integral in (4) can be obtained in closed form as m ( X j | X pa D ( j ) , D ) = (2 π ) − n (cid:12)(cid:12) T j (cid:12)(cid:12) / (cid:12)(cid:12) ¯ T j (cid:12)(cid:12) / · Γ (cid:16) a ∗ j + n (cid:17) Γ (cid:16) a ∗ j (cid:17) (cid:20) g (cid:21) a ∗ j (cid:20) (cid:0) g + X > j X j − ˆ L > j ¯ T j ˆ L j (cid:1)(cid:21) − ( a ∗ j + n/ (5)where T j = g I | pa D ( j ) | ¯ T j = g I | pa D ( j ) | + X > pa D ( j ) X pa D ( j ) ˆ L j = (cid:0) g I | pa D ( j ) | + X > pa D ( j ) X pa D ( j ) (cid:1) − X > pa D ( j ) X j , and a ∗ j = a j − | pa D ( j ) | −

1. For j = 1, because we ﬁxed σ = 1 we instead obtain m ( X | X pa D (1) , D ) = (2 π ) − n (cid:12)(cid:12) T (cid:12)(cid:12) / (cid:12)(cid:12) ¯ T (cid:12)(cid:12) / · exp (cid:26) − (cid:0) X > X + ˆ L > ¯ T ˆ L (cid:1)(cid:27) . (6)Therefore, the PAS ratio in (3) reduces to r D = m ( X j | X pa D0 ( j ) , D ) m ( X j | X pa D ( j ) , D ) · p ( D ) p ( D ) · q ( D | D ) q ( D | D ) . (7) In this section we prove that the posterior distribution p ( D , L , D , θ , X | y , X − ) is proper.To this end we factorize it as p ( D , L , D , θ , X | y , X − ) ∝ p ( θ | y , X − ) p ( X | θ , y , X − ) · p ( D , L | θ , X , y , X − ) p ( D | D , L , θ , X , y , X − ) . Also, because y is deterministically set once X and θ are given, the latter simpliﬁes to p ( D , L , D , θ , X | y , X − ) ∝ p ( θ | y , X − ) p ( X | θ , y , X − ) · p ( D , L | θ , X ) p ( D | D , L , θ , X ) , where X is the ( n, q ) augmented data matrix, column binding of X and X − . To provethe propriety of the joint posterior we will show that each of the above terms correspondsto a proper distribution. p ( θ | y, X − ) Frist notice that p ( θ | y , X − ) ∝ p ( y , X − | θ ) since p ( θ ) ∝

1. Remember the augmented likelihood in Equation (9) of our manuscript f ( y , X | D , L , θ , D ) = q Y j =1 d N n ( X j | − X pa( j ) L ≺ j ] , σ j I n ) · ( n Y i =1 ( θ y i − < x i, ≤ θ y i ) ) , D , where we set the ( j, j )-element of D equalto σ j . We ﬁrst integrate out X to obtain the likelihood f ( y , X − | D , L , θ , D ) = q Y j =2 d N n ( X j | − X pa( j ) L ≺ j ] , σ j I n ) · Z R n d N n ( X | − X pa(1) L ≺ , σ I n ) n Y i =1 ( θ y i − < x i, ≤ θ y i ) d X = f ( X − | D , L ) · Z O n ( y ,θ ) d N n ( X | − X pa(1) L ≺ , σ I n ) d X , where X − = ( X , . . . , X q ), O n ( y , θ ) is the n -dimensional orthant n × i =1 ( θ y i − , θ y i ] and θ − = −∞ , θ = + ∞ . To obtain the marginal posterior of θ we ﬁrst compute the integrated likelihood for θ by integrating the likelihood w.r.t. ( D , L ). Because of priorparameter independence across ( σ j , L ≺ j ] )’s, we obtain f ( y , X − | θ ) = Z f ( y , X − | D , L , θ ) p ( D , L ) d D d L = q Y j =2 (cid:26) Z d N n ( X j | − X pa( j ) L ≺ j ] , σ j I n ) p ( σ j , L ≺ j ] ) dσ j d L ≺ j ] (cid:27) · Z R | pa(1) | Z O n ( y ,θ ) d N n ( X | − X pa(1) L ≺ , σ I n ) p ( L ≺ ) d X d L ≺ , since σ = 1. Notice that the ﬁrst term in the integrated likelihood f ( y , X − | θ ) doesnot depend on θ . We then write f ( y , X − | θ ) = K ( X − ) · Z O n ( y ,θ ) Z R | pa(1) | d N n ( X | − X pa(1) L ≺ , σ I n ) p ( L ≺ ) d L ≺ d X , upon interchanging the order of integration. Consider now the inner integral where p ( L ≺ ) = d N | pa(1) | ( , g − I pa(1) ); see also Equation (15) in our manuscript. Becauseof conjugacy with the normal density d N n ( X | · ) we obtain Z R | pa(1) | d N n ( X | − X pa(1) L ≺ , σ I n ) p ( L ≺ ) d L ≺ = d N n ( X | , Σ ) , where | pa(1) | is the number of parents, Σ = g − X pa(1) X > pa(1) + I n . The integrated like-lihood thus becomes f ( y , X − | θ ) = K ( X − ) · Z O n ( y ,θ ) d N n ( X | , Σ ) d X . (8)Since we assumed p ( θ ) ∝

1, the marginal posterior of θ is proportional to (8) and also p ( θ | y , X − ) ∝ Z O n ( y ,θ ) d N n ( X | , Σ ) d X := I ( θ , n ) , n . Therefore, to verify the propriety of p ( θ | y , X − )we need to prove that I ( θ , n ) is integrable over θ ∈ ( −∞ , ∞ ).Let now for simplicity X = U = ( U , . . . , U n ) > and recall that U i ∈ ( −∞ , θ ] ⇐⇒ U i ≤ θ ⇐⇒ U i − θ ≤ ,U i ∈ ( θ , ∞ ) ⇐⇒ U i > θ ⇐⇒ θ − U i < . Hence, if we let U ∗ i = ( U i − θ if y i = 0 ,θ − U i if y i = 1 , we obtain that U ∗ ∼ N n ( µ ∗ , Σ ∗ ), where µ ∗ = ( µ ∗ , . . . , µ ∗ n ) > , µ ∗ i = ( − θ if y i = 0 ,θ if y i = 1 , Σ ∗ ( h, k ) = ( Σ ( h, k ) if y h = y k , − Σ ( h, k ) if y h = y k , and Σ ∗ ( h, k ) denotes the element at position ( h, k ) in Σ ∗ . Therefore, we can write I ( θ , n ) = ∞ Z · · · ∞ Z d N n ( U ∗ | µ ∗ , Σ ∗ ) dU ∗ · · · dU ∗ n . (9)If n = 1, then (9) is not integrable so that the posterior of θ improper. To see why, noticethat I ( θ ,

1) = ∞ Z d N ( U ∗ | µ ∗ , σ ∗ ) dU ∗ =  Φ (cid:16) θ σ ∗ (cid:17) if y i = 0 , Φ (cid:16) − θ σ ∗ (cid:17) if y i = 1 , which is not integrable over ( −∞ , ∞ ) in either case.Consider now n = 2 and assume that the two observations have diﬀerent values for Y , say y = 1 and y = 0. We then obtain (cid:18) U ∗ U ∗ (cid:19) ∼ N (cid:20)(cid:18) θ − θ (cid:19) , (cid:18) σ ρσ σ ρσ σ σ (cid:19)(cid:21) , where ρ = Corr( U ∗ , U ∗ ) and I ( θ ,

2) = ∞ Z ∞ Z d N n ( U ∗ | µ ∗ , Σ ∗ ) dU ∗ dU ∗ = P { U ∗ > , U ∗ > } . The posterior for θ is proper if and only if R ∞−∞ I ( θ , dθ < ∞ . We will show this resultby providing an upper bound I ( θ , ≤ G ( θ ,

2) with G ( θ ,

2) integrable over the real line.7o this end, we ﬁrst standardize U ∗ and U ∗ V ∗ = U ∗ − θ σ V ∗ = U ∗ + θ σ , so that (cid:18) V ∗ V ∗ (cid:19) ∼ N (cid:20)(cid:18) (cid:19) , (cid:18) ρρ (cid:19)(cid:21) and P { U ∗ > , U ∗ > } = P (cid:26) V ∗ > − θ σ , V ∗ > θ σ (cid:27) . (10)We now distinguish two cases: ρ ≥ ρ < Case 1

Consider ﬁrst 0 ≤ ρ <

1. Applying Equation (C.8) in Melchers & Beck (2017)to the right-hand-side of (10) we obtain P { U ∗ > , U ∗ > } ≤ Φ( − b )Φ (cid:18) θ σ (cid:19) + Φ( − a )Φ (cid:18) − θ σ (cid:19) := H ( θ , , where a = − θ h σ + ρσ i (1 − ρ ) / , b = θ h σ + ρσ i (1 − ρ ) / , and Φ( t ) is the c.d.f. of a standard normal distribution evaluated at t . It follows that H ( θ ,

2) is continuous, bounded andlim θ →−∞ H ( θ ,

2) = 0 , lim θ →∞ H ( θ ,

2) = 0 . Now we verify integrability of H ( θ ,

2) in a right and left neighborhood of −∞ and ∞ respectively.Consider ﬁrst θ → −∞ and notice that H ( θ , ≤ Φ (cid:18) θ σ (cid:19) + Φ ( − a )= Φ ( θ A ) + Φ ( θ B )= P { Z ≥ − θ A } + P { Z ≥ − θ B } . where Z ∼ N (0 ,

1) and A = 1 σ > B = σ + ρσ (1 − ρ ) / > . Because θ → −∞ , we have − θ A > , − θ B >

0. Applying the following inequality forthe upper tail of

Z P { Z > z } ≤ exp {− z / } z √ π , z > , P { Z ≥ − θ A } + P { Z ≥ − θ B } ≤ exp {− θ A / }− θ A √ π + exp {− θ B / }− θ B √ π ≤ exp {− θ A / } A √ π + exp {− θ B / } B √ π := G ( θ , . Clearly R −∞ G ( θ , dθ < ∞ , whence I ( θ ,

2) is integrable in a right neighborhood of −∞ .Consider now θ → ∞ . Similarly as before notice that H ( θ , ≤ Φ ( − b ) + Φ (cid:18) − θ σ (cid:19) = Φ ( − θ C ) + Φ ( − θ D )= P { Z ≥ θ C } + P { Z ≥ θ D } . where C = σ + ρσ (1 − ρ ) / > , D = 1 σ > I ( θ ,

2) is now G ( θ ,

2) = exp {− θ C / } C √ π + exp {− θ D / } D √ π . Thus for ρ ≥ θ is proper. Case 2

Consider now the case − < ρ < P ρ< {·} to indicate that the probability is evaluated w.r.t.the joint distribution when the correlation coeﬃcient is less than zero, and use P ρ ≥ {·} otherwise. Letting v = − θ σ , v = θ σ we have P ρ< { U ∗ > , U ∗ > } = P ρ< { V ∗ > v , V ∗ > v } = P ρ< { V ∗ ≤ v , V ∗ ≤ v } + P { V ∗ > v } + P { V ∗ > v } − ≤ P ρ ≥ { V ∗ ≤ v , V ∗ ≤ v } + P { V ∗ > v } + P { V ∗ > v } − P ρ ≥ { V ∗ > v , V ∗ > v } = P ρ ≥ { U ∗ > , U ∗ > } ;see Equation (C.16) in Melchers & Beck (2017) for the inequality in step 3.It follows that I ( θ ,

2) for 1 < ρ < ≤ ρ <

1. Hence I ( θ ,

2) is integrable also for ρ < θ holds in this case too.Having established that p ( θ | y , X − ) is proper when n = 2 and y = 1 , y = 0, itfollows that propriety will hold for any sample size n ≥ Y . To see why, assume without loss of generalitythat the ﬁrst two observations have distinct values; then p (cid:16) θ | y (1: n ) , X (1: n ) − (cid:17) ∝ p (cid:16) θ | y (1:2) , X (1:2) − (cid:17) · f (cid:16) y (3: n ) , X (3: n ) − | y (1:2) , X (1:2) − , θ (cid:17) , p (cid:16) θ | y (1:2) , X (1:2) − (cid:17) is proper and the conditional integratedlikelihood f ( · | · ) is not degenerate.Results presented in the next sections rely on the following proposition. Proposition 3.1.

Let p ( X | ϑ , λ ) be a proper statistical model for the data X where ϑ , λ are continuous parameters with joint prior p ( ϑ , λ ) = p ( λ | ϑ ) p ( ϑ ) . If p ( λ | ϑ ) is proper,then p ( X | ϑ ) = Z p ( X | ϑ , λ ) p ( λ | ϑ ) d λ is also a proper statistical model. If viewed as a function of ϑ , p ( X | ϑ ) is called integratedlikelihood.Proof. The proof is immediate if one can interchange the order of integration between X and λ . In addition, it follows that if p ( ϑ ) is also proper, then the posterior of ϑ , p ( ϑ | X ) , willalso be proper. p ( X | θ , y, X − ) We can ﬁrst write p ( X | θ , y , X − ) ∝ p ( X − | y , θ , X ) p ( y | θ , X ) p ( X | θ )= p ( X − | θ , X ) p ( y | θ , X ) p ( X | θ )since p ( X − | y , θ , X ) = p ( X − | θ , X ) and p ( θ ) ∝ p ( X − | θ , X ) = X D (cid:26)Z Z p ( X − | D , L , D , θ , X ) p ( D , L | D , θ , X ) d D d L (cid:27) p ( D | θ , X ) . Now notice that p ( D , L | D , θ , X ) ∝ p ( X | D , L , D , θ ) p ( D , L | D , θ ) ∝ p ( X | D , L , D , θ ) p ( D , L | D ) , and that we can write p ( X | D , L , D , θ ) = X y ∈{ , } n (cid:26)Z f ( y , X | D , L , θ ) d X − (cid:27) = X y ∈{ , } n (Z f ( X | D , L , D ) d X − n Y i =1 ( θ y i − < x i, ≤ θ y i ) ) = d N n ( X | , τ I n ) X y ∈{ , } n ( n Y i =1 ( θ y i − < x i, ≤ θ y i ) ) = d N n ( X | , τ I n ) , τ = [( L > ) − DL − ] , because for each ﬁxed X = ( x , , . . . , x n, ) > only oneof the 2 n product of indicators will hold true and therefore P y ∈{ , } n { . . . } = 1. Hence, p ( X | D , L , D , θ ) corresponds to a (proper) multivariate normal density. Also, p ( D , L | D )is a proper (prior) distribution by assumption. It follows that p ( X − | θ , X ) is anintegrated likelihood because it is obtained with proper priors p ( D , L | D , θ , X ) and p ( D | θ , X ).Moreover, p ( y | θ , X ) is also proper because p ( y | θ , X ) = ( y i = ( x i, > θ ) for each i = 1 , . . . , n, . (11)Finally we have p ( X | θ ) = X D (cid:26)Z Z p ( X | D , L , D , θ ) p ( D , L | θ , D ) d D d L (cid:27) p ( D | θ )= X D (cid:26)Z d N n ( X | , τ I n ) p ( τ | D ) dτ (cid:27) p ( D ) , which is also proper because the family of densities for X is integrated w.r.t. properpriors p ( τ | D ) (induced by the proper prior on ( D , L )) and p ( D ). p ( D, L | θ , X ) Consider now p ( D , L | θ , X ) = X D p ( D , L | D , θ , X ) p ( D | θ , X ) . First notice that p ( D , L | D , θ , X ) ∝ p ( X | D , L , D , θ ) p ( D , L | D , θ )= p ( X | D , L , D , θ ) p ( D , L | D ) , where p ( X | D , L , D , θ ) = X y ∈{ , } n p ( y , X | D , L , θ )= X y ∈{ , } n ( p ( X | D , L , D ) n Y i =1 ( θ y i − < x i, ≤ θ y i ) ) = p ( X | D , L , D ) , arguing as in Section 3.2. Therefore we can write p ( D , L | D , θ , X ) = p ( D , L | D , X )which is a proper DAG Wishart distribution.In addition, p ( D | θ , X ) ∝ p ( X | D , θ ) p ( D | θ ) = p ( X | D , θ ) p ( D ) where p ( X | θ , D ) = Z Z p ( X | D , L , D , θ ) p ( D , L | D , θ ) d D d L = Z Z p ( X | D , L , D ) p ( D , L | D ) d D d L

11s again an integrated (augmented) likelihood with a proper prior p ( D , L | D ) and thereforeproper. Hence, p ( D | θ , X ) is proper according to Proposition 3.1 because p ( D ) is also aproper prior.Therefore, p ( D , L | θ , X ) is proper because it corresponds to a ﬁnite mixture of properpriors. p ( D |

D, L, θ , X ) Finally, we can write p ( D | D , L , θ , X ) ∝ p ( X | D , L , D , θ ) p ( D | θ )= p ( X | D , L , D ) p ( D ) , which will be proper because D belongs to a ﬁnite space. In order to ﬁx the numer of MCMC iterations in our simulation study we have run fewpilot simulations that we used to perform diagnostics of convergence. Speciﬁcally, for eachpilot simulation we ran two independent chains of length T and T . We ﬁxed T = 25000and T = 50000 in the q = 20 setting, while T = 50000 and T = 100000 for q = 40.We compare the BMA causal eﬀect estimates (as deﬁned in Equation (27) of our paper)obtained under the two independent chains. Figure 1 shows the scatter plots of the BMAestimates obtained from the two chains for four randomly chosen intervened nodes in oneof the pilot simulations for q = 20. By inspection, one can see that the agreement betweenthe results is highly satisfactory since points are clustered around the main diagonal ofthe plot. Therefore, T iterations are suﬃcient to reach convergence. Similar results, notreported for brevity, were obtained in the q = 40 scenarios. To emphasize the crucial role played by the set f a ( s ) in causal eﬀect estimation, wecompare our method with an alternative setup which does not adjust for confounders.Speciﬁcally, we perform inference of causal eﬀects under a ﬁxed naive DAG wherein allvariables, save for the response, are disjoint and each is a parent of the response; seeFigure 2 for an example on q = 4 nodes. The resulting DAG depicts a situation similar tothat of a standard (Bayesian) probit regression model, wherein the covariates are notﬁxed but rather random and jointly independent with marginal normal distributions, X j ∼ N (0 , σ j ), j = 2 , . . . , q .We then construct simulation scenarios as detailed in Section 6 of our paper. Resultsare summarized in the box-plots of Figure 3, where we report the distribution of the MeanAbsolute Error (MAE, constructed across the 40 DAGs and nodes s = 2 , . . . , q ) under thetwo strategies as a function of n . As expected, our original method which fully accountsfor the uncertainty on the DAG generating model, outperforms the alternative approachwhich is instead based on a ﬁxed DAG not accounting for dependencies among variables.Indeed, it appears that the causal eﬀect estimate signiﬁcantly deviates from the true causaleﬀect even for large sample sizes. 12igure 1: Scatter plots of the BMA estimates for four randomly chosen intervened nodes, obtainedfrom two chains of length T = 25000 and T = 50000 (pilot simulation for the q = 20 setting). X X X X Figure 2:

An instance of naive

DAG with q = 4 nodes used in the alternative method undercomparison. = 20 q = 40Figure 3: Distribution over 40 datasets and nodes s ∈ { , . . . , q } of the mean absolute error (MAE)of BMA estimates of true causal eﬀects. Results are presented for two methods under comparison,our original DAG-probit method (light grey) and the naive DAG-based approach (dark grey), foreach combination of number of nodes q ∈ { , } and sample size n ∈ { , , } . In this section we investigate the computational time of our method as a function of thenumber of variables q and sample size n . The following plots summarize the behaviorof the running time (averaged over 40 replicates) per iteration, as a function of q ∈{ , , , , } for n = 500, and as a function of n ∈ { , , , , } for q = 50. Results were obtained on a PC Intel(R) Core(TM) i7-8550U 1,80 GHz. References

Godsill, S. J. (2012). On the relationship between markov chain monte carlo methodsfor model uncertainty.

Journal of Computational and Graphical Statistics

10 230–248.URL https://doi.org/10.1198/10618600152627924 . 3

Melchers, R. E. & Beck, A. T. (2017).

Structural Reliability Analysis and Prediction .John Wiley & Sons, Ltd. 8, 9

Wang, H. & Li, S. Z. (2012). Eﬃcient gaussian graphical model determination under g-wishart prior distributions.

Electronic Journal of Statistics https://doi.org/10.1214/12-EJS669 . 3 14igure 4:

Computational time (in seconds) per iteration, as a function of the number of variables q for ﬁxed n = 500 (upper plot) and as a function of the sample size n for ﬁxed q = 50 (lowerplot), averaged over 40 simulated datasets.= 50 (lowerplot), averaged over 40 simulated datasets.