[PDF] Causal Discovery in the Presence of Missing Data

Abstract

Missing data are ubiquitous in many domains including healthcare. When these data entries are not missing completely at random, the (conditional) independence relations in the observed data may be different from those in the complete data generated by the underlying causal process. Consequently, simply applying existing causal discovery methods to the observed data may lead to wrong conclusions. In this paper, we aim at developing a causal discovery method to recover the underlying causal structure from observed data that follow different missingness mechanisms, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). With missingness mechanisms represented by missingness graphs, we analyse conditions under which additional correction is needed to derive conditional independence/dependence relations in the complete data. Based on our analysis, we propose the Missing Value PC (MVPC) algorithm for both continuous and binary variables, which extends the PC algorithm to incorporate additional corrections. Our proposed MVPC is shown in theory to give asymptotically correct results even on data that are MAR or MNAR. Experimental results on synthetic data show that the proposed algorithm is able to find correct causal relations even in the general case of MNAR. Moreover, we create a neuropathic pain diagnostic simulator for evaluating causal discovery methods. Evaluated on such simulated neuropathic pain diagnosis records and the other two real world applications, MVPC outperforms the other benchmark methods.

Full PDF

aa r X i v : . [ c s . L G ] M a r Causal Discovery in the Presence of Missing Data

Ruibo Tu ∗ , Cheng Zhang ∗ , Paul Ackermann , Karthika Mohan ,Clark Glymour , Hedvig Kjellström , and Kun Zhang ∗ KTH Royal Institute of Technology, Microsoft Research, Cambridge, Karolinska Institute, University of California, Berkeley, Carnegie Mellon University

Abstract

Missing data are ubiquitous in many domainssuch as healthcare. When these data entries arenot missing completely at random, the (condi-tional) independence relations in the observeddata may be different from those in the completedata generated by the underlying causal process.Consequently, simply applying existing causaldiscovery methods to the observed data may leadto wrong conclusions. In this paper, we aimat developing a causal discovery method to re-cover the underlying causal structure from ob-served data that are missing under different mech-anisms, including missing completely at random(MCAR), missing at random (MAR), and miss-ing not at random (MNAR). With missingnessmechanisms represented by missingness graphs(m-graphs), we analyze conditions under whichadditional correction is needed to derive condi-tional independence/dependence relations in thecomplete data. Based on our analysis, we pro-pose Missing Value PC (MVPC), which extendsthe PC algorithm to incorporate additional correc-tions. Our proposed MVPC is shown in theory togive asymptotically correct results even on datathat are MAR or MNAR. Experimental resultson both synthetic data and real healthcare appli-cations illustrate that the proposed algorithm isable to ﬁnd correct causal relations even in thegeneral case of MNAR.

Determining causal relations plays a pivotal role in manydisciplines of science, especially in healthcare. In particu- * Authors contributed equally. lar, understanding causality in healthcare can facilitate ef-fective treatments to improve quality of life. Traditional ap-proaches (Domeij-Arverud et al., 2016) to identify causalrelations are usually based on randomized controlled trails,which are expensive or even impossible in certain domains.In contrast, owing to the availability of purely observationaldata and recent technological developments in computa-tional and statistical analysis, causal discovery from obser-vational data is potentially widely applicable (Spirtes et al.,2001; Peters et al., 2017). In recent years, causal discoveryfrom observational data has become popular in medical re-search (Sokolova et al., 2017; Klasson et al., 2017).Most existing algorithms for causal discovery are designedfor complete data (Pearl, 2000; Peters et al., 2017), such asthe widely used PC algorithm (Spirtes et al., 2001). Un-fortunately, missing data entries are common in many do-mains. For example, in healthcare, missing entries maycome from imperfect data collection, compensatory med-ical instruments, and ﬁtness of the patients etc. (Robins,1986).All missing data problems fall into one of the followingthree categories (Rubin, 1976): Missing Completely AtRandom (MCAR), Missing At Random (MAR), and Miss-ing Not At Random (MNAR). Data are MCAR if the causeof missingness is purely random, e.g., some entries aredeleted due to a random computer error. Data are MARwhen the direct cause of missingness is fully observed. Forexample, a dataset consists of two variables: gender andincome, where gender is always observed and income hasmissing entries. MAR missingness would occur when menare more reluctant than women to disclose their income(i.e., gender causes missingness). Data that are neitherMAR nor MCAR fall under the MNAR category. In theexample above, MNAR would occur when gender also hasmissing entries. These missingness mechanisms can be rep-resented by causal graphs as introduced in Section 2. Whileit might be tempting to remove samples corrupted by miss-ingness and perform analysis solely with complete cases,it will reduce sample size and, more importantly, bias theoutcome especially when data are MAR or MNAR (Rubin,2004; Mohan et al., 2013; Shpitser, 2016).his paper is concerned with how to ﬁnd the underlyingcausal structure over observed data even in the situationof MAR or MNAR. For simplicity, we assume causal suf-ﬁciency in the paper, as assumed by many causal discov-ery methods including PC (Spirtes et al., 2001). Recov-erability of the data distribution under missing data hasbeen discussed in a number of contributions; see, e.g.,(Mohan et al., 2013; Mohan and Pearl, 2014a). A straight-forward solution is to recover all relevant distributionsthat are needed for Conditional Independence (CI) tests in-volved in the CI-based causal search procedure, such as PC.But compared to the CI test on independent and identicallydistributed observations, the CI test on corrected distribu-tions is generally harder because it involves simulating newdata or importance reweighting with density ratios.Therefore, instead of correcting all CI tests of the PC al-gorithm, we aim to ﬁnd under which condition, CI tests inthe observed data produce erroneous edges, and then wecorrect only such edges by further applying CI tests on cor-rected distributions. Our main contributions are: • We provide a theoretical analysis of the error thatdifferent missingness mechanisms introduce in the re-sults given by traditional causal discovery methods,such as the PC algorithm (Section 3).

We will showthat naive deletion-based method may lead to incor-rect results due to the bias caused by missing data.One immediate way to extend constraint-based meth-ods to handle the missing data issue is correctingall the involved CI tests. This approach is neitherdata-efﬁcient nor computation-efﬁcient. Therefore,we identify possible errors that different missingnessmechanisms lead to in the results given by deletion-based PC. We show that one needs to correct only asmall number of CI tests in order to recover the truecausal structure. • We propose a novel, correction-based extension ofthe PC algorithm, Missing Value PC (MVPC), thathandles all three types of missingness mechanisms:MCAR, MAR, and MNAR (Section 4).

Based on theresult from Section 3, we identify where correctionsare required and propose efﬁcient correction methodsfor all three types of the missingness mechanisms. • MVPC demonstrates superior performance in differ-ent settings, including two real-life healthcare sce-narios (Section 5).

We ﬁrst evaluate the proposedMVPC on synthetic datasets under different settings.MVPC shows clear improvement over multiple base-lines. We further apply MVPC to two real-worlddatasets: the US Cognition study and Achilles TendonRupture study. The results are consistent with med-ical domain knowledge and demonstrate the efﬁcacyof our method.

We discuss closely related works, including traditionalcausal discovery algorithms and approaches that deal withmissing data from a causal perspective.

Causal discovery.

Causal discovery from observationaldata has been of great interest in various domains in thepast decades (Pearl, 2000; Spirtes et al., 2001). In gen-eral, causal discovery consists of constraint-based methods,score-based methods, and methods based on functionalcausal models. Typical constraint-based methods includethe PC algorithm and Fast Causal Inference (FCI). Theyassume that all CI relations are entailed from the causalMarkov condition, according to the faithfulness assump-tion, and use CI constraints in the data to recover causalstructure. The PC algorithm assumes no confounders (hid-den direct common causes of two variables) and outputsa Completed Partially Directed Acyclic Graph (CPDAG),which is easy to interpret and often used in biomedicalapplications (Neto et al., 2008; Le et al., 2016). FCI al-lows confounders and selection bias, and outputs a Par-tial Ancestral Graph (PAG). For simplicity, we use thePC algorithm in this paper, but it is straightforward totransfer our framework to other constraint-based meth-ods. Score-based methods (e.g., Greedy EquivalenceSearch (Chickering, 2002)) ﬁnd the best Markov equiv-alence class (which contains DAGs that have the sameCI relations) under certain score-based criterion, such asthe Bayesian Information Criterion (BIC). Causal discov-ery based on functional causal models beneﬁts from theadditional assumptions on the data distribution and/or thefunctional classes to further determine the causal directionbetween variables. Typical functional causal models in-clude the linear non-Gaussian acyclic model (LiNGAM)(Shimizu et al., 2006), the post-nonlinear (PNL) causalmodel (Zhang and Hyvärinen, 2009), and the nonlinear ad-ditive noise model (ANM) (Peters et al., 2017).

Dealing with data with missing values from a causalperspective.

Recent years have witnessed a growing in-terest in analysing the problem of missing data from acausal perspective. In particular, the notions of recover-ability and testability have been studied by modeling themissingness process using causal graphs (called missing-ness graphs or m-graphs) (Mohan et al., 2013). Given am-graph, a query (such as conditional or joint distribu-tion and causal effects) is deemed recoverable if it can beconsistently estimated (Mohan and Pearl, 2014a). Testabil-ity, on the other hand, deals with ﬁnding testable implica-tions, i.e., claims refutable by the (missing) data distribu-tion (Mohan and Pearl, 2014b). As for causal discovery,it aims to ﬁnd the structure of variables of interest ratherthan the missingness. Relations of variables of interest canbe testable under appropriate assumptions, although rela-ions between variables of interest and their missingnessare untestable.In causal discovery, there are few works for the MNARcase. FCI by test-wise deletion regards the missingness pro-cedure as a particular type of selection bias to handle theMNAR missingness (Strobl et al., 2017). It shows that FCIcombined with test-wise deletion is still sound when oneaims to estimate the PAG for the variables including the ef-fect of missingness. Data missingness is usually differentfrom selection bias, because in the selection bias case weonly have the distribution of the selected samples but noclue about the population. However, in the missing datacase, we may be able to check the (conditional) indepen-dence relation between two variables given others by mak-ing use of the available data for the involved variables. Inthe case where the missingness mechanisms to be known,this problem is closely related to the recoverability of mod-els with missing data. Gain and Shpitser (2018) utilize In-verse Probability Weight (IPW) for each CI test, assumingthe missing data model is known, which may not be real-istic in many real-life applications. When the missing datamodel is unknown, they choose the sparest resulting graphconsidering all possible missingness structures, which isusually computationally expensive.

We assume that there is no confounder or selection biasrelative to the set of observed variables. When the avail-able dataset has missing values, one may apply the PC algo-rithm for causal discovery by performing CI tests on thoserecords which do not have missing values for the variablesinvolved in the tests. We term this ﬁrst proposal deletion-based PC . In this section, we discuss the inﬂuence of miss-ing data on the result of deletion-based PC.Primarily, we investigate the situations where errors oc-cur to the output of deletion-based PC due to the missing-ness. Firstly, we utilize m-graphs and summarize the as-sumptions that we need for properly dealing with missing-ness. We then present the aforementioned deletion-basedPC algorithm. Our analysis focuses on properties of theresults given by this naive extension, and provides the con-ditions under which the deletion-based PC produces erro-neous edges.

Missingness graph.

We utilize the notation of the m-graph (Mohan et al., 2013). A m-graph is a causal DAG G ( V , E ) where V = V ∪ U ∪ V ∗ ∪ R . U is the set of unob-servable nodes; in this paper, we assume causal sufﬁciency,so U is an empty set. V is the set of substantive nodes(observable nodes) containing V o and V m . V o ⊆ V is theset of fully observed variables, denoted by white nodes inour graphical representation. V m ⊆ V is the set of partially observed variables that are missing in at least one record,which is shadowed in gray. R is the set of missingnessindicators that represent the status of missingness and areresponsible for the values of proxy variables V ∗ . For exam-ple, the proxy variable Y ∗ ∈ V ∗ is introduced as an auxiliaryvariable for the convenience of derivation. R y = Y is missing and Y ∗ corresponds to a missing entry; R y = Y is observed and Y ∗ takesthe value of Y .In this work we adopt the CI-based deﬁnitions of missing-ness categories as stated in (Mohan et al., 2013). We de-note an independent relation in a dataset by " ⊥⊥ " and d-separation in a m-graph by " ⊥⊥ d ". As shown in Figure 1,data are MCAR if { V m , V o } ⊥⊥ d R holds in the m-graph,MAR if V m ⊥⊥ d R | V o holds, and MNAR otherwise. Assumptions for dealing with missingness.

Apart fromthe assumptions for the asymptotic correctness of the PCalgorithm (including the causal Markov condition, faithful-ness, and no confounding or selection bias), we introducesome additional assumptions that we make use of to dealwith missingness.

Assumption 1 (Missingness indicators are not causes) . Nomissingness indicator can be the cause of any substantive(observed) variable.

This assumption is employed in most related work usingm-graphs (Mohan et al., 2013; Mohan and Pearl, 2014a).Consequently, under this assumption, if variables of in-terest X and Y are not d-separated by a variable set Z ⊆ V \ { X , Y } , they are not d-separated by Z together with theirmissingness indicators. Under the faithfulness assumption,this means that if they are conditionally independent given Z together with the their missingness indicators, they areconditionally independent given only Z . Now the prob-lem is that generally speaking, we cannot directly verifywhether they are conditionally independent given Z andtheir missingness variables because we do not have therecords for the considered variables when their missingnessindicators take value one. We then need the following as-sumptions. Assumption 2 (Faithful observability) . Any conditional in-dependence relation in the observed data also holds in theunobserved data; formally, X ⊥⊥ Y | { Z , R K = } ⇐⇒ X ⊥⊥ Y | { Z , R K = } . Here R K is the missingness indi-cator set { R x , R y , R z } . R K = means all the missingnessindicators in R K taking the value zero; R K = means atleast one missingness indicator in R K taking the value one. This implies X ⊥⊥ Y | { Z , R K = } ⇐⇒ X ⊥⊥ Y | { Z , R K } ,which means that conditional independence relations in theobserved data also hold in the complete data, i.e., there isno accidental conditional independence relation caused bymissingness. Z Y R y (a) A MCAR graph X W YZ R y (b) A MAR graph X W YZR y R w (c) A MNAR graph X YR x (d) Self-masking missingness Figure 1: Exemplar missingness graphs in MCAR, MAR, MNAR, and self-masking missingness. X , Y , Z , and W arerandom variables. In m-graphs, gray nodes are partially observed variables, and white nodes are fully observed variables. R x , R y , and R w are the missingness indicators of X , Y , and W . Assumption 3 (No causal interactions between missing-ness indicators) . No missingness indicator can be a deter-ministic function of any other missingness indicators.

Assumption 4 (No self-masking missingness) . Self-masking missingness refers to missingness in a variablethat is caused by itself. In the m-graph this is depicted byan edge from X to R x , for X ∈ V m (as shown in Figure 1d).We assume that there is no such edges in the m-graph. Assumption 3 and 4 guarantee the recoverability of ajoint distribution of substantive variables, as shown in(Mohan et al., 2013). As discussed in Appendix A.1, in thelinear Gaussian case, the "self-masking" only affects causaldiscovery results when R x has direct causes other than X .In the end, we assume linear Gaussian causal models inthis work. Thus, one can check CI relations with the par-tial correlation test, a simple CI test method. Note that ourproposed algorithm also works well for general situations.In the non-linear case, we can use a suitable non-linear ornon-parametric one (Zhang et al., 2011). Effect of missing data on the deletion-based PC.

In thepresence of missing data, the list-wise deletion PC algo-rithm deletes all records that have any missing value andthen applies the PC algorithm to the remaining data. Incontrast, the test-wise deletion PC algorithm only deletesrecords with missing values for variables involved in thecurrent CI test when performing the PC algorithm (whichcan be seen as the PC algorithm realization of (Strobl et al.,2017)). Test-wise deletion is more data-efﬁcient than list-wise deletion. In this paper, we focus on the Test-wise Dele-tion PC algorithm (TD-PC).TD-PC gives asymptotically correct results when data areMCAR since { V m , V o } ⊥⊥ d R is satisﬁed. Consider Fig-ure 1a as an example. R y ⊥⊥ d { X , Y , Z } holds; thus, wehave X ⊥⊥ d Y | Z ⇐⇒ X ⊥⊥ d Y | { Z , R y } . With the faith-fulness assumption on m-graphs, X ⊥⊥ Y | Z ⇐⇒ X ⊥⊥ Y | { Z , R y } . Furthermore, with the faithful observability as-sumption, we conclude X ⊥⊥ Y | Z ⇐⇒ X ⊥⊥ Y ∗ | { Z , R y = } . When applying the CI test to the test-wise deleteddata of concerned variables X , Y , and Z , we test whether X ⊥⊥ Y ∗ | { Z , R y = } holds. Therefore, CI results implyd-separation/d-connection relations of concerned variables in m-graphs when data are MCAR, which guarantees theasymptotic correctness of TD-PC.In cases of MAR and MNAR, TD-PC may produce erro-neous edges because { V m , V o } ⊥⊥ d R does not hold. There-fore, in what follows in this section, we mainly address theproblems of TD-PC in cases of MAR and MNAR. Erroneous edges produced by TD-PC.

Since TD-PCmay produce erroneous edges when data are MAR andMNAR, in the following propositions (proofs are givenin Appendix A.2.), we ﬁrst show that the causal skeleton(undirected graph) given by TD-PC has no missing edges,but may contain extraneous edges. We then determine theconditions under which extraneous edges occur in the out-put of TD-PC.

Proposition 1.

Under Assumptions 1 ∼

4, the CI relation intest-wise deleted data, X ⊥⊥ Y | { Z , R x = , R y = , R z = } ,implies the CI relation in complete data, X ⊥⊥ Y | Z , whereX and Y are random variables and Z ⊆ V \ { X , Y } . Proposition 1 shows that CI relations in test-wise deleteddata implies the true corresponding d-separation relationsin a m-graph. However, dependence relations in test-wisedeleted data may imply the wrong corresponding relationsin the m-graph because X Y | { Z , R x = , R y = , R z = } 6 = ⇒ X Y | Z . In other words, TD-PC may wronglytreat some d-separation relations of concerned variables asto be not d-separated in a m-graph. Thus, TD-PC pro-duces extraneous edges in the causal skeleton result ratherthan missing edges. For example, in Figure 1b, we have X Y ∗ | { Z , R y = } in the test-wise deleted data, but thetrue d-separation relation is X ⊥⊥ d Y | Z instead of X d Y | Z .Thus, TD-PC produces an extraneous edge between X and Y . Fortunately, such extraneous edges appear only underspecial circumstances, as shown in the following proposi-tion. Proposition 2.

Suppose that X and Y are not adjacentin the true causal graph and that for any variable set Z ⊆ V \ { X , Y } such that X ⊥⊥ Y | Z , it is always the casethat X Y | { Z , R x = , R y = , R z = } . Then under As-sumptions 1 ∼

4, for at least one variable in { X } ∪ { Y } ∪ Z ,its missingness indicator is either the direct common effector a descendant of the direct common effect of X and Y . roposition 2 indicates that extraneous edges can be iden-tiﬁed from the output of TD-PC. For example, in Figure1b and Figure 1c, W is the direct common effect of X and Y and the missingness indicator R y is a descendant of W .Thus, the extraneous edge occurs between X and Y in thecausal skeleton produced by TD-PC. In this section, we present our proposed approach, Missing-Value PC (MVPC), for causal discovery in the presence ofmissing data based on PC. We introduce the general MVPCframework in Section 4.1, and present our correction meth-ods for removing extraneous edges in Section 4.2.

Algorithm 1 summarizes the framework of MVPC. We per-form TD-PC on V (Step 1), and then involve R (Step 2).This is equivalent to performing TD-PC on V ∪ R under As-sumption 1, 3, and 4. More details of Step 2 are introducedin Appendix A.3. We then identify potential extraneousedges (Step 3). These are the edges between variables ofwhich direct common effects are missingness indicators orancestors of missingness indicators. Since we do not haveorientation information at this stage, we cannot directly lo-cate such extra edges; however, we can ﬁnd potentially in-correct edges, as a superset of the incorrect edges. Next,we perform correction for these candidate edges (Step 4).Finally, we orient edges of the recovered causal skeletonwith the same procedure as the PC algorithm. As shown in Section 3, TD-PC produces extraneous edgesin the causal skeleton result in the situations of Proposi-tion 2. In this section, we introduce our correction meth-ods to remove the extraneous edges. We ﬁrst introducePermutation-based Correction (PermC) with an example.We then show that PermC handles most of the missingnesscases. Next, we propose an alternative solution, namedDensity Ratio Weighted correction (DRW), for the caseswhich PermC does not cover.

Permutation-based correction.

We use an example todemonstrate how to remove the extraneous edges withPermC. For example, suppose that we have a dataset withmissing values of which the underlying m-graph is shownin Figure 1b. As discussed in Section 3, when applying TD-PC to this dataset, we produce an extraneous edge between X and Y in the output of TD-PC. The problem is that datasamples from joint distribution P ( X , Y , Z ) are not availablein the observed dataset. In this case, we test the CI rela-tions in the test-wise deleted data from P ( X , Y ∗ , Z | R y = ) ,which leads to producing the extraneous edge. Algorithm 1

Missing-value PC Skeleton search with deletion-based PC :a Graph initialization : Build a complete undirectedgraph G on the node set V .b Causal skeleton discovery : Remove edges in G with the same procedure as the PC algorithm(Spirtes et al., 2001) with the test-wise deleteddata. Detecting direct causes of missingness indicators :For each variable V i ∈ V containing missing values andfor each j that j = i , test the CI relation of R i and V j . Ifthey are independent given a subset of V \ { V i , V j } , V j is not a direct cause of R i . Detecting potential extraneous edges :For each i = j , if V i and V j are adjacent and have atleast one common adjacent variable or missingness in-dicator, the edge between V i and V j is potentially extra-neous. Recovering the true causal skeleton :Perform correction methods for removing the extrane-ous edges in G as shown in Section 4.2. Determining the orientation :Orient edges in G with the same orientation procedureas the PC algorithm.PermC solves this problem by testing the CI relations inthe reconstructed virtual dataset utilizing the observed dataconcerning: P ( X , Y , Z ) = Z W P ( X , Y , Z | W ) P ( W ) dW = Z W P ( X , Y ∗ , Z | W , R y = ) P ( W ) dW , (1)such that reconstructed data follow the joint distribution P ( X , Y , Z ) . As shown in the ﬁrst step of Equation 1, we in-troduce a random variable W which is the direct cause of R y in Figure 1b to reconstruct the dataset and then marginalizeit out. With W , the joint distribution P ( X , Y , Z ) is estimatedby 1) learning the model for P ( X , Y , Z | W ) from test-wisedeleted data, 2) plugging in the values of W in the dataset,as data samples from P ( W ) , and 3) disregarding the input W and keeping the generated virtual data for { X , Y , Z } tomarginalize W out. Given virtual data of X , Y , and Z thatfollow the joint distribution P ( X , Y , Z ) , one can test CI rela-tions in the complete data.Now the issue is that the data samples from P ( X , Y , Z | W ) are not directly available. Nevertheless, we learn a modelfor P ( X , Y ∗ , Z | W , R y = ) to generate virtual data of X , Y , and Z from W , as shown in the second step of Equa-tion 1. Under Assumptions 1 ∼ P ( X , Y , Z | W ) = P ( X , Y ∗ , Z | W , R y = ) because R y ⊥⊥ d { X , Y , Z } | W ; more-over, data samples from P ( X , Y ∗ , Z | W , R y = ) can be con-structed by test-wise deletion. For simplicity, under theinear Gaussian assumption we apply linear regression tolearning the model for P ( X , Y ∗ , Z | W , R y = ) as : X = α W + ε , Y = α W + ε , Z = α W + ε , (2)where α i is the parameter of linear regression models and ε i is the residual.Next, we sample the input values from the probability distri-bution P ( W ) . Estimating P ( W ) for sampling input values isunnecessary in this case because we have the complete dataof W which follow P ( W ) . However, to generate virtual datawith linear regression models, we cannot directly input thetest-wise deleted data of W and add the residuals from thelinear regression models in Equation 2. In this way, the in-put values follow the conditional distribution P ( W | R y = ) instead of P ( W ) . Thus, we shufﬂe the values of W in theobserved dataset such that P ( W S | R y = ) = P ( W S ) where W S denotes the shufﬂed W . We then feed test-wise deletedvalues of W S into the linear regression models as : b X : = α W S + ε , b Y : = α W S + ε , b Z : = α W S + ε , (3)where we denote the random variables with generated vir-tual values by b X , b Y , and b Z . Finally, we test the CI relationsamong b X , b Y , and b Z . PermC for this example is summarizedin Algorithm 2. Algorithm 2

Permutation-based correction

Input: data of concerned variables, such as X , Y , and Z in Figure 1b, and the direct causes of their correspondingmissingness indicators, such as the direct cause W of R y inFigure 1b. Output:

The CI relations among concerned variables, suchas the CI relations among X , Y , and Z . Delete records containing any missing value. We de-note the deleted dataset by D d , and denote the originaldataset by D o . Regress X , Y , and Z on W with D d as Equation 2. Shufﬂe data of W in D o , denoted by W S , and deleterecords containing any missing value in D o (included W S ). Generate virtual data of b X , b Y , and b Z , with W S and theresiduals according to Equation 3. Test the CI relations among b X , b Y , and b Z in the gener-ated virtual data. return The CI relations among X , Y , and Z .Without loss of generality, we summarize the conditions un-der which PermC correctly removes extraneous edges. Sup-pose that we need to test the CI relation of X and Y given Z ⊆ V \ { X , Y } in the generated virtual data. We denotethe direct causes of missingness indicators by Pa ( R ) . Theconditions for the validity of PermC are as follows. (i) { R x , R y , R z , R w } ⊥⊥ d { X , Y , Z } | W , where the variableset W is the set of direct causes of missingness indi-cators R x , R y , and R z ; if variables in W also havemissing values, the direct causes of their missing-ness indicators R w are also included in W ; formally, W = Pa ( R x , R y , R z , R w ) ;(ii) In the m-graph, the missingness indicators of W fol-low the condition that X ⊥⊥ d Y | Z ⇐⇒ X ⊥⊥ d Y |{ Z , R w } .Under Conditions (i) and (ii), we have P ( X , Y , Z | R w = )= Z W ∗ P ( X ∗ , Y ∗ , Z ∗ | W ∗ , R x = , R y = , R z = , R w = ) × P ( W ∗ | R w = ) d W ∗ . (4)To test the CI relation of X and Y given Z in data sam-ples from P ( X , Y , Z ) , it is valid to test the CI relation in thegenerated data samples from P ( X , Y , Z | R w = ) . UnderCondition (ii) the conditional independence/dependence re-lations in P ( X , Y , Z ) also hold in P ( X , Y , Z | R w = ) . More-over, linear regression models in PermC are valid. UnderCondition (i), we have P ( X , Y , Z | W , R x = , R y = , R z = , R w = ) = P ( X , Y , Z | W ) , in which X , Y , and Z are con-ditionally Gaussian distributed given W . Thus, we use lin-ear regression to estimate P ( X ∗ , Y ∗ , Z ∗ | W ∗ , R x = , R y = , R z = , R w = ) and use them in the correction. Density ratio weighted correction.

DRW removes extra-neous edges in situations where Condition (i) and Condi-tion (ii) are not satisﬁed (e.g., Figure 1c). In these cases,we consistently estimate the joint distribution P ( V a ) ofconcerned variables X , Y , and Z in a CI test and the di-rect causes W = Pa ( R x , R y , R z , R w ) , based on Theorem 2of (Mohan et al., 2013), as shown in the ﬁrst line of Equa-tion 5. Here, R represents the missingness indicators of V a . Equation 5 provides a way to reconstruct the observeddataset: P ( V a ) = P ( R = , V a ) ∏ i P ( R i = | Pa ( R i ) , R Pa ( R i ) = )= P ( V a | R = ) × c × ∏ i β Pa ( R i ) , (5)where c = P ( R = ) ∏ i P ( R i = | R Pa ( Ri ) = ) and β Pa ( R i ) = P ( Pa ( R i ) | R Pa ( Ri ) = ) P ( Pa ( R i ) | R i = , R Pa ( Ri ) = ) . In the second line of Equation5, every (conditional) probability distribution can beconsistently estimated. We ﬁrst apply test-wise deletionto the observed data of V a . Then, we reweight such datawith the density ratios ∏ i β Pa ( R i ) and the normalizingconstant c . We estimate density ratios ∏ i β Pa ( R i ) with thekernel density estimation (Sheather and Jones, 1991) andcompute the normalizing constant c . Finally, we test CI

00 1000 5000 10000 S HD ideal target MVPC TD-PC LD-PC (a) Missing data in MAR (20)

100 1000 5000 10000 (b) Missing data in MNAR (20)

100 1000 5000 10000 (c) Missing data in MNAR (50)

Figure 2: Performance comparison using structural Hamming distance. Lower value is better. Panel (a) shows the perfor-mance for MAR with 20 variables. Panel (b) and (c) show the performance for MNAR with 20 and 50 variables. i d ea lt a r g e t M V P C T D - P C L D - P C . . . V a l u e (a) MAR Adjacencies i d ea lt a r g e t M V P C T D - P C L D - P C . . (b) MAR Orientation i d ea lt a r g e t M V P C T D - P C L D - P C . . (c) MNAR Adjacencies i d ea lt a r g e t M V P C T D - P C L D - P C . . .

81 RecallPrecision (d) MNAR Orientation

Figure 3: Precision and recall for adjacencies and orientation comparison (Higher is better). All experiments above use 20nodes with 10000 data samples.the relations of the concerned variables in the reweighteddata samples from their corresponding joint distribution.

We evaluate our method, MVPC, on both synthetic and real-world datasets. We ﬁrst show experimental results on syn-thetic data (Section 5.1) and the behavior of our methodin a situation with ground truth. After that, we apply ourmethod to two healthcare datasets where data entries aresigniﬁcantly missing. The ﬁrst is from the Cognition andaging USA (CogUSA) study (McArdle et al., 2015) (Sec-tion 5.2), and the second is about Achilles Tendon Rup-ture (ATR) rehabilitation research study (Praxitelous et al.,2017; Domeij-Arverud et al., 2016). MVPC demonstratessuperior performance compared to multiple baseline meth-ods.

To best demonstrate the behavior of different causal discov-ery methods, we ﬁrst perform the evaluation on syntheticdata where the ground truth of causal graphs is known.

Baselines.

Our baseline methods include deletion-basedPC algorithms (as mentioned in Section 3): TD-PC andList-wise Deletion PC (LD-PC). Additionally, we apply the PC algorithm to the oracle data (without missing data), de-noted by "ideal". Finally, to decouple the effect of samplesize, we construct virtual datasets in MCAR with the samesample size as in each CI test of TD-PC. PC with such vir-tual MCAR data as a reference is denote by "target" .

Data Generation.

We follow the procedures in(Colombo et al., 2012; Strobl et al., 2017) to randomlygenerate Gaussian DAG and sample data based on thegiven DAG. Additionally, we include at least two colliderstructures in the random Gaussian DAG, in order fordeletion-based PC to have erroneous edges, as implied byProposition 2. We generate two groups of synthetic datato show the scalability of our methods: One group has 20variables (with 6-10 partially observed variables), and theother is with 50 variables (with 10-14 partially observedvariables) for MAR and MNAR. Note that in MNAR case,we assume that the direct causes of missingness indicatorsare partially observed. This is different from (Strobl et al.,2017), which assumes that the cause is a hidden variable.For each group of the experiments, we generate 400DAGs with sample size of 100, 1000, 5000, and 10000,respectively.

Result.

In all different experimental settings, we com-pare the results of different algorithms with structural Ham-ming distance from the ground truth, shown in Figure 2,nd with the precision and recall of their adjacency and ori-entation, given in Figure 3. Across both metrics, as seenfrom Figure 2 and Figure 3, our proposed algorithm consis-tently has superior performance compared to both TD-PCand LD-PC, and is very close to the "target" performance.Similar to (Strobl et al., 2017), TD-PC also performs betterthan LD-PC in the context of PC. Additionally, our pro-posed method beneﬁts from large volume of data samplesas shown in Figure 2, in contract to (Strobl et al., 2017).

In this experiment, we aim to discovery causal relations inthe CogUSA study as in (Strobl et al., 2017). This is a typ-ical survey based healthcare dataset with a large amountvariables with missing values. In this scenario, the missing-ness mechanism is unknown and we could expect MCAR,MAR, and MNAR occur. M V P C t e s t - w i s e P C li s t - w i s e P C li s t - w i s e F C I t e s t - w i s e F C I C o s t Figure 4: Performance of different methods on CogUSAstudy. Lower cost is better. The cost is the count of errorscomparing with known causal constrains from experts.We use the same 16 variables of interest in the CogUSAstudy as in (Strobl et al., 2017). Since the missingness in-dicators of the 16 variables can be caused by other vari-ables, we utilize the rest variables when applying MVPCto the dataset. We use the BIC score for CI test (likeli-hood ratio test with the BIC penalty as the threshold). Fig-ure 4 shows the performance evaluated using the knowncausal constraints: 1) Variables are in two groups with nointer-group causal relation; 2) there are causal relations be-tween two pairs of variables given by the domain expertise.Each violation of these known causal relations adds 1 in thecost shown in Figure 4. Our proposed method obtains thebest performance (lowest cost) comparing with deletion-based PC and deletion-based FCI (Strobl et al., 2017). Thisdemonstrates the capabilities of our method in real life ap-plications.

In the end, we perform causal discovery on a Achilles Ten-don Rupture (ATR) study dataset (Praxitelous et al., 2017; Hamesse et al., 2018), collected in multiple hospitals .ATR is a type of soft tissue injury involving a long reha-bilitation process. Understanding causal relations amongvarious factors and healing outcomes is essential for practi-tioners. The list-wise deletion method is not applicable forthis case because about 70% of the data entries are miss-ing, which means that very rare patients have complete data.Thus, we apply our method and TD-PC to this dataset. Weran experiments on the full dataset with more than 100 vari-ables. Figure 5a shows part of the causal graph.We ﬁnd that age, gender, BMI (body mass index), andLSI (Limb Symmetry Index) in the causal graph given byMVPC do not affect the healing outcome measured byFoot Ankle Outcome Score (FAOS). This result is consis-tent with (Praxitelous et al., 2017; Domeij-Arverud et al.,2016). To test the effectiveness of MNAR, we further in-troduce an auxiliary variable S which is generated from twovariables: Operation time ( OP time ) and FAOS. This variablefurther causes the missingness indicator of FAOS. Figure5b and 5c show the results of these variables using TD-PCand our proposed method. Our proposed MVPC is able tocorrectly remove the extraneous edge between Operationtime and FAOS. Age GenderBMILSI FAOS (a) Consistent results OP time FAOSS (b) Test-wise deletion PC Op time FAOSS (c) MVPC (proposed)

Figure 5: Causal discovery results in the ATR study. Ex-periments were run over all variables. We show only apart of the whole causal graph. Panel (a) shows the rela-tions among ﬁve variables given by MVPC. The relationsare consistent with medical studies. Panel (b) and (c) showan example where MVPC is able to correct the error of TD-PC.

In this work, we address the problem of causal discov-ery in the presence of missing data. We ﬁrst provide the-oretical analysis to identify possible errors in the resultsgiven by a simple extension of PC. We then show that er-roneous causal edges occur only in particular graph struc- In the ATR study experiment, only Paul Ackermann andRuibo Tu get access to the ATR dataset. ures. Based on our analysis, we propose a novel algo-rithm MVPC, which corrects erroneous edges under mildassumptions. We demonstrate the asymptotic correctnessand the effectiveness of our method on both synthetic dataand real-world applications. As future work, we will ex-plore the possibility of further relaxing the assumptions inMVPC, as well as work jointly with practitioners on causalanalysis of large-scale healthcare applications in the pres-ence of missing data.

Acknowledgements

Kun would like to acknowledge thesupported by the United States Air Force under Con-tract No. FA8650-17-C-7715, by National Institutesof Health under Contract No. NIH-1R01EB022858-01, FAINR01EB022858, NIH-1R01LM012087, NIH-5U54HG008540-02, and FAIN-U54HG008540, and by Na-tional Science Foundation EAGER Grant No. IIS-1829681.The National Institutes of Health, the U.S. Air Force, andthe National Science Foundation are not responsible for theviews reported in this article.

References

D. M. Chickering. Optimal structure identiﬁcation withgreedy search.

Journal of machine learning research ,3(Nov):507–554, 2002.D. Colombo, M. H. Maathuis, M. Kalisch, and T. S.Richardson. Learning high-dimensional directed acyclicgraphs with latent and selection variables.

The Annals ofStatistics , pages 294–321, 2012.E. Domeij-Arverud, P. Anundsson, E. Hardell, G. Barreng,G. Edman, A. Latiﬁ, F. Labruto, and P. Ackermann. Age-ing, deep vein thrombosis and male gender predict pooroutcome after acute achilles tendon rupture.

Bone JointJ , 98(12):1635–1641, 2016.A. Gain and I. Shpitser. Structure learning under miss-ing data. In

International Conference on ProbabilisticGraphical Models , pages 121–132, 2018.C. Hamesse, P. Ackermann, H. Kjellstrom, and C. Zhang.Simultaneous measurement imputation and outcome pre-diction for achilles tendon rupture rehabilitation. In

ICML/IJCAI Joint Workshop on Artiﬁcial Intelligence inHealth , 2018.M. Klasson, K. Zhang, B. C. Bertilson, C. Zhang, andH. Kjellström. Causality reﬁned diagnostic prediction. arXiv preprint arXiv:1711.10915 , 2017.T. Le, T. Hoang, J. Li, L. Liu, H. Liu, and S. Hu. A fastpc algorithm for high dimensional causal discovery withmulti-core pcs.

IEEE/ACM transactions on computa-tional biology and bioinformatics , 2016.J. McArdle, W. Rodgers, and R. Willis. Cognition and ag-ing in the usa (cogusa) 2007-2009.

Assessment , 2015. K. Mohan and J. Pearl. Graphical models for recoveringprobabilistic and causal queries from missing data. In

Ad-vances in Neural Information Processing Systems , pages1520–1528, 2014a.K. Mohan and J. Pearl. On the testability of models withmissing data. In

Artiﬁcial Intelligence and Statistics ,pages 643–650, 2014b.K. Mohan, J. Pearl, and J. Tian. Graphical models for in-ference with missing data. In

Advances in neural infor-mation processing systems , pages 1277–1285, 2013.E. C. Neto, C. T. Ferrara, A. D. Attie, and B. S. Yandell. In-ferring causal phenotype networks from segregating pop-ulations.

Genetics , 2008.J. Pearl.

Causality: Models, Reasoning, and Inference .Cambridge University Press, Cambridge, 2000.J. Peters, D. Janzing, and B. Schölkopf.

Elements of causalinference: foundations and learning algorithms . MITpress, 2017.P. Praxitelous, G. Edman, and P. W. Ackermann. Micro-circulation after achilles tendon rupture correlates withfunctional and patient-reported outcome.

Scandinavianjournal of medicine & science in sports , 2017.J. Robins. A new approach to causal inference in mortalitystudies with a sustained exposure period—application tocontrol of the healthy worker survivor effect.

Mathemat-ical modelling , 7(9-12):1393–1512, 1986.D. B. Rubin. Inference and missing data.

Biometrika , 63(3):581–592, 1976.D. B. Rubin.

Multiple imputation for nonresponse in sur-veys , volume 81. John Wiley & Sons, 2004.S. J. Sheather and M. C. Jones. A reliable data-basedbandwidth selection method for kernel density estima-tion.

Journal of the Royal Statistical Society. Series B(Methodological) , pages 683–690, 1991.S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen.A linear non-gaussian acyclic model for causal discov-ery.

Journal of Machine Learning Research , 7(Oct):2003–2030, 2006.I. Shpitser. Consistent estimation of functions of data miss-ing non-monotonically and not at random. In

Advancesin Neural Information Processing Systems , pages 3144–3152, 2016.E. Sokolova, D. von Rhein, J. Naaijen, P. Groot,T. Claassen, J. Buitelaar, and T. Heskes. Handling hybridand missing data in constraint-based causal discovery tostudy the etiology of adhd.

International journal of datascience and analytics , 3(2):105–119, 2017.P. Spirtes, C. Glymour, and R. Scheines.

Causation, Pre-diction, and Search . MIT Press, Cambridge, MA, 2ndedition, 2001.. V. Strobl, S. Visweswaran, and P. L. Spirtes. Fastcausal inference with non-random missingness by test-wise deletion.

International Journal of Data Science andAnalytics , pages 1–16, 2017.K. Zhang and A. Hyvärinen. On the identiﬁability ofthe post-nonlinear causal model. In

Proceedings of thetwenty-ﬁfth conference on uncertainty in artiﬁcial intelli-gence , pages 647–655. AUAI Press, 2009.K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernel-based conditional independence test and application incausal discovery. In

Proceedings of the 27th Confer-ence on Uncertainty in Artiﬁcial Intelligence (UAI 2011) ,Barcelona, Spain, 2011.

Appendix

A.1 Violation of "no self-masking missingness"

X YV i R x Figure 6: Self-masking missingness indicator with multipledirect causes: TD-PC produces an extra edge between X and Y , but such self-masking missingness does not affectthe other edges in the causal skeleton results, such as theedge between X and V i ∈ V \ { X , Y } .In this section, we discuss challenges of the SelF-maskingMissingness (SFM), and its inﬂuences on MVPC.We note that in the linear Gaussian cases SFM does not af-fect MVPC, when the SFM indicator R x only has one directcause X , such as in Figure 1d. In this case, the result of theCI test of X and Y in test-wise deleted data implies the cor-rect d-separation relation in the m-graph. With the faithful-ness assumption on the m-graph, we have X ⊥⊥ Y ⇐⇒ X ⊥⊥ Y | R x ; furthermore, under the faithful observability as-sumption, we have X ⊥⊥ Y | R x ⇐⇒ X ∗ ⊥⊥ Y | R x = X ∗ ⊥⊥ Y | R x = X and Y .SFM affects MVPC results when the SFM indicator R x hasmultiple direct causes. For example, as the m-graph inFigure 6 shown, conditioning on the missingness indicatorwhich is the direct common effect of two variables in a CItest produces an extraneous edge between them in the re-sult given by MVPC. Removing such extraneous edges ischallenging, because our correction methods are not appli-cable to the self-masking missingness scenario. However,such self-masking missingness indicator does not affect theother edges between X and variables in V \ { X , Y } in thecausal skeleton resulted by MVPC. Therefore, we specifyin the output that edges between the self-masking variableand other direct causes of the self-masking missingness in-dicator are uncertain. A.2 Proofs of the propositions

Proof.

Proposition 1 X ⊥⊥ Y |{ Z , R z = , R x = , R y = } ⇒ X ⊥⊥ Y | Z : We have X ⊥⊥ Y |{ Z , R z = , R x = , R y = } , where some of the in-volved missingness indicators may only take value 0 (i.e.,the corresponding variables do not have missing values).With the faithful observability assumption, the above con-dition implies X ⊥⊥ Y |{ Z , R z , R x , R y } . Because of the faith-fulness assumption on m-graphs, we know that X and Y are d-separated by { Z , R z , R x , R y } ; furthermore, with As-sumption 1, 3, and 4, the missingness indicators can only be leaf nodes in the m-graph. Therefore, conditioning onthese nodes will not destroy the above d-separation rela-tion. That is, in the m-graph, X and Y are d-separated by Z .Hence, we have X ⊥⊥ Y | Z . Proof.

Proposition 2The condition of Proposition 2 implies that for nodes X , Y and any node set Z ⊆ V \ { X , Y } in a m-graph, condition-ing on Z and missingness indicators R x , R y , and R z , therealways exists an undirected path U between X and Y thatis not blocked. Furthermore, to satisfy such constraint of U , at least a missingness indicator R i ∈ { R x , R y , R z } satis-ﬁes either one of the following two conditions: (1) R i isthe only vertex on U ; (2) A cause of R i is the only ver-tex on U as a collider. In Condition (1), if R i is on U , itis a collider because under Assumptions 1 ∼

4, missingnessindicators are the leaf nodes in m-graphs. Then, supposethat R i is not the only vertex on U , and that another node V j ∈ V \ { X , Y , Z } is also on U . Conditioning on V j and R i , U is blocked, which is not satisﬁed the constraint of U . Thus, R i should be the only vertex on U . The samereason also applies to Condition (2). In summary, we con-clude that under the condition of Proposition 2, there is atleast one missingness indicator R i ∈ { R x , R y , R z } such that R i is the direct common effect or a descendant of the directcommon effect of X and Y . A.3 Detection of direct causes of missingnessindicators

In Step 2 of Algorithm 1, detecting direct causes of miss-ingness indicators is implemented by the causal skeletondiscovery procedure of TD-PC. For each missingness in-dicator R i , the causal skeleton discovery procedure checksall the CI relations between R i and variables in V \ V i , andtests whether R i is conditionally independent of a variable V j ∈ V \ V i given any variable or set of variables connectedto R i or V j . If they are conditionally independent, the edgebetween R i and V j is removed. Under Assumptions 1 ∼ R i and V j have at least one directcommon effect. Therefore, all the variables adjacent to R i are its direct causes because R i is either an effect or a cause,and we assume that R ii