[PDF] Stable Prediction via Leveraging Seed Variable

Abstract

In this paper, we focus on the problem of stable prediction across unknown test data, where the test distribution is agnostic and might be totally different from the training one. In such a case, previous machine learning methods might exploit subtly spurious correlations in training data induced by non-causal variables for prediction. Those spurious correlations are changeable across data, leading to instability of prediction across data. By assuming the relationships between causal variables and response variable are invariant across data, to address this problem, we propose a conditional independence test based algorithm to separate those causal variables with a seed variable as priori, and adopt them for stable prediction. By assuming the independence between causal and non-causal variables, we show, both theoretically and with empirical experiments, that our algorithm can precisely separate causal and non-causal variables for stable prediction across test data. Extensive experiments on both synthetic and real-world datasets demonstrate that our algorithm outperforms state-of-the-art methods for stable prediction.

Full PDF

SStable Prediction via Leveraging Seed Variable

Kun Kuang ∗ Zhejiang University

Bo Li

Tsinghua University

Peng Cui

Tsinghua University

Yue Liu

Peking University

Jianrong Tao

Netease

Yueting Zhuang

Zhejiang University

Fei Wu

Zhejiang University

Abstract

In this paper, we focus on the problem of stable prediction across unknown testdata, where the test distribution is agnostic and might be totally different from thetraining one. In such a case, previous machine learning methods might exploitsubtly spurious correlations in training data induced by non-causal variables forprediction. Those spurious correlations are changeable across data, leading toinstability of prediction across data. By assuming the relationships between causalvariables and response variable are invariant across data, to address this problem,we propose a conditional independence test based algorithm to separate thosecausal variables with a seed variable as priori, and adopt them for stable prediction.By assuming the independence between causal and non-causal variables, we show,both theoretically and with empirical experiments, that our algorithm can preciselyseparate causal and non-causal variables for stable prediction across test data.Extensive experiments on both synthetic and real-world datasets demonstrate thatour algorithm outperforms state-of-the-art methods for stable prediction.

Many machine learning algorithms have been shown to be very successful for prediction when the testdata have the same distribution as the training data. In real scenarios, however, we cannot guaranteethe unknown test data will have the same distribution as the training data. For example, differentgeographies, schools, or hospitals may draw from different demographics, and the correlation structureamong demographics may also vary (e.g., one ethnic group may be more or less disadvantagedin different geographies). The model may exploit subtly genuine statistical relationships amongpredictors present in the training data to improve prediction, resulting in the instability of predictionacross test data that out of training distribution. Hence, how to learn a model for stable predictionacross unknown test data is of paramount importance for both academic research and practicalapplications.To address the stable/invariant prediction problem, recently, many algorithms have been proposed,including domain generalization Muandet et al. [2013], causal transfer learning Rojas-Carulla et al.[2018] and invariant causal prediction Peters et al. [2016]. The motivation of these methods is toexplore the invariant or stable structure between predictors and the response variable across multipletraining data for stable prediction. But they cannot handle the test data whose distribution are outof all training environments. Kuang et al. Kuang et al. [2018, 2020] proposed to recover causationbetween predictors and response variable by global sample weighting, and separate causal variablesfor stable prediction. However, they either assume all predictors are binary or analyze based on linearmodel, which are impractical in real scenarios. ∗ [email protected]. Under review. a r X i v : . [ c s . L G ] J un n the stable prediction problem Kuang et al. [2018], all predictors X can be separated into twocategories, including causal variables C and non-causal variables N , by whether it has causal effecton the response variable Y or not, that is X = { C , N } . For example, ears, noses, and whiskers arecausal variables of cats to identify whether an image contains a cat or not, while the grass or otherbackgrounds are non-causal variables to recognize the cat. Then, the generation of the responsevariable Y can be denoted as Y = f ( X ) + (cid:15) = f ( C ) + (cid:15) , where non-causal variables N shouldbe independent with the response variable Y conditional on the full sets of causal variables C . Butthey might be spuriously correlated with either causal variables, response variable or both becauseof sample selection bias in data. For example, the variable “grass” would be spuriously correlatedwith label “cat” and become a powerful predictor if we select many images with “cat on the grass” astraining data. Those spurious correlations between non-causal variables and the response variableare varied and unstable across datasets with different distributions, leading to unstable predictionacross unknown test data. Hence, to address the stable prediction problem, one possible solution is toseparate the causal and non-causal variables, and only adopt causal variables for model training andprediction. However, in practice, the analyst always have no prior knowledge on which variables arecasual variables and which are non-causal variables.Variable/Feature selection plays a very important role in machine learning ﬁled. Traditional correlationbased feature selection methods utilized either the correlation criteria Nie et al. [2010] or mutualinformation criteria Peng et al. [2005] without distinguishing the spurious correlation, leading tounstable prediction across test data that out of training distribution. In the literature of causality,causal discovery and causal estimation techniques can be adopted for causal variables selection.PC Spirtes et al. [2000], FCI Spirtes et al. [2000] and CPC Ramsey et al. [2012] are three of themost prominent causal discovery methods based on conditional independence (CI) test, but theircomplexity grow exponentially with the number of variables. Moreover, PC method need assumecausal sufﬁciency, i.e., the assumption that all common causes of observed predictors are observed.Athey et al. [2018], Kuang et al. [2017] can approximately identify causal variables via estimatingthe causal effect of each variable, but they focused on binary predictors and required that all causalvariables are observed. 𝐶𝐶 𝐶𝐶 𝐶𝐶 𝑝𝑝 𝐶𝐶 𝑗𝑗 𝑌𝑌 𝑁𝑁 𝑁𝑁 𝑁𝑁 𝑞𝑞 𝑁𝑁 𝑘𝑘 𝑆𝑆 VariablesSample SelectionCausal LinkPossible Causal Link

Figure 1: SCM in our problem.Each causal variable C i has a di-rect causal link to Y , but non-causalvariable N j does not. Under thesample selection Bareinboim andPearl [2012] (indexed by variable S ),some non-causal variables might becorrelated with either response vari-able, causal variables, or both .With considering the practical scenarios that causal sufﬁciencyassumption is not met and parts of causal variables are unob-served or unmeasured, in this paper, we propose a novel CI testbased causal variable separation method for stable prediction.By assuming that the set of causal variables C and non-causalvariables N are independent, Fig. 1 illustrates the structuralcausal model (SCM) in our problem. Then, we provides aseries of theorems to prove that one can separate the causalvariables with a single CI test per variable. Speciﬁcally, asshown in Fig. 1, if we know a seed variable C is one of thecausal variables, then each causal variable C · ,i should satisfythat C · ,i (cid:54)⊥⊥ C | Y , and each non-causal variable N · ,j shouldsatisfy that N · ,j ⊥⊥ C | Y . With those theoretical analyses,we present a CI test based causal variable separation methodfor stable prediction. At a ﬁrst step, we apply our causal vari-able separation method on synthetic data, which leads to highprecision on causal variable separation, and the precisely sep-arated causal variables bring stability for prediction acrossunknown test data. In real-world applications, we also demon-strate that our algorithm outperforms baseline algorithms inboth causal variable separation task and stable prediction task.Comparing with previous CI based causal discovery methodsSpirtes et al. [2000], Ramsey et al. [2012], Bühlmann et al. [2010], Yu et al. [2019], our method donot rely on the assumption of causal sufﬁciency and remain unaffected even some causal variables areunobserved. Moreover, our algorithm separate the causal variables with a single CI test per variable,scaling algorithmic complexity from exponential to linear with the number of variables. Comparingwith sample based work on stable prediction Kuang et al. [2018, 2020], our method can be applied forcontinuous settings and separate the causal variables without assumptions on regression model. Our The distribution under sample selection is always conditioned on S . Let X , Y denote the space of observed predictors and response variable, respectively. We deﬁne an environment e ∈ E to be a joint distribution P X Y on X × Y . In practice, the joint distribution canvary across environments: P eXY (cid:54) = P e (cid:48) XY for e, e (cid:48) ∈ E .In this paper, we consider a setting where a researcher has a single data set (data from one envi-ronment), and wishes to train a model that can then be applied to other environments. This type ofproblem might arise when a ﬁrm creates an algorithm that is then provided to other organizations toapply, for example, medical researchers might train a model and incorporate it in a software productthat is used by a range of hospitals; academics might build a prediction model that is applied bygovernments in different locations. The researcher may not have access to the end user’s data forconﬁdentiality reasons. The problem can be formalized as a stable prediction problem Kuang et al.[2018] as follows: Problem 1 (Stable Prediction).

Given one training environment e ∈ E with dataset D e = { X e , Y e } ,the task is to learn a predictive model that can stably predict across unknown test environments E . In this problem, let X = { C , N } , we deﬁne C as causal variables, and N as non-causal variableswith the following assumption Kuang et al. [2018]: Assumption 1

There exists a stable probability function P(y|c) such that for all environment e ∈ E , P ( Y e = y | C e = c, N e = n ) = P ( Y e = y | C e = c ) = P ( y | c ) . Thus, one can address the stable prediction problem by separating causal variables C and learning thestable function P ( y | c ) . But, in practice, we have no prior knowledge on which variables are causaland which are non-causal. In this work, we focus on stable prediction via separating causal variables. Assumption 2

Causal variables C and non-causal variables N are independent. Formally, C ⊥⊥ N . Assumption 1 and 2 illuminate that the non-causal variable is independent with response variableduring the data generation processing (i.e., Y = f ( X ) + (cid:15) = f ( C ) + (cid:15) ), but it might be spuriouslycorrelated with either response variable, causal variables, or both since sample selection bias problemas shown in Fig. 1. These spurious correlations might vary across environments. Hence, to make astable prediction, one should guarantee the prediction only depending on the causal variables. Firstly, we revisit key concepts and theorems related to d -separation and CI in causal graph.Let G = { V , E } represents a causal directed acyclic graph (DAG) with nodes V and edges E , wherea node denotes a variable and an edge represents the direct dependence or causal direction betweentwo variables. In a DAG, V i → V j refers to that V i is a cause of V j and V j is an effect of V i . Deﬁnition 1 ( d -separation Pearl [2009]) In a DAG G , a path π is said to be d -separated by a setof nodes Z if and only if (i) π contains a chain V i → V k → V j or a fork V i ← V k → V j suchthat the middle node V k is in Z , or (ii) π contains a collider V i → V k ← V j such that the middlenode V k is not in Z and such that no descendant of V k is in Z . 𝑌𝑌 𝑆𝑆 𝑁𝑁 𝑗𝑗 𝑁𝑁 𝑖𝑖 (a) Path between C and N i 𝐶𝐶 𝑌𝑌 𝐶𝐶 𝑖𝑖 (b) Path between C and C i Figure 2: Causal paths between a known causal variable C and other variables, including C i and N i . The dash line between two variables refers to the causal link/path between them is unknown. Deﬁnition 2 (Conditional Independence)

Given two distinct variables V i , V j ∈ V are said to beconditionally independent given a subset of variables Z ⊆ V \ { V i , V j } (i.e. V i ⊥⊥ V j | Z ) , ifand only if P ( V i , V j | Z ) = P ( V i | Z ) P ( V j | Z ) . Otherwise, V i and V j are conditionally dependentgiven Z (i.e. V i (cid:54)⊥⊥ V j | Z ) ). The connection between d -separation and CI is established through the following lemma: Lemma 1 (Probabilistic Implications of d -Separation Geiger et al. [1990], Pearl [2009]) Ifvariables V i and V j are d-separated by Z in a DAG G , then V i is independent of V j conditionalon Z in every distribution compatible with the DAG G . Conversely, if V i and V j are not d-separatedby Z in a DAG G , then V i and V j are dependent conditional on Z in at least one distributioncompatible with G . Based on lemma 1, in this paper, we propose an elaborative but effective causal variables separationalgorithm by combining the mechanisms of d -separation and causality with the following assumption. Assumption 3

We have prior knowledge on one causal variable. Formally, we know C ∈ C . Under assumption 3, we have the following theorem to support for precisely separating the set ofcausal and non-causal variables. Then, the set of causal variables can be applied for stable prediction.

Theorem 1

Given a causal variable C , observed variables X and response variable Y , andassuming 1&2&3, if X i (cid:54)⊥⊥ C | Y , then X i belongs to the set of causal variables, otherwise, itbelongs to the set of non-causal variables. Proof 1

Assumption 1 implies that non-causal variables N are not direct causes of response Y ,but causal variables C are the direct causes. Hence, in our causal DAG, there exists a directedge from each causal variable C i to response Y , but N have no any edges that directly pointto Y . Assumption 2 guarantees no causal link between any causal and non-causal variables, butthe causal structure among causal variables (or non-causal variables) might be very complex andunknown. With considering the sample selection bias is generated based on the response Y and partof non-causal variables N , the causal DAG in our problem is shown in Fig. 1.From Fig. 1, the path between the seed causal variable C and any non-causal variable N i canbe represented as Fig. 2a, where the causal links between N i and N j are unknown, could be verycomplex or could N j is exactly N i if sample selection is based on N i and Y . With the deﬁnition of d -separation, we have that C and N i are d -separated by variable Y . Hence, N i ⊥⊥ C | Y for any N i ∈ N guaranteed by the lemma 1.On the other hand, the path between the seed causal variable C and any other causal variable C i can be represented as Fig. 2b, where the causal links between C and C i are unknown. Similarity,with the deﬁnition of d -separation, we know that the response variable Y is a collider and cannot d -separate C and C i . Therefore, with the lemma 1, we have C i (cid:54)⊥⊥ C | Y for any C i ∈ C .Overall, we can separate causal and non-causal variables by a single CI test per variable, and X i belongs to the set of causal variables if X i (cid:54)⊥⊥ C | Y , otherwise, X i is non-causal variable. Based on theorem 1, we propose a causal variable separation algorithm via one single CI test pervariable. The details of our algorithm are summarized in Algorithm 1. With the separated top- k causal variables, we can learn a predictive model for stable prediction. Remark 1

From the proof of theorem 1, we know that to identify whether a variable is causal or not,our algorithm only need a single CI test of that variable and a known causal variable conditional lgorithm 1 top- k Causal Variables Separation/Selection

Require: X ∈ R n × p , Y ∈ R n , C and parameter k Ensure: top- k casual variables for each variable X i ∈ X do Calculate p-value of CI test: pv i = CI-test ( X i ⊥⊥ C | Y ) end for X ranking = Ranking ( X , pv ) (cid:46) Ranking X i ∈ X by their p-value pv i in ascending order return top- k ranked variables in X ranking on the response variable, with no need to know the other causal variables or common causes ofobserved variables. Then, we conclude that (i) our algorithm is not affected by the unobserved causalvariables, but missing some causal variables would decrease the performance of predictive model onprediction; and (ii) the causal sufﬁciency assumption is not necessary for our algorithm, but we needto assume the independence between causal and non-causal variables. Complexity Analysis.

Note that our algorithm requires only a single CI test per variable. Therefore,it speeds up the causal variables separation as it scales linearly with the number of variables, henceits complexity is O ( cp ) , where p is the dimension of observed variables and c is a constant denotingthe complexity of a single CI test. Discussions on assumptions.

Assumption 1 refers to that the underlying predictive mechanismis invariant across environments, which is the basic assumption for causal variables identiﬁcationand stable/invariant prediction Peters et al. [2016], Kuang et al. [2018]. In assumption 2, weassume the independence between causal variables and non-causal variables, which is critical to ourmethod. In practice, however, one might adopt disentangled representation Thomas et al. [2018] ororthogonal techniques Ahmed and Rao [2012] to guarantee this assumption to be satisﬁed on featurerepresentation space. We leave this in future work. As for assumption 3, we think it is reasonable andacceptable in real applications. For example, if we want to predict the crime rate, we could knowthe income is one causal variable. Moreover, one can identify a causal variable as seed variable byestimating its causal effect Athey et al. [2018], Kuang et al. [2017].

We implement the following variable selection methods as baselines, (i) correlation based methods,including minimal Redundancy Maximal Relevance (mRMR) Peng et al. [2005], Random Forest (RF)Breiman [2001] and LASSO Tibshirani [1996], they would be affected by the spurious correlationbetween non-causal variable and the response variable, and select non-causal variables for prediction;(ii) causation based methods, including PC-simple Bühlmann et al. [2010] and causal effect (CE)estimator Athey et al. [2018], Kuang et al. [2017], they need to assume all causal variable areobserved, moreover, PC-simple requires causal sufﬁciency and with curse of dimensionality; (iii)stable/invariant learning based methods, including invariant causal prediction (ICP) Peters et al.[2016] and global balancing algorithm (GBA) Kuang et al. [2018, 2020], ICP need multiple trainingenvironments for reveal causation and GBA requires tremendous training data for global sampleweighting.In our algorithm, we employ causal effect estimator Kuang et al. [2017] to identify one causal variablewithout assumption 3. Then, we execute CI test with bnlearn method Scutari [2009], denoted as

Our+BNCI , and RCIT Strobl et al. [2019] method, denoted as

Our+RCIT .We do not compare with a recent causal variable selection method Mastakouri et al. [2019], since itrequires the knowledge of a cause variable of each candidate causal variable, which is not applicablein our problem. Previous CI based methods either need observe all causal variables, or assume causal sufﬁciency, moreover,with curse of dimensionality. So, we only compare with PC-simple, a prominent CI based method. for prediction to checktheir stability across unknown test data. To evaluate the performance of causal variable separation/selection, we use precision@k and rankingindex of unstable non-causal variable as evaluation metrics. Precision@k refers to the proportion oftop-k selected variables that are hitting the true causal variables set as follows:

P recision @ k = |{ x i | x i ∈ ˆ C , index ( x i ) < k, x i ∈ C }| k , (1)where ˆ C and C refer to the set of selected causal variables and true causal variables, respectively. index ( x i ) is the ranking index of variable x i in the selected variables ˆ C .Similar to Kuang et al. [2018], we also adopt Average_Error and Stability_Error to measure theperformance of stable prediction with the following deﬁnition: Average_Error = |E| (cid:80) e ∈E RMSE ( D e ) , Stability_Error = (cid:114) |E|− (cid:80) e ∈E ( RMSE ( D e ) − Average_Error ) . (2) To generate the synthetic datasets, we consider the sample size n = 2000 and dimension of observedvariables p = { , , , } . We ﬁrst generate the observed variables X = { C , N } . From Fig. 1and assumption 2, we know causal variables C and non-causal variables N should be independent,but the causal variables C could be dependent with each other, and the same to non-causal variables N . Hence, we generate X = { C , , · · · , C ,p c , N , , · · · , N ,p n } with the help of auxiliary variables Z C and Z N with independent Gaussian distributions as: Z C , , · · · , Z C ,p iid ∼ N (0 , C ,i = 0 . ∗ Z C ,i + 0 . ∗ Z C ,i +1 , i = 1 , , · · · , p c (3) Z N , , · · · , Z N ,p iid ∼ N (0 , N ,j = 0 . ∗ Z N ,j + 0 . ∗ Z N ,j +1 , j = 1 , , · · · , p n , (4)where the number of causal variables p c = 0 . ∗ p and the number of non-causal variables p n = 0 . ∗ p . C ,i and N ,j represent the i th and j th variable in C and N , respectively.Then, we generate the response variable Y as: Y = (cid:80) psi =1 α i · C ,i + (cid:80) p c j =1 β j · e C · ,j C · ,j +1 C · ,j +2 + ε, (5)where α i = ( − i · p c /i , β j = I ( mod ( j, ≡ and ε = N (0 , . . The I ( · ) is the indicatorfunction and function mod ( x, y ) returns the modulus after division of x by y .From the generation of Y , we know that Y is only affected by the causal variables C , and independentwith the non-causal variables N . In real applications, however, some non-causal variables might bespuriously correlated with Y since sample selection bias as shown in Fig. 1, and their correlationmight vary across datasets. To check the stability of algorithms under that practical setting, wegenerate a set of environments, each with a stable probability P ( Y | C ) , but a distinct spuriouslycorrelation P ( Y | N ) . For simpliﬁcation, we only set one non-causal variable N ,pn as the unstablenon-causal variable , and change its spuriously correlation P ( Y | N ,pn ) across environments.Speciﬁcally, we vary P ( Y | N ,pn ) via biased sample selection with a bias rate r ∈ [ − , − ∪ (1 , based on N ,pn and Y as shown in Fig. 1. For each sample, we select it with probability P r = | r | − ∗ D i , where D i = | Y − sign ( r ) ∗ N ,pn | . If r > , sign ( r ) = 1 ; otherwise, sign ( r ) = − .Note that r > corresponds to positive spurious correlation between Y and N ,pn , while r < − refers to the negative spurious correlation between Y and N ,pn . The higher value of | r | , the stronger For simpliﬁcation, we use linear model to evaluate the selected variables, other models can also be applied. r on test data R M SE mRMRRFLASSOPC-simpleCEICPGBAour+BNCIour+RCIT Figure 3: Prediction results across unknown test data with n = 2000 , p = 20 . All methods are trainedwith r train = 2 . , but tested across environments with different r test ∈ [ − , − ∪ (1 , .Table 3: Results of Average_Error and Stability_Error with different dimension p . Dimension p=10 p=20 p=40 p=80Metrics Average_Error Stability_Error Average_Error Stability_Error Average_Error Stability_Error Average_Error Stability_ErrormRMR 1.058 0.548 1.145 0.599 1.179 0.625 1.177 0.619RF 0.994 0.506 1.110 0.576 1.174 0.622 1.177 0.619LASSO 0.994 0.506 1.055 0.541 1.170 0.618 1.177 0.619PC-simple 1.039 0.536 1.100 0.570 1.175 0.622 1.178 0.619CE

Our+RCIT correlation between N ,pn and Y . Different value of r refers to different environments. All methodsare trained with r train = 2 . , but tested across environments with different r test ∈ [ − , − ∪ (1 , .Table 1: Results of precision@k, where k equals the number of causal variable, namely k = p ∗ . . ICP method cannot be appliedfor selecting variable with speciﬁc size. Dimension p=10 p=20 p=40 p=80mRMR 0.333 0.167 0.167 0.167RF 0.667 0.500 0.250 0.333LASSO 0.667 0.833 0.500 0.125PC-simple 0.667 0.667 0.250 0.167CE

Our+RCIT

Table 2: Ranking index of the unstable non-causal variable N ,pn , where “Y” denotes thatthe unstable non-causal variable is in the se-lected subset in ICP method. Dimension p=10 p=20 p=40 p=80mRMR 1 1 1 1RF 1 1 1 1LASSO 3 1 1 1PC-simple 1 1 1 1CE 4 2 2 3ICP Y Y Y YGBA 4 2 3 1Our+BNCI Our+RCIT 4 7 We report the results on causal variable selec-tion from two aspects, including the ranking of causal variable with precision @ k in Tab.1 and rankingof unstable non-causal variable in Tab. 2. The ranking of causal variables determines the averageerror of prediction across environments, the closer to 1 of precision @ k, the better; while the rankingof unstable non-causal variable determines the stability error of prediction across environments, thelower ranking, the better. From Tab. 1 and 2, we conclude that: (i) Traditional correlation basedvariables selection methods, including mRMR, Random Forest and LASSO cannot precisely selectthe causal variables (with lower precision @ k) and rank the unstable non-causal variable with a higherranking. The main reason is that the spurious correlation is more signiﬁcant than causation underthe sample selection bias. (ii) The performance of PC-simple is similar to correlation based method,since it’s hard to search the optimal solution for PC-simple via naively random search, moreover,it relies on the causal sufﬁciency assumption and needs to observed all causal variables. (iii) Theperformance of causation based methods, including CE and GBA, is better than those correlationbased methods with higher precision @ k and lower ranking of unstable non-causal variable. Sinceby revealing part of causations among variables, they can reduce spurious correlations in trainingdata. But their performances are still worse than our methods in high dimensional settings, sincethey need enough training data for a better sample rewighting, moreover, they need to observed allcausal variables. (iv) Our methods achieve the best performance for the separation/selection of causalvariables (with highest precision @ k) and the ranking of unstable non-causal variable.7 Top k variables selected R M SE mRMRRFLASSOPC-simpleCEICPGBAour+BNCIour+RCIT (a) RMSE on env. G1 Top k variables selected R M SE mRMRRFLASSOPC-simpleCEICPGBAour+BNCIour+RCIT (b) RMSE on env. G2 Top k variables selected R M SE mRMRRFLASSOPC-simpleCEICPGBAour+BNCIour+RCIT (c) RMSE on env. G3 Top k variables selected R M SE mRMRRFLASSOPC-simpleCEICPGBAour+BNCIour+RCIT (d) RMSE on env. G4 Figure 4: Results of RMSE with top- k selected variables on different environments. All algorithmsare trained with data from environment G1, but tested on the data from each environment. When thetest environment is different from the training one (e.g., G2, G3, and G4), our algorithm achievesbetter performance than baselines. Results on Stable Prediction.

With the variable ranking list form each algorithm, we select top- k ranked variables to evaluate their performances on stable prediction across unknown test environments,where k is set as the number of causal variables (i.e., k = 0 . ∗ p ). Fig. 3 and Tab. 3 demonstratethe experimental results on stable prediction. From Fig. 3, we ﬁnd that (i) the performance of ourmethods are worse than baselines when r test > . . This is because the spurious correlation betweenunstable non-causal variable and the response variable are highly similar between training data( r train = 2 . ) and test data when r test > . , and that correlation can be exploited for improvingpredictive performance; (ii) the performance of our methods are much better than baseline when r test < − . , where that spurious correlation are totaly different between training ( r train = 2 . ) andtest data r test < − . , leading to unstable prediction on baselines; (iii) our methods achieve the moststable prediction across all test data, since our algorithm can precisely separate the causal variablesand achieve the lowest ranking of unstable non-causal variable as reported in Tab.1 and Tab. 2.To clearly demonstrate the advantages of our algorithm on stable prediction, we report the detailresults under different synthetic settings in Tab. 3. From the results, we can conclude that ouralgorithm can make stable prediction across unknown environments via causal variable separation. To evaluate the performance of our algorithm in real-world datasets, we apply it to aParkinson’s telemonitoring dataset , which was wildly used for the problem of domain generalizationMuandet et al. [2013], Blanchard et al. [2017] and other regression tasks Tsanas et al. [2009]. Thisdataset consists of biomedical voice measurements from 42 patients with early-stage Parkinson’sdisease recruited for a six-month trial of a telemonitoring device for remote symptom progressionmonitoring. For each patient, there are about 200 recordings, which were automatically recorded inthe patients’ home. The task is to predict the clinician’s motor UPDRS scoring of Parkinson’s diseasesymptoms from patients’ features, including their age, gender, test time and many other measures. Experimental Settings.

In our experiments, we set the motor UPDRS scoring as the responsevariables Y . To test the stability of all methods, we generate different environments by biased dataseparation based on different patients. Speciﬁcally, we separate the whole 42 patients into 4 patients’groups, including group 1 (G1) with recordings from 21 patients, and other three groups (G2, G3and G4) are all with recordings from different 7 patients, where the different groups correspond todifferent environments. Considering a practical setting where a researcher has a single data set andwishes to train a model that can then be applied to other environments, in our experiments, we trainedall models with data from environment G1, but tested them on all 4 groups. Experimental Results.

We report the experimental results of RMSE with top- k ranked variablesin Figure 4. Fig. 4a shows that correlation based methods (LASSO, mRMR and RF) outperformcausation based methods (GBA and our method), this is because the training and test have the similardistribution on env. G1, hence the spurious correlation between non-causal variables and responsevariable can bring positive power for prediction. Moreover, we ﬁnd ICP method achieves goodperformance in env. G1 since it cannot differentiate the spurious correlation from only one trainingenvironment. Fig. 4b, 4c and 4d demonstrate that causation based methods are better than correlation https://archive.ics.uci.edu/ml/datasets/parkinsons+telemonitoring , theprediction performance might seriously decrease as inputting more selected variables, since someselected variables could be spuriously correlated with the response and unstable across environments. In this paper, we focus on the problem of stable prediction via leveraging a seed variable forcausal variable separation. We argue that most of traditional prediction methods and variableselection methods are correlation based, resulting in instability problem on prediction across unknownenvironments. By assuming that the casual variables and non-causal variables are independent, inthis paper, we proposed a causal variable separation algorithm with a single CI test per variable, andprovide a series of theorems to prove that our algorithm can precisely separate the causal variables.We also demonstrate that the precisely separated causal variables from our algorithm can bring stableprediction across unknown test data. The experimental results on both synthetic and real-worlddatasets show that our algorithm outperforms the baselines for causal variables separation and stableprediction.

References

Nasir Ahmed and Kamisetty Ramamohan Rao.

Orthogonal transforms for digital signal processing .Springer Science & Business Media, 2012.Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: debiasedinference of average treatment effects in high dimensions.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 80(4):597–623, 2018.Elias Bareinboim and Judea Pearl. Controlling selection bias in causal inference. In

ArtiﬁcialIntelligence and Statistics , pages 100–108, 2012.Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domaingeneralization by marginal transfer learning. arXiv preprint arXiv:1711.07910 , 2017.Leo Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.Peter Bühlmann, Markus Kalisch, and Marloes H Maathuis. Variable selection in high-dimensionallinear models: partially faithful distributions and the pc-simple algorithm.

Biometrika , 97(2):261–278, 2010.Dan Geiger, Thomas Verma, and Judea Pearl. Identifying independence in bayesian networks.

Networks , 20(5):507–534, 1990.Kun Kuang, Peng Cui, Bo Li, Meng Jiang, and Shiqiang Yang. Estimating treatment effect in the wildvia differentiated confounder balancing. In

Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pages 265–274. ACM, 2017.Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Xiong, and Bo Li. Stable prediction across unknownenvironments. In

Proceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining , pages 1617–1626. ACM, 2018.Kun Kuang, Ruoxuan Xiong, Peng Cui, Susan Athey, and Bo Li. Stable prediction with modelmisspeciﬁcation and agnostic distribution shift. In

Thirty-Fouth AAAI Conference on ArtiﬁcialIntelligence , 2020.Atalanti Mastakouri, Bernhard Schölkopf, and Dominik Janzing. Selecting causal brain features witha single conditional independence test per feature. In

Advances in Neural Information ProcessingSystems , pages 12532–12543, 2019. The test distribution is different from the training one.

International Conference on Machine Learning , pages 10–18, 2013.Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding. Efﬁcient and robust feature selection viajoint l2, 1-norms minimization. In

Advances in neural information processing systems , pages1813–1821, 2010.Judea Pearl.

Causality . Cambridge university press, 2009.Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information: criteriaof max-dependency, max-relevance, and min-redundancy.

IEEE Transactions on Pattern Analysis& Machine Intelligence , (8):1226–1238, 2005.Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariantprediction: identiﬁcation and conﬁdence intervals.

Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , 78(5):947–1012, 2016.Joseph Ramsey, Jiji Zhang, and Peter L Spirtes. Adjacency-faithfulness and conservative causalinference. arXiv preprint arXiv:1206.6843 , 2012.Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models forcausal transfer learning.

The Journal of Machine Learning Research , 19(1):1309–1342, 2018.Marco Scutari. Learning bayesian networks with the bnlearn r package. arXiv preprintarXiv:0908.3817 , 2009.Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman.

Causation, prediction,and search . MIT press, 2000.Eric V Strobl, Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional in-dependence tests for fast non-parametric causal discovery.

Journal of Causal Inference , 7(1),2019.Valentin Thomas, Emmanuel Bengio, William Fedus, Jules Pondard, Philippe Beaudoin, HugoLarochelle, Joelle Pineau, Doina Precup, and Yoshua Bengio. Disentangling the independentlycontrollable factors of variation by interacting with the world. arXiv preprint arXiv:1802.09484 ,2018.Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal StatisticalSociety: Series B (Methodological) , 58(1):267–288, 1996.Athanasios Tsanas, Max A Little, Patrick E McSharry, and Lorraine O Ramig. Accurate telemon-itoring of parkinson’s disease progression by noninvasive speech tests.

IEEE transactions onBiomedical Engineering , 57(4):884–893, 2009.Kui Yu, Xianjie Guo, Lin Liu, Jiuyong Li, Hao Wang, Zhaolong Ling, and Xindong Wu. Causality-based feature selection: Methods and evaluations. arXiv preprint arXiv:1911.07147arXiv preprint arXiv:1911.07147