Genome-wide Causation Studies of Complex Diseases
GGenome-wide Causation Studies of Complex Diseases
Rong Jiao , Xiangning Chen , Eric Boerwinkle & Momiao Xiong
1* 1
Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas, USA Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, Nevada, USA Epidemiology, Human Genetics & Environmental Sciences, School of Public Health, University of Texas Health Science Center at Houston, Houston, Texas, USA
Key words:
Causal inference, GWAS, GWCS, additive noise models, linkage disequilibrium, prediction * Address for correspondence and reprints: Dr. Momiao Xiong, Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, P.O. Box 20186, Houston, Texas 77225, (Phone): 713-500-9894, (Fax): 713-500-0900, E-mail: [email protected].
BSTRACT
Despite significant progress in dissecting the genetic architecture of complex diseases by genome-wide association studies (GWAS), the signals identified by association analysis may not have specific pathological relevance to diseases so that a large fraction of disease causing genetic variants is still hidden. Association is used to measure dependence between two variables or two sets of variables. Genome-wide association studies test association between a disease and SNPs (or other genetic variants) across the genome. Association analysis may detect superficial patterns between disease and genetic variants. Association signals provide limited information on the causal mechanism of diseases. The use of association analysis as a major analytical platform for genetic studies of complex diseases is a key issue that hampers discovery of the mechanism of diseases, calling into question the ability of GWAS to identify loci underlying diseases. It is time to move beyond association analysis toward techniques enabling the discovery of the underlying causal genetic strctures of complex diseases. To achieve this, we propose a concept of a genome-wide causation studies (GWCS) as an alternative to GWAS and develop additive noise models (ANMs) for genetic causation analysis. Type I error rates and power of the ANMs to test for causation are presented. We conduct GWCS of schizophrenia. Both simulation and real data analysis show that the proportion of the overlapped association and causation signals is small. Thus, we hope that our analysis will stimulate discussion of GWAS and GWCS.
NTRODUCTION
Although significant progress in dissecting the genetic architecture of complex diseases by GWAS has been made, the overall contribution of the new identified genetic variants is small and a large fraction of causal genetic variants is still hidden.
Association is to measure dependence between two variables or two sets of variables in the data and to use these dependencies for prediction that is not dealing with causal problems (Altman and Krzywinski, 2015;
Lopez-Paz 2016). Association analysis may detect superficial patterns between disease and genetic variants. Its signals provide limited information on the causal mechanism of diseases (Kahrilas and Kahrilas 2019). Association analysis has been a major paradigm of genetic analsyis of complex diseases for almost a century. Understanding the etiology and mechanism of complex diseases using association analysis remains elusive. Most genetic questions to uncover the mechanism of the disease is causal in nature. Causation analysis is an essential to the genetic analysis of complex diseases, yet ignored for long time (Lopez-Paz 2016; Kreif and DiazOrdaz 2019). It is well recognized that association analysis is not a direct method to discover the causal mechanism of complex diseases. Many investigators think that βassociation is essential to causationβ and hope that we can successfully shift from association to causation (Jones et al. 2017). A current paradigm to make transition from association to causation is through omics analysis (Clyde 2017; Ongen et al. 2017). However, such approaches have two limitations. First most current omics analysis still detect association signals. For example, eQTL analysis that tests for association of a discrete variable (genetic variant) with a continuous variable (gene expression) is still association analysis. Observed association may not lead to inferring a causal relationship ( Orho-Melander 2015; Lee et al. 2018). The recent study has found that association signals tend to be spread across most of the genome (Boyle et al. 2017). The paradigm of GWAS with eQTL may still fail to identify the causal paths from genetic variants to disease. Second, the lack of association may not be necessary to imply the absence of a causal relationship (Callaway et al. 2017). The set of causal loci including causal QTL, causal eQTL, and causal mQTL is not the subset of association loci that are identified in QTL, eQTL and mQTL analysis simply because QTL, eQTL and mQTL analysis are based on regression. A large proportion of causal loci may not be discovered by association analysis. Finding causal SNPs only via searching the set of associated SNPs may miss many causal SNPs. In summary, the use of association analysis as a major analytical platform for genetic studies of complex diseases is a key issue that hampers identification of causal SNPs and discovery of the causal mechanisms of the diseases. Distuinguishing causation from association is an age-old problem. Methods for causation analysis that is one of the greatest challenging problems in science and technology need to be developed as an alternative to association analysis (Zenil et al. 2019). Without a proper causal analysis, to fully detect causal SNPs is not possible in general. Intutively, causation implies that changes in one variable will directly make changes in the other (Jaffe 2010). The essential distinction between association and causation relies on what the response will be if we intervene in the system (Lattimore and Ong 2018). There are two types of causal inference: intenventional causal inference and observational causal inference (Kaplan 2018). Interventional causal inference learns the effect of taking an action directly via experiments, for example, randomized controlled trials. Interventional experiments are a gold standard for causal inference. However, since in human genetics, we cannot change the genetic materials of human subjects, experimental interventions are unethical and infeasible. Therefore, it is essential to develop tatistical methods and algorithms to predict the outcomes of an intenvention from passive observation (Spirtes et al. 2000; Lattimore and Ong 2018). In this paper, we take an observational causal inference approach to identifying causal SNPs. Although we infer causation from observation data, our concept of causation is derived from intervention (Pearl 2019). In principle, causal inference is based on interventional distribution. The do-calculus is used as an essential concept for causal inference, which can simplify the expression for an interventional distribution. Repeated applications of the do-calculus will lead to an expression containing only observational quantities that can be used to estimate the interventional distribution from observational data (Lattimore and Ong 2018). Therefore, the do-operation is a key concept that makes observational causal inference feasible. Three essential frameworks: causal Beyesian networks, structural equation models and counterfactuals have been developed for observational causal inference (Rosenbaum and Rubin 1983; Pearl 2000; Peters et al. 2014; Peters et al. 2017; Xiong 2018; Lattimore and Ong 2018). In history, causal Beyesian networks, structural equation models and counterfactuals developed relatively independently in different fields, but they can be unified using interventional queries with do-calculus (Lattimore and Ong 2018). This allows methods and algorithms developed within one framework to be easily applied to one another, and also allows predictions about the consequences of intervening upon (rather than merely observing) variables, and provides a method of evaluating counterfactual claims. Therefore, we will use do-calculus as an unified framework for causal inference. Similar to GWAS which investigates the dependence relationship between SNPs and disease at a time, GWCS investigates the causal relationship between SNPs and disease at a time, eferred to as bivariate causal discovery. The traditional causal inference theory infers causal relationships among more than three variables and cannot be applied to bivariate causal discovery. Only recently the independence of cause and mechanism (ICM) and functional causal model, specifically additive noise models (ANMs) (Peters et al. 2017; Xiong 2018), has been proposed. ICM and discrete ANMs can be applied to GWCS. For a long time, many genetic epidemiologists hold a view that that causal inference from observational data is impossible. Some views and concepts that misunderstand causation widely exist in genetic epidemiology. There is also lack of algorithms for causal inference in genetics. Purpose of this paper is to rigorously define causation, clarify concept of causation and association and develop effective causal models and algorithms that can be easily used to discover causal structure in genetic analysis. While there is increasing evidence that association signals provide limited information on causes of disease and some investigators call the future of the GWAS into question (Callaway 2017), the modern causal inference theory provides powerful tools for bivariate causal discovery. In the past two dacades, causal theory has been well developed and is becoming an important component of artificial intelligence (AI). β Reasoning in causal terms is omnipresent, from fundamental physics to medicine, social sciences and economics, and in everyday lifeβ (Barrett et al. 2019). It is urgent to develop concepts and theory to show that under the right conditions and assumptions, causal-effect relationship between two variables can be inferred from purely observational data. It is time to develop a new generation of genetic analysis to shift the current paradigm of genetic analysis from association analysis to causal inference. To make the shift feasible, we will rigorously use do-calculus to model intervations, define the concept of causation, unify counterfactuals, functional causal models and ICM, and investigate the onnections and difference between association and causation. ANMs are easily used causal models. We use ANMs ππ = ππ ππ ( ππ ) + ππ ππ where ππ represents the disease status, ππ represents the indicator variable for the genotype of a SNP, and ππ ππ represents some residual term, as a general framework to distinguish causal directions and develop a new ANM-based statistic to test causation of SNP locus with the disease, which will be used for GWCS of complex disease. Under the assumption of no counfounders (causal model with counfounders will be discussed somewhere else), we investigate the identifiability of the ANM-based statistics for bivariate (SNP and disease) causal discovery. Since an analytical form for the distribution of the causal test statistic is difficult to derive, permutation methods will be used to compute the distribution of the causal test. To evaluate its performance for genetic causal analysis, we use large scale simulations to calculate the type I error rates of of the ANM-based statistics to test causation and to ompute its power under various conditions. To further evaluate its performance, an ANM-based casual test is applied. The proposed method is applied to the CATIE-MGS-SWD schizophrenia (SCZ) study dataset with 8,421,111 common SNPs typed in 13,557 individuals to perform GWCS of SCZ. To further investigate the properties of the ANM-based causal test, we will investigate the prediction ability of causal SNPs and impact of linkage disequilibrium (LD) on the causation analysis. Our purpose is to provide a detailed analysis of GWAS and GWCS as a response to the comments about the ability of GWCS to identify disease causing loci (Orho-Melander 2015). Basic concepts of association and causation n this section, we briefly introduce causal inference theory to make this section as self-contained as possible. We assume that two variables π₯π₯ and π¦π¦ are considered. Their joint distribution is denoted by ππ ( π₯π₯ , π¦π¦ ) . Association between two variables π₯π₯ and π¦π¦ are defined as dependence between them. Statistically, association between π₯π₯ and π¦π¦ is defined as ππ ( π¦π¦ | π₯π₯ ) β ππ ( π¦π¦ ) . (1) Statistical dependence is a symmetric concept: if the variable π₯π₯ to depends on the variable π¦π¦ , then the variable π¦π¦ also depends on the variable π¦π¦ . Classical machine learning and statistical methods, built on pattern recognition and association analyses, are insufficient for causal reasoning. The science of causal reasoning is developing in various disciplines. In different disciplines, there may be different definitions of causations. Four key approaches have emerged: structural equation models, causal Bayesian networks, counterfactuals and independence of cause and mechanism (ICM) (Lattimore and Ongv 2018; Marsala 2015; Peters et al. 2017; Xiong 2018). The four schools of causality have been recently unified. Intervention calculus (do-calculus) can be taken as an unifuing language for causal inference. Intervention calculus
The purpose of intervention calculus is to describe the mathematical conditions under which we can make causal inference from observational data. Intuitively, causation is defined as the encoding of potential outcomes under intervention. Intervention is surgeries on mechanism. In other words, changes in one variable under intervention will affect the outcomes of another variable and hence can be used to measure effects of intervention (action). We consider two variables ππ and ππ . A causal model can be defined by intervention (action) as follows. If we do ππ (forcing the random variable ππ to take a specified value), then ππ will be affected. Causation analysis investigates prediction of the effects of actions that perturb the observed system (Mooij et al. 2016). We use ππ ( ππ | ππππ ( ππ )) to denote the distribution of ππ conditional on an intervention that sets ππ = π₯π₯ . Now ππ causing ππ can be methatically defined as ππ ( ππ | ππππ ( ππ )) β ππ ( ππ | ππππ ( ππ )) for some ππ , ππ , ππ β ππ . (2) If ππ causes Y ( ππ β ππ ) , then in general, we have ππ ( ππ | ππ ) = ππ ( ππ | ππππ ( ππ )) β ππ ( ππ ) . (3) However, πποΏ½πποΏ½ππππ ( ππ ) οΏ½ = ππ ( ππ ) β ππ ( ππ | ππ ) , or if ππ causes ππ ( ππ β ππ ) then πποΏ½πποΏ½ππππ ( ππ ) οΏ½ = ππ ( ππ ) β ππ ( ππ | ππ ) . Although the joint probability can be factorized in terms of marginal distribution and conditional istribution as ππ ( ππππ ) = ππ ( ππ ) ππ ( ππ | ππ ) = ππ ( ππ ) ππ ( ππ | ππ ) , If ππ causes Y ( ππ β ππ ) , we have the factorization: ππ ( ππππ ) = ππ ( ππ ) ππ ( ππ | ππππ ( ππ )) , but in this case ( ππ β ππ ), we do not have ππ ( ππππ ) = ππ ( ππ ) ππ ( ππ | ππππ ( ππ )) , i.e., ππ ( ππππ ) β ππ ( ππ ) ππ ( ππ | ππππ ( ππ )) , the joint probability of ππ and ππ cannot be factorized in terms of marginal distriation ππ ( ππ ) and interventional probability distribution ππ ( ππ | ππππ ( ππ )) unless ππ and ππ are independent. For the genetic problem, ππ represents a disease status and ππ represents a genotype. The action do ππ means that changing genotype ππ is conducted (for human subject this is impossible, but for animal, it can be done by genome editing). Intervention calculus implies that if ππ causes disease, then ππ ( ππ | ππππππ )) = ππ ( ππ | ππ ) , otherwise if ππ is not disease lcous, then ππ ( ππ | ππππ ( ππ )) = ππ ( ππ ) β ππ ( ππ | ππ ) . Do-calculus can also be defined as πΈπΈ [ ππ | ππππ ( ππ )] . If effect variable ππ is a binary variable, then we have πΈπΈ [ ππ | ππππ ( ππ )] = ππ ( ππ = 1| ππππ ( ππ )) . The various relationships between marginal, conditional and interventional distributions of ππ and ππ under causation and association are summarized in Figure 1. Figure 1 (d) clearly demonstrates differences between association and causation. Although temperate in the room and thermometer are associated, since temperature causes changes in thermometer, change in thermometer cannot change the temperature in the room, i.e., ππ ( ππ | ππππ ( ππ )) = ππ ( ππ ) . In summary, association is studied by observed conditional distribution and causation is investigated by interventional distribution where causal effect is determined by the effect of hypothetic manipulation of an input on an output. In other words, association is investigated by seeing and causation is investigated by doing. To illustrate the the difference between seeing and doing, we present the following example: Example 1
Consider ππ ππ ~ ππ (0,1), ππ β ππ ππ , ππ ππ β ππ ππ + ππ ππ2 . Then, from the observed data generated by this model, we can estimate that πΈπΈ [ ππ | ππ = 1] β Next we perform intervention do ( ππ = 1) . Then, the intervened generative model is ππ ππ ~ ππ (0,1), ππ ππ β , ππ ππ β ππ ππ + ππ ππ2 . Then, we obtain that πΈπΈ [ ππ | ππππ ( ππ = 1)] β This clearly demonstrates that seeing and doing are quite different.
Conuterfactual
Causality can also be defined in terms of potential outcomes or counterfactuals in the Neyman-Rubin causal model (Rosenbaum and Rubin 1983). We can read interventional distribution
πποΏ½πποΏ½ππππ ( ππ ) οΏ½ as counterfactual questions: βwhat would have been the distrution of ππ had ππ = π₯π₯ ?β (Lopez-Paz 2016). Intuitively, counterfactuals assume the presence of an alternative world where everything is the same as the factual world, except for the alternative (hypothetical) intervention and its effects. For simplicity, consider a binary treatment ππ = {0,1} or potential intervention, where ππ = 1 indicates the treatment (intervention) and ππ = 0 indicates the control (no intervention). Each individual has two potential outcomes, { ππ ππ1 , ππ ππ0 } , one for each value of the treatment: ππ1 : potential outcome if the individual received a treatment ( ππ ππ = 1) , ππ ππ0 : potential outcome if the individual received no treatment ( ππ ππ = 0) . This implies that a potential outcome is the outcome that would be realized if the individual received a specific value of the treatment (intervention). A SNP has two alleles. We can define ππ = οΏ½ π΄π΄ ππ . The potential outcome is 1 if the individual is affected, potential outcome is 0 if the individual is normal. Let ππ be the set of contexts (covariates). The βfundamental problem of causal inferenceβ (Holland 1986) is that we can only observe one of the potential outcomes rather than both of them. The unobserved (missed) potential outcome is called βcounterfactualβ outcome. Similar to do-calculus, coundetrfactual can be defined as stating that πποΏ½ would change to ππ if it were ππ . In other words, we imagine that value ππ would be taken if we did hypothetical intervention T . Causal effects are defined as differences in counterfactual variables. In other words, it measures difference between what would have happened if we did one thing versus what would have happened if alternatively, we did something else (Lattimore and Ong 2018). A brief overview about counderfactual theory is summarized in Supplementary A. Structural equation model and independence of cause and mechanism (ICM)
The third language of causation which we inrtoduce is structural equation moels (SEMs). SEMs can be used to model causal relationships between some given variables, where each variable is expressed as a function of some other variables (its causes or treatments) as well as ome noise (Nowzohour and BΓΌhlmann 2016, Xiong 2018). The model consists of three essential components: (1) causal structure, (2) the functional dependence among causal and effect variables, and (3) the joint distribution of the noises. We assume that (1) there are no unobserved variables and hence that the noise terms are independent and (2) the difference between the effect variable and some noise term is a deterministic function of the causal variables. In this paper, we focus on bivariable causal discovery. The SEMs for two variables is defined as (Lattimore and Ongv 2018) ππ = ππ π₯π₯ ( ππ π₯π₯ ), ππ = ππ π¦π¦ ( ππ , ππ ππ ), ππ π₯π₯ β«« ππ ππ , (4) where ππ π₯π₯ and ππ π¦π¦ are noises or exogenous random variables. If the functions ππ π₯π₯ and ππ π¦π¦ are free form, the SEMs are called nonparametric structural equation models. One can The structural equation model (4) encodes the assumption that the outcome ππ ππ for an individual ππ is caused by the cause (treatment) ππ ππ which the individual receives and other factors ππ π¦π¦ that are indepdent of cause ππ . The SEMs describe the causal effects of performing real-world interventions or experiments on their variables ππ . Although conditional independences can be used to make causal inference from the data under study, conditional independence cannot be applied to causal analysis under two variaables (Lopez-Paz 2016). Consider observational causal inference for two random variables, ππ and ππ . We want to infer whether ππ β ππ or ππ β ππ . Unfortunately, the absence of a third random variable prevents us from measuring conditional independences. To overcome this limitation, in the past decade, observational causal inference methods that are not based on conditional independence have been developed.
One of them is the widely used Independence of cause and mechanism (ICM) principle. ndependence of Cause and Mechanism (ICM) assumes that causes and mechanisms are chosen independently by nature is a recently proposed principle for causal reasoning and causal learning (Janzing and SchΒ¨olkopf 2010; Shajarisales et al. 2015; Peters et al. 2017). ICM assumes that the mechanism that generates effect from its case contains no imformation about the the cause. Assume that ππ is a cause and ππ is an effect. The joint distribution ππ ( ππ , ππ ) can be decomposed into ππ ( ππ , ππ ) = ππ ( ππ ) ππ ( ππ | ππππ ( ππ )) . (5) The conditional distribution ππ ( ππ | ππππ ( ππ )) is a mechanism that generates effect ππ from cause ππ . The conditional distribution ππ ( ππ | ππππ ( ππ )) is independent of the distribution of cause ππ ( ππ ) . If ππ causes ππ then ππ ( ππ | ππππ ( ππ )) = ππ ( ππ | ππ ) . The conditional distribution ππ ( ππ | ππ ) contains no information about marinal distribution of cause ππ ( ππ ) . Therefore, the ICM postulates that the conditional distribution of each variable given its causes contains no information about its cause. SEMs, ICM and counterfactuals developed relatively independently in different fields. However, it is shown that they can be unified under some assumptions using interventional queries with do-calculus (Supplementary B). This allows methods and algorithms developed within one framework to be easily applied to one another and provides foundation for interpretation of ANMs and justification of GWCS. Additive noise models
In the previous section, we showed that the SEMs, the ICM and counterfactual approach to causal inference are equivalent in general. To facilitate the application of causal inference to the real world, we need simpler methods to implement these general approaches to causal inference. In this paper, we propose to use discrete additive noise models (ANMs) that are based on the CM principle, as a tool for GWCS. We assume that there is no confoundeing, no selection bias and no feedback between the cause and effects (Mooij et al. 2016). The methods for causality analysis with confoundeing will be presented else where. Let ππ be a binary variable to indicate disease status: ππ = 1 , presence of disease and ππ = 0 , normal and ππ be a genotype indicator variable: ππ = οΏ½ ππππ π·π·ππ π·π·π·π· , where π·π· is a disease allele. Let ππ be an integer. Assume that ππ = ππππ + ππ , ππ = 0, 1, β¦ , ππ β . ππ is called a m-cyclic random variable, if ππ takes the remainder ππ as its value. Now we define a discrete ANMs for genetic causation analysis. Let ππ and ππ be 3 and 2-cyclic random variables, respectively. An ANM from ππ to ππ is defined as (Peters et al. 2011) ππ = ππ π¦π¦ ( ππ ) + ππ π¦π¦ , ππ β«« ππ π¦π¦ , (6) where ππ π₯π₯ is an integer function and ππ π¦π¦ is a 2-cyclic noise variable. An ANM is called reversible if there is also an ANM: ππ = ππ π₯π₯ ( ππ ) + ππ π₯π₯ , ππ β«« ππ π₯π₯ , (7) where ππ π₯π₯ is a 3-cyclic noise variable. In practice, there may be multipe potential cuasations ππ , β¦ , ππ ππ . However, only one causation π₯π₯ is explicitely considered in the model equation (6). Other cuasations ππ , β¦ , ππ ππ are unobserved. Their causal effects to ππ are accounted for by residual. Then, we can show that the following model ππ = ππ ππ ( ππ ) + πποΏ½ ππ , ππ β«« πποΏ½ ππ (8) where the effects of ππ , β¦ , ππ ππ on ππ are included in πποΏ½ ππ still holds if we assume that ππ β««ππ , β¦ , ππ β«« ππ πΎπΎ . Its extension to multiple dependent causations is more complicated and will be presented elsewere. It is well known that the set of joint distributions ππ ( ππ , ππ ) that allow the ANM in both forward and backward directions is very small. In other words, in general, the direction of the ANMs is identifiable (Peters et l. 2011). Assumptions for identificability of the direction of the ANMs are summarized in Supplementary C. In our cases, ππ is an indicator variable for genotypes and ππ is a binary variable for disease status. In Supplementary C, we show that in general, reversible is impossible and hence the direction of the ANMs is identifiable. Numerical algorithms to implement ANMs for genetic causal analysis
To implement the ANMs to identify a causal SNP, we used the numerical algorithm that was presented in the paper (Peters et al. 2011) to test the causal relathinsip between the SNP and disease. The algorithm is summarized as follows (Hu et al. 2018).
Algorithm to implement the discret ANMs for genetic causal analysis:
Assume that qualitative trait data ππ and indicator variable ππ for the genotypes of a SNP are available. 1. To infer direction
ππ β ππ , we regress the trait ππ on the genotype indicator variable ππ : ππ = ππ ( ππ ) + ππ ππ . Calculate the residuals πποΏ½ ππ = ππ β ππΜ ( ππ ) . 2. To infer potential causal direction
ππ β ππ , we fit the following nonlinear integer regression to the data: ππ = ππ ( ππ ) + ππ ππ . Calculate the residuals πποΏ½ ππ = ππ β πποΏ½ ( ππ ) . 3. Test for independence between the residuals and potential causation. If
πποΏ½ ππ and ππ are independent ( πποΏ½ ππ β«« ππ ) , and πποΏ½ ππ and ππ are not independent, then ππ causes ππ ( ππ β ππ ) If both
πποΏ½ ππ and ππ , and πποΏ½ ππ and ππ are not independent or if both ππ ππ οΏ½ and ππ , and πποΏ½ ππ and ππ are independent, then no causation conclusion can be made. Nonlinear integer regressions to implement the ANMs have two important features. First, in general, we do not have general functional forms for nonlinear integer functions. We usually, investigate all possible mapping (functions) from ππ to ππ and evaluate their values of lost function. Second, the ordinary regression usually minimizes the sum of square of errors. However, in the above algorithm, in addition to evaluate the loss function, we still need to test the independence between the the regressor and residuals. Therefore, Peters et al. (2011) suggested using a dependence measure (DM) between regressor and residuals as a lost function. We adopt the discrete regression with dependence measure minimization procedure for genetic causal analysis (Peters et al. 2011). Discrete nonlinear regression with dependence measure minimization for genetic causal analysis Step 1:
Calculate the sampling distribution
πποΏ½ ( ππ , ππ ) . tep 2: Initialization. ππ ( ) οΏ½π₯π₯ ππ οΏ½ = πππππππππππ₯π₯ π¦π¦ πποΏ½οΏ½ππ = π₯π₯ ππ , ππ = π¦π¦οΏ½ , π‘π‘ = 0 . Step 3: Repeat ;1 += tt Step 4: for ni ,...,1 = do Step 5: ππ ( π‘π‘ ) οΏ½π₯π₯ ππ οΏ½ = ππππππππππππ ππ π·π·π·π· ( ππ , ππ β ππ π₯π₯ ππ βπ¦π¦ ( π‘π‘β1 ) ( ππ )) end for Step 6: until | οΏ½ππ ( π‘π‘ ) β ππ ( π‘π‘β1 ) οΏ½ | < ππ or οΏ½πποΏ½ ππ = ππ β ππ ( π‘π‘ ) ( ππ ) οΏ½ β«« ππ , or π‘π‘ = ππ , where Ξ΅ and T are pre-specified. A ππ test statistic will be used as the dependence measure (DM). Specifically, we formulate a contingence table (Table 1). In the ANM equation (6), ππ π¦π¦ is assumed as a 2-cyclic noise variable. Let ππ and ππ be number of individuals with ππ ππ = 0 and ππ ππ = 1 , respectively. Let ππ = ππ + ππ . Consider three genotypes: ππππ , πππ·π· and π·π·π·π· . Let ππ ππ = 0, ππ , ππ and ππ be the number of individuals with genotypes ππππ , πππ·π· and π·π·π·π· , respectively. Let ππ ππ = 1, and ππ , ππ and ππ be the number of individuals with genotypes ππππ , πππ·π· and π·π·π·π· , respectively. Define the marginal frequencies as shown in Table 1. Then, we obtain
πΈπΈοΏ½ππ οΏ½ = ππ ( ππ +ππ ) ππ , ππ = 1,2,3 , and πΈπΈοΏ½ππ οΏ½ = ππ ( ππ +ππ ) ππ , ππ = 1,2,3 . Then, the test statistic for testing independence is defined as π·π·π·π· = β οΏ½ ( ππ βπΈπΈ [ ππ ]) πΈπΈ [ ππ ] + ( ππ βπΈπΈ [ ππ ]) πΈπΈ [ ππ ] οΏ½ . (9) Under the null hypothesis of independence, the test statistic π·π·π·π· is distributed as a central ππ ( ) distribution with 2 degrees of freedom . If SNPs involve rare variants, the expected counts of many cells will be small. Fisherβs exact test should be used to test for independence. The statement that there is no causal relationship between the SNP and disease implies that neither causations
ππ β ππ nor causation
ππ β ππ holds. Let
π·π·π·π·
ππβππ and
π·π·π·π·
ππβππ be the ππ statistics for testing causations ππ β ππ and causation
ππ β ππ , respectively. The null hypothesis for testing causal relationships between two random variables ππ and ππ is : H no causation between two random variables X and Y . The statistic to test the causal relationsips between two randoom variables X and Y is defined as ππ πΆπΆ = | π·π·π·π·
ππβππ β π·π·π·π·
ππβππ | . (10) When C T is large, either π·π·π·π·
ππβππ > π·π·π·π·
ππβππ which implies ππ causes ππ , or π·π·π·π·
ππβππ > π·π·π·π·
ππβππ which implies that ππ causes ππ . When β C T , this indicates that no causal decision can be made. Since π·π·π·π·
ππβππ and
π·π·π·π·
ππβππ may be dependent, a closed, analytic expression for the distribution of ππ πΆπΆ is not yet known (Bausch 2012). Although a computational algorithm to numerically calculate the distribution of ππ πΆπΆ is available, in this paper we will use the permulation test to calculate the P-value of the test ππ πΆπΆ . Distance Correlation as a Causation Measure
In previous sections, we introduce the basis principal for assessing causation YX β that the distribution )( XP of causal X is independent of the causal mechanism or conditional distribution )|( XYP of the effect Y , given causal X . Now the question is how to assess their independence. The Pearson correlation coefficient ππ ( ππ , ππ ) , the widely-used classical meaure of dependence measures linear dependence between two random variables ππ and ππ , and in the ivariate normal case, ππ ( ππ , ππ ) = 0 is equivalent to independence between ππ and ππ . If the distributions of ππ and ππ are not normal, then ππ ( ππ , ππ ) = 0 may not imply independence between ππ and ππ . Recently distance correlation that can be applied to all distributions with finite first moments is proposed to measure dependence between random vectors which allows for both linear and nonlinear dependence (SzeΒ΄kely et al. 2007, 2009). Distance correlation extends the traditional Pearson correlation in two remarkable directions: (1) Distance correlation extends the Pearson correlation defined between two random variables to the correlation between two sets of variables with arbitrary numbers; (2)
Zero of distance correlation indicates independence of two random vectors. Consider two vectors of random variables: p - dimensional vector X and q - dimensional vector Y . Let ππ ( ππ ) and ππ ( ππ ) be density functions of the vectors X and Y , respectively. Let ππ ( ππ , ππ ) be the joint density function of X and Y . There are two ways to define independence between two vectors of variables: i) density function definition and ii) characteristic function definition. In other words, if X and Y are independent then either i) ππ ( ππ , ππ ) = ππ ( ππ ) ππ ( ππ ) or ii) ππ ππ , ππ ( π‘π‘ , π π ) = ππ ππ ( π‘π‘ ) ππ ππ ( π π ) , where ][),( )(, ysxtiYX TT eEstf + = , ][)( xitX T eEtf = and ][)( yisY T eEsf = are the characteristic functions of ( ), YX , X and Y , respectively. Therefore, we can use both distances || ππ ( ππ , ππ ) βππ ( ππ ) ππ ( ππ )|| and ||)()(),(|| , sftfstf YXYX β to measure the dependence between two vectors X and Y . Since characteristic function ππ is a complex-valued function, its norm is defined as | ππ | = ππππΜ . Definition of The distance covariance πππΆπΆπππΆπΆ ( ππ , ππ ) between two random vectors ππ nd , distance variance ππππππππ ( ππ ) , and algorithms for their calculations are briefly introduced in Supplementary D. Square of correlation π π ( ππ , ππ ) is defined as π π ( ππ , ππ ) = οΏ½ πππΆπΆπΆπΆπΆπΆ ( ππ , ππ ) οΏ½ππππππππ ( ππ ) ππππππππ ( ππ ) ππππππππ ( ππ ) ππππππππ ( ππ ) > 00 ππππππππ ( ππ ) ππππππππ ( ππ ) = 0 . (11) Now we propose to use distance correlation to measure the dependence between the distributions ππ ( ππ ) and ππ ( ππ | ππ ) . Assume that ππ takes ππ different values and ππ takes πποΏ½ different values. Define two vectors ππ ( ππ ) = [ ππ ( ππ ), β¦ , ππ ( ππ ππ )] ππ and ππ ( ππ | ππ ) = [ ππ ( ππ | ππ ), β¦ , ππ ( ππ πποΏ½ | ππ ), β¦ . ππ ( ππ | ππ ππ ), β¦ , ππ ( ππ πποΏ½ | ππ ππ )] ππ . A meaures for causal directions ππ β ππ and
ππ β ππ are defined as πΆπΆ ππβππ = 1 β π π ( ππ , ππ ( ππ | ππ )) (12) and πΆπΆ ππβππ = 1 β π π ( ππ , ππ ( ππ | ππ )) , (13) respectively. A measure for quantifying the strength of causal relationships between ππ (genetic variant) and ππ (disease phenotype) can be defined by πΆπΆπ π = | πΆπΆ ππβππ β πΆπΆ
ππβππ | . (14) Using Theorem 3 in the paper (SzeΒ΄kely et al. 2009) , we can show that β€ πΆπΆ ππβππ β€ and πΆπΆ ππβππ = 1 if and only if
ππ β ππ . Similarly, we can show that β€ πΆπΆ
ππβππ β€ and πΆπΆ ππβππ = 1 if and only if
ππ β ππ . Consider linear transformations ππ = ππ + ππ π·π· ππ and ππ = ππ + ππ π·π· ππ where π·π· and π·π· are orthonormal matrices, then we can show that πΆπΆ ππβππ = πΆπΆ ππβππ . In other words, linear transformation of the random variables will not change the srtength of causality between two variables ππ and ππ . RESULTS Type 1 Error of Statistics for Testing Causation
To examine the validity of statistics ππ πΆπΆ for testing the causal relationships between a common SNP and disease, we performed a series of simulation studies to compare their empirical levels with the nominal ones. We consier two scenarios: (1) no causation in the absence of association and (2) no causation in the presence of association. We selected the top 100 common SNPs (MAF between 0.19 and 0.49) from gene TEKT4P2 on chromosome 21 from 1,000 Genome Project. In scenario (1), a binary trait ππ is randomly generated and independent of indicator variables ππ for genotypes of SNPs. In senario (2), we first randomly generated ππ and ππ , and then selected the associated pairs of data as our dataset ( ππ , ππ ) . We generated the data with 100,000 subjects by resampling from the 99-individual CEU population in 1,000 Genome Project. Number of permutations was 1,000, Number of replication of tests was 1,000. The sampled subjects from the generated population for type 1 error rate calculations were 500, 1,000, 2,000 and 5,000, respectively. We first consider scenatior 1. Table 2 summarized the average type I error rates of the test statistics for testing the causal relationsips between SNP and disease in the absence of association between SNP and disease over all 100 SNPs at the nominal levels πΌπΌ = 0.05 and πΌπΌ = 0.01 , respectively. To ensure no ssociation in the data, we also presented Table 3 that summarized average type 1 error rates of the association test over 100 SNPs. These tables showed that the in the absence of association, type I error rates of the test statistics for testing the causal relationships between SNPs and disease were not appreciably different from the nominal levels. Next we consider scenartio 2. Table 4 presented the average type I error rates of the test statistics for testing the causal relationsips between SNP and disease in the presence of association between SNP and disease over all 100 SNPs at the nominal levels πΌπΌ = 0.05 and πΌπΌ = 0.01 , respectively. Agan, these results demonstrated even in the presence of association the type I error rates of the test statistics for testing the causal relationships between SNPs and disease were not appreciably different from the nominal levels. Power Evaluation
To evaluate the performance of the ANMs for assessing the causal relationships between SNP and disease, simulated data were used to estimate their power to detect a true causation. First, we invesitigate the power as a function of sample sizes with fixed causal measure parameter. The data were generated by the following cyclic model: ππ = ππ ( ππ ) + ππ ππ , ππ ππ β«« ππ , (15) where ππ = {0, 1} π€π€πππ π a binary trait and genearted by the model equation (15), ππ = {0, 1, 2} was an indicator function for genotype of a SNP selected from 1,000 Genome Project, the minor allele frequency of the SNP was 0.1, ππ was an integer function: ππ (0) = 0, ππ (1) = 0, ππ (2) = 1 , ππ ππ = {0, 1} was a noise distributed as a binormial with probability parameter ππ . We used the model equation (15) to generate the population of 100,000 individuals with ππ and ππ . A set of 500, 1,000, 2,000, 5,000, 10,000 and 20,000 individuals were sampled from the population. A otal of 1,000 simulations were repeated for the power calculation. Three factors: the probability parameter ππ in the bionomial distribution, significance level πΌπΌ and sample sizes affect the power of the ANMs for testing causation. We first fixed the parameter ππ and significance level πΌπΌ . Figure 2 plotted the power curves as a function of sample sizes where four scenarios: : (1) ππ =0.2, πΌπΌ = 0.05 ; (2) ππ = 0.2, πΌπΌ = 0.01 ; (3) ππ = 0.4, πΌπΌ = 0.05 and (4) ππ = 0.4, πΌπΌ = 0.01 were considered. We observed from Fiure 2 that for ππ = 0.2, πΌπΌ = 0.01 , we could reach 81% power even when sample sizes were only 500 and for ππ = 0.4, πΌπΌ = 0.01 , we still could reach 80% power when sample sizes were 5,000. We then fixed sample sizes ππ and significance level πΌπΌ . Figures 3 and 4 showed the power curves of the causation test as a function of the parameter ππ with significance levels πΌπΌ = 0.05 and πΌπΌ = 0.01 , respectively. We observed that when the parameter ππ increased, the power of the causal tests decreased. Indeed, the parameter ππ determined the value of the residual ππ ππ , which in turn, influenced the causality measure. When the parameter ππ was small, the values of the response variable ππ were mainly determined by causal ππ . As the parameter ππ increased, the impact of the noise ππ ππ on ππ increased and hence the causality measure decreased, in turn, the power of the causal tests decreased. Finally, when ππ = 0.5 , with the equal probability, the noise ππ ππ produced values 1 and 0, ππ was mainly determined by noise ππ ππ , the ANMs had alomost no power to detect causation. Application to Real Data Example GWCS of Schizophenia
To further evaluate its performance, the ANMs for testing causation were applied to the CATIE-MGS-SWD schizophrenia (SCZ) study dataset with 8,421,111 common SNPs typed in 13,557 individuals. In both GWAS and GWCS, the ππ test was used for association analysis. A Manhattan plot of GWAS and GWCS was shown in Figure 5. For viewins clarity, in the Manhattan plot of GWAS and GWCS, we only showed P-values of causal analysis ( in green color) and association analysis (in black and grey colors) of all SNPs with P-values < 10 β5 . We observed that associated SNPs were quite uniformally distributed across the genome, but the causal SNPs concentrated only on some genome regions. This may indicate that the Causal SNPs contained more information than the associated SNPs. Due to computational time limitation of permutations, a P-value for declaring significant causation was β6 . In total, 245 SNPs in 29 genes showed significant causations with SCZ. The results were summarized in Supplemental Table 1 where the P-values of both causation and association tests were listed. The selected top 15 causal SNPs were listed in Table 5. Among them, 62 causal SNPs can be confirmed from the literature and four of them were on the typical 108 schizophrenia-associated genetic loci (Schezoprenia working group, 2014; Sullivan et al. 2007; Fatemi et al. 2011; Lei et al. 2013; Costas et al. 2013; Athanasiu et al. 2013; Misztak et al. 2018; Ren et al. 2011; Suzuki et al. 2003; Cho et al. 2015; Ide and Lewis 2010). We also conducted GWAS for this dataset. A total of 5,917 SNPs are associated with SCZ at the significance level of β6 and only 58 showed causation. These resuts showed several remarkable features. First, we can observe some SNPs that showed both significant causation and association. For example, four SNPs: rs1324544, rs2829725, rs9931378 and rs12057989 showed both strong causation and association (Table 5). econd, the number of causal SNPs was much smaller than the number of associated SNPs. Third, highly significantly asscociated SNPs may show no significant causation. Fourth, the SNPs that showed strong causation signals may not demonstrate association. For example, SNP rs12739344 in gene AKT3 showed strong causation (P-value < β6 ) , but did not reach threshold P-value for association (P-value for association is β6 ) . It is well kown that the genetic variation in the gene AKT3 is a top risk signal in schizophrenia and network analysis identified that AKT3 contributes to four of the pathways involved in SCZ (Howell et al. 2017). SNP rs10986439 in gene GABBR2 showed significant causation (P-value < β6 ) , but no association with SCZ (P-value is 0.000458). Genetic-imaging analysis showed that gene GABBR2 was in neuron development, synapse organization and axon pathways which could affect cognition in schizophrenia (Luo et al. 2018). Fifth, proportion of SNPs showed both causation and association was small (36.3% of causal SNPs showed association and only 0.98% of associated SNPs schowed causation). Disease Prediction
Genomic predictors and risk estimates for a large number of diseases can be constructed from SNPs. The traditional methods for developing genomic risk scores (GRS) utilize small numbers of SNPs, typically those identified as genome-wide significant association (Abraham and Inouye 2015). To evaluate the predicitive ability of causal SNPs and associated SNPs, we selected the top 245 causal SNPs (all P-values < β6 ) and top 245 associated SNPs for SCZ risk prediction. Logistic regression and 10 fold cross validation were used to calculate prediction accuracy. Table 6 listed ten-fold cross-validated accuracy for prediction of SCZ. Table 6 showed that using the same number of SNPs, all the sets of SNPs selected by causal analysis had higher prediction accuracy than the set of SNPs selected by association analysis. Specifically, the prediction ccuracy of 245 top causal SNPs was about 3% higher than that of the 245 top SNPs selected by association analysis. This may imply that the causal SNPs contain more biological information than associated SNPs. Impact of Linake Disequilibrium
In this section, we investigate the impact of linkage disequilibrium (LD) on the causal analysis. It is well known that linkage disequilibrium has a large impact on the association analysis. The theoretical analysis of the impact of LD on the causal effect is gven in Supplementary E. Next we use simulations to invesitigate the impact of LD on the causation analysis. Data for two markers: rs150012736 and rs376953511 were taken from 1000 Genome Project. In the 1000 Genome Project dataset , LD ( ππ ) between rs150012736 and rs376953511 was calculated as 0.5. Assume that SNP1 was a causal SNP. We did not make assumptions about whether or not SNP2 was a causal SNP. The trait values was generated by the discrete cyclic ANMs: ππ = ππ ππ ( ππ ) + ππ ππ , (16) where ππ ππ is a specified nonlinear integer function and ππ ππ is a bionomial variable. We fitted the ANMs to the data ( ππ , ππ ππ ) where ππ ππ represented the indicator variable for genotypes of SNP2. The results of causation and association tests were summarized in Tables S3 and S4, and Tables 7 and 8. Tables S3 and S4 showed that we can detect both association and causation between SNP1 and disease with a high power when sample sizes were larger than 2,000. Table 7 showed that type 1 error rates of test to detect causation between SNP2 and disease was not very high and decreased when sample sizes increased. In other words, we did not detect causation at SNP2. However, Table 8 showed that association test detected association of SNP2 with disease with igh power.The simulation results showed that the impact of LD on the causal tests was much smaller than on the association tests. To further evakuate the impact of LD on causation test, real data analysis was conducted. From the results of GWCS of SCZ, we selected SNP rs6578689 that had P-values < β6 and β7 for causation and association tests, respectively. Then, we selected 20 neighboring SNPs of causal SNP rs6578689. We tested their causation and association with SCZ. Table 9 summarized the results of the causation and association tests. These results showed that even neighboring SNPs that had ππ > 0.44 demonstrated no causation with SCZ, but strong associations with small P-values < β9 with SCZ. These results of real data analysis demonstrated that LD had a small impact on causation analysis, but large impact on association tests. DISCUSSION
Alternative to GWAS, the major goal of this paper is to propose a notion of GWCS and to address several important issues for GWCS. The standard approach to causal discovery is to use interventions or randomized experiments. Many genetic epidemiologists have always thought it impossible to detect causal SNPs using observational data. However, intervension or randomized experiments are unethical, time-consuming, expensive and infeasible in many cases. To address this critical barrier in GWCS, we focus on causal discovery methods developed for causal inference from observational data, not from interventional or randomized experiments and propose to use discrete ANMs as a major tool for GWCS. By large simulations and real data analysis we demonstrate the feasibility and limitations of the proposed GWCS as a new paradigm of genetic analysis. ssociation is to measure dependent relationships and association analysis can be deone from observational data. Causal inference is inductive reasoning (Causal inference in AI, 2019). In other words, causal inference is reasonin from the observed part to the unobserved general. The goal of causal inference is to learn the response of taking an action and is usually carried out from interventions. However, as we pointed out before, it is infeasible to conduct intervention experiments in humans. Modern causal theory attempts to learn the outcome of an intervention from the observed data. Causation that can be inferred from observational data has been debated for more than a century. In this paper, we review great progresses that have been made in causal inferences over the past several decades, and define causation as the effect of taking action in some system from observational data in terms of interventions or counterfactuals (Lattimore and Ong 2018). We also review three emerging major approaches to bivariate causal discovery: β do β action, counterfactuals and ICM and showed that these three approaches can be unified. The ANMs that are widely used algorithms to implement ICM are explored for GWCS. In GWCS, we assume that there are no confoundings and selection bias. Methods for causation analysis with confounders will be presented elsewhere. Therefore, we lay down theoretic foundations for GWCS. The original ANMs are used to distinguish cause-effect direction and do not provide P-value calculation for testing the causation of the SNP with disease. To overcome this limitation, we develop a test statistic and use permutations to calculate the P-value of statistics for testing the causation of the SNP with disease. This provides a practical approach to GWCS. An essential problem for performing GWCS in practice is the type 1 error rates, power of the test statistics and feasibility of computations. We showed that type 1 error rates of the ANMs for testing the causation in both presence and absence of association were not significantly deviated rom the nominal level. In other words, large simulation results demonstrated that the ANMs for causation analysis of genetic variants were valid. Power of the ANMs depends on the probability parameter ππ in the bionomial distribution generating noise ππ ππ , sample sizes and significance levels. As we discussed in the text, probability parameter ππ determines the strength of causation. We showed that even for significance lelvel πΌπΌ = 0.01 and ππ = 0.4 , when sample sizes were 5,000, the power of the ANMs was close to 80%. If the parameter ππ β€ , using 500 sample sizes, we could ensure that the ANMs can reach power greater than 90% under both πΌπΌ = 0.05 and πΌπΌ = 0.01 . These results implied that the ANMs had high power to detect causation in many cases. Distuinguishing causation from association is an age-old problem. The most classical causal inference theory focuses on inferring causal relationships among more than three variables. Due to lack of methods for bivariate causal discovery, very few GWCS and very few results of significant causal genetic variants from GWCS have been reported. In the past decade, the rapid development in modern causal analysis theory has provided several efficient methods for biovariate causal discovery including ANMs. To promote application of causal inference to genetic analysis, we applied the ANMs to GWCS of SCZ. From the GWCS of SCZ, we have several important observations. First we observed that the number of causal SNPs (245 SNPs) was much less than the number of associated SNPs (5,917 SNPs). The cusal SNPs were mainly located in Chromosomes 1, 4, 5, 6, 7, 8, 20, 11, 12 and very few causal SNPs were located in other chromosomes. However, the associated SNPs were located across the genome. The results of GWCS of SCZ also challenged the βOmnigenicβ model that assumed that βall genes affect every complex traitβ (Greenwood 2018) and most association signals that tend to be spread across most of the genome influenced he phenotype variation (Boyle et al. 2017). The most identified association signals may have nothing to do with causing phenotype variation. Second, the proportion of SNPs that showed both causation and association was small (36.3% of causal SNPs showed association and only 0.98% of associated SNPs schowed causation). This implied that the majority of causal SNPs could not be discovered by association analysis and most associated SNPs were not involved in the mechanisms of diseases. The results of GWCS of SCZ strongly suggested that association analysis will miss the majority of the causal SNPs and identifying and validating causal SNPs from the set of associated SNPs will be time consuming and not be efficient. Third, full genomic information and genomic risk prediction has enabled new insights about the etiology and genetic architecture of complex disease. Although, we cannot directly validate the causality of the identified SNPs from GWCS, evaluating the difference in disease risk prediction accuracy between the set of causal SNPs and the set of associated SNPs allows assessing the biological relevance of the causal SNPs and associated SNPs. The prediction accuracy of 245 top causal SNPs was about 3% higher than that of 245 top associated SNPs. This may suggest that the causal SNPs contain more biological information than associated SNPs. Fourth, both simulation and real data analysis showed that the LD had strong impact on association analysis, but surprisingly much less impact on the causal analysis. It is well known that LD is a confounding factor for association analysis and often creats spurious associations. Presence of LD across the genome will limit our abaility of using association analysis to discover mechanism of disease. Due to the limited impact of LD on causal analysis, we may expect that GWCS will provide an alternative to association analysis to discover causal genetic structure of complex diseases. lthough 62 of 245 discovered causal SNPs can be confirmed from the literature and four of them are on the typical 108 schizophrenia-associated genetic loci (Schizophrenia working group, 2014), the results were very preliminary. Functional studies of causal SNPs should be investigated in the future. Causality is not only critical for us to understand disease mechanisms, but also particularly important for the development of efficient treatment. Much of the failure of previous efforts of drug development was attributable to the insufficient understanding of the disease mechanism. The question whether we can infer causal relationships between genetic variants and disease from observational data has been debated for more than a century. Association and correlation analysis are the current paradigm of most genetic studies and have been used for more than a century. Our study demonstrated that large proportions of causal loci cannot be discovered by association analysis. Finding causal SNPs only via searching the set of associated SNPs may not be sufficient for unravelling mechanisms of complex diseases. Causal analysis as an alternative to association analysis for genetic studies has neven been systematically investigated. The main purpose of this paper is to stimulate discussion about causal analysis and association analysis, and both theoretical and practical research in genomic causal analysis. We hope that our results will greatly increase confidence in applying causal inference to genetic analysis, more and more intelligent methods for causal inference will be developed, and more and more valid GWCS of complex diseases will be investigated. DATA ACCESS
Software for implementing the proposed methods for GWCS can be downloaded from https://sph.uth.edu/research/centers/hgc/xiong/software.htm and Github ( https://github.com/jiaorong007?tab=repositories ) .
EFERENCES
Abraham G, Inouye M. 2015. Genomic risk prediction of complex human disease and its clinical application.
Curr Opin Genet Dev.
Nat Methods.
J Psychiatr Res . 44(12):748-53. Barrett J, Lorenz R, Oreshkov O. 2019. Quantum causal models. arXiv:1906.10726. Bausch J. 2012. On the efficient calculation of a linear combination of chi-square random variables with an application in counting string vacua. arXiv:1208.2691. Boyle EA, Li YI, Pritchard JK. 2017. An expanded view of complex traits: from polygenic to omnigenic.
Cell.
Nature.
Psychiatry Investig.
Nat Rev Genet.
18: 271. Costas J, SuΓ‘rez-Rama JJ, Carrera N, Paz E, PΓ‘ramo M, Agra S, Brenlla J, Ramos-RΓos R, Arrojo M. 2013. Role of DISC1 interacting proteins in schizophrenia risk from genome-wide analysis of missense SNPs.
Ann Hum Genet.
Schizophr Res.
Journal of the American Statistical Association , 945β960. Howell KR, Floyd K, Law AJ. 2017 . PKBΞ³/AKT3 loss -of-function causes learning and memory deficits and deregulation of AKT/mTORC2 signaling: relevance for schizophrenia.
PLoS One.
Biol Psychiatry
IEEE Trans. Information Theory,
Med J Aust.
Chest.
The Journal of Infectious Diseases.
Heart Rhythm.
PLoS One.
World Scientific. isztak P, PaΕcz yszyn-Trzewik P, Sowa-
KuΔma M. 2018 . Histone deacetylases (HDACs) as therapeutic target for depressive disorders.
Pharmacol Rep . 70(2):398-408. Mooij J, Peters J, Janzing D, Zscheischler J, SchΓΆlkopf B. 2016. Distinguishing cause from effect using observational data: methods and benchmarks.
Journal of Machine Learning Research , 17(32):1-102. Nowzohour C, BΓΌhlmann P. 2016. Score-based causal learning in additive noise models.
Statistics . 50 (3):471β485. Ongen H, Brown AA, Delaneau O, Panousis NI, Nica AC, GTEx Consortium, Dermitzakis ET. 2017. Estimating the causal tissues for complex traits and diseases.
Nat Genet . 49:1676-1683. Orho-Melander M. 2015. Genetics of coronary heart disease: towards causal mechanisms, novel drug targets and more personalized prevention.
J Intern Med.
Communications of the ACM.
IEEE Trans Pattern Anal Mach Intell . 33:2436-2350. Peters J, Janzing D, SchΓΆlkopf B. 2017. Elements of causal inference: foundations and learning algorithms. The MIT Press, Boston. Peters J, Mooij J, Janzing D, SchΓΆlkopf B. 2014. Causal discovery with continuous additive noise models.
Journal of Machine Learning Research , 15:2009-2053. Ren RJ1, Wang LL, Fang R, Liu LH, Wang Y, Tang HD, Deng YL, Xu W, Wang G, Chen SD. 2011. The MTHFD1L gene rs11754661 marker is associated with susceptibility to Alzheimer's disease in the Chinese Han population.
J Neurol Sci.
Biometrika . 70(1):41β55. Ross SM. 1985. Introduction to probability models. Third Edition. Academic Press, Inc. London. Schizophrenia Working Group of the Psychiatric Genomics Consortium. 2014. Biological insights from 108 schizophrenia-associated genetic loci.
Nature.
In Proceedings of the 32 nd International Conference on Machine Learning (ICML) , pages 285β294. pirtes P, Glymour C and Scheines R. 2000. Constructing Bayesian networks models of gene expression networks from microarray data.
In Proceddings of the Atlantic Symposium on Computational Biology . Sullivan PF, Keefe RS, Lange LA, Lange EM, Stroup TS, Lieberman J, Maness PF. 2007. NCAM1 and neurocognition in schizophrenia.
Biol Psychiatry . 61(7):902-10. Suzuki T, Iwata N, Kitamura Y, Kitajima T, Yamanouchi Y, Ikeda M, Nishiyama T, Kamatani N, Ozaki N. 2003. Association of a haplotype in the serotonin 5-HT4 receptor gene (HTR4) with Japanese schizophrenia.
Am J Med Genet B Neuropsychiatr Genet.
Ann Stat.
Ann. Appl. Stat.
Cancers (Basel). pii: E349. Zenil H, Kiani NA, Zea AA, TegnΓ©r J. 2019. Causal deconvolution by algorithmic generative models.
Nature Machine Intelligencevolume. ππππ
Genotype πππ·π·
Genotype
π·π·π·π· ππ ππ = 0 ππ ππ ππ ππ ππ ππ = 1 ππ ππ ππ ππ ππ + ππ ππ + ππ ππ + ππ ππ = ππ + ππ Table 2. Average type 1 error rates of the statistics for testing causal relationships between SNP and disease. Sample Size Nominal Level 500 1,000 2,000 5,000 0.05 0.044 0.046 0.048 0.051 0.01 0.005 0.006 0.007 0.009 Table 3. Type 1 error rates for association test. Nominal Level 500 1,000 2,000 5,000 0.05 0.05 0.05 0.049 0.049 0.01 0.01 0.01 0.01 0.01
Table 4. Average type 1 error rates of the statistics for testing causal relationships between SNP and disease in the presence of association. Nominal Level 500 1,000 2,000 5,000 0.05 0.042 0.046 0.047 0.046 0.01 0.005 0.007 0.007 0.008 able 5. P-values of top 15 SNPs that had significant causal relationships with schizophrenia. P-values RS Number Chr Position Gene Related Disease Causation Association rs1324544 6 9181479 Table 6. Ten-fold cross-validated accuracy and AUC for SCZ risk prediction of using top 15 causal SNPs and association SNPs. Number of SNPs 7 8 9 10 11 12 13 14 15 245 Accuracy of Causal SNPs 0.5511 0.5542 0.5542 0.5542 0.5542 0.5540 0.5534 0.5531 0.5521 0.5737 AUC of Causal SNPs Accuracy of Associated SNPs 0.5470 0.5457 0.5434 0.5423 0.5415 0.5410 0.5404 0.5401 0.5395 0.5430 AUC of Associated SNPs 0.5204 0.5200 0.5191 0.5189 0.5178 0.5173 0.5168 0.5163 0.5158 0.5249 able 7. Type I error rates of causal test between SNP2 and disease. Significance Level 500 1,000 2,000 5,000 0.05 0.183 0.159 0.142 0.104 0.01 0.105 0.118 0.105 0.093 Table 8. Power of test for association between SNP2 and disease. Significance Level 500 1,000 2,000 5,000 0.05 0.918 0.979 0.992 0.994 0.01 0.860 0.957 0.990 0.992 Table 9. P-values for causation and association tests of 20 neighboring SNPs of causal SNP rs6578689. SNPs Chr P-values Neighbor SNPs Position r P-values Causation Association Causation Association rs6578689 11 Several possible causal relationships between two observed variables ππ and ππ : (a) association; (b) ππ causes ππ ; (c) ππ causes ππ ; (d) temperature change causes thermometer change. Figure 2. Power curves of the ANMs for testing causation as a function of sample sizes where power of the tests was calculated under four scenarios: (1) ππ = 0.2, πΌπΌ = 0.05 ; (2) ππ = 0.2, πΌπΌ =0.01 ; (3) ππ = 0.4, πΌπΌ = 0.05 and (4) = 0.4, πΌπΌ = 0.01 . Figure 3. Power convers of the ANMs for testing causation as a function of the parameter ππ in bionomial distribution where four sample sizes 500, 1,000, 5,000 and 10,000 were considered, assuming πΌπΌ = 0.05 . Figure 4. Power curves of the ANMs to test causation as a function of the parameter ππ in bionomial distribution. Four sample sizes 500, 1,000, 5,000 and 10,000 were considered, assuming πΌπΌ = 0.01 . Figure 5. A Manhattan plot of GWAS and GWCS. igure 1. Figure 2. igure 3. Figure 4. igure 5. upplementary A Counterfactuals for causal inference This introduction focuses on how to use counterfactuals to investigate the causal effect. The average causal effect (ACE) (or treatment effect) is defined as ACE = πΈπΈ [ ππ ππ1 ] β πΈπΈ [ ππ ππ0 ] , (A1) where πΈπΈ [. ] is taken over entire population (Elwert 2013) . Since for each individual we can only observe one of ππ ππ1 and ππ ππ0 , we cannot estimate ACE. Standard statistics to estimate ACE is given by ππ = πΈπΈ [ ππ ππ1 οΏ½ππ ππ = 1] β πΈπΈ [ ππ ππ0 οΏ½ππ ππ = 0] . (A2) where πΈπΈ [. ] is taken over the treatment group and control group, respectively, not over the entire population. If potential outcome is binary, then equations (A1) and (A2) can be rewritten as π΄π΄πΆπΆπΈπΈ = ππ ( ππ ππ1 = 1) β ππ ( ππ ππ0 = 1) (A3) and ππ = ππ ( ππ ππ1 = 1 οΏ½ππ ππ = 1) β ππ ( ππ ππ0 = 1| ππ ππ = 0) . (A4) Since quantity ππ depends on the treatment assignment, it measures the association between the potential outcome with the treatment asigment. Therefore, in general, the ACE is not equal to ππ . The sufficient conditions to make them equal are Condition I: πΈπΈ [ ππ ππ1 ] = πΈπΈ [ ππ ππ1 οΏ½ππ ππ = 1] = πΈπΈ [ ππ ππ1 | ππ ππ = 0] (A5) or ππ ( ππ ππ1 = 1) = ππ ( ππ ππ1 οΏ½ππ ππ = 1) = ππ ( ππ ππ1 οΏ½ππ ππ = 0), (A6) which implies that the mean potential outcome (or the probability distribution) under treatment for those in the treatment group equals the mean potential outcome (or the probability distribution) under treatment for those in the control group. Condition II: πΈπΈ [ ππ ππ0 ] = πΈπΈ [ ππ ππ0 οΏ½ππ ππ = 1] = πΈπΈ [ ππ ππ0 | ππ ππ = 0] (A7) or ππ ( ππ ππ0 = 1) = ππ ( ππ ππ0 = 1| ππ = 1) = ππ ( ππ ππ0 = 1| ππ = 0) , (A8) which implies that the mean potential outcome (or the probability distribution) under control for those in the treatment group equals the mean potential outcome (or the probability distribution) under control for those in the control group. Under conditions I and II we obtain ππ = πΈπΈ [ ππ ππ1 οΏ½ππ = 1] β πΈπΈ [ ππ ππ0 οΏ½ππ = 0] = πΈπΈ [ ππ ππ1 ] β πΈπΈ [ ππ ππ0 ] = π΄π΄πΆπΆπΈπΈ , (A9) or ππ = ππ ( ππ ππ1 = 1 οΏ½ππ = 1) β ππ ( ππ ππ0 = 1 οΏ½ππ = 0) = ππ ( ππ ππ1 = 1) β ππ ( ππ ππ0 = 1) = π΄π΄πΆπΆπΈπΈ . (A10) In other words, conditions I and II ensure that association measure ππ is equal to the average causal effect ACE. Conditions I and II imply and assume that the average potential outcomes of people in the treatment group are equal to that of the outcomes of the people in the control group. In other words, no difference between association measurement and causal measurement can be achieved by randomized treatment assignment. Random experiments are often expensive, unethical and infeasible. Observational data should be used for causal inference. The assumption to ensure that ππ is an unbiased and consistent estimator of the ACE is the following ignorability: ( ππ ππ1 , ππ ππ0 ) β«« ππ , (A11) i.e., the potential outcomes must be jointly independent of treatment assignment. In the observational studies, the ignorability assumption, in general, is difficult to be satisfied. Therefore, we make further assumptions to extend the ignorability to conditional ignorability: ( ππ ππ1 , ππ ππ0 ) β«« ππ | ππ , (A12) where ππ is a set of variables. Conditional ignorability in equation (A12) assumes that the potential outcomes, ππ ππ1 and ππ ππ0 are jointly independent of treatment assignment conditional on groups defined by the value of ππ . upplementary B Unification of the SEMs, ICM, counterfactuals and do-calculus methods for causal inference In this supplementary, we briefly show that the SEMs, ICM, counterfactuals and do-calculus methods for causal inference with two random variables can be unified. Suppose that both ππ and ππ are binary variables. Taking action ππ = 1 implies that the equation ππ = ππ π₯π₯ ( ππ π₯π₯ ) should be replaced by ππ = 1 . Setting ππ = 1 will not affect the function ππ ππ and distribution of noise ππ ππ . The interventional distribution is given by πποΏ½ππ = 1 οΏ½ππππ ( ππ = 1) οΏ½ = β ππ ( ππ ππ ) πΌπΌ { ππ ππ (1, ππ ππ ) = π¦π¦ } ππ ππ . (B1) Note that the conditional observational distribution of ππ given ππ is ππ ( ππ = 1| ππ = 1) = β β ππ ( ππ π₯π₯ | ππ = 1) πποΏ½ππ π¦π¦ οΏ½ππ π₯π₯ οΏ½πΌπΌ { ππ π¦π¦ οΏ½ ππ π¦π¦ οΏ½ = π¦π¦ } ππ ππ ππ π₯π₯ . (B2) By the assumption ππ π₯π₯ β«« ππ ππ , we have πποΏ½ππ π¦π¦ οΏ½ππ π₯π₯ οΏ½ = ππ ( ππ π¦π¦ ) . (B3) Substituting equation (B3) into equation (B2) yields ππ ( ππ = 1| ππ = 1) = β ππ ( ππ π¦π¦ ) πΌπΌ { ππ π¦π¦ οΏ½ ππ π¦π¦ οΏ½ = π¦π¦ } ππ π¦π¦ . (B4) Combining equations (B1) and (B4), we obtain πποΏ½ππ = 1 οΏ½ππππ ( ππ = 1) οΏ½ = ππ ( ππ = 1| ππ = 1) , (B5) which shows that interventional distribution πποΏ½ππ = 1 οΏ½ππππ ( ππ = 1) οΏ½ is equal to the observational distribution ππ ( ππ = 1| ππ = 1) . imilarly, we can prove πποΏ½ππ = 1 οΏ½ππππ ( ππ = 0) οΏ½ = ππ ( ππ = 1| ππ = 0) . (B6) The causal effect of ππ on the variable ππ under the structural equation model is defined as π΄π΄πΆπΆπΈπΈππ = πποΏ½ππ = 1 οΏ½ππππ ( ππ = 1) οΏ½ β πποΏ½ππ = 1 οΏ½ππππ ( ππ = 0) οΏ½ . (B7) Equations (B5) and (B6) show that under the structural equation model (4) the interventional distribution is equal to the observational distribution. Under the ignorability assumption ( ππ ππ1 , ππ ππ0 ) β«« ππ ππ , we obtain ππ ( ππ ππ1 = 1) = ππ ( ππ ππ1 = 1 οΏ½ππ ππ = 1) = ππ ( ππ = 1| ππ = 1) , (B8) ππ ( ππ ππ0 = 1) = ππ ( ππ ππ0 οΏ½ππ ππ = 1) = ππ ( ππ = 1| ππ = 0) . (B9) Combining equations (B5), (B6), (B7) and (B8), we obtain πποΏ½ππ = 1 οΏ½ππππ ( ππ = 1) οΏ½ = ππ ( ππ ππ1 = 1) , (B10) πποΏ½ππ = 1 οΏ½ππππ ( ππ = 0) οΏ½ = ππ ( ππ ππ0 = 1) . (B11) It follows from equations (A3), (B7), (B10) and (B11) that π΄π΄πΆπΆπΈπΈππ = πποΏ½ππ = 1 οΏ½ππππ ( ππ = 1) οΏ½ β πποΏ½ππ = 1 οΏ½ππππ ( ππ = 0) οΏ½ = ππ ( ππ ππ = 1 ) β ππ ( ππ ππ = 1 ) = π΄π΄πΆπΆπΈπΈ . (B12)) Equation (B12) shows that the causal effect under the SEMs is equal to the average causal effect under the counterfactual model with ignorability assumption. ext we discuss the equivalence between ICM and SEMs. Take the SEMs (4) as transformations: ππ = ππ π₯π₯ ( ππ π₯π₯ ) , ππ = ππ π¦π¦ ( ππ , ππ ππ ) . We need to show that ICM of ππ β ππ implies ππ π₯π₯ β«« ππ ππ . The Jacobian matrix of the above transformation is π½π½ = ππππ π₯π₯ ππππ π₯π₯ ππππ π¦π¦ ππππ π¦π¦ . Then, by transformation theorem (Ross 1985), we obtain ππ ( ππ , ππ ) = ππ ( ππ π₯π₯ , ππ π¦π¦ )| πππππ₯π₯πππππ₯π₯πππππ¦π¦πππππ¦π¦ | , (B13) ππ ( ππ ) = ππ ( ππ π₯π₯ )| πππππ₯π₯πππππ₯π₯ | , (B14) which implies that ππ ( ππ | ππ ) = ππ ( πππ₯π₯ , πππ¦π¦ )| πππππ₯π₯πππππ₯π₯πππππ¦π¦πππππ¦π¦ | ππ ( πππ₯π₯ )| πππππ₯π₯πππππ₯π₯ | = ππ ( ππ π₯π₯ , ππ π¦π¦ ) ππ ( ππ π₯π₯ )| πππππ¦π¦πππππ¦π¦ | . (B15) ICM states that distributions ππ ( ππ ) and ππ ( ππ | ππ ) are independent, or ππ ( ππ | ππ ) contains no information of ππ ( ππ ) . Therefore, ππ ( ππ π₯π₯ , ππ π¦π¦ ) must be equal to ππ ( ππ π₯π₯ ) ππ ( ππ π¦π¦ ) , i.e., ππ π₯π₯ β«« ππ ππ . Otherwise, from equation (B15), we know that ππ ( ππ | ππ ) involves both ππ π₯π₯ and ππ π¦π¦ , which implies he distribution ππ ( ππ | ππ ) will contain information of ππ ( ππ ) , and ICM will not hold. This shows that ICM of ππ β ππ implies the structural equation model (4). Supplementary C Identifiability of the direction of discrete ANMs Conditions for identifiability were summarized in Theorem 4 in the paper (Peters et l. 2011) . They assumed that ππ = ππ π¦π¦ ( ππ ) + ππ π¦π¦ , ππ β«« ππ π¦π¦ , ππ ( ππ = ππ ) β ππ ( ππ ππ = ππ ) β βππ , ππ and considered three cases: (1) Both ππ ππ and ππ ππ are bijective. If the ANM ππ β ππ is reversible, then ππ and ππ are uniformly distributed; (2) ππ ππ is bijective. Suppose that ππ ππ ( ππ ) = ππ ππ ( ππ ) . If the ANM ππ β ππ is reversible, then ππ ( ππ ππ =ππ βππ ππ ( ππ )) ππ ( ππ ππ =ππ βππ ππ ( ππ )) = ππ ( ππ=ππ ) ππ ( ππ=ππ ) , βππ , in many cases, ππ ( ππ = ππ ) = ππ ( ππ = ππ ) . (3) ππ ππ is bijective. Suppose that ππ ππ ( ππ ) = ππ ππ ( ππ ) . If the ANM ππ β ππ is reversible, then ππ ( ππ=ππ ) ππ ( ππ=ππ ) = ππ ( ππ ππ =ππ βππ ππ ( ππ )) ππ ( ππ ππ =ππ βππ ππ ( ππ )) , βππ , in many cases, ππ ( ππ = ππ ) = ππ ( ππ = ππ ) . In our cases, ππ is an indicator variable for genotypes and ππ is a binary variable for disease status. Therefore, in general, in any of the three cases, reversible is impossible and hence the direction of the ANMs is identifiable. Supplementary D Distance covariance and correlation between two random vectors The distance covariance πππΆπΆπππΆπΆ ( ππ , ππ ) between two random vectors ππ and ππ with finite first moments is defined as πΆπΆπππΆπΆ ( ππ , ππ ) = || ππ ππ , ππ ( π‘π‘ , π π ) β ππ ππ ( π‘π‘ ) ππ ππ ( π π )|| = ππ ππ ππ β« | ππ ππ , ππ ( π‘π‘ , π π ) βππ ππ ( π‘π‘ ) ππ ππ ( π π )| | π‘π‘ | ππ1+ππ | π π | ππ1+ππ π
π
ππ+ππ πππ‘π‘πππ π (D1) where ππ ππ = ππ Ξ ( ) and ππ ππ = ππ Ξ ( ) . Similarly, distance variance ππππππππ ( ππ ) is defined as ππππππππ ( ππ ) = ππ2 β« | ππ ππ , ππ ( π‘π‘ , π π ) βππ ππ ( π‘π‘ ) ππ ππ ( π π )| οΏ½π‘π‘ | ππ1+ππ οΏ½π π | ππ1+ππ π
π
πππ‘π‘πππ π . (D2) The square of distance correlation π
π
( ππ , ππ ) is defined as π
π
( ππ , ππ ) = οΏ½ πππΆπΆπΆπΆπΆπΆ ( ππ , ππ ) οΏ½ππππππππ ( ππ ) ππππππππ ( ππ ) ππππππππ ( ππ ) ππππππππ ( ππ ) > 00 ππππππππ ( ππ ) ππππππππ ( ππ ) = 0 . (D3) The distance covariance and correlation can be easily estimated as follows (SzeΒ΄kely et al. 2007). Assume that pairs of nkYX Kk ,...,1 ),,( = are sampled. Calculate the Euclidean distances: nlnkYYbXXa qlkklplKkl ,...,1,,...,1,|||| , |||| ==β=β= . Define β = = nl klk ana , β = = nk kll ana , β β = = = nk nl kl ana , β = = nl klk bnb , β = = nk kll bnb and β β = = = nk nl kl bnb . Define two matrices: nnkl AA Γ = )( and nnkl BB Γ = )( , where .... aaaaA lkklkl +ββ= , ... bbbbB lkklkl +ββ= , nlk ,...,1, = . Finally, the sampling distance covariance ),( YXV n , variance )( XV n and correlation ),( YXR n are defined as β β = = = nk nl klkln BAnYXV , (D4) β β = = == nk nl klnn AnXXVXV , β β = = = nk nl kln BYV )( ,  =>= , 0)()(0 0)()(, )()( ),(),( 22 2222 22 YVXV YVXVYVXV YXVYXR nn nnnn nn (D5) respectively. Supplementary E Theoretical Analysis of the Impact of LD on the Causal Effects For the convenience of presentation, we first consider the true linear model for a quantitative trait (Xiong 2018): ππ = ππ + πππΌπΌ + ππ ππ , ππ β«« ππ ππ , (E1) where ππ is an inicater variable for the genotype at the true causal locus and distribution of ππ ππ is not normal. Suppose that ππ ππ is an indicator variable for the genotype at a marker locus with marker allele frequencies ππ ππ and ππ ππ and LD measure π·π· ππ between the marker and true causal loci. Then, we have the following linear regression model for the marker locus: ππ = ππ + ππ ππ πΌπΌ ππ + ππ ππππ . (E2) Then, we can show (Xiong 2018) that πΌπΌ ππ ππ . π π οΏ½οΏ½ π·π· ππ ππ ππ ππ ππ πΌπΌ . (E3) Equation (E3) implies that in the presence of LD, the marker locus still shows some association with genetic additive effect π·π· ππ ππ ππ ππ ππ πΌπΌ approximately. Now we investigate the impact of LD on causal inference. Substituting equation (E1) into equation (E2), we obtain ππ ππππ = ππ ππ + πππΌπΌ β ππ ππ πΌπΌ ππ . (E4) Define β = πππΌπΌ β ππ ππ πΌπΌ ππ β ( ππ β π·π· ππ ππ ππ ππ ππ ππ ππ ) πΌπΌ . (E5) When ββ , distance covariance πππΆπΆπππΆπΆ ( ππ ππ , ππ ππππ ) is equal to β€ πππΆπΆπππΆπΆ ( ππ ππ , ππ ππππ ) = πππΆπΆπππΆπΆ ( ππ + ππ ππ β ππ , ππ ππ + β ) β€ πππΆπΆπππΆπΆ ( ππ , ππ ππ ) + πππΆπΆπππΆπΆ ( ππ ππ β ππ , β ) = πππΆπΆπππΆπΆ ( ππ ππ β ππ , β ) , (E6) where ππ and ππ ππ are independent by the ICM. ππ ππ β ππ must imply that β = 0 (SzeΒ΄kely and Rizzo, 2009) or ππ = π·π· ππ ππ ππ ππ ππ ππ ππ . (E7) Equation (E7) indicates that ππ ππ β ππ . However, in general, SNPs do not have causal relationships. Therefore, πππΆπΆπππΆπΆ ( ππ ππ , ππ ππππ ) β and ππ ππ , ππ ππππ are not independent, which implies that ππ ππ does not cause ππ . Now we calculate the causal measure. Let πΆπΆ ππβππ = 1 β π
π
( ππ , ππ ππ ) be the causal measure of the causal SNP ππ . Then, the causal measure of the marker ππ ππ is given by πΆπΆ ππ ππ βππ = πΆπΆ ππβππ β π
π
( ππ ππ β ππ , ππ β π·π· ππ ππ ππ ππ ππ ππ ππ ) . (E8) β₯ π
π
( ππ ππ β ππ , ππ β π·π· ππ ππ ππ ππ ππ ππ ππ ) β₯ implies πΆπΆ ππβππ β₯ πΆπΆ ππ ππ βππ β₯ . (E9) Causation measure πΆπΆ ππ ππ βππ depends on the distance correlation between ππ ππ β ππ and ππ β π·π· ππ ππ ππ ππ ππ ππ ππ . For qualitative trait, we can use a logistic integer function as a nonlinear function. After some algebraic operations, we have the model: ππ = ππ ππππ ππππ + ππ ππ (E10) or ππ = ππ ( πππΌπΌ ) + ππ ππ , (E11) where ππ ( πππΌπΌ ) is a nonlinear function. Equation (E11) can be approximated by ππ = ππ (0) + ππ β² (0) πππΌπΌ + ππ ππ . (E12) Thus, the model (E11) is reduced to model equation (E1). Using the same arguments from the model equation (E1), we can define the causality measure for marker ππ ππ : πΆπΆ ππ ππ βππ = πΆπΆ ππβππ β π
π
( ππ ππ β ππ , ππ β ππ β² ( ) π·π· ππ ππ ππ ππ ππ ππ ππ ) . (E13) For the discrete ANMs, we cannot find ππ β² (0) , the causal measure for the marker may simply be written as πΆπΆ ππ ππ βππ = πΆπΆ ππβππ β π
π
( ππ ππ β ππ , ππ β πΎπΎπ·π· ππ ππ ππ ππ ππ ππ ππ ) , (E14) where πΎπΎ is a appropriate constant. able S1. P-values of causation and assication of 214 SNPs showing significant causal relationships with schizophrenia. SNPs Chr Gene Related Disease P-values Causation Association rs1324544 6 Table S3. Power to detect association between SNP1 and Disease. Sample Sizes 500 1000 2000 5000 0.05 0.999 1 1 0.999 0.01 0.992 0.992 0.993 0.992