Multiple-trait Adaptive Fisher's Method for Genome-wide Association Studies
MMultiple-trait Adaptive Fisher’s Method forGenome-wide Association Studies
Qiaolan Deng and Chi Song College of Public Health, Division of Biostatistics, The Ohio StateUniversitySeptember 22, 2020
Running Title: Adaptive Fisher’s Method for Multiple Traits a r X i v : . [ s t a t . M E ] S e p bstract In genome-wide association studies (GWASs), there is an increasing need for de-tecting the associations between a genetic variant and multiple traits. In studies ofcomplex diseases, it is common to measure several potentially correlated traits in asingle GWAS. Despite the multivariate nature of the studies, single-trait-based meth-ods remain the most widely-adopted analysis procedure, owing to their simplicity forstudies with multiple traits as their outcome. However, the association between agenetic variant and a single trait sometimes can be weak, and ignoring the actualcorrelation among traits may lose power. On the contrary, multiple-trait analysis, amethod analyzes a group of traits simultaneously, has been proven to be more powerfulby incorporating information from the correlated traits. Although existing methodshave been developed for multiple traits, several drawbacks limit their wide applicationin GWASs. First, many existing methods can only process continuous traits and failto allow for binary traits which are ubiquitous in the real-world problems. Second, asshown in our simulation study, the performance of many existing methods is unstableunder different scenarios where the correlation among traits and the signal proportionvary. In this paper, we propose a multiple-trait adaptive Fisher’s (MTAF) method totest associations between a genetic variant and multiple traits at once, by adaptivelyaggregating evidence from each trait. The proposed method can accommodate bothcontinuous and binary traits and it has reliable performance under various scenarios.Using a simulation study, we compared our proposed method with several existingmethods and demonstrated its competitiveness in terms of type I error control andstatistical power. By applying the method to the Study of Addiction: Genetics andEnvironment (SAGE) dataset, we successfully identified several genes associated withsubstance dependence.
Keywords:
Genome-wide association study; Multiple traits; Adaptive Fisher; SAGE;Permutation test; Principal component analysis2
Introduction
Genome-wide association studies (GWASs) explore the associations between genetic variants,called single-nucleotide polymorphisms (SNPs), and traits [Visscher et al., 2012]. They havebeen successfully applied to identify numerous genetic variants associated with complexhuman diseases [Buniello et al., 2019]. In GWASs, it is common to measure multiple traitsunderlying complex diseases, because due to pleiotropy one genetic variant can influencemultiple phenotypic traits [Solovieff et al., 2013]. For example, a number of genetic variantsare associated with both fasting glucose and fasting insulin of type 2 diabetes [Billings andFlorez, 2010]; GWASs found a variant in gene SLC39A8 that has an influence on the riskof schizophrenia and Parkinson disease.[Pickrell et al., 2016]. In the last decade, single-traitmethods has been widely adopted [Visscher et al., 2017] in which the association betweenthe genetic variant and each single trait is tested one at a time. However, this type ofmethods suffers from several disadvantages. First, sometimes the association between asingle SNP and a trait is too weak to be detected by itself. Second, it ignores the correlationstructure among the traits, which leads to the loss of statistical power when the traits aretruly correlated. Third, post-hoc combination of multiple tests without proper adjustmentmay lead to inflated type I error or compromised statistical power. As a result, there is anincreasing need to develop powerful statistical methods that are capable of testing multipletraits simultaneously and properly.Various statistical methods have been developed and applied to multiple-trait studies.Following an overview of multiple-trait methods by Yang and Wang [2012], we classify theexisting methods into three categories. The first category is to combine test statistics orp-values from univariate tests. The O’Brien method [O’Brien, 1984, Wei and Johnson,1985] combines the test statistics from the individual test on each trait weighted by inversevariance. The sum of powered score tests (SPU) [Pan et al., 2014] and adaptive SPU (aSPU)[Zhang et al., 2014] combines the score test statistics derived from generalized estimationequations (GEE). The Trait-based Association Test that uses Extended Simes procedure3TATES) [Sluis et al., 2013] exploits the correlation among p-values from each univariatetest and generate a new test statistic. Fisher’s method [Fisher, 1925, Yang et al., 2016] andCauchy’s method [Liu and Xie, 2020] combines p-values of single-trait analyses and get thefinal p-values from known probability distributions. The second category is to reduce thedimensions of multiple traits. Principal components of heritability (PCH) [Klei et al., 2008]collapses the multiple traits to a linear combination of traits which maximizes the heritabilityand then tests the associations based on transformed traits. Canonical correlation analysis[Ferreira and Purcell, 2009] finds the linear combination of traits maximizing the covariancebetween a SNP and all traits. It is equivalent to multivariate analysis of variance (MANOVA)when there is only one SNP [Sluis et al., 2013]. The third category relies on regression models.MultiPhen [O’Reilly et al., 2012] regresses genotypes on phenotypes via proportional oddslogistic model. As mentioned, GEE has been utilized to generate score test statistics in aSPU[Zhang et al., 2014]. Besides linear models, kernel regression models (KMRs) also play a rolein multiple-trait analysis, including multivariate kernel machine regression [Maity et al.,2012], multi-trait sequence kernel association test (MSKAT) [Wu and Pankow, 2016] andMultiSKAT [Dutta et al., 2019]. Davenport et al. [2018] extended KMRs to multiple binaryoutcomes.Currently, several limitations still exist in multiple-trait methods restricting their wideapplications. First, many existing methods are unable to simultaneously analyze binary andcontinuous traits. For KMRs, multiple non-continuous traits can be both theoretically andcomputationally challenging and it is unclear how to integrate multiple different datatypes(e.g., multi-omics) [Larson et al., 2019]. MANOVA is not applicable to non-Normal traits.Numerous studies in GWASs are case-control studies and incapability of processing binarytraits greatly limits their application. Second, the methods may have inconsistent perfor-mance under various scenarios depending on the number of traits, the strength of correlationand the number of true associations, which are largely unknown in practice. Hence, thereis a demand for methods with robust performance regardless of scenarios. Third, although4any methods claimed the capability of handling covariates and confounders, few of themconducted the relevant simulations or real-data applications to demonstrate the type I errorcontrol and performance. As a matter of fact, our simulation study indicates the claims ofsome methods are inaccurate (see section 3).In this paper, We propose a multiple-trait adaptive Fisher’s (MTAF) method for multipletraits based on adaptive Fisher’s (AF) method [Song et al., 2016]. The remainder of thispaper is organized as follows. In section 2, we elaborate the proposed method and introducevariations to tackle highly correlated traits. In section 3, we evaluate the performance ofMTAF using simulation and apply it to a real GWAS of substance addiction traits. Insection 4, we review the advantages and the limitations of the MTAF method and discussour future work.
Suppose there are n independent subjects. For each subject i = 1 , . . . , n , there are K traits Y i = ( y i , . . . , y iK ) (cid:48) and Y k = ( y k , . . . , y nk ) (cid:48) , and x i ∈ , , i . x = ( x , . . . , x n ) is the vector of all subjects’genotypes for this SNP. In terms of real applications, we suppose each subject i has M covariates z i , . . . , z iM and Z = { z im } n × M . After adjusting for covariates, We aim to testthe associations between the SNP and K traits under the null hypothesis H that none ofthe K traits associates with the SNP. To construct the test statistics, the MTAF methodcombines the marginal p-values from the single-trait score tests in an adaptive way. At last,because our test statistic has no closed form, we conduct permutations to get the empiricalp-values. 5 .1 Score Test Score test is one of the most popular test in GWASs of single traits, because it only requiresfitting the null model thus is computationally inexpensive. Suppose for each SNP and subject i , a generalized linear model of the following form is assumed for the k th trait and the SNPwith M covariates: g k ( E ( Y ik )) = x i · β k + M (cid:88) m =1 z im · α mk , where β k is the effect of the SNP and α ’s are the effects of the M covariates on the k th trait. g k ( · ) is the link function, which is identity function for continuous traits and logit functionfor binary traits. Different link functions make it possible to allow for both continuous traitsand binary traits. Under H : β k = 0, the test statistic of score test for the SNP is: U k = n (cid:88) i =1 ( x i − ˆ x i )( Y ik − ˆ Y ik ) , where ˆ x i is an estimate of x i and ˆ Y ik is an estimate of Y ik . Denote V as Fisher informationmatrix which is the covariance matrix of U = ( U , . . . , U k ) and Var( U k | H ) = V kk where V kk is the k th diagonal element of V . Asymptotically, we have U k / √ V kk ∼ N (0 ,
1) and thenthe p-values (either one-sided or two-sided) can be generated. We get the p-values of thesingle-trait score tests computationally by R package statmod [Giner and Smyth, 2016].
Denote p , ..., p K as the p-values of score tests between the K traits and the SNP. Let S k = − (cid:80) ki =1 log p ( i ) where p ( i ) is the i th smallest p-value and it is the sum of first k smallest p-values. Then the p-value of s k is p s k = P ( S k ≥ s k ) where s k is the observed value of S k . Inpractice, this p-value can be obtained by permutation.The proposed test statistic of the MTAF method is6 MT AF = min ≤ k ≤ K p s k . Since it is intractable to get the p-value of the test statistic analytically, we turn to applythe permutation procedure to get the empirical p-values. We intend to test the conditionalindependence between the SNP and the traits given the covariates, i.e., Y ⊥ x | Z . Thepermutation procedure should break the associations between Y and x while preserving theassociations between Y and Z and between x and Z . Simply permuting the genotype x leadsto inflated type I error rate, because the correlation between the genotype and covariatesare destroyed. Following Potter [2005] and Werft and Benner [2010], we permute residualsof regressions of x on Z for generalized regression models. In our method, we first regressthe genotype on the covariates, then we permute the residuals derived from the regression.We replace the original genotype with the permuted residuals to perform score tests for thepermuted data. It should be noted that even when no covariate is explicitly included in themode, we still have a constant as our covariate.Specifically, we denote the vector of residuals of regressing x on Z as e x and permute itfor B times. In the b th permutation, we regress Y k on e ( b ) x and get the score tests p-values p ( b ) k for the coefficient of e ( b ) x . After B permutations, we get a ( B + 1) × K matrix P = { p ( b ) k } .Each element p ( b ) k is the p-value measuring the k th trait in the b th permutation for 1 ≤ b ≤ B and p (0) k is the observed p-value. Based on P , we can construct the MTAF method’s teststatistics for both the observed data and permuted data.For the matrix P , we can calculate the empirical p-values of the MTAF method for theobserved data and permuted data with the following steps:Suppose we have a ( B + 1) × K matrix of p-values P .1. For each row b ∈ { , , ..., B } , we calculate s ( b ) k and7 ( b ) s k = 1 B + 1 B (cid:88) j =0 { s ( j ) k ≥ s ( b ) k } , where is the indicator function.2. Then we can get a vector t = ( t (0) MT AF , t (1)
MT AF , . . . , t ( B ) MT AF ) where t ( b ) MT AF = min ≤ k ≤ K p ( b ) s k .3. The p-values of MTAF test statistics are approximated by p ( b ) MT AF = 1 B + 1 B (cid:88) j =0 { t ( j ) MT AF ≤ t ( b ) MT AF } , where p ( b ) MT AF is the empirical p-value of the MTAF method for the permuted data for1 ≤ b ≤ B and p (0) MT AF is the empirical p-value for the observed data.To simplify the following discussion of the variation of MTAF method, we define thesteps above as an AF operator AF {·} mapping P to p = ( p (0) MT AF , p (1)
MT AF , . . . , p ( B ) MT AF ). In practice, traits of a complex disease tend to be positively correlated intrinsically. Or,according to the prior knowledge, we can manually change the direction of effects to makethem in the same direction. In the situation that effects are in the same direction, com-bining one-sided p-values aggregates evidence for the effects which tend to have the samesignal and enjoys higher statistical power than combining two-sided p-values. Therefore,we recommend always combine one-sided p-values when it is appropriate. We separatelycombine the lower-tail p-values and the upper-tail p-values, and then unify these two resultsusing another round of MTAF permutation. Specifically, we get p lower = AF { P lower } and p upper = AF { P upper } and then the empirical p-value of the observed data is the first elementof p combo = AF { [ p lower p upper ] } . 8 .5 PCA of continuous traits Alternatively, a SNP may not have strong associations with observed traits but hiddencomponents, which can be difficult to detect those associations. PCA is widely used toreduce dimensions and it generates orthogonal linear combinations of variables maximizingthe variability. We introduce PCA into the MTAF method with the purpose of uncoveringthe hidden components. In the MTAF method, PCA generates K independent principalcomponents and we detect the associations between the SNP and the principal components.Specifically, for continuous traits, we first regress Y k on the covariates Z and denote theresiduals e k for 1 ≤ k ≤ K . PCA conducted on e , . . . , e K leads to K principal componentswhich substitute for y , . . . , y K . In the simulation study, the power of the MTAF methodincreases dramatically given correlated traits after applying PCA. Unlike its common use inpractice that selects only several top principal components, it claims that using all principalcomponents can have greater power [Aschard et al., 2014]. Therefore, the MTAF methodkeeps all principal components and it proves to be powerful. The MTAF method itself ispowerful when signals are sparse, but usually the number of traits truly associated with theSNP and underlying correlation structure are unknown. Hence, when analyze continuoustraits, initially we apply the original MTAF method and the MTAF method with PCArespectively, then combine the results from the two to get the final p-value. Specifically, wehave p original = AF ( P original ) and p pca = AF ( P pca ). Then combine two vectors p original and p pca into a matrix P continuous . At last, we have p continuous = AF ( P continuous ) and the p-valueis the first element of p continuous . To process a mixture of binary traits and continuous traits,we first apply the MTAF method to binary traits to get p binary and then get p continuous bythe procedure above. Then we combine two p-value vectors to get P mix = [ p binary p continuous ]and the empirical p-value is the first element of p mix = AF ( P mix ).In the MTAF method, PCA is used to handle continuous traits rather than binary traits.Although some literature refers to generalized PCA [Landgraf and Lee, 2019], our methodonly applies PCA to continuous traits at this moment.9 .6 Simulation Setup To evaluate the performance of the MTAF method, we conduct a simulation study. Ineach dataset, we simulate 1000 subjects based on various parameters such as the numberand types of traits, as well as the proportion of traits associated with the genotype andthe strength of the association. We consider 10, 50, and 100 traits and we simulate threescenarios: continuous traits only, binary traits only, and mixture of the two. We assume acompound symmetry (CS) structure underlying traits with either weak correlation ( ρ = 0 . ρ = 0 . x i ∈ { , , } for each subject i , such that x i ∼ Bin(2 , . . Y i = ( Y i , . . . , Y iK ) (cid:48) are simulated via a linear model: Y i = x i β + (cid:15) i , (1)where β = ( β , . . . , β K ) are the coefficients. The non-zero β k ’s are drawn from independentuniform distributions and we select the parameters of the uniform distributions to makethe differences among methods obvious. (cid:15) i ’s are independently drawn from a multivariateGaussian distribution N ( , Σ ). To simulate correlated traits, Σ is simulated such that thevariances are sampled independently from an inverse gamma distributions Inv-Gamma(4 , ρ .In addition, we consider simulation scenarios with two binary covariates Z i and Z i
10o investigate the performance of the MTAF method in the presence of confounders, suchas gender and race in the real datasets. These covariates are simulated by dichotomizingunderlying covariates that are linearly associated with the genotype. Specifically, Z i issimulated by dichotomizing x i η + ω i , and Z i are simulated by dichotomizing x i η + ω i ,where η and η are randomly drawn from uniform distributions U (0 . , ω i and ω i follow N (0 , Y i issimulated based on a linear model conditional on both the genotype and the covariates: Y i = x i β + Z i Γ + (cid:15) i , (2)where Z i = ( Z i , Z i ) and Γ K × has coefficients drawn from iid uniform distribution U (0 . , Y ik = 1 by replacing thecorresponding Y ik with logit( E ( Y ik )) in 1 and 2 and then we draw the binary traits based onthe simulated odds.To evaluate the performance of the MTAF method, competitor methods including MSKAT,aSPU, TATES, MANOVA, MultiPhen, and minP are also applied on the simulated datasetsfor comparison. Data availability
The authors state that all data necessary for confirming the conclusions are represented fullywithin the article. The SAGE data was downloaded from the dbGAP using accession numberphs000092.v1.p1. The R software package for the MTAF method and our simulation codeare available at https://github.com/songbiostat/MTAF .11
Results
First, to assess whether the MTAF method and other methods can appropriately controltype I error at the nominal level, we perform simulations under the null hypothesis where β = · · · = β K = 0. The empirical p-values were calculated based on 1 ,
000 permutationsand the type I error rate is evaluated at the 0 .
05 nominal level. In addition to aSPU withdefault independent structure, we evaluated aSPU equipped with exchangeable correlationstructure (aSPU-ex).Table 1 shows that, when all traits were continuous, the empirical Type I error of mostmethods were well controlled allowing for different number of traits and strength of correla-tion. We found that MultiPhen had inflated Type I error, especially after adding covariatesinto the models. Thus, we decided not to include MultiPhen in the corresponding simulationstudies. The similar phenomenon was reported by Konigorski et al.[Konigorski et al., 2020]that MultiPhen led to inflated or highly inflated type I errors and they did not include themethod in the power study. Table 2 and 3 show that type I error were well controlled forthe compared methods when all traits are binary, or when the traits are half binary and halfcontinuous. Please be noted that only methods that can be applied on binary or mixed traitscenarios are included in tables 2 and 3.
The power of these compared methods were evaluated under different scenarios at signif-icance level of 0 .
05. The effect sizes for the associated traits were randomly drawn fromuniform distributions. We include the original MTAF method (MTAF original ) and its PCAexpansion (MTAF
PCA ). The results show that the statistical power under dense scenarioswas greatly improved by introducing PCA. Table 4 summarizes the power of the compared12able 1:
Type I error: continuous traits
Table 2:
Type I error: binary traits
Type I error: mixed traits with covariates
Correlation
To further demonstrate the usage of the proposed method in real studies, we applied MTAFto The Study of Addiction: Genetics and Environment (SAGE) [Bierut et al., 2010] data from14able 4:
Power: continuous traits without covariates
Sparsity Correlation
PCA
MTAF original
MTAF MSKAT aSPU aSPU-ex MultiPhen TATES MANOVA minPsparse 0.3 10 U(0.15,0.25) 0.762 0.791 0.793 0.796 0.563 0.702 0.796 0.785 0.798 0.78850 U(0.2,0.4) 0.797 0.922 0.916 0.825 0.688 0.807 0.827 0.904 0.836 0.906100 U(0.15,0.3) 0.770 0.924 0.919 0.843 0.400 0.665 0.845 0.895 0.852 0.8940.6 10 U(0.15,0.25) 0.907 0.858 0.905 0.918 0.590 0.859 0.918 0.799 0.918 0.90750 U(0.2,0.4) 0.916 0.957 0.949 0.944 0.730 0.926 0.944 0.908 0.950 0.924100 U(0.15,0.3) 0.935 0.948 0.968 0.960 0.413 0.833 0.960 0.886 0.966 0.914dense 0.3 10 U(0.05,0.15) 0.769 0.690 0.765 0.784 0.390 0.507 0.784 0.651 0.786 0.63950 U(0.05,0.12) 0.808 0.629 0.812 0.865 0.134 0.422 0.866 0.534 0.873 0.551100 U(0.02,0.1) 0.615 0.418 0.637 0.716 0.082 0.266 0.716 0.334 0.742 0.3430.6 10 U(0.05,0.15) 0.92 0.729 0.907 0.933 0.301 0.701 0.933 0.627 0.933 0.64150 U(0.05,0.12) 0.969 0.654 0.964 0.987 0.119 0.640 0.987 0.457 0.988 0.526100 U(0.02,0.1) 0.929 0.437 0.916 0.970 0.074 0.338 0.970 0.273 0.974 0.346
Table 5:
Power: continuous traits with covariates
Sparsity Correlation
PCA
MTAF original
MTAF MSKAT aSPU aSPU-ex TATES minPsparse 0.3 10 U(0.15,0.3) 0.763 0.787 0.803 0.795 0.602 0.715 0.773 0.77950 U(0.2,0.4) 0.684 0.889 0.871 0.757 0.578 0.722 0.871 0.876100 U(0.15,0.3) 0.646 0.872 0.862 0.754 0.273 0.543 0.835 0.8400.6 10 U(0.15,0.3) 0.908 0.862 0.900 0.920 0.620 0.865 0.778 0.79450 U(0.2,0.4) 0.877 0.935 0.939 0.917 0.619 0.878 0.880 0.891100 U(0.15,0.3) 0.889 0.922 0.948 0.934 0.307 0.739 0.828 0.859dense 0.3 10 U(0.05,0.2) 0.833 0.792 0.842 0.875 0.506 0.649 0.759 0.74850 U(0.05,0.13) 0.685 0.517 0.695 0.775 0.123 0.387 0.468 0.470100 U(0.03,0.12) 0.460 0.301 0.478 0.560 0.064 0.211 0.258 0.2680.6 10 U(0.05,0.2) 0.957 0.833 0.947 0.963 0.422 0.821 0.732 0.73750 U(0.05,0.13) 0.934 0.544 0.928 0.963 0.105 0.567 0.400 0.457100 U(0.03,0.12) 0.779 0.292 0.756 0.862 0.057 0.247 0.192 0.269
Table 6:
Power: binary traits without covariates
Sparsity Correlation
Table 7:
Power: binary traits with covariates
Sparsity Correlation
Table 8:
Power: mixed traits with covariates and dense signals
Correlation . SAGEis a case-control GWAS of addiction with unrelated individuals where cases are defined as in-dividuals with Diagnostic and Statistical Manual of Mental Disorders 4th Edition (DSM-IV)alcohol dependence (lifetime) and potentially other illicit drug dependence. The individualswere selected from three large studies including the Collaborative Study on the Genetics ofAlcoholism (COGA), the Family Study of Cocaine Dependence (FSCD), and the Collabo-rative Genetic Study of Nicotine Dependence (COGEND).From dbGaP, We downloaded the data of 3 ,
847 individuals who consented to providetheir data for health research. Quality control was performed using PLINK 1.9 [Purcell et al.,2007]. We filtered data based on the genotyping rate (0.01), missingness (0.01), minor allelefrequency (0.01), and Hardy-Weinberg equilibrium (p-value ≤ ,
557 individuals and 560 ,
218 SNPs. In order to detect SNPs associated withaddiction, we selected 18 traits (summarized in Table 9) that account for the addiction toalcohol, nicotine, marijuana, cocaine, opiates, and any other drug. Among these traits, 12are binary and 6 are continuous. In addition, gender and race (black or white) are includedin our analysis as confounding covariates.We inspected the Pearson’s correlation among six continuous traits and Table 10 showsthat all six traits are positively correlated. Then we applied the MTAF method to detectthe associations between the SNPs and the 18 traits. To get accurate p-values at extremelysmall significance level (usually lower than 10 − ) given limited computational resource, weperformed the tests by adaptively increasing the number of permutations. We first set thenumber of permutation to be B and filtered out insignificant SNPs with p-values greaterthan 5 /B and updated the number of permutation to 10 × B . We started with B = 100 andrepeat the above process until B = 10 . By doing this, we managed to avoided permuting10 times for most of the SNPs, and saved computation resource only for the most significantSNPs. Figure 1 shows the QQ-plot of the − log (p-values) of all the SNPs based on the16able 9: SAGE data variables summary variable name descriptionnic sx tot number of nicotine symptoms endorsednic sx1 tolerance to nicotinenic sx2 withdrawal from nicotinecig daily Has participant ever smoked cigarettes daily for a month or more?mj sx tot number of marijuana dependence symptoms endorsedmj sx1 tolerance to marijuanamj sx2 withdrawal to marijuanacoc sx tot number of cocaine dependence symptoms endorsedcoc sx1 tolerance to cocainecoc sx2 withdrawal to cocaineop sx tot number of opiates dependence symptoms endorsedop sx1 tolerance to opiatesop sx2 withdrawal to opiatesalc sx tot number of alcohol dependence symptoms endorsedalc sx1 tolerance to opiatesalc sx2 withdrawal to opiatesmax drinks largest number of alcoholic drinks consumed in 24 hoursever oth Has participant ever used drugs other than marijuana, cocaine or opiates?
Table 10:
Correlation of continuous traits in SAGE data nicotine marijuana cocaine alcohol opiate max drinksnicotine 1 0.390 0.392 0.534 0.218 0.345marijuana 0.390 1 0.534 0.472 0.331 0.287cocaine 0.392 0.534 1 0.519 0.377 0.346alcohol 0.537 0.472 0.519 1 0.298 0.574opiate 0.218 0.331 0.377 0.298 1 0.175max drinks 0.350 0.287 0.346 0.574 0.175 1 × − , we identified 11 significant SNPs belonging to sixgenes as shown in Table 11. Most of these genes are related to the biological functions ofnerve and brain, which is plausible considering the fact that addictions are considered to berelated to mental health. Among these genes, EVI5 is a risk gene of multiple sclerosis, adisease which causes damage to the nervous system [Hoppenbrouwers et al., 2008]. Gouveia18t al. [Gouveia et al., 2019] finds that ZNF385D is associated with cognitive function, andit is also reported to be associated with language impairment in previous literature. TPK1produces thiamine pyrophosphate and its mutation can cause neurological disorder [Bankaet al., 2014]. According to GWAS catlog [Buniello et al., 2019], LINC02008, MIR4495 andCNTN1 are all linked to Alzheimer’s disease, and CNTN1 is reported to be associated withParkinson’s disease and schizophrenia.Table 11: significant SNPs at × − level rsid chromosone position p-value geners1556562 1 92568466 4 . × − EVI5rs1408916 1 92527070 3 . × − EVI5rs4847377 1 92557593 2 . × − EVI5rs4970712 1 92527990 3 . × − EVI5rs9310661 3 21855648 1 . × − ZNF385Drs7614064 3 82239346 1 . × − LINC02008rs7645576 3 82274160 3 . × − LINC02008rs9852219 3 82327624 5 . × − LINC02008rs10224675 7 144595669 3 . × − TPK1rs11178982 12 40907530 2 . × − CNTN1rs2020139 12 97953656 3 . × − MIR4495
Although single-trait analysis methods have been widely used in multiple-trait studies, thesemethods may be insufficient when traits are truly correlated with each other. Hence, themultiple-trait association tests can increase statistical power by incorporating the informa-tion among traits. In this paper, we propose the MTAF method for multiple traits, byadaptively combining p-values from the single-trait analyses.The MTAF method is very versatile for the multiple-trait association testing. First,because the MTAF method only requires the p-values as inputs, it can process both continu-ous traits and binary traits simultaneous whenever p-values are provided. Second, we applyPCA on continuous traits to uncover hidden components underlying traits, which greatlyimproves the performance of MTAF when signals are weak and dense. Third, the MTAF19igure 2: Manhattan plot of the MTAF method testing the association between SNPs andmultiple traits of substance dependence at a significance level of 5 × − B = 100,the chance that a SNP gets removed in the first round is 0 .
05. If the SNP remains, wewould start the second round B = 1000 and the chance that the SNP gets removed in thesecond round would be 0 .
045 which is the difference between the chance of remaining inthe first round 0 .
05 and the chance of advancing to the third round 0 . B = 10 , the expected permutation times for a SNP would be0 . ·
100 + 0 . · . . . + 4 . · − · ≈ · log (10 ), which is logarithmic timelog ( B ). On the contrary, if we fix the permutation times at 10 , the expected permutationtimes would be linear time B . Therefore, we reduce the time complexity from linear time tologarithmic time.A SNP may not always influence the traits directly. Instead, it may indirectly affectcorrelated traits through an unobserved component, in which case, uncovering the hiddencomponent can enhance the statistical power given correlated traits. In the MTAF method,we use PCA to uncover this potentially hidden component. By detecting the associations21etween the SNP and hidden components, we may increase the statistical power. This ideawas supported by our simulation study in which PCA largely improved the statistical powerunder dense scenarios. Thus, depending on the correlation structure, PCA may increase thepower of testing when traits are correlated.Despite the advantages of the MTAF method, several limitations can be addressed in thefuture work. First, the MTAF method can only analyze the single variant at this moment,the set-based analysis is common in GWASs though. Cai et al. [2020] proposed a set-basedAF method and the MTAF method might be extended to the set-based analysis or pathwayanalysis in the future. Second, the p-values are adaptively combined in the MTAF methodand we do not aim at selecting the most related traits for the identified SNPs. Thus, in ourfuture work, we can develop a method to provide a list of traits that are most related for eachidentified SNP. At last, because we need to permute the residuals of genotypes to control fortype I error while adjusting for confounders or covariates, the MTAF method requires theindividual level genotype data to perform the test. However, the individual level data are notalway available or often require special permissions since they are considered as identifiabledata. In the future work, we would explore whether the MTAF method can be extended touse the GWAS summary statistics without requiring the individual level data.The R software package for the MTAF method and our simulation code are available at https://github.com/songbiostat/MTAF . Acknowledgments
Funding support for the Study of Addiction: Genetics and Environment (SAGE) was pro-vided through the NIH Genes, Environment and Health Initiative [GEI] (U01 HG004422).SAGE is one of the genome-wide association studies funded as part of the Gene EnvironmentAssociation Studies (GENEVA) under GEI. Assistance with phenotype harmonization andgenotype cleaning, as well as with general study coordination, was provided by the GENEVA22oordinating Center (U01 HG004446). Assistance with data cleaning was provided bythe National Center for Biotechnology Information. Support for collection of datasets andsamples was provided by the Collaborative Study on the Genetics of Alcoholism (COGA;U10 AA008401), the Collaborative Genetic Study of Nicotine Dependence (COGEND; P01CA089392), and the Family Study of Cocaine Dependence (FSCD; R01 DA013423). Fund-ing support for genotyping, which was performed at the Johns Hopkins University Centerfor Inherited Disease Research, was provided by the NIH GEI (U01HG004438), the Na-tional Institute on Alcohol Abuse and Alcoholism, the National Institute on Drug Abuse,and the NIH contract ”High throughput genotyping for studying the genetic contributionsto human disease” (HHSN268200782096C). The datasets used for the analyses described inthis manuscript were obtained from dbGaP at through dbGaP accession numberphs000092.v1.p.
References
Hugues Aschard, Bjarni J Vilhj´almsson, Nicolas Greliche, Pierre-Emmanuel Morange, David-Alexandre Tr´egou¨et, and Peter Kraft. Maximizing the power of principal-component anal-ysis of correlated phenotypes in genome-wide association studies.
The American Journalof Human Genetics , 94(5):662–676, 2014.Siddharth Banka, Christian de Goede, Wyatt W Yue, Andrew AM Morris, Beate Von Bre-men, Kate E Chandler, Ren´e G Feichtinger, Claire Hart, Nasaim Khan, Verena Lunzer,et al. Expanding the clinical and molecular spectrum of thiamine pyrophosphokinase de-ficiency: a treatable neurological disorder caused by tpk1 mutations.
Molecular geneticsand metabolism , 113(4):301–306, 2014.Laura J Bierut, Arpana Agrawal, Kathleen K Bucholz, Kimberly F Doheny, Cathy Lau-rie, Elizabeth Pugh, Sherri Fisher, Louis Fox, William Howells, Sarah Bertelsen, et al.23 genome-wide association study of alcohol dependence.
Proceedings of the NationalAcademy of Sciences , 107(11):5082–5087, 2010.Liana K. Billings and Jose C. Florez. The genetics of type 2 diabetes: What have we learnedfrom GWAS?
Annals of the New York Academy of Sciences , 1212:59–77, 2010. ISSN17496632. doi: 10.1111/j.1749-6632.2010.05838.x.Annalisa Buniello, Jacqueline A L MacArthur, Maria Cerezo, Laura W Harris, James Hay-hurst, Cinzia Malangone, Aoife McMahon, Joannella Morales, Edward Mountjoy, ElliotSollis, et al. The nhgri-ebi gwas catalog of published genome-wide association studies, tar-geted arrays and summary statistics 2019.
Nucleic acids research , 47(D1):D1005–D1012,2019.Xiaoyu Cai, Lo-Bin Chang, Jordan Potter, and Chi Song. Adaptive fisher method detectsdense and sparse signals in association analysis of snv sets.
BMC medical genomics , 13(5):1–10, 2020.Clemontina A Davenport, Arnab Maity, Patrick F Sullivan, and Jung-Ying Tzeng. A pow-erful test for snp effects on multivariate binary outcomes using kernel machine regression.
Statistics in biosciences , 10(1):117–138, 2018.Diptavo Dutta, Laura Scott, Michael Boehnke, and Seunggeun Lee. Multi-skat: Generalframework to test for rare-variant association with multiple phenotypes.
Genetic epidemi-ology , 43(1):4–23, 2019.Manuel A R Ferreira and Shaun M Purcell. A multivariate test of association. 25(1):132–133,2009. doi: 10.1093/bioinformatics/btn563.Ronald A Fisher. Statistical methods for research workers oliver and boyd.
Edinburgh,Scotland , 6, 1925. 24knur Giner and Gordon K. Smyth. statmod: Probability Calculations for the InverseGaussian Distribution.
The R Journal , 8(1):339–351, 2016. doi: 10.32614/RJ-2016-024.URL https://doi.org/10.32614/RJ-2016-024 .Mateus H Gouveia, Cibele C Cesar, Meddly L Santolalla, Hanaisa P Sant Anna, Marilia OScliar, Thiago P Leal, Nathalia M Ara´ujo, Giordano B Soares-Souza, Wagner CS Ma-galh˜aes, Ignacio F Mata, et al. Genetics of cognitive trajectory in brazilians: 15 yearsof follow-up from the bambu´ı-epigen cohort study of aging.
Scientific Reports , 9(1):1–8,2019.IA Hoppenbrouwers, YS Aulchenko, George Cornell Ebers, SV Ramagopalan, BA Oostra,CM Van Duijn, and RQ Hintzen. Evi5 is a risk gene for multiple sclerosis.
Genes &Immunity , 9(4):334–337, 2008.Lambertus Klei, Diana Luca, B Devlin, and Kathryn Roeder. Pleiotropy and PrincipalComponents of Heritability Combine to Increase Power for Association Analysis.
GeneticEpidemiology , 32:9–19, 2008. doi: 10.1002/gepi.20257. URL .Stefan Konigorski, Yildiz E Yilmaz, J¨urgen Janke, Manuela M Bergmann, Heiner Boeing,and Tobias Pischon. Powerful rare variant association testing in a copula-based jointanalysis of multiple phenotypes.
Genetic Epidemiology , 44(1):26–40, 2020.Andrew J Landgraf and Yoonkyung Lee. Generalized principal component analysis: Projec-tion of saturated model parameters.
Technometrics , pages 1–14, 2019.Nicholas B Larson, Jun Chen, and Daniel J Schaid. A review of kernel methods for geneticassociation studies.
Genetic epidemiology , 43(2):122–136, 2019.Yaowu Liu and Jun Xie. Cauchy combination test: a powerful test with analytic p-valuecalculation under arbitrary dependency structures.
Journal of the American StatisticalAssociation , 115(529):393–402, 2020. 25atthew D Mailman, Michael Feolo, Yumi Jin, Masato Kimura, Kimberly Tryka, RinatBagoutdinov, Luning Hao, Anne Kiang, Justin Paschall, Lon Phan, et al. The ncbi dbgapdatabase of genotypes and phenotypes.
Nature genetics , 39(10):1181–1186, 2007.Arnab Maity, Patrick F Sullivan, and Jun-ing Tzeng. Multivariate phenotype associationanalysis by marker-set kernel machine regression.
Genetic epidemiology , 36(7):686–695,2012.Peter C O’Brien. Procedures for comparing samples with multiple endpoints.
Biometrics ,pages 1079–1087, 1984.Paul F. O’Reilly, Clive J. Hoggart, Yotsawat Pomyen, Federico C.F. Calboli, Paul Elliott,Marjo Riitta Jarvelin, and Lachlan J.M. Coin. MultiPhen: Joint model of multiple phe-notypes can increase discovery in GWAS.
PLoS ONE , 7(5), 2012. ISSN 19326203. doi:10.1371/journal.pone.0034861.Wei Pan, Junghi Kim, Yiwei Zhang, Xiaotong Shen, and Peng Wei. A powerful and adaptiveassociation test for rare variants.
Genetics , 197(4):1081–1095, 2014. ISSN 19432631. doi:10.1534/genetics.114.165035.Joseph K Pickrell, Tomaz Berisa, Jimmy Z Liu, Laure S´egurel, Joyce Y Tung, and David AHinds. Detection and interpretation of shared genetic influences on 42 human traits.
Nature genetics , 48(7):709, 2016.Douglas M Potter. A permutation test for inference in logistic regression with small-andmoderate-sized data sets.
Statistics in medicine , 24(5):693–708, 2005.Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel AR Ferreira,David Bender, Julian Maller, Pamela Sklar, Paul IW De Bakker, Mark J Daly, et al.Plink: a tool set for whole-genome association and population-based linkage analyses.
The American journal of human genetics , 81(3):559–575, 2007.26ophie Van Der Sluis, Danielle Posthuma, and Conor V Dolan. TATES : Efficient Multi-variate Genotype-Phenotype Analysis for Genome-Wide Association Studies. 9(1), 2013.doi: 10.1371/journal.pgen.1003235.Nadia Solovieff, Chris Cotsapas, Phil H Lee, Shaun M Purcell, and Jordan W Smoller.Pleiotropy in complex traits: challenges and strategies.
Nature Reviews Genetics , 14(7):483–495, 2013.Chi Song, Xiaoyi Min, and Heping Zhang. The screening and ranking algorithm for change-points detection in multiple samples.
Annals of Applied Statistics , 10(4):2102–2129, 2016.ISSN 19417330. doi: 10.1214/16-AOAS966.Peter M Visscher, Matthew A Brown, Mark I McCarthy, and Jian Yang. Five years of gwasdiscovery.
The American Journal of Human Genetics , 90(1):7–24, 2012.Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew ABrown, and Jian Yang. 10 years of gwas discovery: biology, function, and translation.
TheAmerican Journal of Human Genetics , 101(1):5–22, 2017.LJ Wei and Wayne E Johnson. Combining dependent tests with incomplete repeated mea-surements.
Biometrika , 72(2):359–364, 1985.Wiebke Werft and Axel Benner. glmperm: A permutation of regressor residuals test forinference in generalized linear models.
The R Journal , 2(1):39–43, 2010.Baolin Wu and James S. Pankow. Sequence Kernel Association Test of Multiple ContinuousPhenotypes.
Genetic Epidemiology , 40(2):91–100, 2016. ISSN 10982272. doi: 10.1002/gepi.21945.James J Yang, Jia Li, L Keoki Williams, and Anne Buu. An efficient genome-wide asso-ciation test for multivariate phenotypes based on the fisher combination function.
BMCbioinformatics , 17(1):19, 2016. 27iong Yang and Yuanjia Wang. Methods for analyzing multivariate phenotypes in geneticassociation studies.
Journal of Probability and Statistics , 2012, 2012. ISSN 1687952X. doi:10.1155/2012/652569.Yiwei Zhang, Zhiyuan Xu, Xiaotong Shen, and Wei Pan. Testing for association with mul-tiple traits in generalized estimation equations, with application to neuroimaging data.