Gene-level pharmacogenetic analysis on survival outcomes using gene-trait similarity regression
aa r X i v : . [ s t a t . A P ] A ug The Annals of Applied Statistics (cid:13)
Institute of Mathematical Statistics, 2014
GENE-LEVEL PHARMACOGENETIC ANALYSIS ON SURVIVALOUTCOMES USING GENE-TRAIT SIMILARITY REGRESSION
By Jung-Ying Tzeng ∗ , ‡ , , , Wenbin Lu ∗ , , and Fang-Chi Hsu † , North Carolina State University ∗ , Wake Forest University † and National Cheng-Kung University ‡ Gene/pathway-based methods are drawing significant attentiondue to their usefulness in detecting rare and common variants thataffect disease susceptibility. The biological mechanism of drug re-sponses indicates that a gene-based analysis has even greater poten-tial in pharmacogenetics. Motivated by a study from the VitaminIntervention for Stroke Prevention (VISP) trial, we develop a gene-trait similarity regression for survival analysis to assess the effect of agene or pathway on time-to-event outcomes. The similarity regressionhas a general framework that covers a range of survival models, suchas the proportional hazards model and the proportional odds model.The inference procedure developed under the proportional hazardsmodel is robust against model misspecification. We derive the equiv-alence between the similarity survival regression and a random effectsmodel, which further unifies the current variance component-basedmethods. We demonstrate the effectiveness of the proposed methodthrough simulation studies. In addition, we apply the method to theVISP trial data to identify the genes that exhibit an association withthe risk of a recurrent stroke. The
TCN2 gene was found to be asso-ciated with the recurrent stroke risk in the low-dose arm. This genemay impact recurrent stroke risk in response to cofactor therapy.
1. Introduction.
Genetic variations play a significant role in drug re-sponses. A gene that participates in a particular physiological mechanismmight influence the response to a specific therapeutic agent that targetsthe mechanism. Identifying these influential genes may help to clarify if an
Received March 2012; revised January 2014. Equal contribution. Supported by National Institutes of Health Grants R01 MH084022 and P01 CA142538. Supported by National Institutes of Health Grants R01 CA140632 and P01 CA142538. Supported by National Institutes of Health Grant U01 HG005160.
Key words and phrases.
Association study, gene/pathway, pharmacogenetics, similar-ity regression, survival data, proportional odds model, proportional hazards model.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Applied Statistics ,2014, Vol. 8, No. 2, 1232–1255. This reprint differs from the original in paginationand typographic detail. 1
J.-Y. TZENG, W. LU AND F.-C. HSU individual might benefit from or be harmed by the therapy. Understand-ing the genetic diversity of drug responses can help to identify medicationsthat maximize treatment effectiveness and minimize the risk of adverse ef-fects for individuals. Such an understanding will also lead to improved riskstratification, prevention and treatment strategies for human diseases. Phar-macogenetics studies show how an adverse reaction or positive response topharmaceutical treatment is affected by an individual’s genetic makeup andhas the potential to deliver both public health and economic benefits rapidly.With the recent advancements in high-throughput technologies, it is becom-ing common for pharmacogenetic researches to systematically investigategenetic markers across the genome. Nevertheless, appropriate and efficientanalysis of the data remains a challenge.Gene- or pathway-based analyses can assess pharmacogenetic effects moreeffectively than single-marker based analyses [Goldstein, Tate and Sisodiya(2003); Goldstein (2005)]. First, there often exist obvious candidate genesand pathways that metabolize the drug and carry variants that are relevantto the drug responses. Responses to therapies usually involve complex re-lationships between gene variants within the same molecular pathway orfunctional gene set. When applied to pharmacogenetic studies, gene- orpathway-based methods might identify multiple variants of subtle effectsthat are missed by single marker-based methods. Second, pharmacogeneticstudies typically enroll only a moderate number of patients, which limits thepower of the association detection. Gene-based analyses have been shownto yield higher power than standard single marker and haplotype analyses.This type of analysis can particularly facilitate studies on rare-event drug re-sponses, such as adverse reactions, where it could take many years to collecta sufficient number of samples to obtain adequate power for standard anal-yses. In gene-based analyses, the association signals are aggregated acrossvariants, and the total number of tests is reduced; the amplification of theassociation signals and the alleviation of the multiple testing burdens resultin improved power.Our study was motivated by the need for a gene-based analysis of the time-to-event data of the Vitamin Intervention for Stroke Prevention (VISP) trial[Toole, Malinow and Chambless (2004); Hsu, Sides and Mychalecky (2011)].Our goal is to assess the association between the recurrent stroke risk and the9 candidate genes involved in the homocystein (Hcy) metabolic pathway (seeData section for more details). In our preliminary analysis, we used the Coxproportional hazards (PH) model [Cox (1972)] to perform single-SNP screen-ing on 69 SNPs across 9 genes from 969 individuals. There were no SNPs pastthe significance threshold after accounting for multiple testing. However, thetop 6 hits, thresholding at unadjusted p -values < TCN2 (i.e., rs1544468, rs731991,rs2301955 and rs2301957 have Wald’s test p -values of 0.0065, 0.0072, 0.0346 IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Fig. 1.
The Kaplan–Meier survival curves for the top 6 SNPs identified from the singleSNP association analysis with risk of recurrent stroke. and 0.0346, resp.) and 2 SNPs are from
CTH (i.e., rs648743 and rs663465each have a Wald’s test p -value of 0.0115). The Kaplan–Meier curves of these6 SNPs are shown in Figure 1 and indicate the potential for different riskpatterns among different variants at these loci. The clustering within thetwo genes suggests that it would be more efficient to combine the individualsignal strengths and model the joint effect of multiple loci in a gene.We perform the gene-based analysis using a gene-trait similarity regres-sion inspired by Haseman–Elston regression from linkage analysis [Elstonet al. (2000); Haseman and Elston (1972)] and haplotype similarity testsfor regional association [Beckmann et al. (2005); Qian and Thomas (2001);Tzeng et al. (2003)]. First, we quantify the genetic and trait similarities for J.-Y. TZENG, W. LU AND F.-C. HSU each pair of individuals. The genetic similarity is determined using identityby state (IBS) methods. The trait similarity is obtained from the covarianceof the transformed survival time conditional on the covariates. We thenregress the trait similarity on the genetic similarity and test the regressioncoefficient to detect the genetic association. There are several gene-based ap-proaches for censored time-to-event phenotypes in the literature, includingGoeman et al. (2005) and Lin and colleagues [Cai, Tonini and Lin (2011);Lin et al. (2011)]. In these approaches, the multimarker effects were modeledunder the Cox PH model using linear random effects [Goeman et al. (2005)]or a nonparametric function induced by a kernel machine [Cai, Tonini andLin (2011); Lin et al. (2011)]. The global effect of a gene was detected bytesting for the corresponding genetic variance component. These approacheswere found to be superior in identifying pathways or genes that are associ-ated with survival.For many years, similarity-based methods have been successfully used toevaluate gene-based associations in quantitative and binary traits [Beck-mann et al. (2005); Lin and Schaid (2009); Qian and Thomas (2001); Tzenget al. (2003); Wessel and Schork (2006)]. Our work makes such approachesavailable for survival phenotypes. In addition, our similarity regression cov-ers a variety of risk models, including the commonly used PH model and theproportional odds (PO) model. Furthermore, we show that the coefficient ofthe similarity regression obtained for survival phenotypes can be reexpressedas a variance component of a certain working random effects model. Suchresults facilitate the derivation of the test statistic and unify the similar-ity model and previous variance component methods [Goeman et al. (2005);Cai, Tonini and Lin (2011); Lin et al. (2011)]. Specifically, under the Cox PHmodel, our test statistic is equivalent to the test statistic defined by a kernelmachine approach [Lin et al. (2011)]. We also show that the test statisticcan be robust to model misspecification. Specifically, the proposed test givesthe correct type I error even if the true risk model is misspecified. However,the correct specification of the true risk model generally leads to a test withbetter power. Finally, we demonstrate the utility of the similarity regressionby identifying the important
TCN2 gene in the VISP study. The signifi-cance of
TCN2 to stroke risk has been reported by other association studies[Giusti et al. (2010); Low et al. (2011)] and has been supported by molecularbiology evidence [Afman et al. (2003); von Castel-Dunwoody et al. (2005)].Our findings further suggest potential interactions between
TCN2 and B12supplementation. This new information furthers the possibility that
TCN2 could be utilized to predict recurrent strokes, identify at-risk individuals andidentify therapeutic targets for ischemic stroke.
2. Data.
The VISP study was a prospective, double-blind, randomizedclinical trial [Toole, Malinow and Chambless (2004)]. The trial was designedto study if high doses of folic acid, vitamin B6 and vitamin B12 reduce the
IMILARITY REGRESSION FOR SURVIVAL OUTCOMES risk of a recurrent stroke as compared to low doses of these vitamins. Thetrial enrolled patients who were 35 or older, had a nondisabling cerebralinfarction within 120 days of randomization, and had Hcy levels in the topquartile of the U.S. population. Subjects were randomly assigned to receivedaily doses of either a high-dose formulation (containing 25 mg vitamin B6,0.4 mg vitamin B12 and 2.5 mg folic acid) or a low-dose formulation (con-taining 200 µ g vitamin B6, 6 µ g vitamin B12 and 20 µ g folic acid). Patientrecruitment began in August 1997 and was completed in December 2001. Atotal of 3680 participants were enrolled at 56 clinical sites across the UnitedStates, Canada and Scotland. The patients were followed for a maximum oftwo years, and the average follow-up duration was 1.7 years. In the VISPgenetic study, 2206 participants provided informed consent and blood sam-ples. The SNP genotypes of 9 genes related to the enzymes and cofactors inthe Hcy metabolic pathway were collected: BHMT1, BHMT2, CBS, CTH,MTHFR, MTR, MTRR, TCN1 and
TCN2 [Hsu, Sides and Mychalecky(2011)]. In a previous study, Hsu, Sides and Mychalecky (2011) conductedsingle-SNP analyses on targeted loci (e.g., Hcy-associated variants) to ex-amine the genetic association with the recurrent stroke risk. In the low-dosearm, the authors found that
TCN2
SNP rs731991 under a recessive modewas associated with the risk of a recurrent stroke with an unadjusted log-rank test p -value of 0.009. The associations for the remaining SNPs withinthe 9 genes in the low-dose arm were not studied. We extend this previousanalysis to all 9 genes using a gene-based approach. After quality controlscreening of the data (e.g., removing loci with >
99% missing proportion orHardy–Weinberg disequilibrium under additive mode and removing individ-uals with missing genotypes), the analysis included 969 individuals in thelow-dose arm with 69 recessively coded SNPs.
3. Gene-trait similarity regression for survival traits.
The model.
For individual i ( i = 1 , , . . . , n ), let T i denote the sur-vival time of interest and C i denote the censoring time. We observe ˜ T i =min( T i , C i ) and the censoring indicator δ i = I ( T i ≤ C i ). In addition, let X i denote the K × g mi denote the allele count vectorof marker m for person i , where the length of g mi , ℓ m is the number ofdistinct alleles at marker m , m = 1 , , . . . , M . For example, for a tri-alleliclocus m , g mi = (1 , , T if person i has genotype “ A A ” and (0 , , T ifperson i has genotype “ A A .”For each pair of individuals i and j , we measure the genetic similarity, S ij , of the targeted gene and the trait similarity, Z ij . The genetic simi-larity is quantified using the weighted IBS sum across the M markers inthe gene, that is, S ij = P Mm =1 w m S mij , where S mij = 2 if | g m,i − g m,j | is azero vector, S mij = 1 if | g m,i − g m,j | contains exactly two 1’s (and if ℓ m > J.-Y. TZENG, W. LU AND F.-C. HSU the remaining entries are 0), and S mij = 0 otherwise. The weights, w m , arespecified to up-weight or down-weight a variant based on certain features.Examples include weights that are based on allele frequencies, the degreeof evolutionary conservation or the functionality of the variations [Wesseland Schork (2006); Schaid (2010a, 2010b); Price et al. (2010)]. We can usethe minor allele frequency of marker m , denoted as q m , to up-weight sim-ilarities that are contributed by rare variants. Specifically, one can set amoderate weight, such as w m = q − / m [Pongpanich, Neely and Tzeng (2012)]or w m = q − m [Tzeng et al. (2011)], to promote similarity attributed by rarealleles, or use a more extreme weight, such as w m = (1 − q m ) [Wu et al.(2011)], to target rare variants only.The trait similarity, Z ij , is quantified as follows. First, we define Y i = H ( T i ), where H ( · ) is an (unspecified) monotonic increasing transformationfunction, such as the logarithm transformation Y i = log( T i ). Assume thatthe conditional mean of Y i given the covariates and genes is E ( Y i | X i , g i ) = θ + X Ti γ + g i , where θ is the intercept, g i is the multi-locus genetic effectof person i , and γ is the K -dimensional covariate effect. Further, define µ i = θ + X Ti γ . The trait similarity is defined as the product of the pairedresiduals adjusting for the covariate effects, that is, Z ij = ( Y i − µ i )( Y j − µ j ).The expected value of the trait similarity is the covariance between thetransformed survival times of subjects i and j .The gene-trait similarity regression has the form E ( Z ij | X i , X j ) = b × S ij , i = j. (3.1)Just as in Tzeng et al. (2009, 2011), the regression has a zero intercept anddoes not have the covariate term X i X j because the baseline and covariateeffects have been adjusted when defining Z ij . This argument will becomemore obvious from the viewpoint of variance components in the followingsubsection. Under model (3.1), the overall association of a gene can be eval-uated by testing the null hypothesis: b = 0.3.2. Score test for the gene-level effect.
We derive the score test statisticbased on the equivalence between the similarity regression and a mixedmodel. This equivalence is demonstrated as follows. Consider a workingmixed model for the transformed survival time: Y i = H ( T i ) = X Ti γ + g i + θ + ε i , (3.2)where ( g , . . . , g n ) T ∼ N (0 , τ S ) with S = { S ij } ni,j =1 , that is, the covariancebetween g i and g j depends on the genetic similarity between subjects i and j , and ε ∗ i ≡ θ + ε i , i = 1 , . . . , n are independently and identically distributedwith a known distribution that is independent of X i and g i . Given X i and g i ,model (3.2) specifies a general class of linear transformation models [Cheng, IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Wei and Ying (1995)], which contains many popular survival models as spe-cial cases. For example, when ε ∗ i follows the standard extreme value distribu-tion, the linear transformation model becomes the PH model [Cox (1972)].When ε ∗ i follows the standard logistic distribution, the linear transformationmodel becomes the PO model [Bennett (1983)].Under (3.2), the conditional expectation of the trait similarity betweenindividuals i and j ( i = j ) is E ( Z ij | X i , X j ) = cov( Y i , Y j | X i , X j ) = cov( g i + ε ∗ i , g j + ε ∗ i )= cov( g i , g j )= τ × S ij . Therefore, we have b = τ , that is, the regression coefficient in the similarityregression (3.1) is the genetic variance component in the mixed model (3.2).This motivates us to develop a score test for the variance component in theworking model. As shown in the Appendix, the score test statistics for τ = 0can be written as Q n = 1 n (ˆ r , . . . , ˆ r n ) S (ˆ r , . . . , ˆ r n ) T , where ˆ r i = Z ∞ ˆ ω i ( t ) dM i ( t ; ˆ γ, ˆ H )= δ i ˙ λ { ˆ H ( ˜ T i ) − ˆ γ T X i } λ { ˆ H ( ˜ T i ) − ˆ γ T X i } − λ { ˆ H ( ˜ T i ) − ˆ γ T X i } ,M i ( t ; γ, H ) = δ i I ( ˜ T i ≤ t ) − Z t I ( ˜ T i ≥ s ) d Λ { H ( s ) − γ T X i } , ˆ ω i ( t ) = ˙ λ { ˆ H ( t ) − ˆ γ T X i } /λ { ˆ H ( t ) − ˆ γ T X i } , and S is as defined after equa-tion (3.2). Here, λ ( · ) and Λ( · ) are the hazard and cumulative hazard func-tions of ε ∗ i , respectively, ˙ λ ( · ) is the first derivative of λ ( · ), and ˆ γ and ˆ H ( · )are the estimates of γ and H ( · ), respectively, in model (3.2) under the nullhypothesis: τ = 0. For example, if the PH model is imposed, that is, λ ( u ) =˙ λ ( u ) = e u , the estimators ˆ γ and ˆΓ( · ) ≡ e ˆ H ( · ) can be taken as the maximumpartial likelihood estimator and Breslow’s estimator, respectively. Under thiscase, ˆ ω i ( t ) ≡ r i = δ i − ˆΓ( ˜ T i ) exp( − ˆ γ T X i ), that is, the martingale resid-ual for the null model. If the PO model is used, that is, λ ( u ) = e u / (1+ e u ) and˙ λ ( u ) = e u / (1 + e u ) , we have ˆ ω i ( t ) = 1 / [1 + exp { ˆ H ( t ) − ˆ γ T X i } ]. In general, γ and H ( · ) can be estimated using the martingale-based estimating equa-tions [Chen, Jing and Ying (2002)] or the nonparametric maximum likeli-hood estimation method [Zeng and Lin (2006)] for the semiparametric linear J.-Y. TZENG, W. LU AND F.-C. HSU transformation model. In the Appendix, we show that under the null hypoth-esis the test statistic, Q n , asymptotically follows a weighted χ distributionwhere the weights can be estimated consistently. The p -values can then becalculated numerically using a resampling method or moment-matching ap-proximations [Pearson (1959); Duchesne and Lafaye De Micheaux (2010)].We note that the proposed score test can be robust to the misspecifica-tion of the true survival model. As an illustration, consider the test derivedunder the PH model. Based on the results for the robust inference of thePH model [Lin and Wei (1989)], when the PH model is misspecified, themaximum partial likelihood estimator, ˆ γ , and Breslow’s estimator, ˆΓ( · ), donot converge to the true parameters but converge to some deterministicvalues γ ∗ and Γ ∗ ( · ) under certain regularity conditions. The corresponding r ∗ i = δ i − Γ ∗ ( ˜ T i ) exp( − X Ti γ ∗ ) is not a martingale residual but has a mean of0 under the null hypothesis. Therefore, it can be shown that Q n convergesin distribution to a weighted χ distribution under the null hypothesis evenwith model misspecification. As shown in our simulation studies, the pro-posed score test derived under the PH model still gives correct type-I errorwhen the true model is the PO model. However, the power of the test de-pends on the assumed working model. In general, the test derived under thetrue risk model may have better power.
4. Results of the analysis of the VISP trial data.
We now return tothe VISP trial to evaluate the association between the recurrent stroke riskand the 9 candidate genes studied in Hsu, Sides and Mychalecky (2011).In our analysis, we conducted a gene-based screening on the 9 genes us-ing the low-dose samples. After removing loci with >
1% of missingnessand subjects with missing genotypes, there were 969 individuals with 69polymorphic SNPs under recessive coding. Of the 969 individuals, 86 ex-perienced a recurrent stroke (i.e., 91.1% censoring). We used the proposedsimilarity regression (referred to as SimReg) with inverse allele frequencyweights w m = q − / m , that is, the weight recommended in Pongpanich, Neelyand Tzeng (2012) when analyzing a mixture of common and rare variants.We calculated the p -values of the SimReg statistics using the resamplingmethod. Specifically, we computed the nonzero eigenvalues, ˆ ξ , . . . , ˆ ξ d , of ˆΣas defined in the Appendix. We generated 10 sets of ( χ , , . . . , χ ,d ). Eachset consisted of d independently and identically distributed χ random vari-ables. For each set, we calculated the value P dk =1 ˆ ξ k χ ,k , and the 10 valuesformed an empirical null distribution of the SimReg statistics. The Sim-Reg p -value was the proportion of the generated null statistics that weregreater than the observed statistic. We performed SimReg analyses underthe PH model (referred to as SimReg-PH) and the PO model (referred to asSimReg-PO). The performances of the SimReg methods were benchmarked IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Table 1
Results of the VISP genetic study
BHMT1 BHMT2 CBS CTH MTHFR MTR MTRR TCN1 TCN2
Numbers of SNPs 5 3 6 10 7 20 5 3 15minP 0.3399 0.5968 0.3354 0.0918 0.9105 0.8933 0.6183 0.9764 0.0704( K eff ) (3.99) (2.83) (4.41) (8.35) (4.64) (10.64) (4.78) (2.77) (11.28)Global 0.6457 0.7391 0.2669 0.0518 0.7819 0.9154 0.7363 0.9689 0.0457KM 0.4863 0.6142 0.2386 0.0078 0.7094 0.8289 0.7289 1.0000 0.0075SimReg-PH 0.5794 0.6402 0.1889 0.0073 0.6833 0.8835 0.5988 0.9845 0.0040SimReg-PO 0.6136 0.6942 0.2877 0.0075 0.7807 0.9011 0.6422 0.9922 0.0052The adjusted p -values for the gene were obtained using 1 − (1 − minimum raw p -value) K eff .Significance is concluded by comparing the p -values with the Bonferroni threshold 0 . / . against three approaches: (a) the single SNP minimum p -value method usingthe Cox PH model (referred to as minP), (b) the multi-SNP method usingthe global test for survival under the PH model [Goeman et al. (2005)] asimplemented in the R-package globaltest (referred to as Global), and (c)the multi-SNP method using the kernel machine [Lin et al. (2011)] as imple-mented in the R-package KMTest.surv (referred to as KM) with 10 resam-plings. Although the SimReg-PH test statistic is identical to the KM teststatistic, the results may be slightly different due to the different resamplingmethods adopted to obtain the p -values. Specifically, in the KM method, thescore statistic was perturbed by multiplying i.i.d. normal random variablesto achieve the same limiting distribution of the test statistic. In SimReg-PH,the weights in the limiting weighted χ distribution, that is, ξ , . . . , ξ d , wereconsistently estimated and then the samples were directly generated fromthe estimated weighted χ distribution based on a large number of i.i.d. χ random variables. The different resampling approaches also lead to differentcomputational burden. For example, using a 3.6 GHz Xeon Processor with60 GB RAM with 10 resamplings, the system run-time of SimReg-PH was < / p -value andcalculated the adjusted p -value of a gene to correct for the multiple SNPsusing 1 − (1 − minimum raw p -value) K eff . The effective number of indepen-dent tests, K eff , was estimated using the method of Moskvina and Schmidt(2008) and accounts for the correlations in recessive coding of loci. As stud-ied by Hsu, Sides and Mychalecky (2011), all analyses were considered underthe recessive mode and were adjusted for age, sex and race.The p -values for each of the methods are shown in Table 1, and the p -values are compared with the Bonferroni threshold adjusted for the 9 gene J.-Y. TZENG, W. LU AND F.-C. HSU analyses, that is, 0 . / . TCN2 (i.e., p -value = 0 . CTH had the second small-est p -value (0.0073), which did not pass the Bonferroni threshold but wasnear the cutoff. These results also agree with the findings in the single SNPanalysis. The results of SimReg-PO are similar to those of SimReg-PH ex-cept that the p -values are slightly larger, that is, p -value = 0 . TCN2 and 0.0075 for
CTH . On the other hand, the KM, Global and minP meth-ods did not yield any significant findings. However, the smallest p -valuesof the 9 genes were obtained for TCN2 (i.e., p -values of TCN2 are 0.0075for KM, 0.0457 for Global and 0.0704 for minP). As expected, the resultsof KM were very similar to those of SimReg-PH, except that the p -value of TCN2 was slightly above the 0.0056 threshold. The smallest p -values for theminP method were from the TCN2
SNP rs731991 with a raw Wald’s test p -value of 0.0065. However, neither the raw minimum p -value (0.0065) northe adjusted p -value (0.0704) survived the significance threshold correctedfor multiple testing (0.0056). All methods also had CTH as the gene withthe second smallest p -value (i.e., p -values 0.0078 for KM, 0.0518 for Global,and 0.00918 for minP).Next, we assessed the prediction performance of Cox PH models builtwith and without TCN2 using the procedure described in Li and Luan(2005). Specifically, we randomly divided the samples into a training set( n = 646) and a testing set ( n = 323). Based on the training set, we fittedthe Cox PH regression with two models: Model 1 included only the baselinecovariates (age, sex and race), that is, no genetic information, and model 2included the baseline covariates plus the top 7 principal components (PCs)of TCN2
SNPs that explained 95% of the variations. The PCs were used in-stead of the 15 original genotypes because of the high linkage disequilibriumamong them. Based on the fits of the PH model from the training set, weobtained the risk scores of every subject under each model and computedthe medians of the risk scores. We also computed the risk scores for thetesting set using the estimated coefficients from the training set. Next, wedivided the subjects into 2 risk groups: high-risk and low-risk. Individualswith a risk score higher/lower than the median risk scores obtained fromthe training data comprised the high-risk/low-risk group. Finally, we plot-ted the Kaplan–Meier curves for the 2 risk groups in the training data andthe two risk groups in the testing data separately and obtained the p -valuesof the corresponding log-rank tests. The results are given in Figure 2. Asexpected, the p -values for both models under the training set were very sig-nificant. However, only model 2 is significant ( p -value is 0.048) under thetesting set. This result implies that TCN2 gives a more accurate predictionof the risk for recurrent stroke.
IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Fig. 2.
The Kaplan–Meier survival curves for the two risk groups of patients.
Vitamin supplements have been identified as a potential treatment forvascular diseases. The beneficial effects of vitamin supplements on strokerecurrence are not yet fully understood. In VISP, vitamin supplements didnot show an effect on the recurrent stroke risk during the 2 years of follow-up. However, we found that genetic variants, such as SNPs in
TCN2 , wereassociated with the recurrent stroke risk in the low-dose arm. This findingis consistent with the literature.
TCN2 was previously found to be asso-ciated with ischemic stroke risk [Low et al. (2011); Giusti et al. (2010)]and premature ischemic stroke risk [Giusti et al. (2010)]. It has been re-ported that
TCN2 interferes with the intracellular availability of vitaminB12 [von Castel-Dunwoody et al. (2005)]. The gene is associated with plasmahomocysteine levels and affects the proportion of vitamin B12 bound totranscobalamin [Afman et al. (2003)]. It is suspected that SNPs on thegenes coding for enzymes involved in the methionine metabolism have beensuspected to be associated with hyperhomocysteinaemia, which can resultin occurrence of stroke [Giusti et al. (2010)]. Our significant findings in thelow-dose arm suggest that there may be an interaction between
TCN2 and J.-Y. TZENG, W. LU AND F.-C. HSU
B12 supplementation, a finding that warrants further studies. The findingslead to a hypothesis that there may be one specific combination of genotypesof
TCN2 that is more efficient at transporting B12 and thus impacts theeffectiveness of cofactor therapy on recurrent stroke risk. A functional studyis being planned to localize possibly independent regions of association anddetermine their function.Besides
TCN2 , CTH is marginally associated with recurrent stroke riskin the low-dose arm. It encodes cystathionine gamma-lyase, which is an en-zyme that converts cystathionine to cysteine in the trans-sulfuration path-way [Wang et al. (2004)]. It may be a determinant of plasma Hcy concen-trations, which may increase the risk of recurrent stroke because of arterialdisease. It is worth further study as well.
5. Simulation studies.
We performed simulations to assess the validityand effectiveness of the proposed SimReg methods based on the 15 SNPsin
TCN2 of the 969 VISP low-arm samples. The rarest minor allele fre-quency (MAF) is approximately 3%. We generated genotypes of n indi-viduals by randomly sampling with replacement from the 15-SNP geno-types of the 969 samples. For individual i , we generated the covariate, X i , from N (0 , T i , based on the ge-netic and covariate information under two models: the PH model and thePO model. Specifically, for the PH model, log( T i ) = − ( X i + P ℓ γ ℓ G ℓi ) + ε ∗ i ,where ε ∗ i follows the standard extreme value distribution; for the PO model,log { exp( T i ) − } = − ( X i + P ℓ γ ℓ G ℓi ) + ε ∗ i , where ε ∗ i follows the standard lo-gistic distribution. The value of G ℓi is determined by the genotypes at thecausal locus ℓ and the mode of inheritance. For example, if A is the causalallele at locus ℓ , then G ℓi = 2 , AA, Aa and aa , respec-tively, under an additive mode. Under a dominant mode, G · i = 1 , G · i = 1 , γ ℓ was set to be 0for all ℓ . For power analysis, we selected 3 SNPs with different MAFs andLD patterns as causal loci from the 15 SNPs in TCN2 and referred to themas SNP R, SNP U and SNP C. The MAFs are 0.036 (rare) for SNP R, 0.132(uncommon) for SNP U and 0.419 (common) for SNP C. The average R ’sbetween a causal locus and the remaining loci are 0.002 (low) for SNP R,0.003 (low) for SNP U and 0.216 (high) for SNP C. The specific values of γ ℓ ’sare given in Table 2 for each scenario under different inheritance modes andcensoring rates. The values were set to consider 1, 2 and 3 causal loci in thegene and to consider causal loci that have either the same or different effectsizes (e.g., rarer variants with larger effect sizes). All of the power scenariosassumed linear additive effects of the causal loci, which favors the linearrandom effects model (e.g., Global) and can be used to examine the utilityof using the nonlinear IBS function to capture the multi-marker effects. IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Table 2
Effect sizes for power analyses
Additive and dominant Recessive15% ∗
40% 90% 15% 40% 90% n = 500 n = 500 n = 1000 n = 1000 n = 1000 n = 1000 Scenario Causal SNPs (MAF) Effect size ( γ R , γ U , γ C ) Effect size ( γ R , γ U , γ C ) ∗ censoring proportion. We generated the censoring time, C i , from Unif(0 , c ), where c is uniquelychosen for each of 3 censoring rates: 15%, 40% and 90%. Specifically, we set c = 6 .
7, 2.0 and 0.2 for censoring rates of 15%, 40% and 90%, respectively.The sample sizes, n , were 500 for the 15% and 40% censoring rates under theadditive and dominant modes. For the 90% censoring rate under the additiveand dominant modes and all censoring rates under the recessive mode, n =1000. Each scenario was analyzed using SimReg (PH and/or PO), minP,Global and KM. SimReg-PH and KM have identical test statistics but useddifferent resampling approaches to obtain p -values. Because both resamplingapproaches are asymptotically equivalent, we expect minor differences infinite sample performance between SimReg-PH and KM. In all analyses, thecausal loci were excluded.5.1. Results of type I error analyses.
We first examined the performanceof the proposed SimReg-PH model. Table 3 displays the type I error ratesof different methods when the survival times were generated from the PHmodel. The results were based on 10 replications except that KM was basedon 5 × replications due to computational cost. In each replication, the p -values of SimReg-PH were obtained from 5 × resamplings. We reportthe type I error rates evaluated at the nominal levels of 5 × − , 5 × − and 5 × − . The type I error rates obtained by SimReg-PH remainedaround the nominal levels. However, the deviations were larger for α = 5 × − , mainly due to fewer resampled statistics observed on the extremetail. In particular, for the low censoring proportion (i.e., 15%), the type I J . - Y . T Z E N G , W . L UAN D F . - C . H S U Table 3
Type I error rates for survival time generated from the PH model
Additive Dominant RecessiveAnalyzed under 15% 40% 90% 15% 40% 90% 15% 40% 90%PH model ( n = 500) ( n = 500) ( n = 1000) ( n = 500) ( n = 500) ( n = 1000) ( n = 1000) ( n = 1000) ( n = 1000) α = 5 × − (Rates shown on the scale of 10 − )minP 4 .
89 4 .
85 4 .
44 4 .
92 4 .
84 4 .
16 7 .
63 7 .
46 7 . .
73 2 .
71 2 .
72 3 .
01 2 .
99 2 .
96 2 .
75 2 .
74 2 . .
07 5 .
11 4 .
73 5 .
10 5 .
17 4 .
81 5 .
12 4 .
97 4 . .
11 5 .
10 4 .
79 5 .
15 5 .
11 4 .
83 5 .
21 5 .
04 5 . α = 5 × − (Rates shown on the scale of 10 − )minP 6 .
52 6 .
22 5 .
50 6 .
13 5 .
44 4 .
39 17 .
09 17 .
62 19 . .
82 2 .
65 2 .
67 2 .
98 2 .
74 2 .
86 2 .
61 2 .
50 2 . .
18 4 .
86 4 .
02 5 .
26 4 .
86 4 .
28 5 .
40 5 .
54 4 . .
18 4 .
91 4 .
15 5 .
22 5 .
03 4 .
38 5 .
84 5 .
20 5 . α = 5 × − (Rates shown on the scale of 10 − )minP 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . , 10 , and 10 for nominal level at 0.05, 0.005, and 0.0005, respectively. The survivaltimes were generated from the PH model and were analyzed using different approaches under the PH model. The results were based on10 replications except that the results for KM were based on 5 × replications.IMILARITY REGRESSION FOR SURVIVAL OUTCOMES error rates were slightly inflated, while for the high censoring proportion(i.e., 90%), the type I error rates became a little conservative when α =5 × − . Nevertheless, the overall results suggest that the SimReg-PH testmaintained an appropriate size, which confirms the validity of the derivednull distribution of the test statistic, Q n . As expected, the type I error ratesobtained by KM were very similar to SimReg-PH. The type I error ratesobtained by the Global test are overly conservative; similar behavior hasbeen reported in the literature [Zhong and Chen (2011)]. The minP methodhad correct type I error rates under additive and dominant modes. However,the method had inflated type I error rates under the recessive mode, andthe inflation was more severe with smaller α . Under the recessive mode,the Bonferroni corrected p -values obtained by replacing K eff with the totalnumber of SNPs also yielded inflated type I error rates. Specifically, theempirical type I error rates for 15%, 40% and 90% censoring are (0.0627,0.0687, 0.0608) for α = 5 × − , (0.0145, 0.0152, 0.0163) for α = 5 × − and(0.0042, 0.0041, 0.0053) for α = 5 × − , respectively. The anti-conservationappears to be somewhat related to the rare recessive loci; when the rare lociare excluded from the analysis, the empirical type I error rate became closerto the nominal level (data not shown). However, such an exclusion strategymight give uninformative results because the relevant signals were excludedfrom the analysis.Table 4 displays the type I error rates when the survival times were gen-erated from the PO model. Both SimReg-PH and SimReg-PO were imple-mented. The results were based on 5000 replications, and the type I errorrates were calculated at the nominal level of 0.05. The type I error ratesfor both SimReg-PO and SimReg-PH were close to 0.05 independent of theinheritance mode and the censoring proportions. These results show the va-lidity of the SimReg-PO tests and the robustness of the SimReg-PH tests.Though not performed, KM is expected to have the same robustness asSimReg-PH. As seen previously, the Global test had conservative type I er-ror rates, but the magnitude of conservation is less than that seen in Table 3.The minP method yielded slightly inflated type I error rates for the additiveand dominant modes with low censoring proportion (i.e., 15%). As before,it yielded inflated type I error rates for the recessive mode.5.2. Results of power analyses.
The power analyses were performed us-ing the settings specified in Table 2. The results were based on 100 replica-tions under each scenario. Figure 3 shows the power when the survival timeswere generated from the PH model. We first consider the additive mode.When one large-effect causal locus has low MAF and low LD with the othermarkers (e.g., Scenarios 1, 5 and 11), minP tends to have the highest powerindependent of the censoring proportion. The good performance of minPis not unexpected in these scenarios because the overall association of the J . - Y . T Z E N G , W . L UAN D F . - C . H S U Table 4
Type I error rates for survival time generated from the PO model
Additive Dominant Recessive15% 40% 90% 15% 40% 90% 15% 40% 90% ( n = 500) ( n = 500) ( n = 1000) ( n = 500) ( n = 500) ( n = 1000) ( n = 1000) ( n = 1000) ( n = 1000) minP 0.0584 0.0548 0.0444 0.0560 0.0492 0.0410 0.0798 0.0774 0.0740Global 0.0358 0.0320 0.0256 0.0352 0.0332 0.0266 0.0342 0.0312 0.0250SimReg-PH 0.0488 0.0496 0.0464 0.0492 0.0504 0.0478 0.0526 0.0518 0.0480SimReg-PO 0.0446 0.0486 0.0468 0.0452 0.0496 0.0508 0.0544 0.0492 0.0490The survival times were generated from the PO model and were analyzed using different approaches under the PO model or the PHmodel (to examine the impact of model misspecification). The results were based on 5000 replications.IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Fig. 3.
Power when the survival times are generated from the PH model. gene was driven by a single large-effect locus, for which the majority of theother SNPs did not carry much information. As a result, there is no powergain when borrowing information from other SNPs, which is what SimReg-PH does. However, the power gain of minP over other methods generallydiminishes as the number of causal loci increases (e.g., Scenarios 5 to 10). Inscenarios where the marker set is not dominated by a single causal locus oflow MAF and low LD, SimReg-PH showed comparable or higher power. Asexpected, KM had near identical power as SimReg-PH in all scenarios. Inmost scenarios, the Global test produces the least amount of power largely J.-Y. TZENG, W. LU AND F.-C. HSU
Fig. 4.
Power when the survival times are generated from the PO model. due to the over-conservative test size. The overall performance under thedominant mode has a similar trend to that of the additive mode. However,the power of SimReg-PH is comparable to or better power than the power ofminP in more cases under the dominant mode than under the additive mode.For the recessive mode, SimReg-PH appears to have better power than theGlobal test. We did not perform a power analysis for minP because the typeI error rates were inflated.Figure 4 shows the power performance when the survival times were gen-erated from the PO model. The power obtained using SimReg-PO is similar
IMILARITY REGRESSION FOR SURVIVAL OUTCOMES to or better than the power obtained using SimReg-PH, indicating that effi-ciency is gained when the correct model is used. The power gain of SimReg-PO is more substantial when the censoring proportion is low to medium,and the power is comparable to or better power than the minP power insome of those difficult cases, such as Scenario 1.
6. Discussion.
In this work we extended the similarity regression (Sim-Reg) approaches, which have shown effectiveness in modeling marker-seteffects on binary and continuous outcomes, to survival models to facilitatethe assessment of gene or pathway effects on drug responses. The geneticeffect is evaluated by assessing the association between the IBS status of apair of individuals and the covariance of their survival times. We derived theequivalence between the similarity survival regression and a random effectsmodel. The equivalence facilitates the derivation of the score test statisticsand unifies the current variance component-based methods. Specifically, theKM approach [Lin et al. (2011)] is equivalent to SimReg-PH when the samekernel function is used to quantify the genetic similarity, S ij . The Globaltest [Goeman et al. (2005)] can be viewed as a special case of SimReg-PH with S ij = P m g Tm,i g m,j (i.e., the linear kernel). However, the results ofGlobal and SimReg-PH with linear kernels may be different because differ-ent approaches were used to derive the asymptotic distributions of the teststatistics. Compared to these existing gene-based approaches, our proposedmethod has the generality to incorporate a variety of risk models in the classof linear transformation models, and we explicitly constructed the SimRegtests under the PH model and the PO model in this work. We also proposeda resampling approach to obtain the p -values that improves the computa-tional efficiency. Finally, we showed that the derived inference procedure isrobust against the misspecification of the risk model, which is an attractivefeature because the underlying risk model is often unknown. Through sim-ulations, we showed that the power of the SimReg method is comparable toor higher than the power of the minP and Global methods across variousscenarios. We also verified that the SimReg-PH test statistics remain valideven if the risk model is misspecified.In the data application on the VISP study, we illustrated how SimRegcan be used to search for genes or pathways that are associated with a time-to-event outcome and confirmed previous findings using this gene-based ap-proach with statistical significance. Although we focused our method devel-opment and demonstrated its utility based on pharmacogenetics studies, theproposed method is applicable to other genetic clinical researches or obser-vational studies with time-to-event outcomes. For a pharmacogenetic studywith sample sizes 1000, such as the VISP trial, 1000 runs of the SimReg anal-yses took ≤ J.-Y. TZENG, W. LU AND F.-C. HSU analysis on ∼
20K genes should be completed in a day using a comparablecomputing facility. We implemented the proposed methods in R and madeit available at the authors’ websites. We are incorporating the software intothe
SimReg
R package.Motivated by the data application where the risk variant acted reces-sively, we further investigated the behavior of the gene-based approachesunder different modes of inheritance. We found that all of the studied gene-based methods performed appropriately under the additive and dominantmodes. However, caution should be used when performing the minimal p -value approaches under the recessive mode; the minimal p -value approacheshad severely inflated type I error rates. The inflation might be related to theextremely rare recessive loci. However, excluding those rare recessive loci issuboptimal because the important signals can be artificially removed andlead to power loss. In contrast, the global test and the similarity regressionare not vulnerable to such a situation and appear to be more suitable optionsgiven their reasonable performance under the recessive mode.This work focused on assessing the genetic main effect on drug responsein an effort to understand how individual variation affects drug efficacy andtoxicity. In pharmacogenetics and personalized medicine, one major topic isto study if the genetic effects are modified by treatments and how the effectsdiffer across treatment options. As observed in the VISP genetic studies, theeffect of TCN2 on recurrent stroke risk is restricted to the low-dose treat-ment. An analysis stratified by treatments allows for the evaluation of suchheterogeneous effects between different treatment groups, but its efficiencycan be further improved by incorporating the gene-treatment interactionin the regression model. Such an extension is not straightforward becausethe calculation of the score test requires the variance component estimatesfor the genetic main effect under a mixed-effects survival model. We aredeveloping further extensions of SimReg to incorporate interaction effects.APPENDIX
Derivation of the score statistic Q n : Given the working mixed model H ( T i ) = γ T X i + g i + ε ∗ i , the log-likelihood function of the observed datacan be written as l n ( γ, H, τ )= log Z · · · Z n Y i =1 [ λ { H ( ˜ T i ) − γ T X i − g i } ] δ i × e − Λ { H ( ˜ T i ) − γ T X i − g i } f G ( g , . . . , g n ) dg · · · dg n , where λ and Λ are the specified hazard and cumulative hazard functionsof ε ∗ i , and f G ( g , . . . , g n ) is the joint density of g , . . . , g n , that is, a mul-tivariate normal density with mean 0 and variance–covariance matrix τ S . IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Consider the variable transformation ( g ∗ , . . . , g ∗ n ) ′ = τ − / S − / ( g , . . . , g n ) ′ ,where S = S / S / . Then ( g ∗ , . . . , g ∗ n ) ′ follows a standard multivariate nor-mal distribution. The result leads to l n ( γ, H, τ )= log Z · · · Z n Y i =1 [ λ { H ( ˜ T i ) − γ T X i − τ / S / g ∗ i } ] δ i × e − Λ { H ( ˜ T i ) − γ T X i − τ / S / g ∗ i } × f ∗ G ( g ∗ , . . . , g ∗ n ) dg ∗ · · · dg ∗ n , where f ∗ G is the density for the standard multivariate normal distribution.After some algebra, we have1 n ∂l n ( γ, H, τ ) ∂τ (cid:12)(cid:12)(cid:12)(cid:12) τ =0 = 12 n ( r , . . . , r n ) S ( r , . . . , r n ) T + 12 n n X i =1 (cid:20) ¨ λ ( e i ) λ ( e i ) − { ˙ λ ( e i ) } λ ( e i ) − ˙ λ ( e i ) (cid:21) s Ti s i , where e i = H ( ˜ T i ) − γ T X i , ¨ λ ( · ) is the second derivative of λ ( · ), s i is the i throw of the matrix S / and r i = R ∞ ˙ λ { H ( t ) − γ T X i } /λ { H ( t ) − γ T X i } dM i ( t ; γ, H ). The equality in the above equation is obtained by first taking thederivative of l n ( γ, H, τ ) with respect to τ and then deriving its limit as τ → n goes to infinity. In addition, under the nullhypothesis, r i ’s have expectation of 0 at the true values of γ and H becausethey are martingale integrations. Therefore, if the null hypothesis is correct,the first term in the summation should be close to 0. This result motivatesus to consider a score test and reject the null hypothesis when the score(1 /n ) ∂l n (ˆ γ, ˆ H, τ ) /∂τ | τ =0 is bigger than some value, where ˆ γ and ˆ H are theestimates of γ and H , respectively, under the null model. It is asymptoticallyequivalent to consider the test statistic Q n = n − (ˆ r , . . . , ˆ r n ) S (ˆ r , . . . , ˆ r n ) T and reject the null hypothesis when Q n > c α , where c α is the critical valuefor a level- α test. Null distribution of the score statistic Q n : Here, we consider the estimatorsˆ γ and ˆ H ( · ) obtained via the martingale-based estimating equations for thestandard linear transformation model [Chen, Jin and Ying (2002)] under thenull hypothesis. Note that Q n = ( n − / P ni =1 ˆ r i s i ) T ( n − / P ni =1 ˆ r i s i ). Let γ and H denote the true values of γ and H , respectively, in the null model. J.-Y. TZENG, W. LU AND F.-C. HSU
Based on the derivations given in Chen, Jin and Ying (2002), we have thefollowing asymptotic representations: √ n (ˆ γ − γ ) = − A − √ n n X i =1 Z ∞ { X i − µ X ( t ) } dM i ( t ; γ , H ) + o p (1) , √ n { ˆ H ( t ) − H ( t ) } = − b ( t ) T A − √ n n X i =1 Z ∞ { X i − µ X ( t ) } dM i ( t ; γ , H )+ 1 √ n n X i =1 Z t φ ( t, s ) dM i ( s ; γ , H ) + o p (1) . The definitions of A , µ X ( · ), b ( · ) and φ ( · , · ) can be found in Chen, Jin andYing (2002). By the Taylor expansion, we can show that1 √ n n X i =1 ˆ r i s i → d N (0 , Σ) , as n → ∞ , where Σ = E ( ψ i ψ Ti ) and ψ i = (cid:20) δ i ˙ λ { H ( ˜ T i ) − γ T X i } λ { H ( ˜ T i ) − γ T X i } − λ { H ( ˜ T i ) − γ T X i } (cid:21) s i − ( B − B ) A − Z ∞ { X i − µ X ( t ) } dM i ( t ; γ , H ) − Z ∞ b ( t ) dM i ( t ; γ , H ) . Here B = lim n →∞ n n X i =1 s i X Ti [ ˙ λ { H ( ˜ T i ) − γ T X i }− δ i (¨ λ { H ( ˜ T i ) − γ T X i } λ { H ( ˜ T i ) − γ T X i }− ( ˙ λ { H ( ˜ T i ) − γ T X i } ) ) / ( λ { H ( ˜ T i ) − γ T X i } ) ] ,B = lim n →∞ n n X i =1 s i b ( ˜ T i ) T [ ˙ λ { H ( ˜ T i ) − γ T X i }− δ i (¨ λ { H ( ˜ T i ) − γ T X i } λ { H ( ˜ T i ) − γ T X i }− ( ˙ λ { H ( ˜ T i ) − γ T X i } ) ) / ( λ { H ( ˜ T i ) − γ T X i } ) ] , IMILARITY REGRESSION FOR SURVIVAL OUTCOMES b ( t ) = lim n →∞ n n X i =1 s i I ( ˜ T i ≥ t ) × φ ( ˜ T i , t )[ ˙ λ { H ( ˜ T i ) − γ T X i }− δ i (¨ λ { H ( ˜ T i ) − γ T X i } λ { H ( ˜ T i ) − γ T X i }− ( ˙ λ { H ( ˜ T i ) − γ T X i } ) ) / ( λ { H ( ˜ T i ) − γ T X i } ) ] . Therefore, Q n converges in distribution to a weighted χ distribution: P dk =1 ξ k χ ,k , where χ , , . . . , χ ,d are d independently and identically dis-tributed χ random variables with degree freedom of 1, and ξ , . . . , ξ d arethe d nonzero eigenvalues of the matrix Σ. To obtain the critical value, c α ,of the limiting weighted χ distribution, we use a numerical method. Specif-ically, we first obtain a consistent estimator, ˆΣ, of Σ using the usual plug-inmethod and compute the nonzero eigenvalues ˆ ξ , . . . , ˆ ξ d of the matrix ˆΣ.Next we generate a large set (e.g., 10,000 sets) of independent and identi-cally distributed random variables χ , , . . . , χ ,d . For each set of χ randomvariables, we compute P dk =1 ˆ ξ k χ ,k . We can then estimate c α by the upper α -quantile of P dk =1 ˆ ξ k χ ,k ’s and the associated p -value of the score test canbe computed accordingly. Acknowledgments.
The authors thank Dr. Mich`ele M. Sale and Dr. Brad-ford B. Worrall for providing the data. They thank Dr. Chen-Hsin Chen forhis insightful discussion and helpful suggestions to improve the work. Theyalso thank Dr. Shannon Holloway for her constructive input and proofread-ing to improve the manuscript and assistance in creating the software pack-age. REFERENCES
Afman, L. A. , Lievers, K. J. A. , Kluijtmans, L. A. J. , Trijbels, F. J. M. and
Blom, H. J. (2003). Gene–gene interaction between the cystathionine beta-synthase31 base pair variable number of tandem repeats and the methylenetetrahydrofolate re-ductase 677C > T polymorphism on homocysteine levels and risk for neural tube defects.
Mol. Genet. Metab. Beckmann, L. , Thomas, D. C. , Fischer, C. and
Chang-Claude, J. (2005). Haplotypesharing analysis using mantel statistics.
Hum. Hered. Bennett, S. (1983). Analysis of survival data by the proportional odds model.
Stat. Med. Cai, T. , Tonini, G. and
Lin, X. (2011). Kernel machine approach to testing thesignificance of multiple genetic markers for risk prediction.
Biometrics Chen, K. , Jin, Z. and
Ying, Z. (2002). Semiparametric analysis of transformation modelswith censored data.
Biometrika J.-Y. TZENG, W. LU AND F.-C. HSU
Cheng, S. C. , Wei, L. J. and
Ying, Z. (1995). Analysis of transformation models withcensored data.
Biometrika Cox, D. R. (1972). Regression models and life-tables.
J. R. Stat. Soc. Ser. B Stat.Methodol. Duchesne, P. and
Lafaye De Micheaux, P. (2010). Computing the distribution ofquadratic forms: Further comparisons between the Liu–Tang–Zhang approximation andexact methods.
Comput. Statist. Data Anal. Elston, R. C. , Buxbaum, S. , Jacobs, K. B. and
Olson, J. M. (2000). Haseman andElston revisited.
Genet. Epidemiol. Giusti, B. , Saracinim, C. , Bolli, P. , Magi, A. , Martinelli, I. , Peyvandi, F. , Ra-sura, M. , Volpe, M. , Lotta, L. A. , Rubattu, S. , Mannucci, P. M. and
Abbate, R. (2010). Early-onset ischaemic stroke: Analysis of 58 polymorphisms in 17 genes involvedin methionine metabolism.
Thrombosis and Haemostasis
Goeman, J. J. , Oosting, J. , Cleton-Jansen, A.-M. , Anninga, J. K. and vanHouwelingen, H. C. (2005). Testing association of a pathway with survival usinggene expression data.
Bioinformatics Goldstein, D. B. (2005). The genetics of human drug response.
Philosophical Transac-tions of the Royal Society B: Biological Sciences
Goldstein, D. B. , Tate, S. K. and
Sisodiya, S. M. (2003). Pharmacogenetics goesgenomic.
Nat. Rev. Genet. Haseman, J. K. and
Elston, R. C. (1972). The investigation of linkage between aquantitative trait and a marker locus.
Behav. Genet. Hsu, F. C. , Sides, E. G. and
Mychaleckyj, J. C.et al. (2011). A Transcobalamin 2gene variant associated with post-stroke homocysteine modifies recurrent stroke risk.
Neurology Li, H. and
Luan, Y. (2005). Boosting proportional hazards models using smoothing spline,with application to high-dimensional microarray data.
Biostatistics Lin, D. Y. and
Wei, L. J. (1989). The robust inference for the Cox proportional hazardsmodel.
J. Amer. Statist. Assoc. Lin, W.-Y. and
Schaid, D. J. (2009). Power comparisons between similarity-based mul-tilocus association methods, logistic regression, and score tests for haplotypes.
Genet.Epidemiol. Lin, X. , Cai, T. , Wu, M. C. , Zhou, Q. , Liu, G. , Christiani, D. C. and
Lin, X. (2011). Kernel machine SNP-set analysis for censored survival outcomes in genome-wide association studies.
Genet. Epidemiol. Low, H.-Q. , Chen, C. P. L. H. , Kasiman, K. , Thalamuthu, A. , Ng, S.-S. , Foo, J.-N. , Chang, H.-M. , Wong, M.-C. , Tai, E.-S. and
Liu, J. (2011). A comprehensiveassociation analysis of homocysteine metabolic pathway genes in Singaporean Chinesewith ischemic stroke.
PLoS ONE e24757. Moskvina, V. and
Schmidt, K. M. (2008). On multiple-testing correction in genome-wide association studies.
Genet. Epidemiol. Pearson, E. S. (1959). Note on an approximation to the distribution of non-central χ . Biometrika Pongpanich, M. , Neely, M. and
Tzeng, J. Y. (2012). On the aggregation of multi-marker information for marker-set and sequencing data analysis: Genotype collapsingvs. similarity collapsing.
Frontiers in Statistical Genetics and Methodology Price, A. L. , Kryukov, G. V. , de Bakker, P. I. W. , Purcell, S. M. , Staples, J. , Wei, L.-J. and
Sunyaev, S. R. (2010). Pooled association tests for rare variants inexon-resequencing studies.
Am. J. Hum. Genet. Qian, D. and
Thomas, D. (2001). Genome scan of complex traits by haplotype sharingcorrelation.
Genetic Epidemiology S582–S587.
Schaid, D. J. (2010a). Genomic similarity and kernel methods I: Advancements by build-ing on mathematical and statistical foundations.
Human Heredity Schaid, D. J. (2010b). Genomic similarity and kernel methods II: Methods for genomicinformation.
Human Heredity Toole, J. F. , Malinow, M. R. , Chambless, L. E. et al. (2004). Lowering homocysteinein patients with ischemic stroke to prevent recurrent stroke, myocardial infarction, anddeath: The Vitamin Intervention for Stroke Prevention (VISP) randomized controlledtrial.
Journal of American Medical Association
Tzeng, J. Y. , Devlin, D. , Wasserman, L. and
Roeder, K. (2003). On the identificationof disease mutations by the analysis of haplotype similarity and goodness of fit.
TheAmerican Journal of Human Genetics Tzeng, J.-Y. , Zhang, D. , Chang, S.-M. , Thomas, D. C. and
Davidian, M. (2009).Gene-trait similarity regression for multimarker-based association analysis.
Biometrics Tzeng, J.-Y. , Zhang, D. , Pongpanich, M. , Smith, C. , McCarthy, M. I. , Sale, M. M. , Worrall, B. B. , Hsu, F.-C. , Thomas, D. C. and
Sullivan, P. F. (2011). Studying gene and gene-environment effects of uncommon and common vari-ants on continuous traits: A marker-set approach using gene-trait similarity regression.
Am. J. Hum. Genet. von Castel-Dunwoody, K. M. , Kauwell, G. P. A. , Shelnutt, K. P. , Vaughn, J. D. , Griffin, E. R. , Maneval, D. R. , Theriaque, D. W. and
Bailey, L. B. (2005).Transcobalamin 776C → G polymorphism negatively affects vitamin B-12 metabolism.
Am. J. Clin. Nutr. Wang, J. , Huff, A. M. , Spence, J. D. and
Hegele, R. A. (2004). Single nucleotidepolymorphism in CTH associated with variation in plasma homocysteine concentration.
Clin. Genet. Wessel, J. and
Schork, N. J. (2006). Generalized genomic distance-based regressionmethodology for multilocus association analysis.
Am. J. Hum. Genet. Wu, M. C. , Lee, S. , Cai, T. , Li, Y. , Boehnke, M. and
Lin, X. (2011). Rare-variantassociation testing for sequencing data with the sequence kernel association test.
Am.J. Hum. Genet. Zeng, D. and
Lin, D. Y. (2006). Efficient estimation of semiparametric transformationmodels for counting processes.
Biometrika Zhong, P.-S. and
Chen, S. X. (2011). Tests for high-dimensional regression coefficientswith factorial designs.
J. Amer. Statist. Assoc.
J.-Y. TzengDepartment of StatisticsNorth Carolina State UniversityRaleigh, North Carolina 27695-8203USAandDepartment of StatisticsNational Cheng Kung UniversityTainan CityTaiwan (R.O.C.)E-mail: [email protected]
W. LuDepartment of StatisticsNorth Carolina State UniversityRaleigh, North Carolina 27695-8203USAE-mail: [email protected] J.-Y. TZENG, W. LU AND F.-C. HSU