[PDF] Gene-level pharmacogenetic analysis on survival outcomes using gene-trait similarity regression

Abstract

Gene/pathway-based methods are drawing significant attention due to their usefulness in detecting rare and common variants that affect disease susceptibility. The biological mechanism of drug responses indicates that a gene-based analysis has even greater potential in pharmacogenetics. Motivated by a study from the Vitamin Intervention for Stroke Prevention (VISP) trial, we develop a gene-trait similarity regression for survival analysis to assess the effect of a gene or pathway on time-to-event outcomes. The similarity regression has a general framework that covers a range of survival models, such as the proportional hazards model and the proportional odds model. The inference procedure developed under the proportional hazards model is robust against model misspecification. We derive the equivalence between the similarity survival regression and a random effects model, which further unifies the current variance component-based methods. We demonstrate the effectiveness of the proposed method through simulation studies. In addition, we apply the method to the VISP trial data to identify the genes that exhibit an association with the risk of a recurrent stroke. The TCN2 gene was found to be associated with the recurrent stroke risk in the low-dose arm. This gene may impact recurrent stroke risk in response to cofactor therapy.

Full PDF

aa r X i v : . [ s t a t . A P ] A ug The Annals of Applied Statistics (cid:13)

Institute of Mathematical Statistics, 2014

GENE-LEVEL PHARMACOGENETIC ANALYSIS ON SURVIVALOUTCOMES USING GENE-TRAIT SIMILARITY REGRESSION

By Jung-Ying Tzeng ∗ , ‡ , , , Wenbin Lu ∗ , , and Fang-Chi Hsu † , North Carolina State University ∗ , Wake Forest University † and National Cheng-Kung University ‡ Gene/pathway-based methods are drawing signiﬁcant attentiondue to their usefulness in detecting rare and common variants thataﬀect disease susceptibility. The biological mechanism of drug re-sponses indicates that a gene-based analysis has even greater poten-tial in pharmacogenetics. Motivated by a study from the VitaminIntervention for Stroke Prevention (VISP) trial, we develop a gene-trait similarity regression for survival analysis to assess the eﬀect of agene or pathway on time-to-event outcomes. The similarity regressionhas a general framework that covers a range of survival models, suchas the proportional hazards model and the proportional odds model.The inference procedure developed under the proportional hazardsmodel is robust against model misspeciﬁcation. We derive the equiv-alence between the similarity survival regression and a random eﬀectsmodel, which further uniﬁes the current variance component-basedmethods. We demonstrate the eﬀectiveness of the proposed methodthrough simulation studies. In addition, we apply the method to theVISP trial data to identify the genes that exhibit an association withthe risk of a recurrent stroke. The

TCN2 gene was found to be asso-ciated with the recurrent stroke risk in the low-dose arm. This genemay impact recurrent stroke risk in response to cofactor therapy.

1. Introduction.

Genetic variations play a signiﬁcant role in drug re-sponses. A gene that participates in a particular physiological mechanismmight inﬂuence the response to a speciﬁc therapeutic agent that targetsthe mechanism. Identifying these inﬂuential genes may help to clarify if an

Received March 2012; revised January 2014. Equal contribution. Supported by National Institutes of Health Grants R01 MH084022 and P01 CA142538. Supported by National Institutes of Health Grants R01 CA140632 and P01 CA142538. Supported by National Institutes of Health Grant U01 HG005160.

Key words and phrases.

Association study, gene/pathway, pharmacogenetics, similar-ity regression, survival data, proportional odds model, proportional hazards model.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Applied Statistics ,2014, Vol. 8, No. 2, 1232–1255. This reprint diﬀers from the original in paginationand typographic detail. 1

J.-Y. TZENG, W. LU AND F.-C. HSU individual might beneﬁt from or be harmed by the therapy. Understand-ing the genetic diversity of drug responses can help to identify medicationsthat maximize treatment eﬀectiveness and minimize the risk of adverse ef-fects for individuals. Such an understanding will also lead to improved riskstratiﬁcation, prevention and treatment strategies for human diseases. Phar-macogenetics studies show how an adverse reaction or positive response topharmaceutical treatment is aﬀected by an individual’s genetic makeup andhas the potential to deliver both public health and economic beneﬁts rapidly.With the recent advancements in high-throughput technologies, it is becom-ing common for pharmacogenetic researches to systematically investigategenetic markers across the genome. Nevertheless, appropriate and eﬃcientanalysis of the data remains a challenge.Gene- or pathway-based analyses can assess pharmacogenetic eﬀects moreeﬀectively than single-marker based analyses [Goldstein, Tate and Sisodiya(2003); Goldstein (2005)]. First, there often exist obvious candidate genesand pathways that metabolize the drug and carry variants that are relevantto the drug responses. Responses to therapies usually involve complex re-lationships between gene variants within the same molecular pathway orfunctional gene set. When applied to pharmacogenetic studies, gene- orpathway-based methods might identify multiple variants of subtle eﬀectsthat are missed by single marker-based methods. Second, pharmacogeneticstudies typically enroll only a moderate number of patients, which limits thepower of the association detection. Gene-based analyses have been shownto yield higher power than standard single marker and haplotype analyses.This type of analysis can particularly facilitate studies on rare-event drug re-sponses, such as adverse reactions, where it could take many years to collecta suﬃcient number of samples to obtain adequate power for standard anal-yses. In gene-based analyses, the association signals are aggregated acrossvariants, and the total number of tests is reduced; the ampliﬁcation of theassociation signals and the alleviation of the multiple testing burdens resultin improved power.Our study was motivated by the need for a gene-based analysis of the time-to-event data of the Vitamin Intervention for Stroke Prevention (VISP) trial[Toole, Malinow and Chambless (2004); Hsu, Sides and Mychalecky (2011)].Our goal is to assess the association between the recurrent stroke risk and the9 candidate genes involved in the homocystein (Hcy) metabolic pathway (seeData section for more details). In our preliminary analysis, we used the Coxproportional hazards (PH) model [Cox (1972)] to perform single-SNP screen-ing on 69 SNPs across 9 genes from 969 individuals. There were no SNPs pastthe signiﬁcance threshold after accounting for multiple testing. However, thetop 6 hits, thresholding at unadjusted p -values < TCN2 (i.e., rs1544468, rs731991,rs2301955 and rs2301957 have Wald’s test p -values of 0.0065, 0.0072, 0.0346 IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Fig. 1.

The Kaplan–Meier survival curves for the top 6 SNPs identiﬁed from the singleSNP association analysis with risk of recurrent stroke. and 0.0346, resp.) and 2 SNPs are from

CTH (i.e., rs648743 and rs663465each have a Wald’s test p -value of 0.0115). The Kaplan–Meier curves of these6 SNPs are shown in Figure 1 and indicate the potential for diﬀerent riskpatterns among diﬀerent variants at these loci. The clustering within thetwo genes suggests that it would be more eﬃcient to combine the individualsignal strengths and model the joint eﬀect of multiple loci in a gene.We perform the gene-based analysis using a gene-trait similarity regres-sion inspired by Haseman–Elston regression from linkage analysis [Elstonet al. (2000); Haseman and Elston (1972)] and haplotype similarity testsfor regional association [Beckmann et al. (2005); Qian and Thomas (2001);Tzeng et al. (2003)]. First, we quantify the genetic and trait similarities for J.-Y. TZENG, W. LU AND F.-C. HSU each pair of individuals. The genetic similarity is determined using identityby state (IBS) methods. The trait similarity is obtained from the covarianceof the transformed survival time conditional on the covariates. We thenregress the trait similarity on the genetic similarity and test the regressioncoeﬃcient to detect the genetic association. There are several gene-based ap-proaches for censored time-to-event phenotypes in the literature, includingGoeman et al. (2005) and Lin and colleagues [Cai, Tonini and Lin (2011);Lin et al. (2011)]. In these approaches, the multimarker eﬀects were modeledunder the Cox PH model using linear random eﬀects [Goeman et al. (2005)]or a nonparametric function induced by a kernel machine [Cai, Tonini andLin (2011); Lin et al. (2011)]. The global eﬀect of a gene was detected bytesting for the corresponding genetic variance component. These approacheswere found to be superior in identifying pathways or genes that are associ-ated with survival.For many years, similarity-based methods have been successfully used toevaluate gene-based associations in quantitative and binary traits [Beck-mann et al. (2005); Lin and Schaid (2009); Qian and Thomas (2001); Tzenget al. (2003); Wessel and Schork (2006)]. Our work makes such approachesavailable for survival phenotypes. In addition, our similarity regression cov-ers a variety of risk models, including the commonly used PH model and theproportional odds (PO) model. Furthermore, we show that the coeﬃcient ofthe similarity regression obtained for survival phenotypes can be reexpressedas a variance component of a certain working random eﬀects model. Suchresults facilitate the derivation of the test statistic and unify the similar-ity model and previous variance component methods [Goeman et al. (2005);Cai, Tonini and Lin (2011); Lin et al. (2011)]. Speciﬁcally, under the Cox PHmodel, our test statistic is equivalent to the test statistic deﬁned by a kernelmachine approach [Lin et al. (2011)]. We also show that the test statisticcan be robust to model misspeciﬁcation. Speciﬁcally, the proposed test givesthe correct type I error even if the true risk model is misspeciﬁed. However,the correct speciﬁcation of the true risk model generally leads to a test withbetter power. Finally, we demonstrate the utility of the similarity regressionby identifying the important

TCN2 gene in the VISP study. The signiﬁ-cance of

TCN2 to stroke risk has been reported by other association studies[Giusti et al. (2010); Low et al. (2011)] and has been supported by molecularbiology evidence [Afman et al. (2003); von Castel-Dunwoody et al. (2005)].Our ﬁndings further suggest potential interactions between

TCN2 and B12supplementation. This new information furthers the possibility that

TCN2 could be utilized to predict recurrent strokes, identify at-risk individuals andidentify therapeutic targets for ischemic stroke.

2. Data.

The VISP study was a prospective, double-blind, randomizedclinical trial [Toole, Malinow and Chambless (2004)]. The trial was designedto study if high doses of folic acid, vitamin B6 and vitamin B12 reduce the

IMILARITY REGRESSION FOR SURVIVAL OUTCOMES risk of a recurrent stroke as compared to low doses of these vitamins. Thetrial enrolled patients who were 35 or older, had a nondisabling cerebralinfarction within 120 days of randomization, and had Hcy levels in the topquartile of the U.S. population. Subjects were randomly assigned to receivedaily doses of either a high-dose formulation (containing 25 mg vitamin B6,0.4 mg vitamin B12 and 2.5 mg folic acid) or a low-dose formulation (con-taining 200 µ g vitamin B6, 6 µ g vitamin B12 and 20 µ g folic acid). Patientrecruitment began in August 1997 and was completed in December 2001. Atotal of 3680 participants were enrolled at 56 clinical sites across the UnitedStates, Canada and Scotland. The patients were followed for a maximum oftwo years, and the average follow-up duration was 1.7 years. In the VISPgenetic study, 2206 participants provided informed consent and blood sam-ples. The SNP genotypes of 9 genes related to the enzymes and cofactors inthe Hcy metabolic pathway were collected: BHMT1, BHMT2, CBS, CTH,MTHFR, MTR, MTRR, TCN1 and

TCN2 [Hsu, Sides and Mychalecky(2011)]. In a previous study, Hsu, Sides and Mychalecky (2011) conductedsingle-SNP analyses on targeted loci (e.g., Hcy-associated variants) to ex-amine the genetic association with the recurrent stroke risk. In the low-dosearm, the authors found that

TCN2

SNP rs731991 under a recessive modewas associated with the risk of a recurrent stroke with an unadjusted log-rank test p -value of 0.009. The associations for the remaining SNPs withinthe 9 genes in the low-dose arm were not studied. We extend this previousanalysis to all 9 genes using a gene-based approach. After quality controlscreening of the data (e.g., removing loci with >

99% missing proportion orHardy–Weinberg disequilibrium under additive mode and removing individ-uals with missing genotypes), the analysis included 969 individuals in thelow-dose arm with 69 recessively coded SNPs.

3. Gene-trait similarity regression for survival traits.

The model.

For individual i ( i = 1 , , . . . , n ), let T i denote the sur-vival time of interest and C i denote the censoring time. We observe ˜ T i =min( T i , C i ) and the censoring indicator δ i = I ( T i ≤ C i ). In addition, let X i denote the K × g mi denote the allele count vectorof marker m for person i , where the length of g mi , ℓ m is the number ofdistinct alleles at marker m , m = 1 , , . . . , M . For example, for a tri-alleliclocus m , g mi = (1 , , T if person i has genotype “ A A ” and (0 , , T ifperson i has genotype “ A A .”For each pair of individuals i and j , we measure the genetic similarity, S ij , of the targeted gene and the trait similarity, Z ij . The genetic simi-larity is quantiﬁed using the weighted IBS sum across the M markers inthe gene, that is, S ij = P Mm =1 w m S mij , where S mij = 2 if | g m,i − g m,j | is azero vector, S mij = 1 if | g m,i − g m,j | contains exactly two 1’s (and if ℓ m > J.-Y. TZENG, W. LU AND F.-C. HSU the remaining entries are 0), and S mij = 0 otherwise. The weights, w m , arespeciﬁed to up-weight or down-weight a variant based on certain features.Examples include weights that are based on allele frequencies, the degreeof evolutionary conservation or the functionality of the variations [Wesseland Schork (2006); Schaid (2010a, 2010b); Price et al. (2010)]. We can usethe minor allele frequency of marker m , denoted as q m , to up-weight sim-ilarities that are contributed by rare variants. Speciﬁcally, one can set amoderate weight, such as w m = q − / m [Pongpanich, Neely and Tzeng (2012)]or w m = q − m [Tzeng et al. (2011)], to promote similarity attributed by rarealleles, or use a more extreme weight, such as w m = (1 − q m ) [Wu et al.(2011)], to target rare variants only.The trait similarity, Z ij , is quantiﬁed as follows. First, we deﬁne Y i = H ( T i ), where H ( · ) is an (unspeciﬁed) monotonic increasing transformationfunction, such as the logarithm transformation Y i = log( T i ). Assume thatthe conditional mean of Y i given the covariates and genes is E ( Y i | X i , g i ) = θ + X Ti γ + g i , where θ is the intercept, g i is the multi-locus genetic eﬀectof person i , and γ is the K -dimensional covariate eﬀect. Further, deﬁne µ i = θ + X Ti γ . The trait similarity is deﬁned as the product of the pairedresiduals adjusting for the covariate eﬀects, that is, Z ij = ( Y i − µ i )( Y j − µ j ).The expected value of the trait similarity is the covariance between thetransformed survival times of subjects i and j .The gene-trait similarity regression has the form E ( Z ij | X i , X j ) = b × S ij , i = j. (3.1)Just as in Tzeng et al. (2009, 2011), the regression has a zero intercept anddoes not have the covariate term X i X j because the baseline and covariateeﬀects have been adjusted when deﬁning Z ij . This argument will becomemore obvious from the viewpoint of variance components in the followingsubsection. Under model (3.1), the overall association of a gene can be eval-uated by testing the null hypothesis: b = 0.3.2. Score test for the gene-level eﬀect.

We derive the score test statisticbased on the equivalence between the similarity regression and a mixedmodel. This equivalence is demonstrated as follows. Consider a workingmixed model for the transformed survival time: Y i = H ( T i ) = X Ti γ + g i + θ + ε i , (3.2)where ( g , . . . , g n ) T ∼ N (0 , τ S ) with S = { S ij } ni,j =1 , that is, the covariancebetween g i and g j depends on the genetic similarity between subjects i and j , and ε ∗ i ≡ θ + ε i , i = 1 , . . . , n are independently and identically distributedwith a known distribution that is independent of X i and g i . Given X i and g i ,model (3.2) speciﬁes a general class of linear transformation models [Cheng, IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Wei and Ying (1995)], which contains many popular survival models as spe-cial cases. For example, when ε ∗ i follows the standard extreme value distribu-tion, the linear transformation model becomes the PH model [Cox (1972)].When ε ∗ i follows the standard logistic distribution, the linear transformationmodel becomes the PO model [Bennett (1983)].Under (3.2), the conditional expectation of the trait similarity betweenindividuals i and j ( i = j ) is E ( Z ij | X i , X j ) = cov( Y i , Y j | X i , X j ) = cov( g i + ε ∗ i , g j + ε ∗ i )= cov( g i , g j )= τ × S ij . Therefore, we have b = τ , that is, the regression coeﬃcient in the similarityregression (3.1) is the genetic variance component in the mixed model (3.2).This motivates us to develop a score test for the variance component in theworking model. As shown in the Appendix, the score test statistics for τ = 0can be written as Q n = 1 n (ˆ r , . . . , ˆ r n ) S (ˆ r , . . . , ˆ r n ) T , where ˆ r i = Z ∞ ˆ ω i ( t ) dM i ( t ; ˆ γ, ˆ H )= δ i ˙ λ { ˆ H ( ˜ T i ) − ˆ γ T X i } λ { ˆ H ( ˜ T i ) − ˆ γ T X i } − λ { ˆ H ( ˜ T i ) − ˆ γ T X i } ,M i ( t ; γ, H ) = δ i I ( ˜ T i ≤ t ) − Z t I ( ˜ T i ≥ s ) d Λ { H ( s ) − γ T X i } , ˆ ω i ( t ) = ˙ λ { ˆ H ( t ) − ˆ γ T X i } /λ { ˆ H ( t ) − ˆ γ T X i } , and S is as deﬁned after equa-tion (3.2). Here, λ ( · ) and Λ( · ) are the hazard and cumulative hazard func-tions of ε ∗ i , respectively, ˙ λ ( · ) is the ﬁrst derivative of λ ( · ), and ˆ γ and ˆ H ( · )are the estimates of γ and H ( · ), respectively, in model (3.2) under the nullhypothesis: τ = 0. For example, if the PH model is imposed, that is, λ ( u ) =˙ λ ( u ) = e u , the estimators ˆ γ and ˆΓ( · ) ≡ e ˆ H ( · ) can be taken as the maximumpartial likelihood estimator and Breslow’s estimator, respectively. Under thiscase, ˆ ω i ( t ) ≡ r i = δ i − ˆΓ( ˜ T i ) exp( − ˆ γ T X i ), that is, the martingale resid-ual for the null model. If the PO model is used, that is, λ ( u ) = e u / (1+ e u ) and˙ λ ( u ) = e u / (1 + e u ) , we have ˆ ω i ( t ) = 1 / [1 + exp { ˆ H ( t ) − ˆ γ T X i } ]. In general, γ and H ( · ) can be estimated using the martingale-based estimating equa-tions [Chen, Jing and Ying (2002)] or the nonparametric maximum likeli-hood estimation method [Zeng and Lin (2006)] for the semiparametric linear J.-Y. TZENG, W. LU AND F.-C. HSU transformation model. In the Appendix, we show that under the null hypoth-esis the test statistic, Q n , asymptotically follows a weighted χ distributionwhere the weights can be estimated consistently. The p -values can then becalculated numerically using a resampling method or moment-matching ap-proximations [Pearson (1959); Duchesne and Lafaye De Micheaux (2010)].We note that the proposed score test can be robust to the misspeciﬁca-tion of the true survival model. As an illustration, consider the test derivedunder the PH model. Based on the results for the robust inference of thePH model [Lin and Wei (1989)], when the PH model is misspeciﬁed, themaximum partial likelihood estimator, ˆ γ , and Breslow’s estimator, ˆΓ( · ), donot converge to the true parameters but converge to some deterministicvalues γ ∗ and Γ ∗ ( · ) under certain regularity conditions. The corresponding r ∗ i = δ i − Γ ∗ ( ˜ T i ) exp( − X Ti γ ∗ ) is not a martingale residual but has a mean of0 under the null hypothesis. Therefore, it can be shown that Q n convergesin distribution to a weighted χ distribution under the null hypothesis evenwith model misspeciﬁcation. As shown in our simulation studies, the pro-posed score test derived under the PH model still gives correct type-I errorwhen the true model is the PO model. However, the power of the test de-pends on the assumed working model. In general, the test derived under thetrue risk model may have better power.

4. Results of the analysis of the VISP trial data.

We now return tothe VISP trial to evaluate the association between the recurrent stroke riskand the 9 candidate genes studied in Hsu, Sides and Mychalecky (2011).In our analysis, we conducted a gene-based screening on the 9 genes us-ing the low-dose samples. After removing loci with >

1% of missingnessand subjects with missing genotypes, there were 969 individuals with 69polymorphic SNPs under recessive coding. Of the 969 individuals, 86 ex-perienced a recurrent stroke (i.e., 91.1% censoring). We used the proposedsimilarity regression (referred to as SimReg) with inverse allele frequencyweights w m = q − / m , that is, the weight recommended in Pongpanich, Neelyand Tzeng (2012) when analyzing a mixture of common and rare variants.We calculated the p -values of the SimReg statistics using the resamplingmethod. Speciﬁcally, we computed the nonzero eigenvalues, ˆ ξ , . . . , ˆ ξ d , of ˆΣas deﬁned in the Appendix. We generated 10 sets of ( χ , , . . . , χ ,d ). Eachset consisted of d independently and identically distributed χ random vari-ables. For each set, we calculated the value P dk =1 ˆ ξ k χ ,k , and the 10 valuesformed an empirical null distribution of the SimReg statistics. The Sim-Reg p -value was the proportion of the generated null statistics that weregreater than the observed statistic. We performed SimReg analyses underthe PH model (referred to as SimReg-PH) and the PO model (referred to asSimReg-PO). The performances of the SimReg methods were benchmarked IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Table 1

Results of the VISP genetic study

BHMT1 BHMT2 CBS CTH MTHFR MTR MTRR TCN1 TCN2

Numbers of SNPs 5 3 6 10 7 20 5 3 15minP 0.3399 0.5968 0.3354 0.0918 0.9105 0.8933 0.6183 0.9764 0.0704( K eﬀ ) (3.99) (2.83) (4.41) (8.35) (4.64) (10.64) (4.78) (2.77) (11.28)Global 0.6457 0.7391 0.2669 0.0518 0.7819 0.9154 0.7363 0.9689 0.0457KM 0.4863 0.6142 0.2386 0.0078 0.7094 0.8289 0.7289 1.0000 0.0075SimReg-PH 0.5794 0.6402 0.1889 0.0073 0.6833 0.8835 0.5988 0.9845 0.0040SimReg-PO 0.6136 0.6942 0.2877 0.0075 0.7807 0.9011 0.6422 0.9922 0.0052The adjusted p -values for the gene were obtained using 1 − (1 − minimum raw p -value) K eff .Signiﬁcance is concluded by comparing the p -values with the Bonferroni threshold 0 . / . against three approaches: (a) the single SNP minimum p -value method usingthe Cox PH model (referred to as minP), (b) the multi-SNP method usingthe global test for survival under the PH model [Goeman et al. (2005)] asimplemented in the R-package globaltest (referred to as Global), and (c)the multi-SNP method using the kernel machine [Lin et al. (2011)] as imple-mented in the R-package KMTest.surv (referred to as KM) with 10 resam-plings. Although the SimReg-PH test statistic is identical to the KM teststatistic, the results may be slightly diﬀerent due to the diﬀerent resamplingmethods adopted to obtain the p -values. Speciﬁcally, in the KM method, thescore statistic was perturbed by multiplying i.i.d. normal random variablesto achieve the same limiting distribution of the test statistic. In SimReg-PH,the weights in the limiting weighted χ distribution, that is, ξ , . . . , ξ d , wereconsistently estimated and then the samples were directly generated fromthe estimated weighted χ distribution based on a large number of i.i.d. χ random variables. The diﬀerent resampling approaches also lead to diﬀerentcomputational burden. For example, using a 3.6 GHz Xeon Processor with60 GB RAM with 10 resamplings, the system run-time of SimReg-PH was < / p -value andcalculated the adjusted p -value of a gene to correct for the multiple SNPsusing 1 − (1 − minimum raw p -value) K eﬀ . The eﬀective number of indepen-dent tests, K eﬀ , was estimated using the method of Moskvina and Schmidt(2008) and accounts for the correlations in recessive coding of loci. As stud-ied by Hsu, Sides and Mychalecky (2011), all analyses were considered underthe recessive mode and were adjusted for age, sex and race.The p -values for each of the methods are shown in Table 1, and the p -values are compared with the Bonferroni threshold adjusted for the 9 gene J.-Y. TZENG, W. LU AND F.-C. HSU analyses, that is, 0 . / . TCN2 (i.e., p -value = 0 . CTH had the second small-est p -value (0.0073), which did not pass the Bonferroni threshold but wasnear the cutoﬀ. These results also agree with the ﬁndings in the single SNPanalysis. The results of SimReg-PO are similar to those of SimReg-PH ex-cept that the p -values are slightly larger, that is, p -value = 0 . TCN2 and 0.0075 for

CTH . On the other hand, the KM, Global and minP meth-ods did not yield any signiﬁcant ﬁndings. However, the smallest p -valuesof the 9 genes were obtained for TCN2 (i.e., p -values of TCN2 are 0.0075for KM, 0.0457 for Global and 0.0704 for minP). As expected, the resultsof KM were very similar to those of SimReg-PH, except that the p -value of TCN2 was slightly above the 0.0056 threshold. The smallest p -values for theminP method were from the TCN2

SNP rs731991 with a raw Wald’s test p -value of 0.0065. However, neither the raw minimum p -value (0.0065) northe adjusted p -value (0.0704) survived the signiﬁcance threshold correctedfor multiple testing (0.0056). All methods also had CTH as the gene withthe second smallest p -value (i.e., p -values 0.0078 for KM, 0.0518 for Global,and 0.00918 for minP).Next, we assessed the prediction performance of Cox PH models builtwith and without TCN2 using the procedure described in Li and Luan(2005). Speciﬁcally, we randomly divided the samples into a training set( n = 646) and a testing set ( n = 323). Based on the training set, we ﬁttedthe Cox PH regression with two models: Model 1 included only the baselinecovariates (age, sex and race), that is, no genetic information, and model 2included the baseline covariates plus the top 7 principal components (PCs)of TCN2

SNPs that explained 95% of the variations. The PCs were used in-stead of the 15 original genotypes because of the high linkage disequilibriumamong them. Based on the ﬁts of the PH model from the training set, weobtained the risk scores of every subject under each model and computedthe medians of the risk scores. We also computed the risk scores for thetesting set using the estimated coeﬃcients from the training set. Next, wedivided the subjects into 2 risk groups: high-risk and low-risk. Individualswith a risk score higher/lower than the median risk scores obtained fromthe training data comprised the high-risk/low-risk group. Finally, we plot-ted the Kaplan–Meier curves for the 2 risk groups in the training data andthe two risk groups in the testing data separately and obtained the p -valuesof the corresponding log-rank tests. The results are given in Figure 2. Asexpected, the p -values for both models under the training set were very sig-niﬁcant. However, only model 2 is signiﬁcant ( p -value is 0.048) under thetesting set. This result implies that TCN2 gives a more accurate predictionof the risk for recurrent stroke.

IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Fig. 2.

The Kaplan–Meier survival curves for the two risk groups of patients.

Vitamin supplements have been identiﬁed as a potential treatment forvascular diseases. The beneﬁcial eﬀects of vitamin supplements on strokerecurrence are not yet fully understood. In VISP, vitamin supplements didnot show an eﬀect on the recurrent stroke risk during the 2 years of follow-up. However, we found that genetic variants, such as SNPs in

TCN2 , wereassociated with the recurrent stroke risk in the low-dose arm. This ﬁndingis consistent with the literature.

TCN2 was previously found to be asso-ciated with ischemic stroke risk [Low et al. (2011); Giusti et al. (2010)]and premature ischemic stroke risk [Giusti et al. (2010)]. It has been re-ported that

TCN2 interferes with the intracellular availability of vitaminB12 [von Castel-Dunwoody et al. (2005)]. The gene is associated with plasmahomocysteine levels and aﬀects the proportion of vitamin B12 bound totranscobalamin [Afman et al. (2003)]. It is suspected that SNPs on thegenes coding for enzymes involved in the methionine metabolism have beensuspected to be associated with hyperhomocysteinaemia, which can resultin occurrence of stroke [Giusti et al. (2010)]. Our signiﬁcant ﬁndings in thelow-dose arm suggest that there may be an interaction between

TCN2 and J.-Y. TZENG, W. LU AND F.-C. HSU

B12 supplementation, a ﬁnding that warrants further studies. The ﬁndingslead to a hypothesis that there may be one speciﬁc combination of genotypesof

TCN2 that is more eﬃcient at transporting B12 and thus impacts theeﬀectiveness of cofactor therapy on recurrent stroke risk. A functional studyis being planned to localize possibly independent regions of association anddetermine their function.Besides

TCN2 , CTH is marginally associated with recurrent stroke riskin the low-dose arm. It encodes cystathionine gamma-lyase, which is an en-zyme that converts cystathionine to cysteine in the trans-sulfuration path-way [Wang et al. (2004)]. It may be a determinant of plasma Hcy concen-trations, which may increase the risk of recurrent stroke because of arterialdisease. It is worth further study as well.

5. Simulation studies.

We performed simulations to assess the validityand eﬀectiveness of the proposed SimReg methods based on the 15 SNPsin

TCN2 of the 969 VISP low-arm samples. The rarest minor allele fre-quency (MAF) is approximately 3%. We generated genotypes of n indi-viduals by randomly sampling with replacement from the 15-SNP geno-types of the 969 samples. For individual i , we generated the covariate, X i , from N (0 , T i , based on the ge-netic and covariate information under two models: the PH model and thePO model. Speciﬁcally, for the PH model, log( T i ) = − ( X i + P ℓ γ ℓ G ℓi ) + ε ∗ i ,where ε ∗ i follows the standard extreme value distribution; for the PO model,log { exp( T i ) − } = − ( X i + P ℓ γ ℓ G ℓi ) + ε ∗ i , where ε ∗ i follows the standard lo-gistic distribution. The value of G ℓi is determined by the genotypes at thecausal locus ℓ and the mode of inheritance. For example, if A is the causalallele at locus ℓ , then G ℓi = 2 , AA, Aa and aa , respec-tively, under an additive mode. Under a dominant mode, G · i = 1 , G · i = 1 , γ ℓ was set to be 0for all ℓ . For power analysis, we selected 3 SNPs with diﬀerent MAFs andLD patterns as causal loci from the 15 SNPs in TCN2 and referred to themas SNP R, SNP U and SNP C. The MAFs are 0.036 (rare) for SNP R, 0.132(uncommon) for SNP U and 0.419 (common) for SNP C. The average R ’sbetween a causal locus and the remaining loci are 0.002 (low) for SNP R,0.003 (low) for SNP U and 0.216 (high) for SNP C. The speciﬁc values of γ ℓ ’sare given in Table 2 for each scenario under diﬀerent inheritance modes andcensoring rates. The values were set to consider 1, 2 and 3 causal loci in thegene and to consider causal loci that have either the same or diﬀerent eﬀectsizes (e.g., rarer variants with larger eﬀect sizes). All of the power scenariosassumed linear additive eﬀects of the causal loci, which favors the linearrandom eﬀects model (e.g., Global) and can be used to examine the utilityof using the nonlinear IBS function to capture the multi-marker eﬀects. IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Table 2

Eﬀect sizes for power analyses

Additive and dominant Recessive15% ∗

40% 90% 15% 40% 90% n = 500 n = 500 n = 1000 n = 1000 n = 1000 n = 1000 Scenario Causal SNPs (MAF) Eﬀect size ( γ R , γ U , γ C ) Eﬀect size ( γ R , γ U , γ C ) ∗ censoring proportion. We generated the censoring time, C i , from Unif(0 , c ), where c is uniquelychosen for each of 3 censoring rates: 15%, 40% and 90%. Speciﬁcally, we set c = 6 .

7, 2.0 and 0.2 for censoring rates of 15%, 40% and 90%, respectively.The sample sizes, n , were 500 for the 15% and 40% censoring rates under theadditive and dominant modes. For the 90% censoring rate under the additiveand dominant modes and all censoring rates under the recessive mode, n =1000. Each scenario was analyzed using SimReg (PH and/or PO), minP,Global and KM. SimReg-PH and KM have identical test statistics but useddiﬀerent resampling approaches to obtain p -values. Because both resamplingapproaches are asymptotically equivalent, we expect minor diﬀerences inﬁnite sample performance between SimReg-PH and KM. In all analyses, thecausal loci were excluded.5.1. Results of type I error analyses.

We ﬁrst examined the performanceof the proposed SimReg-PH model. Table 3 displays the type I error ratesof diﬀerent methods when the survival times were generated from the PHmodel. The results were based on 10 replications except that KM was basedon 5 × replications due to computational cost. In each replication, the p -values of SimReg-PH were obtained from 5 × resamplings. We reportthe type I error rates evaluated at the nominal levels of 5 × − , 5 × − and 5 × − . The type I error rates obtained by SimReg-PH remainedaround the nominal levels. However, the deviations were larger for α = 5 × − , mainly due to fewer resampled statistics observed on the extremetail. In particular, for the low censoring proportion (i.e., 15%), the type I J . - Y . T Z E N G , W . L UAN D F . - C . H S U Table 3

Type I error rates for survival time generated from the PH model

Additive Dominant RecessiveAnalyzed under 15% 40% 90% 15% 40% 90% 15% 40% 90%PH model ( n = 500) ( n = 500) ( n = 1000) ( n = 500) ( n = 500) ( n = 1000) ( n = 1000) ( n = 1000) ( n = 1000) α = 5 × − (Rates shown on the scale of 10 − )minP 4 .

89 4 .

85 4 .

44 4 .

92 4 .

84 4 .

16 7 .

63 7 .

46 7 . .

73 2 .

71 2 .

72 3 .

01 2 .

99 2 .

96 2 .

75 2 .

74 2 . .

07 5 .

11 4 .

73 5 .

10 5 .

17 4 .

81 5 .

12 4 .

97 4 . .

11 5 .

10 4 .

79 5 .

15 5 .

11 4 .

83 5 .

21 5 .

04 5 . α = 5 × − (Rates shown on the scale of 10 − )minP 6 .

52 6 .

22 5 .

50 6 .

13 5 .

44 4 .

39 17 .

09 17 .

62 19 . .

82 2 .

65 2 .

67 2 .

98 2 .

74 2 .

86 2 .

61 2 .

50 2 . .

18 4 .

86 4 .

02 5 .

26 4 .

86 4 .

28 5 .

40 5 .

54 4 . .

18 4 .

91 4 .

15 5 .

22 5 .

03 4 .

38 5 .

84 5 .

20 5 . α = 5 × − (Rates shown on the scale of 10 − )minP 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . , 10 , and 10 for nominal level at 0.05, 0.005, and 0.0005, respectively. The survivaltimes were generated from the PH model and were analyzed using diﬀerent approaches under the PH model. The results were based on10 replications except that the results for KM were based on 5 × replications.IMILARITY REGRESSION FOR SURVIVAL OUTCOMES error rates were slightly inﬂated, while for the high censoring proportion(i.e., 90%), the type I error rates became a little conservative when α =5 × − . Nevertheless, the overall results suggest that the SimReg-PH testmaintained an appropriate size, which conﬁrms the validity of the derivednull distribution of the test statistic, Q n . As expected, the type I error ratesobtained by KM were very similar to SimReg-PH. The type I error ratesobtained by the Global test are overly conservative; similar behavior hasbeen reported in the literature [Zhong and Chen (2011)]. The minP methodhad correct type I error rates under additive and dominant modes. However,the method had inﬂated type I error rates under the recessive mode, andthe inﬂation was more severe with smaller α . Under the recessive mode,the Bonferroni corrected p -values obtained by replacing K eﬀ with the totalnumber of SNPs also yielded inﬂated type I error rates. Speciﬁcally, theempirical type I error rates for 15%, 40% and 90% censoring are (0.0627,0.0687, 0.0608) for α = 5 × − , (0.0145, 0.0152, 0.0163) for α = 5 × − and(0.0042, 0.0041, 0.0053) for α = 5 × − , respectively. The anti-conservationappears to be somewhat related to the rare recessive loci; when the rare lociare excluded from the analysis, the empirical type I error rate became closerto the nominal level (data not shown). However, such an exclusion strategymight give uninformative results because the relevant signals were excludedfrom the analysis.Table 4 displays the type I error rates when the survival times were gen-erated from the PO model. Both SimReg-PH and SimReg-PO were imple-mented. The results were based on 5000 replications, and the type I errorrates were calculated at the nominal level of 0.05. The type I error ratesfor both SimReg-PO and SimReg-PH were close to 0.05 independent of theinheritance mode and the censoring proportions. These results show the va-lidity of the SimReg-PO tests and the robustness of the SimReg-PH tests.Though not performed, KM is expected to have the same robustness asSimReg-PH. As seen previously, the Global test had conservative type I er-ror rates, but the magnitude of conservation is less than that seen in Table 3.The minP method yielded slightly inﬂated type I error rates for the additiveand dominant modes with low censoring proportion (i.e., 15%). As before,it yielded inﬂated type I error rates for the recessive mode.5.2. Results of power analyses.

The power analyses were performed us-ing the settings speciﬁed in Table 2. The results were based on 100 replica-tions under each scenario. Figure 3 shows the power when the survival timeswere generated from the PH model. We ﬁrst consider the additive mode.When one large-eﬀect causal locus has low MAF and low LD with the othermarkers (e.g., Scenarios 1, 5 and 11), minP tends to have the highest powerindependent of the censoring proportion. The good performance of minPis not unexpected in these scenarios because the overall association of the J . - Y . T Z E N G , W . L UAN D F . - C . H S U Table 4

Type I error rates for survival time generated from the PO model

Additive Dominant Recessive15% 40% 90% 15% 40% 90% 15% 40% 90% ( n = 500) ( n = 500) ( n = 1000) ( n = 500) ( n = 500) ( n = 1000) ( n = 1000) ( n = 1000) ( n = 1000) minP 0.0584 0.0548 0.0444 0.0560 0.0492 0.0410 0.0798 0.0774 0.0740Global 0.0358 0.0320 0.0256 0.0352 0.0332 0.0266 0.0342 0.0312 0.0250SimReg-PH 0.0488 0.0496 0.0464 0.0492 0.0504 0.0478 0.0526 0.0518 0.0480SimReg-PO 0.0446 0.0486 0.0468 0.0452 0.0496 0.0508 0.0544 0.0492 0.0490The survival times were generated from the PO model and were analyzed using diﬀerent approaches under the PO model or the PHmodel (to examine the impact of model misspeciﬁcation). The results were based on 5000 replications.IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Fig. 3.

Power when the survival times are generated from the PH model. gene was driven by a single large-eﬀect locus, for which the majority of theother SNPs did not carry much information. As a result, there is no powergain when borrowing information from other SNPs, which is what SimReg-PH does. However, the power gain of minP over other methods generallydiminishes as the number of causal loci increases (e.g., Scenarios 5 to 10). Inscenarios where the marker set is not dominated by a single causal locus oflow MAF and low LD, SimReg-PH showed comparable or higher power. Asexpected, KM had near identical power as SimReg-PH in all scenarios. Inmost scenarios, the Global test produces the least amount of power largely J.-Y. TZENG, W. LU AND F.-C. HSU

Fig. 4.

Power when the survival times are generated from the PO model. due to the over-conservative test size. The overall performance under thedominant mode has a similar trend to that of the additive mode. However,the power of SimReg-PH is comparable to or better power than the power ofminP in more cases under the dominant mode than under the additive mode.For the recessive mode, SimReg-PH appears to have better power than theGlobal test. We did not perform a power analysis for minP because the typeI error rates were inﬂated.Figure 4 shows the power performance when the survival times were gen-erated from the PO model. The power obtained using SimReg-PO is similar

IMILARITY REGRESSION FOR SURVIVAL OUTCOMES to or better than the power obtained using SimReg-PH, indicating that eﬃ-ciency is gained when the correct model is used. The power gain of SimReg-PO is more substantial when the censoring proportion is low to medium,and the power is comparable to or better power than the minP power insome of those diﬃcult cases, such as Scenario 1.

6. Discussion.

In this work we extended the similarity regression (Sim-Reg) approaches, which have shown eﬀectiveness in modeling marker-seteﬀects on binary and continuous outcomes, to survival models to facilitatethe assessment of gene or pathway eﬀects on drug responses. The geneticeﬀect is evaluated by assessing the association between the IBS status of apair of individuals and the covariance of their survival times. We derived theequivalence between the similarity survival regression and a random eﬀectsmodel. The equivalence facilitates the derivation of the score test statisticsand uniﬁes the current variance component-based methods. Speciﬁcally, theKM approach [Lin et al. (2011)] is equivalent to SimReg-PH when the samekernel function is used to quantify the genetic similarity, S ij . The Globaltest [Goeman et al. (2005)] can be viewed as a special case of SimReg-PH with S ij = P m g Tm,i g m,j (i.e., the linear kernel). However, the results ofGlobal and SimReg-PH with linear kernels may be diﬀerent because diﬀer-ent approaches were used to derive the asymptotic distributions of the teststatistics. Compared to these existing gene-based approaches, our proposedmethod has the generality to incorporate a variety of risk models in the classof linear transformation models, and we explicitly constructed the SimRegtests under the PH model and the PO model in this work. We also proposeda resampling approach to obtain the p -values that improves the computa-tional eﬃciency. Finally, we showed that the derived inference procedure isrobust against the misspeciﬁcation of the risk model, which is an attractivefeature because the underlying risk model is often unknown. Through sim-ulations, we showed that the power of the SimReg method is comparable toor higher than the power of the minP and Global methods across variousscenarios. We also veriﬁed that the SimReg-PH test statistics remain valideven if the risk model is misspeciﬁed.In the data application on the VISP study, we illustrated how SimRegcan be used to search for genes or pathways that are associated with a time-to-event outcome and conﬁrmed previous ﬁndings using this gene-based ap-proach with statistical signiﬁcance. Although we focused our method devel-opment and demonstrated its utility based on pharmacogenetics studies, theproposed method is applicable to other genetic clinical researches or obser-vational studies with time-to-event outcomes. For a pharmacogenetic studywith sample sizes 1000, such as the VISP trial, 1000 runs of the SimReg anal-yses took ≤ J.-Y. TZENG, W. LU AND F.-C. HSU analysis on ∼

20K genes should be completed in a day using a comparablecomputing facility. We implemented the proposed methods in R and madeit available at the authors’ websites. We are incorporating the software intothe

SimReg

R package.Motivated by the data application where the risk variant acted reces-sively, we further investigated the behavior of the gene-based approachesunder diﬀerent modes of inheritance. We found that all of the studied gene-based methods performed appropriately under the additive and dominantmodes. However, caution should be used when performing the minimal p -value approaches under the recessive mode; the minimal p -value approacheshad severely inﬂated type I error rates. The inﬂation might be related to theextremely rare recessive loci. However, excluding those rare recessive loci issuboptimal because the important signals can be artiﬁcially removed andlead to power loss. In contrast, the global test and the similarity regressionare not vulnerable to such a situation and appear to be more suitable optionsgiven their reasonable performance under the recessive mode.This work focused on assessing the genetic main eﬀect on drug responsein an eﬀort to understand how individual variation aﬀects drug eﬃcacy andtoxicity. In pharmacogenetics and personalized medicine, one major topic isto study if the genetic eﬀects are modiﬁed by treatments and how the eﬀectsdiﬀer across treatment options. As observed in the VISP genetic studies, theeﬀect of TCN2 on recurrent stroke risk is restricted to the low-dose treat-ment. An analysis stratiﬁed by treatments allows for the evaluation of suchheterogeneous eﬀects between diﬀerent treatment groups, but its eﬃciencycan be further improved by incorporating the gene-treatment interactionin the regression model. Such an extension is not straightforward becausethe calculation of the score test requires the variance component estimatesfor the genetic main eﬀect under a mixed-eﬀects survival model. We aredeveloping further extensions of SimReg to incorporate interaction eﬀects.APPENDIX

Derivation of the score statistic Q n : Given the working mixed model H ( T i ) = γ T X i + g i + ε ∗ i , the log-likelihood function of the observed datacan be written as l n ( γ, H, τ )= log Z · · · Z n Y i =1 [ λ { H ( ˜ T i ) − γ T X i − g i } ] δ i × e − Λ { H ( ˜ T i ) − γ T X i − g i } f G ( g , . . . , g n ) dg · · · dg n , where λ and Λ are the speciﬁed hazard and cumulative hazard functionsof ε ∗ i , and f G ( g , . . . , g n ) is the joint density of g , . . . , g n , that is, a mul-tivariate normal density with mean 0 and variance–covariance matrix τ S . IMILARITY REGRESSION FOR SURVIVAL OUTCOMES Consider the variable transformation ( g ∗ , . . . , g ∗ n ) ′ = τ − / S − / ( g , . . . , g n ) ′ ,where S = S / S / . Then ( g ∗ , . . . , g ∗ n ) ′ follows a standard multivariate nor-mal distribution. The result leads to l n ( γ, H, τ )= log Z · · · Z n Y i =1 [ λ { H ( ˜ T i ) − γ T X i − τ / S / g ∗ i } ] δ i × e − Λ { H ( ˜ T i ) − γ T X i − τ / S / g ∗ i } × f ∗ G ( g ∗ , . . . , g ∗ n ) dg ∗ · · · dg ∗ n , where f ∗ G is the density for the standard multivariate normal distribution.After some algebra, we have1 n ∂l n ( γ, H, τ ) ∂τ (cid:12)(cid:12)(cid:12)(cid:12) τ =0 = 12 n ( r , . . . , r n ) S ( r , . . . , r n ) T + 12 n n X i =1 (cid:20) ¨ λ ( e i ) λ ( e i ) − { ˙ λ ( e i ) } λ ( e i ) − ˙ λ ( e i ) (cid:21) s Ti s i , where e i = H ( ˜ T i ) − γ T X i , ¨ λ ( · ) is the second derivative of λ ( · ), s i is the i throw of the matrix S / and r i = R ∞ ˙ λ { H ( t ) − γ T X i } /λ { H ( t ) − γ T X i } dM i ( t ; γ, H ). The equality in the above equation is obtained by ﬁrst taking thederivative of l n ( γ, H, τ ) with respect to τ and then deriving its limit as τ → n goes to inﬁnity. In addition, under the nullhypothesis, r i ’s have expectation of 0 at the true values of γ and H becausethey are martingale integrations. Therefore, if the null hypothesis is correct,the ﬁrst term in the summation should be close to 0. This result motivatesus to consider a score test and reject the null hypothesis when the score(1 /n ) ∂l n (ˆ γ, ˆ H, τ ) /∂τ | τ =0 is bigger than some value, where ˆ γ and ˆ H are theestimates of γ and H , respectively, under the null model. It is asymptoticallyequivalent to consider the test statistic Q n = n − (ˆ r , . . . , ˆ r n ) S (ˆ r , . . . , ˆ r n ) T and reject the null hypothesis when Q n > c α , where c α is the critical valuefor a level- α test. Null distribution of the score statistic Q n : Here, we consider the estimatorsˆ γ and ˆ H ( · ) obtained via the martingale-based estimating equations for thestandard linear transformation model [Chen, Jin and Ying (2002)] under thenull hypothesis. Note that Q n = ( n − / P ni =1 ˆ r i s i ) T ( n − / P ni =1 ˆ r i s i ). Let γ and H denote the true values of γ and H , respectively, in the null model. J.-Y. TZENG, W. LU AND F.-C. HSU

Based on the derivations given in Chen, Jin and Ying (2002), we have thefollowing asymptotic representations: √ n (ˆ γ − γ ) = − A − √ n n X i =1 Z ∞ { X i − µ X ( t ) } dM i ( t ; γ , H ) + o p (1) , √ n { ˆ H ( t ) − H ( t ) } = − b ( t ) T A − √ n n X i =1 Z ∞ { X i − µ X ( t ) } dM i ( t ; γ , H )+ 1 √ n n X i =1 Z t φ ( t, s ) dM i ( s ; γ , H ) + o p (1) . The deﬁnitions of A , µ X ( · ), b ( · ) and φ ( · , · ) can be found in Chen, Jin andYing (2002). By the Taylor expansion, we can show that1 √ n n X i =1 ˆ r i s i → d N (0 , Σ) , as n → ∞ , where Σ = E ( ψ i ψ Ti ) and ψ i = (cid:20) δ i ˙ λ { H ( ˜ T i ) − γ T X i } λ { H ( ˜ T i ) − γ T X i } − λ { H ( ˜ T i ) − γ T X i } (cid:21) s i − ( B − B ) A − Z ∞ { X i − µ X ( t ) } dM i ( t ; γ , H ) − Z ∞ b ( t ) dM i ( t ; γ , H ) . Here B = lim n →∞ n n X i =1 s i X Ti [ ˙ λ { H ( ˜ T i ) − γ T X i }− δ i (¨ λ { H ( ˜ T i ) − γ T X i } λ { H ( ˜ T i ) − γ T X i }− ( ˙ λ { H ( ˜ T i ) − γ T X i } ) ) / ( λ { H ( ˜ T i ) − γ T X i } ) ] ,B = lim n →∞ n n X i =1 s i b ( ˜ T i ) T [ ˙ λ { H ( ˜ T i ) − γ T X i }− δ i (¨ λ { H ( ˜ T i ) − γ T X i } λ { H ( ˜ T i ) − γ T X i }− ( ˙ λ { H ( ˜ T i ) − γ T X i } ) ) / ( λ { H ( ˜ T i ) − γ T X i } ) ] , IMILARITY REGRESSION FOR SURVIVAL OUTCOMES b ( t ) = lim n →∞ n n X i =1 s i I ( ˜ T i ≥ t ) × φ ( ˜ T i , t )[ ˙ λ { H ( ˜ T i ) − γ T X i }− δ i (¨ λ { H ( ˜ T i ) − γ T X i } λ { H ( ˜ T i ) − γ T X i }− ( ˙ λ { H ( ˜ T i ) − γ T X i } ) ) / ( λ { H ( ˜ T i ) − γ T X i } ) ] . Therefore, Q n converges in distribution to a weighted χ distribution: P dk =1 ξ k χ ,k , where χ , , . . . , χ ,d are d independently and identically dis-tributed χ random variables with degree freedom of 1, and ξ , . . . , ξ d arethe d nonzero eigenvalues of the matrix Σ. To obtain the critical value, c α ,of the limiting weighted χ distribution, we use a numerical method. Specif-ically, we ﬁrst obtain a consistent estimator, ˆΣ, of Σ using the usual plug-inmethod and compute the nonzero eigenvalues ˆ ξ , . . . , ˆ ξ d of the matrix ˆΣ.Next we generate a large set (e.g., 10,000 sets) of independent and identi-cally distributed random variables χ , , . . . , χ ,d . For each set of χ randomvariables, we compute P dk =1 ˆ ξ k χ ,k . We can then estimate c α by the upper α -quantile of P dk =1 ˆ ξ k χ ,k ’s and the associated p -value of the score test canbe computed accordingly. Acknowledgments.

The authors thank Dr. Mich`ele M. Sale and Dr. Brad-ford B. Worrall for providing the data. They thank Dr. Chen-Hsin Chen forhis insightful discussion and helpful suggestions to improve the work. Theyalso thank Dr. Shannon Holloway for her constructive input and proofread-ing to improve the manuscript and assistance in creating the software pack-age. REFERENCES

Afman, L. A. , Lievers, K. J. A. , Kluijtmans, L. A. J. , Trijbels, F. J. M. and

Blom, H. J. (2003). Gene–gene interaction between the cystathionine beta-synthase31 base pair variable number of tandem repeats and the methylenetetrahydrofolate re-ductase 677C > T polymorphism on homocysteine levels and risk for neural tube defects.

Mol. Genet. Metab. Beckmann, L. , Thomas, D. C. , Fischer, C. and

Chang-Claude, J. (2005). Haplotypesharing analysis using mantel statistics.

Hum. Hered. Bennett, S. (1983). Analysis of survival data by the proportional odds model.

Stat. Med. Cai, T. , Tonini, G. and

Lin, X. (2011). Kernel machine approach to testing thesigniﬁcance of multiple genetic markers for risk prediction.

Biometrics Chen, K. , Jin, Z. and

Ying, Z. (2002). Semiparametric analysis of transformation modelswith censored data.

Biometrika J.-Y. TZENG, W. LU AND F.-C. HSU

Cheng, S. C. , Wei, L. J. and

Ying, Z. (1995). Analysis of transformation models withcensored data.

Biometrika Cox, D. R. (1972). Regression models and life-tables.

J. R. Stat. Soc. Ser. B Stat.Methodol. Duchesne, P. and

Lafaye De Micheaux, P. (2010). Computing the distribution ofquadratic forms: Further comparisons between the Liu–Tang–Zhang approximation andexact methods.

Comput. Statist. Data Anal. Elston, R. C. , Buxbaum, S. , Jacobs, K. B. and

Olson, J. M. (2000). Haseman andElston revisited.

Genet. Epidemiol. Giusti, B. , Saracinim, C. , Bolli, P. , Magi, A. , Martinelli, I. , Peyvandi, F. , Ra-sura, M. , Volpe, M. , Lotta, L. A. , Rubattu, S. , Mannucci, P. M. and

Abbate, R. (2010). Early-onset ischaemic stroke: Analysis of 58 polymorphisms in 17 genes involvedin methionine metabolism.

Thrombosis and Haemostasis

Goeman, J. J. , Oosting, J. , Cleton-Jansen, A.-M. , Anninga, J. K. and vanHouwelingen, H. C. (2005). Testing association of a pathway with survival usinggene expression data.

Bioinformatics Goldstein, D. B. (2005). The genetics of human drug response.

Philosophical Transac-tions of the Royal Society B: Biological Sciences

Goldstein, D. B. , Tate, S. K. and

Sisodiya, S. M. (2003). Pharmacogenetics goesgenomic.

Nat. Rev. Genet. Haseman, J. K. and

Elston, R. C. (1972). The investigation of linkage between aquantitative trait and a marker locus.

Behav. Genet. Hsu, F. C. , Sides, E. G. and

Mychaleckyj, J. C.et al. (2011). A Transcobalamin 2gene variant associated with post-stroke homocysteine modiﬁes recurrent stroke risk.

Neurology Li, H. and

Luan, Y. (2005). Boosting proportional hazards models using smoothing spline,with application to high-dimensional microarray data.

Biostatistics Lin, D. Y. and

Wei, L. J. (1989). The robust inference for the Cox proportional hazardsmodel.

J. Amer. Statist. Assoc. Lin, W.-Y. and

Schaid, D. J. (2009). Power comparisons between similarity-based mul-tilocus association methods, logistic regression, and score tests for haplotypes.

Genet.Epidemiol. Lin, X. , Cai, T. , Wu, M. C. , Zhou, Q. , Liu, G. , Christiani, D. C. and

Lin, X. (2011). Kernel machine SNP-set analysis for censored survival outcomes in genome-wide association studies.

Genet. Epidemiol. Low, H.-Q. , Chen, C. P. L. H. , Kasiman, K. , Thalamuthu, A. , Ng, S.-S. , Foo, J.-N. , Chang, H.-M. , Wong, M.-C. , Tai, E.-S. and

Liu, J. (2011). A comprehensiveassociation analysis of homocysteine metabolic pathway genes in Singaporean Chinesewith ischemic stroke.

PLoS ONE e24757. Moskvina, V. and

Schmidt, K. M. (2008). On multiple-testing correction in genome-wide association studies.

Genet. Epidemiol. Pearson, E. S. (1959). Note on an approximation to the distribution of non-central χ . Biometrika Pongpanich, M. , Neely, M. and

Tzeng, J. Y. (2012). On the aggregation of multi-marker information for marker-set and sequencing data analysis: Genotype collapsingvs. similarity collapsing.

Frontiers in Statistical Genetics and Methodology Price, A. L. , Kryukov, G. V. , de Bakker, P. I. W. , Purcell, S. M. , Staples, J. , Wei, L.-J. and

Sunyaev, S. R. (2010). Pooled association tests for rare variants inexon-resequencing studies.

Am. J. Hum. Genet. Qian, D. and

Thomas, D. (2001). Genome scan of complex traits by haplotype sharingcorrelation.

Genetic Epidemiology S582–S587.

Schaid, D. J. (2010a). Genomic similarity and kernel methods I: Advancements by build-ing on mathematical and statistical foundations.

Human Heredity Schaid, D. J. (2010b). Genomic similarity and kernel methods II: Methods for genomicinformation.

Human Heredity Toole, J. F. , Malinow, M. R. , Chambless, L. E. et al. (2004). Lowering homocysteinein patients with ischemic stroke to prevent recurrent stroke, myocardial infarction, anddeath: The Vitamin Intervention for Stroke Prevention (VISP) randomized controlledtrial.

Journal of American Medical Association

Tzeng, J. Y. , Devlin, D. , Wasserman, L. and

Roeder, K. (2003). On the identiﬁcationof disease mutations by the analysis of haplotype similarity and goodness of ﬁt.

TheAmerican Journal of Human Genetics Tzeng, J.-Y. , Zhang, D. , Chang, S.-M. , Thomas, D. C. and

Davidian, M. (2009).Gene-trait similarity regression for multimarker-based association analysis.

Biometrics Tzeng, J.-Y. , Zhang, D. , Pongpanich, M. , Smith, C. , McCarthy, M. I. , Sale, M. M. , Worrall, B. B. , Hsu, F.-C. , Thomas, D. C. and

Sullivan, P. F. (2011). Studying gene and gene-environment eﬀects of uncommon and common vari-ants on continuous traits: A marker-set approach using gene-trait similarity regression.

Am. J. Hum. Genet. von Castel-Dunwoody, K. M. , Kauwell, G. P. A. , Shelnutt, K. P. , Vaughn, J. D. , Griffin, E. R. , Maneval, D. R. , Theriaque, D. W. and

Bailey, L. B. (2005).Transcobalamin 776C → G polymorphism negatively aﬀects vitamin B-12 metabolism.

Am. J. Clin. Nutr. Wang, J. , Huff, A. M. , Spence, J. D. and

Hegele, R. A. (2004). Single nucleotidepolymorphism in CTH associated with variation in plasma homocysteine concentration.

Clin. Genet. Wessel, J. and

Schork, N. J. (2006). Generalized genomic distance-based regressionmethodology for multilocus association analysis.

Am. J. Hum. Genet. Wu, M. C. , Lee, S. , Cai, T. , Li, Y. , Boehnke, M. and

Lin, X. (2011). Rare-variantassociation testing for sequencing data with the sequence kernel association test.

Am.J. Hum. Genet. Zeng, D. and

Lin, D. Y. (2006). Eﬃcient estimation of semiparametric transformationmodels for counting processes.

Biometrika Zhong, P.-S. and

Chen, S. X. (2011). Tests for high-dimensional regression coeﬃcientswith factorial designs.

J. Amer. Statist. Assoc.

J.-Y. TzengDepartment of StatisticsNorth Carolina State UniversityRaleigh, North Carolina 27695-8203USAandDepartment of StatisticsNational Cheng Kung UniversityTainan CityTaiwan (R.O.C.)E-mail: [email protected]

W. LuDepartment of StatisticsNorth Carolina State UniversityRaleigh, North Carolina 27695-8203USAE-mail: [email protected] J.-Y. TZENG, W. LU AND F.-C. HSU