Variable Selection for Survival Data with A Class of Adaptive Elastic Net Techniques
aa r X i v : . [ s t a t . M E ] D ec VARIABLE SELECTION FOR SURVIVAL DATA WITH ACLASS OF ADAPTIVE ELASTIC NET TECHNIQUES
By Md Hasinur Rahaman Khan
University of Dhaka andBy J. Ewart H. Shaw
University of Warwick
The accelerated failure time (AFT) models have proved usefulin many contexts, though heavy censoring (as for example in cancersurvival) and high dimensionality (as for example in microarray data)cause difficulties for model fitting and model selection. We proposenew approaches to variable selection for censored data, based on AFTmodels optimized using regularized weighted least squares. The regu-larized technique uses a mixture of ℓ and ℓ norm penalties under twoproposed elastic net type approaches. One is the the adaptive elas-tic net and the other is weighted elastic net. The approaches extendthe original approaches proposed by Ghosh [Technical Report (2007)PR no. 07-01], and Hong and Zhang [Math. Model. of Natu. Phen. (2010) 115-133)] respectively. We also extend the two proposed ap-proaches by adding censoring observations as constraints into theirmodel optimization frameworks. The approaches are evaluated onmicroarray and by simulation. We compare the performance of theseapproaches with six other variable selection techniques–three are gen-erally used for censored data and the other three are correlation-basedgreedy methods used for high-dimensional data.
1. Introduction.
The practical importance of variable selection is hugeand well recognized in many disciplines, and has been the focus of muchresearch. Several variable selection techniques have been developed for lin-ear regression models; some of these have been extended to censored sur-vival data. The methods include stepwise selection [Peduzzi, Hardy andHolford (1980)], and penalized likelihood based techniques, such as Akaike’sinformation criterion (AIC) [Akaike et al. (1973)], bridge regression [Frankand Friedman (1993)], least absolute shrinkage and selection operator (lasso)[Tibshirani (1996)], Smoothly Clipped Absolute Deviation (SCAD) [Fan andLi (2001)], least angle regression selection (LARS) [Efron et al. (2004)], theelastic net [Zou and Hastie (2005)], MM algorithms [Hunter and Li (2005)]
Keywords and phrases:
Adaptive elastic net, AFT, Variable selection, Stute’s weightedleast squares, Weighted elastic net KHAN, MHR AND SHAW, JEH that are based on extensions of the well-known class of EM algorithms, grouplasso [Yuan and Lin (2006)], the Dantzig selector [Candes and Tao (2007)]that is based on a selector that minimizes the ℓ norm of the coefficientssubject to a constraint on the error terms, and MC+ [Zhang (2010)] that isbased on a minimax concave penalty and penalized linear unbiased selection.Stability selection as proposed in Meinshausen and B¨uhlmann (2010) is avariable selection technique that is based on subsampling in combinationwith (high-dimensional) selection algorithms. It is also used as a techniqueto improve variable selection performance for a range of selection methods.Recently there has been a surge of interest in variable selection withultra-high dimensional data. By ultra-high dimension, Fan and Lv (2008)meant that the dimensionality grows exponentially in the sample size, i.e.,log( p ) = O ( n a ) for some a ∈ (0 , / tilting procedure that provides an adaptive choice be-tween the use of marginal correlation and tilted correlation for each variable,where the choice is made depending on the values of the hard-thresholdedsample correlation of the design matrix.The Cox model [Cox (1978)] with high–dimensional data has been thefocus of many variable selection studies. For example, Tibshirani (1997) de-veloped a regularized Cox regression by minimizing an ℓ lasso penalty to thepartial likelihood, Faraggi and Simon (1998) proposed a Bayesian variableselection method, Fan and Li (2002) developed a non–concave penalized like-lihood approach, Li and Luan (2003) used kernel transformations, and Guiand Li (2005) introduced a threshold gradient descent regularization estima-tion method, Antoniadis, Fryzlewicz and Letue (2010) developed a variableselection approach for the Cox model based on the Dantzig selector.There are also some variable selection studies for AFT models. For ex-ample, Huang et al. (2006) used the lasso regularization for estimation andvariable selection in the AFT model based on the inverse probability of cen-soring. The lasso regularized Buckley–James method for the AFT model isinvestigated by Huang and Harrington (2005) and Datta, Le-Rademacher ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA and Datta (2007). Sha, Tadesse and Vannucci (2006) developed a Bayesianvariable selection approach. Variable selection using the elastic net is in-vestigated in Wang et al. (2008). Engler and Li (2009), and Cai, Huangand Tian (2009) proposed variable selection using the lasso regularized rankbased estimator. Huang and Ma (2010) used a bridge method for variableselection. Hu and Rao (2010) proposed a sparse penalization technique withcensoring constraints. Recently, Khan and Shaw (2013) proposed a variableselection technique for AFT model that is based on the synthesize of theBuckley–James method and the Dantzig selector.In this paper, we consider variable selection methods for the AFT mod-eling of censored data, and propose new regularized Stute’s Weighted LeastSquares (SWLS) approaches. We introduce classes of elastic net type regu-larized variable selection techniques based on SWLS. The classes include anadaptive elastic net, a weighted elastic net and two extended versions thatare carried out by introducing censoring constraints into the optimizationfunction.The rest of the paper is structured as follows. Section 2 provides the reg-ularized framework of SWLS. Section 3 provides proposed variable selectionmethods including variable selection criteria and prediction formula. All themethods are demonstrated with two simulated examples in Section 4 andwith one microarray real data example in Section 5. In Section 4 and 5we also compare the performance of the proposed approaches with threeother variable selection approaches: the typical elastic net implemented forweighted data, the adaptive elastic net for censored data [Engler and Li(2009)], and a Bayesian approach [Sha, Tadesse and Vannucci (2006)] andthree other correlation-based greedy variable selection methods generallyused for high-dimensional data: sure independence screening [Fan and Lv(2008)], tilted correlation screening [Cho and Fryzlewicz (2012)], and PC-simple [B¨uhlmann, Kalisch and Maathuiset (2010)].
2. Methodology.
Regularized SWLS (Stute’s Weighted Least Squares) and CensoringConstraints.
The objective function of Stute’s weighted least squares for atypical AFT model is given by(2.1) arg min( α, β ) " n X i =1 w i ( Y ( i ) − α − X T ( i ) β ) , where Y i is the log survival time for the i -th observation, X is the covariatevector, α is the intercept term, β is the unknown p × w i are the Kaplan–Meier (K–M) weights that are KHAN, MHR AND SHAW, JEH obtained by(2.2) w = δ (1) n , w i = δ ( i ) n − i + 1 i − Y j =1 (cid:16) n − jn − j + 1 (cid:17) δ ( j ) , i = 2 , · · · , n As discussed in Huang et al. (2006), the SWLS method is computation-ally more amenable to high–dimensional covariates than the B–J estimator[Buckley and James (1979)] or a rank based estimator [e.g., Ying (1993)].This is because the OLS structure makes it computationally efficient toapply a regularized method in the AFT model. The method has rigoroustheoretical justifications under reasonable assumptions.In matrix notation the objective function of SWLS (2.1) is given by(2.3) 12 ( Y − α − Xβ ) T w ( Y − α − Xβ ) , where w is the n × n diagonal weight matrix. Let the uncensored and censoreddata be subscripted by u and ¯ u respectively. Thus the number of uncensoredand censored observations are denoted by n u and n ¯ u , the predictor andresponse observations for censored data by X ¯ u and Y ¯ u , and the unobservedtrue failure time for censored observation by T ¯ u . Since under right censoring Y ¯ u < log( T ¯ u ) that is equivalent to Y ¯ u ≤ α + X ¯ u β , can be added to the SWLSobjective function (2.3). These constraints are called censoring constraints[Hu and Rao (2010)] which may be too stringent due to the random noise.This might suggest modifying the constraints to Y ¯ u ≤ α + X ¯ u β + ξ , where ξ is a vector of non-negative values that measure the severities of violationsof the constraints. The SWLS objective function now can be defined by L ( α, β ) = 12 ( Y u − α − X u β ) T w u ( Y u − α − X u β ) + λ n ξ T ξ, subject to Y ¯ u ≤ α + X ¯ u β + ξ, (2.4)where λ is a positive value that accounts for the penalties of violations ofconstraints, and n is included for scaling to match the w u .An intercept term α typically is included in the AFT model. However, fornotational convenience, we can remove α by (weighted) standardisation ofthe predictors and response. The weighted means are defined by¯ X w = P ni =1 w i X ( i ) P ni =1 w i , ¯ Y w = P ni =1 w i Y ( i ) P ni =1 w i . Then the adjusted predictors and responses are defined by X w ( i ) = ( w i ) / ( X ( i ) − ¯ X w ) , Y w ( i ) = ( w i ) / ( Y ( i ) − ¯ Y w ) . ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA For simplicity, we still use X ( i ) and Y ( i ) to denote the weighted and centeredvalues and ( Y ( i ) , δ ( i ) , X ( i ) ) to denote the weighted data.The objective function of SWLS (2.3) therefore becomes(2.5) L ( β ) = 12 ( Y u − X u β ) T ( Y u − X u β ) . So, it is easy to show that the SWLS in Equation (2.1) is equivalent to theOLS estimator without intercept on the weighted data with K − M weights.Unfortunately, OLS estimation does not perform well with variable selection,and is simply infeasible when p > n . Hence the need to introduce variousregularized methods that improve OLS, such as lasso [Tibshirani (1996)],the elastic net [Zou and Hastie (2005)], and the Dantzig selector [Candesand Tao (2007)]. Many of these regularized methods are developed for datawhere p > n and the coefficients vector is sparse. The general frame ofregularized WLS objective function is therefore defined by(2.6) L ( β, λ ) = 12 ( Y u − X u β ) T ( Y u − X u β ) + λ pen( β ) , where λ is the (scalar or vector) penalty parameter and the penalty quantitypen( β ) is set typically in a way so that it controls the complexity of themodel. For example, the penalty pen( β ) for ridge, lasso and elastic net aredefined as p X j =1 β j , p X j =1 | β j | , p X j =1 | β j | , p X j =1 β j respectively. This type of regularized WLS with lasso penalty is studied re-cently by Huang et al. (2006). A regularized WLS method called CCLASSOwhere a combination of ridge and lasso penalty is used in Hu and Rao (2010).So, the objective function of the regularized WLS method with censoringconstraints becomes L ( β, λ, λ ) = 12 ( Y u − X u β ) T ( Y u − X u β ) + λ pen( β ) + λ ξ T ξ, subject to Y ¯ u ≤ X ¯ u β + ξ. (2.7)
3. Proposed Model Framework.
The Regularized WLS: Adaptive Elastic Net (AEnet).
The elasticnet [Zou and Hastie (2005)] has proved useful when analysing data withvery many correlated covariates. The ℓ part of the penalty for elastic net KHAN, MHR AND SHAW, JEH generates a sparse model. On the other hand, the quadratic part of thepenalty removes the limitation on the number of selected variables when p ≫ n . The quadratic part of the penalty also stabilizes the ℓ regularizationpath and shrinks the coefficients of correlated predictors towards each other,allowing them to borrow strength from each other. The elastic net can not beapplied directly to the AFT models because of censoring, but the regularizedWLS (2.6) with the elastic net penalty overcomes this problem. The naiveelastic net estimator ˆ β for censored data is obtained as(3.1) arg min β
12 ( Y u − X u β ) T ( Y u − X u β ) + λ p X j =1 | β j | + λ β T β. With some algebra this naive elastic net can be transformed into a lassotype problem in an augmented space as belowarg min β
12 ( Y ∗ u − X ∗ u β ) T ( Y ∗ u − X ∗ u β ) + λ p X j =1 | β j | , where X ∗ u = X u √ λ I ! and Y ∗ u = Y u ! . The original elastic net estimator is now defined byˆ β (elastic net) = (1 + λ ) ˆ β (naive elastic net) . It is established in Zou (2006) that the lasso does not exhibit the oracleproperties. These properties include that the method selects the correctsubset of predictors with probability tending to one, and estimates the non-zero parameters as efficiently as would be possible if we knew which variableswere uninformative ahead of time. A modification, the adaptive elastic net,that does satisfy the oracle properties, was studied in Zou (2006), Ghosh(2007), and Zou and Zhang (2009). The adaptive elastic net is a convexcombination of the adaptive lasso penalty and the ridge penalty. Here wepresent an adaptive elastic net approach designed for censored data that isreferred to as the AEnet approach throughout the paper. We introduce theadaptive elastic net penalty terms including coefficients to the regularizedWLS objective function (2.6).(3.2) arg min β
12 ( Y u − X u β ) T ( Y u − X u β ) + λ p X j =1 ˆ w j | β j | + λ β T β, ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA where ˆ w = 1 / | ˆ β | γ is the adaptive weight based on the initial estimatorˆ β for some γ . For the rest of this paper w will denote weights obtainedfrom an initial estimator. For the initial estimator ˆ β , the OLS estimator assuggested in Ghosh (2007) or the elastic net estimator as suggested in Zouand Zhang (2009) can be used. For this study we use γ = 1 and the elasticnet estimator on the weighted data as given by Equation (3.1) as the initialestimator ˆ β .The adaptive elastic net can be transformed into an adaptive lasso typeproblem in an augmented space in a similar way as for the naive elastic net.(3.3) ˆ β ∗ a − nenet = arg min β
12 ( Y ∗ u − X ∗ u β ) T ( Y ∗ u − X ∗ u β ) + λ p X j =1 ˆ w j | β j | , where X ∗ u ( n u + p ) × p = X u √ λ I ! and Y ∗ u ( n u + p ) × p = Y u ! . So, for fixed λ the adaptive elastic net is equivalent to an adaptive lassoin augmented space. The adaptive lasso estimates in (3.3) can be solved forfixed λ by the LARS algorithm [Efron et al. (2004)]. According to Theorem2 in Zou (2006), and
Theorem 3.2 in Ghosh (2007), the estimator ˆ β ∗ a − nenet is asymptotically normal. Then the adaptive elastic net estimate can beobtained by rescaling the estimate found in Equation (3.3).ˆ β ∗ a − enet = (1 + λ ) ˆ β ∗ a − nenet . AEnet Algorithm.
The algorithm for the proposed adaptive elas-tic net approach as shown below is referred to as the AEnet algorithm.
Input:
Design matrix X ∗ u , response Y ∗ u , a fixed set for λ , and ˆ w .1. Define X ∗∗ j ( u ) = X ∗ j ( u ) / ˆ w j , j = 1 , · · · , p .2. Solve the lasso problem for all λ and a fixed λ ,ˆ β ∗∗ a − nenet = arg min β ( Y ∗ u − X ∗∗ u β ) T ( Y ∗ u − X ∗∗ u β ) + λ P pj =1 | β j | .
3. Calculate ˆ β j ∗ a − enet = (1 + λ ) ˆ β j ∗∗ a − nenet / ˆ w j . To find the optimal value for the tuning parameters ( λ , λ ), λ is typi-cally assumed to take values in a relatively small grid, say (0, 0.5, 1.0, 1.5,2.00, · · · ,5). For each λ , the LARS algorithm produces the entire solutionpath. This gives the optimal equivalent specification for lasso in terms offraction of the ℓ norm ( t ). Then the optimal pair of ( t , λ ) is obtainedusing k -fold cross-validation. KHAN, MHR AND SHAW, JEH
The Regularized WLS: Adaptive Elastic Net with Censoring Con-straints (AEnetCC).
Here we present an extension of the above adaptiveelastic approach that allows the censoring constraints to be implemented intothe optimization framework. The adaptive elastic net estimator for censoreddata given by (3.2) can be rewritten with censoring constraints as˜ β ∗ a − enet = arg min β, ξ
12 ( Y u − X u β ) T ( Y u − X u β ) + λ β T β + λ ξ T ξ, subject to p X j =1 ˆ w j | β j | ≤ t and Y ¯ u ≤ X ¯ u β + ξ, (3.4)where t is the lasso tuning parameter. We use a quadratic programming(QP) approach to solve the minimization problem of (3.4). The QP cannot handle | β j | because the lasso constraint ( P pj =1 | β j | ≤ t ) makes the QPsolutions nonlinear in the Y i . Further modification is needed to use | β j | in theQP framework. Following Tibshirani (1996) we use a modified design matrix˜ X = [ X, − X ] and represent coefficients β as the difference between two non-negative coefficients β + and β − . Although the technique doubles the numberof variables in the problem, it requires only a (known and bounded) linearnumber of constraints, and only requires the solution to one QP problem.Now Equation (3.4) becomes˜ β ∗ a − enet = arg min β + , β − , ξ
12 [ Y u − X u ( β + − β − )] T [ Y u − X u ( β + − β − )]+ λ β + T β + + λ β − T β − + λ ξ T ξ, subject to p X j =1 ˆ w j ( β + j + β − j ) ≤ t and Y ¯ u ≤ X ¯ u ( β + − β − ) + ξ, β + ≥ , β − ≥ . (3.5)According to Ghosh (2007), the estimator ˜ β ∗ a − enet is asymptotically normal.3.2.1. AEnetCC Algorithm.
The algorithm for the proposed adaptiveelastic net with censoring constraints approach is referred to as the AEnetCCalgorithm.
ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA Input: ˆ w .1. Define X ∗∗ j ( u ) = X j ( u ) / ˆ w j , j = 1 , · · · , p .2. Solve the elastic net problem,˜ β ∗∗ a − nenet = arg min β + , β − , ξ
12 [ Y u − X ∗∗ u ( β + − β − )] T [ Y u − X ∗∗ u ( β + − β − )]+ λ β + T β + + λ β − T β − + λ ξ T ξ, subject to p X j =1 ( β + j + β − j ) ≤ t and Y ¯ u ≤ X ¯ u ( β + − β − ) + ξ, β + ≥ , β − ≥ .
3. Calculate ˜ β j ∗ a − enet = (1 + λ ) ˜ β j ∗∗ a − nenet / ˆ w j . The AEnetCC has three tuning parameters ( λ , λ , λ ). For this methodwe use the same optimal pair of ( λ , λ ) as found in AEnet. Then λ istypically allowed to take values in a grid such as (0, 0.5, 1, 1.5, · · · ,10),and the optimal value for λ obtained by 5-fold cross-validation. Here thevalue of λ typically depends upon how stringently one wants the model tosatisfy the censoring constraints compared to how good is the prediction foruncensored data.3.3. The Regularized WLS: Weighted Elastic Net (WEnet).
In this sec-tion we present a weighted elastic net for censored data. This is an extensionof the adaptive elastic net where for suitable weight w the ridge penalty termis expressed as P pj =1 ( w j β j ) , instead of P pj =1 β j . This is a doubly adaptivetype model. This type of regularized technique for uncensored data was firststudied in Hong and Zhang (2010). They established the model consistencyand its oracle property under some regularity conditions. Following the reg-ularized WLS given in (2.6) the weighted elastic net for censored data canbe defined by(3.6)arg min β
12 ( Y u − X u β ) T ( Y u − X u β ) + n u λ p X j =1 w j | β j | + n u λ p X j =1 ( w j β j ) , where w j > , j = 1 , · · · , p are the weighted penalty coefficients. The weightis typically chosen as the standard deviations of the associated estimators[Hong and Zhang (2010)]. Since standard deviations are unknown in practice,we use the standard error of an initial consistent estimator. For estimating KHAN, MHR AND SHAW, JEH standard error under high–dimensional data we use a bootstrap procedurebased on the elastic net model on the weighted data (3.1). For data where n > p , as in Jin et al. (2003) and Jin, Lin and Ying (2006), we choose theGehan type rank estimator as an initial estimator. This is defined [Gehan(1965)] as the solution to the system of estimating equations, 0 = U G ( β ),where(3.7) U G ( β ) = n − n X i =1 n X i ′ =1 δ i ( X i − X i ′ ) I { ξ i ( β ) ≤ ξ i ′ ( β ) } , and ξ i ( β ) = Y i − X Ti β . Note that Equation (3.7) can be expressed as the p -dimensional gradient of the convex loss function, n L G ( β ), where L G ( β ) = n X i =1 n X i ′ =1 δ i { ξ i ( β ) − ξ i ′ ( β ) } − ,a − = 1 { a< } | a | .Similarly to the adaptive elastic net the weighted elastic net can be trans-formed into a weighted lasso type problem on an augmented data set. Werewrite Equation (3.6) with a scaled coefficient difference asˆ β w − enet = arg min β ( Y u − X u β ) T ( Y u − X u β ) + λ p X j =1 w j | β j | + × λ p X j =1 ( w j β j ) (3.8) = arg min β " Y u ! − X u √ λ W ! √ λ p λ β T × " Y u ! − X u √ λ W ! √ λ p λ β + λ √ λ w j × p λ | β j | , (3.9)where W = diag[ w , · · · , w p ]. Now assume that X ∗ u ( n u + p ) × p = (1 + λ ) − X u √ λ W ! ,Y ∗ u ( n u + p ) × p = Y u ! , ˜ λ = λ √ λ ,β ∗ = p λ β. ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA Then the estimator in (3.9) with new notation becomesˆ β ∗ w − enet = arg min β ( Y ∗ u − X ∗ u β ∗ ) T ( Y ∗ u − X ∗ u β ∗ ) + ˜ λ p X j =1 w j | β ∗ j | . (3.10)So, for fixed λ , the weighted elastic net can be transformed into anadaptive lasso problem (3.10) in some augmented space. The weighted elasticnet estimator ˆ β w − enet can be obtained by(3.11) ˆ β jw − enet = ˆ β j ∗ w − enet / p λ , which is a naive elastic net estimator. The original weighted elastic netestimator ˆ β j w − enet is therefore obtained by(3.12) p λ ˆ β j ∗ w − enet . WEnet Algorithm.
The algorithm for the proposed weighted elas-tic net approach is referred to as the WEnet algorithm.
Input:
Design matrix X ∗ u , response Y ∗ u , ˜ λ , and ˆ w j for j = 1 , · · · , p .1. Define X ∗∗ j ( u ) = X ∗ j ( u ) / ˆ w j , j = 1 , · · · , p .2. Solve the lasso problem for all λ and a fixed λ ,ˆ β ∗∗ w − nenet = arg min β ( Y ∗ u − X ∗∗ u β ) T ( Y ∗ u − X ∗∗ u β ) + ˜ λ P pj =1 | β j | .
3. Calculate ˆ β ∗ w − enet = ˆ β ∗∗ w − nenet / ˆ w j . Output: ˆ β w − enet = p (1 + λ ) ˆ β ∗ w − enet . To find the optimal value for the tuning parameters ( λ , λ ), λ is typ-ically assumed to take values in a relatively small grid, similar to the gridused for AEnet algorithm. To optimize the tuning parameters ( λ , λ ) forWEnet we follow exactly the same procedure as described in the previoussection for the AEnet.3.4. The Regularized WLS: Weighted Elastic Net with Censoring Con-straints (WEnetCC).
Following the adaptive elastic net with censoringconstraints the weighted elastic net model with censoring constraints canbe defined by˜ β ∗ w − enet = arg min β, ξ
12 ( Y u − X u β ) T ( Y u − X u β ) + λ W β T β + λ ξ T ξ, subject to p X j =1 ˆ w j | β j | ≤ t and Y ¯ u ≤ X ¯ u β + ξ. KHAN, MHR AND SHAW, JEH
Now after representing β ∗ as the difference between two non-negativecoefficients β ∗ + and β ∗− , the Equation (3.14) becomes˜ β ∗∗ w − enet = arg min β ∗ + , β ∗− , ξ ( Y ∗ u − X ∗ u ( β ∗ + − β ∗− )) T ( Y ∗ u − X ∗ u ( β ∗ + − β ∗− ))+ λ ξ T ξ, subject to P pj =1 w j ( β ∗ + − β ∗− ) ≤ t and Y ¯ u ≤ X ¯ u ( β ∗ + − β ∗− ) + ξ. (3.13)3.4.1. WEnetCC Algorithm.
Below we present the algorithm for the pro-posed weighted elastic net with censoring constraints approach which is re-ferred to as the WEnetCC algorithm.
Input:
Design matrix X ∗ u , response Y ∗ u , ˜ λ , and ˆ w j for j = 1 , · · · , p .1. Define X ∗∗ j ( u ) = X ∗ j ( u ) / ˆ w j , j = 1 , · · · , p .2. Solve the lasso problem,˜ β ∗∗ w − enet = arg min β ∗ + , β ∗− , ξ ( Y ∗ u − X ∗∗ u ( β ∗ + − β ∗− )) T ( Y ∗ u − X ∗∗ u × ( β ∗ + − β ∗− )) + λ ξ T ξ, subject to P pj =1 ( β ∗ + − β ∗− ) ≤ t and Y ¯ u ≤ X ¯ u ( β ∗ + − β ∗− ) + ξ.
3. Calculate ˜ β ∗ w − enet = ˆ β ∗∗ w − enet / ˆ w j . Output: ˜ β w − enet = p (1 + λ ) ˜ β ∗ w − enet . The estimator in the second step is obtained by optimizing the QP prob-lem. To obtain optimal tuning parameter ( λ , λ , λ ) of the WEnetCC wefollow exactly the same procedure as for the AEnetCC algorithm. Accord-ing to Theorem 2 in Zou (2006), and
Theorem 3.2 in Ghosh (2007), bothestimators ˆ β j w − enet and ˜ β j w − enet are asymptotically normal. Remark . Under some regularity conditions, the WLS estimator withK − M weights is consistent and asymptotically normal [Stute (1993, 1996)].The proof is not directly applicable to the lasso penalty since the lasso penaltyis not differentiable. In Huang et al. (2006) it is shown that the regularizedWLS estimator with lasso penalty has the asymptotic normality property.In their proof they added two more conditions additional to the regularityconditions mentioned in Stute (1993, 1996). The two additional conditionsare (i) the regularized WLS lasso estimator has finite variance, and (ii) the
ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA bias of the K − M integrals is of the order O ( n / ) , which is related to thelevel of censoring and to the tail behavior of the K − M estimator.
Variable Selection Criteria.
For AEnet and WEnet we use LARSthat produces exact zero coefficients in solution paths and hence does par-simonious variable selection. For the remaining two methods AEnetCC andWEnetCC we use an approach that simultaneously removes several variablesthat have very small coefficients.We use an AIC c type score based on the weighted k -fold cross-validationerror CV − S (which is the sum of squared residuals of uncensored data mul-tiplied by the K − M weights i.e. ( Y u − X u ˆ β ) T w u ( Y u − X u ˆ β )). The AIC c score is defined byAIC c score = n u log(CV − S) + 2 k (cid:16) n u n u − k − (cid:17) . (3.14)3.5.1. Variable Selection Algorithm for AEnetCC and WEnetCC.
Theabove AIC c score is used as variable selection criteria.1: Get the optimal pair ( λ , λ ) from fitting AEnet or WEnet.2: Fix a set of λ . Then for each λ (a) Fit AFT model by the computational procedure (3.2.1) or (3.4.1).Find the predictor set PS by using | ˆ β | > ς .(b) Use the PS and divide the dataset into k parts. Leave out onepart at a time and fit AFT model by the computational procedure(3.2.1) or (3.4.1).(c) Combine the K –fitted models built in step 2(b) by averaging theircoefficients. Compute the CV–S and then AIC c score.3: Repeat steps 2 until all λ are exhausted. Return the model with thelowest AIC c score and corresponding λ .We choose a very small value for precision parameter, say ς = 1 e − , asa default value but any other suitable value can be chosen. Alternatively, ς should be considered as an additional tuning parameter in the above vari-able selection algorithm. It is also possible either to adapt existing efficientoptimization algorithm such as SCAD [Fan and Li (2001)] or LARS-EN [amodified LARS developed for adaptive elastic net by Ghosh (2007)] or todevelop a new algorithm that avoids the ς parameter completely. KHAN, MHR AND SHAW, JEH
Measures of Fit and Measures of Prediction.
The following MSE iscomputed to measure the fit in the training data:
M SE
T R = 1 n u n X i =1 δ i ( ˆ Y i − Y i ) , (3.15)where n u is the number of uncensored observations. This measure comparesthe fitted values with the true values corresponding to the uncensored obser-vations. We first generate a training dataset, such as Y in (3.15), and thena test dataset Y new (say) of the same size using the same design parameters.All the methods are fitted using the training data. Then in order to getpredicted values ˆ Y new the fitted model is used with the X matrix of the testdata. We measure the prediction accuracy by M SE
T E = 1 n n X i =1 ( ˆ Y new , i − Y new , i ) . (3.16)We estimate the variance of the regression parameters using the nonpara-metric 0.632 bootstrap [Efron and Tibshirani (1993)] in which one sam-ples ˜ n ≈ . × n from the n observations without replacement. We use˜ n ≈ . × n as the expected number of distinct bootstrap observationsis about 0 . × n . We use B ≥
500 and then the sample variance of thebootstrap estimates provide an estimate of the variance of ˆ β .
4. Simulation Studies.
The purpose of this section is to evaluate andcompare the performance of the proposed approaches using simulation stud-ies and a real data example. We use six existing model selection approachesfor comparison purpose. However the aim is not to address which approachis superior, rather we identify the similarities among the methods and ac-cordingly provide some suggestions about the situations where one approachmay outperform the others. The six approaches are used for the last simu-lation example and also for the real data example. For AEnet and WEnetmethods we use {
0, 0.6, 1.1, 1.7, 2.2, 2.8, 3.3, 3.9, 4.4, 5.0 } as the grid for λ .For the two censoring constraint based methods AEnetCC and WEnetCCthe set {
0, 1.0, 1.4, 1.8, 2.2, 2.6, 3.0 } is used for the penalties of violationsof constraints λ .Several alternative penalized regression and Bayesian approaches for vari-able selection for high–dimensional censored data have been developed. Weuse the simple elastic net (Enet) approach as defined in Equation (3.1) onthe weighted data. Another approach is the adaptive elastic net for AFT(ENet-AFT) in Engler and Li (2009). There is also an MCMC selection ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA based Bayesian method (Bayesian-AFT) for log-normal AFT model intro-duced in Sha, Tadesse and Vannucci (2006); we use this approach only forthe log-normal AFT model. We also use three correlation-based greedy vari-able selection approaches: sure independence screening (SIS) [Fan and Lv(2008)], tilted correlation screening (TCS) [Cho and Fryzlewicz (2012)], andPC-simple [B¨uhlmann, Kalisch and Maathuiset (2010)] all implemented withthe weighted data under the SWLS as defined by the Equation (2.5).4.1. Simulation Studies.
The logarithm of the survival time is generatedfrom the true AFT model(4.1) Y i = α + X Ti β + σε i , i = 1 , · · · , n with ε i ∼ f ( · ), any suitable probability density function and σ , the sig-nal to noise ratio. We use correlated datasets. Censoring time is generatedfrom particular distributions maintaining a desired censoring level, P % . Weconsider three P % : 30, 50, and 70 that are indicated as low, medium andhigh respectively. We maintain random censoring except for the case thatif the largest observation is found censored ( i.e. Y +( n ) ) then we reclassify itas uncensored according to Efron’s tail correction [Efron (1967)]. This isnecessary since the WLS method involves the K − M weights that are basedon the K − M distribution function [Khan and Shaw (2013)].4.1.1.
Simulation I: n = 100 , p = 40 . We consider 40 covariates withthree blocks: β coefficients for j ∈ { , · · · , } are set to be 5 and for j ∈{ , · · · , } are set to be 2. We treat these two blocks as informative blocksi.e. contain potential covariates (say, p γ = 10). The remaining β coefficients(i.e. j ∈ { , · · · , } ) are set to be zero and we set X ∼ U (0 , ε i ∼ N (0 ,
1) for the log-normal AFT model and using (4.1) with ε i ∼ log(Weibull) for the Weibull AFT model. More precisely, in Equation(4.1) the error is σε where σ = 1 and ε = log[ W ] − E log[ W ] p Var { log[ W ] } , where W ∼ Weibull (5, 1).For both AFT models the censoring time is generated using the log-normaldistribution exp[ N ( c √ σ, (1 + σ ))]. Here c is calculated analyticallyto produce the chosen P % . We fit all four methods and for both AFT models100 runs are simulated. For each covariate we record the frequency of beingselected among 100 simulation runs, the minimum, mean and maximum.The summary statistics for each block are shown in Table 1. KHAN, MHR AND SHAW, JEH
Table 1
Variable selection frequency percentages for the methods for both log-normal and WeibullAFT models. Here (*) stands for the noninformative block with zero coefficients. P % Methods Parameters r ij = 0 r ij = 0.5log-normal Weibull log-normal Weibull(Min, Mean, Max) (Min, Mean, Max) (Min, Mean, Max) (Min, Mean, Max)30 AEnet Block1 (99, 99, 99) (95, 95, 95) (62, 69.6, 79) (62, 66.6, 74)Block2 (66, 74.6, 83) (73, 94.8, 97) (39, 43.8, 51) (35, 41.8, 48)Block3* (00, 1.6, 05) (00, 2.8, 07) (11, 18.5, 25) (12, 18.9, 24)AEnetCC Block1 (100, 100, 100) (100, 100, 100) (76, 80, 82) (80, 87, 95)Block2 (48, 53.6, 57) (48, 51.8, 54) (44, 46.8, 50) (51, 54.4, 59)Block3* (00, 3.1, 06) (02, 6.3, 12) (17, 25.8, 33) (22, 30, 38)WEnet Block1 (99, 99, 99) (99, 99, 99) (100, 100, 100) (100, 100, 100)Block2 (63, 73.4, 82) (74, 77.2, 79) (51, 62.8, 75) (55, 62.2, 68)Block3* (00, 1.3, 3) (00, 1.9, 06) (00, 2.3, 06) (00, 03, 07)WEnetCC Block1 (100, 100, 100) (99, 99.8, 100) (93, 95.4, 97) (96, 97, 98)Block2 (30, 36, 39) (24, 32.4, 36) (44, 47.6, 52) (43, 49, 57)Block3* (00, 2.2, 5) (00, 2.5, 06) (04, 9.6, 15) (03, 9.2, 14)50 AEnet Block1 (99, 99.6, 100) (98, 98.8, 99) (77, 81.8, 85) (77, 81.6, 84)Block2 (63, 69.8, 75) (61, 66.6, 76) (48, 52.2, 55) (48, 53.6, 59)Block3* (09, 18.2, 25) (10, 19, 27) (22, 31.5, 38) (26, 31.7, 41)AEnetCC Block1 (97, 98.6, 100) (98, 99.2, 100) (53, 56.4, 59) (52, 57.8, 65)Block2 (27, 33.6, 39) (34, 38.2, 41) (19, 22.6, 28) (16, 24, 29)Block3* (00, 3.5, 08) (01, 4.4, 10) (05, 10.5, 18) (04, 8.7, 13)WEnet Block1 (99, 99.8, 100) (100, 100, 100) (98, 99, 100) (99, 99.6, 100)Block2 (63, 68.6, 74) (62, 65.8, 71) (55, 59.4, 62) (46, 58.2, 68)Block3* (05, 9.9, 18) (05, 9.1, 14) (01, 5.3, 10) (02, 4.9, 09)WEnetCC Block1 (91, 95.8, 100) (91, 95.4, 98) (71, 76.2, 81) (71, 74.2, 81)Block2 (15, 21.6, 26) (22, 23.4, 24) (20, 25.4, 29) (20, 26, 29)Block3* (00, 2.8, 7) (00, 2.8, 07) (02, 6.6, 12) (02, 6.8, 12)70 AEnet Block1 (96, 97.4, 99) (93, 94.8, 97) (71, 77, 82) (70, 73.8, 77)Block2 (52, 59.2, 62) (56, 58.2, 62) (52, 55, 58) (42, 48.6, 55)Block3* (15, 23.5, 31) (17, 24.1, 34) (27, 34.9, 43) (21, 28.7, 36)AEnetCC Block1 (85, 89.8, 94) (85, 88.6, 92) (35, 37.2, 41) (36, 41.6, 48)Block2 (19, 27, 33) (24, 29.4, 36) (16, 17, 18) (13, 19.8, 26)Block3* (03, 10, 15) (05, 9.5, 14) (05, 8.4, 12) (05, 10, 16)WEnet Block1 (96, 97.2, 99) (92, 96, 99) (90, 93.8, 96) (93, 94.6, 97)Block2 (41, 49.4, 57) (42, 50.2, 55) (36, 44.8, 52) (43, 49.2, 57)Block3* (07, 16.9, 22) (13, 17.8, 24) (08, 12.3, 18) (05, 10.6, 18)WEnetCC Block1 (82, 85.8, 89) (83, 86, 89) (50, 57.8, 63) (47, 57.6, 65)Block2 (14, 23.4, 31) (21, 25.2, 31) (15, 19.6, 24) (19, 22.2, 28)Block3* (03, 8.8, 15) (03, 6.3, 12) (03, 7.3, 11) (02, 7.4, 15) As expected the covariates of the informatics blocks (blocks 1 and 2)should be selected. For both AFT models Table 1 shows that when the datais uncorrelated all the methods tend to select most of the informative covari-ates from block 1 and with a very high percentage of informative covariatesfrom block 2. The inclusion rate of the noninformative covariates into thefinal models of the methods is shown to be very low particularly with lowcensoring. When the covariates are correlated the two methods WEnet andWEnetCC outperform the other two methods. The methods AEnet andAEnetCC tend to select informative covariates with low selection frequen-cies, and tend to include more irrelevant predictors in their models. However,as censoring increases the overall performance in terms of both selectinginformative covariates and excluding noninformative covariates slightly de-creases.The prediction performance of the methods on the datasets under the log-normal AFT model is given in Figures 1 We similarly obtained graphs for the
ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA
10 15 20 25
AEnet, r = 0
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor
10 15 20 25
WEnet, r = 0
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor
20 30 40 50
AEnet, r = 0.5
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor
20 30 40 50
WEnet, r = 0.5
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor
10 15 20 25
AEnet, r = 0
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor
10 15 20 25
WEnet, r = 0
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor
15 25 35 45
AEnet, r = 0.5
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor
15 25 35 45
WEnet, r = 0.5
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor −3 −1 0 1 2 − − − AEnetCC, r = 0
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor −3 −1 0 1 2 − − − WEnetCC, r = 0
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor −2 0 1 2 − − AEnetCC, r = 0.5
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor −2 0 1 2 − − WEnetCC, r = 0.5
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor −2 0 1 2 − − AEnetCC, r = 0
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor −2 0 1 2 − − WEnetCC, r = 0
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor −2 0 1 2 − − AEnetCC, r = 0.5
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor −2 0 1 2 − − WEnetCC, r = 0.5
Predicted log(time)
Log ( ob s e r v ed t i m e ) uncensorcensor Fig 1 . Predicted vs observed log survival time under log-normal AFT model for the methodsAEnet and WEnet for datasets with P % = 30 (first row panel) and P % = 50 (second rowpanel) and for the methods AEnetCC and WEnetCC for datasets with P % = 30 (third rowpanel) and P % = 50 (fourth row panel) . Weibull AFT model and not reported here since we got almost similar resultsas for the log-normal model. According to those figures all the methods fitthe test data reasonably well for all levels of censoring. For the AEnet on thecorrelated dataset with 30% censoring, the predicted times are considerablybiased compared the corresponding observed times (Figure 1, row 1, column3). However, this does not happen with 50% censoring (row 2, column 3).So it is hard to draw general conclusion. Given this performance of AEnet,all other methods select the correct set of nonzero coefficients, although thecoefficients may be poorly estimated. KHAN, MHR AND SHAW, JEH
Simulation II: n = 100 , p = 120 . We fix α = 1 and set the first20 coefficients for β ’s to 4 (i.e. p γ is 20) and the remaining coefficients of β to zero. We keep everything else similar to simulation I. We fit all fourproposed methods together with six other existing methods and comparethem in terms of variable selection. The Bayesian-AFT MCMC sampler wasrun for 200,000 iterations at which the first 100,000 iterations are used asburn-in. A starting model with 40 randomly selected variables is considered.For the Bayesian-AFT, we choose 0 .
01 to be the cut–off for the marginalposterior probabilities.
Table 2
Variable selection frequency percentages for 20 p γ variables and 100 p − p γ (non-relevant) variables with the methods for both log-normal and Weibull AFT models. P % Methods Parameters r ij = 0 r ij = 0.5log-normal Weibull log-normal Weibull(Min, Mean, Max) (Min, Mean, Max) (Min, Mean, Max) (Min, Mean, Max)30 AEnet p γ (77, 84.8, 94) (82, 87.9, 94) (42, 50.7, 58) (44, 54.2, 63) p − p γ (02, 9.4, 19) (02, 9.8, 17) (11, 21.1, 29) (17, 23.8, 31)AEnetCC p γ (77, 89.8, 98) (83, 89.5, 94) (49, 62, 70) (60, 67, 77) p − p γ (08, 14.3, 21) (04, 13.3, 21) (30, 40.3, 52) (35, 44.67, 53)WEnet p γ (72, 80, 90) (73, 80, 89) (36, 43.3, 50) (36, 43.2, 55) p − p γ (05, 12.3, 21) (05, 13.1, 22) (00, 1.4, 05) (00, 1.5, 05)WEnetCC p γ (82, 87.3, 94) (82, 86.5, 91) (52, 61.8, 73) (55, 65, 75) p − p γ (06, 12.4, 19) (03, 11.6, 20) (03, 11, 22) (03, 10.5, 21)Enet p γ (96, 97.2, 98) (97, 97.3, 98) (96, 98.2, 100) (96, 98.8, 100) p − p γ (05, 6.5, 8) (4, 6.5, 9) (01, 4.3, 10) (00, 4.1, 10)Enet-AFT p γ (50, 53.2, 56) (50, 53.3, 55) (36, 40.3, 48) (32, 39.5, 50) p − p γ (03, 4.3, 06) (03, 4.4, 06) (05, 16.7, 27) (08, 16.7, 27)Bayesian-AFT p γ (45, 64, 75) - (45, 72, 85) - p − p γ (14, 29, 41) - (36, 48.3, 62) -SIS p γ (11, 20.5, 29) (13, 21.3, 29) (06, 11.9, 18) (06, 11.5, 18) p − p γ (00, 0.9, 03) (00, 0.8, 04) (00, 2.6, 06) (00, 2.7, 07)TCS p γ (38, 45.9, 53) (44, 47.4, 52) (42, 52.8, 61) (44, 55.1, 64) p − p γ (00, 3.4, 08) (00, 3.7, 09) (05, 15.1, 26) (08, 15.0, 27)PC-simple p γ (11, 23.5, 30) (16, 25.2, 33) (11, 18.1, 26) (13, 19.7, 29) p − p γ (00, 1.1, 04) (00, 0.9, 05) (00, 5.6, 13) (01, 5.4, 15)50 AEnet p γ (59, 68.1, 75) (60, 66.2, 73) (41, 50.7, 56) (43, 50.6, 60) p − p γ (03, 8.8, 16) (03, 9.7, 18) (16, 26, 35) (15, 25.6, 38)AEnetCC p γ (69, 76.4, 85) (68, 72.7, 80) (47, 57.2, 67) (41, 52.5, 64) p − p γ (16, 23.6, 31) (12, 22.2, 33) (26, 37.9, 50) (25, 35.2, 50)WEnet p γ (45, 56.8, 65) (46, 52.2, 60) (21, 28.4, 36) (21, 28.6, 37) p − p γ (04, 10.6, 18) (03, 11.5, 20) (00, 1.8, 06) (00, 02, 06)WEnetCC p γ (69, 75.4, 87) (68, 72.8, 80) (49, 55.5, 66) (46, 55.8, 64) p − p γ (12, 21.1, 29) (11, 21.2, 30) (06, 12.4, 19) (03, 12.3, 21)Enet p γ (64, 67.4, 70) (66, 68.1, 70) (70, 76.2, 85) (71, 80.4, 85) p − p γ (11, 12.7, 15) (10, 12.7, 15) (03, 10.3, 18) (03, 8.8, 16)Enet-AFT p γ (33, 35.3, 38) (33, 35.8, 38) (25, 17.8, 41) (26, 33.6, 41) p − p γ (03, 04, 05) (03, 04, 06) (09, 17.8, 27) (08, 17.9, 26)Bayesian-AFT p γ (40, 53.5, 70) - (30, 51.5, 60) - p − p γ (27, 50, 67) - (32, 45.1, 56) -SIS p γ (11, 17.4, 25) (09, 16.8, 23) (05, 10.5, 15) (06, 09.9, 17) p − p γ (00, 1.5, 06) (00, 1.6, 05) (00, 2.9, 08) (00, 3.0, 09)TCS p γ (48, 56.9, 64) (46, 55.6, 64) (40, 47.5, 55) (38, 45.5, 59) p − p γ (25, 37.6, 48) (27, 37.9, 47) (28, 39.5, 53) (28, 39.9, 51)PC-simple p γ (13, 19.4, 27) (12, 18.8, 26) (07, 14.8, 21) (10, 15.9, 23) p − p γ (00, 1.9, 07) (00, 1.9, 05) (00, 5.3, 16) (00, 5.1, 13) Table 2 shows the results from 100 simulation runs. We evaluate the fre-quency of being selected among 100 simulation and then compute the mini-mum, mean, and maximum of those frequencies. The results are presented inthe table for two censoring level, 30% and 50%. In terms of the mean selec-tion frequencies of informative variables, all proposed methods outperform
ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA the Enet-AFT, SIS, TCS, and PC-simple methods. Their performances arevery close to the performance of the Enet method at the higher censoringlevel, although they show slightly poorer performances with lower censoring.However for the uncorrelated dataset and both AFT models, the two meth-ods AEnet and AEnetCC tend to exclude fewer noninformative covariates.The three greedy approaches tend to select far fewer variables (both infor-mative and spurious) in the final model. As expected, for all, approaches thenumber of variables selected from p γ decreases as the censoring increases.
5. Real Data Example.
Mantle cell lymphoma data.
Rosenwald et al. (2003) reported astudy using microarray expression analysis of mantle cell lymphoma (MCL).The primary goal of the study was to discover gene expression signaturesthat correlate with survival in MCL patients. MCL accounts for 6% of allnon − Hodgkins lymphomas and a higher fraction of deaths from lymphoma,given that it is an incurable malignancy [Swerdlow and Williams (2002)].Among 101 untreated patients with no history of previous lymphoma in-cluded in the study, 92 were classified as having MCL, based on establishedmorphologic and immunophenotypic criteria. Survival times of 64 patientswere available and the remaining 28 patients were censored (i.e. censoringrate P % is 30). The median survival time was 2.8 years (range 0.02 to 14.05years). The length of survival of MCL patients following diagnosis is quiteheterogeneous (Figure 2 (a)). Many patients died within the first 2 years fol-lowing diagnosis, yet 15% (14/92) of the patients survived more than 5 yearsand 3 patients survived more than 10 years. Lymphochip DNA microarrayswere used to quantify mRNA expression in the lymphoma samples from the92 patients. The gene expression dataset that contains expression values of8810 cDNA elements is available at http://llmpp.nih.gov/MCL/. The datado not provide any further relevant covariates for MCL patients.We apply the AFT model with all methods to this dataset. Althoughthese methods have in principle no limit to the number of genes that can beused, we pre-process the data in a simple way. Pre-processing is importantto gain further stability by reducing potential noise from the dataset. Thepre-processing steps can be summarized as follows: (1) First, missing valuesof the original dataset are imputed by their sample means (for example,for a particular gene the missing gene expression value for a patient is re-placed by the mean of the gene expression values for the observed patients).(2) Secondly, we compute correlation coefficients of the uncensored survivaltimes with gene expressions. (3) Finally, a reasonable number of genes areselected based on their correlation with the response. After pre-processing, KHAN, MHR AND SHAW, JEH . . . . . . Time in Years S u r v i v a l P r obab ili t y Number of Missing Cases N u m be r o f G ene s Fig 2 . (a) K–M plots of overall survival of patients for MCL data (left panel). (b) His-togram of number of genes with missingness (right panel).
574 genes with the largest absolute correlation ( > Table 3
Number of genes selected by the methods (diagonal elements) and number of commongenes found between the methods (off diagonal elements).
Methods AEnet AEnetCC WEnet WEnetCC Enet Enet-AFT Bayesian-AFT SIS TCS PC-simpleAEnet 45 12 10 03 10 07 05 05 01 02AEnetCC 12 18 02 01 02 05 04 05 01 01WEnet 10 02 39 09 03 01 01 00 00 00WEnetCC 03 01 09 40 02 01 02 01 00 00Enet 10 02 03 02 68 08 03 01 00 01Enet-AFT 07 05 01 01 08 25 05 03 01 00Bayesian-AFT 05 04 01 02 03 05 25 02 01 00SIS 05 05 00 01 01 03 02 05 01 01TCS 01 01 00 00 00 01 01 01 02 00PC-simple 02 01 00 00 01 00 00 01 00 03
In the final model, out of 574 genes, the Bayesian-AFT method finds only
ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA
25 genes that have the largest marginal posterior probabilities (we choose0.38 to be the cut–off) (see also Figure 3). The left panel of Figure 3 showsthat MCMC chains mostly visited models with 20 to 40 genes. The rightpanel of Figure 3 shows that not many genes have high marginal probabilities(only 25 genes with marginal probabilities greater than 0.38). Iteration N u m b e r o f o n e s Variable M a r g i n a l p r o b a b ili t y Fig 3 . Number of included genes (left) in each iteration and marginal posterior probabilitiesof inclusion (right).
There are four genes with uniqid 24761, 27434, 27844 and 29888 that areidentified by all four proposed methods and there are another five genes withuniqid 22136, 24383, 29876, 30034 and 33704 that are identified by three ofthe proposed methods. The overall analysis of the MCL data suggests thatall four proposed methods are capable of identifying sets of genes that arepotentially related to the response variable. In the analysis the AEnetCC, asin the simulations with r ij = 0 .
5, selects a smaller number of genes than dothe other methods. However with gene expression data, a smaller number ofidentified genes means a more focused hypothesis for future confirmationsstudies, and is thus usually preferred.We evaluate the predictive performance using the four proposed methods.We use the obtained models to predict the risk of death in the MCL testdataset. We first partition the data randomly into two equal parts calledtraining and test datasets. We then implement the methods to the trainingdataset and compute the risk scores ( X T ˆ β ) based on the model estimatesand the test dataset. The subjects are classified to be in the high risk groupor low risk group based on whether the risk score exceeds the median survivaltime in the training dataset. We compare the K–M curves between the twogroups and then a log–rank test is used to identify the difference betweenthe two K–M curves (see Figure 4). The corresponding predictive MSE forthe methods AEnet, AEnetCC, WEnet, and WEnetCC are 15.7, 19.6, 29.4,and 15.0 respectively. The log–rank test suggests that the high and lowrisk groups are significantly different from each other under almost all themethods. So it seems the methods can group very well the subjects’ survival KHAN, MHR AND SHAW, JEH time into two risk sets. . . . AEnet (p=0.018)
Survival in days S u r v i v a l p r obab ili t y highlow . . . AEnetCC (p=0.016)
Survival in days S u r v i v a l p r obab ili t y highlow . . . WEnet (p=0.069)
Survival in days S u r v i v a l p r obab ili t y highlow . . . WEnetCC (p=0.02)
Survival in days S u r v i v a l p r obab ili t y highlow Fig 4 . Survival comparison between the high risk group and low risk group using differentmethods.
Table 4
Number of genes selected by the SIS followed by the methods (diagonal elements) andnumber of common genes found between the methods (off diagonal elements).
Methods SIS+AEnet SIS+AEnetCC SIS+WEnet SIS+WEnetCC SIS+Enet SIS+Enet-AFT SIS+Bys-AFT SIS+SCAD SIS+TCS SIS+PCSIS+AEnet 05 02 02 01 02 04 02 04 00 02SIS+AEnetCC 02 12 05 03 07 07 02 03 00 02SIS+WEnet 02 05 10 02 05 07 02 02 00 02SIS+WEnetCC 01 03 02 08 05 02 01 01 00 01SIS+Enet 02 07 05 05 32 13 02 03 01 02SIS+Enet-AFT 04 07 07 02 13 30 02 04 00 02SIS+Bys-AFT 02 02 02 01 02 02 02 02 00 02SIS+SCAD 04 03 02 01 03 04 02 05 00 02SIS+TCS 00 00 00 00 01 00 00 00 01 00SIS+PC 02 02 02 01 02 02 02 02 00 02
Mantle cell lymphoma data under adaptive preprocessing.
A chal-lenge with the MCL dataset (with ultra-high dimensionality, p ≫ n ) is thatthe important genes might be highly correlated with some unimportant ones;that usually increases with dimensionality. The maximum spurious correla-tion between a gene and the survival time also grows with dimensionality.Here we focus on a smart preprocessing technique for MCL data, that ad-dresses this issue and also reduces circularity bias [Kriegeskorte et at. (2009)]by reducing false discovery rate. Fan and Lv (2008) introduced the SIS idea,that reduces the ultra-high dimensionality to a relatively large scale d n ,where d n < n . In Fan and Lv (2008) asymptotic theory is proved to show ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA that, with high probability, SIS keeps all important variables with vanishingfalse discovery rate. Then, the lower dimension methods such as SCAD [Fanand Li (2001)] can be used to estimate the sparse model. This procedure isreferred to as SIS+SCAD For MCL data we first apply SIS to reduce the di-mensionality from 8810 to d n = [3 n / ] = 61 and then fit the data using allten methods (our four proposed methods and six competitors, including thethree greedy methods). We call this procedure SIS + methods . The resultsare reported in Table 4. Table 5
Number of genes selected between the methods with and without the SIS implementation.
Methods AEnet AEnetCC WEnet WEnetCC Enet Enet-AFT Bayesian-AFT SIS TCS PC-simpleSIS+AEnet 01 01 00 00 00 00 00 01 00 00SIS+AEnetCC 01 01 00 00 01 00 00 01 00 01SIS+WEnet 01 01 00 00 00 01 00 01 00 00SIS+WEnetCC 01 01 00 00 00 00 00 00 00 00SIS+Enet 06 06 00 01 01 02 00 03 01 01SIS+Enet-AFT 04 04 00 01 01 02 00 04 00 01SIS+Bys-AFT 00 00 00 00 00 00 00 00 00 00SIS+SCAD 01 01 00 00 01 00 00 01 00 01SIS+TCS 00 00 00 00 00 00 00 00 00 00SIS+PC 00 00 00 00 00 00 00 00 00 00
From Table 4 we see that the proposed methods select a considerable num-ber of genes in common with the other methods. The three greedy methodstend to return final models with fewer genes. TCS selects the lowest numberof genes (1) while Enet selects the largest number of genes (32). Amongfour proposed methods, AEnet selects the lowest number of genes (5). Theresults also suggest that the implementation of SIS followed by all the meth-ods (proposed and competitors) pick smaller sets of genes, most of whichare not in the set of genes found by the methods without SIS. Table 5 showsthe number of common genes between the methods with and without SISimplementation. The predictive performance for the four proposed methodswith SIS implementation has been evaluated (see Figure 5) similarly to whatwas done before for methods without SIS (Figure 4). The predictive MSE formethods SIS+AEnet, SIS+AEnetCC, SIS+WEnet, and SIS+WEnetCC are1.2, 1.1, 2.8, and 1.4 respectively. It is clear from the predictive performancegraph Figure 5 (also Figure 4 for methods without SIS) and the predictiveMSE’s that the predictive performance improves considerably after imple-mentation of SIS.
6. Discussion.
In this study we propose adaptive elastic net and weightedelastic net regularized variable selection approaches for the AFT model.We conjecture that the proposed approaches enjoy oracle properties un-der some regularity assumptions in analogous with the adaptive elastic net[Ghosh (2007)] and weighted elastic net [Hong and Zhang (2010)]. They KHAN, MHR AND SHAW, JEH . . . SIS+AEnet (p=0.000)
Survival in days S u r v i v a l p r obab ili t y highlow . . . SIS+AEnetCC (p=0.000)
Survival in days S u r v i v a l p r obab ili t y highlow . . . SIS+WEnet (p=0.001)
Survival in days S u r v i v a l p r obab ili t y highlow . . . SIS+WEnetCC (p=0.001)
Survival in days S u r v i v a l p r obab ili t y highlow Fig 5 . Survival comparison between the high risk group and low risk group using differentmethods with SIS implementation. produce sparse solutions. We propose another two variable selection algo-rithms, where censored observations are used as extra constraints in theoptimization function of the methods. The censoring constraints in the op-timization equations limit the model space using the right censored data. Itis shown how all the methods apart from the AEnetCC can be optimizedafter transforming them into an adaptive lasso problem in some augmentedspace.The analysis of both simulated and MCL gene expression data showsthat the regularized SWLS approach for variable selection with its fourimplementations (AEnet, AEnetCC, WEnet and WEnetCC) can be usedfor selecting important variables. They also can be used for future predic-tion for survival time under AFT models. The MCL gene expression dataanalysis also suggests that the sure independence screening improves theperformance of all the proposed methods in the AFT model. It is observedthat the methods AEnetCC and WEnetCC seem only to perform well un-der moderately high–dimensional censored datasets such as with variablesat most four or five times higher than the sample size. However, the twomethods with SIS implementation have no such limits. A variable selectionstrategy such as ours, that allows the adaptive elastic net [Ghosh (2007)] andweighted elastic net [Hong and Zhang (2010)] to be used for censored data,is new. The extensions of these methods, that use right censored data to
ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA limit the model space and improve the parameter estimation, are also new.Both the adaptive and weighted elastic net together with two extensionsenjoy the computational advantages of the lasso. The methods show thatvarious regularized technique will continue to be an important tool in vari-able selection with survival data. Our new variable selection techniques forhigh–dimensional censored data are promising alternatives to the existingmethods. For implementing all proposed methods we have provided a pub-licly available package AdapEnetClass (Khan & Shaw, 2013) implementedin the R programming system.
Acknowledgements.
The first author is grateful to the centre for re-search in Statistical Methodology (CRiSM), Department of Statistics, Uni-versity of Warwick, UK for offering research funding for his PhD study.
References.
Akaike, H. (1973). Information theory as an extension of the maximumlikelihood principle. In
Second International Symposium on Informa-tion Theory (B. N. Petrov and F. Csaki eds.)
Antoniadis, A. , Fryzlewicz, P. and
Letue, F. (2010). The DantzigSelector in Cox’s Proportional Hazards Model.
Scandinavian Journalof Statistics Buckley, J. and
James, I. (1979). Linear regression with censored data.
Biometrika B¨uhlmann, P. , Kalisch, M. and
Maathuis, M. H. (2010). Variable selec-tion in high-dimensional linear models: partially faithful distributionsand the PC-simple algorithm.
Biometrika Cai, T. , Huang, J. and
Tian, L. (2009). Regularized estimation for theaccelerated failure time model.
Biometrics Candes, E. and
Tao, T. (2007). The Dantzig selector: statistical estimationwhen p is much larger than n . The Annals of Statistics Cho, H. and
Fryzlewicz, P. (2012). High dimensional variable selectionvia tilting.
J. Roy. Statist. Soc. Ser. B Cox, D. R. (1972). Regression models and life-tables.
J. Roy. Statist. Soc.Ser. B Datta, S. , Le-Rademacher, J. and
Datta, S. (2007). Predicting patientsurvival from microarray data by accelerated failure time modelingusing partial least squares and LASSO.
Biometrics KHAN, MHR AND SHAW, JEH
Efron, B. (1967). The two sample problem with censored data. In
Proceed-ings of the Fifth Berkeley Symposium on Mathematical Statistics andProbability , Efron, B. and
Tibshirani, R. (1993).
An Introduction to the Bootstrap .Chapman and Hall, New York.
Efron, B. , Hastie, T. , Johnstone, I. and
Tibshirani, R. (2004). Leastangle regression.
Ann. Statist. Engler, D. and
Li, Y. (2009). Survival Analysis with High-DimensionalCovariates: An Application in Microarray Studies.
Statistical Applica-tions in Genetics and Molecular Biology Article 14.
Fan, J. and
Li, R. (2001). Variable selection via nonconcave penalizedlikelihood and its oracle properties.
J. Amer. Stat. Assoc. Fan, J. and
Li, R. (2002). Variable selection for Cox’s proportional hazardsmodel and frailty model.
Ann. Stat. Fan, J. and
Lv, J. (2008). Sure independence screening for ultrahigh di-mensional feature space.
J. Roy. Statist. Soc. Ser. B Faraggi, D. and
Simon, R. (1998). Bayesian Variable selection methodfor censored survival data.
Biometrics Frank, I. E. and
Friedman, J. H. (1993). A Statistical View of SomeChemometrics Regression Tools.
Technometrics Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarilysingle-censored samples.
Biometrika Ghosh, S. (2007). Adaptive Elastic Net: An Improvement of Elastic Netto achieve Oracle Properties.
Technical Reports, Indiana University-Purdue University, Indianapolis, USA.
PR no. 07-01.
Gui, J. and
Li, H. (2005). Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microar-ray gene expression data.
Bioinformatics Hong, D. and
Zhang, F. (2010). Weighted Elastic Net Model for MassSpectrometry Imaging Processing.
Mathematical Modelling of NaturalPhenomena Hu, S. and
Rao, J. S. (2010). Sparse Penalization with Censoring Con-straints for Estimating High Dimensional AFT Models with Applica-tions to Microarray Data Analysis.
Technical Reports, University ofMiami . Huang, J. and
Harrington, D. (2005). Iterative partial least squareswith rightcensored data analysis: A comparison to other dimensionreduction techniques.
Biometrics Huang, J. , Ma, S. and
Xie, H. (2006). Regularized estimation in the accel-
ARIABLE SELECTION FOR HIGH-DIMENSION SURVIVAL DATA erated failure time model with high-dimensional covariates. Biometrics Huang, J. and
Ma, S. (2010). Variable selection in the accelerated failuretime model via the bridge method.
Lifetime Data Analysis Hunter, D. R. and
Li, R. (2005). Variable selection using MM algorithms.
The Annals of Statistics Jin, Z. , Lin, D. Y. and
Ying, Z. (2006). On least-squares regression withcensored data.
Biometrika Jin, Z. , Lin, D. , Wei, L. J. and
Ying, Z. L. (2003). Rank-based inferencefor the accelerated failure time model.
Biometrika Khan, M. H. R. and
Shaw, J. E. H. (2013a). AdapEnetClass: A class ofadaptive elastic net methods for censored data R package version 1.0.
Khan, M. H. R. and
Shaw, J. E. H. (2013b). On Dealing with CensoredLargest Observations under Weighted Least Squares.
CRiSM workingpaper,
No. 13-07
Department of Statistics, University of Warwick.
Khan, M. H. R. and
Shaw, J. E. H. (2013c). Variable Selection withThe Modified Buckley–James Method and The Dantzig Selector forHigh–dimensional Survival Data.
Submitted . Kriegeskorte, N. , Simmons, W. K. , Bellgowan, P. S. F. and
Baker, C. I. (2009). Circular analysis in systems neuroscience: thedangers of double dipping.
Nature Neuroscience Li, H. Z. and
Luan, Y. H. (2003). Kernel Cox regression models for linkinggene expression profiles to censored survival data.
Pacific Symposiumon Biocomputing Meinshausen, N. and
B¨uhlmann, P. (2010). Stability selection.
J. Roy.Statist. Soc. Ser. B Peduzzi, P. N. , Hardy, R. J. and
Holford, T. R. (1980). A stepwisevariable selection procedure for nonlinear regression models.
Biomet-rics Radchenko, P. and
James, G. M. (2011). Improved variable selectionwith Forward-Lasso adaptive shrinkage.
Annals of Applied Statistics Rosenwald, A. , Wright, G. , Wiestner, A. , Chan, W. C. , Con-nors, J. M. , Campo, E. , Gascoyne, R. D. , Grogan, T. M. , Muller-Hermelink, H. K. , Smeland, E. B. , Chiorazzi, M. , Giltnane, J. M. , Hurt, E. M. , Zhao, H. , Averett, L. , Hen-rickson, S. , Yang, L. , Powell, J. , Wilson, W. H. , Jaffe, E. S. , Simon, R. , Klausner, R. D. , Montserrat, E. , Bosch, F. , Greiner, T. C. , Weisenburger, D. D. , Sanger, W. G. , Dave, B. J. , Lynch, J. C. , Vose, J. , Armitage, J. O. , KHAN, MHR AND SHAW, JEH
Fisher, R. I. , Miller, T. P. , LeBlanc, M. , Ott, G. , Kvaloy, S. , Holte, H. , Delabie, J. and
Staudt, L. M. (2003). The prolifera-tion gene expression signature is a quantitative integrator of oncogenicevents that predicts survival in mantle cell lymphoma.
Cancer Cell Sha, N. , Tadesse, M. G. and
Vannucci, M. (2006). Bayesian variableselection for the analysis of microarray data with censored outcome.
Bioinformatics Stute, W. (1993). Consistent estimation under random censorship whencovariables are available.
Journal of Multivariate Analysis Stute, W. (1996). Distributional convergence under random censorshipwhen covariables are present.
Scandinavian Journal of Statistics Swerdlow, S. H. and
Williams, M. E. (2002). From centrocytic tomantle cell lymphoma: a clinicopathologic and molecular review of3 decades.
Hum. Pathol Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
J. Roy. Statist. Soc. Ser. B Tibshirani, R. (1997). The lasso method for variable selection in the Coxmodel.
Statistics in Medicine Wang, S. , Nan, B. , Zhu, J. and
Beer, D. G. (2008). Doubly penal-ized Buckley-James method for survival data with high-dimensionalcovariates.
Biometrics Ying, Z. (1993). A large sample study of rank estimation for censored re-gression data.
Annals of Statistics Yuan, M. and
Lin, Y. (2006). Model selection and estimation in regressionwith grouped variables.
J. Roy. Statist. Soc. Ser. B Zhang, C. H. (2010). Nearly unbiased variable selection under minimaxconcave penalty.
The Annals of Statistics Zou, H. (2006). The adaptive Lasso and Its Oracle Properties.
J. Amer.Stat. Assoc.
Zou, H. and
Hastie, T. (2005). Regularization and Variable Selection viathe Elastic Net.
J. Roy. Statist. Soc. Ser. B Zou, H. and
Zhang, H. H. (2009). On the adaptive elastic-net with adiverging number of parameters.
Annals of Statistics Applied StatisticsInstitute of Statistical Research and TrainingUniversity of DhakaDhaka 1000E-mail: [email protected]