pcoxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates
ppcoxtime: Penalized Cox Proportional HazardModel for Time-dependent Covariates
A Preprint
Steve Cygu
McMaster University
Benjamin M. Bolker
McMaster University
Abstract
The penalized Cox proportional hazard model is a popular analytical approach for survivaldata with a large number of covariates. Such problems are especially challenging whencovariates vary over follow-up time (i.e., the covariates are time-dependent). The standard R packages for fully penalized Cox models cannot currently incorporate time-dependentcovariates. To address this gap, we implement a variant of gradient descent algorithm( proximal gradient descent ) for fitting penalized Cox models. We apply our implementationto real and simulated data sets. Survival analysis studies event times — for example, time to cancer recurrence or time to death due tocancer. The goal is to predict the time-to-event (survival time) using a set of covariates, and to estimatethe effect of different covariates on survival. Survival models typically attempt to estimate the hazard ,the probability (density) of the event of interest occurring within a specific small time interval. Binaryclassification methods from machine learning (Kvamme, Borgan, and Scheel, 2019) can be used in problemsthat focus on predicting whether an event occurs within a specified time window. However, while binaryclassifiers can predict outcomes for a specified time window, they fail to account for one of the uniquecharacteristics of survival data — censoring . In survival data, it is common that some of the individuals areevent-free by the end of the follow-up time, resulting in censored times rather than observed event times.Binary classifiers typically ignore such observations; one of the main goals of survival analysis is to make fulluse of the information they provide.Cox proportional hazard (CPH) models are predominantly used to model survival data. The CPH modelis usually applied in problems where the number of observations, n , is much larger than the number ofcovariates, p . In the modern era of big data, researchers often encounter cases where p ≈ n (or p (cid:29) n ).In cancer research, for example, rapid advancement in genomics technologies combined with the low cost ofsequencing have led to the generation of vast amounts of cancer data (Cagan and Meyer, 2017) — presentinginherent challenges for effective and efficient data handling and analysis. Penalized regression methods suchas lasso , ridge or elastic net offer a statistically convenient way of handling high-dimensional data, especiallywhen building predictive models. Depending on the objectives of the analysis, for example when one ofthe goals is to identify a small subset of prognostic factors associated with cancer mortality, the subclass ofpenalized methods that are sparsity-inducing (e.g. lasso and elastic net) can simultaneously estimate andselect useful predictive features.The standard CPH model (i.e., with no time-dependent covariates) assumes that the hazard ratio is constantover the entire follow-up period, or that each covariate has a constant multiplicative effect over time on thehazard function. This assumption is violated in some situations. For example, in cancer prognosis, patients’treatment plans or health services use may change over time. Some implementations of CPH models allowsuch time-dependent covariates . However, their use require more attention than the fixed (time-independent)covariates (Hochstein, Ahn, Leung, and Denesuk, 2013; Therneau, Crowson, and Atkinson, 2017; Austin,Latouche, and Fine, 2020). a r X i v : . [ s t a t . M E ] F e b coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates
A Preprint
Many authors have implemented CPH models with penalization (Gui and Li, 2005; Park and Hastie, 2007;Sohn, Kim, Jung, and Park, 2009; Goeman, 2010), but all of these implementations share a common limi-tation — poor computational performance due to the use of the Newton-Raphson algorithm for estimation(Gorst-Rasmussen and Scheike, 2012). In the last decade, however, some researchers have developed moreefficient implementations of penalized CPH models. For instance, Simon, Friedman, Hastie, and Tibshirani(2011) describe and implement an impressively fast algorithm coxnet , implemented in the glmnet package,for fitting regularized CPH models via weighted cyclic coordinate descent. This method is computation-ally efficient in handling high-dimensional problems. Yang and Zou (2013) proposed and implemented the cocktail algorithm, which is a mixture of coordinate descent, the majorization-minimization principle, andthe strong rule for solving penalized CPH models in high dimensional data. The cocktail algorithm alwaysconverges to the correct solution and is slightly faster than the coxnet algorithm; it is implemented in R package fastcox . However, these implementations, the benchmark R packages for penalized Cox models, havesome specific limitations. For instance, as opposed to implementations by Simon et al. (2011) and Yang andZou (2013) which currently do not support time-dependent covariates, implementation by Goeman (2010)does incorporate time-dependent covariates but does not fully implement elastic net.Other, non-CPH-based approaches have also incorporated time-dependent covariates penalized models fortime-to-event-data. Most of these studies have used generalized additive models to implement semiparametricregression methods in the context of survival models (Gorst-Rasmussen and Scheike, 2012; Bender, Groll, andScheipl, 2018). Gorst-Rasmussen and Scheike (2012) used a cyclic coordinate descent algorithm to developpenalized semiparametric additive hazard model, implemented in R package ahaz . The model defines ahazard function as the sum of the baseline hazard and the regression function of the covariates — it isintrinsically linear thus theoretically guarantees convergence, and can handle time-dependent covariates.However, currently, it only implements lasso penalization.In this paper, we describe and implement an algorithm and R package ( pcoxtime ) for penalized CPH modelsthat handle both time-independent and time-dependent covariates. The general properties of penalizedmethods to handle high-dimension data make this algorithm a candidate tool for handling a relativelyhigh-dimension problems. We describe how existing computational approaches for CPH modeling can beadapted to obtain penalized methods for time-dependent covariates in time-to-event data. To solve theoptimization problem, we exploit a variant of the gradient descent algorithm known as proximal gradientdescent (as outlined in Parikh and Boyd (2014)) with Barzilai-Borwein step-size adjustment (Barzilai andBorwein, 1988). One of the important aspects in high-dimensional data is computational speed. Of course,coordinate descent based methods are usually much faster than gradient descent based methods which areslower by design (Simon et al., 2011). In the present case, we argue that capability is also important, i.e.,handling time-dependent covariates and full implementation of elastic net. We also provide examples of itsusage on real data and compare its performance with that of the penalized package on simulated data withtime-dependent covariates. Consider the standard survival data of the form { y i , δ i , x i } ni =1 , where y i is the observed event time (mortalityor failure time, or censoring time), δ i is an indicator variable which is 1 if the event has occurred duringthe follow-up period or 0 if the event is censored, and x i is a vector of covariates ( x i, , x i, , · · · , x i,p ). Let t < t < · · · t m be the distinct failure times, and j ( i ) be the index of the observation failing at time t i . TheCox proportional hazard model defines the hazard function at time t as h i ( t ) = h ( t ) exp ( x Ti β ) (1)where h ( t ) is the non-parametric baseline hazard and β is the coefficient vector of length p .Also, let R i denote the risk set at time t i (i.e., j : y j ≥ t i ), and for simplicity, assume there are no ties in y i (Section subsection 2.2 will show how to handle ties). Inference for a Cox model uses the partial likelihoodfunction L ( β ) = m Y i =1 exp ( x Ti β ) P j ∈ R i exp ( x Tj β ) . (2)The Cox model in Equation 1 is fitted in two steps — first, the parametric part is fitted by maximizingthe partial likelihood in Equation 2, and then the non-parametric baseline hazard is estimated based on theparametric results. 2 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates
A Preprint
When a covariate changes over time during the follow-up period, the observed survival data is of the form { y start i , y stop i , σ i , x i } ni =1 , where y start i and y stop i are the time periods in which covariate information aboutindividual i is observed, δ i indicates whether the event has occurred for individual i at time y stop i , and thedescription of x i remains the same as in subsection 2.1 but is now a (piecewise) function of time. The Coxproportional hazard model partial likelihood, with Breslow’s method for handling tied event times (Breslow,1972), is given by L ( β ) = N Y i =1 exp ( x ( t ) Ti β ) P Nj ∈ R ( t i ) exp ( x ( t ) Tj β ) ! σ i (3)where the risk set at time t i is now defined as R ( t i ) = { j : ( y stop j ≥ t i ) ∧ ( y start j < t i ) } . The first condition,( y stop j ≥ t i ), ensures that individual j either experienced the event or was censored at a later time point than t i , while the second condition, ( y start j < t i ), ensures the start time was observed before the event.From Equation 3, the partial log-likelihood (the loss function) is defined as ‘ ( β ) = N X i =1 δ i x ( t ) Ti β − log " N X j ∈ R ( t i ) exp( x ( t ) Tj β ) (4)and ˆ β is obtained by minimizing Equation 4. Our algorithm extends the partial log-likelihood in Equation 4 by adding the penalty term. Let P α,λ ( β ) bea mixture of ‘ (lasso — non-differentiable at β = 0) and ‘ (ridge) penalties. The penalized Cox partiallog-likelihood is defined as pen ‘ ( β ) α,λ = − ‘ ( β ) + P α,λ ( β ) , where P α,λ ( β ) = λ α p X i =1 | β i | + 0 . − α ) p X i =1 β i ! (5)as proposed by Zou and Hastie (2005), with λ > < α ≤
1. The lasso penalty ( P pi =1 | β i | ) inducessparsity by selecting a subset of nonzero coefficients. It works well in high-dimensional applications, but willeliminate all but one of any set of strongly multicollinear terms. On the hand, the ridge penalty ( P pi =1 β i )shrinks coefficients towards but never all the way to zero; hence it gives non-sparse estimates and givescorrelated predictors approximately equal weights. The elastic net penalty combines the strength of lassoand ridge penalties for improved predictive performance (Simon et al., 2011). As α changes from 0 to 1 for afixed value of λ , the sparsity and the magnitude of non-zero coefficients increases, i.e., the solution becomesmore ridge-like and less lasso-like.Using Equation 5, our problem becomesˆ β = arg min β N X i =1 δ i x ( t ) Ti β − log " N X j ∈ R ( t i ) exp( x ( t ) Tj β ) + λ α p X i =1 | β i | + 0 . − α ) p X i =1 β i ! . (6) To obtain parameter estimates, we apply proximal gradient descent (Parikh and Boyd, 2014) to solve Equa-tion 6. Consider a composite function f ( β ) = g ( β ) + h ( β ), such that g ( β ) is differentiable and convex and h ( β ) is convex but not necessarily differentiable. The proximal gradient descent approximation, based onthe proximal operator, is defined asprox h ( x ) = argmin u (cid:18) || u − x || + h ( u ) (cid:19) = S h ( x )where S h ( x ) = sgn( x )( | x | − h ) + is the soft thresholding operator.3 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates
A Preprint
We can decompose the objective function (Equation 6) as f ( β ) = g ( β ) + h ( β ) with g ( β ) = − ‘ ( β ) + 0 . λ (1 − α ) p X i =1 β i and h ( β ) = λα p X i =1 | β i | . Hence, the proximal operator to update β is given by β ( k ) = S γ k h (cid:16) β ( k − − γ k ∇ g ( β ( k − ) (cid:17) , k = 1 , , , · · · (7)where γ k is the step size determined via Barzilai-Borwein step-size adjustment (Barzilai and Borwein, 1988)and ∇ g ( β ) = −∇ ‘ ( β ) + λ (1 − α ) p X i =1 β i = − X T ( δ − P δ ) + λ (1 − α ) p X i =1 β i . For simplicity, denote η i = x Ti β , w j = exp x j ( t ) T β and W i = P j ∈ R ( t i ) w j . Let y i ( t j ) = I ( i ∈ R ( t j )) definethe at risk indicator if individual i is still at risk at time t i . Also, let π ij = y i ( t j ) w i W j . By evaluating the partial derivative of Equation 3 with respect to η k (linear predictor at the k th iteration)and applying the chain rule ∇ ‘ ( β ) = ∂η∂β T ‘ ( β ) ∂η = X T ( δ − P δ ) (8)where P is a n × n matrix with entries π ij .In our particular case, the two components of the objective reduces to soft thresholding S γ k λα ( x i ) = x i − γ k λα x i ≥ γ k λα − γ k λα ≤ x i ≤ γ k λαx i + γ k αλ x i ≤ − γ k λα which allows us to perform updates via Equation 7. To train an optimal model, we need to choose the optimal value of λ — a large value of λ will dominate thepenalty terms in Equation 6 setting coefficients to zero, on the other hand, a small λ value will overfit themodel. An ideal approach would be to perform cross-validation, but we need to decide which values of λ touse in cross-validation. At λ = 0, the model fits unpenalized Cox model, while at some sufficiently large λ ,all the coefficient are set to zero. To compute the regularization path, we need a sequence of λ s falling inbetween the extremes such that λ < λ , · · · , < λ max . We start with sufficiently large, λ max , to set β = 0,and decrease λ until near unpenalized solution (this is the warm-start approach employed in glmnet ).By KKT conditions (Yang and Zou, 2013; Wan, Nagorski, Allen, Li, and Liu, 2013; Tibshirani et al., 2013),if N | X T ( δ − P δ ) | ≤ αλγ k then ˆ β = 0. Thus, by unpacking P , we notice that λ max = 1 N αγ k max β ((cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 x i σ i − P i ∈ R ( t i ) x i σ i | R ( t i ) | !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)) (9)where | R ( t i ) | denotes the cardinality of the risk set R ( t i ).In our implementation, we set λ min = (cid:15)λ max , and compute solutions over a grid of k values of λ decreasingfrom λ max to λ min , where λ i = λ max ( λ min /λ max ) i/ ( k − for i = 0 , · · · , k − k is 100. If n ≥ p , the default value of (cid:15) is set to 0 . n < p , (cid:15) = 0 . coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates
A Preprint
To select the optimal λ , we perform k -fold cross-validation — splitting the training data into k parts, modelis trained on k − k th left-out part via some predictive performance measure.Here, we adopt two approaches — the cross-validated partial likelihood and the cross-validated C -index (Daiand Breheny, 2019). In the first case, the goodness of fit for each k split and λ isˆCVE( λ ) = − K X k =1 ‘ ( ˆ β − k ( λ )) − ‘ − k ( ˆ β − k ( λ )) (10)where ‘ ( ˆ β − k ( λ )) is the log partial likelihood evaluated at ˆ β − k on the whole dataset and ‘ − k ( ˆ β − k ( λ )) is thelog partial likelihood evaluated at ˆ β − k on the non-left out data. We choose λ which minimizes Equation 10.In the second case, we implement and alternative approach of cross-validation using the concordance statisticfor Cox models, also known as cross-validated C -index, based on Harrell’s concordance index (Harrell Jr,Lee, and Mark, 1996). The C -index computes the probability that, for a random pair of individuals, thepredicted survival times of the pair have the same ordering as their true survival times. Our implementationis similar to that of the R package survival (see Therneau (2020) for a detailed description). Once ˆ β is estimated, we can estimate the baseline hazard function (ˆ h ( t )), and hence the survival function( ˆ S ( t )) — fitting a Cox model does not directly result in an estimate of ˆ S i ( t ) since it is independent of thechoice of the baseline h ( t ). We first compute the cumulative hazard functionˆ h ( t ) = X i ∈ y j We use the sorlie gene expression data set (Sorlie and Tibshirani, 2003), which contains 549 gene expressionmeasurements together with the survival times for 115 females diagnosed with cancer. This data was alsoused by Gorst-Rasmussen and Scheike (2012) to demonstrate the performance of R package ahaz .We perform an elastic net penalized survival regression by varying both α and λ . In pcoxtime , multiple α values can be specified for cross-validation. We suggest first running the analysis for an intermediate rangeof α values. If the optimal α based on 10-fold cross-validation (over all λ values considered) is at the lowerbound of the range considered, then extend the range of the α vector to lower (positive) values; if it is atthe upper bound, extend the range upward. In this example, based on our preliminary runs, we restrict α to α = { . , . } . For each α value in this final set, we analyze a solution path of λ values and use 10-foldcross validation to choose the optimal value of α and λ .We first load the data as follows: R> data("sorlie", package = "ahaz")R> dat <- sorlieR> dim(dat)[1] 115 551 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates A Preprint It is common practice to standardize the predictors before applying penalized methods. In pcoxtime , pre-dictors are scaled internally (but the user can choose to output coefficients on the original scale [the default]or to output standardized coefficients). We make the following call to pcoxtimecv to perform 10-foldcross-validation to choose the optimal α and λ . R> cv_fit1 <- pcoxtimecv(Surv(time, status) ˜., data = dat,+ alphas = c(0.4, 0.5), lambdas = NULL, devtype = "vv",+ lamfract = 0.6, refit = TRUE, nclusters = 4+ )Progress: Refitting with optimal lambdas... In order to reduce the computation time, lamfract is set to 0 . λ values near zero (Simonet al., 2011). Setting lamfract in this way specifies that only a subset of the full sequence of 100 λ valuesis used.Once cross-validation is performed, we can report the optimal λ (= 0 . 28) and α (= 0 . 4) and view thecross-validated error plot Figure 1 and the regularization path plot Figure 3. R> print(cv_fit1)Call:pcoxtimecv(formula = Surv(time, status) ˜ ., data = dat, alphas = c(0.4,0.5), lambdas = NULL, lamfract = 0.6, devtype = "vv", refit = TRUE,nclusters = 4)Optimal parameter valueslambda.min lambda.1se alpha.optimal0.2756254 0.6077901 0.4R> cv_error1 <- plot(cv_fit1)R> print(cv_error1)R> solution_path1 <- plot(cv_fit1, type = "fit")R> print(solution_path1) Next, we fit the penalized model using optimal α and λ and plot the estimated survival function. R> We then plot the predicted survival function for each patient and the average survival function in Figure 4. R> surv_avg <- pcoxsurvfit(fit1)R> surv_df <- with(surv_avg, data.frame(time, surv))R> surv_ind <- pcoxsurvfit(fit1, newdata = dat)R> splot_sorlie <- plot(surv_ind, lsize = 0.05, lcol = "grey") ++ geom_line(data = surv_df, aes(x = time, y = surv, group = 1),+ col = "red"+ )R> print(splot_sorlie) coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates A Preprint a = ( Optimal ) a = −3 −2 −1 −3 −2 −110.012.515.017.5 log ( l ) P a r t i a l L i k e li hood D e v i an c e Figure 1: Plots of the cross-validated error rates for a sequence of λ values at different α values — pcoxtimecv automatically chooses the optimal α . The left dotted line indicates the minimum error whilethe right indicates the largest value of λ such that the error is within one standard deviation. In this case,the optimal λ is obtained when α = 0 . 4, left plot. We now consider survival data with time-dependent covariates and repeat the analysis outlined above todemonstrate how to use pcoxtime in this context.We consider the chronic granulotomous disease ( cgd ) data set from R package survival (Therneau, 2020). Itcontains the data on time to serious infection observed through the end of the study for 128 unique patients(because some patients were observed more than once there are 203 observations in total).We load the data and perform cross-validation as follows: R> data("cgd", package = "survival")R> dat <- cgdR> dim(dat)[1] 203 16R> cv_fit2 <- pcoxtimecv(Surv(tstart, tstop, status) ˜ treat + sex ++ ns(age,3) + height + weight + inherit + steroids + propylac + coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates A Preprint + hos.cat, data = dat, alphas = c(0.2, 0.5, 0.8), lambdas = NULL,+ devtype = "vv", lamfract = 0.6, refit = TRUE, nclusters = 4+ )Progress: Refitting with optimal lambdas... Here, we choose α = { . , . , . } for 10-fold cross-validation. R> print(cv_fit2)Call:pcoxtimecv(formula = Surv(tstart, tstop, status) ˜ treat + sex +ns(age, 3) + height + weight + inherit + steroids + propylac +hos.cat, data = dat, alphas = c(0.2, 0.5, 0.8), lambdas = NULL,lamfract = 0.6, devtype = "vv", refit = TRUE, nclusters = 4)Optimal parameter valueslambda.min lambda.1se alpha.optimal0.01477729 0.3834742 0.5 The optimal λ and α in this case are 0 . 015 and 0 . 5, respectively, and the cross-validation error plot is shownin Figure 2. R> cv_error2 <- plot(cv_fit2)R> print(cv_error2) We plot the solution path, Figure 3, and fit the penalized model based on the optimal λ and α : R> solution_path2 <- plot(cv_fit2, type = "fit")R> print(solution_path2)R> alp <- cv_fit2$alpha.optimalR> lam <- cv_fit2$lambda.minR>R> Again, we use the fit2 object to plot the predicted individual and average survival curves as shown inFigure 4. R> surv_avg <- pcoxsurvfit(fit2)R> surv_df <- with(surv_avg, data.frame(time, surv))R> surv_ind <- pcoxsurvfit(fit2, newdata = dat)R> splot_cgd <- plot(surv_ind, lsize = 0.05, lcol = "grey") ++ geom_line(data = surv_df, aes(x = time, y = surv, group = 1),+ col = "red"+ )R> print(splot_cgd) coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates A Preprint a = a = ( Optimal ) a = −4 −2 0 −6 −4 −2 −7 −6 −5 −4 −3 −210.210.510.811.1 log ( l ) P a r t i a l L i k e li hood D e v i an c e Figure 2: Plots of the cross-validated error rates for a sequence of λ values at different α values for cgd data. We provide a user-friendly wrapper, simtdc , for the R package PermAlgo permalgorithm algorithm whichimplements an extended permutational algorithm for simulating time-dependent covariates. The algorithmis described and validated in Sylvestre and Abrahamowicz (2008).We simulated a data set for 100 individuals with a follow-up time of up to 10 years, 100 time-dependentand 900 time-fixed covariates. The survival times, T i (independent of the covariates (Sylvestre andAbrahamowicz, 2008)) and censoring times, C i , are generated as T i := Exponential(100 , . 2) and C i :=Uniform(100 , , n = 378 and p = 1000 with 63% of the subjectshaving experienced the event of interest at the end of the follow-up time.Our proximal gradient descent algorithm, pcoxtime , took 10 times as long to compute the complete solutionpath as the combination gradient descent-Newton-Raphson method from penalized (Goeman, 2010) (357seconds vs. 34 seconds). To make the algorithms comparable, we only performed lasso since penalized does not currently fully implement elastic net, and made penalized solve for the solution path generatedby pcoxtime such that λ min = 0 . λ max . Our current implementation has a downside — although thelikelihood and coefficient estimations are performed in C++ for a single λ , the solution paths are computedin R which is likely to be slower than an implementation that is completely coded in C++ . On the otherhand, penalized performs most of its computation in C++ . Table 1 summarizes specific strengths of our9 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates A Preprint ( l ) C oe ff i c i en t e s t i m a t e 13 12 11 11 10 2 0−6−4−202 −6 −4 −2log ( l ) Figure 3: Plot of regularization path for the optimal α with the corresponding λ values. The values at thetop of the plot give the number of nonzero coefficients (size of the model) at various λ values. The left plotis for sorlie data set while the right plot is for cgd data set.algorithm. In terms of prediction accuracy, as measured by Harrell’s C -statistic (Therneau, 2020), the twopackages are equivalent (0 . . , . In this section, we compare the capabilities of pcoxtime to some of the most general and widely used R packages for penalized CPH models — glmnet , fastcox and penalized . One of the important aspects inhigh-dimensional data is computational speed. Of course, coordinate descent based methods ( glmnet and fastcox ) are usually much faster than gradient descent based methods ( pcoxtime and penalized ) which areslower by design (Simon et al., 2011). Table 1 summarizes some important capabilities of the packages.10 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates A Preprint Time S u r v i v a l p r ob . Time Figure 4: Predicted individual and average (red) for sorlie , left , and cgd , right , data sets. pcoxtime glmnet fastcox penalized Supported models Time-dependent covariates yes no no yesFull elastic net yes yes yes no Post model predictions Survival and hazard functions yes no no yesModel validation and calibration yes no no noTable 1: Capabilities of pcoxtime , glmnet , fastcox and penalized packages. We have shown how penalized CPH model can be extended to handle time-dependent covariates. To achievethis, we have we have employed proximal gradient descent algorithm. This paper provides a general overviewof the R package pcoxtime and serves as good starting point to further explore the capabilities of the package.In future, we plan to improve the functionality of pcoxtime . In particular, we plan to move to an im-plementation using the coordinate descent algorithm which has proven to be efficient in high-dimensionaldata. 11 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates A Preprint Acknowledgments We would like to thank Mr. Erik Drysdale and Mr. Brian Kiprop who provided valuable feedback on theearlier versions of the package. References Peter C Austin, Aur´elien Latouche, and Jason P Fine. A review of the use of time-varying covariates in thefine-gray subdistribution hazard competing risk regression model. Statistics in Medicine , 39(2):103–113,2020.Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods. IMA Journal of Nu-merical Analysis , 8(1):141–148, 1988.Andreas Bender, Andreas Groll, and Fabian Scheipl. A generalized additive model approach to time-to-eventanalysis. Statistical Modelling , 18(3-4):299–321, 2018.Norman E Breslow. Contribution to discussion of paper by Dr Cox. J. Roy. Statist. Soc., Ser. B , 34:216–217,1972.Ross Cagan and Pablo Meyer. Rethinking cancer: Current challenges and opportunities in cancer research,2017.Biyue Dai and Patrick Breheny. Cross validation approaches for penalized Cox regression. ArXiv PreprintArXiv:1905.10432 , 2019.Jelle J Goeman. L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal , 52(1):70–84, 2010.Anders Gorst-Rasmussen and Thomas H Scheike. Coordinate descent methods for the penalized semipara-metric additive hazards model. Journal of Statistical Software , 47(1):1–17, 2012.Jiang Gui and Hongzhe Li. Penalized Cox regression analysis in the high-dimensional and low-sample sizesettings, with applications to microarray gene expression data. Bioinformatics , 21(13):3001–3008, 2005.Frank E Harrell Jr, Kerry L Lee, and Daniel B Mark. Multivariable prognostic models: Issues in developingmodels, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine ,15(4):361–387, 1996.Axel Hochstein, Hyung-Il Ahn, Ying Tat Leung, and Matthew Denesuk. Survival analysis for hdlss datawith time dependent variables: Lessons from predictive maintenance at a mining service provider. In Proceedings of 2013 IEEE International Conference on Service Operations and Logistics, and Informatics ,pages 372–381. IEEE, 2013.H˚avard Kvamme, Ørnulf Borgan, and Ida Scheel. Time-to-event prediction with neural networks and Coxregression. Journal of Machine Learning Research , 20(129):1–30, 2019.Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization , 1(3):127–239,2014.Mee Young Park and Trevor Hastie. L1-regularization path algorithm for generalized linear models. Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 69(4):659–677, 2007.Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization Paths for Cox’s Propor-tional Hazards Model via Coordinate Descent. Journal of Statistical Software , 39(5):1, 2011.Insuk Sohn, Jinseog Kim, Sin-Ho Jung, and Changyi Park. Gradient lasso for Cox proportional hazardsmodel. Bioinformatics , 25(14):1775–1781, 2009.T Sorlie and R Tibshirani. Repeated observation of breast tumor subtypes in independent gene expressiondata sets. Proc Natl Acad Sci USA , 100:8418–8423, 2003.Marie-Pierre Sylvestre and Michal Abrahamowicz. Comparison of algorithms to generate event times con-ditional on time-dependent covariates. Statistics in Medicine , 27(14):2618–2634, 2008.Terry Therneau, Cindy Crowson, and Elizabeth Atkinson. Using time dependent covariates and time de-pendent coefficients in the Cox model. Survival Vignettes , 2017.Terry M Therneau. A Package for Survival Analysis in R , 2020. URL https://CRAN.R-project.org/package=survival . R package version 3.2-7. 12 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates A Preprint Ryan J Tibshirani et al. The lasso problem and uniqueness. Electronic Journal of Statistics , 7:1456–1490,2013.Ying-Wooi Wan, John Nagorski, Genevera I Allen, Zhaohui Li, and Zhandong Liu. Identifying cancerbiomarkers through a network regularized Cox model. In , pages 36–39. IEEE, 2013.Yi Yang and Hui Zou. A cocktail algorithm for solving the elastic net penalized Cox’s regression in highhimensions. Statistics and its Interface , 6(2):167–173, 2013.Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.