# pcoxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

ppcoxtime: Penalized Cox Proportional HazardModel for Time-dependent Covariates

A Preprint

Steve Cygu

McMaster University

Benjamin M. Bolker

McMaster University

Abstract

The penalized Cox proportional hazard model is a popular analytical approach for survivaldata with a large number of covariates. Such problems are especially challenging whencovariates vary over follow-up time (i.e., the covariates are time-dependent). The standard R packages for fully penalized Cox models cannot currently incorporate time-dependentcovariates. To address this gap, we implement a variant of gradient descent algorithm( proximal gradient descent ) for ﬁtting penalized Cox models. We apply our implementationto real and simulated data sets. Survival analysis studies event times — for example, time to cancer recurrence or time to death due tocancer. The goal is to predict the time-to-event (survival time) using a set of covariates, and to estimatethe eﬀect of diﬀerent covariates on survival. Survival models typically attempt to estimate the hazard ,the probability (density) of the event of interest occurring within a speciﬁc small time interval. Binaryclassiﬁcation methods from machine learning (Kvamme, Borgan, and Scheel, 2019) can be used in problemsthat focus on predicting whether an event occurs within a speciﬁed time window. However, while binaryclassiﬁers can predict outcomes for a speciﬁed time window, they fail to account for one of the uniquecharacteristics of survival data — censoring . In survival data, it is common that some of the individuals areevent-free by the end of the follow-up time, resulting in censored times rather than observed event times.Binary classiﬁers typically ignore such observations; one of the main goals of survival analysis is to make fulluse of the information they provide.Cox proportional hazard (CPH) models are predominantly used to model survival data. The CPH modelis usually applied in problems where the number of observations, n , is much larger than the number ofcovariates, p . In the modern era of big data, researchers often encounter cases where p ≈ n (or p (cid:29) n ).In cancer research, for example, rapid advancement in genomics technologies combined with the low cost ofsequencing have led to the generation of vast amounts of cancer data (Cagan and Meyer, 2017) — presentinginherent challenges for eﬀective and eﬃcient data handling and analysis. Penalized regression methods suchas lasso , ridge or elastic net oﬀer a statistically convenient way of handling high-dimensional data, especiallywhen building predictive models. Depending on the objectives of the analysis, for example when one ofthe goals is to identify a small subset of prognostic factors associated with cancer mortality, the subclass ofpenalized methods that are sparsity-inducing (e.g. lasso and elastic net) can simultaneously estimate andselect useful predictive features.The standard CPH model (i.e., with no time-dependent covariates) assumes that the hazard ratio is constantover the entire follow-up period, or that each covariate has a constant multiplicative eﬀect over time on thehazard function. This assumption is violated in some situations. For example, in cancer prognosis, patients’treatment plans or health services use may change over time. Some implementations of CPH models allowsuch time-dependent covariates . However, their use require more attention than the ﬁxed (time-independent)covariates (Hochstein, Ahn, Leung, and Denesuk, 2013; Therneau, Crowson, and Atkinson, 2017; Austin,Latouche, and Fine, 2020). a r X i v : . [ s t a t . M E ] F e b coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint

Many authors have implemented CPH models with penalization (Gui and Li, 2005; Park and Hastie, 2007;Sohn, Kim, Jung, and Park, 2009; Goeman, 2010), but all of these implementations share a common limi-tation — poor computational performance due to the use of the Newton-Raphson algorithm for estimation(Gorst-Rasmussen and Scheike, 2012). In the last decade, however, some researchers have developed moreeﬃcient implementations of penalized CPH models. For instance, Simon, Friedman, Hastie, and Tibshirani(2011) describe and implement an impressively fast algorithm coxnet , implemented in the glmnet package,for ﬁtting regularized CPH models via weighted cyclic coordinate descent. This method is computation-ally eﬃcient in handling high-dimensional problems. Yang and Zou (2013) proposed and implemented the cocktail algorithm, which is a mixture of coordinate descent, the majorization-minimization principle, andthe strong rule for solving penalized CPH models in high dimensional data. The cocktail algorithm alwaysconverges to the correct solution and is slightly faster than the coxnet algorithm; it is implemented in R package fastcox . However, these implementations, the benchmark R packages for penalized Cox models, havesome speciﬁc limitations. For instance, as opposed to implementations by Simon et al. (2011) and Yang andZou (2013) which currently do not support time-dependent covariates, implementation by Goeman (2010)does incorporate time-dependent covariates but does not fully implement elastic net.Other, non-CPH-based approaches have also incorporated time-dependent covariates penalized models fortime-to-event-data. Most of these studies have used generalized additive models to implement semiparametricregression methods in the context of survival models (Gorst-Rasmussen and Scheike, 2012; Bender, Groll, andScheipl, 2018). Gorst-Rasmussen and Scheike (2012) used a cyclic coordinate descent algorithm to developpenalized semiparametric additive hazard model, implemented in R package ahaz . The model deﬁnes ahazard function as the sum of the baseline hazard and the regression function of the covariates — it isintrinsically linear thus theoretically guarantees convergence, and can handle time-dependent covariates.However, currently, it only implements lasso penalization.In this paper, we describe and implement an algorithm and R package ( pcoxtime ) for penalized CPH modelsthat handle both time-independent and time-dependent covariates. The general properties of penalizedmethods to handle high-dimension data make this algorithm a candidate tool for handling a relativelyhigh-dimension problems. We describe how existing computational approaches for CPH modeling can beadapted to obtain penalized methods for time-dependent covariates in time-to-event data. To solve theoptimization problem, we exploit a variant of the gradient descent algorithm known as proximal gradientdescent (as outlined in Parikh and Boyd (2014)) with Barzilai-Borwein step-size adjustment (Barzilai andBorwein, 1988). One of the important aspects in high-dimensional data is computational speed. Of course,coordinate descent based methods are usually much faster than gradient descent based methods which areslower by design (Simon et al., 2011). In the present case, we argue that capability is also important, i.e.,handling time-dependent covariates and full implementation of elastic net. We also provide examples of itsusage on real data and compare its performance with that of the penalized package on simulated data withtime-dependent covariates. Consider the standard survival data of the form { y i , δ i , x i } ni =1 , where y i is the observed event time (mortalityor failure time, or censoring time), δ i is an indicator variable which is 1 if the event has occurred duringthe follow-up period or 0 if the event is censored, and x i is a vector of covariates ( x i, , x i, , · · · , x i,p ). Let t < t < · · · t m be the distinct failure times, and j ( i ) be the index of the observation failing at time t i . TheCox proportional hazard model deﬁnes the hazard function at time t as h i ( t ) = h ( t ) exp ( x Ti β ) (1)where h ( t ) is the non-parametric baseline hazard and β is the coeﬃcient vector of length p .Also, let R i denote the risk set at time t i (i.e., j : y j ≥ t i ), and for simplicity, assume there are no ties in y i (Section subsection 2.2 will show how to handle ties). Inference for a Cox model uses the partial likelihoodfunction L ( β ) = m Y i =1 exp ( x Ti β ) P j ∈ R i exp ( x Tj β ) . (2)The Cox model in Equation 1 is ﬁtted in two steps — ﬁrst, the parametric part is ﬁtted by maximizingthe partial likelihood in Equation 2, and then the non-parametric baseline hazard is estimated based on theparametric results. 2 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint

When a covariate changes over time during the follow-up period, the observed survival data is of the form { y start i , y stop i , σ i , x i } ni =1 , where y start i and y stop i are the time periods in which covariate information aboutindividual i is observed, δ i indicates whether the event has occurred for individual i at time y stop i , and thedescription of x i remains the same as in subsection 2.1 but is now a (piecewise) function of time. The Coxproportional hazard model partial likelihood, with Breslow’s method for handling tied event times (Breslow,1972), is given by L ( β ) = N Y i =1 exp ( x ( t ) Ti β ) P Nj ∈ R ( t i ) exp ( x ( t ) Tj β ) ! σ i (3)where the risk set at time t i is now deﬁned as R ( t i ) = { j : ( y stop j ≥ t i ) ∧ ( y start j < t i ) } . The ﬁrst condition,( y stop j ≥ t i ), ensures that individual j either experienced the event or was censored at a later time point than t i , while the second condition, ( y start j < t i ), ensures the start time was observed before the event.From Equation 3, the partial log-likelihood (the loss function) is deﬁned as ‘ ( β ) = N X i =1 δ i x ( t ) Ti β − log " N X j ∈ R ( t i ) exp( x ( t ) Tj β ) (4)and ˆ β is obtained by minimizing Equation 4. Our algorithm extends the partial log-likelihood in Equation 4 by adding the penalty term. Let P α,λ ( β ) bea mixture of ‘ (lasso — non-diﬀerentiable at β = 0) and ‘ (ridge) penalties. The penalized Cox partiallog-likelihood is deﬁned as pen ‘ ( β ) α,λ = − ‘ ( β ) + P α,λ ( β ) , where P α,λ ( β ) = λ α p X i =1 | β i | + 0 . − α ) p X i =1 β i ! (5)as proposed by Zou and Hastie (2005), with λ > < α ≤

1. The lasso penalty ( P pi =1 | β i | ) inducessparsity by selecting a subset of nonzero coeﬃcients. It works well in high-dimensional applications, but willeliminate all but one of any set of strongly multicollinear terms. On the hand, the ridge penalty ( P pi =1 β i )shrinks coeﬃcients towards but never all the way to zero; hence it gives non-sparse estimates and givescorrelated predictors approximately equal weights. The elastic net penalty combines the strength of lassoand ridge penalties for improved predictive performance (Simon et al., 2011). As α changes from 0 to 1 for aﬁxed value of λ , the sparsity and the magnitude of non-zero coeﬃcients increases, i.e., the solution becomesmore ridge-like and less lasso-like.Using Equation 5, our problem becomesˆ β = arg min β N X i =1 δ i x ( t ) Ti β − log " N X j ∈ R ( t i ) exp( x ( t ) Tj β ) + λ α p X i =1 | β i | + 0 . − α ) p X i =1 β i ! . (6) To obtain parameter estimates, we apply proximal gradient descent (Parikh and Boyd, 2014) to solve Equa-tion 6. Consider a composite function f ( β ) = g ( β ) + h ( β ), such that g ( β ) is diﬀerentiable and convex and h ( β ) is convex but not necessarily diﬀerentiable. The proximal gradient descent approximation, based onthe proximal operator, is deﬁned asprox h ( x ) = argmin u (cid:18) || u − x || + h ( u ) (cid:19) = S h ( x )where S h ( x ) = sgn( x )( | x | − h ) + is the soft thresholding operator.3 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint

We can decompose the objective function (Equation 6) as f ( β ) = g ( β ) + h ( β ) with g ( β ) = − ‘ ( β ) + 0 . λ (1 − α ) p X i =1 β i and h ( β ) = λα p X i =1 | β i | . Hence, the proximal operator to update β is given by β ( k ) = S γ k h (cid:16) β ( k − − γ k ∇ g ( β ( k − ) (cid:17) , k = 1 , , , · · · (7)where γ k is the step size determined via Barzilai-Borwein step-size adjustment (Barzilai and Borwein, 1988)and ∇ g ( β ) = −∇ ‘ ( β ) + λ (1 − α ) p X i =1 β i = − X T ( δ − P δ ) + λ (1 − α ) p X i =1 β i . For simplicity, denote η i = x Ti β , w j = exp x j ( t ) T β and W i = P j ∈ R ( t i ) w j . Let y i ( t j ) = I ( i ∈ R ( t j )) deﬁnethe at risk indicator if individual i is still at risk at time t i . Also, let π ij = y i ( t j ) w i W j . By evaluating the partial derivative of Equation 3 with respect to η k (linear predictor at the k th iteration)and applying the chain rule ∇ ‘ ( β ) = ∂η∂β T ‘ ( β ) ∂η = X T ( δ − P δ ) (8)where P is a n × n matrix with entries π ij .In our particular case, the two components of the objective reduces to soft thresholding S γ k λα ( x i ) = x i − γ k λα x i ≥ γ k λα − γ k λα ≤ x i ≤ γ k λαx i + γ k αλ x i ≤ − γ k λα which allows us to perform updates via Equation 7. To train an optimal model, we need to choose the optimal value of λ — a large value of λ will dominate thepenalty terms in Equation 6 setting coeﬃcients to zero, on the other hand, a small λ value will overﬁt themodel. An ideal approach would be to perform cross-validation, but we need to decide which values of λ touse in cross-validation. At λ = 0, the model ﬁts unpenalized Cox model, while at some suﬃciently large λ ,all the coeﬃcient are set to zero. To compute the regularization path, we need a sequence of λ s falling inbetween the extremes such that λ < λ , · · · , < λ max . We start with suﬃciently large, λ max , to set β = 0,and decrease λ until near unpenalized solution (this is the warm-start approach employed in glmnet ).By KKT conditions (Yang and Zou, 2013; Wan, Nagorski, Allen, Li, and Liu, 2013; Tibshirani et al., 2013),if N | X T ( δ − P δ ) | ≤ αλγ k then ˆ β = 0. Thus, by unpacking P , we notice that λ max = 1 N αγ k max β ((cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 x i σ i − P i ∈ R ( t i ) x i σ i | R ( t i ) | !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)) (9)where | R ( t i ) | denotes the cardinality of the risk set R ( t i ).In our implementation, we set λ min = (cid:15)λ max , and compute solutions over a grid of k values of λ decreasingfrom λ max to λ min , where λ i = λ max ( λ min /λ max ) i/ ( k − for i = 0 , · · · , k − k is 100. If n ≥ p , the default value of (cid:15) is set to 0 . n < p , (cid:15) = 0 . coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint

To select the optimal λ , we perform k -fold cross-validation — splitting the training data into k parts, modelis trained on k − k th left-out part via some predictive performance measure.Here, we adopt two approaches — the cross-validated partial likelihood and the cross-validated C -index (Daiand Breheny, 2019). In the ﬁrst case, the goodness of ﬁt for each k split and λ isˆCVE( λ ) = − K X k =1 ‘ ( ˆ β − k ( λ )) − ‘ − k ( ˆ β − k ( λ )) (10)where ‘ ( ˆ β − k ( λ )) is the log partial likelihood evaluated at ˆ β − k on the whole dataset and ‘ − k ( ˆ β − k ( λ )) is thelog partial likelihood evaluated at ˆ β − k on the non-left out data. We choose λ which minimizes Equation 10.In the second case, we implement and alternative approach of cross-validation using the concordance statisticfor Cox models, also known as cross-validated C -index, based on Harrell’s concordance index (Harrell Jr,Lee, and Mark, 1996). The C -index computes the probability that, for a random pair of individuals, thepredicted survival times of the pair have the same ordering as their true survival times. Our implementationis similar to that of the R package survival (see Therneau (2020) for a detailed description). Once ˆ β is estimated, we can estimate the baseline hazard function (ˆ h ( t )), and hence the survival function( ˆ S ( t )) — ﬁtting a Cox model does not directly result in an estimate of ˆ S i ( t ) since it is independent of thechoice of the baseline h ( t ). We ﬁrst compute the cumulative hazard functionˆ h ( t ) = X i ∈ y j

We use the sorlie gene expression data set (Sorlie and Tibshirani, 2003), which contains 549 gene expressionmeasurements together with the survival times for 115 females diagnosed with cancer. This data was alsoused by Gorst-Rasmussen and Scheike (2012) to demonstrate the performance of R package ahaz .We perform an elastic net penalized survival regression by varying both α and λ . In pcoxtime , multiple α values can be speciﬁed for cross-validation. We suggest ﬁrst running the analysis for an intermediate rangeof α values. If the optimal α based on 10-fold cross-validation (over all λ values considered) is at the lowerbound of the range considered, then extend the range of the α vector to lower (positive) values; if it is atthe upper bound, extend the range upward. In this example, based on our preliminary runs, we restrict α to α = { . , . } . For each α value in this ﬁnal set, we analyze a solution path of λ values and use 10-foldcross validation to choose the optimal value of α and λ .We ﬁrst load the data as follows: R> data("sorlie", package = "ahaz")R> dat <- sorlieR> dim(dat)[1] 115 551 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint

It is common practice to standardize the predictors before applying penalized methods. In pcoxtime , pre-dictors are scaled internally (but the user can choose to output coeﬃcients on the original scale [the default]or to output standardized coeﬃcients). We make the following call to pcoxtimecv to perform 10-foldcross-validation to choose the optimal α and λ . R> cv_fit1 <- pcoxtimecv(Surv(time, status) ˜., data = dat,+ alphas = c(0.4, 0.5), lambdas = NULL, devtype = "vv",+ lamfract = 0.6, refit = TRUE, nclusters = 4+ )Progress: Refitting with optimal lambdas...

In order to reduce the computation time, lamfract is set to 0 . λ values near zero (Simonet al., 2011). Setting lamfract in this way speciﬁes that only a subset of the full sequence of 100 λ valuesis used.Once cross-validation is performed, we can report the optimal λ (= 0 .

28) and α (= 0 .

4) and view thecross-validated error plot Figure 1 and the regularization path plot Figure 3.

R> print(cv_fit1)Call:pcoxtimecv(formula = Surv(time, status) ˜ ., data = dat, alphas = c(0.4,0.5), lambdas = NULL, lamfract = 0.6, devtype = "vv", refit = TRUE,nclusters = 4)Optimal parameter valueslambda.min lambda.1se alpha.optimal0.2756254 0.6077901 0.4R> cv_error1 <- plot(cv_fit1)R> print(cv_error1)R> solution_path1 <- plot(cv_fit1, type = "fit")R> print(solution_path1)

Next, we ﬁt the penalized model using optimal α and λ and plot the estimated survival function. R>

We then plot the predicted survival function for each patient and the average survival function in Figure 4.

R> surv_avg <- pcoxsurvfit(fit1)R> surv_df <- with(surv_avg, data.frame(time, surv))R> surv_ind <- pcoxsurvfit(fit1, newdata = dat)R> splot_sorlie <- plot(surv_ind, lsize = 0.05, lcol = "grey") ++ geom_line(data = surv_df, aes(x = time, y = surv, group = 1),+ col = "red"+ )R> print(splot_sorlie) coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint a = ( Optimal ) a = −3 −2 −1 −3 −2 −110.012.515.017.5 log ( l ) P a r t i a l L i k e li hood D e v i an c e Figure 1: Plots of the cross-validated error rates for a sequence of λ values at diﬀerent α values — pcoxtimecv automatically chooses the optimal α . The left dotted line indicates the minimum error whilethe right indicates the largest value of λ such that the error is within one standard deviation. In this case,the optimal λ is obtained when α = 0 .

4, left plot.

We now consider survival data with time-dependent covariates and repeat the analysis outlined above todemonstrate how to use pcoxtime in this context.We consider the chronic granulotomous disease ( cgd ) data set from R package survival (Therneau, 2020). Itcontains the data on time to serious infection observed through the end of the study for 128 unique patients(because some patients were observed more than once there are 203 observations in total).We load the data and perform cross-validation as follows: R> data("cgd", package = "survival")R> dat <- cgdR> dim(dat)[1] 203 16R> cv_fit2 <- pcoxtimecv(Surv(tstart, tstop, status) ˜ treat + sex ++ ns(age,3) + height + weight + inherit + steroids + propylac + coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint + hos.cat, data = dat, alphas = c(0.2, 0.5, 0.8), lambdas = NULL,+ devtype = "vv", lamfract = 0.6, refit = TRUE, nclusters = 4+ )Progress: Refitting with optimal lambdas...

Here, we choose α = { . , . , . } for 10-fold cross-validation. R> print(cv_fit2)Call:pcoxtimecv(formula = Surv(tstart, tstop, status) ˜ treat + sex +ns(age, 3) + height + weight + inherit + steroids + propylac +hos.cat, data = dat, alphas = c(0.2, 0.5, 0.8), lambdas = NULL,lamfract = 0.6, devtype = "vv", refit = TRUE, nclusters = 4)Optimal parameter valueslambda.min lambda.1se alpha.optimal0.01477729 0.3834742 0.5

The optimal λ and α in this case are 0 .

015 and 0 .

5, respectively, and the cross-validation error plot is shownin Figure 2.

R> cv_error2 <- plot(cv_fit2)R> print(cv_error2)

We plot the solution path, Figure 3, and ﬁt the penalized model based on the optimal λ and α : R> solution_path2 <- plot(cv_fit2, type = "fit")R> print(solution_path2)R> alp <- cv_fit2$alpha.optimalR> lam <- cv_fit2$lambda.minR>R>

Again, we use the fit2 object to plot the predicted individual and average survival curves as shown inFigure 4.

R> surv_avg <- pcoxsurvfit(fit2)R> surv_df <- with(surv_avg, data.frame(time, surv))R> surv_ind <- pcoxsurvfit(fit2, newdata = dat)R> splot_cgd <- plot(surv_ind, lsize = 0.05, lcol = "grey") ++ geom_line(data = surv_df, aes(x = time, y = surv, group = 1),+ col = "red"+ )R> print(splot_cgd) coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint a = a = ( Optimal ) a = −4 −2 0 −6 −4 −2 −7 −6 −5 −4 −3 −210.210.510.811.1 log ( l ) P a r t i a l L i k e li hood D e v i an c e Figure 2: Plots of the cross-validated error rates for a sequence of λ values at diﬀerent α values for cgd data. We provide a user-friendly wrapper, simtdc , for the R package PermAlgo permalgorithm algorithm whichimplements an extended permutational algorithm for simulating time-dependent covariates. The algorithmis described and validated in Sylvestre and Abrahamowicz (2008).We simulated a data set for 100 individuals with a follow-up time of up to 10 years, 100 time-dependentand 900 time-ﬁxed covariates. The survival times, T i (independent of the covariates (Sylvestre andAbrahamowicz, 2008)) and censoring times, C i , are generated as T i := Exponential(100 , .

2) and C i :=Uniform(100 , , n = 378 and p = 1000 with 63% of the subjectshaving experienced the event of interest at the end of the follow-up time.Our proximal gradient descent algorithm, pcoxtime , took 10 times as long to compute the complete solutionpath as the combination gradient descent-Newton-Raphson method from penalized (Goeman, 2010) (357seconds vs. 34 seconds). To make the algorithms comparable, we only performed lasso since penalized does not currently fully implement elastic net, and made penalized solve for the solution path generatedby pcoxtime such that λ min = 0 . λ max . Our current implementation has a downside — although thelikelihood and coeﬃcient estimations are performed in C++ for a single λ , the solution paths are computedin R which is likely to be slower than an implementation that is completely coded in C++ . On the otherhand, penalized performs most of its computation in

C++ . Table 1 summarizes speciﬁc strengths of our9 coxtime:

Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint ( l ) C oe ff i c i en t e s t i m a t e

13 12 11 11 10 2 0−6−4−202 −6 −4 −2log ( l ) Figure 3: Plot of regularization path for the optimal α with the corresponding λ values. The values at thetop of the plot give the number of nonzero coeﬃcients (size of the model) at various λ values. The left plotis for sorlie data set while the right plot is for cgd data set.algorithm. In terms of prediction accuracy, as measured by Harrell’s C -statistic (Therneau, 2020), the twopackages are equivalent (0 . . , . In this section, we compare the capabilities of pcoxtime to some of the most general and widely used R packages for penalized CPH models — glmnet , fastcox and penalized . One of the important aspects inhigh-dimensional data is computational speed. Of course, coordinate descent based methods ( glmnet and fastcox ) are usually much faster than gradient descent based methods ( pcoxtime and penalized ) which areslower by design (Simon et al., 2011). Table 1 summarizes some important capabilities of the packages.10 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint

Time S u r v i v a l p r ob . Time

Figure 4: Predicted individual and average (red) for sorlie , left , and cgd , right , data sets. pcoxtime glmnet fastcox penalized Supported models

Time-dependent covariates yes no no yesFull elastic net yes yes yes no

Post model predictions

Survival and hazard functions yes no no yesModel validation and calibration yes no no noTable 1: Capabilities of pcoxtime , glmnet , fastcox and penalized packages. We have shown how penalized CPH model can be extended to handle time-dependent covariates. To achievethis, we have we have employed proximal gradient descent algorithm. This paper provides a general overviewof the R package pcoxtime and serves as good starting point to further explore the capabilities of the package.In future, we plan to improve the functionality of pcoxtime . In particular, we plan to move to an im-plementation using the coordinate descent algorithm which has proven to be eﬃcient in high-dimensionaldata. 11 coxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint

Acknowledgments

We would like to thank Mr. Erik Drysdale and Mr. Brian Kiprop who provided valuable feedback on theearlier versions of the package.

References

Peter C Austin, Aur´elien Latouche, and Jason P Fine. A review of the use of time-varying covariates in theﬁne-gray subdistribution hazard competing risk regression model.

Statistics in Medicine , 39(2):103–113,2020.Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods.

IMA Journal of Nu-merical Analysis , 8(1):141–148, 1988.Andreas Bender, Andreas Groll, and Fabian Scheipl. A generalized additive model approach to time-to-eventanalysis.

Statistical Modelling , 18(3-4):299–321, 2018.Norman E Breslow. Contribution to discussion of paper by Dr Cox.

J. Roy. Statist. Soc., Ser. B , 34:216–217,1972.Ross Cagan and Pablo Meyer. Rethinking cancer: Current challenges and opportunities in cancer research,2017.Biyue Dai and Patrick Breheny. Cross validation approaches for penalized Cox regression.

ArXiv PreprintArXiv:1905.10432 , 2019.Jelle J Goeman. L1 penalized estimation in the Cox proportional hazards model.

Biometrical Journal , 52(1):70–84, 2010.Anders Gorst-Rasmussen and Thomas H Scheike. Coordinate descent methods for the penalized semipara-metric additive hazards model.

Journal of Statistical Software , 47(1):1–17, 2012.Jiang Gui and Hongzhe Li. Penalized Cox regression analysis in the high-dimensional and low-sample sizesettings, with applications to microarray gene expression data.

Bioinformatics , 21(13):3001–3008, 2005.Frank E Harrell Jr, Kerry L Lee, and Daniel B Mark. Multivariable prognostic models: Issues in developingmodels, evaluating assumptions and adequacy, and measuring and reducing errors.

Statistics in Medicine ,15(4):361–387, 1996.Axel Hochstein, Hyung-Il Ahn, Ying Tat Leung, and Matthew Denesuk. Survival analysis for hdlss datawith time dependent variables: Lessons from predictive maintenance at a mining service provider. In

Proceedings of 2013 IEEE International Conference on Service Operations and Logistics, and Informatics ,pages 372–381. IEEE, 2013.H˚avard Kvamme, Ørnulf Borgan, and Ida Scheel. Time-to-event prediction with neural networks and Coxregression.

Journal of Machine Learning Research , 20(129):1–30, 2019.Neal Parikh and Stephen Boyd. Proximal algorithms.

Foundations and Trends in Optimization , 1(3):127–239,2014.Mee Young Park and Trevor Hastie. L1-regularization path algorithm for generalized linear models.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 69(4):659–677, 2007.Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization Paths for Cox’s Propor-tional Hazards Model via Coordinate Descent.

Journal of Statistical Software , 39(5):1, 2011.Insuk Sohn, Jinseog Kim, Sin-Ho Jung, and Changyi Park. Gradient lasso for Cox proportional hazardsmodel.

Bioinformatics , 25(14):1775–1781, 2009.T Sorlie and R Tibshirani. Repeated observation of breast tumor subtypes in independent gene expressiondata sets.

Proc Natl Acad Sci USA , 100:8418–8423, 2003.Marie-Pierre Sylvestre and Michal Abrahamowicz. Comparison of algorithms to generate event times con-ditional on time-dependent covariates.

Statistics in Medicine , 27(14):2618–2634, 2008.Terry Therneau, Cindy Crowson, and Elizabeth Atkinson. Using time dependent covariates and time de-pendent coeﬃcients in the Cox model.

Survival Vignettes , 2017.Terry M Therneau.

A Package for Survival Analysis in R , 2020. URL https://CRAN.R-project.org/package=survival . R package version 3.2-7. 12 coxtime:

Penalized Cox Proportional Hazard Model for Time-dependent Covariates

A Preprint

Ryan J Tibshirani et al. The lasso problem and uniqueness.

Electronic Journal of Statistics , 7:1456–1490,2013.Ying-Wooi Wan, John Nagorski, Genevera I Allen, Zhaohui Li, and Zhandong Liu. Identifying cancerbiomarkers through a network regularized Cox model. In , pages 36–39. IEEE, 2013.Yi Yang and Hui Zou. A cocktail algorithm for solving the elastic net penalized Cox’s regression in highhimensions.

Statistics and its Interface , 6(2):167–173, 2013.Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.