[PDF] Analysis of survival data with non-proportional hazards: A comparison of propensity score weighted methods

Abstract

One of the most common ways researchers compare survival outcomes across treatments when confounding is present is using Cox regression. This model is limited by its underlying assumption of proportional hazards; in some cases, substantial violations may occur. Here we present and compare approaches which attempt to address this issue, including Cox models with time-varying hazard ratios; parametric accelerated failure time models; Kaplan-Meier curves; and pseudo-observations. To adjust for differences between treatment groups, we use Inverse Probability of Treatment Weighting based on the propensity score. We examine clinically meaningful outcome measures that can be computed and directly compared across each method, namely, survival probability at time T, median survival, and restricted mean survival. We conduct simulation studies under a range of scenarios, and determine the biases, coverages, and standard errors of the Average Treatment Effects for each method. We then apply these approaches to two published observational studies of survival after cancer treatment. The first examines chemotherapy in sarcoma, where survival is very similar initially, but after two years the chemotherapy group shows a benefit. The other study is a comparison of surgical techniques for kidney cancer, where survival differences are attenuated over time.

Full PDF

AAnalysis of survival data with non-proportionalhazards: A comparison of propensity score weightedmethods

Elizabeth A. Handorf*, Marc Smaldone, Sujana Movva, Nandita MitraSeptember 3, 2020 *Correspondence to: Elizabeth Handorf. 333 Cottman Ave., Reimann 383, Philadel-phia, PA 19111. [email protected]

Abstract

One of the most common ways researchers compare survival outcomes acrosstreatments when confounding is present is using Cox regression. This model is lim-ited by its underlying assumption of proportional hazards; in some cases, substantialviolations may occur. Here we present and compare approaches which attempt toaddress this issue, including Cox models with time-varying hazard ratios; paramet-ric accelerated failure time models; Kaplan-Meier curves; and pseudo-observations.To adjust for diﬀerences between treatment groups, we use Inverse Probability ofTreatment Weighting based on the propensity score. We examine clinically mean-ingful outcome measures that can be computed and directly compared across eachmethod, namely, survival probability at time T , median survival, and restrictedmean survival. We conduct simulation studies under a range of scenarios, and de-termine the biases, coverages, and standard errors of the Average Treatment Eﬀectsfor each method. We then apply these approaches to two published observationalstudies of survival after cancer treatment. The ﬁrst examines chemotherapy in sar-coma, where survival is very similar initially, but after two years the chemotherapygroup shows a beneﬁt. The other study is a comparison of surgical techniques forkidney cancer, where survival diﬀerences are attenuated over time. Many health care studies use observational databases to compare censored survivaltimes between diﬀerent treatment or exposure groups. A common statistical ap-proach used to model such data is the Cox proportional hazards regression model.However, the standard Cox model makes a strong assumption that the hazardsare proportional between treatment groups of interest over the course of follow-up.This assumption may not hold in real-world studies.Survival times are proportional between two groups when the ratio of their haz-ards is constant over time. [20] Although Cox regression is often used to model sur-vival times, particularly when adjusting for covariates, there are a variety of modelsfor censored outcomes which do not rely on the assumption of proportional hazards.Non-parametric methods, including the ubiquitous Kaplan-Meier method, by deﬁni-tion do not make assumptions about the functional form of the survival curves, and a r X i v : . [ s t a t . A P ] S e p herefore accommodate non-proportionality. [19] An alternative non-parametricmethod is the Nelson-Aalen estimator, [11] and a recent method proposes us-ing pseudo-observations, which can also be used without parametric assumptions.[2, 3, 1] However, standard implementations of non-parametric approaches do notreadily incorporate covariates. This motivates use of regression-based parametricand semi-parametric approaches which accommodate non-proportionality. The Coxmodel can be modiﬁed to include time dependent hazards, relaxing the assumptionof proportionality. [11] Accelerated Failure Time (AFT) models are a diﬀerent classof regression methods for censored data, which assume a linear model for log-time.They are often modeled fully parametrically, and for many choices of distributionfor the error term, the models are non-proportional. Semi-parametric AFT meth-ods are also available. [18] Proportional odds models can also be used in the settingof non-proportionality. [11]Given the number of choices, a natural question is which method performs bestin real-world conditions, particularly when analyzing observational data where con-founding is an issue. In this paper, we focus on approaches which can use propensityscore weighting to address confounding. The propensity score is commonly used inthe analysis of clinical studies and has many advantages as we discuss in detail insection 2. We then describe several methods for analysis of survival data where theproportional hazards assumption is violated. We focus on meaningful estimandsand compare the performance of parametric, semi-parametric, and nonparamet-ric approaches; some of which are commonly used while others are less familiar.We focus on methods which can 1) be used with Inverse Probability of TreatmentWeighting (IPTW), 2) identify the estimands of interest, and 3) are readily imple-mented using currently available software. We provide insight on the performanceof these methods in scenarios often involved in clinical studies.This work is motivated by two of our recently published studies in cancer inwhich the treatment eﬀect exhibited non-proportional hazards. In our study ofchemotherapy for soft-tissue sarcoma, there was a negligible survival beneﬁt untilabout two years after the start of treatment. [25] In another study comparingtwo diﬀerent surgical techniques in early-stage renal cancer, we found that oneapproach had a strong beneﬁt immediately after surgery, but the beneﬁt decreasedover time. [27] The data for both of these studies came from an observationalsource, the National Cancer Database. [9] Therefore, both were likely subject toconfounding by indication; that is, the choice of treatment is informed by factorsthat are also associated with survival.The outline of our paper is as follows. In section 2, we provide a brief review ofestimating the average treatment eﬀect using propensity score weighting. In section3, we present several methods to analyze survival outcomes under non proportionalhazards that utilize IPTW. In section 4, we conduct extensive simulation studiesto compare the methods presented in section 3. In section 5, we use our motivatingstudies of soft-tissue sarcoma and renal cancer to demonstrate implementation andinterpretation of the various methods described earlier. We conclude with a briefdiscussion and thoughts on next steps. The propensity score, ﬁrst proposed by Rosenbaum and Rubin, [29, 30] is deﬁnedas the probability of receiving treatment given a set of covariates, e = P ( Z = 1 | X ) here e is the propensity score, Z is a binary indicator for treatment and X isa vector of covariates. The propensity score is useful for causal inference, underthe potential outcomes framework. Here, each subject has two potential outcomesunder control and treatment conditions ( Y , Y ), one of which is unobserved. [32,16] The estimand of interest is often the Average Treatment Eﬀect (ATE), deﬁnedas ATE = E( Y − Y ) , which is usefully interpreted as the expected value of the outcome if every subjectis given the treatment, compared with the expected value if every subject receivesthe control. Strongly ignorable treatment assignment occurs under the conditions( Y , Y ) ⊥ Z | X and 0 < e < , that is, when the potential outcomes are independent of treatment assignmentconditional on the covariates, and each subject has a nonzero probability of receivingboth the treatment and the control (positivity). When strong ignorability holds,the ATE based on the propensity score is identiﬁable from the observed data.[29, 31, 28] E( Y − Y ) = E( Y | Z = 1 , e ) − E( Y | Z = 0 , e ) . Propensity score methods have a number of beneﬁts over traditional regressionadjustment. First, they provide an easy way to check for suﬃcient covariate overlapand balance between the treated and control groups. [4] Second, one can use ﬂexibleand modern approaches, such as ensemble models, [36] to estimate the propensityscores, which make the estimates less vulnerable to bias from an incorrect functionalform. Propensity scores also allow for the use of doubly robust methods. [14] Third,propensity score weighting approaches provide marginal, not conditional, estimates,which are more interpretable and useful to clinicians and policymakers. [6, 21, 7, 22]There are several ways to use propensity scores in analyses of survival data.Here we focus on weighting, speciﬁcally inverse probability of treatment weighting(IPTW). The idea underlying IPTW is to create synthetic samples where covari-ates are unrelated to treatment assignment. The weighted samples can then besummarized and compared directly. One nice property of using IPTW is that theestimand is the ATE. [4] To implement an IPTW analysis, one must ﬁrst obtainthe estimated propensity scores for each subject (ˆ e i ). This can be accomplished viaa number of approaches, including logistic regression, random forests, or ensemblemethods. The IPTW for subject i is then deﬁned asˆ w i = Z i ˆ e i + 1 − Z i − ˆ e i . In subsequent analyses, these weights are used to estimate the expected eﬀectsunder the treatment and control conditions.

In the setting of survival outcomes with non-proportionality, we must ﬁrst decidewhat our estimand will be. It should be readily interpretable by the intended au- ience, meaningful given the observed data features, and estimable using availablesoftware. If we ﬁt a proportional hazards model to non-proportional data, we wouldobtain an estimate of a single HR, but this would be a misleading measure becauseit would essentially be a summary of the diﬀerent hazard ratios at each observedfailure time. Alternative hazard ratios have been proposed including the averagehazard ratio which is interpreted as the average of the hazard ratios weighted bythe number of patients at risk. [34]The diﬃculties inherent in reporting hazard ratios in the setting of non-proportionalitycan be avoided by focusing on measures that can be deﬁned directly from the sur-vival functions for the treated and control arms, S ( t ) and S ( t ), respectively.The simplest approach compares diﬀerences in survival probabilities at a par-ticular time t = T . ∆ S t = S ( t = T ) − S ( t = T )Quantiles of the survival functions can also be compared. For instance, if thesurvival curves are deﬁned for S ( t ) = 0 .

5, median survival times can be compared.∆ T = ( t : S ( t = T ) = 0 . − ( t : S ( t = T ) = 0 . T where S ( t ) is deﬁned (i.e. mean survivalup through the maximum observation time). [12] RMS can be found using the areaunder the survival curve, where the diﬀerence in RMS between the treated andcontrol conditions is∆RMS t = (cid:90) t S ( T = t ) dt − (cid:90) t S ( T = t ) dt. Each of these measures are estimated directly from the survival curve, so theyare always meaningful regardless of proportionality. However, not every methodfor the analysis of survival data explicitly estimates the survival curve, and notall methods can accommodate weighting. In the subsequent sections, we focuson methods which can 1) accommodate IPTW, 2) allow the estimation of themeasure of interest (∆ S T , ∆ T , and ∆RMS T ), and 3) are readily implementedusing published software. Although the Cox proportional hazards model does rely on the assumption of pro-portionality, the model tends to be robust to small deviations from this assumption.However, if n is large, as is common is registry studies, small deviations from pro-portionality may be statistically signiﬁcant when tested using standard diagnostictests (e.g. Cox-Snell residuals, Schoenfeld residuals). [15] Therefore, the Cox modelmay still be useful in some cases were deviations from proportionality are demon-strated. The Cox model is deﬁned as follows: h i ( t | Z i ) = h ( t )exp( βZ i ) . IPTW can be readily accommodated in the Cox model, by weighting the contribu-tion of each observation to the partial likelihood function: L ( β ) = N (cid:89) i =1 (cid:32) exp( βZ i ) (cid:80) j ∈ R ( t i ) ˆ w j exp( βZ j ) (cid:33) ˆ w i here R ( t i ) is the risk set at time t i . The parameter β is the average of the log-hazard ratios at each failure time. Importantly, one can ﬁnd ˆ S ( t ) based on theestimates ˆ h ( t ) and ˆ β .We can also extend the standard Cox model, relaxing the assumption of pro-portionality, by allowing the hazard ratio to vary as a function of time. h i ( t | Z i ) = h ( t )exp( βZ i + f ( Z i , t )) . In our simulations, we consider two particular cases, 1) allowing the hazard to varyby log-time. f ( Z i , t ) = κZ i log( t ) , and 2) using a piecewise constant treatment eﬀect f ( Z i , t ) =  ≤ t < C κ Z i C ≤ t < C κ Z i C ≤ t These models can all be implemented with the widely-used survival package inR. [35]

Accelerated Failure Time (AFT) models assume a linear model for log( T ). [11, 18,38] log( T i ) = µ + βZ i + σ(cid:15) i Here, we consider parametric models, where the error term (cid:15) is assumed to take ona given distribution. We have many choices for the distribution of (cid:15) . A commonchoice is the Gumbel distribution, which yields a Weibull hazard model; however,the Weibull model is also a proportional hazards model. We can allow for non-proportionality by allowing the hazard ratio to change over time h i ( t | Z ) = exp[ − ( βZ i + f ( Z i , t ))] γ λγt γ − , (1)where λ is the scale and γ is the shape parameter. If we let f ( Z i , t ) = κZ i log( t ) , then (1) can be re-written aslog( T i ) = 1 γ + κZ i ( − log( λ ) + βZ i + (cid:15) i ) .(cid:15) i ∼ GumbelTherefore, if we include an additional parameter in the Weibull model, where theshape can vary by treatment, the model speciﬁcation allows the hazard ratio tovary by log-time.Another ﬂexible alternative is the generalized gamma model, which is also athree parameter model. [23] The distribution of the errors has an additional pa-rameter Q . log( T i ) = µ + βZ i + σ(cid:15) i (cid:15) i ∼ (cid:18) | Q | Γ(1 /Q (cid:19) (cid:18) exp( Qv ) Q (cid:19) /Q exp (cid:18) − exp( Qv ) Q (cid:19) his model encompasses other distributions as special cases; when Q = 1 wehave a Weibull AFT model, and if Q = 0 it results in a log-normal model.For any AFT model, the weights ˆ w i can be applied during estimation of pa-rameters. Here, we focus on the 3-parameter Weibull and generalized gamma AFTmodels, but a wide range of AFT models can be ﬁt in R using the flexsurv package.[17] Another strategy is to avoid parametric assumptions entirely, and use Kaplan-Meiercurves with IPTW to directly estimate the survival functions. [10, 39]ˆ S ( t ) = (cid:89) t j ≤ t (cid:32) − (cid:99) d wj (cid:99) n wj (cid:33) Where (cid:99) d wj = (cid:80) i : T i = t j ˆ w i δ i is the weighted number of events and (cid:99) n wj (cid:80) i : T i ≥ t j ˆ w i is the weighted number of people at risk at time t j . Weighted curves are easilyestimated in R via the survival package. Another non-parametric method which can be used to estimate the survival curve isthe recently-developed pseudo-observations approach. [2, 3, 1] This framework usesa missing data approach to account for censoring. If no missing data or confoundersare present, the ATE can be found using a simple average over the (possibly trans-formed) outcomes for each subject θ = E ( f ( Y )) = 1 n (cid:88) i f ( Y i ) . But if f ( Y i ) is unknown for some subjects, we cannot calculate E ( f ( y )) directly.Instead, we can deﬁne the pseudo-observation for subject i asˆ θ i = n ˆ θ − ( n − θ − i where ˆ θ is a consistent estimate for θ , and ˆ θ − i is the estimator applied to allobservations excluding subject i . The pseudo-observations are then used in placeof all observations, not only the ones which are missing due to censoring.In a survival context, our outcome of interest, θ , is the probability of survival attime t , and θ i is a binary indicator for the status (alive/dead) of subject i at time t . We can use a non-parametric method, such as Kaplan-Meier, to estimate S Z ( t ),which is then used to ﬁnd θ i for all observations in arm Z .ˆ θ Zi = n Z ˆ S Z ( t ) − ( n Z −

1) ˆ S Z ( t ) − i Heuristically, n Z ˆ S ( t ) is the expected number of patients alive at time t , and ( n Z −

1) ˆ S ( t ) − i is the expected number of patients other than patient i alive at time t . ˆ θ i is used in place of (possibly unobserved) survival status at time t . Therefore, theprobability of being alive at time t in arm Z isˆP( Y | Z, t ) = 1 n Z (cid:88) i ˆ θ Zi . ncorporating propensity-score weights to account for confounding is straight-forward. ˆP( Y | Z, t ) = 1 n Z (cid:88) i ˆ w i ˆ θ Zi There are several beneﬁts of using the pseudo-observations approach. It has beendemonstrated to provide a valid estimate of the ATE in a causal framework, andrequires weaker assumptions than regression-based models. It also supports estima-tion of many outcomes of interest. We can estimate ˆP( Y | Z, t ) at each failure time,deﬁning the survival curve, which in turn allows us to estimate the measures ofinterest described in section 3.1. Pseudo-observations can also be used to estimateRMS. This method can be implemented using the R package pseudo . [26]

To conduct inference for the outcomes of interest, we need to quantify uncertainty.Some methods have closed-form variance estimators for our estimands of interest,but others do not. This motivates the use of a non-parametric bootstrap to estimatevariances and ﬁnd conﬁdence intervals.[13] Bootstrap-based estimates are availablefor every combination of model and outcome we discuss above. Further, to correctlyaccount for the variance of ˆ e , the estimated propensity scores, we can re-estimatethe propensity score within each bootstrap iteration. [5] We tested the performance of the methods described above using simulation stud-ies. We assessed the bias, coverage, and variance of each method’s estimates of ouroutcomes of interest: median survival, restricted mean survival, and survival prob-ability at time T . We based our covariate and outcome distributions, and eﬀectsizes on our NCDB cancer studies to mimic real world scenarios.First, we simulated a set of covariates based on the observed multivariate distri-bution of the NCDB covariates in the renal dataset. We used the R package genOrd [8] to draw variables representing gender, age, stage, histology, tumor grade, Charl-son comorbidity score, race, ethnicity, insurance status, facility type, income, andeducation. As our goal here is to evaluate the survival analysis methods, and notthe propensity score estimation methods, we held this set of covariates constantfor each simulation; we employ simulated covariates instead of the actual covari-ates so that our data and simulation code can be shared. (Available online at https://github.com/BethHandorf/NonPH_IPTW )We used a simple logistic model with linear covariate eﬀects to determine eachsubjects’ probability of being assigned to the treatment condition (versus control).The eﬀect size (log-scale) for each covariate ranged from 0.8 (eﬀect of high vs lowstage) to 0.03 (for each additional year of age). This corresponds roughly to theeﬀect sizes estimated when modeling actual treatment allocation in the NCDBsample. This model deﬁned the probability that each individual would be allo-cated to the treatment condition. We set the intercept such that, on average, thepopulation’s probability of receiving the treatment was 0.5.Survival times were drawn based on a Weibull model with a time-varying treat-ment eﬀect. h ( t ) = γλt γ − exp( β Z + f ( Z, t ) + X ψ ) (2) n our base case, the hazard ratio varied by log time f ( Z, t ) = β Z log(t), we set β =-0.69, β = 0.25, with a sample size of 5000 total cases. For each individual, survivaltimes under Z = 1 and Z = 0 were both drawn using the R package simsurv .[33] These true potential outcomes were used to calculate the true ATE for theestimands of interest. Observed treatment status was drawn from the binomialbased on each individuals probability of receiving the treatment, as deﬁned by ourpropensity score model.For each set of simulated outcomes, we ﬁrst estimated the propensity scoresusing a logistic regression model, and then calculated respective IPTWs for eachindividual. Using these estimated weights, we ﬁt each of the models describedabove in Section 3 using standard R packages, including flexsurv for parametricsurvival models and pseudo for pseudo-observations methods.We repeated this simulation framework with various changes to Equation 2, fora total of ﬁve scenarios:1. Base case: β = -0.69 ; f ( Z, t ) = 0 . Z log( t )2. Piecewise constant (PWC) hazard ratio: β = 0; f ( Z, t ) = − . Z I( t ≥ · ) is the indicator function3. Modest non-proportionality β = -0.69 ; f ( Z, t ) = 0 . Z log( t )4. Modest treatment eﬀect β = -0.41 ; f ( Z, t ) = 0 . Z log( t )5. Base case, with a smaller sample size (N=500 per arm)Senario 1 (base case) had a large treatment eﬀect with substantial non-proportionality.Scenarios 2-4 were chosen to approximate the parameters and non-proportionalityfound in the clinical examples, and scenario 5 was used to assess how the methodsperform with a more modest sample size. For each method, we assessed performance based on the mean bias, standard error,and coverage for the ATE for treatment verusus the control. For the base case, thebias was often largest for the standard Cox model, as expected given the substantialdegree of non-proportionality in the simulated data. Bias in the Cox model tendedto be larger (relative to that of other methods) for the point estimates of survival.The generalized gamma AFT model also had large bias relative to other methods;at 10 years, it’s bias was actually greater than that of the Cox model. Even thoughthis three-parameter model is ﬂexible and allows for non-proportionality, ﬁtting thismis-speciﬁed parametric model still introduced substantial biases into the results,comparable in size to those found using the standard Cox model. In the base case,the Cox model with a treatment eﬀect varying by log-time was correctly speciﬁed,but to our surprise, it was often outperformed by the piecewise constant time-varying Cox model. The fully parametric AFT model with variable location andscale parameters was also correctly speciﬁed, and it had comparable bias to thatof the time-varying Cox models. The non-parametric Kaplan-Meier and pseudo-observation methods generally had the lowest bias. Finally, we note that the RMSwas less sensitive to the choice of method than the other outcomes of interest; therelative diﬀerences in the sizes of the biases were smallest for this survival outcome.(see Figure 1)As shown in Table 1, coverage was worst for the standard Cox model, followedby the generalized gamma AFT model. The time-varying Cox models did better,although coverage was surprisingly low for the log-time model at the 2-year outcome Table 1: Coverage for base case

Method 2y 5y 10y Median RMS

Cox 0.00 0.07 0.61 0.81 0.71CTV LT 0.82 0.91 0.93 0.94 0.93CTV PWC 0.96 0.96 0.95 0.95 0.93AFT GG 0.00 0.37 0.44 0.94 0.90AFT WBL LS 0.89 0.93 0.93 0.95 0.93Pseudo 0.97 0.97 0.96 0.95 0.94Wtd KM 0.97 0.96 0.96 0.95 0.94

The computational time and resources required to ﬁt each model varied substan-tially. Depending on the size of the dataset, computational considerations maybecome important, especially given our use of the bootstrap to compute conﬁdenceintervals. The fastest method was weighted KM. To ﬁt a Cox model with a timevarying-eﬀect of log-time, we found it infeasibly slow to split the dataset at eachobserved failure time given our large sample size. Instead, we limited the splits tooccur at one-month intervals; we tested this approach and found that the resultswere nearly identical to those obtained by splitting at each failure time. The pseudoobservations approach was also relatively slow, and highly memory intensive, evenwhen we estimated S ( t ) at one-month intervals. Table 2 shows the time required toﬁt each model once, and the total time required to calculate 500 bootstrap samples. Cox CTV LT CTV PWC AFT GG AFT WBL LS Pseudo Wtd KM y ea r s y ea r s y ea r s M ed i an R M S Scenario B i a s Scenario

BasePWCModest NPHModest TESmall SS

CTV=Cox Time-Varying; LT=Log-Time; PWC=Piece-Wise Constant; AFT=Accelerated Failure Time;GG=Generalized Gamma; WBL LS= Weibull Location-Scale; Pseudo=Pseudo-Observations; WtdKM=Weighted Kaplan-Meier; NHP=Non-Proportional Hazards; TE=Treatment eﬀect; SS=Sample Size

Table 2: Performance comparison: time required for N=5,000

Method Per bootstrap iteration (sec) Total time (min)

Cox 0.98 8.21CTV LT* 7.60 63.44CTV PWC 0.41 3.40AFT GG 13.70 114.39AFT Weibull LS 5.26 43.90Pseudo** 2.17 18.14Wtd KM 0.14 1.18 *28 times higher if split at each failure**60 times higher if estimated at each failure Comparative eﬀectiveness of Cancer Ther-apies

We applied the previously described methods to estimate the eﬀect of treatments onsurvival in two clinical studies using the National Cancer Database (NCDB). TheNCDB is a large registry which encompasses approximately 70% of newly diagnosedcancer cases in the United States. It contains variables describing patient and tumorcharacteristics, and the treatments they receive. [9] Here, we describe two studies weconducted using the NCDB that featured two diﬀerent forms of non-proportionalhazards. The ﬁrst is a moderately sized study where two treatment groups hadsimilar survival initially, but diﬀerences in long-term outcomes. The second exampleis a much larger study where beneﬁts of one treatment were attenuated over time.

Soft-tissue sarcoma is a rare cancer of the connective tissue (e.g. fat, muscle,blood vessel), usually arising in the extremities. Stage III disease (high grade,large, or deep tumor) is typically treated surgically or with radiation therapy. Itremains uncertain whether chemotherapy provides an overall survival beneﬁt forthese patients, and its use is optional according to national guidelines. [37] Westudied this question using the NCDB, creating a cohort of Stage III sarcomapatients (with various histologies) treated with deﬁnitive surgical resection of theprimary tumor, with or without chemotherapy pre/post operatively. We identiﬁed5,337 cases with recorded overall survival data, of whom 28% were treated withchemotherapy. [25]Due to the non-randomized nature of the data, there were substantial diﬀerencesbetween treatment groups. For example, patients treated with chemotherapy wereyounger and had fewer recorded co-morbid conditions. Furthermore, the hazardsfor the treatment groups exhibited substantial non-proportionality, as is clearly ev-ident upon examining the complementary log-log survival curves. (See Figure 2)Testing the Schoenfeld residuals provides further evidence of non-proportionality(P=0.047). We used IPTW to account for covariate imbalance between the treat-ment groups. The propensity score was estimated using all available covariates(including age, gender, insurance status, income, comorbidity score, tumor histol-ogy, grade, size, anatomic site, treating facility type, and travel distance.) TheIPTW Kaplan-Meier curves are shown in Figure 2. In our original analysis, we alsoﬁt a time-varying Cox proportional hazards model, assuming a piecewise constanthazard ratio, which could change at two years. We found that there was little eﬀectof chemotherapy during the ﬁrst two years after diagnosis, but that patients treatedwith chemotherapy did better in the long term. These results motivated simulationscenario 2 (PWC).Here, we apply each of the methods discussed in the sections above to thesarcoma data. Results for RMS, 2 year survival, and 5 year survival are shown inTable 3. Median and 10-year survival were not estimable from the observed data, sowe excluded these outcomes. We found that the diﬀerent methods yielded diﬀerenteﬀect size estimates; however, these diﬀerences were often modest and in somecases would not change statistical inferences. For example, ∆RMS varied from aminimum of 0.409 (AFT with a generalized gamma distribution) to a maximumof 0.468, but all models showed an improvement in RMS for patients treated withchemotherapy. In other cases the inferences would diﬀer based on the model chosen.At 2 years, the standard Cox model showed a diﬀerence of 3.6% between the treated . . . Unadjusted

Months No chemoChemo 5e−02 5e−01 5e+00 5e+01 − − − Complementary log−log

MonthsNo chemoChemo0 20 40 60 80 100 120 . . . IPTW

Months S u r v i v a l No chemoChemo

Table 3: Chemotherapy in sarcoma: results from each model

RMS 2 years 5 years ∆ LCL UCL ∆ LCL UCL ∆ LCL UCL

Cox *Centered weights∆ = Diﬀerence between ATEs for chemo vs no chemo, boldface denotes signiﬁcanceLCL = Lower conﬁdence limit, UCL = Upper conﬁdence limit nd control arms, which was statistically signiﬁcant. The results of both AFTmodels also found signiﬁcantly higher survival in the chemotherapy group. However,the weighted Kaplan-Meier method showed a non-signiﬁcant diﬀerence of -0.7%.Both time-varying Cox models and the pseudo-observations method also producednon-signiﬁcant diﬀerences.We would like to note that in this data analysis example, we found that thepseudo-observations method was sensitive to deviations in the mean of the esti-mated weights from the expected value of 1. This can be explained by both howthe weights are applied and how the survival function is estimated. Each patient’s(binary) contribution to the survival function is weighted, and the weighted indica-tors are then averaged. Therefore, if the average weight is not equal to 1, at baselinethe survival function will not start at 1. To address this issue in the estimation ofthe survival function when using pseudo-observations, we centered the weights sotheir mean was exactly 1. However, we note that this was done heuristically, andnot theoretically driven. Further exploration of this issue is warranted. The second motivating clinical example for this work was a study of surgical op-tions for early stage renal cancer. These small tumors can be treated with RadicalNephrectomy (RN) or Partial Nephrectomy (PN). PN preserves more kidney tissue,but surgery/recovery is more diﬃcult and it is more likely to be given to healthierpatients. [24, 27] Here, we focus on a subgroup from the original study: patientsaged 51-60 with T1a tumors, which gives a sample size of 28,973 patients (61.1%treated with PN). In unadjusted analysis, patients given PN had improved survivalover those given RN. However, these results are subject to confounding by indica-tion. Examination of the complementary log-log plots shows a notable deviationfrom non-proportionality, which is conﬁrmed by testing the Schoenfeld residuals(P < h ( t ) = λ ( t )exp( β Z + β Z log( t )) themaximum likelihood estimates are ˆ β = − .

468 and ˆ β = 0 . . . . Unadjusted

MonthsRadicalPartial 5e−02 5e−01 5e+00 5e+01 − − − − Complementary log−log

MonthsRadicalPartial0 20 40 60 80 100 120 . . . IPTW

Months S u r v i v a l Radical Partial

Table 4: Eﬀect of PN vs RN in renal cancer: Results from each model

RMS 2 years 5 years ∆ LCL UCL ∆ LCL UCL ∆ LCL UCL

Cox *Centered weights∆ = Diﬀerence between ATEs for PN vs RN, boldface denotes signiﬁcance

LCL = Lower conﬁdence limit, UCL = Upper conﬁdence limit Discussion

When analyzing survival outcomes, we often see non-proportional hazards. Basedon our ﬁndings, it seems that non-proportionality is readily addressable in practicewith weighted Kaplan-Meier curves, the simplest method we assessed. It performedquite well across a variety of scenarios and outcome measures of interest. The IPTWKaplan-Meier method does not require specialized software or methods, and it alsohad the best computational performance. There is a penalty in terms of eﬃciency,but we found the increase in the size of the standard errors (compared to parametricmethods) to be small in practice. We believe that the larger standard errors are aminor drawback when compared to the beneﬁts of fewer assumptions and reducedbias.In this work, we used simple IPTW weights. Alternative propensity scoreweights have also been proposed; [6] notably, variance stabilized weights have goodproperties, especially when one treatment is given to a small proportion of the pa-tients. In our simulation studies, we used a simple functional form to model thecovariates’ relationship with the probability of receiving treatment. Therefore, thesimple logistic regression model used to estimate the propensity scores was cor-rectly speciﬁed. In practice, there may be much more complex covariate eﬀects,motivating more ﬂexible procedures to estimate the propensity scores, such as en-semble machine learning methods. [36] Such methods have been shown to improveprediction, but are often more resource-intensive to implement.Our simulation studies were directly informed by the two cancer studies dis-cussed in Section 5. We sought to better understand how choice of method mayaﬀect results of these real-world analyses, and to learn broader lessons about howbest to accommodate non-proportionality in large, observational studies. Althoughboth clinical examples had large sample sizes, we found in simulations that theresults held up even when sample sizes were small. Taken together, our simulationresults and clinical examples show that one can easily protect against incorrectinferences using IPTW Kaplan-Meier curves to estimate treatment eﬀects.

Acknowledgement

Research reported in this publication was supported by the National Cancer Insti-tute of the National Institutes of Health under Award Number P30CA006927 andU54CA221705. The content is solely the responsibility of the authors and does notnecessarily represent the oﬃcial views of the National Institutes of Health. Thedata used in the study are derived from a de-identiﬁed NCDB ﬁle. The AmericanCollege of Surgeons and the Commission on Cancer have not veriﬁed and are notresponsible for the analytic or statistical methodology employed, or the conclusionsdrawn from these data by the investigator.

References [1] Per K. Andersen, Elisavet Syriopoulou, and Erik T. Parner.Causal inference in survival analysis using pseudo-observations.

Statistics in Medicine , 36(17):2669–2681, 2017. eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.7297.

2] Per Kragh Andersen, Mette Gerster Hansen, and John P. Klein. RegressionAnalysis of Restricted Mean Survival Time Based on Pseudo-Observations.

Lifetime Data Analysis , 10(4):335–350, December 2004.[3] Per Kragh Andersen and Maja Pohar Perme. Pseudo-observations in survivalanalysis.

Statistical Methods in Medical Research , 19(1):71–99, February 2010.Publisher: SAGE Publications Ltd STM.[4] Peter C. Austin. An Introduction to Propensity Score Methods for Reducingthe Eﬀects of Confounding in Observational Studies.

Multivariate BehavioralResearch , 46(3):399–424, May 2011.[5] Peter C. Austin. Variance estimation when using inverse prob-ability of treatment weighting (IPTW) with survival analy-sis.

Statistics in Medicine , 35(30):5642–5655, 2016. eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.7084.[6] Peter C. Austin and Elizabeth A. Stuart. Moving towards best practice whenusing inverse probability of treatment weighting (IPTW) using the propensityscore to estimate causal treatment eﬀects in observational studies.

Statisticsin Medicine , 34(28):3661–3679, 2015.[7] Peter C Austin and Elizabeth A Stuart. The performance of inverse proba-bility of treatment weighting and full matching on the propensity score in thepresence of model misspeciﬁcation when estimating the eﬀect of treatment onsurvival outcomes.

Statistical Methods in Medical Research , 26(4):1654–1670,August 2017.[8] Barbiero, Alessandro and Alda Ferrari, Pier. GenOrd: Simulation of DiscreteRandom Variables with Given Correlation Matrix and Marginal Distributions,2015.[9] Daniel J. Boﬀa, Joshua E. Rosen, Katherine Mallin, Ashley Loomis, GreerGay, Bryan Palis, Kathleen Thoburn, Donna Gress, Daniel P. McKellar,Lawrence N. Shulman, Matthew A. Facktor, and David P. Winchester. Us-ing the National Cancer Database for Outcomes Research: A Review.

JAMAOncology , 3(12):1722–1728, December 2017. Publisher: American Medical As-sociation.[10] Stephen R. Cole and Miguel A. Hernn. Adjusted survival curves with in-verse probability weights.

Computer Methods and Programs in Biomedicine ,75(1):45–49, July 2004.[11] David Collett.

Modelling survival data in medical research . Chapman andHall/CRC, 2015.[12] Sarah C. Conner, Lisa M. Sullivan, Emelia J. Benjamin, Michael P. LaValley,Sandro Galea, and Ludovic Trinquart. Adjusted restricted mean survival timesin observational studies.

Statistics in Medicine , 38(20):3832–3860, 2019.[13] B. Efron and R. J. Tibshirani.

An Introduction to the Bootstrap . Chapmanand Hall/CRC, 1993.[14] Michele Jonsson Funk, Daniel Westreich, Chris Wiesen, Til Strmer, M. AlanBrookhart, and Marie Davidian. Doubly Robust Estimation of Causal Eﬀects.

American Journal of Epidemiology , 173(7):761–767, April 2011. Publisher:Oxford Academic.[15] Patricia M. Grambsch and Terry M. Therneau. Proportional Hazards Testsand Diagnostics Based on Weighted Residuals.

Biometrika , 81(3):515–526,1994. Publisher: [Oxford University Press, Biometrika Trust].

16] Paul W. Holland. Statistics and Causal Inference.

Jour-nal of the American Statistical Association

Journal of Statistical Software , 70(8):1–33, 2016.[18] Zhezhen Jin, D. Y. Lin, and Zhiliang Ying. On Least-Squares Regression withCensored Data.

Biometrika , 93(1):147–161, 2006.[19] E. L. Kaplan and Paul Meier. Nonparametric Estimation from IncompleteObservations.

Journal of the American Statistical Association , 53(282):457–481, June 1958.[20] J. P. Klein and M. L. Moeschberger.

Survival Analysis: Teshniques for Cen-sored and Truncated Data . Springer, second edition edition, 2003.[21] Jared K. Lunceford and Marie Davidian. Stratiﬁcation and weighting viathe propensity score in estimation of causal treatment eﬀects: a com-parative study.

Statistics in Medicine , 23(19):2937–2960, 2004. eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.1903.[22] Huzhang Mao, Liang Li, Wei Yang, and Yu Shen. On the propensity scoreweighting analysis with survival outcome: Estimands, estimation, and infer-ence.

Statistics in Medicine , 37(26):3745–3763, 2018.[23] Albert W. Marshall and Ingram Olkin. A new method for adding a parameterto a family of distributions with application to the exponential and Weibullfamilies.

Biometrika , 84(3):641–652, September 1997. Publisher: Oxford Aca-demic.[24] Motzer, Robert J. NCCN Clinical Practice Guidelines in Oncology (NCCNGuidelines) Kindey Cancer version 1.2021, July 2020.[25] Sujana Movva, Margaret von Mehren, Eric A. Ross, and Elizabeth Handorf.Patterns of Chemotherapy Administration in High-Risk Soft Tissue Sarcomaand Impact on Overall Survival.

Journal of the National Comprehensive Can-cer Network , 13(11):1366–1374, November 2015.[26] Pohar Perme, Maha and Gerster, Mette. pseudo: Computes Pseudo-Observations for Modeling, 2017.[27] Benjamin T. Ristau, Elizabeth A. Handorf, David B. Cahn, AlexanderKutikov, Robert G. Uzzo, and Marc C. Smaldone. Partial nephrectomyis not associated with an overall survival advantage over radical nephrec-tomy in elderly patients with stage Ib-II renal masses: An analysis ofthe national cancer data base.

Cancer , 124(19):3839–3848, 2018. eprint:https://acsjournals.onlinelibrary.wiley.com/doi/pdf/10.1002/cncr.31582.[28] P. R. Rosenbaum.

Observational Studies . Springer, second edition edition,2002.[29] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity scorein observational studies for causal eﬀects.

Biometrika , 70:412–8, 1983.[30] P. R. Rosenbaum and D. B. Rubin. Bias in Observational Studies Using Sub-classiﬁcation on the Propensity Score.

Journal of the American StatisticalAssociation , 79:516–524, 1984.

31] Paul R. Rosenbaum. From Association to Causation in Observational Studies:The Role of Tests of Strongly Ignorable Treatment Assignment.

Journal of theAmerican Statistical Association , 79(385):41–48, 1984.[32] Donald B. Rubin. Causal Inference Using Potential Outcomes.

Journal of theAmerican Statistical Association , 100(469):322–331, March 2005. Publisher:Taylor & Francis eprint: https://doi.org/10.1198/016214504000001880.[33] Sam Brilleman. simsurv: Simulate Survival Data, 2019.[34] Michael Schemper, Samo Wakounig, and Georg Heinze. The estimation ofaverage hazard ratios by weighted Cox regression.

Statistics in Medicine ,28(19):2473–2489, 2009.[35] Terry M Therneau. A Package for Survival Analysis in R, 2020.[36] van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. Super Learner.

Statistical Applications in Genetics and Molecular Biology , 6(1), 2007.[37] von Mehren, Margaret. NCCN Clinical Practice Guidelines in Oncology(NCCN Guidelines)Soft Tissue Sarcoma Version 2.2020, May 2020.[38] L. J. Wei. The accelerated failure time model: a useful alternative to the Coxregression model in survival analysis.

Statistics in Medicine , 11(14-15):1871–1879, 1992.[39] Jun Xie and Chaofeng Liu. Adjusted KaplanMeier estimator and log-rank testwith inverse probability of treatment weighting for survival data.

Statistics inMedicine , 24(20):3089–3110, 2005.

Appendices

Cox CTV LT CTV PWC AFT GG AFT WBL LS Pseudo Wtd KM y ea r s y ea r s y ea r s M ed i an R M S Scenario SE Scenario

BasePWCModest NPHModest TESmall SS