Survival analysis under non-proportional hazards: investigating non-inferiority or equivalence in time-to-event data
SSurvival analysis under non-proportionalhazards: investigating non-inferiority orequivalence in time-to-event data
Kathrin M¨ollenhoff
Department of Mathematics and Computer Science,Eindhoven University of Technology, Eindhoven, The Netherlands andUniversity of Cologne, Faculty of Medicine and University Hospital, Cologne, Germany
Achim Tresch
University of Cologne, Faculty of Medicine and University Hospital, Cologne, Germany andCECAD, University of Cologne, Cologne, Germany andCenter for Data and Simulation Science, University of Cologne, Germany
September 16, 2020
Abstract
The classical approach to analyze time-to-event data, e.g. in clinical trials, is to fitKaplan-Meier curves yielding the treatment effect as the hazard ratio between treatmentgroups. Afterwards commonly a log-rank test is performed in order to investigate whetherthere is a difference in survival, or, depending on additional covariates, a Cox proportionalhazard model is used. However, in numerous trials these approaches fail due to the pres-ence of non-proportional hazards, resulting in difficulties of interpreting the hazard ratioand a loss of power. When considering equivalence or non-inferiority trials, the commonlyperformed log-rank based tests are similarly affected by a violation of this assumption.Here we propose a parametric framework to assess equivalence or non-inferiority for sur-vival data. We derive pointwise confidence bands for both, the hazard ratio and thedifference of the survival curves. Further we propose a test procedure addressing non-inferiority and equivalence by directly comparing the survival functions at certain timepoints or over an entire range of time. We demonstrate the validity of the methods by aclinical trial example and by numerous simulation results. a r X i v : . [ s t a t . M E ] S e p eywords and Phrases: equivalence, non-inferiority, survival analysis, time-to-event data, non-proportional hazards, confidence band, hazard ratio, log-rank test Time-to-event outcomes are frequently observed in medical research, for instance in the area ofoncology or cardiovascular diseases. A commonly addressed issue is the comparison of a test toa reference treatment with regard to survival. For this purpose an analysis based on Kaplan-Meier curves (see Kaplan and Meier (1958)), followed by a log-rank test (see, for example,Kalbfleisch and Prentice (2011)) is still the most popular approach. Additionally, adjusting formultiple covariates, Cox’s proportional hazards model (Cox (1972)) has been extensively usedin the last decades (for some examples see Cox and Oakes (1984); Klein and Moeschberger(2006) among many others). In case of addressing non-inferiority or equivalence, extensionsof the log-rank test investigating the vertical distance between the survival curves have beenproposed by Wellek (1993) and Com-Nougue et al. (1993). These approaches owe much of theirpopularity to the fact that they do not rely on assumptions on the distribution of event times.Moreover, a direct interpretation is obtained by summarizing the treatment effect in one singleparameter, given by the hazard ratio of the two treatments, assumed to be constant over time.However, this assumption has been heavily criticized (see for example Hern´an (2010); Uno et al.(2014)), and is in practice rarely assessed or even obviously violated (see Li et al. (2015) andJachno et al. (2019) for an overview on the awareness of this assumption). In particular ifshort- and long-term benefits of the two treatments differ for instance in situations where asurgical treatment is compared to a non-surgical one (see for example Howard et al. (1997) andthe references therein), the assumption of proportional hazards is questionable and should becarefully investigated. The most obvious sign for a violation of this assumption are survivalcurves crossing each other. However, often other techniques are required, for example graphicalmethods (see Grambsch and Therneau (1994) for an overview) or statistical tests (see forexample Gill and Schumacher (1987)).One of the advantages of the standard methodology based on Kaplan-Meier curves and thelog-rank test is that equivalence hypotheses can be formulated using one parameter, that isthe hazard ratio. If this ratio changes over time, both, an alternative measure of treatmenteffect, and a suitable definition of equivalence have to be found. For instance, Royston andParmar (2011) introduce the restricted mean survival time to overcome this issue. Moreover,an alternative to commonly used log-rank based tests of equivalence has been proposed byMartinez et al. (2017) under the assumption of proportional odds. Further these authors showthat type I errors for a log-rank based test are higher than the nominal level if the assumptionof proportional hazards doesn’t hold. Finally, in a recent paper Shen (2020) proposes analternative test for equivalence based on two other non-parametric models, which can also be2sed if neither hazards, nor odds, are proportional.Methods that employ parametric survival models are less common than the above-mentionedsemi-or non-parametric methods. However, a correctly specified parametric survival model pro-vides numerous advantages, as for instance more precise estimates (see, for example, Klein andMoeschberger (2006)) or the possibility of making predictions. Inference based on parametricmodels can be very precise even in case of misspecification, as demonstrated by Subramanianand Zhang (2013), who develop simultaneous confidence bands for parametric survival curvesand compare them to non-parametric approaches based on Kaplan-Meier estimates.In this paper we develop new methodology in two directions. We address the issue of non-inferiority and equivalence testing in presence of non-proportional hazards by presenting aparametric alternative to the classical methodology, without the assumption of a constanthazard-ratio. We first present an approach to derive pointwise confidence bands for the differ-ence of two survival curves and the hazard ratio over time, respectively, by both making in-ference of the asymptotic distribution of these curves and using a bootstrap approach. Secondwe use these confidence intervals to assess equivalence or non-inferiority of the two treatments.Finally all our methods are illustrated by a clinical trial example and by means of a simulationstudy.
Consider two samples of size n and n respectively, resulting in a total sample size of n = n + n . Let Y , , . . . , Y ,n and Y , , . . . , Y ,n denote independent random variables representingsurvival times for individuals allocated to two (treatment) groups, observing a time range givenby T = [0 , t max ], where 0 denotes the start of the observations. Assume that the distributionfunctions F and F of Y ,j , j = 1 , . . . , n and Y ,j j = 1 , . . . , n , respectively, are absolutelycontinuous with densities f resp. f . Consequently the probability of experiencing an eventfor an individual j of the (cid:96) -th sample before time t can be written as F (cid:96) ( t ) = P ( Y (cid:96),j < t ) = (cid:90) t f (cid:96) ( u ) du, (cid:96) = 1 , . Further denote the corresponding survival functions by S (cid:96) := 1 − F (cid:96) and the hazard rates by h (cid:96) := f (cid:96) S (cid:96) . The cumulative hazard function is given by H (cid:96) ( t ) = − log( S ( t )), (cid:96) = 1 , C , , . . . , C ,n and C , , . . . , C ,n and the corresponding distribution functions by G and G respectively. Note that these distributions can differ from each other and are assumed to beindependent from the Y (cid:96),j , (cid:96) = 1 , , j = 1 , . . . n (cid:96) . We define ∆ (cid:96),j = I { Y (cid:96),j ≥ C (cid:96),j } , indicatingwhether an individual is censored (∆ (cid:96),j = 0) or experiences an event (∆ (cid:96),j = 1), where I t (cid:96),j , δ (cid:96),j ) is a realization of thebivariate random variable ( T (cid:96),j , ∆ (cid:96),j ), where T (cid:96),j = min( Y (cid:96),j , C (cid:96),j ) , (cid:96) = 1 , , j = 1 , . . . n (cid:96) . Inorder to make inference on the underlying distributions we consider the likelihood function forgroup (cid:96) given by L (cid:96) ( F (cid:96) , G (cid:96) ) = n (cid:96) (cid:89) j =1 (cid:8) f (cid:96) ( t (cid:96),j ) δ (cid:96),j (1 − F (cid:96) ( t (cid:96),j )) − δ (cid:96),j (cid:9) · n (cid:96) (cid:89) j =1 (cid:8) (1 − G (cid:96) ( t (cid:96),j )) δ (cid:96),j ( g (cid:96) ( t (cid:96),j )) − δ (cid:96),j (cid:9) , (2.1)as censoring times and survival times are assumed to be independent. Hence we can obtainestimates for the densities f (cid:96) ( t ) = f (cid:96) ( θ (cid:96) , t ) and g (cid:96) ( t ) = g (cid:96) ( ψ (cid:96) , t ) by deriving the parametersˆ θ (cid:96) and ˆ ψ (cid:96) maximizing log L (cid:96) , (cid:96) = 1 ,
2. Note that if one is not interested in estimating theunderlying distribution of the censoring times and θ (cid:96) and ψ (cid:96) have no common parameters, thisoptimization procedure can be further simplified by just considering the first part in (2.1),resulting in an objective function given by˜ L (cid:96) ( θ (cid:96) ) = n (cid:96) (cid:89) j =1 (cid:8) f (cid:96) ( θ (cid:96) , t (cid:96),j ) δ (cid:96),j (1 − F (cid:96) ( θ (cid:96) , t (cid:96),j )) − δ (cid:96),j (cid:9) , (cid:96) = 1 , . (2.2) In the following we will construct pointwise confidence bands for the difference of the survivalfunctions and for the hazard ratio. First we derive an asymptotic approach using the Delta-method (see Oehlert (1992)) and second, we propose an alternative based on a bootstrapprocedure, where the latter can also be used when samples are very small or, moreover, ifasymptotic inference cannot be made due to the lack of concrete expressions for the variance.In order to simplify calculations, we will consider the log-ratio and therefore the two measuresof interest are given by∆( t, θ , θ ) := S ( t, θ ) − S ( t, θ ) and r ( t, θ , θ ) := log h ( t,θ ) h ( t,θ ) . (2.3)Under certain regularity conditions (see Bradley and Gart (1962)) the Maximum Likelihoodestimates (MLE) obtained by maximizing (2.1) or (2.2) are asymptotically normally distributed,that is √ n (cid:96) (cid:0) ˆ θ (cid:96) − θ (cid:96) (cid:1) D −→ N (0 , I − θ (cid:96) ) , (cid:96) = 1 , , where I − θ denotes the inverse of the Fisher information matrix. This result can be used to makeinference about the asymptotic distribution of the estimated survival curves. More precisely,using the Delta-method we obtain for every t > √ n (cid:96) (cid:0) S (cid:96) ( t, ˆ θ (cid:96) ) − S (cid:96) ( t, θ (cid:96) ) (cid:1) D −→ N (0 , ∂∂θ (cid:96) S (cid:96) ( t, θ (cid:96) ) T I − θ (cid:96) ∂∂θ (cid:96) S (cid:96) ( t, θ (cid:96) )) , (cid:96) = 1 , . t, θ , θ ) is given byˆ σ := V ar (∆( t, ˆ θ , ˆ θ )) = n ∂∂θ S ( t, ˆ θ ) T I − θ ∂∂θ S ( t, ˆ θ ) + n ∂∂θ S ( t, ˆ θ ) T I − θ ∂∂θ S ( t, ˆ θ ) , (2.4)where I ˆ θ (cid:96) denotes the observed information matrix, (cid:96) = 1 ,
2. For sufficiently large samplesthis asymptotic result can be used to construct pointwise lower and upper (1 − α )-confidencebounds, respectively, by L ∆ ( t, ˆ θ , ˆ θ ) := ∆( t, ˆ θ , ˆ θ ) − z − α ˆ σ ∆ and U ∆ ( t, ˆ θ , ˆ θ ) := ∆( t, ˆ θ , ˆ θ ) + z − α ˆ σ ∆ , (2.5)where z − α denotes the (1 − α )-quantile of the standard normal distribution. More precisely, if L ( t ) and U ( t ) denote the (1 − α ) pointwise lower and the (1 − α ) pointwise upper confidencebound, respectively, it holdslim n ,n →∞ P (cid:0) L ∆ ( t, ˆ θ , ˆ θ ) ≤ ∆( t, θ , θ ) (cid:1) ≥ − α , lim n ,n →∞ P (cid:0) ∆( t, θ , θ ) ≤ U ∆ ( t, ˆ θ , ˆ θ ) (cid:1) ≥ − α (2.6)for all t >
0, where α denotes the prespecified significance level. The construction of pointwiseconfidence bands for the log hazard ratio works similarly and the asymptotic variance of r ( t )is given by ˆ σ r : = V ar ( r ( t, ˆ θ , ˆ θ )) = n · ∂∂θ log h ( t, ˆ θ ) T I − θ ∂∂θ log h ( t, ˆ θ )+ n · ∂∂θ log h ( t, ˆ θ ) T I − θ ∂∂θ log h ( t, ˆ θ ) . (2.7)Consequently L d and U d are given by L d ( t, ˆ θ , ˆ θ ) := r ( t, ˆ θ , ˆ θ ) − z − α ˆ σ r , U d ( t, ˆ θ , ˆ θ ) := r ( t, ˆ θ , ˆ θ ) + z − α ˆ σ r (2.8)and it holds for all t > n ,n →∞ P (cid:0) L d ( t, ˆ θ , ˆ θ ) ≤ r ( t, θ , θ ) (cid:1) ≥ − α and lim n ,n →∞ P (cid:0) r ( t, θ , θ ) ≤ U d ( t, ˆ θ , ˆ θ ) (cid:1) ≥ − α. A concrete numerical example for calculating the variances in (2.4) and (2.7) and the corre-sponding confidence bounds assuming a Weibull distribution are deferred to Section 2 in theSupplemental Material and the finite sample properties for to this scenario are presented inSection 3.If sample sizes are rather small or the variability in the data is high, we propose to obtainestimates for the variances ˆ σ and ˆ σ r by using a bootstrap approach, taking the right-censoringinto account. This approach has the advantage that it can also be used if a formula for theasymptotic variance is not obtainable, for instance due to numerical difficulties. The followingalgorithm explains the procedure for ∆( t , θ , θ ) and it can directly be adapted to r ( t , θ , θ ).5 lgorithm 2.1. (Parametric) Bootstrap Confidence Intervals for ∆( t , θ , θ ) .1. Calculate the MLE ˆ θ (cid:96) and ˆ ψ (cid:96) , (cid:96) = 1 , , from the data by maximizing (2.1) .2(a). Generate survival times y ∗ (cid:96), , . . . , y ∗ (cid:96),n (cid:96) from F (cid:96) (ˆ θ (cid:96) ) , (cid:96) = 1 , . Further generate the corre-sponding censoring times c ∗ (cid:96), , . . . , c ∗ (cid:96),n (cid:96) by sampling from the distributions G (cid:96) ( ˆ ψ (cid:96) ) , (cid:96) = 1 , .If y ∗ (cid:96),j > c ∗ (cid:96),j , the observation is censored (i.e. δ ∗ (cid:96),j = 0 ), j = 1 , . . . , n (cid:96) . The observed datais given by ( t ∗ (cid:96),j , δ ∗ (cid:96),j ) , t ∗ (cid:96),j = min ( y ∗ (cid:96),j , c ∗ (cid:96),j ) , j = 1 , . . . , n (cid:96) , (cid:96) = 1 , .2(b). Calculate MLE ˆ θ ∗ (cid:96) for the bootstrap sample from the t ∗ (cid:96),j , j = 1 , . . . , n (cid:96) , (cid:96) = 1 , , andcalculate the difference of the corresponding survival functions at t , that is ∆ ∗ := ∆( t , ˆ θ ∗ , ˆ θ ∗ ) = S ( t , ˆ θ ∗ ) − S ( t , ˆ θ ∗ ) . (2.9)
3. Repeat steps a ) and b ) n boot times, yielding ∆ ∗ , . . . , ∆ ∗ n boot and calculate an estimatefor the variance ˆ σ by ˆ σ = 1 n boot − n boot (cid:88) k =1 (∆ ∗ k − ¯∆ ∗ ) , (2.10) where ¯∆ ∗ denotes the mean of the ∆ ∗ i , i = 1 , . . . , n boot . Finally the estimate ˆ σ in (2.10) is used to calculate the confidence band in (2.5). Note that theprocedure described in Algorithm 2.1 is a parametric bootstrap as it is based on estimating theparameters ˆ θ (cid:96) , ˆ ψ (cid:96) , (cid:96) = 1 ,
2. A non-parametric alternative, given by resampling the observations(see Efron and Tibshirani (1994)), could also be implemented.
We are aiming to compare the survival functions of two (treatment) groups which is commonlyaddressed by testing the null hypothesis that the two survival functions are identical versus thealternative that they are differing at any time point (for an overview, see, for example, Kleinand Moeschberger (2006)). More precisely the classical hypotheses are given by H : S ( t, θ ) = S ( t, θ ) for all t ∈ T against H : S ( t, θ ) (cid:54) = S ( t, θ ) for a t ∈ T . In some situations one might be more interested in observing non-inferiority of one treatmentamong another or equivalence of the two treatments, meaning that we allow a deviation of thesurvival curves of a prespecified threshold instead of testing for equality. This can be done fora particular point in time, say t , or over an entire interval, say [ t , t ]. We now consider the6ifference in survival at a particular point in time t . The corresponding hypotheses are thengiven by H : S ( t , θ ) − S ( t , θ ) ≥ δ against H : S ( t , θ ) − S ( t , θ ) < δ (2.11)for a non-inferiority trial observing whether a test treatment is non-inferior to the referencetreatment (which is stated in the alternative hypothesis). Considering equivalence, the hy-potheses are given by H : | S ( θ , t ) − S ( θ , t ) | ≥ δ against H : | S ( θ , t ) − S ( θ , t ) | < δ. (2.12)The same questions can be addressed considering the (log) hazard ratio, resulting in the hy-potheses analogue to (2.11) given by H : log h ( θ ,t ) h ( θ ,t ) ≥ ε against H : log h ( θ ,t ) h ( θ ,t ) < ε (2.13)for a non-inferiority trial and H : (cid:12)(cid:12)(cid:12) log h ( θ ,t ) h ( θ ,t ) (cid:12)(cid:12)(cid:12) ≥ ε against H : (cid:12)(cid:12)(cid:12) log h ( θ ,t ) h ( θ ,t ) (cid:12)(cid:12)(cid:12) < ε (2.14)for addressing equivalence. The choice of the margins δ > ε > δ for the survival differenceare frequently chosen between 0 . .
2, for detailed discussions on that topic we refer toD’Agostino Sr et al. (2003); Da Silva et al. (2009); Wellek (2010), who also investigate thecorrespondence between δ and the hazard ratio threshold ε .We now use the confidence bounds derived in (2.5) to construct an asymptotic α -level test for(2.11). More precisely, the null hypothesis concerning non-inferiority in (2.11) is rejected if theupper bound of the confidence interval is below the equivalence margin, that is U ∆ ( t , ˆ θ , ˆ θ ) ≤ δ. (2.15)Further, rejecting H whenever U ∆ ( t , ˆ θ , ˆ θ ) ≤ δ and L ∆ ( t , ˆ θ , ˆ θ ) ≥ − δ (2.16)yields an equivalence test for the hypotheses in (2.12). According to the intersection-union-principleBerger (1982), the same (1 − α )-confidence bounds L ∆ ( t , ˆ θ , ˆ θ ) and U ∆ ( t , ˆ θ , ˆ θ ) areused for both, the non-inferiority and the equivalence test. The following theorem states thatthis yields an asymptotic α -level test. Theorem 2.2.
The test described in (2.16) yields an asymptotic α -level equivalence tests forthe hypotheses in (2.12) . More precisely it holds for all t ∈ T lim n ,n →∞ P H ( U ∆ ( t , ˆ θ , ˆ θ ) ≤ δ, L ∆ ( t , ˆ θ , ˆ θ ) ≥ − δ ) ≤ α. The proof is left to Section 1 in the Supplemental Material and a similar procedure can beapplied for testing the hypotheses in (2.13) and (2.14).7 .2.1 Comparing survival over entire time intervals
In some situations it might be interesting to compare survival not only at one particular pointin time but over an entire period of time [ t , t ]. For instance, extending the equivalence teststated in (2.12) to this situation yields the hypotheses H : max t ∈ [ t ,t ] | S ( t, θ ) − S ( t, θ ) | ≥ δ against H : max t ∈ [ t ,t ] | S ( t, θ ) − S ( t, θ ) | < δ (2.17)and a similar extension can be formulated for the non-inferiority test (2.11) and the tests onthe hazard ratio (2.13) and (2.14), respectively. In this case we reject the null hypothesis in(2.12) if, for all t in [ t , t ], the confidence bounds L ∆ ( t ) and U ∆ ( t ) derived in (2.5) are includedin the equivalence region [ − δ, δ ]. In the following we will investigate the finite sample properties of the proposed methods bymeans of a simulation study. We consider two scenarios corresponding to the choice of a Weibulldistribution of survival times. In a third case we investigate the robustness of our approach bygenerating data according to a log-logistic distribution but still fitting a Weibull model. Weassume (randomly) right-censored observations in combination with an administrative censoringtime in both scenarios meaning a fixed time point of last follow-up which we denote by t max . Allresults are obtained by running n sim = 1000 simulations and n boot = 500 bootstrap repetitions.For all three scenarios we will calculate confidence bands for both the difference of the survivalcurves and the log hazard ratio and observe their coverage probabilities. For the difference ofthe survival curves, we will investigate the tests on non-inferiority and equivalence proposedin (2.15) and (2.16), respectively. For this purpose we will vary both, the particular timepoint under consideration t and the non-inferiority/equivalence margin δ . More precisely wewill consider three different choices for this margin ranging from rather conservative to liberal,namely δ = 0 . , .
15 and 0 .
2. The test concerning the hypotheses formulated in terms of the loghazard ratio (see (2.13) and (2.17)) is conducted in the same manner, results are omitted here.In addition to our new method we also evaluate all scenarios using a non-parametric approach asdescribed in Section 5.2. of Com-Nougue et al. (1993). More precisely we construct confidencebounds for the difference of two Kaplan-Meier curves by estimating the variance using theformula by Greenwood (1926). Due to the sake of brevity the latter results will be deferred toSection 3 of the Supplemental Material.For the first two scenarios we assume the data in both treatment groups to follow a Weibulldistribution, that is F (cid:96) ( t, θ (cid:96) ) = 1 − exp (cid:8) − ( t/θ (cid:96), ) θ (cid:96), (cid:9) , (cid:96) = 1 ,
2, where θ (cid:96), denotes the shapeparameter and θ (cid:96), the scale parameter corresponding to treatment group (cid:96) = 1 ,
2. We considera time range (in months) given by T = [0 , t max = 9 is the latest point of follow up.8 a) . . . . . . Time (months) S u r v i v a l P r obab ili t y Reference Model q = (1.5,4.9) q = (2,2.5) (b) . . . . . . . Time (months) H a z a r d R a t e Reference Model q = (1.5,4.9) q = (2,2.5) (c) . . . . . . Time (months) S u r v i v a l P r obab ili t y q = (1.5,2.6) q = (2.1,3.9) Figure 1:
The three scenarios under consideration used for simulating type I error rates andcoverage probabilites. (a): Survival curves for Scenarios (3.18) and (3.19) with θ = (1 . , . θ ∈ { (1 . , . , (2 , . } . (b): Corresponding hazard rates. (c): Survival curves for Scenario (3.20). For the first configuration we choose θ = (1 . , . , θ = (1 . , . , θ = (1 . , . , (3.18)where θ corresponds to the reference model and the second model is varied by its scale param-eter. Here θ is used for investigating the type I errors and coverage probabilities and θ forsimulating the power, respectively. As an example, Figure 1(a) displays the survival curves fora choice of θ . Both configurations result in a constant log hazard ratio of log (1 . ≈ .
4, repre-senting the situation of proportional hazards. We assume the censoring times to be exponentialdistributed and choose the rates of the two groups such that a censoring rate of approximately25% results, that is a rate of ψ = 0 . ψ = 0 .
09 and ψ = 0 .
05 for θ and θ , respectively.In order to investigate the effect of non-proportional hazards we consider a second scenario ofintersecting survival curves, where we keep the reference model specified by θ and all otherconfigurations as above, but vary the parameters of the second model, resulting in θ = (1 . , . , θ = (2 , . , θ = (2 , . . (3.19)Here the choice of θ is used for investigating the type I errors and coverage probabilities and θ for simulating the power, respectively. Again, we consider censoring rates of approximately25% for both treatment groups, meaning a rate of ψ = 0 .
14 and ψ = 0 . θ and θ ,respectively.For the third scenario we generate the survival times according to a log-logistic distribution.More precisely we choose F (cid:96) ( t, θ (cid:96) ) = 1 − t/θ (cid:96), ) − θ(cid:96), , (cid:96) = 1 ,
2. We now assume the censoringtimes to be uniformly distributed on an interval [0 , c (cid:96) ], where c (cid:96) is chosen such that a censoring9 a) . . . . . . Time (months) S u r v i v a l P r obab ili t y Reference Model q = (1.5,3.7) q = (2,3.4) (b) . . . . . . . Time (months) H a z a r d R a t e Reference Model q = (1.5,3.7) q = (2,3.4) Figure 2:
The two scenarios used for simulating the power. (a): Survival curves for Scenarios (3.18)and (3.19) with θ = (1 . , . θ ∈ { (1 . , . , (2 , . } . (b): Corresponding hazard rates. rate of approximately 20% results, (cid:96) = 1 ,
2. We consider a time range (in months) given by T = [0 ,
12] and define the scenario by the set of parameters given by θ = (1 . , . , θ = (2 . , . , (3.20)where θ corresponds to the reference model, see Figure 1(c). In order to investigate the performance of the confidence bands derived in (2.5) and (2.8) weconsider the scenarios described above for three different sample sizes, that is ( n , n ) = (20 , n , n ) = (50 ,
50) and ( n , n ) = (100 , n = 40 , α = 0 .
05 and calculate both, the asymptotic(two-sided) confidence bands obtained by using the Delta method, and the bands based on thebootstrap described in Algorithm 2.1, which we call in the following asymptotic bands andbootstrap bands, respectively. Confidence bands were constructed for an equidistant grid of 23time points ranging from 1 . . . − . . S − S and log h h , respectively, the approximation is very precise when sample sizes increase, as thecoverage probabilities are very close to the desired value of 0 .
95 in this case. Further it becomesobvious that the confidence bands obtained by estimating the variance by bootstrap (2.10) arealways slightly more conservative than their asymptotic versions (2.4) and (2.7), respectively.However, considering the bands on S − S for very small sample sizes, that is n = n = 20,the coverage probability lies between 0 .
91 and 0 .
94 and hence these bands are slightly anti-conservative. The bootstrap bands perform slightly better, but still have coverage probabilitesaround 0 .
93 instead of 0 .
95, see the first column of Figure 3. This effect already disappearsfor n = n = 50 where a highly improved accuracy can be observed. We observe the samequalitative results for the asymptotic bands for log h h , whereas the bootstrap bands show adifferent behaviour, that is being rather conservative for small sample sizes, but also gettingmore precise with increasing sample sizes.For smaller sample sizes, all confidence bands under consideration vary in their behaviourover time. This effect gets in particular visible when considering the non-proportional hazardsscenario (3.19), see the second row of Figure 3. The coverage probability of the bands for S − S start with a very accurate approximation during the first two months but then decreaseto 0 .
93. This effect can be explained by the fact that in the setting of a very small sample, thatis n = n = 20, after this period of time only very few patients remain (note that the mediansurvival for the reference model is given by 2 . . . .
85 and 0 . .
95. However, considering S − S , in amajority of the cases the coverage is above 0 .
9, even for a small sample size of n = 40, wherethe bootstrap approach performs slightly better than the asymptotic analogue. For increasingsample sizes the coverage, which varies over the time, approximates 0 .
95, whereas the bandsfor log h h still do not come sufficiently close to this desired value, even for the largest samplesize of n = 200. Consequently we conclude that these bands suffer more from misspecificationthan the ones for S − S , where the latter prove to be robust if sample sizes are sufficiently11arge. In the following we will investigate type I error rates for the non-inferiority test (2.15) andthe equivalence test (2.16) on the difference of survival curves. We consider a nominal levelof α = 0 .
05 and different sample sizes, i.e. ( n , n ) = (20 , , ( n , n ) = (50 , , ( n , n ) =(100 , n , n ) = (150 , n = 40 , , θ = (1 . , . S ( t ) − S ( t ) attains values of 0 .
1, 0 .
15 and 0 . .
6, 2 . . . δ .Table 1 displays the simulated type I errors, configurations not falling under the null hypothesisare omitted and marked with ” − ”. The first number always corresponds to the equivalencetest, the number in brackets displays the type I error of the non-inferiority test, respectively. Itturns out that the approximation of the level is very precise, for the non-inferiority test (2.15)in general and for the equivalence test (2.16) for sufficiently large sample sizes. In particularfor settings on the margin of the null hypothesis, that is S ( t ) − S ( t ) = δ , we obtain type Ierrors very close to 0 .
05. Only for very small samples, that is n = n = 20, the equivalencetest (2.16) is conservative as the obtained type I errors are zero in almost all scenarios underconsideration.The same arguments hold for the scenario of non-proportional hazards (3.19) with θ = (2 , . δ = 0 .
2, whereas for the other choices of δ the type I error is controlled properly.Finally we note that the non-parametric non-inferiority test also yields a precise approximationof the level if sample sizes are sufficiently large. However, the equivalence test is conservative,in particular for δ = 0 .
1. The results and a detailed interpretation are deferred to Section 3.1of the Supplemental Material. 12 . . . . . . n =n =20 . . . . . . n =n =50 . . . . . . n =n =100 . . . . . . n =n =20 . . . . . . n =n =50 . . . . . . n =n =100 . . . . n =n =20 . . . . n =n =50 . . . . n =n =100 Time (months) C o v e r age P r obab ili t y Asymptotic bands for S - S Bootstrap bands for S - S Asymptotic bands for log ( h h ) Bootstrap bands for log ( h h ) Figure 3:
Simulated coverage probabilities for scenarios (3.18) (first row), (3.19) (second row) andthe scenario of misspecification (3.20) (third row) at different time points for sample sizes of n = n = 20 , ,
100 (left, middle, right column). The dashed lines correspond to the confidence bandsfor the difference of the survival functions (2.5), the dotted lines to the confidence bands for the loghazard ratio (2.8). Black lines display the asymptotic bands, red lines the confidence bands based onbootstrap, respectively. n , n ) ( t , S ( t ) − S ( t )) δ = 0 . δ = 0 . δ = 0 . ,
20) (1 . , .
1) 0.000 (0.049) - -(2 . , .
15) 0.000 (0.025) 0.000 (0.051) -(4 , .
2) 0.000 (0.014) 0.000 (0.030) 0.001 (0.061)(50 ,
50) (1 . , .
1) 0.001 (0.049) - -(2 . , .
15) 0.000 (0.014) 0.017 (0.045) -(4 , .
2) 0.000 (0.005) 0.010 (0.012) 0.047 (0.048)(100 , . , .
1) 0.037 (0.037) - -(2 . , .
15) 0.000 (0.009) 0.049 (0.041) -(4 , .
2) 0.000 (0.001) 0.012 (0.006) 0.050 (0.051)(150 , . , .
1) 0.054 (0.044) - -(2 . , .
15) 0.001 (0.002) 0.037 (0.040) -(4 , .
2) 0.000 (0.000) 0.003 (0.003) 0.053 (0.048)
Table 1:
Simulated type I errors of the non-inferiority test (2.15) (numbers in brackets) and theequivalence test (2.16) for the scenario of proportional hazards (3.18) with θ = (1 . , .
9) at threedifferent time points for different sample sizes and equivalence margins. The nominal level is chosenas α = 0 .
05. Scenarios not falling under the null hypothesis are marked with ” − ”. For investigations on the power we consider the same configurations as given above. We nowobserve the scenario of proportional hazards (3.18) such that the difference curve S ( t ) − S ( t )attains values of 0 .
01, 0 .
02 and 0 .
04 at time points 0 .
7, 1 . .
3, respectively. Hence allchosen configurations belong to the alternatives in (2.15) and (2.16). Table 3 displays thesimulated power. It turns out that both tests achieve a reasonable power in most settingsunder consideration. The power clearly increases with increasing sample sizes and a widerequivalence/non-inferiority margin δ . For instance, when considering δ = 0 . θ = (2 , . .
01 and 0 .
04, attained at time points 0 . . .
01) and 3 . . . . . n , n ) ( t , S ( t ) − S ( t )) δ = 0 . δ = 0 . δ = 0 . ,
20) (1 . , .
1) 0.000 (0.060) - -(2 . , .
15) 0.000 (0.029) 0.000 (0.049) -(3 , .
2) 0.000 (0.016) 0.000 (0.021) 0.002 (0.061)(50 ,
50) (1 . , .
1) 0.000 (0.053) - -(2 . , .
15) 0.000 (0.011) 0.008 (0.052) -(3 , .
2) 0.000 (0.003) 0.007 (0.014) 0.048 (0.048)(100 , . , .
1) 0.002 (0.057) - -(2 . , .
15) 0.000 (0.008) 0.055 (0.055) -(3 , .
2) 0.000 (0.001) 0.006 (0.006) 0.038 (0.038)(150 , . , .
1) 0.043 (0.053) - -(2 . , .
15) 0.004 (0.006) 0.041 (0.041) -(3 , .
2) 0.000 (0.000) 0.005 (0.005) 0.049 (0.049)
Table 2:
Simulated type I errors of the non-inferiority test (2.15) (numbers in brackets) and theequivalence test (2.16) for the scenario of non-proportional hazards (3.19) with θ = (2 , .
5) at threedifferent time points for different sample sizes and equivalence margins. The nominal level is chosenas α = 0 .
05. Scenarios not falling under the null hypothesis are marked with ” − ”. event in this scenario.In summary we observe a reasonably high power for both tests. The power increases with in-creasing sample sizes and a wider equivalence margin. Further the presence of non-proportionalhazards does not affect the power, meaning that in both scenarios the results are qualitativelythe same. Finally we also observe that for the non-parametric approach the power is muchsmaller than for our new method, in particular for small sample sizes and for the scenario ofnon-proportional hazards, which underlines the theoretical findings. Detailed results can befound in Section 3.1 of the Supplemental Material. In the following we investigate a well known benchmark dataset regarding survival analysis.The data set veteran from Veteran’s Administration Lung Cancer Trial, published in Kalbfleischand Prentice (2011) and implemented in the R package survival by Therneau (2020), describesa two-treatment, randomized trial for lung cancer. In this trial, male patients with advancedinoperable lung cancer were allocated to either a standard therapy (reference treatment, (cid:96) = 1)or a chemotherapy (test treatment, (cid:96) = 2). Numerous covariates were documented, includingtime to death for each patient, which is the primary endpoint of our analysis. In total 13715 n , n ) ( t , S ( t ) − S ( t )) δ = 0 . δ = 0 . δ = 0 . ,
20) (0 . , .
01) 0.145 (0.436) 0.512 (0.723) 0.795 (0.880)(1 . , .
02) 0.002 (0.227) 0.057 (0.410) 0.303 (0.595)(2 . , .
04) 0.000 (0.114) 0.000 (0.208) 0.000 (0.365)(50 ,
50) (0 . , .
01) 0.553 (0.696) 0.929 (0.943) 0.995 (0.995)(1 . , .
02) 0.025 (0.360) 0.501 (0.645) 0.833 (0.870)(2 . , .
04) 0.000 (0.187) 0.085 (0.335) 0.460 (0.601)(100 , . , .
01) 0.887 (0.906) 0.998 (0.998) 1.000 (1.000)(1 . , .
02) 0.388 (0.548) 0.858 (0.875) 0.988 (0.998)(2 . , .
04) 0.017 (0.274) 0.486 (0.557) 0.854 (0.861)(150 , . , .
01) 0.973 (0.974) 1.000 (1.000) 1.000 (1.000)(1 . , .
02) 0.639 (0.703) 0.971 (0.971) 0.999 (0.999)(2 . , .
04) 0.236 (0.364) 0.728 (0.736) 0.949 (0.950)
Table 3:
Simulated power of the non-inferiority test (2.15) (numbers in brackets) and the equivalencetest (2.16) for the scenario of proportional hazards (3.18) with θ = (1 . , .
7) at three different timepoints for different sample sizes and equivalence margins. The nominal level is chosen as α = 0 . observations, allocated to n = 69 patients in the reference group and n = 68 in the testgroup, are given.The R code reproducing the results presented in the following can be found online at https://github.com/kathrinmoellenhoff/survival . As our analysis is model-based, we start witha model selection step. More precisely we split the data into the reference group and the testgroup and assume six different distributions, that is a Weibull distribution, an exponentialdistribution, a Gaussian distribution, a logistic distribution, a log-normal distribution and alog-logistic distribution, respectively. We fit the corresponding models separately per treatmentgroup, resulting in 12 models in total. Finally we compare for each group the six different modelsusing Akaike’s Information Criterion (AIC, see Sakamoto et al. (1986)). It turns out that forthe group receiving the reference treatment the Weibull and the exponential model provide thebest fits (AICs are given by 749 .
1, 747 .
1, 799 .
9, 794 .
7, 755 . . .
1, 750 . .
7, respectively). Therefore wedecide to base our analyses on Weibull models for both groups. However, note that all testscould also be performed assuming different distributions for each treatment.For the analysis we now consider the complete likelihood function (2.1) incorporating the dis-tribution of the censoring times. We assume the censoring times to be exponentially dis-tributed and maximizing the likelihood (2.1) yields ˆ θ = (4 . , . , ˆ ψ = 0 . θ = (4 . , . , ˆ ψ = 0 . . Figure 4(a) displays the corresponding Weibull models and16 n , n ) ( t , S ( t ) − S ( t )) δ = 0 . δ = 0 . δ = 0 . ,
20) (0 . , .
01) 0.964 (0.964) 0.996 (0.999) 1.000 (1.000)(0 . , .
04) 0.372 (0.448) 0.712 (0.712) 0.891 (0.893)(2 . , .
04) 0.000 (0.124) 0.000 (0.219) 0.000 (0.297)(3 . , .
01) 0.000 (0.179) 0.000 (0.284) 0.000 (0.374)(50 ,
50) (0 . , .
01) 1.000 (1.000) 1.000 (1.000) 1.000 (1.000)(0 . , .
04) 0.637 (0.640) 0.938 (0.938) 0.997 (0.997)(2 . , .
04) 0.000 (0.151) 0.047 (0.339) 0.436 (0.563)(3 . , .
01) 0.000 (0.258) 0.054 (0.468) 0.481 (0.684)(100 , . , .
01) 1.000 (1.000) 1.000 (1.000) 1.000 (1.000)(0 . , .
04) 0.818 (0.841) 0.995 (0.995) 1.000 (1.000)(2 . , .
04) 0.001 (0.246) 0.442 (0.555) 0.803 (0.833)(3 . , .
01) 0.006 (0.406) 0.558 (0.728) 0.869 (0.921)(150 , . , .
01) 1.000 (1.000) 1.000 (1.000) 1.000 (1.000)(0 . , .
04) 0.927 (0.927) 1.000 (1.000) 1.000 (1.000)(2 . , .
04) 0.206 (0.326) 0.678 (0.701) 0.922 (0.928)(3 . , .
01) 0.267 (0.553) 0.797 (0.859) 0.903 (0.984)
Table 4:
Simulated power of the non-inferiority test (2.15) (numbers in brackets) and the equivalencetest (2.16) for the scenario of non-proportional hazards (3.19) with θ = (2 , .
4) at four different timepoints for different sample sizes and equivalence margins. The nominal level is chosen as α = 0 . the non-parametric analogue given by Kaplan-Meier curves. It turns out that for both treat-ment groups the parametric and the non-parametric curves are very close to each other. Furtherwe observe that the survival curves of the two treatment groups cross each other which indi-cates that the assumption of proportional hazards is not justified here. Indeed, the hazard ratioranges from 0 .
55 to 1 .
93 from the first time of event (3 days) until the end of the observationalperiod (999 days) and therefore an analysis using a proportional hazards model is actually notapplicable here. The p-value of the log-rank test is 0 .
928 and thus does not detect any differencebetween the two groups.We will now perform a similar analysis using the parametric models and the theory derived inSection 2. For the sake of brevity we will only consider difference in survival, analyses concerningthe (log) hazard ratio can be conducted in the same manner. We consider the first 600 days ofthe trial. We set α = 0 .
05 and calculate lower and upper (1 − α )-pointwise confidence bandsaccording to (2.5) at several points 0 ≤ t ≤ σ ∆ are obtainedby both, the asymptotic approach and bootstrap as described in Algorithm 2.1, respectively.Figure 4(b) displays the estimated difference curve ∆( t, ˆ θ , ˆ θ ) = S ( t, ˆ θ ) − S ( t, ˆ θ ) and the17 a) . . . . . . Time (days) S u r v i v a l Reference treatmentTest treatment (b) − . − . . . . Time (days) D i ff e r en c e i n S u r v i v a l Difference in Survival S ( t ) - S ( t ) Asymptotic confidence bandsBootstrap confidence bands
Figure 4: (a): Survival curves (Kaplan-Meier curves and Weibull models) for the veteran data. Solidlines correspond to the reference group, dashed lines to the test treatment. (b): Difference in survival,pointwise confidence bands obtained by the asymptotic approach (dashed) and bootstrap (dotted),respectively, on the interval [40 , δ = 0 . pointwise confidence bands on the interval [0 , t = 80, which is close to the median survival of both treat-ment groups. The difference in survival is ∆(80 , ˆ θ , ˆ θ ) = 0 .
047 and the asymptotic confidenceinterval is given by [ − . , . − . , . δ > .
163 (0 . δ = 0 .
15, which is indicated by the shaded area in Figure 4(b), H cannot be rejected inboth cases, meaning in particular that the treatments cannot be considered equivalent withrespect to the survival at day 80. Figure 4(b) further displays these investigations for δ = 0 . δ = 0 .
15. The same conclusion can be made concerningequivalence as for all t in [0 , − δ = − .
15. Concluding we observe18hat considering for instance δ = 0 .
2, equivalence would be claimed at all time points underconsideration, demonstrating the impact of the choice of the margin.
In this paper we addressed the problem of survival analysis in presence of non-proportionalhazards. Here, commonly used methods, as Cox’s proportional hazards model or the log-ranktest, are not optimal and suffer from a loss of power. Therefore we proposed another approachfor investigating equivalence or non-inferiority of time-to-event outcomes based on the construc-tion of (pointwise) confidence intervals and applicable irrespectively of the shape of the hazardratio. We constructed confidence bands for both, the (log) hazard ratio and the difference of thesurvival curve proposing two alternatives, one based on asymptotic investigations and the otheron a parametric bootstrap procedure. Both approaches show a similar performance, where thelatter has the advantage that it can be used for very small samples as well and without theneed to calculate the asymptotic variance.Our approach provides a framework for investigating survival in many ways. Apart from specifictime points entire periods of time can be observed. We demonstrated that the presence ofnon-proportional hazards has no effect on the performance of the confidence bands and thenon-inferiority and equivalence test, respectively, which means that they do not depend on thisassumption, providing a much more flexible tool compared to standard methodology.Our methods are based on the estimation of parametric survival functions, which can be a verypowerful tool once the suitability of the model has been proven. This has to be assessed ina preliminary study, using for instance model selection criteria as the AIC. We demonstratedthe robustness of our approach even in the case of misspecification. An interesting extensionof the methodology could be given by model averaging which possibly further improves theperformance of the procedures.
References
Berger, R. L. (1982). Multiparameter hypothesis testing and acceptance sampling.
Technomet-rics , 24:295–300.Bradley, R. and Gart, J. (1962). The asymptotic properties of ml estimators when samplingfrom associated populations.
Biometrika , 49:205–214.Com-Nougue, C., Rodary, C., and Patte, C. (1993). How to establish equivalence when dataare censored: a randomized trial of treatments for b non-hodgkin lymphoma.
Statistics inmedicine , 12(14):1353–1364. 19ox, D. R. (1972). Regression models and life-tables.
Journal of the Royal Statistical Society:Series B (Methodological) , 34(2):187–202.Cox, D. R. and Oakes, D. (1984).
Analysis of survival data , volume 21. CRC Press.Da Silva, G. T., Logan, B. R., and Klein, J. P. (2009). Methods for equivalence and noninferi-ority testing.
Biology of Blood and Marrow Transplantation , 15(1):120–127.D’Agostino Sr, R. B., Massaro, J. M., and Sullivan, L. M. (2003). Non-inferiority trials: de-sign concepts and issues–the encounters of academic consultants in statistics.
Statistics inmedicine , 22(2):169–186.Efron, B. and Tibshirani, R. J. (1994).
An introduction to the bootstrap . CRC press.(EMA), E. M. A. (2014). Committee for medicinal products for human use(chmp): Guideline on the choice of non-inferiority margin. cpmp/ewp/2158/99.available at .Gill, R. and Schumacher, M. (1987). A simple test of the proportional hazards assumption.
Biometrika , 74(2):289–300.Grambsch, P. M. and Therneau, T. M. (1994). Proportional hazards tests and diagnosticsbased on weighted residuals.
Biometrika , 81(3):515–526.Greenwood, M. (1926). The natural duration of cancer. london: Her majesty’s stationery office.
Reports on public health and medical subjects , (33).Hern´an, M. A. (2010). The hazards of hazard ratios.
Epidemiology (Cambridge, Mass.) ,21(1):13.Howard, G., Chambless, L. E., and Kronmal, R. A. (1997). Assessing differences in clinicaltrials comparing surgical vs nonsurgical therapy: using common (statistical) sense.
JAMA ,278(17):1432–1436.Jachno, K., Heritier, S., and Wolfe, R. (2019). Are non-constant rates and non-proportionaltreatment effects accounted for in the design and analysis of randomised controlled trials? areview of current practice.
BMC medical research methodology , 19(1):103.Kalbfleisch, J. D. and Prentice, R. L. (2011).
The statistical analysis of failure time data ,volume 360. John Wiley & Sons.Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations.
Journal of the American statistical association , 53(282):457–481.20lein, J. P. and Moeschberger, M. L. (2006).
Survival analysis: techniques for censored andtruncated data . Springer Science & Business Media.Li, H., Han, D., Hou, Y., Chen, H., and Chen, Z. (2015). Statistical inference methods for twocrossing survival curves: a comparison of methods.
PLoS One , 10(1):e0116774.Martinez, E. E., Sinha, D., Wang, W., Lipsitz, S. R., and Chappell, R. J. (2017). Tests forequivalence of two survival functions: Alternative to the tests under proportional hazards.
Statistical methods in medical research , 26(1):75–87.Oehlert, G. (1992). A note on the delta method.
The American Statistician , 46:27–29.Royston, P. and Parmar, M. K. (2011). The use of restricted mean survival time to estimatethe treatment effect in randomized clinical trials when the proportional hazards assumptionis in doubt.
Statistics in medicine , 30(19):2409–2421.Sakamoto, Y., Ishiguro, M., and Kitagawa, G. (1986). Akaike information criterion statistics.
Dordrecht, The Netherlands: D. Reidel , 81.Shen, P.-S. (2020). Tests for equivalence of two survival functions: alternatives to the ph andpo models.
Journal of Biopharmaceutical Statistics , pages 1–12.Subramanian, S. and Zhang, P. (2013). Model-based confidence bands for survival functions.
Journal of Statistical Planning and Inference , 143(7):1166–1185.Therneau, T. M. (2020).
A Package for Survival Analysis in R . R package version 3.1-12.Uno, H., Claggett, B., Tian, L., Inoue, E., Gallo, P., Miyata, T., Schrag, D., Takeuchi, M.,Uyama, Y., Zhao, L., et al. (2014). Moving beyond the hazard ratio in quantifying thebetween-group difference in survival analysis.
Journal of clinical Oncology , 32(22):2380.Wellek, S. (1993). A log-rank test for equivalence of two survivor functions.
Biometrics , pages877–881.Wellek, S. (2010).