[PDF] Survival analysis under non-proportional hazards: investigating non-inferiority or equivalence in time-to-event data

Abstract

The classical approach to analyze time-to-event data, e.g. in clinical trials, is to fit Kaplan-Meier curves yielding the treatment effect as the hazard ratio between treatment groups. Afterwards commonly a log-rank test is performed in order to investigate whether there is a difference in survival, or, depending on additional covariates, a Cox proportional hazard model is used. However, in numerous trials these approaches fail due to the presence of non-proportional hazards, resulting in difficulties of interpreting the hazard ratio and a loss of power. When considering equivalence or non-inferiority trials, the commonly performed log-rank based tests are similarly affected by a violation of this assumption. Here we propose a parametric framework to assess equivalence or non-inferiority for survival data. We derive pointwise confidence bands for both, the hazard ratio and the difference of the survival curves. Further we propose a test procedure addressing non-inferiority and equivalence by directly comparing the survival functions at certain time points or over an entire range of time. We demonstrate the validity of the methods by a clinical trial example and by numerous simulation results.

Full PDF

SSurvival analysis under non-proportionalhazards: investigating non-inferiority orequivalence in time-to-event data

Kathrin M¨ollenhoﬀ

Department of Mathematics and Computer Science,Eindhoven University of Technology, Eindhoven, The Netherlands andUniversity of Cologne, Faculty of Medicine and University Hospital, Cologne, Germany

Achim Tresch

University of Cologne, Faculty of Medicine and University Hospital, Cologne, Germany andCECAD, University of Cologne, Cologne, Germany andCenter for Data and Simulation Science, University of Cologne, Germany

September 16, 2020

Abstract

The classical approach to analyze time-to-event data, e.g. in clinical trials, is to ﬁtKaplan-Meier curves yielding the treatment eﬀect as the hazard ratio between treatmentgroups. Afterwards commonly a log-rank test is performed in order to investigate whetherthere is a diﬀerence in survival, or, depending on additional covariates, a Cox proportionalhazard model is used. However, in numerous trials these approaches fail due to the pres-ence of non-proportional hazards, resulting in diﬃculties of interpreting the hazard ratioand a loss of power. When considering equivalence or non-inferiority trials, the commonlyperformed log-rank based tests are similarly aﬀected by a violation of this assumption.Here we propose a parametric framework to assess equivalence or non-inferiority for sur-vival data. We derive pointwise conﬁdence bands for both, the hazard ratio and thediﬀerence of the survival curves. Further we propose a test procedure addressing non-inferiority and equivalence by directly comparing the survival functions at certain timepoints or over an entire range of time. We demonstrate the validity of the methods by aclinical trial example and by numerous simulation results. a r X i v : . [ s t a t . M E ] S e p eywords and Phrases: equivalence, non-inferiority, survival analysis, time-to-event data, non-proportional hazards, conﬁdence band, hazard ratio, log-rank test Time-to-event outcomes are frequently observed in medical research, for instance in the area ofoncology or cardiovascular diseases. A commonly addressed issue is the comparison of a test toa reference treatment with regard to survival. For this purpose an analysis based on Kaplan-Meier curves (see Kaplan and Meier (1958)), followed by a log-rank test (see, for example,Kalbﬂeisch and Prentice (2011)) is still the most popular approach. Additionally, adjusting formultiple covariates, Cox’s proportional hazards model (Cox (1972)) has been extensively usedin the last decades (for some examples see Cox and Oakes (1984); Klein and Moeschberger(2006) among many others). In case of addressing non-inferiority or equivalence, extensionsof the log-rank test investigating the vertical distance between the survival curves have beenproposed by Wellek (1993) and Com-Nougue et al. (1993). These approaches owe much of theirpopularity to the fact that they do not rely on assumptions on the distribution of event times.Moreover, a direct interpretation is obtained by summarizing the treatment eﬀect in one singleparameter, given by the hazard ratio of the two treatments, assumed to be constant over time.However, this assumption has been heavily criticized (see for example Hern´an (2010); Uno et al.(2014)), and is in practice rarely assessed or even obviously violated (see Li et al. (2015) andJachno et al. (2019) for an overview on the awareness of this assumption). In particular ifshort- and long-term beneﬁts of the two treatments diﬀer for instance in situations where asurgical treatment is compared to a non-surgical one (see for example Howard et al. (1997) andthe references therein), the assumption of proportional hazards is questionable and should becarefully investigated. The most obvious sign for a violation of this assumption are survivalcurves crossing each other. However, often other techniques are required, for example graphicalmethods (see Grambsch and Therneau (1994) for an overview) or statistical tests (see forexample Gill and Schumacher (1987)).One of the advantages of the standard methodology based on Kaplan-Meier curves and thelog-rank test is that equivalence hypotheses can be formulated using one parameter, that isthe hazard ratio. If this ratio changes over time, both, an alternative measure of treatmenteﬀect, and a suitable deﬁnition of equivalence have to be found. For instance, Royston andParmar (2011) introduce the restricted mean survival time to overcome this issue. Moreover,an alternative to commonly used log-rank based tests of equivalence has been proposed byMartinez et al. (2017) under the assumption of proportional odds. Further these authors showthat type I errors for a log-rank based test are higher than the nominal level if the assumptionof proportional hazards doesn’t hold. Finally, in a recent paper Shen (2020) proposes analternative test for equivalence based on two other non-parametric models, which can also be2sed if neither hazards, nor odds, are proportional.Methods that employ parametric survival models are less common than the above-mentionedsemi-or non-parametric methods. However, a correctly speciﬁed parametric survival model pro-vides numerous advantages, as for instance more precise estimates (see, for example, Klein andMoeschberger (2006)) or the possibility of making predictions. Inference based on parametricmodels can be very precise even in case of misspeciﬁcation, as demonstrated by Subramanianand Zhang (2013), who develop simultaneous conﬁdence bands for parametric survival curvesand compare them to non-parametric approaches based on Kaplan-Meier estimates.In this paper we develop new methodology in two directions. We address the issue of non-inferiority and equivalence testing in presence of non-proportional hazards by presenting aparametric alternative to the classical methodology, without the assumption of a constanthazard-ratio. We ﬁrst present an approach to derive pointwise conﬁdence bands for the diﬀer-ence of two survival curves and the hazard ratio over time, respectively, by both making in-ference of the asymptotic distribution of these curves and using a bootstrap approach. Secondwe use these conﬁdence intervals to assess equivalence or non-inferiority of the two treatments.Finally all our methods are illustrated by a clinical trial example and by means of a simulationstudy.

Consider two samples of size n and n respectively, resulting in a total sample size of n = n + n . Let Y , , . . . , Y ,n and Y , , . . . , Y ,n denote independent random variables representingsurvival times for individuals allocated to two (treatment) groups, observing a time range givenby T = [0 , t max ], where 0 denotes the start of the observations. Assume that the distributionfunctions F and F of Y ,j , j = 1 , . . . , n and Y ,j j = 1 , . . . , n , respectively, are absolutelycontinuous with densities f resp. f . Consequently the probability of experiencing an eventfor an individual j of the (cid:96) -th sample before time t can be written as F (cid:96) ( t ) = P ( Y (cid:96),j < t ) = (cid:90) t f (cid:96) ( u ) du, (cid:96) = 1 , . Further denote the corresponding survival functions by S (cid:96) := 1 − F (cid:96) and the hazard rates by h (cid:96) := f (cid:96) S (cid:96) . The cumulative hazard function is given by H (cid:96) ( t ) = − log( S ( t )), (cid:96) = 1 , C , , . . . , C ,n and C , , . . . , C ,n and the corresponding distribution functions by G and G respectively. Note that these distributions can diﬀer from each other and are assumed to beindependent from the Y (cid:96),j , (cid:96) = 1 , , j = 1 , . . . n (cid:96) . We deﬁne ∆ (cid:96),j = I { Y (cid:96),j ≥ C (cid:96),j } , indicatingwhether an individual is censored (∆ (cid:96),j = 0) or experiences an event (∆ (cid:96),j = 1), where I t (cid:96),j , δ (cid:96),j ) is a realization of thebivariate random variable ( T (cid:96),j , ∆ (cid:96),j ), where T (cid:96),j = min( Y (cid:96),j , C (cid:96),j ) , (cid:96) = 1 , , j = 1 , . . . n (cid:96) . Inorder to make inference on the underlying distributions we consider the likelihood function forgroup (cid:96) given by L (cid:96) ( F (cid:96) , G (cid:96) ) = n (cid:96) (cid:89) j =1 (cid:8) f (cid:96) ( t (cid:96),j ) δ (cid:96),j (1 − F (cid:96) ( t (cid:96),j )) − δ (cid:96),j (cid:9) · n (cid:96) (cid:89) j =1 (cid:8) (1 − G (cid:96) ( t (cid:96),j )) δ (cid:96),j ( g (cid:96) ( t (cid:96),j )) − δ (cid:96),j (cid:9) , (2.1)as censoring times and survival times are assumed to be independent. Hence we can obtainestimates for the densities f (cid:96) ( t ) = f (cid:96) ( θ (cid:96) , t ) and g (cid:96) ( t ) = g (cid:96) ( ψ (cid:96) , t ) by deriving the parametersˆ θ (cid:96) and ˆ ψ (cid:96) maximizing log L (cid:96) , (cid:96) = 1 ,

2. Note that if one is not interested in estimating theunderlying distribution of the censoring times and θ (cid:96) and ψ (cid:96) have no common parameters, thisoptimization procedure can be further simpliﬁed by just considering the ﬁrst part in (2.1),resulting in an objective function given by˜ L (cid:96) ( θ (cid:96) ) = n (cid:96) (cid:89) j =1 (cid:8) f (cid:96) ( θ (cid:96) , t (cid:96),j ) δ (cid:96),j (1 − F (cid:96) ( θ (cid:96) , t (cid:96),j )) − δ (cid:96),j (cid:9) , (cid:96) = 1 , . (2.2) In the following we will construct pointwise conﬁdence bands for the diﬀerence of the survivalfunctions and for the hazard ratio. First we derive an asymptotic approach using the Delta-method (see Oehlert (1992)) and second, we propose an alternative based on a bootstrapprocedure, where the latter can also be used when samples are very small or, moreover, ifasymptotic inference cannot be made due to the lack of concrete expressions for the variance.In order to simplify calculations, we will consider the log-ratio and therefore the two measuresof interest are given by∆( t, θ , θ ) := S ( t, θ ) − S ( t, θ ) and r ( t, θ , θ ) := log h ( t,θ ) h ( t,θ ) . (2.3)Under certain regularity conditions (see Bradley and Gart (1962)) the Maximum Likelihoodestimates (MLE) obtained by maximizing (2.1) or (2.2) are asymptotically normally distributed,that is √ n (cid:96) (cid:0) ˆ θ (cid:96) − θ (cid:96) (cid:1) D −→ N (0 , I − θ (cid:96) ) , (cid:96) = 1 , , where I − θ denotes the inverse of the Fisher information matrix. This result can be used to makeinference about the asymptotic distribution of the estimated survival curves. More precisely,using the Delta-method we obtain for every t > √ n (cid:96) (cid:0) S (cid:96) ( t, ˆ θ (cid:96) ) − S (cid:96) ( t, θ (cid:96) ) (cid:1) D −→ N (0 , ∂∂θ (cid:96) S (cid:96) ( t, θ (cid:96) ) T I − θ (cid:96) ∂∂θ (cid:96) S (cid:96) ( t, θ (cid:96) )) , (cid:96) = 1 , . t, θ , θ ) is given byˆ σ := V ar (∆( t, ˆ θ , ˆ θ )) = n ∂∂θ S ( t, ˆ θ ) T I − θ ∂∂θ S ( t, ˆ θ ) + n ∂∂θ S ( t, ˆ θ ) T I − θ ∂∂θ S ( t, ˆ θ ) , (2.4)where I ˆ θ (cid:96) denotes the observed information matrix, (cid:96) = 1 ,

2. For suﬃciently large samplesthis asymptotic result can be used to construct pointwise lower and upper (1 − α )-conﬁdencebounds, respectively, by L ∆ ( t, ˆ θ , ˆ θ ) := ∆( t, ˆ θ , ˆ θ ) − z − α ˆ σ ∆ and U ∆ ( t, ˆ θ , ˆ θ ) := ∆( t, ˆ θ , ˆ θ ) + z − α ˆ σ ∆ , (2.5)where z − α denotes the (1 − α )-quantile of the standard normal distribution. More precisely, if L ( t ) and U ( t ) denote the (1 − α ) pointwise lower and the (1 − α ) pointwise upper conﬁdencebound, respectively, it holdslim n ,n →∞ P (cid:0) L ∆ ( t, ˆ θ , ˆ θ ) ≤ ∆( t, θ , θ ) (cid:1) ≥ − α , lim n ,n →∞ P (cid:0) ∆( t, θ , θ ) ≤ U ∆ ( t, ˆ θ , ˆ θ ) (cid:1) ≥ − α (2.6)for all t >

0, where α denotes the prespeciﬁed signiﬁcance level. The construction of pointwiseconﬁdence bands for the log hazard ratio works similarly and the asymptotic variance of r ( t )is given by ˆ σ r : = V ar ( r ( t, ˆ θ , ˆ θ )) = n · ∂∂θ log h ( t, ˆ θ ) T I − θ ∂∂θ log h ( t, ˆ θ )+ n · ∂∂θ log h ( t, ˆ θ ) T I − θ ∂∂θ log h ( t, ˆ θ ) . (2.7)Consequently L d and U d are given by L d ( t, ˆ θ , ˆ θ ) := r ( t, ˆ θ , ˆ θ ) − z − α ˆ σ r , U d ( t, ˆ θ , ˆ θ ) := r ( t, ˆ θ , ˆ θ ) + z − α ˆ σ r (2.8)and it holds for all t > n ,n →∞ P (cid:0) L d ( t, ˆ θ , ˆ θ ) ≤ r ( t, θ , θ ) (cid:1) ≥ − α and lim n ,n →∞ P (cid:0) r ( t, θ , θ ) ≤ U d ( t, ˆ θ , ˆ θ ) (cid:1) ≥ − α. A concrete numerical example for calculating the variances in (2.4) and (2.7) and the corre-sponding conﬁdence bounds assuming a Weibull distribution are deferred to Section 2 in theSupplemental Material and the ﬁnite sample properties for to this scenario are presented inSection 3.If sample sizes are rather small or the variability in the data is high, we propose to obtainestimates for the variances ˆ σ and ˆ σ r by using a bootstrap approach, taking the right-censoringinto account. This approach has the advantage that it can also be used if a formula for theasymptotic variance is not obtainable, for instance due to numerical diﬃculties. The followingalgorithm explains the procedure for ∆( t , θ , θ ) and it can directly be adapted to r ( t , θ , θ ).5 lgorithm 2.1. (Parametric) Bootstrap Conﬁdence Intervals for ∆( t , θ , θ ) .1. Calculate the MLE ˆ θ (cid:96) and ˆ ψ (cid:96) , (cid:96) = 1 , , from the data by maximizing (2.1) .2(a). Generate survival times y ∗ (cid:96), , . . . , y ∗ (cid:96),n (cid:96) from F (cid:96) (ˆ θ (cid:96) ) , (cid:96) = 1 , . Further generate the corre-sponding censoring times c ∗ (cid:96), , . . . , c ∗ (cid:96),n (cid:96) by sampling from the distributions G (cid:96) ( ˆ ψ (cid:96) ) , (cid:96) = 1 , .If y ∗ (cid:96),j > c ∗ (cid:96),j , the observation is censored (i.e. δ ∗ (cid:96),j = 0 ), j = 1 , . . . , n (cid:96) . The observed datais given by ( t ∗ (cid:96),j , δ ∗ (cid:96),j ) , t ∗ (cid:96),j = min ( y ∗ (cid:96),j , c ∗ (cid:96),j ) , j = 1 , . . . , n (cid:96) , (cid:96) = 1 , .2(b). Calculate MLE ˆ θ ∗ (cid:96) for the bootstrap sample from the t ∗ (cid:96),j , j = 1 , . . . , n (cid:96) , (cid:96) = 1 , , andcalculate the diﬀerence of the corresponding survival functions at t , that is ∆ ∗ := ∆( t , ˆ θ ∗ , ˆ θ ∗ ) = S ( t , ˆ θ ∗ ) − S ( t , ˆ θ ∗ ) . (2.9)

3. Repeat steps a ) and b ) n boot times, yielding ∆ ∗ , . . . , ∆ ∗ n boot and calculate an estimatefor the variance ˆ σ by ˆ σ = 1 n boot − n boot (cid:88) k =1 (∆ ∗ k − ¯∆ ∗ ) , (2.10) where ¯∆ ∗ denotes the mean of the ∆ ∗ i , i = 1 , . . . , n boot . Finally the estimate ˆ σ in (2.10) is used to calculate the conﬁdence band in (2.5). Note that theprocedure described in Algorithm 2.1 is a parametric bootstrap as it is based on estimating theparameters ˆ θ (cid:96) , ˆ ψ (cid:96) , (cid:96) = 1 ,

2. A non-parametric alternative, given by resampling the observations(see Efron and Tibshirani (1994)), could also be implemented.

We are aiming to compare the survival functions of two (treatment) groups which is commonlyaddressed by testing the null hypothesis that the two survival functions are identical versus thealternative that they are diﬀering at any time point (for an overview, see, for example, Kleinand Moeschberger (2006)). More precisely the classical hypotheses are given by H : S ( t, θ ) = S ( t, θ ) for all t ∈ T against H : S ( t, θ ) (cid:54) = S ( t, θ ) for a t ∈ T . In some situations one might be more interested in observing non-inferiority of one treatmentamong another or equivalence of the two treatments, meaning that we allow a deviation of thesurvival curves of a prespeciﬁed threshold instead of testing for equality. This can be done fora particular point in time, say t , or over an entire interval, say [ t , t ]. We now consider the6iﬀerence in survival at a particular point in time t . The corresponding hypotheses are thengiven by H : S ( t , θ ) − S ( t , θ ) ≥ δ against H : S ( t , θ ) − S ( t , θ ) < δ (2.11)for a non-inferiority trial observing whether a test treatment is non-inferior to the referencetreatment (which is stated in the alternative hypothesis). Considering equivalence, the hy-potheses are given by H : | S ( θ , t ) − S ( θ , t ) | ≥ δ against H : | S ( θ , t ) − S ( θ , t ) | < δ. (2.12)The same questions can be addressed considering the (log) hazard ratio, resulting in the hy-potheses analogue to (2.11) given by H : log h ( θ ,t ) h ( θ ,t ) ≥ ε against H : log h ( θ ,t ) h ( θ ,t ) < ε (2.13)for a non-inferiority trial and H : (cid:12)(cid:12)(cid:12) log h ( θ ,t ) h ( θ ,t ) (cid:12)(cid:12)(cid:12) ≥ ε against H : (cid:12)(cid:12)(cid:12) log h ( θ ,t ) h ( θ ,t ) (cid:12)(cid:12)(cid:12) < ε (2.14)for addressing equivalence. The choice of the margins δ > ε > δ for the survival diﬀerenceare frequently chosen between 0 . .

2, for detailed discussions on that topic we refer toD’Agostino Sr et al. (2003); Da Silva et al. (2009); Wellek (2010), who also investigate thecorrespondence between δ and the hazard ratio threshold ε .We now use the conﬁdence bounds derived in (2.5) to construct an asymptotic α -level test for(2.11). More precisely, the null hypothesis concerning non-inferiority in (2.11) is rejected if theupper bound of the conﬁdence interval is below the equivalence margin, that is U ∆ ( t , ˆ θ , ˆ θ ) ≤ δ. (2.15)Further, rejecting H whenever U ∆ ( t , ˆ θ , ˆ θ ) ≤ δ and L ∆ ( t , ˆ θ , ˆ θ ) ≥ − δ (2.16)yields an equivalence test for the hypotheses in (2.12). According to the intersection-union-principleBerger (1982), the same (1 − α )-conﬁdence bounds L ∆ ( t , ˆ θ , ˆ θ ) and U ∆ ( t , ˆ θ , ˆ θ ) areused for both, the non-inferiority and the equivalence test. The following theorem states thatthis yields an asymptotic α -level test. Theorem 2.2.

The test described in (2.16) yields an asymptotic α -level equivalence tests forthe hypotheses in (2.12) . More precisely it holds for all t ∈ T lim n ,n →∞ P H ( U ∆ ( t , ˆ θ , ˆ θ ) ≤ δ, L ∆ ( t , ˆ θ , ˆ θ ) ≥ − δ ) ≤ α. The proof is left to Section 1 in the Supplemental Material and a similar procedure can beapplied for testing the hypotheses in (2.13) and (2.14).7 .2.1 Comparing survival over entire time intervals

In some situations it might be interesting to compare survival not only at one particular pointin time but over an entire period of time [ t , t ]. For instance, extending the equivalence teststated in (2.12) to this situation yields the hypotheses H : max t ∈ [ t ,t ] | S ( t, θ ) − S ( t, θ ) | ≥ δ against H : max t ∈ [ t ,t ] | S ( t, θ ) − S ( t, θ ) | < δ (2.17)and a similar extension can be formulated for the non-inferiority test (2.11) and the tests onthe hazard ratio (2.13) and (2.14), respectively. In this case we reject the null hypothesis in(2.12) if, for all t in [ t , t ], the conﬁdence bounds L ∆ ( t ) and U ∆ ( t ) derived in (2.5) are includedin the equivalence region [ − δ, δ ]. In the following we will investigate the ﬁnite sample properties of the proposed methods bymeans of a simulation study. We consider two scenarios corresponding to the choice of a Weibulldistribution of survival times. In a third case we investigate the robustness of our approach bygenerating data according to a log-logistic distribution but still ﬁtting a Weibull model. Weassume (randomly) right-censored observations in combination with an administrative censoringtime in both scenarios meaning a ﬁxed time point of last follow-up which we denote by t max . Allresults are obtained by running n sim = 1000 simulations and n boot = 500 bootstrap repetitions.For all three scenarios we will calculate conﬁdence bands for both the diﬀerence of the survivalcurves and the log hazard ratio and observe their coverage probabilities. For the diﬀerence ofthe survival curves, we will investigate the tests on non-inferiority and equivalence proposedin (2.15) and (2.16), respectively. For this purpose we will vary both, the particular timepoint under consideration t and the non-inferiority/equivalence margin δ . More precisely wewill consider three diﬀerent choices for this margin ranging from rather conservative to liberal,namely δ = 0 . , .

15 and 0 .

2. The test concerning the hypotheses formulated in terms of the loghazard ratio (see (2.13) and (2.17)) is conducted in the same manner, results are omitted here.In addition to our new method we also evaluate all scenarios using a non-parametric approach asdescribed in Section 5.2. of Com-Nougue et al. (1993). More precisely we construct conﬁdencebounds for the diﬀerence of two Kaplan-Meier curves by estimating the variance using theformula by Greenwood (1926). Due to the sake of brevity the latter results will be deferred toSection 3 of the Supplemental Material.For the ﬁrst two scenarios we assume the data in both treatment groups to follow a Weibulldistribution, that is F (cid:96) ( t, θ (cid:96) ) = 1 − exp (cid:8) − ( t/θ (cid:96), ) θ (cid:96), (cid:9) , (cid:96) = 1 ,

2, where θ (cid:96), denotes the shapeparameter and θ (cid:96), the scale parameter corresponding to treatment group (cid:96) = 1 ,

2. We considera time range (in months) given by T = [0 , t max = 9 is the latest point of follow up.8 a) . . . . . . Time (months) S u r v i v a l P r obab ili t y Reference Model q = (1.5,4.9) q = (2,2.5) (b) . . . . . . . Time (months) H a z a r d R a t e Reference Model q = (1.5,4.9) q = (2,2.5) (c) . . . . . . Time (months) S u r v i v a l P r obab ili t y q = (1.5,2.6) q = (2.1,3.9) Figure 1:

The three scenarios under consideration used for simulating type I error rates andcoverage probabilites. (a): Survival curves for Scenarios (3.18) and (3.19) with θ = (1 . , . θ ∈ { (1 . , . , (2 , . } . (b): Corresponding hazard rates. (c): Survival curves for Scenario (3.20). For the ﬁrst conﬁguration we choose θ = (1 . , . , θ = (1 . , . , θ = (1 . , . , (3.18)where θ corresponds to the reference model and the second model is varied by its scale param-eter. Here θ is used for investigating the type I errors and coverage probabilities and θ forsimulating the power, respectively. As an example, Figure 1(a) displays the survival curves fora choice of θ . Both conﬁgurations result in a constant log hazard ratio of log (1 . ≈ .

4, repre-senting the situation of proportional hazards. We assume the censoring times to be exponentialdistributed and choose the rates of the two groups such that a censoring rate of approximately25% results, that is a rate of ψ = 0 . ψ = 0 .

09 and ψ = 0 .

05 for θ and θ , respectively.In order to investigate the eﬀect of non-proportional hazards we consider a second scenario ofintersecting survival curves, where we keep the reference model speciﬁed by θ and all otherconﬁgurations as above, but vary the parameters of the second model, resulting in θ = (1 . , . , θ = (2 , . , θ = (2 , . . (3.19)Here the choice of θ is used for investigating the type I errors and coverage probabilities and θ for simulating the power, respectively. Again, we consider censoring rates of approximately25% for both treatment groups, meaning a rate of ψ = 0 .

14 and ψ = 0 . θ and θ ,respectively.For the third scenario we generate the survival times according to a log-logistic distribution.More precisely we choose F (cid:96) ( t, θ (cid:96) ) = 1 − t/θ (cid:96), ) − θ(cid:96), , (cid:96) = 1 ,

2. We now assume the censoringtimes to be uniformly distributed on an interval [0 , c (cid:96) ], where c (cid:96) is chosen such that a censoring9 a) . . . . . . Time (months) S u r v i v a l P r obab ili t y Reference Model q = (1.5,3.7) q = (2,3.4) (b) . . . . . . . Time (months) H a z a r d R a t e Reference Model q = (1.5,3.7) q = (2,3.4) Figure 2:

The two scenarios used for simulating the power. (a): Survival curves for Scenarios (3.18)and (3.19) with θ = (1 . , . θ ∈ { (1 . , . , (2 , . } . (b): Corresponding hazard rates. rate of approximately 20% results, (cid:96) = 1 ,

2. We consider a time range (in months) given by T = [0 ,

12] and deﬁne the scenario by the set of parameters given by θ = (1 . , . , θ = (2 . , . , (3.20)where θ corresponds to the reference model, see Figure 1(c). In order to investigate the performance of the conﬁdence bands derived in (2.5) and (2.8) weconsider the scenarios described above for three diﬀerent sample sizes, that is ( n , n ) = (20 , n , n ) = (50 ,

50) and ( n , n ) = (100 , n = 40 , α = 0 .

05 and calculate both, the asymptotic(two-sided) conﬁdence bands obtained by using the Delta method, and the bands based on thebootstrap described in Algorithm 2.1, which we call in the following asymptotic bands andbootstrap bands, respectively. Conﬁdence bands were constructed for an equidistant grid of 23time points ranging from 1 . . . − . . S − S and log h h , respectively, the approximation is very precise when sample sizes increase, as thecoverage probabilities are very close to the desired value of 0 .

95 in this case. Further it becomesobvious that the conﬁdence bands obtained by estimating the variance by bootstrap (2.10) arealways slightly more conservative than their asymptotic versions (2.4) and (2.7), respectively.However, considering the bands on S − S for very small sample sizes, that is n = n = 20,the coverage probability lies between 0 .

91 and 0 .

94 and hence these bands are slightly anti-conservative. The bootstrap bands perform slightly better, but still have coverage probabilitesaround 0 .

93 instead of 0 .

95, see the ﬁrst column of Figure 3. This eﬀect already disappearsfor n = n = 50 where a highly improved accuracy can be observed. We observe the samequalitative results for the asymptotic bands for log h h , whereas the bootstrap bands show adiﬀerent behaviour, that is being rather conservative for small sample sizes, but also gettingmore precise with increasing sample sizes.For smaller sample sizes, all conﬁdence bands under consideration vary in their behaviourover time. This eﬀect gets in particular visible when considering the non-proportional hazardsscenario (3.19), see the second row of Figure 3. The coverage probability of the bands for S − S start with a very accurate approximation during the ﬁrst two months but then decreaseto 0 .

93. This eﬀect can be explained by the fact that in the setting of a very small sample, thatis n = n = 20, after this period of time only very few patients remain (note that the mediansurvival for the reference model is given by 2 . . . .

85 and 0 . .

95. However, considering S − S , in amajority of the cases the coverage is above 0 .

9, even for a small sample size of n = 40, wherethe bootstrap approach performs slightly better than the asymptotic analogue. For increasingsample sizes the coverage, which varies over the time, approximates 0 .

95, whereas the bandsfor log h h still do not come suﬃciently close to this desired value, even for the largest samplesize of n = 200. Consequently we conclude that these bands suﬀer more from misspeciﬁcationthan the ones for S − S , where the latter prove to be robust if sample sizes are suﬃciently11arge. In the following we will investigate type I error rates for the non-inferiority test (2.15) andthe equivalence test (2.16) on the diﬀerence of survival curves. We consider a nominal levelof α = 0 .

05 and diﬀerent sample sizes, i.e. ( n , n ) = (20 , , ( n , n ) = (50 , , ( n , n ) =(100 , n , n ) = (150 , n = 40 , , θ = (1 . , . S ( t ) − S ( t ) attains values of 0 .

1, 0 .

15 and 0 . .

6, 2 . . . δ .Table 1 displays the simulated type I errors, conﬁgurations not falling under the null hypothesisare omitted and marked with ” − ”. The ﬁrst number always corresponds to the equivalencetest, the number in brackets displays the type I error of the non-inferiority test, respectively. Itturns out that the approximation of the level is very precise, for the non-inferiority test (2.15)in general and for the equivalence test (2.16) for suﬃciently large sample sizes. In particularfor settings on the margin of the null hypothesis, that is S ( t ) − S ( t ) = δ , we obtain type Ierrors very close to 0 .

05. Only for very small samples, that is n = n = 20, the equivalencetest (2.16) is conservative as the obtained type I errors are zero in almost all scenarios underconsideration.The same arguments hold for the scenario of non-proportional hazards (3.19) with θ = (2 , . δ = 0 .

2, whereas for the other choices of δ the type I error is controlled properly.Finally we note that the non-parametric non-inferiority test also yields a precise approximationof the level if sample sizes are suﬃciently large. However, the equivalence test is conservative,in particular for δ = 0 .

1. The results and a detailed interpretation are deferred to Section 3.1of the Supplemental Material. 12 . . . . . . n =n =20 . . . . . . n =n =50 . . . . . . n =n =100 . . . . . . n =n =20 . . . . . . n =n =50 . . . . . . n =n =100 . . . . n =n =20 . . . . n =n =50 . . . . n =n =100 Time (months) C o v e r age P r obab ili t y Asymptotic bands for S - S Bootstrap bands for S - S Asymptotic bands for log ( h h ) Bootstrap bands for log ( h h ) Figure 3:

Simulated coverage probabilities for scenarios (3.18) (ﬁrst row), (3.19) (second row) andthe scenario of misspeciﬁcation (3.20) (third row) at diﬀerent time points for sample sizes of n = n = 20 , ,

100 (left, middle, right column). The dashed lines correspond to the conﬁdence bandsfor the diﬀerence of the survival functions (2.5), the dotted lines to the conﬁdence bands for the loghazard ratio (2.8). Black lines display the asymptotic bands, red lines the conﬁdence bands based onbootstrap, respectively. n , n ) ( t , S ( t ) − S ( t )) δ = 0 . δ = 0 . δ = 0 . ,

20) (1 . , .

1) 0.000 (0.049) - -(2 . , .

15) 0.000 (0.025) 0.000 (0.051) -(4 , .

2) 0.000 (0.014) 0.000 (0.030) 0.001 (0.061)(50 ,

50) (1 . , .

1) 0.001 (0.049) - -(2 . , .

15) 0.000 (0.014) 0.017 (0.045) -(4 , .

2) 0.000 (0.005) 0.010 (0.012) 0.047 (0.048)(100 , . , .

1) 0.037 (0.037) - -(2 . , .

15) 0.000 (0.009) 0.049 (0.041) -(4 , .

2) 0.000 (0.001) 0.012 (0.006) 0.050 (0.051)(150 , . , .

1) 0.054 (0.044) - -(2 . , .

15) 0.001 (0.002) 0.037 (0.040) -(4 , .

2) 0.000 (0.000) 0.003 (0.003) 0.053 (0.048)

Table 1:

Simulated type I errors of the non-inferiority test (2.15) (numbers in brackets) and theequivalence test (2.16) for the scenario of proportional hazards (3.18) with θ = (1 . , .

9) at threediﬀerent time points for diﬀerent sample sizes and equivalence margins. The nominal level is chosenas α = 0 .

05. Scenarios not falling under the null hypothesis are marked with ” − ”. For investigations on the power we consider the same conﬁgurations as given above. We nowobserve the scenario of proportional hazards (3.18) such that the diﬀerence curve S ( t ) − S ( t )attains values of 0 .

01, 0 .

02 and 0 .

04 at time points 0 .

7, 1 . .

3, respectively. Hence allchosen conﬁgurations belong to the alternatives in (2.15) and (2.16). Table 3 displays thesimulated power. It turns out that both tests achieve a reasonable power in most settingsunder consideration. The power clearly increases with increasing sample sizes and a widerequivalence/non-inferiority margin δ . For instance, when considering δ = 0 . θ = (2 , . .

01 and 0 .

04, attained at time points 0 . . .

01) and 3 . . . . . n , n ) ( t , S ( t ) − S ( t )) δ = 0 . δ = 0 . δ = 0 . ,

20) (1 . , .

1) 0.000 (0.060) - -(2 . , .

15) 0.000 (0.029) 0.000 (0.049) -(3 , .

2) 0.000 (0.016) 0.000 (0.021) 0.002 (0.061)(50 ,

50) (1 . , .

1) 0.000 (0.053) - -(2 . , .

15) 0.000 (0.011) 0.008 (0.052) -(3 , .

2) 0.000 (0.003) 0.007 (0.014) 0.048 (0.048)(100 , . , .

1) 0.002 (0.057) - -(2 . , .

15) 0.000 (0.008) 0.055 (0.055) -(3 , .

2) 0.000 (0.001) 0.006 (0.006) 0.038 (0.038)(150 , . , .

1) 0.043 (0.053) - -(2 . , .

15) 0.004 (0.006) 0.041 (0.041) -(3 , .

2) 0.000 (0.000) 0.005 (0.005) 0.049 (0.049)

Table 2:

Simulated type I errors of the non-inferiority test (2.15) (numbers in brackets) and theequivalence test (2.16) for the scenario of non-proportional hazards (3.19) with θ = (2 , .

5) at threediﬀerent time points for diﬀerent sample sizes and equivalence margins. The nominal level is chosenas α = 0 .

05. Scenarios not falling under the null hypothesis are marked with ” − ”. event in this scenario.In summary we observe a reasonably high power for both tests. The power increases with in-creasing sample sizes and a wider equivalence margin. Further the presence of non-proportionalhazards does not aﬀect the power, meaning that in both scenarios the results are qualitativelythe same. Finally we also observe that for the non-parametric approach the power is muchsmaller than for our new method, in particular for small sample sizes and for the scenario ofnon-proportional hazards, which underlines the theoretical ﬁndings. Detailed results can befound in Section 3.1 of the Supplemental Material. In the following we investigate a well known benchmark dataset regarding survival analysis.The data set veteran from Veteran’s Administration Lung Cancer Trial, published in Kalbﬂeischand Prentice (2011) and implemented in the R package survival by Therneau (2020), describesa two-treatment, randomized trial for lung cancer. In this trial, male patients with advancedinoperable lung cancer were allocated to either a standard therapy (reference treatment, (cid:96) = 1)or a chemotherapy (test treatment, (cid:96) = 2). Numerous covariates were documented, includingtime to death for each patient, which is the primary endpoint of our analysis. In total 13715 n , n ) ( t , S ( t ) − S ( t )) δ = 0 . δ = 0 . δ = 0 . ,

20) (0 . , .

01) 0.145 (0.436) 0.512 (0.723) 0.795 (0.880)(1 . , .

02) 0.002 (0.227) 0.057 (0.410) 0.303 (0.595)(2 . , .

04) 0.000 (0.114) 0.000 (0.208) 0.000 (0.365)(50 ,

50) (0 . , .

01) 0.553 (0.696) 0.929 (0.943) 0.995 (0.995)(1 . , .

02) 0.025 (0.360) 0.501 (0.645) 0.833 (0.870)(2 . , .

04) 0.000 (0.187) 0.085 (0.335) 0.460 (0.601)(100 , . , .

01) 0.887 (0.906) 0.998 (0.998) 1.000 (1.000)(1 . , .

02) 0.388 (0.548) 0.858 (0.875) 0.988 (0.998)(2 . , .

04) 0.017 (0.274) 0.486 (0.557) 0.854 (0.861)(150 , . , .

01) 0.973 (0.974) 1.000 (1.000) 1.000 (1.000)(1 . , .

02) 0.639 (0.703) 0.971 (0.971) 0.999 (0.999)(2 . , .

04) 0.236 (0.364) 0.728 (0.736) 0.949 (0.950)

Table 3:

Simulated power of the non-inferiority test (2.15) (numbers in brackets) and the equivalencetest (2.16) for the scenario of proportional hazards (3.18) with θ = (1 . , .

7) at three diﬀerent timepoints for diﬀerent sample sizes and equivalence margins. The nominal level is chosen as α = 0 . observations, allocated to n = 69 patients in the reference group and n = 68 in the testgroup, are given.The R code reproducing the results presented in the following can be found online at https://github.com/kathrinmoellenhoff/survival . As our analysis is model-based, we start witha model selection step. More precisely we split the data into the reference group and the testgroup and assume six diﬀerent distributions, that is a Weibull distribution, an exponentialdistribution, a Gaussian distribution, a logistic distribution, a log-normal distribution and alog-logistic distribution, respectively. We ﬁt the corresponding models separately per treatmentgroup, resulting in 12 models in total. Finally we compare for each group the six diﬀerent modelsusing Akaike’s Information Criterion (AIC, see Sakamoto et al. (1986)). It turns out that forthe group receiving the reference treatment the Weibull and the exponential model provide thebest ﬁts (AICs are given by 749 .

1, 747 .

1, 799 .

9, 794 .

7, 755 . . .

1, 750 . .

7, respectively). Therefore wedecide to base our analyses on Weibull models for both groups. However, note that all testscould also be performed assuming diﬀerent distributions for each treatment.For the analysis we now consider the complete likelihood function (2.1) incorporating the dis-tribution of the censoring times. We assume the censoring times to be exponentially dis-tributed and maximizing the likelihood (2.1) yields ˆ θ = (4 . , . , ˆ ψ = 0 . θ = (4 . , . , ˆ ψ = 0 . . Figure 4(a) displays the corresponding Weibull models and16 n , n ) ( t , S ( t ) − S ( t )) δ = 0 . δ = 0 . δ = 0 . ,

20) (0 . , .

01) 0.964 (0.964) 0.996 (0.999) 1.000 (1.000)(0 . , .

04) 0.372 (0.448) 0.712 (0.712) 0.891 (0.893)(2 . , .

04) 0.000 (0.124) 0.000 (0.219) 0.000 (0.297)(3 . , .

01) 0.000 (0.179) 0.000 (0.284) 0.000 (0.374)(50 ,

50) (0 . , .

01) 1.000 (1.000) 1.000 (1.000) 1.000 (1.000)(0 . , .

04) 0.637 (0.640) 0.938 (0.938) 0.997 (0.997)(2 . , .

04) 0.000 (0.151) 0.047 (0.339) 0.436 (0.563)(3 . , .

01) 0.000 (0.258) 0.054 (0.468) 0.481 (0.684)(100 , . , .

01) 1.000 (1.000) 1.000 (1.000) 1.000 (1.000)(0 . , .

04) 0.818 (0.841) 0.995 (0.995) 1.000 (1.000)(2 . , .

04) 0.001 (0.246) 0.442 (0.555) 0.803 (0.833)(3 . , .

01) 0.006 (0.406) 0.558 (0.728) 0.869 (0.921)(150 , . , .

01) 1.000 (1.000) 1.000 (1.000) 1.000 (1.000)(0 . , .

04) 0.927 (0.927) 1.000 (1.000) 1.000 (1.000)(2 . , .

04) 0.206 (0.326) 0.678 (0.701) 0.922 (0.928)(3 . , .

01) 0.267 (0.553) 0.797 (0.859) 0.903 (0.984)

Table 4:

Simulated power of the non-inferiority test (2.15) (numbers in brackets) and the equivalencetest (2.16) for the scenario of non-proportional hazards (3.19) with θ = (2 , .

4) at four diﬀerent timepoints for diﬀerent sample sizes and equivalence margins. The nominal level is chosen as α = 0 . the non-parametric analogue given by Kaplan-Meier curves. It turns out that for both treat-ment groups the parametric and the non-parametric curves are very close to each other. Furtherwe observe that the survival curves of the two treatment groups cross each other which indi-cates that the assumption of proportional hazards is not justiﬁed here. Indeed, the hazard ratioranges from 0 .

55 to 1 .

93 from the ﬁrst time of event (3 days) until the end of the observationalperiod (999 days) and therefore an analysis using a proportional hazards model is actually notapplicable here. The p-value of the log-rank test is 0 .

928 and thus does not detect any diﬀerencebetween the two groups.We will now perform a similar analysis using the parametric models and the theory derived inSection 2. For the sake of brevity we will only consider diﬀerence in survival, analyses concerningthe (log) hazard ratio can be conducted in the same manner. We consider the ﬁrst 600 days ofthe trial. We set α = 0 .

05 and calculate lower and upper (1 − α )-pointwise conﬁdence bandsaccording to (2.5) at several points 0 ≤ t ≤ σ ∆ are obtainedby both, the asymptotic approach and bootstrap as described in Algorithm 2.1, respectively.Figure 4(b) displays the estimated diﬀerence curve ∆( t, ˆ θ , ˆ θ ) = S ( t, ˆ θ ) − S ( t, ˆ θ ) and the17 a) . . . . . . Time (days) S u r v i v a l Reference treatmentTest treatment (b) − . − . . . . Time (days) D i ff e r en c e i n S u r v i v a l Difference in Survival S ( t ) - S ( t ) Asymptotic confidence bandsBootstrap confidence bands

Figure 4: (a): Survival curves (Kaplan-Meier curves and Weibull models) for the veteran data. Solidlines correspond to the reference group, dashed lines to the test treatment. (b): Diﬀerence in survival,pointwise conﬁdence bands obtained by the asymptotic approach (dashed) and bootstrap (dotted),respectively, on the interval [40 , δ = 0 . pointwise conﬁdence bands on the interval [0 , t = 80, which is close to the median survival of both treat-ment groups. The diﬀerence in survival is ∆(80 , ˆ θ , ˆ θ ) = 0 .

047 and the asymptotic conﬁdenceinterval is given by [ − . , . − . , . δ > .

163 (0 . δ = 0 .

15, which is indicated by the shaded area in Figure 4(b), H cannot be rejected inboth cases, meaning in particular that the treatments cannot be considered equivalent withrespect to the survival at day 80. Figure 4(b) further displays these investigations for δ = 0 . δ = 0 .

15. The same conclusion can be made concerningequivalence as for all t in [0 , − δ = − .

15. Concluding we observe18hat considering for instance δ = 0 .

2, equivalence would be claimed at all time points underconsideration, demonstrating the impact of the choice of the margin.

In this paper we addressed the problem of survival analysis in presence of non-proportionalhazards. Here, commonly used methods, as Cox’s proportional hazards model or the log-ranktest, are not optimal and suﬀer from a loss of power. Therefore we proposed another approachfor investigating equivalence or non-inferiority of time-to-event outcomes based on the construc-tion of (pointwise) conﬁdence intervals and applicable irrespectively of the shape of the hazardratio. We constructed conﬁdence bands for both, the (log) hazard ratio and the diﬀerence of thesurvival curve proposing two alternatives, one based on asymptotic investigations and the otheron a parametric bootstrap procedure. Both approaches show a similar performance, where thelatter has the advantage that it can be used for very small samples as well and without theneed to calculate the asymptotic variance.Our approach provides a framework for investigating survival in many ways. Apart from speciﬁctime points entire periods of time can be observed. We demonstrated that the presence ofnon-proportional hazards has no eﬀect on the performance of the conﬁdence bands and thenon-inferiority and equivalence test, respectively, which means that they do not depend on thisassumption, providing a much more ﬂexible tool compared to standard methodology.Our methods are based on the estimation of parametric survival functions, which can be a verypowerful tool once the suitability of the model has been proven. This has to be assessed ina preliminary study, using for instance model selection criteria as the AIC. We demonstratedthe robustness of our approach even in the case of misspeciﬁcation. An interesting extensionof the methodology could be given by model averaging which possibly further improves theperformance of the procedures.

References

Berger, R. L. (1982). Multiparameter hypothesis testing and acceptance sampling.

Technomet-rics , 24:295–300.Bradley, R. and Gart, J. (1962). The asymptotic properties of ml estimators when samplingfrom associated populations.

Biometrika , 49:205–214.Com-Nougue, C., Rodary, C., and Patte, C. (1993). How to establish equivalence when dataare censored: a randomized trial of treatments for b non-hodgkin lymphoma.

Statistics inmedicine , 12(14):1353–1364. 19ox, D. R. (1972). Regression models and life-tables.

Journal of the Royal Statistical Society:Series B (Methodological) , 34(2):187–202.Cox, D. R. and Oakes, D. (1984).

Analysis of survival data , volume 21. CRC Press.Da Silva, G. T., Logan, B. R., and Klein, J. P. (2009). Methods for equivalence and noninferi-ority testing.

Biology of Blood and Marrow Transplantation , 15(1):120–127.D’Agostino Sr, R. B., Massaro, J. M., and Sullivan, L. M. (2003). Non-inferiority trials: de-sign concepts and issues–the encounters of academic consultants in statistics.

Statistics inmedicine , 22(2):169–186.Efron, B. and Tibshirani, R. J. (1994).

An introduction to the bootstrap . CRC press.(EMA), E. M. A. (2014). Committee for medicinal products for human use(chmp): Guideline on the choice of non-inferiority margin. cpmp/ewp/2158/99.available at .Gill, R. and Schumacher, M. (1987). A simple test of the proportional hazards assumption.

Biometrika , 74(2):289–300.Grambsch, P. M. and Therneau, T. M. (1994). Proportional hazards tests and diagnosticsbased on weighted residuals.

Biometrika , 81(3):515–526.Greenwood, M. (1926). The natural duration of cancer. london: Her majesty’s stationery oﬃce.

Reports on public health and medical subjects , (33).Hern´an, M. A. (2010). The hazards of hazard ratios.

Epidemiology (Cambridge, Mass.) ,21(1):13.Howard, G., Chambless, L. E., and Kronmal, R. A. (1997). Assessing diﬀerences in clinicaltrials comparing surgical vs nonsurgical therapy: using common (statistical) sense.

JAMA ,278(17):1432–1436.Jachno, K., Heritier, S., and Wolfe, R. (2019). Are non-constant rates and non-proportionaltreatment eﬀects accounted for in the design and analysis of randomised controlled trials? areview of current practice.

BMC medical research methodology , 19(1):103.Kalbﬂeisch, J. D. and Prentice, R. L. (2011).

The statistical analysis of failure time data ,volume 360. John Wiley & Sons.Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations.

Journal of the American statistical association , 53(282):457–481.20lein, J. P. and Moeschberger, M. L. (2006).

Survival analysis: techniques for censored andtruncated data . Springer Science & Business Media.Li, H., Han, D., Hou, Y., Chen, H., and Chen, Z. (2015). Statistical inference methods for twocrossing survival curves: a comparison of methods.

PLoS One , 10(1):e0116774.Martinez, E. E., Sinha, D., Wang, W., Lipsitz, S. R., and Chappell, R. J. (2017). Tests forequivalence of two survival functions: Alternative to the tests under proportional hazards.

Statistical methods in medical research , 26(1):75–87.Oehlert, G. (1992). A note on the delta method.

The American Statistician , 46:27–29.Royston, P. and Parmar, M. K. (2011). The use of restricted mean survival time to estimatethe treatment eﬀect in randomized clinical trials when the proportional hazards assumptionis in doubt.

Statistics in medicine , 30(19):2409–2421.Sakamoto, Y., Ishiguro, M., and Kitagawa, G. (1986). Akaike information criterion statistics.

Dordrecht, The Netherlands: D. Reidel , 81.Shen, P.-S. (2020). Tests for equivalence of two survival functions: alternatives to the ph andpo models.

Journal of Biopharmaceutical Statistics , pages 1–12.Subramanian, S. and Zhang, P. (2013). Model-based conﬁdence bands for survival functions.

Journal of Statistical Planning and Inference , 143(7):1166–1185.Therneau, T. M. (2020).

A Package for Survival Analysis in R . R package version 3.1-12.Uno, H., Claggett, B., Tian, L., Inoue, E., Gallo, P., Miyata, T., Schrag, D., Takeuchi, M.,Uyama, Y., Zhao, L., et al. (2014). Moving beyond the hazard ratio in quantifying thebetween-group diﬀerence in survival analysis.

Journal of clinical Oncology , 32(22):2380.Wellek, S. (1993). A log-rank test for equivalence of two survivor functions.

Biometrics , pages877–881.Wellek, S. (2010).