On null hypotheses in survival analysis
OON NULL HYPOTHESES IN SURVIVAL ANALYSIS.
MATS J. STENSRUD, KJETIL RØYSLAND AND P˚AL C. RYALEN
Department of Biostatistics, University of Oslo, Domus Medica Gaustad,Sognsvannsveien 9, 0372 Oslo, Norway
Abstract.
The conventional nonparametric tests in survival analysis, such as the log-rank test, assess the null hypothesis that the hazards are equal at all times. However,hazards are hard to interpret causally, and other null hypotheses are more relevant inmany scenarios with survival outcomes. To allow for a wider range of null hypotheses, wepresent a generic approach to define test statistics. This approach utilizes the fact thata wide range of common parameters in survival analysis can be expressed as solutions ofdifferential equations. Thereby we can test hypotheses based on survival parameters thatsolve differential equations driven by cumulative hazards, and it is easy to implementthe tests on a computer. We present simulations, suggesting that our tests performwell for several hypotheses in a range of scenarios. Finally, we use our tests to evaluatethe effect of adjuvant chemotherapies in patients with colon cancer, using data from arandomised controlled trial.
KEY WORDS:
Causal inference; Hazards; Hypothesis testing; Failure time analysis.
Date : January 29, 2019. a r X i v : . [ s t a t . M E ] J a n ON NULL HYPOTHESES IN SURVIVAL ANALYSIS. Introduction
The notion of hazards has been crucial for the development of modern survival analysis.Hazards are perhaps the most natural parameters to use when fitting statistical modelsto time-to-event data subject to censoring, and hazard functions were essential for thedevelopment of popular methods like Cox regression and rank tests, which are routinelyused in practice.In the field of causal inference, however, there is concern that many statisticians just doadvanced ’curve fitting’ without being careful about the interpretation of the parametersthat are reported [1, 2, 3]. This criticism can be directed to several areas in statistics. Inthis spirit, we think that statisticians in general should pay particular attention to effectmeasures with clear-cut causal interpretations.In survival analysis, it has been acknowledged that interpreting hazards as effect mea-sures is delicate, see e.g. [4] and [5]. This contrasts the more traditional opinion, inwhich the proportional hazards model is motivated by the ’simple and easily understoodinterpretation’ of hazard ratios [6, 4.3.a]. A key issue arises because the hazard, by def-inition, is conditioned on previous survival. If we consider causal diagrams [3, 2], it isclear that we condition on a ’collider’ that opens a non-causal pathway from the exposurethrough any unobserved heterogeneity into the event of interest, see [4, 7, 8]. Since unob-served heterogeneity is present in most practical scenarios, even in randomized trials, theconditioning means that the hazards are fundamentally hard to interpret causally [4, 5].Although we must be careful about assigning causal interpretations to hazards, we donot claim that hazards are worthless. On the contrary, hazards are key elements in themodelling of other parameters that are easier to interpret, serving as building blocks.
N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 3
This point of view is also found in [2, 17.1]: “..., the survival analyses in this bookprivilege survival/risk over hazard. However, that does not mean that we should ignorehazards. The estimation of hazards is often a useful intermediate step for the estimationof survivals and risks.” Indeed, we have recently suggested a generic method to estimatea range of effect measures in survival analysis, utilizing differential equations driven bycumulative hazards [9].Nevertheless, the conventional hypothesis tests in survival analysis are still based onhazards. In particular the rank tests [10], including the log-rank test, are based on thenull hypothesis H : α t = α t for all t ∈ [0 , T ] , (1)where α it is the hazard in group i . Formulating such hypotheses in a practical setting willoften imply that we assign causal interpretations to these hazard functions. In the sim-plest survival setting this is not a problem, as there is a one-to-one relationship betweenhazards and the survival curves, and a null hypothesis comparing two or more survivalcurves is straightforward. In more advanced settings, e.g. scenarios with competing risks,hypotheses like (1) are less transparent, leading to issues with interpretation [11]. Forexample, in competing risks settings where competing events are treated as censoringevents, the null hypothesis in (1) is based on cause-specific hazards, which are often notthe target of inference [11].We aimed to develop new hypothesis tests for time-to-event outcomes with two keycharacteristics: First, the tests should be rooted in explicit null hypotheses that are easy ON NULL HYPOTHESES IN SURVIVAL ANALYSIS. to interpret. Second, the testing strategy should be generic, such that the scientist canapply the test to their estimand of interest.
Survival parameters as solutions of differential equations
We will consider survival parameters that are functions solving differential equationson the form X t = X + (cid:90) t F ( X s ) dA s , (2)where A is a q dimensional vector of cumulative hazards, and F = ( F , · · · , F q ) : R p −→ R p × q is Lipschitz continuous with bounded and continuous first and second derivatives,and satisfies a linear growth bound. The class of parameters also includes several quanti-ties that are Lebesgue integrals, such that dA it = dt for some i . Here, X is a vector thatincludes our estimand of interest, but X may also contain additional nuisance parametersthat are needed to formulate the estimand of interest.Many parameters in survival analysis solve equations on the form (2). In particular,the survival function can be expressed on the form (2) as S t = 1 − (cid:82) t S s dA s , where A is thecumulative hazard for death. Other examples include the cumulative incidence function,the restricted mean survival function, and the prevalence function. We will present theseparameters in detail in Section 3. Nonparametric plugin estimators have been thoroughlystudied in [9]. The strategy assumes that A can be consistently estimated byˆ A t = (cid:90) t G s − dN s , (3) N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 5 where G is a q × l dimensional predictable process, and N is an l dimensional countingprocess. Furthermore, we assume that ˆ A , the residuals W n = √ n ( ˆ A − A ), and itsquadratic variation [ W n ], are so-called predictably uniformly tight . When the estimatoris a counting process integral, a relatively simple condition ensures predictable uniformlytightness [9, Lemma 1]. Moreover, we suppose that √ n ( ˆ A − A ) converges weakly (wrt theSkorohod metric) to a mean zero Gaussian martingale with independent increments, see[9, Lemma 1, Theorem 1 & 2] for details. Examples of estimators on the form (3) thatsatisfy these criteria are the Nelson-Aalen estimator, or more generally Aalen’s additivehazard estimator; if Aalen’s additive hazards model is a correct model for the hazard A , then Aalen’s additive hazard model satisfy these criteria, in particular predictableuniformly tightness.Our suggested plugin estimator of X is obtained by replacing A with ˆ A , giving esti-mators that solve the system ˆ X t = ˆ X + (cid:90) t F ( ˆ X s − ) d ˆ A s , (4)where ˆ X is a consistent estimator of X . When the estimand is the survival function,this plugin estimator reduces to the Kaplan-Meier estimator. Ryalen et al [9] identifiedthe asymptotic distribution of √ n ( ˆ X t − X t ) to be a mean zero Gaussian martingale withcovariance V solving a linear differential equation [9, eq. (17) ]. The covariance V canalso be consistently estimated by inserting the estimates ˆ A , giving rise to the systemˆ V t = ˆ V + q (cid:88) j =1 (cid:90) t ˆ V s − ∇ F j ( ˆ X s − ) (cid:124) + ∇ F j ( ˆ X s − ) ˆ V s − d ˆ A js + n (cid:90) t F ( ˆ X s − ) d [ B ] s F ( ˆ X s − ) (cid:124) , (5) ON NULL HYPOTHESES IN SURVIVAL ANALYSIS. where {∇ F j } qj =1 are the Jacobian matrices of the columns of F = ( F , · · · , F q ) from (2),and [ B ] t is a q × q matrix defined by (cid:16) [ B ] t (cid:17) i,j = , if dA it = dt or dA jt = dt (cid:80) s ≤ t ∆ ˆ A is ∆ ˆ A js , otherwise.The variance estimator (5), as well as the parameter estimator (4), can be expressedas difference equations, and therefore they are easy to calculate generically in computerprograms. To be explicit, let τ , τ , . . . denote the ordered jump times of ˆ A . Then,ˆ X t = ˆ X τ k − + F ( ˆ X τ k − )∆ ˆ A τ k , as long as τ k ≤ t < τ k +1 . Similarly, the plugin varianceequation may be written as a difference equation,ˆ V t = ˆ V τ k − + q (cid:88) j =1 ˆ V τk − ∇ F j ( ˆ X τ k − ) (cid:124) + ∇ F j ( ˆ X τ k − ) ˆ V τ k − ∆ ˆ A jτ k + nF ( ˆ X τ k − )∆[ B ] τ k F ( ˆ X τ k − ) (cid:124) . Hypothesis testing
The null hypothesis is not explicitly expressed in many research reports. On the con-trary, the null hypothesis is often stated informally, e.g. vaguely indicating that a dif-ference between two groups is assessed. Even if the null hypothesis is perfectly clear toa statistician, this is a problem: the applied scientist, who frames the research questionbased on subject-matter knowledge, may not have the formal understanding of the nullhypothesis.In particular, we are not convinced that scientists faced with time-to-event outcomesprofoundly understand how null hypotheses based on hazard functions. Hence, using null
N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 7 hypotheses based on hazard functions, such as (1), may be elusive: in many scenarios,the scientist’s primary interest is not to assess whether the hazard functions are equal at all follow-up times . Indeed, the research question is often more focused, and thescientist’s main concern can be contrasts of other survival parameters at a prespecified t or in a prespecified time interval [12]. For example, we may aim to assess whether cancersurvival differs between two treatment regimens five years after diagnosis. Or we mayaim to assess whether a drug increases the average time to relapse in subjects with arecurring disease. We will highlight that the rank tests are often not suitable for suchhypotheses.Hence, instead of assessing hazards, let us study tests of (survival) parameters X and X in groups 1 and 2 at a prespecified time t . The null hypothesis is H X : X ,it = X ,it for i = 1 , · · · , p, (6)where p is the dimension of X . We emphasize that the null hypothesis in (6) is differentfrom the null hypothesis in (1), as (6) is defined for any parameter X t at a t . Wewill consider parameters X and X that solve (2); this is a broad class of importantparameters, including (but not limited to) the survival function, the cumulative incidencefunction, the time dependent sensitivity and specificity functions, and the restricted meansurvival function [9].2.1. Test statistics.
We consider two groups 1 and 2 with population sizes n , n andlet n = n + n . We can estimate parameters X , X and covariance matrices V , V using the plugin method described in Section 1. The contrast √ n ( ˆ X t − ˆ X t ) has anasymptotic mean zero normal distribution under the null hypothesis. If the groups are ON NULL HYPOTHESES IN SURVIVAL ANALYSIS. independent, we may then use the statistic( ˆ X t − ˆ X t ) ˆ V − t ( ˆ X t − ˆ X t ) (cid:124) , (7)to test for differences at t , where ˆ V t = ˆ V t /n + ˆ V t /n , and where ˆ V t and ˆ V are calcu-lated using the covariance matrix estimator (5). Then, the quantity (7) is asymptotically χ distributed with p degrees of freedom under the null hypothesis, which is a corollary ofthe results in [9], as we know from [9, Theorem 2] that √ n i ( ˆ X i − X i ) converges weakly tomean zero Gaussian martingale whose covariance matrix V i can be consistently estimatedusing (5). Therefore, under the null hypothesis (6), the root n difference of the estimates, √ n ( ˆ X t − ˆ X t ), will converge to a multivariate mean zero normal distribution with co-variance matrix that can be estimated by n (cid:0) ˆ V t /n + ˆ V t /n (cid:1) . Due to the continuousmapping theorem, the statistic (7) has an asymptotic χ distribution.Sometimes we may be interested in testing e.g. the r first components of X and X ,under the null hypothesis X ,it = X ,it for i = 1 , · · · , r < p . It is straightforward to adjustthe hypothesis (6) and the test statistic, yielding the same asymptotic distribution with r degrees of freedom. 3. Examples of test statistics
We derive test statistics for some common effect measures in survival and event historyanalysis. By expressing the test statistics explicitly, our tests may be compared with thetests based on conventional approaches.3.1.
Survival at t . In clinical trials, the primary outcome may be survival at a pre-specified t , e.g. cancer survival 5 years after diagnosis. Testing if survival at t is equal in N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 9 two independent groups can be done in several ways [12], e.g. by estimating the varianceof Kaplan-Meier curves using Greenwood’s formula. However, we will highlight that ourgeneric tests also immediately deal with this scenario: it is straightforward to use the nullhypothesis in (6), where S t and S t are the survival functions in group 1 and 2 at time t .Using the results in Section 2.1, we find that the plugin estimators of S and S are thestandard Kaplan-Meier estimators. The plugin variance in group i solvesˆ V it = ˆ V i − (cid:90) t ˆ V is − d ˆ A is + n i (cid:90) t (cid:16) ˆ S is − Y is (cid:17) dN is , (8)for i ∈ { , } , where Y is is the number at risk in group i just before time s . Assumingthat the groups are independent, the final variance estimator can be expressed as ˆ V t =ˆ V t /n + ˆ V t /n , and the statistic (7) becomes ( ˆ S t − ˆ S t ) / ˆ V t , which is approximately χ distributed with 1 degree of freedom.3.2. Restricted mean survival until t . As an alternative to the hazard ratio, therestricted mean survival has been advocated: it can be calculated without parametricassumptions and it has a clear causal interpretation [13, 14, 15]. The plugin estimator ofthe restricted mean survival difference between groups 1 and 2 is ˆ R t − ˆ R t = (cid:80) τ k ≤ t (cid:0) ˆ S τ k − − ˆ S τ k − (cid:1) ∆ τ k , where ∆ τ k = τ k − τ k − . The plugin estimator for the variance isˆ V R i t = ˆ V R i + 2 (cid:88) τ k ≤ t ˆ V R i ,S i τ k − ∆ τ k ˆ V R i ,S i t = ˆ V R i ,S i − (cid:90) t ˆ V R i ,S i s − d ˆ A is + (cid:88) τ k ≤ t ˆ V S i τ k − ∆ τ k , where ˆ V S i is the plugin variance for √ n i ˆ S i , given in (8). The statistic (7) can be used toperform a test, with ˆ V t = ˆ V R t /n + ˆ V R t /n . Cumulative incidence at t . Many time-to-event outcomes are subject to com-peting risks. The Gray test is a competing risk analogue to the log-rank test: the nullhypothesis is defined by subdistribution hazards λ t , such that λ t = ddt log[1 − C t ] where C t is the cumulative incidence of the event of interest, are equal at all t [16]. Analogousto the log-rank test, the Gray test has low power if the subdistribution hazard curvesare crossing [17]. However, we are often interested in evaluating the cumulative incidenceat a time t , without making assumptions about the subdistribution hazards, which areeven harder to interpret causally than standard hazard functions. By expression thecumulative incidence on the form (2), we use our transformation procedure to obtain atest statistic for the cumulative incidence at t . The plugin estimator for the cumulativeincidence difference is ˆ C t − ˆ C t = (cid:90) t ˆ S s − d ˆ A ,js − (cid:90) t ˆ S s − d ˆ A ,js , where A i,j is the cumulative cause-specific hazard for the event j of interest, and ˆ S i isthe Kaplan-Meier estimate within group i . The groupwise plugin variances solveˆ V it = ˆ V i + 2 (cid:90) t ˆ V is − d ˆ A i,js + n i (cid:90) t (cid:16) ˆ S is − Y is (cid:17) dN i,js , where N i,j counts the event of interest.3.4. Frequency of recurrent events.
Many time-to-event outcomes are recurrent events.For example, time to hospitalization is a common outcome in medical studies, such astrials on cardiovascular disease. Often recurrent events are analysed with conventionalmethods, in particular the Cox model, restricting the analysis to only include the firstevent in each subject. A better solution may be to study the mean frequency function,
N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 11 i.e. the marginal expected number of events until time t , acknowledging that the subjectcan not experience events after death [18]. We let A i,E and A i,D be the cumulative haz-ards for the recurrent event and death in group i , respectively, and let K i and S i be themean frequency function and survival, respectively. Then, the plugin estimator of thedifference is ˆ K t − ˆ K t = (cid:90) t ˆ S s − d ˆ A ,Es − (cid:90) t ˆ S s − d ˆ A ,Es . The plugin variances solveˆ V K i t = ˆ V K i + (cid:90) t ˆ V K i ,S i s − d ( ˆ A i,Es − ˆ A i,Ds ) + n i (cid:90) t (cid:16) ˆ S is − Y is (cid:17) dN i,Es , ˆ V K i ,S i t = ˆ V K i ,S i − (cid:90) t ˆ V K i ,S i s − d ˆ A i,Ds + (cid:90) t ˆ V S i s − d ˆ A i,Es , where N i,E counts the recurrent event, and where ˆ V S i is the survival plugin variance ingroup i , as displayed in (8).3.5. Prevalence in an illness-death model.
The prevalence denotes the number ofindividuals with a condition at a specific time, which is e.g. useful for evaluating theburden of a disease. We consider a simple Markovian illness-death model with threestates: healthy:0, ill:1, dead:2. The population is assumed to be healthy initially, butindividuals may get ill or die as time goes on. We aim to study the prevalence P i, t of theillness in group i as a function of time. Here, we assume that the illness is irreversible, butwe could extend this to a scenario in which recovery from the illness is possible, similarto Bluhmki [19]. Let A i,kj be the cumulative hazard for transitioning from state k to j in group i . Then, P i, solves the system P i, t P i, t = + (cid:90) t − P i, s − P i, s P i, s − P i, s d A i, s A i, s A i, s . The plugin estimator for the difference P , − P , isˆ P , t − ˆ P , t = (cid:90) t ˆ P , s − d ˆ A , s − (cid:90) t ˆ P , s − d ˆ A , s − (cid:90) t ˆ P , s − d ˆ A , s + (cid:90) t ˆ P , s − d ˆ A , s . The variance estimator for group i readsˆ V P i, t = ˆ V P i, + 2 (cid:90) t ˆ V P i, ,P i, s − d ˆ A i, s − (cid:90) t ˆ V P i, s − d ˆ A i, s + n i (cid:16) (cid:90) t (cid:16) ˆ P i, s − Y i, s (cid:17) dN i, s + (cid:90) t (cid:16) ˆ P i, s − Y i, s (cid:17) dN i, s (cid:17) ˆ V P i, ,P i, t = ˆ V P i, ,P i, + (cid:90) t (cid:0) ˆ V P i, s − − ˆ V P i, ,P i, s − (cid:1) d ˆ A i, s − (cid:90) t ˆ V P i, ,P i, s − d ˆ A i, s − (cid:90) t ˆ V P i, ,P i, s − d ˆ A i, s − n i (cid:90) t (cid:16) ˆ P i, s − Y i, s (cid:17) dN i, s ˆ V P i, t = ˆ V P i, − (cid:90) t ˆ V P i, s − d ˆ A i, s − (cid:90) t ˆ V P i, s − d ˆ A i, s + n i (cid:90) t (cid:16) ˆ P i, s − Y i, s (cid:17) d (cid:0) N i, s + N i, s (cid:1) . Here, Y i, , Y i, are the number of individuals at risk in states 0 and 1, while N i,kj countsthe transitions from state k to j in group i . By calculating ˆ V P i, t for i ∈ { , } , wecan find the statistic (7). Here, the prevalence is measured as the proportion of affectedindividuals relative to the population at t = 0. We could use a similar approach toconsider the proportion of affected individuals relative to the surviving population at t , N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 13 or we could record the cumulative prevalence until t to evaluate the cumulative diseaseburden. 4. Performance
In this section, we present power functions under several scenarios for the test statisticsthat were presented in Section 3. The scenarios were simulated by defining differentrelations between the hazard functions in the two exposure groups: (i) constant hazardsin both groups, (ii) hazards that were linearly crossing, and (iii) hazards that were equalinitially before diverging after a time t .For each hazard relation (i)-(iii), we defined several κ ’s such that the true parameterdifference was equal to κ at the prespecified time point t , i.e. X t − X t = κ . For eachcombination of target parameter, difference κ , and hazard scenario, we replicated thesimulations m times to obtain m realizations of (7), and we artificially censored 10%of the subjects in each simulation. In the Supplementary material, we show additionalsimulations with different sample sizes (50, 100 and 500) and fractions of censoring (10%-40%). We have provided an overview of the simulations in figure 2, in which parametersof interest (solid lines) and hazard functions (dashed lines) are displayed in scenarios withfixed κ = − .
05 at t = 1 .
5, i.e. X . − X . = − . H X at the 5% confidence level. Thus, we obtained m Bernoulli trials in which the success probability is the power function evaluated at κ .The estimated power functions, i.e. the estimates of the Bernoulli probabilities, are dis-played in figure 3 (solid lines). The power functions are not affected by the structure ofthe underlying hazards, as desired: our tests are only defined at t , and the particularparameterization of the hazard has minor impact on the power function. The dashed lines in figure 3 show power functions of alternative nonparametric teststatistics that are already found in the literature, tailor-made for the scenario of interest.In particular, for the survival at t , we obtained a test statistic using Greenwood’s varianceformula (and a cloglog transformation in the Supplementary Material) [12]. For therestricted mean survival at t , we used the statistic suggested in [15]. For the cumulativeincidence at t , we used the statistic suggested in [20]. For the mean frequency function weused the estimators derived in [18], and in the prevalence example we used the varianceformula in [21, p. 295], as implemented in the etm package in R . Our generic strategygave similar power compared to the conventional methods for each particular scenario.4.1. Comparisons with the log-rank test.
We have argued that our tests are funda-mentally different from the rank tests, as the null hypotheses are different. Nevertheless,since rank tests are widely used in practice, also when the primary interest seems to beother hypothesis than in (1), we aimed to compare the power of our test statistics withthe log-rank test under different scenarios. In table 3, we compared the power of ourtest statistic and the rank test, using the scenarios in figure 2. In the first column, theproportional hazards assumption is satisfied (constant), and therefore the power of thelog-rank test is expected to be optimal (assuming no competing risks). Our tests of thesurvival function and the restricted mean survival function show only slightly reducedpower compared to the log rank test. For the cumulative incidence function at a time t , our test is less powerful than the Gray test and the log-rank test of the cause specifichazards. However, the cause specific hazard test have type one error rate is not nominal,which we will return to in the end of this section. N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 15
The second column displays power of tests under scenarios with crossing hazards. Forthe survival function, it may seem surprising that the log-rank test got higher power thanour test, despite the crossing hazards. However, in this particular scenario the hazardsare crossing close to the end of the study (dashed lines in Figure 2), and therefore thecrossing has little impact on the power of the log-rank test. In contrast, the power ofthe log-rank test is considerably reduced in the scenarios where we study the restrictedmean survival function and the cumulative incidence functions, in which the hazards arecrossing at earlier points in time.The third column shows the power under hazards that are deviating. For the survivalfunction, our test provides higher power. Intuitively, the log-rank test has less power inthis scenario because the hazards are equal or close to equal during a substantial fractionof the follow-up time. For the restricted mean survival, however, the log-rank has morepower. This is not surprising [15], and it is due to the particular simulation scenario:Late events have relatively little impact on the restricted mean survival time, and inthis scenario a major difference between the hazards was required to obtain κ . Since thelog-rank test is sensitive to major differences in the hazards, it has more power in thisscenario. For the cumulative incidence, in contrast, the power of the log-rank test is lowerthan the power of our test.The results in table 3 illustrate that power depends strongly on the hazard scenariofor the log-rank test, but this dependence is not found for our tests.To highlight the basic difference between the log-rank test and our tests, we havestudied scenarios where H X in (6) is true (figure 4). That is, at t = 1 . X t − X t = 0, but for earlier times the equality does not hold. In these scenarios, thelog-rank test got high rates of type 1 errors. Heuristically, this is expected because the hazards are different at most (if not all) times t ∈ [0 , t ]. Nevertheless, figure 4 confirmsthat the log-rank test does not have the correct type 1 error rate under null hypothesesas in (6), and should not be used for such tests, even if the power sometimes is adequate(as in table 3).5. Example: Adjuvant chemotherapy in patients with colon cancer
To illustrate how our tests can be applied in practice, we assessed the effectiveness oftwo adjuvant chemotherapy regimes in patients with stage III colon cancer, using datathat are available to anyone [22, 23]. The analysis is performed as a worked example in thesupplementary data using the R package transform.hazards . After undergoing surgery,929 patients were randomly assigned to observation only, levamisole (Lev) or levamisoleplus fluorouracil (Lev+5FU) [22]. We restricted our analysis to the 614 subjects who gotLev or Lev+5FU. All-cause mortality and the cumulative incidence of cancer recurrencewas lower in subjects receiving (Lev+5FU), as displayed in Figure 1.We formally assessed the comparative effectivness of Lev and Lev+5FU after 1 and 5years of follow-up, using the parameters from section 3. After 1 year, both the overallsurvival and the restricted mean survival were similar in the two treatment arms (Table1). However, the cumulative incidence of recurrence was reduced in the Lev+5FU group,and the number of subjects alive with recurrent disease were lower in the Lev+5FU group.Also, the mean time spent alive and without recurrence was longer in the Lev+5FU group(Table 1, Restricted mean recurrence free survival). These results suggest that Lev+5FUhas a beneficial effect on disease recurrence after 1 year of follow-up compared to Lev.After 5 years of follow-up, overall survival and restricted mean survival was improved inthe Lev+5FU group (Table 2). Furthermore, the cumulative incidence of recurrence was N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 17 reduced, and the prevalence of patients with recurrence was lower in the Lev+5FU group.These results suggest that Lev+5FU improves overall mortality and reduces recurrenceafter 1 year compared to Lev.In conclusion, our analysis indicates that treatment with Lev+5FU improves severalclinically relevant outcomes in patients with stage III colon cancer. We also emphasisethat a conventional analysis using a proportional hazards model would not be ideal here,as the plots in Figure 1 indicate violations of the proportional hazards assumptions.6.
Covariate adjustments
Our approach allows us to conduct tests conditional on baseline covariates, using theadditive hazard model; by letting the cumulative hazard integrand in (2) be conditionalon specific covariates, we can test for conditional differences between groups, assumingthat the underlying hazards are additive.In more detail, we can test for differences between group 1 and 2 under the covariatelevel Z = z by evaluating the cumulative hazards in each group at that level, yielding A ,z and A ,z . Estimates ˆ A ,z and ˆ A ,z can be found using standard software. Thisallows us to estimate parameters with covariances using (4) and (5), and test the nullhypothesis of no group difference within covariate level z using the test statistic (7),again assuming that the groups are independent.7. Discussion
By expressing survival parameters as solutions of differential equations, we providegeneric hypothesis tests for survival analysis. In contrast to the conventional approachesthat are based on hazard functions [24, Section 3.3], our null hypotheses are defined with respect to explicit parameters, defined at a time t . Our strategy also allows for covariateadjustment under additive hazard models.We have presented some examples of parameters, and our simulations suggest that thetest statistics are well-behaved in a range of scenarios. Indeed, for common parameterssuch as the survival function, the restricted mean survival function and the cumulative in-cidence function, our tests obtain similar power to conventional tests that are tailor madefor a particular parameter. Importantly, our examples do not comprise a comprehensiveset of relevant survival parameters, and several other effect measures for event historiesmay be described on the differential equation form (2), allowing for immediate imple-mentation of hypothesis tests, for example using the R package transforming.hazards [9, 8]. The fact that our derivations are generic and easy to implement for customizedparameters, is a major advantage.Our tests differ from the rank tests, as the rank tests are based on assessing theequality of the hazards during the entire follow-up. However, our strategy is intended tobe different: We aimed to provide tests that apply to scenarios where the null hypothesisof the rank tests is not the primary interests.Restricting the primary parameter to a single time t is sometimes considered to bea caveat. In particular, we ignore the time-dependent profile of the parameters beforeand after t . For some parameters, such as the survival function or the cumulativeincidence function, this may be a valid objection in principle. However, even if ourprimary parameter is assessed at t , this parameter may account for the whole historyof events until t . One example is the restricted mean survival, which considers thehistory of events until t . Indeed, the restricted mean survival has been suggested as analternative effect measure to the hazard ratio, because it is easier to interpret causally, N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 19 and it does not rely on the proportional hazards assumption [14]. An empirical analysisof RCTs showed that tests of the restricted mean survival function yield results that areconcordant with log-rank tests, under the conventional null hypothesis in (1) [14], andsimilar results were found in an extensive simulation study [15].Furthermore, the time-dependent profile before t is not our primary interest in manyscenarios. In medicine, for example, we may be interested in comparing different treat-ment regimes, such as radiation and surgery for a particular cancer. Then, time totreatment failure is expected to differ in the shorter term due to the fundamental dif-ference between the treatment regimes, but the study objective is to assess longer-termtreatment differences [25]. Similarly, in studies of cancer screening, it is expected thatmore cancers are detected early in the screening arm, but the scientist’s primary aim isoften to assess long-term differences between screened and non-screened. In such scenar-ios, testing at a prespecified t are more desirable than the null-hypothesis of the ranktests.Nevertheless, we must assure that cherry picking of t is avoided. In practice, therewill often be a natural value of t . For example, t (or multiple t , t , · · · , t k ) can beprespecified in the protocol of clinical trials. In cancer studies, a common outcome is e.g.five year survival. Alternatively, t can be selected based on when a certain proportion islost to censoring. Furthermore, using confidence bands, rather than pointwise confidenceintervals and p-values, is an appealing alternative when considering multiple points intime. There exist methods to estimate confidence bands based on wild bootstrap forcompeting risks settings [26], which were recently extended to reversible multistate modelsallowing for illness-death scenarios with recovery [19]. We aim to develop confidence bandsfor our estimators in future research. Finally, we are often interested in assessing causal parameters using observational data,under explicit assumptions that ensure no confounding and no selection bias. Such pa-rameters may be estimated after weighting the original data [8, 27]. Indeed, weightedpoint estimators are consistent when our approach is used [8], but we would also like toidentify the asymptotic root n residual distribution, allowing us to estimate covariancematrices that are appropriate for the weighted parameters. We consider this to be animportant direction for future work. Currently, such covariance matrices can only beobtained from bootstrap samples. 8. Software
We have implemented a generic procedure for estimating parameters and covariancematrices in an R package, available for anyone to use on github.com/palryalen/transform.hazards .It allows for hypothesis testing at prespecified time points using (6). Worked examplescan be found the package vignette, or in the supplementary material here.9. Funding
The authors were all supported by the research grant NFR239956/F20 - Analyzingclinical health registries: Improved software and mathematics of identifiability.
References [1] Judea Pearl. The new challenge: From a century of statistics to the age of causation. In
ComputingScience and Statistics , pages 415–423. Citeseer, 1997.[2] Miguel A Hernan and James M Robins.
Causal inference . CRC Boca Raton, FL:, 2018 forthcomming.[3] Judea Pearl.
Causality: Models, Reasoning and Inference 2nd Edition . Cambridge University Press,2000.
N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 21 [4] Odd O Aalen, Richard J Cook, and Kjetil Røysland. Does cox analysis of a randomized survivalstudy yield a causal treatment effect?
Lifetime data analysis , 21(4):579–593, 2015.[5] Miguel A Hern´an. The hazards of hazard ratios.
Epidemiology (Cambridge, Mass.) , 21(1):13, 2010.[6] David R Cox and David Oakes. Analysis of survival data. 1984.
Chapman&Hall, London , 1984.[7] Mats J Stensrud, Morten Valberg, Kjetil Røysland, and Odd O Aalen. Exploring selection bias bycausal frailty models: The magnitude matters.
Epidemiology , 28(3):379–386, 2017.[8] P˚al C Ryalen, Mats J Stensrud, and Kjetil Røysland. The additive hazard estimator is consistentfor continuous time marginal structural models. arXiv preprint arXiv:1802.01946 , 2018.[9] P˚al C Ryalen, Mats J Stensrud, and Kjetil Røysland. Transforming cumulative hazard estimates. accepted in Biometrika, arXiv preprint arXiv:1710.07422 , 2017.[10] Odd Aalen. Nonparametric inference for a family of counting processes.
The Annals of Statistics ,pages 701–726, 1978.[11] Jessica G Young, Eric J Tchetgen Tchetgen, and Miguel A Hern´an. The choice to define com-peting risk events as censoring events and implications for causal inference. arXiv preprintarXiv:1806.06136 , 2018.[12] John P Klein, Brent Logan, Mette Harhoff, and Per Kragh Andersen. Analyzing survival curves ata fixed point in time.
Statistics in medicine , 26(24):4505–4519, 2007.[13] Patrick Royston and Mahesh KB Parmar. The use of restricted mean survival time to estimate thetreatment effect in randomized clinical trials when the proportional hazards assumption is in doubt.
Statistics in medicine , 30(19):2409–2421, 2011.[14] Ludovic Trinquart, Justine Jacot, Sarah C Conner, and Rapha¨el Porcher. Comparison of treatmenteffects measured by the hazard ratio and by the ratio of restricted mean survival times in oncologyrandomized controlled trials.
Journal of Clinical Oncology , 34(15):1813–1819, 2016.[15] Lu Tian, Haoda Fu, Stephen J Ruberg, Hajime Uno, and Lee-Jen Wei. Efficiency of two sampletests via the restricted mean survival time for analyzing event time observations.
Biometrics , 2017.[16] Robert J Gray. A class of k-sample tests for comparing the cumulative incidence of a competingrisk.
The Annals of statistics , pages 1141–1154, 1988. [17] A Latouche and R Porcher. Sample size calculations in the presence of competing risks.
Statisticsin medicine , 26(30):5370–5380, 2007.[18] Debashis Ghosh and DY Lin. Nonparametric analysis of recurrent events and death.
Biometrics ,56(2):554–562, 2000.[19] Tobias Bluhmki, Claudia Schmoor, Dennis Dobler, Markus Pauly, Juergen Finke, Martin Schu-macher, and Jan Beyersmann. A wild bootstrap approach for the aalen–johansen estimator.
Bio-metrics , 2018.[20] Mei-Jie Zhang and Jason Fine. Summarizing differences in cumulative incidence functions.
Statisticsin Medicine , 27(24):4939–4949, 2008.[21] Per Kragh Andersen, Ørnulf Borgan, Richard D. Gill, and Niels Keiding.
Statistical models basedon counting processes . Springer Series in Statistics. Springer-Verlag, New York, 1993.[22] Charles G Moertel, Thomas R Fleming, John S Macdonald, Daniel G Haller, John A Laurie, Tangen,et al. Fluorouracil plus levamisole as effective adjuvant therapy after resection of stage iii coloncarcinoma: a final report.
Annals of internal medicine , 122(5):321–326, 1995.[23] T. M. Therneau and P. M. Grambsch.
Modeling Survival Data: Extending the Cox Model . Springer,New York, 2000.[24] Odd Aalen, Ornulf Borgan, and Hakon Gjessing.
Survival and event history analysis: a process pointof view . Springer Science & Business Media, 2008.[25] P˚al C Ryalen, Mats J Stensrud, and Kjetil Røysland. Causal inference in continuous time: Anexample on prostate cancer therapy.
Accepted in Biostatistics , 2018.[26] DY Lin. Non-parametric inference for cumulative incidence functions in competing risks studies.
Statistics in medicine , 16(8):901–910, 1997.[27] Miguel Angel Hern´an, Babette Brumback, and James M Robins. Marginal structural models toestimate the causal effect of zidovudine on the survival of hiv-positive men, 2000.
N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 23
Table 1.
Estimates, 95% confidence intervals and p-values after 1 yearLev Lev+5FU p-valueSurvival 0.91 (0.87,0.94) 0.92 (0.89,0.95) 0.62Restricted mean survival 0.96 (0.95,0.98) 0.97 (0.95,0.98) 0.86Cumulative incidence 0.27 (0.22,0.32) 0.15 (0.11,0.19) 0Prevalence 0.19 (0.14,0.23) 0.09 (0.05,0.12) 0Restricted mean recurrence free survival 0.85 (0.82,0.88) 0.90 (0.87,0.92) 0.01
Table 2.
Estimates, 95% confidence intervals and p-values after 5 yearsLev Lev+5FU p-valueSurvival 0.54 (0.48,0.59) 0.63 (0.58,0.69) 0.01Restricted mean survival 3.62 (3.44,3.81) 3.97 (3.79,4.15) 0.01Cumulative incidence 0.47 (0.42,0.51) 0.34 (0.29,0.39) 0Prevalence 0.07 (0.05,0.10) 0.03 (0.02,0.05) 0.02Restricted mean recurrence free survival 2.29 (2.07,2.51) 2.95 (2.73,3.18) 0
Table 3.
Power comparisons of our tests and rank tests (our/rank) for thescenarios displayed in figure 2, comparing two groups of 1500 individuals(based on 400 replications). In the lower row, we also display the power ofGray’s test for competing risks (our/rank/Gray). The power of the ranktests is sensitive to the shape of the underlying hazards, while the powerof our tests vary little across the scenarios. In particular, the power of therank tests is very sensitive to the rate of change of the hazards when theyare crossing or deviating; see also the third column of figure 2.Parameter \ Hazard Constant Crossing DeviatingSurvival 0.81/0.88 0.79/0.96 0.83/0.70Restricted mean survival 0.77/0.87 0.78/0.21 0.8/1Cumulative incidence 0.85/0.94/0.88 0.86/0.80/0.70 0.86/0.83/0.76 . . . . . . Survival yearsLevLev+5FU 0 2 4 6 8 . . . . . . Cumulative incidence years
Figure 1.
Survival curves (left) and the cumulative incidence of recur-rence (right) along with 95% pointwise confidence intervals (shaded) fromthe colon cancer trial are displayed.
N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 25 . . . . . . . . Survival . . . . . . . . . . . . Restricted mean survival . . . . . . . . . . . . Cumulative incidence . . . . Figure 2.
Simulation scenarios in which the true parameter difference wasfixed to κ = − .
05 at t = 1 .
5, i.e. X . − X . = − .
05. The upper rowshows survival, the middle row shows restricted mean survival, and lowerrow shows cumulative incidences. The hazards are displayed as dotted lines;constant in the left column, linearly crossing in the middle column, anddeviating in the right column. The X parameters/hazards are black, andthe X parameters/hazards are green. See Table 3 for a power comparisonof our tests and the rank tests for the scenarios that are displayed. Thecumulative incidence panels: The cause specific hazards for the competingevent are held constant equal to 0 . X . − X . = − .
05 underthe restrictions that they are constant (left), linearly crossing (middle), andequal before deviating (right). −0.10 −0.05 0.00 0.05 0.10 . . . . . . Survival k ConstantCrossingDeviating PluginAlternative −0.10 −0.05 0.00 0.05 0.10 . . . . . . Restricted mean survival k −0.10 −0.05 0.00 0.05 0.10 . . . . . . Cumulative incidence k −0.10 −0.05 0.00 0.05 0.10 . . . . . . Mean frequency function k −0.10 −0.05 0.00 0.05 0.10 . . . . . . Prevalence k Figure 3.
Estimated power functions for constant (black), crossing(green), and deviating (blue) hazards, based on 250 subjects with a repli-cation number of 400. The dashed lines show test statistics derived fromexisting methods in the literature, that are tailor-made for the particularscenario. The confidence level is shown by the gray lines.
N NULL HYPOTHESES IN SURVIVAL ANALYSIS. 27 . . . survival t 0.0 0.5 1.0 1.5 . . . restricted mean survival t 0.0 0.5 1.0 1.5 . . . cumulative incidence t−2.5 −1.5 −0.5 0.5 1.0 . . . a b −5 −4 −3 −2 −1 0 1 . . . a b −3 −2 −1 0 1 . . . a b −2.5 −1.5 −0.5 0.5 1.0 . . . a b −5 −4 −3 −2 −1 0 1 . . . a b −3 −2 −1 0 1 . . . a b −2.5 −1.5 −0.5 0.5 1.0 . . . −5 −4 −3 −2 −1 0 1 . . . −3 −2 −1 0 1 . . . Figure 4.
In the upper row, we display hazards functions in scenariosin which the hazard in group 1 is fixed (black line), and the hazards ingroup 2 varies (grey lines). The hazards are optimized such that the nullhypothesis is true, i.e. X t = X t for each combination of black/grayhazards at t = 1 ..