Treatment Effect Quantification for Time-to-event Endpoints -- Estimands, Analysis Strategies, and beyond
aa r X i v : . [ s t a t . M E ] O c t Treatment Effect Quantification for Time-to-eventEndpoints – Estimands, Analysis Strategies, andbeyond
Kaspar Rufibach ∗ October 8, 2018
Abstract
A draft addendum to ICH E9 has been released for public consultation in Au-gust 2017. The addendum focuses on two topics particularly relevant for random-ized confirmatory clinical trials: estimands and sensitivity analyses. The need toamend ICH E9 grew out of the realization of a lack of alignment between the objec-tives of a clinical trial stated in the protocol and the accompanying quantificationof the “treatment effect” reported in a regulatory submission. We embed time-to-event endpoints in the estimand framework, and discuss how the four estimandattributes described in the addendum apply to time-to-event endpoints. We pointout that if the proportional hazards assumption is not met, the estimand targetedby the most prevalent methods used to analyze time-to-event endpoints, logranktest and Cox regression, depends on the censoring distribution. We discuss fora large randomized clinical trial how the analyses for the primary and secondaryendpoints as well as the sensitivity analyses actually performed in the trial can beseen in the context of the addendum. To the best of our knowledge, this is the firstattempt to do so for a trial with a time-to-event endpoint. Questions that remainopen with the addendum for time-to-event endpoints and beyond are formulated,and recommendations for planning of future trials are given. We hope that thiswill provide a contribution to developing a common framework based on the fi-nal version of the addendum that can be applied to design, protocols, statisticalanalysis plans, and clinical study reports in the future.
Keywords:
Estimand; Time-to-Event; Sensitivity Analysis; Censoring; RandomizedClinical Trial; Causal Inference
When evaluating interventions in clinical trials, time-to-event (T2E) endpoints such asoverall survival (OS) or time to a severe cardiovascular event play a prominent role. ∗ Methods, Collaboration, and Outreach Group (MCO), Department of Biostatistics, Hoffmann-LaRoche Ltd, Basel, Switzerland
Statistical Principles for Clinical Trials ) [9]. This ICH E9draft addendum ([10], henceforth simply referred to as “addendum”) has been releasedfor public consultation in August 2017. It focuses on two topics particularly relevantfor confirmatory randomized clinical trials (RCT): estimands and sensitivity analyses .According to the addendum, an estimand describes what is to be estimated based onthe question of interest and can be defined through the population of interest, endpointof interest, specification of how intercurrent events are handled, and summary measure.In what follows, we refer to the specification of how intercurrent events are handled as intervention effect . A sensitivity analysis “...can help to investigate and understand therobustness of estimates; the sensitivity of the overall conclusions to various limitationsof the data, assumptions, and approaches to data analysis.” [11].The need to amend E9 with a discussion on estimands grew out of the realization of anapparent lack of alignment between the objectives stated in a clinical trial protocol andthe accompanying quantification and interpretation of the “treatment effect” reported ina regulatory submission. While the estimand framework has been developed with differ-ent clinical trial settings and endpoints in mind, the examples discussed in publications,at scientific meetings, and in the addendum have largely focused on symptomatic stud-ies and continuous, longitudinal endpoints ([12], [13], [9]). Given the realization of lackof common definition for T2E endpoints, and the introduction of the estimand conceptin the addendum, we would like to connect these two aspects and discuss estimandconsiderations as they apply to outcome trials that focus on T2E endpoints. We hopethat this, together with further efforts discussed in Section 5, will lead to alignment inendpoint, and ultimately, estimand definitions.While this paper was under review, Unkel et al. [14] have also looked into estimands forT2E endpoints, as they apply to safety analyses with different follow-up times in treat-ment arms. This illustrates that considerations for T2E endpoints within the estimandframework do not only apply to efficacy, but also safety or quality-of-life T2E endpoints.The key messages we want to convey in this paper are:2
Embed T2E endpoints in the estimand framework. • Illustrate that censoring should be considered part of the estimator (not estimand)definition. • We discuss the (often implicit) assumptions made when censoring at an intercur-rent event. The decision on how an estimator deals with an intercurrent event,e.g. whether it censors it or not, should be derived from the estimand which inturn has to be defined first. • If we have non-proportional hazards (NPH) and censoring, the estimand that isestimated by the hazard ratio from Cox regression depends on the censoring dis-tribution. This is a rather undesirable feature and we raise the question whether,within the estimand framework, the Cox regression hazard ratio should remainthe method of choice to quantify a treatment effect for T2E endpoints. We discusssome potential alternatives. • Related to this, we raise the question whether hypothesis testing and effect esti-mation need to be tied to the same estimand, or whether these can be consideredseparately. • Based on several examples from the literature, we illustrate that seemingly clearlydefined endpoints in clinical trials are subject to substantial heterogeneity in howthey are specified. We anticipate that the estimand framework will help aligningthese definitions. • We retrospectively embed the analysis specification of a large Phase 3 oncologyRCT into the estimand framework. We hope this can serve as a basis for trial teamsto specify estimands moving forward and to inform data collection strategies. • The addendum uses causal inference language and concepts, e.g. the principalstrata strategy. However, it is not clear whether post-addendum estimands needto be causally interpretable. We would welcome clarification of this aspect in thefinal version of the addendum.The paper is structured as follows: In Section 2 we review the estimand attributes pro-posed in the addendum and how they apply to T2E endpoints while Section 3 brieflysummarizes how we see the role of censoring within the estimand definition. The con-nection between logrank test, Cox proportional hazards regression (CPHR), and theproportional hazards (PH) assumption, especially if the latter is not met, is discussedin Section 4. Section 5 is devoted to a brief review of examples from the literature thatillustrate how heterogeneous seemingly “clearly defined” endpoints actually are betweentrials and the implications of this heterogeneity. In Section 6 we illustrate how intercur-rent events can play a different role and thus necessitate a different estimand strategydepending on the indication the trial is run in. As an illustration, the endpoints andsensitivity analyses (“sensitivity” with the pre-ICH E9 addendum meaning) of a largePhase 3 oncology trial, that has recently been reported, are then discussed in view ofthe new estimand framework in Section 7. We conclude with a discussion in Section 8.3
Estimand attributes for T2E endpoints
In order to provide a framework for the following sections we discuss how T2E endpointscan be viewed in the estimand framework. According to the addendum, an estimanddescribes what is to be estimated based on the question of interest and can be definedthrough the population, variable, intervention effect, and summary measure. Amongthese, the intervention effect, which specifies how intercurrent events are reflected in thescientific question, may be considered the attribute that in general adds most noveltyto the way how we describe trial objectives. Intercurrent events are clinical eventsthat occur between randomisation and before the endpoint, such as non-adherence,discontinuation of intervention, study withdrawal, treatment switching, use of rescuemedication, second line treatment, transplantation, or death. The summary measurespecifies the quantity on which the treatment comparison will be based.Note that the four attributes of an estimand as described in the addendum should notbe considered independently, but in relation to each other ([10], Section A.3.1). In whatfollows, we discuss them as they apply to a T2E endpoint.
The population to be sampled from when generating clinical trial data is usually char-acterized through a comprehensive list of in- and exclusion criteria. We do not identifyanything specific to T2E endpoints concerning the definition of the population understudy. However, as mentioned by a reviewer, the rate of intercurrent events in thepopulation, as defined in the intervention effect attribute of the addendum, will have adirect bearing on the value of the estimand. As a consequence, the estimand, and thusalso the estimate of interest, is becoming more dependent on the characteristics of thepopulation. These considerations apply not only to T2E but any type of endpoint.
When defining an estimand, the attribute variable is intended to summarize the quan-tities required to address the scientific question. For a T2E endpoint this would be, inline with Altman and Bland [15]: • Starting date: Typically, in a clinical trial this is either the date of registrationinto a (single-arm or non-randomized) trial or the date of randomisation. Forsimplicity, we will generally use “randomisation” to refer to the starting date. • Event defining the endpoint: As discussed e.g. in Chapter 2 of Klein and Moeschberger[16], the event of interest may be death, appearance of tumor, development of somedisease, recurrence of a disease, and so forth.In the simplest case, e.g. when looking at OS, the starting date and event date areprecisely defined up to the exact day. On the other hand, when the event date isdetermined through pre-specified assessments, e.g. regular tumor imaging in oncology,then the resulting T2E might have a density that has peaks around these assessmentdates if one simply assumes that the data is right-censored at the actual assessment date4instead of interval-censored with interval between two consecutive assessment dates).This manifests through “steps” in the survival curve when nonparametrically estimatedthrough, e.g., Kaplan-Meier. An example for this pattern is PFS in the CLEOPATRAtrial [17].The definition of the endpoint might involve more than one timepoint. We discussan example from Multiple Sclerosis in Section 5. Interestingly, this also implies thatintercurrent events can not only happen between randomisation and the endpoint, butalso between the initial and final assessment of the endpoint itself.
The intervention effect defines how intercurrent events, i.e. potentially treatment-relatedclinical events that occur between randomisation and before the endpoint, are reflectedin the scientific question. As a matter of fact, what constitutes an intercurrent eventfor a given variable depends on the variable itself and the disease indication, i.e. thepopulation attribute. A key difficulty when defining the summary measure in Section 2.4is that an observed intercurrent event typically depends on the treatment received. Weprovide a list of intercurrent events in a standard oncology setting in the case study inSection 7.In what follows, we discuss the strategies proposed in the addendum (Section A.3.2) tohandle intercurrent events as they apply to a T2E endpoint. • “Treatment policy”: The value for the variable is used regardless of whether ornot the intercurrent event occurs, • “Composite”: make the intercurrent event part of a composite endpoint by count-ing it an event defining the endpoint, • “Hypothetical”: a scenario is envisaged in which the intercurrent event would notoccur, e.g. because the patient switched treatment, • “While on treatment”: we are interested in the response to treatment prior tothe occurrence of the intercurrent event, e.g. start of second line treatment in theabsence of observing the endpoint, • “Principal stratum”: restrict the population of interest to the stratum of patientsin which an intercurrent event would not have happened.Note that a dedicated strategy for each intercurrent event needs to be defined, implyingthat in case of multiple intercurrent events, more than one of the above strategies maybe needed to define a single estimand.Now, assume our trial objective would be “measuring the time between randomisationand progressive disease (PD)”. The variable attribute of the corresponding estimandwould then be “time from randomisation until PD” and death would be an intercurrentevent. For a T2E endpoint, the intercurrent event of death can be embedded in theaddendum strategies as follows: • If we counted death as an event we made the intercurrent event part of a composite,and we get PFS as the variable attribute of our estimand.5
Death could be considered an intercurrent event competing with progression. Inorder to embed this case into the addendum language, recall the addendum defi-nition of the “while on treatment strategy”: “...we are interested in the responseto treatment prior to the occurrence of the intercurrent event. If a variable ismeasured repeatedly, its values up to the time of the intercurrent event may beconsidered to account for the intercurrent event, rather than the value at the samefixed timepoint for all subjects.” So, for a longitudinal endpoint, which is referredto in this definition, we would simply impute the value of the variable (e.g. numer-ical score that defines the variable) at the intercurrent event (when patient dies)as the value at the fixed timepoint. We propose to fit a competing intercurrentevent for a T2E endpoint in the “while on treatment strategy” of the addendum,as we are equally interested in the response to treatment prior to either the eventof interest or the competing event. On the estimator level this would mean wewould simply censor at death when performing inference on time-to-progression(TTP). As discussed in Section 3 one then needs to make sure to align the summarymeasure with this censoring strategy. • If we are willing to entertain the assumption that those who die have the samemomentary risk of an event as those that remain in the risk set, we then estimatewith this the hypothetical estimand “time from randomisation until PD, assumingthat the time until PD of patients who died is imputed using the longer termoutcomes of other patients who survived and remained under observation” [18],and we retrieve the familiar definition of TTP [19], [20]. The assumption e.g. holdsin case of random censoring , i.e. if the censoring time and T2E are stochasticallyindependent [21].So, interestingly, both the hypothetical and the “while on treatment” strategy can beestimated through censoring at the intercurrent event of death. However, the estimandthe resulting estimators are targeting is different, and this manifests itself in the sum-mary measure attribute of the estimand. We revisit a few points around censoring andsummary measures in the case of competing risks in Section 3.As becomes clear, the choice between TTP, a competing risk analysis, and PFS isultimately a decision about which estimand to look at, and this estimand should beinferred from the trial objective. Note that compared with TTP, PFS is generally thepreferred regulatory endpoint [19].The PFS example can also be used to discuss the general concept of a treatment policyestimand, where the following considerations apply to any type of endpoint, not justtime-to-event. Treatment policy is defined by encompassing the intercurrent event. Therationale for doing so is that if the intercurrent event impacts the endpoint, this will bereflected in the resulting treatment effect, i.e. the causal link is preserved. This boilsdown to “ignoring”, or not censoring at, the intercurrent event, and that is often consid-ered the “analogue” to the “intention-to-treat” (ITT) principle introduced in the originalICH E9 guideline, although much ambiguity is involved around what is considered toconstitute the ITT principle [22]. We would like to caution against the use of such abroad strategy when defining a treatment policy estimand. If any intercurrent event,whether pre-specified in the protocol or not, is ignored and a treatment policy estimand6s postulated, the implied estimand is difficult to interpret. This is because allowingany intercurrent event is not really a strategy and causality remains unclear. We thusrecommend that any intercurrent event that is ignored for a treatment policy strategyshould still be pre-specified in the protocol and be systematically collected, somehowleading to a “protocol-defined treatment policy”. Related, and also not specific to T2Eendpoints, Hernan and Robins address issues that may arise from uncritical reliance onthe intention-to-treat principle in pragmatic trials [23]. They provide guidance on howper-protocol analyses can be used to overcome these issues.Finally, if applicable, a principal stratum strategy resolves the problem that the occur-rence of an intercurrent event is generally associated to treatment. This advantage of theprincipal stratum strategy comes at the cost of the difficulty to identify the patients ineach stratum. This membership can typically only be inferred through suitable covari-ates, unless the intercurrent event is completely unrelated to treatment, but then the“random censoring” assumption would also be met, making a hypothetical estimandstraightforward to estimate. An example is discussed by Shepherd et al [24] and anestimator of that estimand is proposed by Chiba [25]. An introduction into principalstratification and how it allows for causal inference is provided by Rubin [26]. Rubindiscusses the issues that come with intercurrent events using the term “intermediateoutcome variables” for intercurrent events, and the concepts are not just limited to T2Eendpoints. He specifically looks into “truncation by death”, so precisely the scenario wediscuss above.In general, the choice made for each potential intercurrent event for a T2E endpoint,i.e. making it part of a composite, treat it as competing, be interested in a hypotheticalworld where it would not happen, ignore it, or apply the principal strata strategy,requires intensive discussions about the actual precise trial objective and the impliedestimand.We recommend to document these choices and the resulting estimands, as well as thedata analysis strategy, in trial protocols, statistical analysis plans (SAP), and publica-tions, e.g. in tabular form similar to Table 2 in Bellera et al. [27].For illustration, we provide an oncology case study in Section 7.
In this section, we first discuss our view on the connection between statistical hypoth-esis testing and effect quantification in the estimand context. The addendum specifiesin Section A.1. as part of its scope that it “...presents a structured framework to linktrial objectives to a suitable trial design and tools for estimation and hypothesis test-ing.” However, in what follows the addendum does not explicitly discuss the connectionbetween testing and estimation , and the paragraph on the summary measure clearlyfocuses on estimation, not testing. This may be because for the endpoint type thatinitiated the addendum, continuous data collected over time with intercurrent events,the statistic used for testing and the effect estimate are closely linked and allow for aconsistent interpretation. Depending on how the population survival functions relate toeach other, this link may break down for the most commonly used methods to analyzeT2E endpoints, namely the logrank test and CPHR: If we have non-proportional haz-ards, the estimand implied by the estimate based on CPHR depends on the censoring7istribution, and might thus vary from one trial to the next [28, 29]. Furthermore, sincethe logrank test is intimately connected with CPHR, the estimand implied by the lo-grank test suffers from the same deficiency. We will provide a discussion of these aspectsin Section 4. The potential disconnect between testing and estimation in certain casesoffers two distinct ways of making the summary measure component of the estimandmore specific:1. Either one insists that testing and estimation are consistent. The implication ofthis approach would be that for T2E endpoints, methods such as logrank test andCPHR would potentially need to be replaced by alternative tests and estimatorsthat are robust against violation of the PH assumption, see Section 4.2. Or one allows for a two-stage procedure: In a first stage, a valid hypothesis test isperformed which serves as a gatekeeper. “Valid” always refers to maintaining typeI error. If the null hypothesis under consideration is rejected and the effect estimatepoints in the right direction, then the trial is considered a success. The analysisthen goes on in a second stage to quantify the effect by which the experimentaltreatment is “better” than the control. This effect quantifier would not necessarilyhave to be connected to the test statistic of the gatekeeper test.Based on our experience with several Health Authorities (HA), once the null hypothesishas been rejected using a valid hypothesis test, it is possible to choose one or moresummary measures which might differ from the one that corresponds to the performedhypothesis test, even if the chosen summary measure would not reject its associated nullhypothesis. Akacha et al. [30] briefly discuss this aspect in Section 5.7 as well. So itappears that at least implicitly, the second approach above is acceptable to HAs.In what follows, we discuss effect quantifiers for a T2E endpoint that are simple, well-known, correspond to a clear estimand, and can easily be computed from availabledata. A discussion of alternatives to the hazard ratio under more general assumptionsis provided in Section 4.Recall that in a two-arm RCT with a T2E endpoint, the necessary number of events istypically determined such that a two-sided logrank test of the null hypothesis H : S control = S experimental has the pre-specified power for an assumed alternative hypothesis, if all assumptions aremet [31]. Here, S i = P ( T i > t ) is the survival function in treatment arm i ∈ { control,treatment } , with T i the corresponding survival time, a non-negative random variable,and t ≥
0. If H is rejected using a valid test and the effect estimate points in the rightdirection, the trial is considered a success and the effect has to be quantified.One candidate to use is an estimate of the difference ( S experimental − S control )( t ) at sometimepoint t (“milestone survival”) as our summary measure after passing the gate-keeper. Further potential effect quantifiers that describe and summarize the observeddata are median or any other quantile difference, or difference in restricted mean sur-vival time [32], the latter being especially relevant in pharmacoeconomic applications,see e.g. [33]. Instead of differences, ratios or even odds ratios can be based on S experimental and S control . All these estimands rely on an estimate of the survival function, such as8aplan-Meier, or one based on a parametric assumption. The assumptions for the cho-sen estimator, e.g. about censoring and the suitability of the parametric model, needto be checked and any summary measure should be accompanied by a quantification ofuncertainty, most likely a confidence interval. Whether and how some of these estimandsare amenable to a causal interpretation is discussed in [34].The example discussed in Section 6 allows for an illustration of the interplay between intercurrent event and summary measure in general. We thus defer the discussion onthis aspect of the summary measure to that section. When developing a clinical trial, in line with the recommendations by the NationalResearch Council [35], the order of what needs to be discussed is1. trial objective,2. estimand,3. study design,4. data collection and handling strategy,5. estimator.As a reviewer pointed out, if an intercurrent event is not terminal, then one may chooseto truncate the T2E variable, i.e. the estimand, at this intercurrent event, and this mightoften be referred to as “censoring”. However, we propose to leave the word “censoring”to activities related to the estimator, and use “truncation” instead for the estimanddefinition.Censoring is thus part of the estimator definition and describes how to handle data thatis only partially observed for the T2E variable, either because an intercurrent eventhappened before the event defining the variable or the trial ended.The event end of trial plays a special role in the context of censoring. Note that with“end of trial” we imply analysis at any pre-specified clinical cutoff, e.g. also whenperforming an interim analysis. This cutoff is typically pre-specified at baseline [36],which implies that the time until the clinical cutoff date is definitely independent ofthe time-to-event, i.e. we have random censoring. Using an estimator based on simpleright-censoring, generally referred to as administrative censoring , is thus uncritical [21].An estimator involving censoring at an intercurrent can be foreseen for two of the fivestrategies described in Section 2.3: • If the estimand specifies a “while on treatment strategy” we consider the situationof an event of interest and a competing event. It is important that in this case,the estimator is aligned with the summary measure attribute of the estimand:hazard-based inference is unbiased for each event-specific hazard separately whencensoring at the intercurrent event(s), see e.g. Beyersmann et al. [37]. A simpleKaplan-Meier type estimator for the survival curve of the event of interest that9imply censors at the competing intercurrent event(s) is biased though [21]. Toget probability statements, one has to look at cumulative incidence functions, orrisk quantifiers based on subdistribution hazards [38]. We refer to Geskus [39] andUnkel et al. [14] for a discussion of these aspects. • If the estimand specifies a hypothetical strategy, simple right-censoring at theintercurrent event provides unbiased estimates if we e.g. assume random censoring ,as discussed in Section 2.3.Andersen [40] and Allignol et al. [21] discuss the subtle difference between random and independent censoring. Andersen explains that precise mathematical formulations ofindependent censoring may be given and that frequently used models for the generationof right-censored satisfy these conditions. However, the conditions may be impossibleto verify for actual data. Interpreted within the addendum discussion, this re-iteratesthe need for sensitivity analyses when assumptions on censoring mechanisms, and thusestimators to estimate the above estimands, are specified, as also pointed out by Unkelet al. [14].
For T2E endpoints, the logrank test is overwhelmingly used as a gatekeeper to test H .For the logrank test to be valid, the following assumptions about the data are made:independent censoring, the survival probabilities are the same for subjects recruitedearly and late in the trial, and the events happened at the times specified [41]. Whileits power is maximal if the hazard functions corresponding to the assumed survivalcurves are indeed proportional, the logrank test maintains type I error, and is thusvalid, also under NPH. This, because under H we assume identical survival curves andthe PH assumption does trivially hold. The PH assumption is only used to define thealternative and thus the nature of the test statistic. All this justifies the important roleof the logrank test as a gatekeeper for testing H for a T2E endpoint in a regulatorycontext.However, while the logrank test can be developed as a hypothesis test comparing thestandardized difference between expected and observed number of events, the fact thatit corresponds to the score test in a CPHR, see e.g. Section 3.9 in Collett [42], impliesthat the estimand connected to the logrank test is the same as the one for CPHR, i.e.relies on the PH assumption if we deviate from H . So, under NPH, it is a priori notclear to what estimand the logrank test corresponds.Now, under NPH and if there is no censoring, Xu and O’Quigley [43] derive that theestimate from a CPHR can still be interpreted as a regression effect suitable averagedover time, i.e. has an interpretable estimand. However, if there is right-censoring,already Struthers and Kalbfleisch [44] have shown that the estimand targeted by CPHRis defined implicitly through an estimating equation that depends on the censoringdistribution. As discussed in Nguyen and Gillen [45], this implies that in the presenceof right censoring, the estimand that is estimated by CPHR under NPH depends on10he censoring pattern of the actual data, which leads to an estimand that varies withthe censoring distribution. The dependence of the estimand value on the censoringdistribution is nicely illustrated by Nguyen and Gillen ([45], Figure 2). As a consequence,the estimand targeted by CPHR becomes trial-specific, as it is virtually impossible toreplicate the censoring distribution from one trial to the next, even if only administrativecensoring is present. This, because the censoring distribution at least depends on theaccrual pattern and the trial length, which are typically outside of the trial sponsor’scontrol. Or, as Boyd et al. [28] put it, “...the usual unweighted Cox estimator will beconsistent for a parameter that is dependent upon patient accrual and dropout patternsthat bear no relevance to the scientific objectives of a clinical trial.” This quote nicelyrelates to the estimand discussion, as the primary goal of the addendum is to alignscientific objective, estimand, and estimator.As discussed in Section 2.3, even in the absence of intercurrent events, we typically haveto deal with at least administrative censoring in a trial with a T2E endpoint. Xu andO’Quigley [43] and Boyd and et al. [28] provide estimators of an average regressioneffect for the continuous time CPHR model, the latter under more general assumptionson censoring. Extending the work of Struthers and Kalbfleisch [44], these estimatorstarget an estimand that is again only implicitly defined. However, Xu and O’Quigley [43]nicely show that the estimator approximates the population average effect R β ( t ) d F ( t ),where β ( t ) is the time-varying regression coefficient, and F is the marginal distributionof the failure times. Nguyen and Gillen [45] provide similar results for a discrete hazardmodel. The key properties of these estimators are that they are equal to the one fromCPHR under PH ([43], Section 5), but they are robust against the censoring distribution.This means they are consistent for the same estimand as CPHR if the PH assumptionsis true, and they estimate a quantity that is independent of the censoring distributionin case of NPH.In conclusion, by replacing the logrank test and the estimator based on the CPHR bye.g. the methods introduced by Xu and O’Quigley [43] or Boyd et al. [28], it wouldbe possible to remove the conceptual problems discussed above, i.e. one would get ahypothesis test and effect quantifier that • have a clear estimand • which would be (asymptotically) independent of the censoring distributionindependently of the PH assumption. In this section we focus on heterogeneity of endpoint definitions across studies, andillustrate the implications of this heterogeneity.In the terminology of the addendum, estimand heterogeneity for a T2E endpoint pri-marily affects the interplay between intervention effect and variable , i.e. which clinicalevents are considered for one or the other, or ignored, make the endpoint incompletelyobserved, or treated as competing. The lack of common definitions for T2E endpointshas previously been recognized in the literature, see Peto et al. [6] for the earliest account11o the best of our knowledge, including recommendations how to define and report T2Eendpoints. Still, the review by Altman and Bland [7] finds frequent failure to specifywhether non-cancer deaths were treated as events or censored, or how deaths withoutrelapse were considered in the definition of the endpoint. They re-iterate the recom-mendation to give a clear definition of the “time origin, the event of interest and thecircumstances where survival times are censored” for each endpoint considered. Morethan another decade later, Mathoulin-Pelissier et al. still found that about half of thereviewed articles published in major clinical journals failed to provide a clear definitionof the T2E endpoint, and 68% reported insufficient information on the survival analysis[8].Bellera et al. state that “Most of these T2E endpoints currently lack standardiseddefinition enabling a cross comparison of results from different clinical trials” [5], clearlypointing to the need of a discussion of estimands also in this context, although theauthors of the latter paper do not use the term. Even for a binary endpoint, Kahanand Jairath [46] also find marked differences in estimated odds ratios, depending onendpoint definition.An example illustrating the lack of a clear endpoint definition is discussed by Birgissonet al. [47]. For colorectal cancer, these authors find that the inclusion of second primarycancer as an event in the definition of DFS relevantly alters effect estimates. Theyrecommend that researchers and journals must clearly define T2E endpoints in all trialprotocols and published manuscripts. A similar analysis has been performed for coloncancer [48].DFS is also the primary end point for many large adjuvant breast cancer trials. Theresults of such trials, if positive, will likely change clinical practice, refer to Minckwitzet al. [49] for a recent example. Historically, which of the five addendum strategies isapplied to a given clinical event has been inconsistent. The typical definition involvedlocal, regional, and distant recurrence of the tumor, as well as death as events for thevariable, DFS. Often inconsistently handled were • initiation of treatment for contralateral breast cancer, • second primary cancers, further depending on whether they were contralateral ornonbreast, • and death not due to breast cancer.Hudis et al. ([50], Table 1) gives an overview and the paper provides recommendationsreached by an expert group. Building on this, the DATECAN initiative (see below, [51])applied a formal international consensus method, to increase the use and acceptability incurrent practice of common definitions. This illustrates that work is still necessary andis being done to align the definition of the variable attribute in the estimand context. Wehope that with the addendum this alignment will also carry over to the other estimandattributes.The aim of the Definition for the Assessment of Time-to-event Endpoints in CANcertrials (DATECAN) initiative is to provide recommendations for standardised definitionsof T2E endpoints [5] not only for breast but for many cancer indications: those forpancreatic [52], sarcomas and gastrointestinal stromal [27], breast [51], and renal [53]12umors have been published so far. Note that all these efforts are currently made withoutexplicit referencing to the estimand framework in general, or the framework put forwardby the addendum. The consensus defines how intercurrent events and the variable arerecommended to be handled, but the other estimand attributes, population and summarymeasure , are left open.The Steering Committee of the DATECAN project considered defining censoring rulesto be a statistical rather than a clinical question [53]. As a result, censoring rules werenot discussed during the DATECAN consensus process. Disentangling the estimandfrom the estimator is in line with our recommendation in Section 3, i.e. the clinicalquestion is the definition of the estimand and it remains then a statistical challenge toestimate the targeted quantity. However, from the description in Bellera et al. [5], itremains unclear how much the decisions on the actual estimand made in the DATECANinitiative are actually derived from a clearly defined scientific objective.Another initiative with the goal of improving reporting of clinical trials is the CON-solidated Standards Of Reporting Trials (CONSORT) statement [54]. Although CON-SORT is aimed at structuring reporting of RCTs in general, many of the items on theCONSORT checklist aim at improving the definition and description of trial objectives,population, endpoints, and outcomes and estimation . So, the CONSORT statementactually covers a substantial portion of the four attributes that make up an estimandin the addendum, for any type of endpoint, not only time-to-event. The addendum canthus be seen as detailing the aspects pertaining to these CONSORT items further, andconnecting them more directly to the trial objective.Acknowledging the current state of heterogeneity in the definition of T2E endpoints, inwhat follows we would like to point out what this potentially implies in RCTs.As discussed in Section 2.4, whether an RCT is considered a “success” generally de-pends on rejection or non-rejection of the considered null hypothesis. So the questionof “statistical significance” carries substantial weight, and pronouncedly so for trialspotentially leading to regulatory approval. Montalban et al. [3] provide an examplefor Multiple Sclerosis: The primary variable was time-to-initial disability progression(time-to-IDP), which had to be confirmed 12 weeks after IDP, and the associated lo-grank test was statistically significant. However, since not all patients come back 12weeks after their initial assessment to have their progression confirmed, for the primaryanalysis it was assumed that all these patients progressed (Table S10 in [3]). An alterna-tive definition of the endpoint, imputing half of the patients randomly as event and theother half as censored, was analyzed as part of a range of sensitivity analyses, yieldinga non-significant logrank test. This example also provides a case study of a variableattribute that is assessed at different timepoints, implying that intercurrent events canalso happen within the variable attribute itself.Similarly, van Cutsem et al. discuss colon cancer, with endpoint DFS in the PETACC-3trial [55]. The primary definition in the trial protocol counted second primary cancerother than colon as an event, and with this definition, the trial is reported to be “non-significant” [55]. However, the evaluation of relapse-free survival (RFS) as defined inthe PETACC-3 protocol (which corresponds to the definition of
DFS in the MOSAICtrial, [56]!) showed a p -value of 0.02 and was called “statistically significant” [55].Not surprisingly, estimates of event-free probabilities at 3 and 5 years are different13or different variable definitions as well [57], see Bellera et al. for further details [5].Note that the trial objective of “improving DFS” is not nearly specific enough to implyunambiguous definitions.Meta-analyses will not provide reliable summary estimates if studies with different def-initions of variables and/or intercurrent events are being compared [47].“Dynamic borrowing” of historical data has gained some popularity recently [58], butthat also raises the question of “compatibility” of the historical data used in these kindof trials, not only from a statistical perspective but also in terms of estimand definition.In an analysis combining data from several trials in colon cancer to assess surrogacyof DFS for OS after various times of follow-up [59], among them PETACC-3 [57] andMOSAIC [56] for which DFS definition was different in terms of handling second tumors[55], the DFS definition given is: “DFS was defined as the time from randomisation tothe first event of either disease recurrence or death due to any cause.” No discussionof how second tumors were handled can be found in this surrogacy assessment, thusthe reader is left uncertain which estimand was actually targeted by the analysis. Theauthors further found that the previously established surrogacy of DFS after 2- and3-year median follow-up for OS in Sargent et al. [60] was now modest to poor in thisfollow-up analysis based on six new trials not included in the previous analysis. Theauthors attribute this reduction in the amount of surrogacy based on the same amountof follow-up (2 and 3 years for DFS, 5 and 6 years for OS) as in Sargent et al. [60]to generally longer DFS and OS achieved in colon cancer over time through improvedtherapy and combinations. One can only speculate on the possible contribution ofheterogeneity in DFS variable definition to the lack of surrogacy in this case.Effect estimates depending on the definition of an estimand may also complicate plan-ning of future trials, as assumptions for the control arm in a new trial are often basedon estimates of the treatment arm in previous trials.To summarize, unprecise definitions may impact statistical significance, effect estimates,feasibility of meta-analysis, dynamic borrowing, and surrogacy assessments, and can biasassumptions for planning of future trials.The addendum (Section A.4) emphasizes the need for construction of a suitable estimandwhen summarising data across trials. This means that to allow cross-trial summaries,the estimand in each single trial needs to be described in sufficient detail, to enable anassessment of which studies to combine and to quantify the “estimand heterogeneity”. In this section, we illustrate how the same intercurrent event, initiation of new therapyin lymphoma in the absence of PD, might need to be considered differently for the sameT2E endpoint, depending on the disease indication considered.PFS is a commonly accepted regulatory endpoint in Non-Hodgkin Lymphoma (NHL), forboth, the indolent (follicular, FL) and aggressive (diffuse large B-cell, DLBCL) subtype[61]. FL is incurable and DLBCL is potentially curable, but patients failing to achieve acomplete remission (CR) after initial induction treatment have a dismal outcome. This14eans that physicians might treat such a DLBCL patient with a new anti-lymphomatherapy (NALT) of their choice already prior to formal imaging-based determination ofprogression to the first-line therapy in case of non-CR (i.e. if they have either stabledisease or partial response), in order to maximize the odds to still get the patient toCR. So, NALT can be considered a potential intercurrent event for the endpoint of PFS.For FL, in line with regulatory guidelines [19], PFS is generally defined as time fromrandomisation to the earlier of disease progression or death [1], and NALT is ignored forPFS, implying a treatment-policy estimand. Ideally, those NALTs that are allowed tobe administered within the trial are protocol-specified, see the discussion in Section 2.3.In DLBCL, PFS is often defined as for FL (e.g. in the GOYA trial, [62]), although thelymphoma-specific harmonization effort by Cheson et al. [61] indicates that “...in studiesin which failure to respond without progression is considered an indication for anothertherapy, such patients should be censored at that point for the progression analysis.”The estimand corresponding to the definition of the variable PFS in this case would bea hypothetical one. The censoring rule implies that for patients who fail to respond,their PFS is imputed (terminology used by Fleming et al., [18]) based on patients thatdid respond, but had longer follow-up. Interestingly, in another highly cited paper itis recommended that “...patients should not be censored at the time other treatmentsare initiated when analyzing the PFS end point...” [18]. Here, the interest focuses ona treatment-policy estimand strategy. Given the conflicting recommendations given byregulatory guidelines and key opinion leaders, how should a researcher designing a trialusing PFS as the primary endpoint proceed to define this endpoint? In the frameworkof estimands for a T2E endpoint outlined in Section 2, the definition of the variablethat defines PFS appears unambiguously accepted: starting date is randomisation andthe event defining the endpoint is the earlier of PD and death. To define the potentialestimand we identify two potential intercurrent events: • Failure to respond to induction therapy, but no PD (IC1). • Initiation of NALT, but no PD (IC2).Note that it is not entirely clear from the statement in Cheson et al. [61] whether theseauthors really think of two different intercurrent events when they write “...failure torespond” or whether they implicitly assume IC1 = IC2. In any case, to be as preciseas possible, we consider both these events separately in our discussion, as it is notentirely implausible that a patient with partial response only at end of induction willnot immediately receive NALT. A number of estimands that could be defined by treatingIC1 and IC2 are summarized in Table 1. For the estimation of the estimands in Table 1we propose to administratively censor at the last non-PD tumor assessment for theclinical cutoff, and also to censor at the intercurrent event whenever we consider ahypothetical estimand.As for the summary measure, the choice depends on which of the two approaches de-scribed in Section 2.4 is preferred and whether the PH assumption is sensible in thisindication. The preferred approach will likely evolve after the addendum has been fi-nalized and substantive knowledge of the DLBCL indication is needed to decide on theplausibility of the PH assumption. Whether the PH assumption seems appropriate or15ot depends on the choices made for the other three estimand attributes, most specifi-cally the intervention effect. For simplicity, in Table 1 we make the PH assumption.16 stimand Population Variable Intervention effect Summary measure CommentOption 1 DLBCL pa-tients, definedby list of in-and exclusioncriteria PFS NALT: treatment policy
Failure to respond: treatment policy
Death: composite logrank test and haz-ard ratio Corresponds to definition in Flem-ing et al. [18]. Used in large RCT[62].Option 2 as Option 1 as Option 1 NALT: hypothetical
Failure to respond: treatment policy
Death: composite as Option 1Option 3 as Option 1 as Option 1 NALT: treatment policy
Failure to respond: hypothetical
Death: composite as Option 1 Corresponds to definition in Chesonet al. [61].Option 4 as Option 1 as Option 1 NALT: hypothetical
Failure to respond: hypothetical
Death: composite as Option 1Option 5(Event-freesurvival,EFS) as Option 1 PFS be-comesEFS NALT: composite
Failure to respond: hypothetical
Death: composite as Option 1 Similar to PFS, may be useful inevaluation of highly toxic therapies,only acceptable for HAs if NALT issupported by some “objective” eval-uation of treatment failure in the ab-sence of PD. Used as part of a sur-rogacy assessment [63].
Table 1: Potential PFS estimands for DLBCL, depending on how intercurrent events are considered. Estimand strategies accordingto the addendum are emphasized in bold . ot all estimands in Table 1 may be of practical interest. Furthermore, Options 2-5all contain at least one intercurrent event for which a hypothetical estimand strategy isproposed. For an estimator based on simple censoring at IC1 and/or IC2 to be unbiased,assumptions on the censoring mechanism, e.g. random or independent censoring, haveto be made, see Section 3. The need to adjust for the potential dependence of the effectestimate on the intercurrent events via methods that can deal with, what they call,“informative censoring” led Fleming et al. to actually advocate Option 1 in this context[18]. However, the estimand framework in the addendum might potentially reversethat thinking: ideally, one first determines a trial objective which is then translated inestimand strategies to handle the intercurrent events. The last step is then to definean estimation strategy that is able to estimate the selected estimand, potentially at theprice of added assumptions or the need for advanced methodology. In any case, we findthat depending on how we consider the intercurrent events IC1 and IC2, we get differentestimands for an endpoint that is commonly called “PFS”, and with Option 5 even onethat is considered a different endpoint. EFS has lately been considered an endpoint forDLBCL as well [63].For FL, the situation is different, as in this indolent disease, failure to achieve responseafter induction therapy does not necessarily trigger NALT. Rather, physicians indeedwait with inducing NALT until the patient experiences progression, so that PFS as anendpoint is indeed more reflective of the actual clinical treatment of this disease.The comparison between DLBCL and FL illustrates that even for two quite relateddiseases, for which often the same treatment is applied, defining a relevant estimandmight not be straightforward and needs careful assessment on what actually a therapyunder study should achieve, and what the precise trial objective is. Irrespective ofwhich estimand is chosen, we strongly recommend that trial developers try to identify allpotential intercurrent events upfront, and clearly indicate for the finally chosen estimandand each intercurrent event whether the latter is considered within a treatment policy, inthe context of a hypothetical estimand, as part of a composite endpoint, or as competing.We recommend to make that transparent in a table similar to Table 1.To conclude this section, and to finish the discussion on summary measure in Section 2.4,we would like to comment on the interplay between intervention effect and summarymeasure . The above debate on how to treat the intercurrent event “initiation of NALT”for PFS was settled by a general recommendation [18] to use a “treatment policy”strategy for this intercurrent event. One, if not the major, justification to favour thetreatment policy estimand is to avoid having to censor at NALT, i.e. the insight thata hypothetical estimand would potentially be difficult to estimate. With this approach,the question of NALT is dealt with in the definition of the intervention effect. Now,instead of changing the estimand from hypothetical to treatment policy to avoid havingto deal with assumptions for censoring when defining the estimator, one could equallywell consider adjusting the summary measure, e.g. by applying inverse probability ofcensoring weighting (IPCW), see e.g. [64] or [65]. This method allows, again undersome but now different assumptions (no unmeasured confounders, [66]), to estimate thesurvival function and consequently the treatment effect if no patient had started NALT ,so again a hypothetical estimand strategy. Alternatively, an estimand defined through arank-preserving structural failure time model (RPSFT, [67]) assesses the counterfactual18vent time of a patient, i.e. again the T2E if no NALT were received [65]. This canequally well be considered a hypothetical estimand. These considerations illustrate thatvarious estimators can be defined for the same estimand. Interestingly, methodologyto estimate a hypothetical estimand such as IPCW and RPSFT, have been developedlong ago in the context of causal inference and successfully applied in many instances,especially in epidemiological applications. We anticipate that the addendum providesa framework and a common language that will facilitate and foster implementationof such advanced estimation methodology, because these are made necessary by thetrial objective and the estimand derived from it. Furthermore, it is important to notethat the “no unmeasured confounder” assumption for IPCW requires that (as muchas possible) data on prognostic factors that explain initiation of NALT needs to becollected, by pre-specifying these factors in the data collection and handling strategy ,according to the list in the National Research Council’s report [35], see Section 2.3.Having a hypothetical estimand defined upfront will shift focus of a sponsor to collectingnecessary information during the trial, another advantage of the estimand framework,as outlined in the addendum (Section A.1). Already Watkins et al. [65] emphasizethat “...Often, only limited data are collected after the patient experiences the primaryregulatory endpoint, which can mean important time-varying confounders are missedor that switch dates or time on/off treatment cannot be accurately defined. Carefulupfront planning is required.” We expect such planning to become more standard afterthe addendum is in place. The GALLIUM trial [1] assessed whether replacing the anti-CD20 antibody rituximabby a second generation anti-CD20 antibody, obinutuzumab, increases PFS. The trial wasunblinded after it crossed the pre-specified significance level at a pre-planned efficacyinterim analysis and was fully analyzed. The trial randomized 1202 FL patients andan additional 195 marginal-zone lymphoma (MZL) patients. The latter cohort can beconsidered a Phase 2 trial within the same protocol, whereas the first constituted theprimary analysis population. Due to regional heterogeneity in standard of care, trial siteshad to select one of three chemotherapy backbone therapies. Chemotherapy as well as
Follicular Lymphoma International Prognostic Index (FLIPI1, [68]), a prognostic scoreused in FL, were used as stratification factors in the analysis.Table 2 provides details on the primary analysis and a list of all sensitivity analyses (withterm “sensitivity” referring to the pre-addendum meaning in the GALLIUM protocol)reported in the clinical study report (CSR) for the primary endpoint, PFS as assessed bythe investigator (Inv-PFS). We provide columns for each of the four estimand attributesintroduced in the addendum. As almost all the analyses in Table 2, and also Table 3,are targeting different estimands than the primary by varying at least one of the fourattributes, in the post-addendum language these are in fact all supplementary analyses.The only exceptions are variations of the primary analysis by considering an unstratifiedinstead of a stratified logrank test and a re-randomisation test. These truly vary theunderlying assumptions of the primary estimand. We have still added these variationsto the summary measure column for simplicity, but they target the primary estimand,19s indicated in the first column of the tables.Depending on the endpoint, the intercurrent events we consider are death, NALT, pro-gression, withdrawal, drop-out, discontinuation of trial treatment, and missed scheduledresponse assessment. While all these were systematically collected on the electronic casereport form, the list of “allowed” NALTs was not pre-specified in the protocol, becauseNALT was not foreseen prior to observing the actual endpoint, PFS. So, as discussed inSection 2.3, one can argue that it is not entirely clear what precise objective, or effectquantification, corresponds to a treatment policy estimand strategy applied to the in-tercurrent event NALT. However, for illustrative purposes we ignore this aspect in ourcase study, i.e. if in the tables below we discuss ignoring NALT for an estimand we callthis treatment policy.Table 3 then continues to describe all the T2E endpoints listed on https://clinicaltrials.gov/ct2/show/NCT01332968 .In the original protocol, SAP, and CSR, these analyses were provided as a long list andhave been selected based on substantive considerations, regulatory guidelines [19], expe-riences from previous trials, and feedback from HAs on studies in the same developmentprogram. In the estimand framework, these are supplementary analyses . We wouldlike to emphasize that this is an after-the-fact exercise with the aim of learning how tostructure an estimand and analysis plan in the future based on the addendum.Note that the primary and some of the secondary T2E endpoints were evaluated inthe FL and overall, i.e. FL + MZL, population. For simplicity, we focus on the FLpopulation only in both Tables. Adding the overall population to the lists would be asimple variation of the population attribute.The estimand definitions in Table 2 are complemented by the following data analysisstrategy: • Clinical cutoff for PFS: administratively censor at the last non-PD tumor assess-ment. • Primary: PDs were collected after NALT, so the treatment policy strategy couldbe applied. However, data on PD was not routinely collected after drop-out andwithdrawal. As a consequence, estimation was based on censoring at these inter-current events, implying a hypothetical estimand. • Supplementary 1: Withdrawals prior to PD: consider event at next scheduleddisease assessment date in obinutuzumab arm, censored at last disease assessmentfor rituximab. • Supplementary 2: Missed assessment prior to PD or clinical cutoff date: consideredevent at day after last response assessment. • Supplementary 3: NALT prior to PD: censor at NALT. • Supplementary 4: Discontinuation of trial treatment for other reasons than PD/death:consider event at time of discontinuation. • Supplementary 5: Death ≥ nalysis Population Variable Intervention effect Summary measure RationalePrimary FL patients, de-fined by list ofin- and exclu-sion criteria Inv-PFS NALT: treatment policy Drop-out, withdrawal: hypothetical
Death: composite unadjusted hazardratio and logranktest, stratified bychemotherapy andFLIPI1 Composite: interest would be intime-to-progression, but we make“death” part of a composite.“Primary” as primary IRC-PFS as primary as primarySensitivity 1 as primary as primary as primary unadjusted hazard ra-tio and logrank test,unstratifiedSensitivity 2 as primary as primary as primary unadjusted hazardratio and logranktest, stratified, usingre-randomisation Assesses sensitivity of stratified log-rank test to dynamic randomisationprocedure, see e.g. [69].Supplementary 1 as primary as primary Withdrawals prior to PD: composite forobinutuzumab, hypothetical for rituximab as primary Assess impact of loss to follow-up.Supplementary 2 as primary as primary Missed assessment prior to PD or clinical cut-off: composite as primary Assess impact of missed assess-ments.Supplementary 3 as primary as primary NALT prior to PD: hypothetical as primary Assess potential confounding of thetreatment effect estimates by subse-quent therapy. See also Section 6.Supplementary 4 as primary as primary Discontinuation of trial treatment for otherreasons than PD/death: composite as primarySupplementary 5 as primary as primary Death ≥ hypothetical as primary Table 2: List of original sensitivity analyses from GALLIUM protocol. “IRC-PFS” stands for PFS as assessed by independentreview committee, see the discussion below. The column
Analysis uses the post-addendum terms. Estimand strategies accordingto the addendum are emphasized in bold . he primary analysis yielded an estimated hazard ratio of 0.66, with p -value 0.0012 and95% confidence interval from 0.51 to 0.85 [1]. All but Supplementary Analysis 1 weresufficiently consistent with this result.Assembling Table 2 provided the following insights that may be helpful in designingfuture studies or analysis plans: • It is not entirely obvious how to position IRC-PFS. The primary endpoint of thetrial was Inv-PFS, meaning that the timing of the clinical cutoffs for interim anal-yses was based on the number of events for Inv-PFS, and the p -value for thisendpoint was primarily considered by the independent data monitoring commit-tee (iDMC) when making their recommendation to either stop or continue thetrial at any interim analysis. However, the iDMC charter also asked the iDMCto ascertain that point estimates for Inv- and IRC-PFS were “consistent”. Fur-thermore, the protocol stated that Although the primary efficacy endpoint is theinvestigator-assessed PFS, PFS based on IRC assessments will also be analyzed tosupport the primary analysis. In the United States, IRC-assessed PFS will be thebasis for regulatory decisions.
For that purpose, all the analyses in Table 2 hadbeen repeated for IRC-PFS and, given their importance for regulatory purposesand since we only alter one aspect of the primary estimand (assessment methodfor PD), we consider these still sensitivity analyses for Inv-PFS. However, onecould argue that these should rather be deemed supplementary in post-addendumlanguage, as they could be considered targeting an alternative estimand. • Each sensitivity analysis only modifies one aspect of the primary analysis at atime, in line with the recommendation in the addendum. • For Supplementary Analysis 1, a different strategy for the intercurrent event is useddepending on which arm the patient who withdrew was randomized to, so this canbe considered a “worst case” approach. In hindsight, the scientific objective, theimplied estimand, and the estimated treatment effect based on it is thus difficultto interpret, to say the least. • For the hypothetical estimand in Supplementary Analysis 3, the analysis strategyspecifies censoring at initiation of NALT, as in Section 6. We thus estimate PFSassuming that those patients who received NALT are comparable to those thatdid not need NALT. As discussed in Section 3, if the corresponding effect estimateshould be unbiased we need to make e.g. the assumption of random censoring,but based on the discussion by Fleming et al. [18], this assumption is unlikely tohold. As discussed in Section 6, IPCW or RPSFT would be (complex) options toestimate a hypothetical estimand. • In Section 2.4 we discuss that one could argue to split the summary measure intwo separate attributes, one for the hypothesis test and one for the effect estimate.In this context it is noteworthy that Sensitivity Analysis 2 indeed only varies the hypothesis test , but not the effect estimate . • Finally, the choice of the summary measures implies that PH was assumed forall the estimands in Table 2, and also Table 3. For the primary endpoint, Inv-22FS, this assumption was based on results of the predecessor trial PRIMA [70].The treatment arm from PRIMA became part of the control arm of GALLIUM,and estimates of the survival function of the treatment arm in PRIMA revealeda remarkably constant hazard of an Inv-PFS event over time, a feature regularlyobserved in FL. The trial team thus assumed the same shape for the hazard func-tion in the experimental arm in GALLIUM, leading to the PH assumption. Andin fact, the analysis of GALLIUM indeed showed quite constant estimated hazardfunctions and thus the PH assumption seemed justified also in hindsight. Inter-estingly, the average regression effect as proposed by Xu and O’Quigley [43] wasestimated to be 0.68 (computed using the R [71] package coxphw [72]), so quiteclose to the estimated primary analysis hazard ratio.Strictly speaking, the above rationale justifying the PH assumption based on anearlier trial exclusively applies to Inv-PFS in GALLIUM. Similar considerationsfor all the other analyses in Table 2 and 3 had not been made at the time.The GALLIUM trial also reported a set of results on secondary endpoints [1]. These arelisted in Table 3 and, in post-addendum language, these would be “supplementary”. Itis remarkable that all these endpoints, i.e. OS, Time to NALT (TTNALT), EFS, DFS,and duration of response (DOR) are basically received by varying handling of clinicalevents for the variable and intervention effect, as well as the population attribute forDFS and DOR. The analysis strategies for the estimands in Table 3 are the same as forTable 2 and in addition: • Clinical cutoff for OS and TTNALT: administratively censor at date last knownalive. • Supplementary 7: The protocol specified that after PD, patients were to be fol-lowed up for NALT and death, allowing to use a treatment policy strategy for PDfor TTNALT. 23 nalysis Population Variable Intervention effect Summary measure RationalePrimary FL patients, de-fined by list ofin- and exclu-sion criteria Inv-PFS NALT: treatment policy
Drop-out, withdrawal: hypotheti-cal
Death: composite unadjusted hazardratio and logranktest, stratified bychemotherapy andFLIPI1 Composite: interest would be intime-to-progression, but we make“death” part of a composite.Supplementary 6Overall survival as primary OS PD, NALT: treatment policy
Drop-out, withdrawal: hypotheti-cal as primarySupplementary 7Time to NALT as primary TTNALT, timefrom randomisa-tion to death orNALT PD: treatment policy
Drop-out, withdrawal: hypotheti-cal
NALT, death: made part of a com-posite as primarySupplementary 8Event-free survival as primary EFS, time fromrandomisationto Inv-PFSevent or NALT Drop-out, withdrawal: hypotheti-cal
NALT, death: made part of a com-posite as primarySupplementary 9Disease-free sur-vival Patients withCR prior toNALT DFS, time fromfirst occurrenceof CR to Inv-PFS event as primary unadjusted hazardratio, stratified bychemotherapy andFLIPI1 Non-randomized comparison be-tween arms, no hypothesis testperformed.Supplementary 10Duration of re-sponse Patients withPR or CR priorto NALT DOR, time fromfirst occurrenceof PR or CR toInv-PFS event as primary unadjusted hazardratio, stratified bychemotherapy andFLIPI1 Non-randomized comparison be-tween arms, no hypothesis testperformed.
Table 3: List of supplementary analyses from GALLIUM protocol. “PR” stands for “partial response”. The column
Analysis usesthe post-addendum terms. Estimand strategies according to the addendum are emphasized in bold . Discussion
In this paper, we have outlined an estimand framework for a T2E endpoint and dis-cussed all four estimand attributes defined in the addendum as they apply to a T2Eendpoint. We illustrate that these attributes cannot be seen independently, but haveto be considered inter-related. Intercurrent events often make measurements of a T2Eendpoint incomplete, or even impossible to observe in the case of an intercurrent event ofdeath. Depending on the targeted estimand, various approaches can be used to estimateit. For a hypothetical estimand, simple censoring at the intercurrent event may oftenyield a potential estimator, but at the price of the strong assumption that the endpointfor those patients experiencing the intercurrent event can be “imputed” using data frompatients with longer follow-up, but not experiencing the intercurrent event. Alterna-tively, methods developed in the causal inference literature can be used to estimate ahypothetical estimand, making alternative assumptions and necessitating data collec-tion on factors that are prognostic and predict the intercurrent event, e.g. a treatmentswitch [73].As summary measures, the logrank test and the hazard ratio based on the CPHR areoverwhelmingly used today in the routine analysis of trials with T2E endpoints. How-ever, both these methods are intimately connected to the PH assumption. While thePH assumption may often be made based on previous or external knowledge, as e.g. inGALLIUM in Section 7, there are indications where it is clear that it generally does notapply, e.g. immuno-oncology [74]. Still, trials are often powered based on the logranktest and the pre-specified effect quantifier is based on CPHR. This asks for a choicebetween the two potential ways of moving forward in the context of the addendum asoutlined in Section 2.3. • Making testing and estimation consistent even under NPH would methodologicallybe possible, using e.g. one of the estimators discussed in Section 4. The trade-off would be a major logistical and educational effort for all parties involved, i.e.rewriting of programming and reporting templates and education of statisticians,clinicians, HAs, and the broader scientific community on the application and inter-pretation of these methods. Also, if, as an example, the average regression effectapproach by Xu and O’Quigley [43] would be adopted, methodology would firsthave to be developed e.g. for sample size computation, sequential monitoring of atrial, estimation of this average effect from interval censored data, etc. • The other option to move forward would be the gatekeeper approach, which is atleast informally accepted by HAs today.If this second approach remains valid moving forward, it would be crucial for trialsponsors to understand: • Even if the hypothesis test can be different from the effect quantifier - is validity ofthe chosen hypothesis test enough to justify its application, or does the hypothesistest still need to refer to a clearly defined estimand? This would make the use ofthe logrank test questionable under NPH.25
What are the criteria that will make HAs accept an effect quantifier? As suggestedby a reviewer, could it be an option to base the hypothesis test on an analysismethod that makes a set of plausible but minimal assumptions, but then use oneor more effect quantifiers in the package insert making different and/or strongerassumptions, e.g. proportional hazards or even parametric?A discussion of these aspects, specifically the role and interplay of testing and estimation,in the ICH E9 final addendum or general guidance by HAs on these aspects wouldcertainly help sponsors to set up RCTs with T2E endpoints in the future, especiallywhen NPH is anticipated.Irrespective of which approach is favoured, a general comment on the causal interpreta-tion of T2E data and hazard ratios is in order. As discussed by Akacha [75], while theaddendum is not explicitly mentioning the word “causal”, reference to “causal think-ing” is made implicitly via referencing potential outcomes (Section A.3.1) and adoptionof the principal stratum strategy. However, as e.g. Hernan [76] and Aalen et al. [29]point out, the use of the hazard ratio for causal inference is not straightforward, evenin the ideal situation of PH and absence of unmeasured confounding and measurementerror. Furthermore, as discussed by Hernan ([36], Fine point 17.1), truncation of a T2Evariable by competing events raises logical questions about the meaning of a causal es-timand, and these issues cannot be bypassed by statistical techniques. So, even whenperfectly adopting the addendum framework for T2E endpoints in the future and PHbeing fulfilled, validity of causal conclusions for a T2E endpoint might remain unclear,even in an RCT, and needs further research. We would also welcome if the addendumin its final form could be more clear on whether estimands have to be constructed ina way that they allow for causal statements, or alternatively, if the addendum can beinterpreted more in the sense of an “operational guidance” of how to set up RCTs welland transparently.Interestingly, as discussed by Aalen et al. [29], accelerated failure time models could po-tentially provide both, effect estimates that are robust against the censoring distributionand allow for a causal interpretation.We have discussed the current heterogeneity in variable definitions and the DATECANinitiative that aims at unifying these definitions in many indications in oncology. Ontop of that, harmonization efforts are underway for response definitions, e.g. extendingthe
Response Evaluation Criteria In Solid Tumours (RECIST) guideline to testing drugagents in immunotherapy [77]. We are definitively supportive of all these efforts and hopethat these will be extended to other disease areas as well. However, with a preciselydefined endpoint and strategy how to handle intercurrent events, only part of whatconstitutes an estimand according to the addendum is specified. We believe that theestimand framework will further help to align trial objectives and quantification andinterpretation of effects. In addition, having the estimand discussion within teams attrial onset will help to define the data collection strategy. If a hypothetical estimandis targeted e.g. for OS, data collection strategy becomes especially important: dataon prognostic factors for experiencing an intercurrent event, e.g initiation of NALT ora switch from control to treatment arm, needs to collected. Hudis et al. provide anillustrative discussion on these aspects in the penultimate paragraph of their paper [50].One aspect is important to note in this context: for potentially lethal diseases like many26ancers, patients remain under treatment also after progression, often even within theinitial trial protocol, so that follow-up for OS is feasible. Such comprehensive follow-upfor OS might be more difficult to achieve in other indications, e.g. Alzheimer’s disease.Experience shows that if patients stop treatment, due to the high burden a visit to thehospital with all the trial assessments is for the patient and caretakers, the odds thatpatients completely drop-out of the trial after stopping treatment are higher than inmany oncology indications.We illustrate our points with two case studies from oncology, and we show that familiarendpoints such as PFS, TTP, or EFS can be simply interpreted as different estimands.Strict compliance to the estimand methodology outlined in the addendum is desirablemoving forward. However, as remarked by a reviewer, this may increase costs, e.g.because of power loss when using the treatment policy strategy for an intercurrentevent or diluting endpoints when using the composite strategy.Regulatory guidelines, e.g. the
FDA Guidance for Industry: Clinical Trail Endpointsthe Approval of Cancer Drugs and Biologics [19], specify definitions of endpoints such asPFS or TTP, and also outline potential sensitivity analyses. This guidance is reflectedin the list of analysis for GALLIUM in Tables 2 and 3. However, the guidance currentlydoes neither discuss the assumptions that are (implicitly) made on censoring nor areestimands specified from which the proposed analyses, or estimators, are derived. Weanticipate that regulatory guidelines will have to updated to reflect the changes broughtto ICH E9 by the addendum.Based on our experience and working on the tables in Section 7, and similar to therecommendations by Hernan and Robins [23] in the context of pragmatic trials, werecommend that when defining an estimand for a trial with a T2E endpoint, the sponsorshould • involve all relevant stakeholders of the trial, i.e. statisticians, clinicians, inter-nal regulatory colleagues, health authorities, payer organizations if the trial isindustry-sponsored, and maybe even patients, • identify all endpoint-defining and potential intercurrent events, • define for each intercurrent event the estimand strategy to be applied, ideally byinferring this strategy from a precisely formulated trial objective, • tabulate all these events and determine for each of them an estimand strategy, • and finally apply or develop methodology that can estimate the chosen estimand,with a precise definition of the estimator and a description and discussion of theassumptions made, e.g. with respect to censoring.By doing so, the sponsor should make sure to take into account aspects specific to thedisease under study and make an effort to match the definition with either guidelines inthe field, such as DATECAN, CONSORT, or RECIST, or other clinical trials. This willfacilitate discussions on regulatory approval, allow for easier comparison of trials nowand in the future, and allow the use of trial data as historical controls, in meta-analyses,or surrogacy assessments. 27 Acknowledgments
I would like to thank the reviewers for their generous and constructive comments thatsubstantially improved this paper. Specifically, I would like to thank one reviewer forpointing out the lack of a clear estimand implied by CPHR. This feedback made meextend Section 2.4 in the revision.I would like to acknowledge numerous discussions on the topic with Mouna Akacha,Hans Ulrich Burger, James Roger, and Marcel Wolbers and thank Chris Harbron andTina Nielson for proofreading an earlier version of this paper.
References [1] R. Marcus, A. Davies, K. Ando, W. Klapper, S. Opat, C. Owen, E. Phillips,R. Sangha, R. Schlag, J. F. Seymour, W. Townsend, M. Trneny, M. Wenger,G. Fingerle-Rowson, K. Rufibach, T. Moore, M. Herold, and W. Hiddemann,“Obinutuzumab for the First-Line Treatment of Follicular Lymphoma,”
N. Engl.J. Med. , vol. 377, pp. 1331–1344, Oct 2017.[2] D. J. McGhee, C. W. Ritchie, J. P. Zajicek, and C. E. Counsell, “A review ofclinical trial designs used to detect a disease-modifying effect of drug therapy inAlzheimer’s disease and Parkinson’s disease,”
BMC Neurol , vol. 16, p. 92, Jun 2016.[3] X. Montalban, S. L. Hauser, L. Kappos, D. L. Arnold, A. Bar-Or, G. Comi,J. de Seze, G. Giovannoni, H. P. Hartung, B. Hemmer, F. Lublin, K. W. Rammo-han, K. Selmaj, A. Traboulsee, A. Sauter, D. Masterman, P. Fontoura, S. Belachew,H. Garren, N. Mairon, P. Chin, and Wolinsky, “Ocrelizumab versus Placebo in Pri-mary Progressive Multiple Sclerosis,”
N. Engl. J. Med. , vol. 376, pp. 209–220, 012017.[4] J. J. McMurray, M. Packer, A. S. Desai, et. al., and G. Vergara, “Angiotensin-neprilysin inhibition versus enalapril in heart failure,”
N. Engl. J. Med. , vol. 371,pp. 993–1004, Sep 2014.[5] C. A. Bellera, M. Pulido, and S. e. a. Gourgou, “Protocol of the Definition for theAssessment of Time-to-event Endpoints in CANcer trials (DATECAN) project:formal consensus method for the development of guidelines for standardised time-to-event endpoints’ definitions in cancer clinical trials,”
Eur. J. Cancer , vol. 49,no. 4, pp. 769–781, 2013.[6] R. Peto, M. C. Pike, P. Armitage, N. E. Breslow, D. R. Cox, S. V. Howard, N. Man-tel, K. McPherson, J. Peto, and P. G. Smith, “Design and analysis of randomizedclinical trials requiring prolonged observation of each patient. II. analysis and ex-amples,”
Br. J. Cancer , vol. 35, pp. 1–39, Jan 1977.[7] D. G. Altman, B. L. De Stavola, S. B. Love, and K. A. Stepniewska, “Review ofsurvival analyses published in cancer journals,”
Br. J. Cancer , vol. 72, pp. 511–518,Aug 1995. 288] S. Mathoulin-Pelissier, S. Gourgou-Bourgade, F. Bonnetain, and A. Kramar, “Sur-vival end point reporting in randomized cancer clinical trials: a review of majorjournals,”
J. Clin. Oncol. , vol. 26, pp. 3721–3726, Aug 2008.[9] D. V. Mehrotra, R. J. Hemmings, and E. Russek-Cohen, “Seeking harmony: esti-mands and sensitivity analyses for confirmatory clinical trials,”
Clin Trials , vol. 13,pp. 456–458, Aug 2016.[10] ICH E9 working group, “ICH E9 (R1): addendum on estimands and sensitivityanalysis in clinical trials to the guideline on statistical principles for clinical trials,”8 2017.[11] ICH E9 working group, “E9(R1): Addendum to Statistical Principles for ClinicalTrials on Choosing Appropriate Estimands and Defining Sensitivity Analyses inClinical Trials.” 10 2014.[12] M. Akacha, F. Bretz, D. Ohlssen, G. Rosenkranz, and H. Schmidli, “Estimands andtheir role in clinical trials,”
Summer issue of the American Statistical Association’sBiopharmaceutical Section Report , vol. 22, pp. 1–4, 2015.[13] B. Holzhauer, M. Akacha, and G. Bermann, “Choice of estimand and analysismethods in diabetes trials with rescue medication,”
Pharm. Stat. , vol. 14, no. 6,pp. 433–447, 2015.[14] S. Unkel, M. Amiri, N. Benda, J. Beyersmann, D. Knoerzer, K. Kupas, F. Langer,F. Leverkus, A. Loos, C. Ose, T. Proctor, C. Schmoor, C. Schwenke, G. Skipka,K. Unnebrink, F. Voss, and T. Friede, “On estimands and the analysis of adverseevents in the presence of varying follow-up times within the benefit assessment oftherapies,” 2018.[15] D. G. Altman and J. M. Bland, “Statistics notes: Absence of evidence is notevidence of absence,”
BMJ , vol. 311, no. 7003, p. 485, 1995.[16] J. P. Klein and M. L. Moeschberger,
Survival Analysis . Springer-Verlag, 2nd ed.,2003.[17] J. Baselga, J. Cortes, S. B. Kim, S. A. Im, R. Hegg, Y. H. Im, L. Roman, J. L.Pedrini, T. Pienkowski, A. Knott, E. Clark, M. C. Benyunes, G. Ross, S. M.Swain, M. A. Bartoli, M. I. Bianconi, M. Kotliar, J. A. Lacava, M. Matwiejuk,P. E. Price, M. Varela, J. Andrade, R. Araujo, S. Azevedo, E. Cortes, E. Costa eSilva, D. Cubero, G. Delgado, M. d. e. l. P. Diz, B. Eyll, F. Franke, R. Hegg,G. Ismael, D. Jendiroba, J. L. Pedrini, R. Pereira, H. Pinczowski, P. Tokunaga,C. Tosello, C. Brezden-Masley, Y. Cheng, X. Ouyang, Z. Shen, X. Wang, L. Wang,T. K. Yau, W. Yeo, D. Otero, Z. Soldic, D. Vrbanec, T. Soria, P. Kellokumpu-Lehtinen, S. Pyrhonen, M. Campone, B. Coudert, J. M. Ferrero, F. Priou, B. Ak-tas, W. Aulitzky, M. Clemens, E. M. Grischke, M. Hauschild, M. Just, A. Kirsch,S. Kuemmel, C. Maintz, A. Marme, V. Mueller, M. Schmidt, A. Schneeweiss,C. Schumacher, C. Thomssen, B. Wesenberg, H. Castro-Salguero, C. Hernan-dez Monroy, L. M. Zelina-Toache, D. Amadori, C. Angiolini, L. Biganzoli, S. Cinieri,29. Gamucci, S. Iacobelli, L. Latini, F. Montemurro, E. Simoncini, K. Aogi,H. Fuji, J. Horiguchi, K. Inoue, Y. Ito, H. Iwata, M. Kashiwaba, N. Kohno,K. Kuroi, N. Masuda, K. Nakagami, T. Nakayama, R. Nishimura, S. Saji, Y. Sasaki,N. Sato, K. Takeda, Y. Tokuda, K. Tsugawa, T. Ueno, J. Watanabe, R. Yoshiaki,S. A. Im, Y. H. Im, S. B. Kim, Y. W. Moon, J. S. Ro, J. H. Sohn, E. Grin-cuka, I. Kudaba, G. Purkalne, L. Kostovska-Maneva, P. Stefanovski, G. Martinez,G. Tellez, P. Caguioa, V. Chan, D. Tudtud, M. Foszczynska-Kloda, T. Pienkowski,W. Polkowski, E. Starowslawska, P. Tomczak, V. Gorbunova, E. Gotovkin, I. Kise-lev, M. Kopp, M. Lichinitser, V. Merkulov, L. Roman, V. Semiglazov, V. Shirinkin,S. C. Lee, Z. W. Wong, E. Alba Conejo, J. Baselga, N. Batista, L. Calvo, E. Ciru-elos, J. Cortes, M. Gil i Gil, A. Gonzalez, J. Hornedo Muguiro, S. Morales, N. Ri-belles Entrena, S. d. e. L. Sanchez, P. Sanchez Rovira, W. Arpornwirat, T. Dejthe-vaporn, J. Maneechavakajorn, V. Srimuninnimit, V. Sriuranpong, P. Sunpawer-avong, R. Ahmad, L. Assersohn, I. Boiangiu, N. Davidson, C. Gallagher, A. Jones,D. Miles, S. O’Reilly, A. Robinson, D. Wheatley, R. Agajanian, J. F. Armor,M. W. Audeh, A. S. Behairy, R. Birhiray, R. Blachly, K. Blackwell, R. Blan-chard, P. Blanchet, B. J. Bowers, A. Brufsky, L. Budde, R. R. Carroll, V. Charu,S. Dakhil, B. Daniel, J. A. Ellerton, L. Fehrenbacher, P. Flynn, S. Franco, N. Green,V. Hansen, J. Hargis, C. Hendricks, R. C. Hermann, A. Kallab, M. Karwal, G. Kato,P. Kaufman, P. S. Kennedy, P. Klein, E. P. Lester, C. F. Lobo, R. A. Michaelson,J. A. Neidhart, J. D. Neidhart, A. D. Nguyen, T. O’Rourke, R. Patel, T. Patel,A. Perez, J. A. Quackenbush, C. E. Peterson, J. D. Polikoff, S. J. Prill, R. Rob-les, G. Rodriguez, F. Senecal, P. Sharma, R. Smith, D. Spicer, S. A. Swain, J. A.Taguchi, C. L. Vogel, D. M. Waterhouse, S. Yadav, and D. A. Yardley, “Pertuzumabplus trastuzumab plus docetaxel for metastatic breast cancer,”
N. Engl. J. Med. ,vol. 366, pp. 109–119, Jan 2012.[18] T. R. Fleming, M. D. Rothmann, and H. L. Lu, “Issues in using progression-freesurvival when evaluating oncology products,”
J. Clin. Oncol. , vol. 27, pp. 2874–2880, Jun 2009.[19] U.S. Food and Drug Administration,
Guidance for Industry: Clinical Trial End-points for the Approval of Cancer Drugs and Biologics , 2007.[20] R. Pazdur, “Endpoints for assessing drug activity in clinical trials,”
Oncologist ,vol. 13 Suppl 2, pp. 19–21, 2008.[21] A. Allignol, J. Beyersmann, and C. Schmoor, “Statistical issues in the analysisof adverse events in time-to-event data,”
Pharmaceutical Statistics , vol. 15, no. 4,pp. 297–305, 2016.[22] A. K. Leuchs, A. Brandt, J. Zinserling, and N. Benda, “Disentangling estimandsand the intention-to-treat principle,”
Pharm Stat , vol. 16, pp. 12–19, Jan 2017.[23] M. A. Hernan and J. M. Robins, “Per-Protocol Analyses of Pragmatic Trials,”
N.Engl. J. Med. , vol. 377, pp. 1391–1398, 10 2017.3024] B. E. Shepherd, P. B. Gilbert, and T. Lumley, “Sensitivity analyses comparing time-to-event outcomes existing only in a subset selected postrandomization,”
Journalof the American Statistical Association , vol. 102, no. 478, pp. 573–582, 2007.[25] Y. Chiba, “Kaplan-Meier curves for survivor causal effects with time-to-event out-comes,”
Clin Trials , vol. 10, pp. 515–521, Aug 2013.[26] D. B. Rubin, “Causal inference through potential outcomes and principal stratifi-cation: Application to studies with censoring due to death,”
Statist. Sci. , vol. 21,pp. 299–309, 08 2006.[27] C. A. Bellera, N. Penel, M. Ouali, S. Bonvalot, P. G. Casali, O. S. Nielsen,M. Delannes, S. Litiere, F. Bonnetain, T. S. Dabakuyo, R. S. Benjamin, J. Y.Blay, B. N. Bui, F. Collin, T. F. Delaney, F. Duffaud, T. Filleron, M. Fiore,H. Gelderblom, S. George, R. Grimer, P. Grosclaude, A. Gronchi, R. Haas, P. Ho-henberger, R. Issels, A. Italiano, V. Jooste, A. Krarup-Hansen, C. Le Pechoux,C. Mussi, O. Oberlin, S. Patel, S. Piperno-Neumann, C. Raut, I. Ray-Coquard,P. Rutkowski, S. Schuetze, S. Sleijfer, E. Stoeckle, M. Van Glabbeke, P. Woll,S. Gourgou-Bourgade, and S. Mathoulin-Pelissier, “Guidelines for time-to-eventend point definitions in sarcomas and gastrointestinal stromal tumors (GIST) trials:results of the DATECAN initiative (Definition for the Assessment of Time-to-eventEndpoints in CANcer trials) ,”
Ann. Oncol. , vol. 26, pp. 865–872, May 2015.[28] A. P. Boyd, J. M. Kittelson, and D. L. Gillen, “Estimation of treatment effect un-der non-proportional hazards and conditionally independent censoring,”
Stat Med ,vol. 31, pp. 3504–3515, Dec 2012.[29] O. O. Aalen, R. J. Cook, and K. Røysland, “Does Cox analysis of a random-ized survival study yield a causal treatment effect?,”
Lifetime Data Anal , vol. 21,pp. 579–593, Oct 2015.[30] M. Akacha, F. Bretz, and S. Ruberg, “Estimands in clinical trials - broadening theperspective,”
Stat Med , vol. 36, pp. 5–19, Jan 2017.[31] T. D. Cook and L. DeMets, David,
Introduction to Statistical Methods for ClinicalTrials . Chapman & Hall, 2008.[32] P. Royston and M. K. Parmar, “Restricted mean survival time: an alternative to thehazard ratio for the design and analysis of randomized trials with a time-to-eventoutcome,”
BMC Med Res Methodol , vol. 13, p. 152, Dec 2013.[33] N. Demiris, D. Lunn, and L. D. Sharples, “Survival extrapolation using the poly-Weibull model,”
Stat Methods Med Res , vol. 24, pp. 287–301, Apr 2015.[34] H. Mao, L. Li, W. Yang, and Y. Shen, “On the propensity score weighting analysiswith survival outcome: Estimands, estimation, and inference,”
Stat Med , May 2018.[35] National Research Council,
The prevention and treatment of missing data in clinicaltrials. Panel on handling missing data in clinical trials. Committee on National tatistics, Division of Behavioural and Social Sciences and Education . The NationalAcademies Press, 2010.[36] M. A. Hernan and J. Robins, “Causal inference.” Book draft, accessed on 24thOctober 2017, 2017.[37] J. Beyersmann, A. Allignol, and M. Schumacher, Competing Risks and MultistateModels with R . Springer, 2012.[38] J. P. Fine and R. J. Gray, “A proportional hazards model for the subdistributionof a competing risk,”
J. Amer. Statist. Assoc. , vol. 94, no. 446, pp. 496–509, 1999.[39] R. B. Geskus,
Data Analysis with Competing Risks and Intermediate States . Chap-man & Hall, 2015.[40] P. K. Andersen,
Censored Data . John Wiley & Sons, Ltd, 2005.[41] J. M. Bland and D. G. Altman, “The logrank test,”
BMJ , vol. 328, p. 1073, May2004.[42] D. Collett,
Modelling survival data in medical research . Chapman & Hall, 2 ed.,2003.[43] R. Xu and J. O’Quigley, “Estimating average regression effect under non-proportional hazards,”
Biostatistics , vol. 1, pp. 423–439, Dec 2000.[44] C. A. Struthers and J. D. Kalbfleisch, “Misspecified proportional hazard models,”
Biometrika , vol. 73, no. 2, pp. 363–369, 1986.[45] V. Q. Nguyen and D. L. Gillen, “Robust inference in discrete hazard models forrandomized clinical trials,”
Lifetime Data Anal , vol. 18, pp. 446–469, Oct 2012.[46] B. C. Kahan and V. Jairath, “Outcome pre-specification requires sufficient detailto guard against outcome switching in clinical trials: a case study,”
Trials , vol. 19,p. 265, May 2018.[47] H. Birgisson, U. Wallin, L. Holmberg, and B. Glimelius, “Survival endpoints incolorectal cancer and the effect of second primary other cancer on disease freesurvival,”
BMC Cancer , vol. 11, p. 438, Oct 2011.[48] C. J. Punt, M. Buyse, C. H. Kohne, P. Hohenberger, R. Labianca, H. J. Schmoll,L. Pahlman, A. Sobrero, and J. Y. Douillard, “Endpoints in adjuvant treatmenttrials: a systematic review of the literature in colon cancer and proposed definitionsfor future trials,”
J. Natl. Cancer Inst. , vol. 99, pp. 998–1003, Jul 2007.[49] G. von Minckwitz, M. Procter, E. de Azambuja, a. et., and W. Edenfield, “AdjuvantPertuzumab and Trastuzumab in Early HER2-Positive Breast Cancer,”
N. Engl.J. Med. , vol. 377, pp. 122–131, 07 2017.3250] C. A. Hudis, W. E. Barlow, J. P. Costantino, R. J. Gray, K. I. Pritchard, J. A.Chapman, J. A. Sparano, S. Hunsberger, R. A. Enos, R. D. Gelber, and J. A.Zujewski, “Proposal for standardized definitions for efficacy end points in adjuvantbreast cancer trials: the STEEP system,”
J. Clin. Oncol. , vol. 25, pp. 2127–2132,May 2007.[51] S. Gourgou-Bourgade, D. Cameron, P. Poortmans, B. Asselain, D. Azria, F. Car-doso, R. A’Hern, J. Bliss, J. Bogaerts, H. Bonnefoi, E. Brain, M. J. Cardoso,B. Chibaudel, R. Coleman, T. Cufer, L. Dal Lago, F. Dalenc, E. De Azambuja,M. Debled, S. Delaloge, T. Filleron, J. Gligorov, M. Gutowski, W. Jacot, C. Kirkove,G. MacGrogan, S. Michiels, I. Negreiros, B. V. Offersen, F. Penault Llorca,G. Pruneri, H. Roche, N. S. Russell, F. Schmitt, V. Servent, B. Thurlimann,M. Untch, J. A. van der Hage, G. van Tienhoven, H. Wildiers, J. Yarnold, F. Bon-netain, S. Mathoulin-Pelissier, C. Bellera, and T. S. Dabakuyo-Yonli, “Guidelinesfor time-to-event end point definitions in breast cancer trials: results of the DATE-CAN initiative (Definition for the Assessment of Time-to-event Endpoints in CAN-cer trials),”
Ann. Oncol. , vol. 26, pp. 2505–2506, Dec 2015.[52] F. Bonnetain, B. Bonsing, T. Conroy, A. Dousseau, B. Glimelius, K. Hauster-mans, F. Lacaine, J. L. Van Laethem, T. Aparicio, D. Aust, C. Bassi, V. Berger,E. Chamorey, B. Chibaudel, L. Dahan, A. De Gramont, J. R. Delpero, C. Dervenis,M. Ducreux, J. Gal, E. Gerber, P. Ghaneh, P. Hammel, A. Hendlisz, V. Jooste,R. Labianca, A. Latouche, M. Lutz, T. Macarulla, D. Malka, M. Mauer, E. Mitry,J. Neoptolemos, P. Pessaux, A. Sauvanet, J. Tabernero, J. Taieb, G. van Tien-hoven, S. Gourgou-Bourgade, C. Bellera, S. Mathoulin-Pelissier, and L. Collette,“Guidelines for time-to-event end-point definitions in trials for pancreatic cancer.Results of the DATECAN initiative (Definition for the Assessment of Time-to-eventEnd-points in CANcer trials),”
Eur. J. Cancer , vol. 50, pp. 2983–2993, Nov 2014.[53] A. Kramar, S. Negrier, R. Sylvester, S. Joniau, P. Mulders, T. Powles, A. Bex,F. Bonnetain, A. Bossi, S. Bracarda, R. Bukowski, J. Catto, T. K. Choueiri,S. Crabb, T. Eisen, M. El Demery, J. Fitzpatrick, V. Flamand, P. J. Goe-bell, G. Gravis, N. Houede, D. Jacqmin, R. Kaplan, B. Malavaud, C. Massard,B. Melichar, L. Mourey, P. Nathan, D. Pasquier, C. Porta, D. Pouessel, D. Quinn,A. Ravaud, F. Rolland, M. Schmidinger, B. Tombal, D. Tosi, E. Vauleon, A. Volpe,P. Wolter, B. Escudier, T. Filleron, A. Kramar, R. Sylvester, T. Filleron, S. Ne-grier, S. Joniau, P. Mulders, T. Powles, B. Escudier, A. Bex, F. Bonnetain,A. Bossi, S. Braccarda, R. Bukowski, J. Catto, T. Choueiri, S. Crabb, T. Eisen,M. El Demery, J. Fitzpatrick, V. Flamand, P. J. Goebell, G. Gravis, N. Houede,D. Jacqmin, R. Kaplan, B. Malavaud, C. Massard, B. Melichar, L. Mourey,P. Nathan, D. Pasquier, C. Porta, D. Pouessel, D. Quinn, A. Ravaud, F. Rol-land, M. Schmidinger, B. Tombal, D. Tosi, E. Vauleon, A. Volpe, and P. Wolter,“Guidelines for the definition of time-to-event end points in renal cell cancer clinicaltrials: results of the DATECAN project,”
Ann. Oncol. , vol. 26, pp. 2392–2398, Dec2015. 3354] K. F. Schulz, D. G. Altman, D. Moher, D. G. Altman, V. Barbour, I. Boutron, P. J.Devereaux, K. Dickersin, D. Elbourne, S. Ellenberg, V. Gebski, S. Goodman, P. C.Gotzsche, T. Groves, S. Grunberg, B. Haynes, S. Hopewell, P. Juhn, P. Middleton,D. Minckler, D. Moher, V. M. Montori, C. Mulrow, S. Pocock, D. Rennie, D. L.Schriger, K. F. Schulz, I. Simera, and E. Wager, “CONSORT 2010 statement:updated guidelines for reporting parallel group randomised trials,”
BMJ , vol. 340,p. c332, 2010.[55] E. Van Cutsem, R. Labianca, D. Hossfeld, and et al, “Randomized phase iii trialcomparing biweekly infusional fluorouracil/leucovorin alone or with irinotecan inthe adjuvant treatment of stage iii colon cancer: Petacc-3,” vol. 41st Annual Meet-ing of the American Society of Clinical Oncology, 2005.[56] T. Andre, C. Boni, L. Mounedji-Boudiaf, M. Navarro, J. Tabernero, T. Hickish,C. Topham, M. Zaninelli, P. Clingan, J. Bridgewater, I. Tabah-Fisch, and A. de Gra-mont, “Oxaliplatin, fluorouracil, and leucovorin as adjuvant treatment for coloncancer,”
N. Engl. J. Med. , vol. 350, pp. 2343–2351, Jun 2004.[57] E. Van Cutsem, R. Labianca, G. Bodoky, C. Barone, E. Aranda, B. Nordlinger,C. Topham, J. Tabernero, T. Andre, A. F. Sobrero, E. Mini, R. Greil,F. Di Costanzo, L. Collette, L. Cisar, X. Zhang, D. Khayat, C. Bokemeyer, A. D.Roth, and D. Cunningham, “Randomized phase III trial comparing biweekly infu-sional fluorouracil/leucovorin alone or with irinotecan in the adjuvant treatment ofstage III colon cancer: PETACC-3,”
J. Clin. Oncol. , vol. 27, pp. 3117–3125, Jul2009.[58] K. Viele, S. Berry, B. Neuenschwander, B. Amzal, F. Chen, N. Enas, B. Hobbs, J. G.Ibrahim, N. Kinnersley, S. Lindborg, S. Micallef, S. Roychoudhury, and L. Thomp-son, “Use of historical control data for assessing treatment effects in clinical trials,”
Pharm Stat , vol. 13, no. 1, pp. 41–54, 2014.[59] D. Sargent, Q. Shi, G. Yothers, E. Van Cutsem, J. Cassidy, L. Saltz, N. Wolmark,B. Bot, A. Grothey, M. Buyse, A. de Gramont, D. J. Sargent, E. Green, A. Grothey,S. R. Alberts, B. Bot, M. Campbell, Q. Shi, G. Yothers, M. J. O’Connell, N. Wol-mark, A. de Gramont, R. Gray, D. Kerr, D. G. Haller, J. Benedetti, M. Buyse,R. Labianca, J. F. Seitz, C. J. O’Callaghan, G. Francini, P. J. Catalano, C. D.Blanke, T. Andre, R. M. Goldberg, A. Benson, C. Twelves, J. Cassidy, F. Sirzen,L. Cisar, E. Van Cutsem, and L. Saltz, “Two or three year disease-free survival(DFS) as a primary end-point in stage III adjuvant colon cancer trials with fluo-ropyrimidines with or without oxaliplatin or irinotecan: data from 12,676 patientsfrom MOSAIC, X-ACT, PETACC-3, C-06, C-07 and C89803,”
Eur. J. Cancer ,vol. 47, pp. 990–996, May 2011.[60] D. J. Sargent, H. S. Wieand, D. G. Haller, R. Gray, J. K. Benedetti, M. Buyse,R. Labianca, J. F. Seitz, C. J. O’Callaghan, G. Francini, A. Grothey, M. O’Connell,P. J. Catalano, C. D. Blanke, D. Kerr, E. Green, N. Wolmark, T. Andre, R. M.Goldberg, and A. De Gramont, “Disease-free survival versus overall survival as aprimary end point for adjuvant colon cancer studies: individual patient data from340,898 patients on 18 randomized trials,”
J. Clin. Oncol. , vol. 23, pp. 8664–8670,Dec 2005.[61] B. D. Cheson, B. Pfistner, M. E. Juweid, R. D. Gascoyne, L. Specht, S. J. Horning,B. Coiffier, R. I. Fisher, A. Hagenbeek, E. Zucca, S. T. Rosen, S. Stroobants,T. A. Lister, R. T. Hoppe, M. Dreyling, K. Tobinai, J. M. Vose, J. M. Connors,M. Federico, and V. Diehl, “Revised response criteria for malignant lymphoma,”
J.Clin. Oncol. , vol. 25, pp. 579–586, Feb 2007.[62] U. Vitolo, M. Trneny, D. Belada, J. M. Burke, A. M. Carella, N. Chua, P. Abris-queta, J. Demeter, I. Flinn, X. Hong, W. S. Kim, A. Pinto, Y. K. Shi, Y. Tat-sumi, M. Z. Oestergaard, M. Wenger, G. Fingerle-Rowson, O. Catalani, T. Nielsen,M. Martelli, and L. H. Sehn, “Obinutuzumab or Rituximab Plus Cyclophos-phamide, Doxorubicin, Vincristine, and Prednisone in Previously Untreated DiffuseLarge B-Cell Lymphoma,”
J. Clin. Oncol. , p. JCO2017733402, Aug 2017.[63] M. J. Maurer, H. Ghesquieres, J. P. Jais, T. E. Witzig, C. Haioun, C. A. Thompson,R. Delarue, I. N. Micallef, F. Peyrade, W. R. Macon, T. Jo Molina, N. Ketterer,S. I. Syrbu, O. Fitoussi, P. J. Kurtin, C. Allmer, E. Nicolas-Virelizier, S. L. Slager,T. M. Habermann, B. K. Link, G. Salles, H. Tilly, and J. R. Cerhan, “Event-freesurvival at 24 months is a robust end point for disease-related outcome in diffuselarge B-cell lymphoma treated with immunochemotherapy,”
J. Clin. Oncol. , vol. 32,pp. 1066–1073, Apr 2014.[64] J. M. Robins, “Information recovery and bias adjustment in proportional hazardsregression analysis of randomized trials using surrogate markers,”
Proceedings ofthe Biopharmaceutical Section, American Statistical Association , pp. 24–33, 1993.[65] C. Watkins, X. Huang, N. Latimer, Y. Tang, and E. J. Wright, “Adjusting overallsurvival for treatment switches: commonly used methods and practical applica-tion,”
Pharm. Stat. , vol. 12, no. 6, pp. 348–357, 2013.[66] J. Robins,
Causal inference from complex longitudinal data. Latent variable mod-eling and applications to causality. , vol. 120, 69-117 of
Lecture Notes in Statistics .Springer, New York, 1997.[67] J. Robins and A. Tsiatis, “Correcting for noncompliance in randomized trials usingrank preserving structural failure time models,”
Communications in Statistics –Theory and Methods , vol. 20, pp. 2609–2631, 1991.[68] P. Solal-Celigny, P. Roy, P. Colombat, J. White, J. O. Armitage, R. Arranz-Saez, W. Y. Au, M. Bellei, P. Brice, D. Caballero, B. Coiffier, E. Conde-Garcia,C. Doyen, M. Federico, R. I. Fisher, J. F. Garcia-Conde, C. Guglielmi, A. Ha-genbeek, C. Haioun, M. LeBlanc, A. T. Lister, A. Lopez-Guillermo, P. McLaugh-lin, N. Milpied, P. Morel, N. Mounier, S. J. Proctor, A. Rohatiner, P. Smith,P. Soubeyran, H. Tilly, U. Vitolo, P. L. Zinzani, E. Zucca, and E. Montserrat, “Fol-licular lymphoma international prognostic index,”
Blood , vol. 104, pp. 1258–1265,Sep 2004. 3569] L. D. Kaiser, “Dynamic randomization and a randomization model for clinical trialsdata,”
Stat Med , vol. 31, pp. 3858–3873, Dec 2012.[70] G. Salles, J. F. Seymour, F. Offner, A. Lopez-Guillermo, D. Belada, L. Xerri,P. P. Feugier, R. Bouabdallah, J. V. Catalano, P. Brice, D. Caballero, C. Haioun,L. M. Pedersen, A. Delmer, D. Simpson, S. Leppa, P. Soubeyran, A. Hagenbeek,O. Casasnovas, T. Intragumtornchai, C. Ferme, M. G. da Silva, C. Sebban, A. Lis-ter, J. A. Estell, G. G. Milone, A. Sonet, M. Mendila, B. Coiffier, and H. Tilly,“Rituximab maintenance for 2 years in patients with high tumour burden follic-ular lymphoma responding to rituximab plus chemotherapy (PRIMA): a phase 3,randomised controlled trial,”
Lancet , vol. 377, no. 9759, pp. 42–51, 2011.[71] R Core Team, R : A Language and Environment for Statistical Computing . R Foun-dation for Statistical Computing, Vienna, Austria, 2017.[72] G. Heinze, M. Ploner, and D. Dunkler, coxphw: Weighted Estimation in Cox Re-gression , 2017. R package version 4.0.0.[73] S. Dodd, I. R. White, and P. Williamson, “A framework for the design, conduct andinterpretation of randomised controlled trials in the presence of treatment changes,”
Trials , vol. 18, p. 498, Oct 2017.[74] T. T. Chen, “Statistical issues and challenges in immuno-oncology,”
J ImmunotherCancer , vol. 1, p. 18, 2013.[75] M. Akacha, “Estimands in clinical trials - broadening the perspective,” Royal Sta-tistical Society Conference 2017, Glasgow, 2017.[76] M. A. Hernan, “The hazards of hazard ratios,”
Epidemiology , vol. 21, pp. 13–15,Jan 2010.[77] L. Seymour, J. Bogaerts, A. Perrone, R. Ford, L. H. Schwartz, S. Mandrekar, N. U.Lin, S. Litiere, J. Dancey, A. Chen, F. S. Hodi, P. Therasse, O. S. Hoekstra, L. K.Shankar, J. D. Wolchok, M. Ballinger, C. Caramella, and E. G. de Vries, “iRECIST:guidelines for response criteria for use in trials testing immunotherapeutics,”