[PDF] Designing group sequential clinical trials when a delayed effect is anticipated: A practical guidance

Abstract

Full PDF

DD ESIGNING GROUP SEQUENTIAL CLINICAL TRIALS WHEN ADELAYED EFFECT IS ANTICIPATED : A

PRACTICAL GUIDANCE

A P

REPRINT

Dominic Magirr

Advanced Methodology and Data ScienceNovartis Pharma AGBasel, Switzerland [email protected]

José L. Jiménez

Biostatistical Sciences and PharmacometricsNovartis Pharma AGBasel, Switzerland [email protected]

February 15, 2021 A BSTRACT

A common feature of many recent trials evaluating the effects of immunotherapy on survival is thatnon-proportional hazards can be anticipated at the design stage. This raises the possibility to usea statistical method tailored towards testing the purported long-term beneﬁt, rather than applyingthe more standard log-rank test and/or Cox model. Many such proposals have been made in recentyears, but there remains a lack of practical guidance on implementation, particularly in the context ofgroup-sequential designs. In this article, we aim to ﬁll this gap. We discuss how the POPLAR trial,which compared immunotherapy versus chemotherapy in non-small-cell lung cancer, might havebeen re-designed to be more robust to the presence of a delayed effect. We then provide step-by-stepinstructions on how to analyse a hypothetical realisation of the trial, based on this new design. Basictheory on weighted log-rank tests and group-sequential methods is covered, and an accompanying R package (including vignette) is provided. For a homogeneous patient population, the primary analysis of a randomized controlled trial with a time-to-eventendpoint is nothing more than a comparison of two cumulative distribution functions. Statistical analysis is madedifﬁcult, however, by right censoring, which precludes a simple comparison of means. The addition of one ormore interim analyses complicates matters further. A standard solution is a group-sequential log-rank test, typicallycomplimented with Kaplan-Meier estimates and a Cox proportional hazards model. Although successful in general,this strategy works less well for immuno-oncology trials, where the proportional hazards assumption is untenable. Inthis context, it is unlikely that the experimental drug will lead to an immediate improvement in survival. Rather, thesurvival curves are expected to be similar, or possibly favour the control arm, for a number of months, before diverging.The log-rank test, although valid, may have low power if the component of the test statistic corresponding to earlytimepoints is contributing noise without contributing signal. In addition, the estimated beta coefﬁcient corresponding tothe treatment term in the Cox model will no longer have a straightforward interpretation.Numerous proposals have been made to replace the log-rank test with a weighted version that is tailored towards testingpurported long-term improvements in survival [1, 2, 3, 4, 5]. Uptake has been slow, however, in part due to concernsthat such tests could produce counter-intuitive results when the hazard functions on the two arms cross [6]. To addresssuch concerns, a "modestly-weighted" log-rank test has been proposed [7], with the key property that if survival on theexperimental drug is truly lower (or equal) to survival on control at all timepoints, then the probability of claiming astatistically signiﬁcant improvement is less than α . The modestly-weighted test also has considerably greater powerthan the standard log-rank test when there is a delayed treatment effect, as well as being straightforward to implement[8].In this paper, we aim to provide researchers with the guidance and tools necessary to use a modestly-weighted testin the context of a group-sequential design. Our emphasis will be on the practical side, since, from a methodological a r X i v : . [ s t a t . A P ] F e b PREPRINT - F

EBRUARY

15, 2021perspective, no new concepts are required. The modestly-weighted log-rank test belongs to the class of weightedlog-rank statistic studied by Fleming & Harrington [9], which, as shown by Tsiatis [10], satisfy the standard independentincrements assumption of group-sequential theory [11, 12]. We refer to Gillen & Emerson [13] for a detailed accountof the methodology.

We shall use the POPLAR trial [14] as an example throughout. POPLAR was an open-label phase 2 randomizedcontrolled trial of atezolizumab versus docetaxel for patients with previously-treated non-small-cell lung cancer. Thekey design assumptions, as well as a de-identiﬁed data set [15], are publicly available. The sample size was calculatedassuming a median OS of 8 months for the control arm and a HR of 0.65, which translated into an assumed median OSof approximately 12.3 months for the atezolizumab arm, under an exponential model. Recruitment lasted 8 months.Three interim analyses were planned, with (two-sided) alpha levels of 0.0001, 0.0001, and 0.001. The ﬁnal analysis ofOS was performed when 173 deaths had occurred in the intention-to-treat (ITT) population, using a two-sided α levelof 4.88%. The trial enrolled a total of 287 patients.A Kaplan-Meier estimate derived from the published data set [15] is shown in Figure 1. The curves display thetypical late separation pattern often seen with immunotherapy agents. With the beneﬁt of hindsight, but also basedon observations from similar studies [16], we will show how the trial might have been designed more robustly andefﬁciently, taking into account the potential for a delayed treatment effect.Figure 1: Kaplan-Meier curves from the POPLAR trial. + ++ + + + + + ++++++++++++++++++++++++++++++++++++++ ++++ +++ + ++ + + + + + +++++++++ +++ +++ + Time (months) O v e r a ll s u r v i v a l ++ AtezolizumabDocetaxel

144 131 117 106 90 78 70 64 54 50 46 24 11 1143 123 106 92 82 65 55 46 36 29 24 12 5 1

DocetaxelAtezolizumab 0 2 4 6 8 10 12 14 16 18 20 22 24 26

Time (months)

Number at risk

To perform a weighted log-rank test, we scan over the ordered event times t , . . . , t k , and take a weighted sum of theobserved minus expected events on one of the treatment arms, where the expectation is taken assuming that the survivaldistributions on the two arms are identical. Let n i,j denote the number of patients at risk on treatment i = 0 , just priorto time t j , and let O i,j denote the observed number of events on treatment i = 0 , at time t j , with the expected numberof events given by E i,j = O j × n i,j /n j , where n j = n ,j + n ,j and O j = O ,j + O ,j . Then the weighted log-ranktest statistic is U W := (cid:88) j w j ( O ,j − E ,j ) ∼ N (0 , V W ) , where V W = (cid:88) j w j n ,j n ,j O j ( n j − O j ) n j ( n j − . PREPRINT - F

EBRUARY

15, 2021Intuitively, if the treatment is beneﬁcial, we will tend to see fewer events on the experimental arm than would beexpected assuming the curves are identical. We are hoping to see that U W << , and, in particular, that the one-sidedp-value, p := Φ( U W / √ V W ) , is less than, e.g., α = 0 . . Weights are pre-speciﬁed to boost the chances that p < α ,given the anticipated treatment effect. The standard log-rank test uses w j = 1 , which is the most powerful choice underproportional hazards. Under a delayed-treatment-effect scenario, a popular alternative is the Fleming-Harrington-(0,1)test, which uses w j = 1 − ˆ S ( t j − ) , where ˆ S ( t j ) is the Kaplan-Meier estimate of the pooled sample just prior to time t j .Considerable care is necessary, however, since although the Fleming-Harrington-(0,1) test controls the type 1 error ratewhen survival curves are identical, it offers no guarantees regarding the direction of the effect [8, 17]. To put it anotherway: it offers a valid α -level test when the null hypothesis is identical survival, H : S ( t ) = S ( t ) for all t , but notwhen the null hypothesis is inferior (or identical) survival, ˜ H : S ( t ) ≤ S ( t ) for all t . A safer choice that controls α also under ˜ H is a "modestly-weighted" log-rank test [7], which uses w j = 1 / max (cid:110) ˆ S ( t j − ) , ˆ S ( t ∗ ) (cid:111) . Heuristically,the modestly-weighted test can be thought of as similar to an average landmark analysis from time t ∗ to the end offollow up [8]. This interpretation is helpful at the design stage when pre-specifying t ∗ . For a group-sequential version of the weighted log-rank test, we must consider the joint distribution of U (1) W , . . . , U ( K ) W ,where U ( k ) W denotes the test statistic at analysis k . As shown by Tsiatis [10], under H ,  U (1) W U (2) W ... U ( K ) W  ∼ N  ...  ,  V (1) W V (1) W · · · V (1) W V (1) W V (2) W · · · V (2) W ... ... . . . V (1) W V (2) W V ( K ) W  . (1)A group-sequential test can be deﬁned via the K critical values, c , . . . , c K such that p  (cid:92) k ≤ K  U ( k ) W (cid:113) V ( k ) W > c k  ; H  = 1 − α. (2)There are many different ways to choose such critical values [12, 11]. One ﬂexible approach is to use a Hwang-Shih-DeCani alpha-spending function [18]. In this case, we must pre-specify an anticipated variance of the ﬁnal test statistic, ˜ V ( K ) W . Then, at analysis k , for k = 1 , . . . , K − , we ﬁnd the cumulative alpha spend, α ∗ k = α × min  , − exp (cid:18) − γ (cid:113) V ( k ) W / ˜ V ( K ) W (cid:19) − exp ( − γ )  . (3)Further deﬁning α ∗ K := α , the critical value c k ( k = 1 , . . . , K ) is found via numerical integration, such that p  (cid:92) l ≤ k  U ( l ) W (cid:113) V ( l ) W > c l  ; H  = 1 − α ∗ k . The parameter γ can be chosen such that the stopping boundary resembles an O’Brien-Fleming boundary ( γ = − ), aPocock boundary ( γ = 1 ), or something in between. 3 PREPRINT - F

EBRUARY

15, 2021

We now consider the alternative hypothesis, denoted by H . Figure 2 shows two potential alternative hypotheses thatmay have been considered for the POPLAR trial design. Our challenge is to ﬁnd a design such that p (cid:16) U W / (cid:112) V W < Φ − ( α ); H (cid:17) = 1 − β. (4)In time-to-event settings, power is driven by the number of events rather than the number of patients. The number ofevents is a function of the recruitment assumptions, time-to-event distributions, and the duration of follow up. Thus wehave considerable ﬂexibility, in theory at least, in how we design the trial to meet objective (4). If the sponsor of thestudy has large resources, it may be feasible to ﬁx the duration of recruitment and follow-up to ensure that the study iscompleted in a timely manner. In this case, we adjust the recruitment rate, or, equivalently, the total number of patients,until (4) is satisﬁed. For example, the POPLAR trial speciﬁed 8 months of recruitment, plus a minimum follow-uptime of 13 months, bringing the total trial duration to 21 months. Given these assumptions, as well as the time-to-eventdistributions in Figure 2, the corresponding power of the standard log-rank test is shown in the ﬁrst two columns ofTable 1 for a series of potential sample sizes. We see that under the proportional hazards (PH) alternative, a samplesize of 165 per arm would be sufﬁcient to achieve 90% power. However, under the non-proportional hazards (NPH)assumption, 180 per arm would be required. If, instead of the standard log-rank test, we use the modestly-weightedlog-rank test with t ∗ = 6 , then the corresponding required sample size per arm is 165 under PH and 150 under NPH. Areason for choosing t ∗ = 6 here is that the modestly-weighted log-rank test is similar to an average landmark analysisfrom t ∗ to the end of follow-up [8], and under NPH we anticipate the curves to have started diverging at this point. Thecloser t ∗ is to zero, the more similar the test is to a standard log-rank test.Figure 2: Two potential alternative hypotheses for the POPLAR trial. Time (months) S u r v i v a l ControlDelay = 0 monthsDelay = 4 months

Alternative hypotheses

Table 1: Relationship between number of patients per arm ( n ) and power using the standard log-rank test (LR) and themodestly weighted log-rank test (MWLR). Assuming uniform recruitment over 8 months, time-to-event distributions asgiven in Figure 2, with analysis performed 21 months after the start of the trial.Power LR Power MWLR ( t ∗ = 6 ) n PH NPH PH NPH150 0.87 0.84 0.87 0.91155 0.88 0.85 0.88 0.91160 0.89 0.86 0.89 0.92165 0.90 0.87 0.90 0.93170 0.91 0.88 0.90 0.94175 0.92 0.89 0.91 0.94180 0.92 0.90 0.92 0.95To summarize, if we are conﬁdent in the delayed effect assumption, and require 90% power under the NPH alternative,then there is an approximate 20% saving in sample size from using the modestly-weighted log-rank test instead of astandard log-rank test. Even if we are not certain about the delayed effect, and would prefer to choose the sample size4

PREPRINT - F

EBRUARY

15, 2021such that there is at least 90% power under both PH and NPH alternatives, there is still an approximate 10% sample sizereduction from using the modestly-weighted test.Note that the power has been calculated assuming a ﬁxed data cut-off time, rather than a ﬁxed number of events thattriggers a data cut-off. It is straightforward, however, to calculate the expected number of events, given the designassumptions, and this can then be considered as the ﬁxed quantity in the ﬁnal deﬁnition of the trial design.Calculations have been performed via numerical integration using the R package gsdelayed that we speciﬁcallydeveloped to illustrate all the steps presented in this article. Details of the approximations involved have already beendescribed elsewhere [7]. Numerical integration is useful for fast evaluation of various design options. If necessary, it isalso straightforward to simulate a chosen design, to conﬁrm the operating characteristics. We now consider adding an interim analysis for efﬁcacy. Two choices are necessary: the timing of the interim analysis,and the amount of alpha to spend. In making these choices, we must consider our goal. For our example based on thePOPLAR study, recruitment lasts 8 months, with a maximum trial length of 21 months. Unless the interim analysisis very early, all patients will have already been recruited, and most of the costs of the study will have already beenincurred. The only incentive to stop early for efﬁcacy is a reduction in the expected time until a decision. We could,for example, make choices that minimize the expected duration of the trial under the alternative hypothesis. Typically,however, there is a trade-off: the more we reduce the expected duration of the trial, the more we reduce the overallpower. Or, if we decide to increase the maximum sample size to recover 90% power, we must trade off a shorterexpected duration versus a longer maximum duration.In Table 2, expected duration and power is displayed for 10 potential designs. Perhaps the three-stage design withinterim analyses at 11 and 16 months stands out as an appealing option, based on a Hwang-Shih-DeCani spendingfunction with γ = − . This design reduces the expected duration of the study by 3.4 months with barely any reductionin power compared to a single-stage design.Table 2: Expected duration and power of various design options, calculated under the NPH alternative. Based on amodestly-weighted log-rank test with t ∗ = 6 , sample size of 150 per arm, and uniform recruitment over 8 months.Expected duration (months) Powerunder alternative hypothesis (NPH) under alternative hypothesis (NPH)Design Analysis times γ = − γ = − . γ = 1 γ = − γ = − . γ = 1 Single-stage 21 21.0 0.91Two-stage 11,21 20.1 19.4 18.8 0.90 0.89 0.86Two-stage 16,21 17.9 17.6 17.4 0.90 0.88 0.86Three-stage 11,16,21 17.6 17.0 16.7 0.90 0.88 0.83

Regulatory guidance generally steers towards futility stopping rules that are non-binding [19]. This means that wedo not consider the futility stopping rule when we calculate the efﬁcacy boundary to guarantee an α -level test. Ifnon-binding futility rules are subsequently added, this has the effect of reducing both the type 1 error probability andthe power.There are several ways that a futility rule could be speciﬁed [20]. We could, for example, consider a beta-spendingfunction [21]. We could calculate the conditional power [22], or the predictive power [23]. Or we could specify acut-off directly, either on the z-statistic scale or on the average-hazard-ratio scale. The latter has been implemented in gsdelayed .In the special case of time-to-event trials with an anticipated delayed effect, it should be recognised that a formal futilityanalysis may have limited value. As mentioned above, unless the interim analysis occurs very early, most patients willhave been recruited, and most of the costs of the study already incurred. In addition, a stringent rule would risk stoppinginappropriately before a late treatment effect has been given a chance to emerge. This is not to say that the trial wouldnever be stopped early. All such trials will be monitored by an independent data safety and monitoring board (DSMB).The DSMB will stop the trial promptly if the experimental drug is clearly harmful [24].5 PREPRINT - F

EBRUARY

15, 2021

We shall now walk through a hypothetical realization of the three-stage trial design from Table 2. So far, we havespeciﬁed the calendar times of the interim and ﬁnal analyses, relative to the start of the trial. Figure 3 shows how theexpected number of events corresponds to calendar time under the NPH alternative. We see that the ﬁrst interim, secondinterim and ﬁnal analyses at months 11, 16 and 21, correspond to 122, 170 and 203 events, respectively. We can nowspecify the design in terms of the number of events that trigger each analysis. Having done so, the planned stoppingboundaries are shown in Figure 5.Figure 3: Switching from a study time perspective to an expected number of events perspective. Based on the alternativehypothesis (NPH)

Time since start of study (months) E x pe c t ed n RecruitedEvents

Convert from time to events

As already mentioned, at the design stage we must pre-specify the anticipated variance of the U statistic at the ﬁnalanalysis, denoted by ˜ V ( K ) W . In our case, using numerical integration, we ﬁnd that ˜ V (3) W = 103 . . This parameter will beused when adjusting the critical values to account for any deviations from the planned design assumptions. After theselection of ˜ V (3) W , we proceed with the hypothetical realization of the trial:• Trial recruitment begins.• We conduct the ﬁrst interim analysis after 122 events. Suppose we observe the data shown in Figure 4A.Applying the modestly-weighted test, we ﬁnd that U (1) W = − . and V (1) W = 49 . . The next step is toﬁnd the interim alpha spend, given the information fraction t = V (1) W / ˜ V (3) W = 0 . , and the alpha-spendingfunction (3). In our case, we speciﬁed a Hwang-Shi-DeCani alpha-spending function with γ = − , such that α ∗ = 0 . . This corresponds to a critical value on the Z-statistic scale of Φ − (0 . − . . Since theobserved Z-statistic is U (1) W / (cid:113) V (1) W = − . , the decision at the ﬁrst interim would be to continue. This isrepresented graphically in Figure 5 by the blue “x” at 122 events.• We conduct the second interim analysis after 170 events. Suppose we observe the data shown in Figure4B. Again, applying the modestly-weighted test we ﬁnd that U (2) W = − . and V (2) W = 76 . , so that Z = U (2) W / (cid:113) V (2) W = − . . The information fraction is now V (2) W / ˜ V (3) W = 0 . . This means that thecumulative alpha spend at the second interim is α ∗ = 0 . . To convert this into a critical value on theZ-scale, we must solve equation (2) for c using the bivariate normal distribution (1). This gives c = − . .Since Z < c , we could now reject the null hypothesis and stop the trial. This is represented graphically inFigure 5 by the blue “x” on top of 170 events.This hypothetical realization highlights the danger of a stringent futility analysis, as there is little separation along mostof the Kaplan-Meier curves shown in Figure 4A.For group sequential designs, there is no single natural way to deﬁne a p-value. A popular approach is a so-called"stage-wise ordering p-value", where earlier stops for efﬁcacy are always considered more extreme evidence against the6 PREPRINT - F

EBRUARY

15, 2021Figure 4: Kaplan-Meier curves at interim analyses 1 and 2. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Time (months) O v e r a ll s u r v i v a l A) Interim Analysis 1

150 131 105 67 36 10 0150 132 107 80 47 24 0

ExperimentalControl 0 2 4 6 8 10 12

Time (months)

Number at risk ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Time (months)++

ControlExperimental

B) Interim Analysis 2

150 131 111 96 86 58 43 24 6 0 0150 132 113 102 94 81 62 38 19 0 0

ExperimentalControl 0 2 4 6 8 10 12 14 16 18 20

Time (months)

Number at risk

Figure 5: Planned stopping boundaries. −4−3−2−101 0 122 170 203

Observed number of events Z s t a t i s t i c Planned stopping boundaries null hypothesis than later stops for efﬁcacy. In the hypothetical trial described above, the stage-wise (one-sided) p-valuewould be: p  U (1) W (cid:113) V (1) W ≤ − .  ∪  U (1) W (cid:113) V (1) W > − .  ∩  U (2) W (cid:113) V (2) W ≤ − .  ; H  = 0 . . (5)Testing the null hypothesis is only one part of the analysis. Equally important is how to describe the treatment effect.Perhaps the most important tool is a Kaplan-Meier plot, which has the advantage of describing the entire survival curve.As has been noted by many authors [8, 25, 26, 5], in the setting of non-proportional hazards, there is no single-numbersummary measure that can adequately capture the full information from the survival curves. Rather, it is consideredhelpful to report a range of single-number summary measures, including the difference in survival at ﬁxed time points,differences in quantiles of the survival distributions, and differences in restricted mean survival times.In our hypothetical realization, where the trial was stopped for efﬁcacy after the second interim analysis, we mightfocus on the survival probabilities at 12 months (0.54 on experimental versus 0.4 on control), the median survival times(15.5 versus 9.3 months), or the restricted mean survival times up to 12 months (8.6 versus 7.9 months).For all of these summary measures, the group sequential design will introduce some bias, owing to the possibility tostop early on a random high. Various methods have been proposed that attempt to account for this bias [27, 28, 29].7 PREPRINT - F

EBRUARY

15, 2021They are rarely used in practice, however, with the justiﬁcation often that the size of the bias is small, particularly if theinterim analyses occur late [30].

Immunotherapy treatments often have delayed effects. We could use this knowledge to make phase 3 clinical trialsmore efﬁcient, by focusing the test statistic on the purported long-term survival beneﬁt, rather than using the standardlog-rank test as a default.One potential barrier to realizing this increase in efﬁciency is a lack of guidance and software for implementing themore efﬁcient methods in the context of group-sequential trials. In this paper, we have described in detail how to designand analyse a phase 3 trial in immuno-oncology using a group-sequential modestly-weighted log-rank test. We havealso discussed the scope for a formal futility analysis in the special case of a time-to-event endpoint with an anticipateddelayed effect. Lastly, we have illustrated how a range of single-number summary measures together help to quantifythe treatment effect, which is important given that the hazard ratio lacks interpretability in this setting.

Data availability

The data used to produce the Kaplan-Meier curves in Figure 1 is publicly available in [15]. The R codeused throughout the article is part of the package gsdelayed , which includes a vignette, and is available at github.com/dominicmagirr/gsdelayed . References [1] David P Harrington and Thomas R Fleming. A class of rank test procedures for censored survival data.

Biometrika ,69(3):553–566, 1982.[2] Song Yang and Ross Prentice. Improved logrank-type tests for survival data using adaptive weights.

Biometrics ,66(1):30–38, 2010.[3] Valérie Garès, Sandrine Andrieu, Jean-François Dupuy, Nicolas Savy, et al. A comparison of the constantpiecewise weighted logrank and ﬂeming-harrington tests.

Electronic journal of statistics , 8(1):841–860, 2014.[4] Theodore G Karrison. Versatile tests for comparing survival curves based on weighted log-rank statistics.

TheStata Journal , 16(3):678–690, 2016.[5] Satrajit Roychoudhury, Keaven M Anderson, Jiabu Ye, and Pralay Mukhopadhyay. Robust design and analysis ofclinical trials with non-proportional hazards: a straw man guidance from a cross-pharma working group.

Statisticsin Biopharmaceutical Research , pages 1–37, 2021.[6] Boris Freidlin and Edward L Korn. Methods for accommodating nonproportional hazards in clinical trials: readyfor the primary analysis?

Journal of Clinical Oncology , 37(35):3455, 2019.[7] Dominic Magirr and Carl-Fredrik Burman. Modestly weighted logrank tests.

Statistics in medicine , 38(20):3782–3790, 2019.[8] Dominic Magirr. Non-proportional hazards in immuno-oncology: is an old perspective needed?

PharmaceuticalStatistics , 2020.[9] Thomas R Fleming and David P Harrington.

Counting processes and survival analysis , volume 169. John Wiley& Sons, 2011.[10] Anastasios A Tsiatis. Repeated signiﬁcance testing for a general class of statistics used in censored survivalanalysis.

Journal of the American Statistical Association , 77(380):855–861, 1982.[11] Christopher Jennison and Bruce W Turnbull.

Group sequential methods with applications to clinical trials . CRCPress, 1999.[12] John Whitehead.

The design and analysis of sequential clinical trials . John Wiley & Sons, 1997.[13] Daniel L Gillen and Scott S Emerson. Information growth in a family of weighted logrank statistics under repeatedanalyses.

Sequential Analysis , 24(1):1–22, 2005.[14] Louis Fehrenbacher, Alexander Spira, Marcus Ballinger, Marcin Kowanetz, Johan Vansteenkiste, Julien Mazieres,Keunchil Park, David Smith, Angel Artal-Cortes, Conrad Lewanski, et al. Atezolizumab versus docetaxel forpatients with previously treated non-small-cell lung cancer (poplar): a multicentre, open-label, phase 2 randomisedcontrolled trial.

The Lancet , 387(10030):1837–1846, 2016.8

PREPRINT - F

EBRUARY

15, 2021[15] David R Gandara, Sarah M Paul, Marcin Kowanetz, Erica Schleifman, Wei Zou, Yan Li, Achim Rittmeyer, LouisFehrenbacher, Geoff Otto, Christine Malboeuf, et al. Blood-based tumor mutational burden as a predictor of clinicalbeneﬁt in non-small-cell lung cancer patients treated with atezolizumab.

Nature medicine , 24(9):1441–1448, 2018.[16] Rifaquat Rahman, Geoffrey Fell, Steffen Ventz, Andrea Arfé, Alyssa M Vanderbeek, Lorenzo Trippa, and Brian MAlexander. Deviation from the proportional hazards assumption in randomized phase 3 clinical trials in oncology:prevalence, associated factors, and implications.

Clinical Cancer Research , 25(21):6339–6345, 2019.[17] José L Jiménez, Viktoriya Stalbovskaya, and Byron Jones. Properties of the weighted log-rank test in the designof conﬁrmatory studies with delayed effects.

Pharmaceutical statistics , 18(3):287–303, 2019.[18] Irving K Hwang, Weichung J Shih, and John S De Cani. Group sequential designs using a family of type i errorprobability spending functions.

Statistics in medicine , 9(12):1439–1445, 1990.[19] Food and Drug Administration. Adaptive designs for clinical trials of drugs and biologics - guidance for industry.food and drug administration. , 2019. [Online; accessed2-February-2021].[20] Paul Gallo, Lu Mao, and Vivian H Shih. Alternative views on setting clinical trial futility criteria.

Journal ofbiopharmaceutical statistics , 24(5):976–993, 2014.[21] Sandro Pampallona, Anastasios A Tsiatis, and KyungMann Kim. Interim monitoring of group sequential trialsusing spending functions for the type i and type ii error probabilities.

Drug Information Journal , 35(4):1113–1121,2001.[22] John M Lachin. A review of methods for futility stopping based on conditional power.

Statistics in medicine ,24(18):2747–2764, 2005.[23] David J Spiegelhalter, Laurence S Freedman, and Patrick R Blackburn. Monitoring clinical trials: conditional orpredictive power?

Controlled clinical trials , 7(1):8–17, 1986.[24] David L Demets and KK Gordon Lan. Interim analysis: the alpha spending function approach.

Statistics inmedicine , 13(13-14):1341–1352, 1994.[25] Kaspar Ruﬁbach. Treatment effect quantiﬁcation for time-to-event endpoints–estimands, analysis strategies, andbeyond.

Pharmaceutical statistics , 18(2):145–165, 2019.[26] José L Jiménez. Quantifying treatment differences in conﬁrmatory trials under non-proportional hazards.

Journalof Applied Statistics , pages 1–19, 2020.[27] Jose C Pinheiro and David L DeMets. Estimating and reducing bias in group sequential designs with gaussianindependent increment structure.

Biometrika , 84(4):831–845, 1997.[28] Xiaoyin Fan, David L DeMets, and KK Gordon Lan. Conditional bias of point estimates following a groupsequential test.

Journal of biopharmaceutical statistics , 14(2):505–530, 2004.[29] SD Walter, GH Guyatt, D Bassler, M Briel, T Ramsay, and HD Han. Randomised trials with provision for earlystopping for beneﬁt (or harm): the impact on the estimated treatment effect.

Statistics in medicine , 38(14):2524–2543, 2019.[30] Boris Freidlin and Edward L Korn. Stopping clinical trials early for beneﬁt: impact on estimation.