[PDF] A pragmatic adaptive enrichment design for selecting the right target population for cancer immunotherapies

Abstract

One of the challenges in the design of confirmatory trials is to deal with uncertainties regarding the optimal target population for a novel drug. Adaptive enrichment designs (AED) which allow for a data-driven selection of one or more pre-specified biomarker subpopulations at an interim analysis have been proposed in this setting but practical case studies of AEDs are still relatively rare. We present the design of an AED with a binary endpoint in the highly dynamic setting of cancer immunotherapy. The trial was initiated as a conventional trial in early triple-negative breast cancer but amended to an AED based on emerging data external to the trial suggesting that PD-L1 status could be a predictive biomarker. Operating characteristics are discussed including the concept of a minimal detectable difference, that is, the smallest observed treatment effect that would lead to a statistically significant result in at least one of the target populations at the interim or the final analysis, respectively, in the setting of AED.

Full PDF

AA pragmatic adaptive enrichment design for selectingthe right target population for cancer immunotherapies

Anh Nguyen Duc, Dominik Heinzmann, Claude Berge and Marcel Wolbers09 June, 2020

Abstract

One of the challenges in the design of conﬁrmatory trials is to deal with uncertaintiesregarding the optimal target population for a novel drug. Adaptive enrichment designs(AED) which allow for a data-driven selection of one or more pre-speciﬁed biomarkersubpopulations at an interim analysis have been proposed in this setting but practicalcase studies of AEDs are still relatively rare. We present the design of an AED witha binary endpoint in the highly dynamic setting of cancer immunotherapy. The trialwas initiated as a conventional trial in early triple-negative breast cancer but amendedto an AED based on emerging data external to the trial suggesting that PD-L1 statuscould be a predictive biomarker. Operating characteristics are discussed including theconcept of a minimal detectable diﬀerence, that is, the smallest observed treatmenteﬀect that would lead to a statistically signiﬁcant result in at least one of the targetpopulations at the interim or the ﬁnal analysis, respectively, in the setting of AED.

Cancer immunotherapy (CIT) has revolutionized the treatment of cancer patients. CIT isable to stimulate and promote the immune system and to engage it in the ﬁght againstcancer. The immune system normally recognizes and eliminates most early tumor cells, butimmunological checkpoints (e.g. PD-L1) constitute a signiﬁcant obstacle to eﬀective antitumorimmune responses . An important class of CITs are PD1 and PD-L1 inhibitors . PD-L1protein expression on tumor or immune cells has emerged as a potential predictive biomarkerfor sensitivity to such CITs. However, uncertainty remains on the value of PD-L1 as apredictive biomarker which may vary by cancer type or stage , as immune-based interactionsare dynamic and complex in nature.A major topic in research and development of CITs is thus the identiﬁcation and conﬁrmationof subgroups of patients where a treatment is (most) eﬀective and dealing with uncertaintiesregarding the optimal target population is an important consideration in the design of pivotalCIT trials. If there is conﬁdence at the time of the design of the pivotal trial that thebiomarker has the ability to identify patients who will beneﬁt from the treatment, then1 a r X i v : . [ s t a t . A P ] S e p he trial can be limited to the biomarker-positive patients (that is, the trial is enriched atthe beginning). If the biomarker is appropriately developed but conﬁdence in its ability tofully identify the correct biomarker population is lacking, then a separate Phase II studyinvestigating the biomarker population could be conducted to inform the Phase III trial design.Alternatively, a single conﬁrmatory adaptive enrichment design (AED) could be conductedwhich allows a data-driven selection of one or more pre-speciﬁed biomarker subpopulationsat an interim analysis, and the conﬁrmatory proof of eﬃcacy in the selected subset at theend of the trial .Regulatory guidance documents for conﬁrmatory adaptive designs are available and stressthe importance of prospective planning of adaptations and strong type I error control .Moreover, an ICH guideline on adaptive trials is in preparation. However, although themethodological foundation for adaptive designs was established more than 30 years ago, theirimpact in drug development has not been as high as anticipated . In particular, practical casestudies of AEDs are still rare. For example, according to a recent review of 59 medicines forwhich an adaptive clinical trial had been submitted to the EMA Scientiﬁc Advice, only 5/59(8%) concerned AED and for only one of them, it could be established that the correspondingtrial was actually initiated .In this article, we present a case study of a conﬁrmatory AED in CIT. This trial compares aCIT (atezolizumab) plus chemotherapy versus chemotherapy alone in early triple-negativebreast cancer (TNBC) with pathological complete response as the primary (binary) endpoint.The trial was originally planned as a conventional randomized trial in all-comers but thenconverted to an AED based on emerging data external to the trial suggesting that PD-L1 could be a predictive biomarker for the atezolizumab treatment eﬀect in TNBC. Theremainder of this article is structured as follows. Section 2 outlines the general methodologicalframework for an AED with a binary clinical endpoint. This section also contains a discussionof the minimal detectable diﬀerence (MDD), an important concept for the design of a trial.In Section 3, the methodology is applied to our case study. The article concludes with adiscussion. At the design stage of a trial, it is often uncertain whether all patients or only a targetedsubgroup will beneﬁt from the experimental treatment. Adaptive enrichment designs (AED)which allow for a data-driven population selection at interim analyses have been proposed inthis setting . The general methodology to control for multiple testing in such designs via p -value combination tests and the closed testing principle was described in Brannath et al .In this section, we summarize the theory of AEDs with a focus on methods relevant to ourcase study which is a two-stage AED with two sub-populations deﬁned by a dichotomizedbiomarker and a binary endpoint (responder vs non-responder). We refer to Section 11.2 of2assmer and Brannath for a detailed discussion of the general case with multiple stagesand sub-populations and general endpoints and to for a systematic review of methods foridentiﬁcation and conﬁrmation of targeted subgroups in clinical trials. In the last part ofthis section, we discuss the calculation of the minimum detectable diﬀerence (MDD), whichtranslates the local signiﬁcance levels to the clinically more interpretable treatment eﬀectscale. To our knowledge the MDD has not been discussed in the AED setting previously.Let F denote the full population, S the sub-population of subjects tested positive for a binarybiomarker of interest, and C the subgroup of biomarker-negative subjects. The true responseprobability in the experimental arm in S and F is denoted by π q with q ∈ { F, S } and thecorresponding probability in the control arm is π q . The two one-sided null-hypotheses ofinterest are H ,q : π q − π q ≤ H A,q : π q − π q > q ∈ { F, S } .A ﬂow chart of a pivotal AED in this setting is shown in Figure 1. In stage 1 of the trial, n patients from the full population are randomized to the experimental treatment or controland F and S are co-primary populations. After data from stage 1 is available, a pre-speciﬁedinterim analysis is conducted and one of the following decisions is taken based on decisioncriteria as discussed in Section 2.3:1. Stop early for eﬃcacy in F or S or both.2. Stop early for futility in both F and S .3. Continue to stage 2 with S as the target population, that is, randomize an additional n patients from S in stage 2, and only test H ,S at the ﬁnal analysis.4. Continue to stage 2 with F as the target population, that is, randomize an additional n patients from F in stage 2, and only test H ,F at the ﬁnal analysis.5. Continue to stage 2 with F and S as co-primary populations, that is, randomize anadditional n patients from F in stage 2, and test both H ,S and H ,F at the ﬁnalanalysis.Figure 1: Flow chart of adaptive enrichment design as deﬁned above3ecision 1 provides an early opportunity to declare eﬃcacy in case an overwhelming beneﬁt isobserved in F or S (or both). Decision 2 avoids exposing additional patients to a potentiallyineﬃcacious experimental treatment. The other three decisions are applicable if a promisingsignal is seen but it is not pronounced enough for an early read-out.Decision 3 is chosen when the biomarker is strongly predictive of treatment beneﬁt, hencemaximizing the power to reject H S as well as avoiding to expose patients in C to anineﬀective treatment with some potential safety reactions. Decisions 4 and 5 are appropriateif a promising signal is observed in both S and C (or F ). Dropping S as a target populationin Decision 4 is sensible when e.g. the predictive value of the relevant biomarker is weak,hence dropping H S and maximizing the probability to reject H F . In an AED, type I error can be controlled by combining closed testing with adaptive p -valuecombination. Denote the unadjusted stage-wise one-sided p -values (testing for superiority ofthe intervention arm) corresponding to the null hypotheses H q ( q ∈ { F, S } ) based on datafrom stage i ( i ∈ { , } ) by p qi .Multiplicity in target populations is circumvented by using a closed testing procedure. Fortwo target populations, the closed testing principle implies that signiﬁcance in S or F canonly be declared if the test of the intersection null hypothesis H S ∩ H F can also be rejected.Several choices for the test of the intersection hypothesis are possible, see Section 11.2 ofWassmer and Brannath . We use the Simes test because it protects type I error withoutrequiring strong assumptions and is more powerful than the Bonferroni test.The design of the second stage of a two-stage AED, especially selection of the target popula-tion(s) and associated test hypothesis(es), is driven by trial data from the ﬁrst stage whichprohibits “naive” analyses via pooling data across stages. To remedy this, stage-wise p -valuescan be combined by using inverse normal combination tests . Deﬁne Z -values correspondingto stage 1 p -values by Z = Φ − (1 − p ) for p ∈ { p F , p S , p ( F,S )1 } and corresponding values Z for stage 2 accordingly. A combined Z -value accross both stages is then deﬁned as˜ Z = w Z + w Z with non-negative weights w and w satisfying w + w = 1. The p -valuecorresponding to ˜ Z , that is, ˜ p = 1 − Φ( ˜ Z ), is referred to as the combined p -value.The weights have to be pre-speciﬁed at the design stage. In our case study, we use the commondeﬁnition of the weights according to the pre-planned stage-wise sample sizes n and n : w = n / ( n + n ) and w = n / ( n + n ). These weights are optimal if the actual stage-wisesample sizes are proportional to the planned stage-wise sample sizes in each population.Note that discrepancies from this proportionality are expected in case the trial is enrichedat stage 2 and only subjects from S are recruited. However, it has been shown that theassociated power loss from this is rather limited in all but extreme cases which is unlikely toour setting .Regardless of the adaptations after stage 1, Z and ˜ Z follow the same bivariate distributionas a standard group sequential test with two interim analyses at information fractions t = w t = 1. Thus, standard statistical software for group sequential designs can be used forthe determination of local signiﬁcance levels α and α after each stage which allow for earlystopping for eﬃcacy and protect the ovarall signiﬁcance levels across both Z -tests.Following the aforementioned close-test principle and combined p-value combination approach,overall test decisions which control the family-wise type I error in the strong sense are thendeﬁned as follows:• H F is rejected after stage 1 if p ( F,S )1 ≤ α and p F ≤ α .• H S is rejected after stage 1 if p ( F,S )1 ≤ α and p S ≤ α .• H F is rejected after stage 2 if F is a target population in stage 2, with ˜ p ( F,S )2 ≤ α and˜ p F ≤ α .• H S is rejected after stage 2 if if S is a target population in stage 2, with ˜ p ( F,S )2 ≤ α and ˜ p S ≤ α .Stopping for futility is discussed in Section 2.3. If the AED cannot be stopped for compelling eﬃcacy after stage 1, a decision must be madewhether to stop for futility or to continue to stage 2 with one or both populations. Decisioncriteria may be based on the observed treatment eﬀect in S and C (and/or F ) after stage1 (e.g. and our case study), conditional or predictive power (e.g. ), or Bayesian decisiontheory (e.g. ). While type I error control is guaranteed regardless of how these choicesare made, the decision criteria aﬀect the probability of correct decisions after stage 1 andstudy power.In general, the design parameters for an AED include (1) the sample sizes n and n ofstage 1 and 2, rescectively, where n could be re-calculated at the interim analysis, (2) the α -spending approach for early stopping for eﬃcacy, (3) the exact decision criteria (thresholds)used for population selection criteria as well as for early stopping for futility.In practice, these design parameters are usually determined by running simulations across arange of plausible scenarios and evaluating design characteristics such as: the overall powerof the study, that is, the probability of a statistically signiﬁcant result for either of the targetpopulations or both at either the interim or ﬁnal analysis, the conditional power, that is, theprobability of a signiﬁcant result conditional on continuing to stage 2, and the probability ofmaking the “correct” decision(s) at the interim analysis. Overall power measures the successprobability of the entire AED whereas conditional power assesses the probability of success ofthe additional investment into stage 2. Usually, a trade-oﬀ between overall and conditionalpower needs to be made because maximizing the latter leads to aggressive thresholds forfutility stopping and population selection which may reduce overall power.Of note, often “biomarker status” could be used as stratiﬁcation factor for randomizationand analysis. We adopt a common practice, which is to neglect this in sample size calculation5or the case study in this paper. In Section 2.2, it was outlined how to determine local signiﬁcance levels α and α , thatis, boundary values based on which test decisions can be taken as described at the end ofSection 2.2.To support clinical interpretation of these local signiﬁcance levels, it is useful to express themon the treatment eﬀect scale. We denote this as the minimal detectable diﬀerence (MDD),that is, the smallest observed absolute risk diﬀerence between the two groups which wouldlead to rejection of the corresponding null hypothesis after stage 1 or 2, respectively.MDDs (also called boundary values on the treatment eﬀect scale) are routinely provided forsingle stage or group-sequential trials by standard software such as rpact (version 2.0.5)and an extension to AEDs is described below.We ﬁrst consider a single stage trial and a statistical test of the null hypothesis H : ∆ = π − π ≤ H A : ∆ = π − π > α ∗ .The observed proportion of ˆ π in the control arm serves as a nuisance parameter in this settingand, typically, it is assumed to correspond to the hypothesized control proportion from thesample size calculation. Given ˆ π , the MDD δ ∗ corresponds to the observed diﬀerence betweenthe two arms which would lead to a p -value of exactly α ∗ using e.g. a signed (one-sided)chi-squared test for hypothesis testing. Numerically, the MDD can be calculated with anyone dimensional root (zero) ﬁnding algorithm. If there is substantial uncertainty regardingthe true response probability in the control arm, the MDD should be calculated for a rangeof plausible values.In an AED, MDDs which lead to the rejection of the respective population null hypothesis after stage 1 can be calculated in the same way as for a single stage trial. The only additionalcomplication is that the corresponding null hypothesis for each population can only be rejectedif the intersection hypothesis is also rejected. If Simes test is used to test the intersectionhypothesis, the intersection hypothesis and the null hypothesis for S can both be rejected afterstage 1 if either 2 p S ≤ α (that is, S alone is responsible for the rejection of the intersectionnull hypothesis) or if both p S ≤ α and p F ≤ α (that is, F contributes to the rejection ofthe intersection null hypothesis). Consequently, two MDDs can be calculated for S : First,a conservative MDD for S which assumes that S alone is responsible for rejection of theintersection null hypothesis and corresponds to the MDD for a single stage trial based on the“adjusted” p -value 2 p S . Second, a more liberal MDD which is only applicable if F contributesto the rejection of the intersection null hypothesis (and hence is also signiﬁcant with a smaller p -value than S ), and corresponds to the MDD for a single stage trial based on the raw p -value p S . Both, the conservative and liberal MDD will be useful in design discussions with thedrug development team. In an analogous way, two MDDs can be calculated for F .MDD calculations after stage 2 are based on the combination test and require additionalconsistency assumptions to achieve a unique solution. First, we assume that the estimated6roportions in each arms are identical for stage 1 and stage 2 data, respectively.If only S continues to stage 2 , the intersection test is only relevant for the stage 1 data andit is plausible to additionally assume consistency in the driver for the treatment eﬀect, thatis, that the stage 1 p -value for Simes test is driven by S alone and given by 2 p S . Thus, theMDD for S after stage 2 corresponds to the observed diﬀerence between the two arms whichwould lead to a p -value of exactly α using the combination test based on the adjusted stage1 p -value 2 p S and the raw stage 2 p -value p S . If only F continues to stage 2 , the MDD canbe calculated in an analogous way.If both S and F continue to stage 2 , we propose to calculate a conservative MDD for S usingan adjusted stage 1 p -value 2 p S and an adjusted stage 2 p -value 2 p S for the combination test(assuming that the intersection test is driven by S alone in both stages) and a more liberalMDD for S based on the raw stage 1 p -value p S and the raw stage 2 p -value p S (which isonly valid if F contributes to the rejection of the intersection test in both stages). In thesame way, a conservative and a liberal MDD can be calculated for F .As described above, MDDs can only be derived under additional assumptions. While theseassumptions will often be approximately true, it is important to recognize that formaltest decisions should be based on the local signiﬁcance levels α and α and not on theMDD. However, the MDD is on a clinically relevant scale, and as such extremely helpfulfor a discussion of the trial design with the cross-functional clinical development teams. Forexample, if the MDD is very small, this indicates that it is possible that the observed treatmenteﬀect is not clinically relevant but that the trial is nevertheless statistically signiﬁcant. Thiswould make it very diﬃcult to market the drug and hence, such a ﬁnding may lead to are-consideration of the trial sample size. IMpassion031 is a global Phase III, double-blind, 1:1 randomized, multicenter, placebo-controlled study which is conducted to evaluate the eﬃcacy and safety of neoadjuvanttreatment with nab-paclitaxel + doxorubicin + cyclophosphamide and either atezolizumabor placebo in invasive stage II/III early triple-negative breast cancer (TNBC). The CITatezolizumab is an anti-programmed death-ligand 1 (PD-L1) monoclonal antibody that blocksthe binding of PD-L1 to PD-1 and B7.1 receptors, thereby restoring tumor-speciﬁc immunity.The primary eﬃcacy endpoint of IMpassion031 is pathological complete response (pCR), abinary clinical endpoint evaluated at surgery which takes place approximately 6 months afterrandomization for patients in both arms. pCR in this study is deﬁned as absence of residualinvasive cancer in the complete resected breast specimen and all sampled regional lymphnodes following completion of neoadjuvant therapy . The original IMpassion031 trial had aﬁxed non-adaptive design with a one-sided signiﬁcance level of 2.5% and a target sample sizeof 204 subjects from the overall population F , randomized 1:1 to receive either atezolizumab7n combination with chemotherapy or chemotherapy alone. This yields a power of 79% todetect an increase of 20% pCR rate in the combo arm from a true pCR rate of 48% in themono arm, accounting for 5% drop-out rate in both treatment arms, whereby drop-outs areconsidered non-responders. Chi-square test for two proportions was considered.The trial was initiated on 24th July 2017 and recruitment of the original 204 subjectsindependent of PD-L1 status was completed on 12th June 2018, with 205 patients actuallyenrolled. During patient follow-up and prior to study unblinding, data external to the trialemerged suggesting that PD-L1 could be a predictive biomarker for the treatment eﬀect ofatezolizumab in metastatic or locally advanced TNBC (IMpassion130 ). Although results ofIMpassion130 were compelling with respect to the predictive nature of PD-L1 biomarker, theextent to which this ﬁnding would apply to the early TNBC setting was uncertain. Therefore,the original ﬁxed design was changed to an AED to address this potential predictive biomarkerpopulation hypothesis. PD-L1 status was dichotomized according to a pre-speciﬁed cut-pointwhich has been previously used and validated in the pivotal trial in advanced TNBC . By the time of the protocol amendment, that is, when transforming the trial to an AED, thetarget sample size for the original design had already been fully enrolled and follow-up wasongoing. Thus, this phase of the trial was assigned as stage 1 of the new AED with n = 205subjects.The new AED design has the following features. The overall one-sided type-I error level is α = 2 .

5% and both used a 1:1 randomization ratio. After the interim analysis at the end ofstage 1, to safeguard the scientiﬁc integrity of the study, an independent data monitoringcommittee will look at unblinded data to make recommendations to the trial sponsor regardingearly stopping for eﬃcacy or futility or continuing into stage 2 with the target population(s)selected, following the general framework laid down in Section 2 and displayed in Figure 1.The sample size of n = 120 subjects for stage 2 is determined based on its favorable designcharacteristics such as power (see below) without adding too much operational complicationsuch as extension of study duration. Importantly, the protocol amendment was completed andsubmitted to regulatory authorities while the treatment assignment remained double-blinded,hence protecting the integrity of the trial.Type I error control is implemented by combining closed testing via Simes test with p -valuecombination using inverse normal combination tests as described in Section 2. In order toallow for an acceptably high probability that the trial stops after stage 1, that is, the samplesize of the original trial, 50% of the overall type I error is spent at stage 1 which implies localsigniﬁcance levels of α = 0 . α = 0 . S and its counterpart, the PD-L1 negative population C .8ased on clinical trial simulations (reported below) and discussion with clinicians, futilitydecision thresholds d S = 12% and d C = 10% are chosen. Thus, if the observed treatementeﬀect in the PD-L1 positive population is ≥

12% after stage 1, it is included as a targetpopulation in stage 2. Similarly, if the observed treatment eﬀect in the PD-L1 negativepopulation is ≥

10% after stage 1, the full population F is included as a target population instage 2. If both treatment eﬀects exceed the threshold, both populations are included andtested as co-primary populations at the ﬁnal analysis. If neither treatment eﬀect is above thethreshold, the trial stops for futility. Decisions concerning F are predicated on the observedtreatment in the PD-L1 negative population instead of the full population to avoid caseswhere the beneﬁt in F is strongly driven by S , hence minimizing the risk of exposing patientsin C to a potentially futile treatment independent of its activity in S .Simulations are used to investigate the properties of the chosen AED and the decisionthresholds d S and d C . Three scenarios are reported here which all assume a prevalence of47% of the PD-L1 positive subgroups (based on internal and external information such asKwa et. al ), a pCR rate of 48% in the control arm, and a treatment eﬀect of 20% in thePD-L1 positive subgroup. The treatment eﬀect in PD-L1 negative subjects is varied between4%, 12%, and 20% (Table 1). The original trial design assumed a homogenous treatmenteﬀect of 20% in the full population but scenarios with a reduced treatment eﬀect in PD-L1negative patients are also plausible in view of the external information that triggered theamendment to IMpassion031. Of note, all simulations assumed a drop-out rate of 5% inboth treatment arms and stages independent of the pCR outcome and that drop-outs wereconsidered non-responders. Therefore, as an example, an underlying control arm pCR of0.48 × d S = 12% and d C = 10%, results for an alternative set of more aggressivedecision thresholds d S = 15% and d C ’=12% are also displayed for comparison purposes.Reported results for each scenario represent averages over 100,000 simulated trials.Under the original trial assumptions (scenario 1), the overall probability to stop for eﬃcacy(that is, reject either H ,S or H ,F or both) after stage 1 is 64% compared to a power of 79%for the original trial design (Table 2). This quantiﬁes the price for the additional ﬂexibilityof two co-primary populations, the choice of the target population for stage 2, and a secondopportunity to declare eﬃcacy after stage 2. The futility decision thresholds d S and d C donot impact the estimated probabilities of stopping for eﬃcacy at stage 1. However, the moreaggressive (higher) decision thresholds almost double the likelihood for futility stopping inall scenarios, whereas they reduce the likelihood of continuing to stage 2 with only patientsfrom S . Similarly, using the more aggressive thresholds reduce the chance of continuing tostage 2 with both S and F whereas the chance of continuing only in F is slightly increasedin all scenarios. This latter ﬁnding is mainly due to the fact that many of the simulatedtrials which would continue to stage 2 with both F and S for the less aggressive thresholdscontinue with only F for the more aggressive thresholds due to the larger increase from d S to d S compared to the increase from d C to d C . Hence, d S = 12% and d C = 10% are consideredthe better choice due to the lower chance of stopping for futility as well as higher chance tocontinue in either S or both F and S , which provides more conﬁdence to clinicians based on9he mode of action of our active treatment.The overall power for the revised IMpassion031 design is 88% (versus 79% for the originaldesign) for scenario 1, 76% (versus 57%) for scenario 2, and 67% (versus 35%) for scenario 3(Table 3). While the power is still below 80% for scenarios with a reduced treatment beneﬁtin PD-L1 negative patients, it is substantially increased compared to the original design. Aneven larger increase in power would have required a larger stage 2 sample size, which wouldhave put challenges on recruitment feasibility and timelines, in particular if one would onlycontinue with S in stage 2.More aggressive thresholds for assessing futility and population selection would lead to ageneral drop in overall power (Table 3) but increase the conditional power, that is, theconditional rejection probabilities if stage 2 is activated (Tables 4). However, one needsto take into account the greater risk of not activating stage 2 with the more aggressivethresholds.Finally, MDDs for IMpassion 031 are displayed in Table 5. An observed diﬀerence in responserates of at least 17% in F or at least 25% in S (or a diﬀerence of both at least 16% in F and at least 22% in S jointly) is required for an eﬃcacy stop after stage 1. After stage 2,the smallest MDD is 12% for F which applies if F and S are both tested at stage 2 andthe treatment eﬀect in S alone is also at least 17%. These MDDs were all considered tocorrespond to clinically relevant eﬀect sizes.Of note, the futility boundaries at the interim were d S = 12% and d C = 10% (consistentwith a treatment eﬀect of ≈

11% in F ). These values are smaller than the calculated MDDsafter stage 2. If this were not the case, the same observed treatment eﬀect might lead to afutility stop at the interim analysis but also to a rejection of the null hypothesis at the ﬁnalanalysis. This would be incoherent and should lead to a re-assessment of the planned futilityboundaries. The world of cancer immunotherapies (CITs) is highly dynamic and data external to anongoing trial is evolving quickly. In this paper, we showed that it is possible to incorporateemerging data regarding the most appropriate target population into an ongoing trial byamending it to an adaptive enrichment trial (AED). The implemented two-stage AED allowsfor both eﬃcacy and futility stopping after stage 1, as well as for population selection for stage2. Importantly, the protocol was amended and submitted to regulatory agencies while stage1 was still ongoing and prior to any unblinding of the double-blind trial, hence protecting theintegrity of the design.This paper also introduced the concept of the minimal detectable diﬀerence (MDD) to AED,which is relevant for the discussion of the design with stakeholders beyond biostatistics.The proposed AED in CIT addresses several important considerations highlighted in the recentadaptive trials guidance by the FDA for enrichment designs : First, the design combines10stablished statistical methods to protect the familywise type I error in the strong sense.Second, the data external to the trial provided a strong rationale that the beneﬁt-risk proﬁlemay be more favorable in the PD-L1 positive subgroup. Third, the PD-L1 assay and thethreshold used to deﬁne PD-L1 positivity had been previously validated . Fourth, stage 1 ofthe trials was large enough to characterize the treatment eﬀect in the complement population,that is, the PD-L1 negative subgroup, even in the situation when only PD-L1 positive subjectswould be included in stage 2.However, our proposal also has several limitations. The amendment was promptly imple-mented following the release of the external data, but this occurred only after recruitment tostage 1 of the trial had already been completed. This prevented optimization of some designparameters of the AED such as the stage 1 sample size. Moreover, our derivation of the MDDrelied on the simple dependence of Simes’ test on the p-values from each population. If morecomplex intersection tests are employed, e.g. the test proposed by Spiessen and Debois ,this would further complicate the quantiﬁcation of the dependence of test decisions for eachpopulation on the intersection test. In addition, we focused on hypothesis testing but did notcover estimation and inference in AEDs which is an important area of current research. Werefer to Chapter 8 of Wassmer and Brannath for a general discussion for adaptive trialsand to Kunzman et al for a discussion of estimation in the context of AEDs. Finally, thecurrent setting of a binary endpoint is methodologically easier than the time-to-event setting,which is frequent in oncology, where additional complications arise .In summary, this case study demonstrated that AED are an eﬃcient way to circumventemerging uncertainties about the target population for cancer immunotherapy. AEDs arestill relatively rarely used in clinical development and we hope that this paper promotes theiruse and increases the conﬁdence that such designs are feasible in practice. The code and the results of all computations described in this paper are available as a GitHubrepository: https://github.com/nguyenducanhvn101087/Enrichment_Adaptive_Design

Data sharing is not applicable to this article as no new data were created or analyzed in thisstudy.

References

1. Daniel S. Chen and Ira Mellman. Oncology Meets Immunology: The Cancer-ImmunityCycle.

Immunity , 39(1):1–10, July 2013. doi: 10.1016/j.immuni.2013.07.012. URL11ttps://doi.org/10.1016/j.immuni.2013.07.012.2. Andrew A. Davis and Vaibhav G. Patel. The role of PD-L1 expression as a predictivebiomarker: an analysis of all US food and drug administration (FDA) approvals ofimmune checkpoint inhibitors.

Journal for ImmunoTherapy of Cancer , 7(1), October2019. doi: 10.1186/s40425-019-0768-9. URL https://doi.org/10.1186/s40425-019-0768-9.3. Anil P. George, Timothy M. Kuzel, Yi Zhang, and Bin Zhang. The Discovery ofBiomarkers in Cancer Immunotherapy.

Computational and Structural BiotechnologyJournal , 17:484–497, 2019. doi: 10.1016/j.csbj.2019.03.015. URL https://doi.org/10.1016/j.csbj.2019.03.015.4. Sue-Jane Wang, Robert T. O'Neill, and H. M. James Hung. Approaches to evaluationof treatment eﬀect in randomized clinical trials with genomic subset.

PharmaceuticalStatistics , 6(3):227–244, 2007. doi: 10.1002/pst.300. URL https://doi.org/10.1002/pst.300.5. Kaspar Ruﬁbach, Meng Chen, and Hoa Nguyen. Comparison of diﬀerent clinical develop-ment plans for conﬁrmatory subpopulation selection.

Contemporary Clinical Trials

Statistics inMedicine , 35(3):325–347, March 2015. doi: 10.1002/sim.6472. URL https://doi.org/10.1002/sim.6472.10. Olivier Collignon, Franz Koenig, Armin Koch, Robert James Hemmings, Frank Pétavy,Agnès Saint-Raymond, Marisa Papaluca-Amati, and Martin Posch. Adaptive designsin clinical trials: from scientiﬁc advice to marketing authorisation to the EuropeanMedicine Agency.

Trials , 19(1), November 2018. doi: 10.1186/s13063-018-3012-x. URLhttps://doi.org/10.1186/s13063-018-3012-x.11. Martin Jenkins, Andrew Stone, and Christopher Jennison. An adaptive seamless phaseII/III design for oncology trials with subpopulation selection using correlated survivalendpoints†.

Pharmaceutical Statistics , 10(4):347–356, December 2010. doi: 10.1002/pst.472. URL https://doi.org/10.1002/pst.472.12. Werner Brannath, Emmanuel Zuber, Michael Branson, Frank Bretz, Paul Gallo, Martin12osch, and Amy Racine-Poon. Conﬁrmatory adaptive designs with Bayesian decisiontools for a targeted therapy in oncology.

Statistics in Medicine , 28(10):1445–1463, May2009. doi: 10.1002/sim.3559. URL https://doi.org/10.1002/sim.3559.13. Gernot Wassmer and Werner Brannath.

Group sequential and conﬁrmatory adaptivedesigns in clinical trials . Springer, Switzerland, 2016. ISBN 978-3-319-32560-6.14. Thomas Ondra, Alex Dmitrienko, Tim Friede, Alexandra Graf, Frank Miller, NigelStallard, and Martin Posch. Methods for identiﬁcation and conﬁrmation of targetedsubgroups in clinical trials: A systematic review.

Journal of Biopharmaceutical Statistics ,26(1):99–119, September 2015. doi: 10.1080/10543406.2015.1092034. URL https://doi.org/10.1080/10543406.2015.1092034.15. Ruth Marcus, Eric Peritz, and K. R. Gabriel. On Closed Testing Procedures with SpecialReference to Ordered Analysis of Variance.

Biometrika

Journal of the American StatisticalAssociation , 92(440):1601–1608, December 1997. doi: 10.1080/01621459.1997.10473682.URL https://doi.org/10.1080/01621459.1997.10473682.17. Walter Lehmacher and Gernot Wassmer. Adaptive Sample Size Calculations in GroupSequential Trials.

Biometrics , 55(4):1286–1290, December 1999. doi: 10.1111/j.0006-341x.1999.01286.x. URL https://doi.org/10.1111/j.0006-341x.1999.01286.x.18. Deepak L. Bhatt and Cyrus Mehta. Adaptive Designs for Clinical Trials.

New EnglandJournal of Medicine , 375(1):65–74, July 2016. doi: 10.1056/nejmra1510061. URLhttps://doi.org/10.1056/nejmra1510061.19. Heiko Götte, Margarita Donica, and Giacomo Mordenti. Improving Probabilities ofCorrect Interim Decision in Population Enrichment Designs.

Journal of BiopharmaceuticalStatistics , 25(5):1020–1038, June 2014. doi: 10.1080/10543406.2014.929583. URL https://doi.org/10.1080/10543406.2014.929583.20. Johannes Krisam and Meinhard Kieser. Optimal Decision Rules for Biomarker-BasedSubgroup Selection for a Targeted Therapy in Oncology.

International Journal ofMolecular Sciences , 16(12):10354–10375, May 2015. doi: 10.3390/ijms160510354. URLhttps://doi.org/10.3390/ijms160510354.21. Thomas Ondra, Sebastian Jobjörnsson, Robert A Beckman, Carl-Fredrik Burman, FranzKönig, Nigel Stallard, and Martin Posch. Optimized adaptive enrichment designs.

Statistical Methods in Medical Research , 28(7):2096–2111, December 2017. doi: 10.1177/0962280217747312. URL https://doi.org/10.1177/0962280217747312.22. Wassmer G and Pahlke F. rpact: Conﬁrmatory Adaptive Clinical Trial Design andAnalysis

Cancer , 124(10):2086–2103, February 2018. doi:10.1002/cncr.31272. URL https://doi.org/10.1002/cncr.31272.27. Bart Spiessens and Muriel Debois. Adjusted signiﬁcance levels for subgroup analysesin clinical trials.

Contemporary Clinical Trials , 31(6):647–656, November 2010. doi:10.1016/j.cct.2010.08.011. URL https://doi.org/10.1016/j.cct.2010.08.011.28. Kevin Kunzmann, Laura Benner, and Meinhard Kieser. Point estimation in adaptiveenrichment designs.

Statistics in Medicine , 36(25):3935–3947, August 2017. doi: 10.1002/sim.7412. URL https://doi.org/10.1002/sim.7412.29. Dominic Magirr, Thomas Jaki, Franz Koenig, and Martin Posch. Sample Size Reassess-ment and Hypothesis Testing in Adaptive Survival Trials.

PLOS ONE , 11(2):e0146465,February 2016. doi: 10.1371/journal.pone.0146465. URL https://doi.org/10.1371/journal.pone.0146465. 14

Tables

Table 1: Scenarios investigated in the clinical trial simulationsScenario π S − π π C − π π F − π π π q is the pCR rate in intervention arm for pop-ulation q , q ∈ ( F, S, C ). π is the pCR rate incontrol arm assumed to be the same for S and C .Numbers in the table refer to true pCR rates. Table 2: Relative Frequencies of Decisions at Stage 1 for IMpassion031 for the diﬀerentscenarios and diﬀerent futility decision thresholds d S = 0 . , d C = 0 . d S = 0 . , d C = 0 . d S = 0 . , d C = 0 . d S = 0 . , d C = 0 . d S = 0 . , d C = 0 . d S = 0 . , d C = 0 . × pp