[PDF] Learning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Abstract

Phase I dose-finding trials are increasingly challenging as the relationship between efficacy and toxicity of new compounds (or combination of them) becomes more complex. Despite this, most commonly used methods in practice focus on identifying a Maximum Tolerated Dose (MTD) by learning only from toxicity events. We present a novel adaptive clinical trial methodology, called Safe Efficacy Exploration Dose Allocation (SEEDA), that aims at maximizing the cumulative efficacies while satisfying the toxicity safety constraint with high probability. We evaluate performance objectives that have operational meanings in practical clinical trials, including cumulative efficacy, recommendation/allocation success probabilities, toxicity violation probability, and sample efficiency. An extended SEEDA-Plateau algorithm that is tailored for the increase-then-plateau efficacy behavior of molecularly targeted agents (MTA) is also presented. Through numerical experiments using both synthetic and real-world datasets, we show that SEEDA outperforms state-of-the-art clinical trial designs by finding the optimal dose with higher success rate and fewer patients.

Full PDF

LLearning for Dose Allocation in Adaptive Clinical Trials with SafetyConstraints

Cong Shen Zhiyang Wang Sofía S. Villar Mihaela van der Schaar

Abstract

Phase I dose-ﬁnding trials are increasingly chal-lenging as the relationship between efﬁcacy andtoxicity of new compounds (or combination ofthem) becomes more complex. Despite this,most commonly used methods in practice fo-cus on identifying a Maximum Tolerated Dose(MTD) by learning only from toxicity events. Wepresent a novel adaptive clinical trial methodol-ogy, called Safe Efﬁcacy Exploration Dose Al-location (SEEDA), that aims at maximizing thecumulative efﬁcacies while satisfying the toxicitysafety constraint with high probability. We evalu-ate performance objectives that have operationalmeanings in practical clinical trials, includingcumulative efﬁcacy, recommendation/allocationsuccess probabilities, toxicity violation probabil-ity, and sample efﬁciency. An extended SEEDA-Plateau algorithm that is tailored for the increase-then-plateau efﬁcacy behavior of molecularly tar-geted agents (MTA) is also presented. Throughnumerical experiments using both synthetic andreal-world datasets, we show that SEEDA out-performs state-of-the-art clinical trial designs byﬁnding the optimal dose with higher success rateand fewer patients.

1. Introduction

An adaptive clinical trial utilizes the accumulated results todynamically modify its future trajectory for better efﬁciencyand ethics, while preserving the integrity and validity of thestudy. Studies such as the phase I trial in Acute MyeloidLeukaemia in (Yap et al., 2013) and Cancer Research UKstudy CR0720-11 in (Whitehead et al., 2012) have suggested University of Virginia, USA University of Pennsylvania,USA University of Cambridge, United Kingdom University ofCalifornia, Los Angeles, USA. Correspondence to: Cong Shen.

Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 108, 2020. Copyright 2020 bythe author(s). that even some simple forms of adaptive design lead to betterusage of resources and require fewer participants. Thesepromising results have spawned the interest in developingadaptive clinical trial methodologies in recent years (Villaret al., 2015a; Pallmann et al., 2018; Atan et al., 2019; Leeet al., 2020), which is of great importance because runningan actual clinical trial on human subjects is expensive andethically sensitive. A well-designed trial methodology withthorough theoretical and simulated investigation is widelyacknowledged as a crucial ﬁrst step.Traditionally, the goal of phase I clinical trials is to iden-tify the Maximum Tolerated Dose (MTD) of a cytotoxic(CTX) or therapeutic agent, which is then used for subse-quent studies (Storer, 1989). However, modern cancer phaseI trials test antineoplastic agents in patients with advancedcancer stages, who have often exhausted all other availabletreatment options (Roberts et al., 2004). These participantsusually expect therapeutic beneﬁt from participating in thetrial, which has motivated the trial design to include ef-ﬁcacy as a co-primary end point of phase I dose-ﬁndingstudies (Yan et al., 2017; Paoletti & Postel-Vinay, 2018). Inaddition, the monotonic assumption for the dose-efﬁcacyrelationship is widely adopted in state of the art designs,which is reasonable for cytotoxic agents but may not ap-ply to the new molecularly targeted agents (MTA) such asmonoclonal antibodies (see (Postel-Vinay et al., 2009) foran exemplary trial that illustrates this issue). Designingadaptive clinical trials that can properly address the intrinsicconﬂict between learning and treatment effectiveness forgeneral dose-response models has become an important taskfor phase I clinical trials.In addition to the well-known 3+3 design (Storer, 1989) andcontinual reassessment method (CRM) (O’Quigley et al.,1990) (and its many variants), Bayesian approaches suchas Thompson Sampling (TS) (Aziz et al., 2019) and Gittinsindex (Villar et al., 2015a;b) have been proposed in the liter-ature for dose-ﬁnding studies. However, these methods wereoriginally designed for simpliﬁed models that do not capturesome of the unique characteristics of clinical trials, oftenleading to lack of randomization (Villar et al., 2015b), inefﬁ-cient use of side information (Villar & Rosenberger, 2018),and reduced power levels and estimation issues. Notably,for cases where the best dose for combination therapies a r X i v : . [ c s . L G ] J un earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints Table 1: Representative adaptive clinical trial studies

Study Treatment Category Methodology Evaluation (Tighiouart et al., 2014) Veliparib CTX EWOC-PH simulated trial(Whitehead et al., 2012) MK-0752 CTX joint phase I and II design simulated trial(Lee et al., 2017) Erlotinib MTA extended TITE-CRM simulated trial(Thiessen et al., 2010) Lapatinib MTA escalation to DLT real-world trial datais to be found, unknown synergistic/antagonist effects arelikely to exist and naive designs will fail to identify them.For MTA, the existence of a plateau of efﬁcacy has beendiscussed in (Zang et al., 2014) and (Riviere et al., 2018),which indicates that the toxicity constraint must be jointlystudied with the dose-efﬁcacy relationship for certain newcompounds. This is also conﬁrmed by the real-world trialresult; see (Tighiouart et al., 2014). Last but not the least,safety constraints such as minimizing the adverse events(AE) (Petroni et al., 2017) have not been properly evalu-ated with theoretical guarantees. Table 1 summarizes somerepresentative studies in this direction.In this paper, we address these challenges by developingnew dose-ﬁnding methods that explicitly impose safety con-straints to the allocation and recommendation of dose levelsin a phase I clinical trial. Through the lens of multi-armedbandits (MAB), we propose the

Safe Efﬁcacy ExplorationDose Allocation (SEEDA) algorithm that adaptively updatesthe admissible set of dose levels satisfying the safety con-straints, thus limiting the exploration of doses with harmfuleffect. Performance analysis for SEEDA is carried out withrespect to several measures that have operational meaningsin clinical trials, including the probability of safety con-straints violation, the average efﬁcacy for patients, and therecommendation and allocation probabilities. Noting thatSEEDA only leverages the dose-toxicity logistic model andmakes no assumptions on the efﬁcacy, we then show that,by considering the increasing-then-plateau feature of thedose-efﬁcacy relationship for MTA,

SEEDA-Plateau leadsto better performance by leveraging the unimodal structure.Experiments on simulated datasets as well as clinical tri-als built from real-world datasets show that the proposedmethods are capable of ﬁnding the optimal dose with highersuccess rate and fewer patients in most cases, compared toother state-of-the-art designs.

2. Model and problem formulation

In a phase I dose-ﬁnding clinical trial, a total of K doses aregiven where the k -th dose is denoted as d k ∈ D , k ∈ K = { , , ..., K } . The performance is characterized by both efﬁcacy and toxicity . We model the efﬁcacy X and toxicity Y for dose d k as Bernoulli random variables with unknownprobabilities q k and p k , respectively, where X = 1 ( X = 0 ) indicates that the dose level is effective (not effective), and Y = 1 ( Y = 0 ) suggests that the dose is harmful (notharmful) to the patient .We consider adaptive clinical trials where informationlearned from previous trial patients can be used in allocatingdoses to subsequent patients (Atan et al., 2019; Villar et al.,2015a; Aziz et al., 2019). For the t -th patient, dose I ( t ) isselected based on a policy that uses past observations, andadministrated to the patient. The efﬁcacy outcome X t andtoxicity response Y t are realized based on their distributions X t ∼ Ber ( q I ( t ) ) and Y t ∼ Ber ( p I ( t ) ) , and observed bythe trialist.We adopt a well-known dose-toxicity logistic model pro-posed by in (O’Quigley et al., 1990) to describe the toxicityprobability for different dose levels: p k ( a ) = (cid:18) tanh d k + 12 (cid:19) a , (1)where a is a global parameter for all the dose levels. It canbe veriﬁed that Eqn. (1) satisﬁes the assumption that thetoxicity monotonically increases with dose d k . The unsafedose levels are deﬁned as those whose toxicity probabilities p k ’s are above a pre-determined target toxicity probability θ ,which is referred as the MTD threshold. Hence the toxicitiesof all doses can be written as p ≤ p ≤ · · · ≤ p M <θ < p M +1 ≤ · · · p K where the (unknown) M denotes thenumber of safe doses. The efﬁcacy-dose relationship is notmodeled to allow for the development of a general algorithm.The speciﬁc increase-then-plateau efﬁcacy behavior of MTAwill be exploited in Section 4. Several objectives are often desired for a successful dose-ﬁnding study, which are summarized as follows.•

Successful recommendation.

At the end of the trial ( n patients) a dose recommendation ˆ k n is made, which isdesired to match the optimal dose k ∗ that is the lowestsafe dose that achieves the highest efﬁcacy (Zang et al.,2014): k ∗ = min { k : q k = max l : l ∈K ,p l ≤ θ q l } .• Effective treatment.

The cumulative treatment for trialparticipants (cid:80) ni =1 X t is desired to be maximized. This is typically measured by the presence of absence of adose-limiting toxicity (DLT) reported in a ﬁxed evaluation windowafter administrating the drug. earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints • Minimal violation of the safety constraint.

There aredifferent formulations for the safety constraint. One is tominimize E [ (cid:80) k ∈K ,p k >θ N k ( n ) /n ] where N k ( t ) denotesthe number of times dose k is allocated to the ﬁrst t pa-tients. Another formulation is to minimize the probabilitythat the average toxicity exceeds the MTD threshold.• Small sample size.

Most phase I trials have a pre-determined n which is decided as the minimum number oftrial participants to achieve a pre-deﬁned conﬁdence levelof successful recommendation. It is desirable to have asmall n for cost and efﬁciency considerations.Proposing a learning model that explicitly guarantees allof the above objectives is elusive and non-constructive indeveloping the dose-allocation policy. We thus formulatedose-ﬁnding clinical trials as an online efﬁcacy learningproblem with explicit safety constraint , and subsequentlyprovide performance analysis on the metrics of interest.Speciﬁcally, we aim at maximizing the cumulative efﬁcacyover a ﬁnite number of patients n while simultaneouslyguaranteeing that the average toxicity observed from the n dose allocations is kept under the probability threshold θ with high probability. This can be written as:maximize E (cid:34) n (cid:88) t =1 X t (cid:35) subject to P (cid:34) n n (cid:88) t =1 Y t ≤ θ (cid:35) ≥ − δ. (2)Essentially, problem formulation (2) focuses on safe explo-ration among all the dose levels to maximize cumulativeefﬁcacies. Clinical trial designs for (2) thus need to pursueboth objectives of toxicity and efﬁcacy.

3. The SEEDA algorithm

The proposed Safe Efﬁcacy Exploration Dose Allocation(SEEDA) design is completely described in Algorithm 1. Inparticular, ˆ p k ( t ) and ˆ q k ( t ) are the estimated toxicity and efﬁ-cacy, respectively, after administrating the t -th patient. Theprinciple of dose selection is to ﬁrst dynamically constructthe admissible set D ( t ) using the Upper Conﬁdence Bound(UCB) principle (Auer et al., 2002), where the conﬁdenceinterval α ( t ) is constructed as α ( t ) = ¯ C K (cid:32) log Kδ t (cid:33) ¯ γ , (3)where ¯ C and ¯ γ are algorithm parameters . Note that theadmissible set consists of doses that, with high conﬁdence,satisfy the toxicity constraint. See Section B in the supplementary material for a discussionon how to select these algorithm parameters.

Then, limiting to those in the admissible set D ( t ) , the algo-rithm again applies the UCB principle (UCB-1 from (Aueret al., 2002)) to select a dose with the largest F ( p, s, n ) forthe efﬁcacy estimate: F ( p, s, n ) = p + (cid:114) c log( n ) s , (4)with c denoting the UCB-1 coefﬁcient. It should be notedthat (4) can be replaced by other UCB principles, e.g., KL-UCB (Garivier & Cappè, 2011). Algorithm 1

The Safe Efﬁcacy Exploration Dose Alloca-tion (SEEDA) Algorithm

Input: p k ( a ) for each k ∈ K ; MTD threshold θ ; totalnumber of patients n . Initialize: N k (1) = 0 , ˆ p k (1) = 0 , ˆ q k (1) = 0 , ∀ k ∈ K ;Sample each dose once and set: I ( t ) = t , ˆ q I ( t ) ( K ) = X t , ˆ p I ( t ) ( K ) = Y t , N I ( t ) ( K ) = 1 , for t = 1 to K ; t = K + 1 . while t ≤ n do Compute the estimated parameter: ˆ a ( t ) = (cid:80) Kk =1 w k ( t − a k ( t − ; Set the admissible set: D ( t ) = { d k ∈ D : p k (ˆ a ( t ) + α ( t )) ≤ θ } ; Select dose: I ( t ) =arg max d k ∈D ( t ) F (ˆ q k ( t ) , N k ( t ) , t ) ,; Observe the revealed outcomes X t and Y t ; Update estimations: ˆ q I ( t ) ( t ) = ˆ q I ( t ) ( t − N I ( t ) ( t − X t N I ( t ) ( t − , ˆ p I ( t ) ( t ) = ˆ p I ( t ) ( t − N I ( t ) ( t − Y t N I ( t ) ( t − , N I ( t ) ( t ) = N I ( t ) ( t −

1) + 1 ; Update parameter estimation: ˆ a I ( t ) ( t ) =arg min a ∈A | p I ( t ) ( a ) − ˆ p I ( t ) ( t ) | ; Update weights: w k ( t ) = N k ( t ) /t , ∀ d k ∈ D ; t = t + 1 . end whileOutput: ˆ d ( n ) = arg max d k : p k (ˆ a ( n )) ≤ θ p k (ˆ a ( n )) . The SEEDA algorithm is developed with the aim to solveproblem (2). It is thus important to analyze (a) whetherthe cumulative efﬁcacy is maximized, and (b) how oftenthe toxicity constraint is violated. For metric (a), it canbe equivalently formulated as regret minimization, i.e., thecumulative efﬁcacy difference between the oracle policywith full information and that of the learning algorithm.Formally, the efﬁcacy regret is deﬁned as R ( n ) = q ∗ n − E (cid:34) n (cid:88) t =1 q I ( t ) (cid:35) , (5)where q ∗ = q k ∗ denotes the efﬁcacy associated with theoptimal dose deﬁned in Section 2.2, and a ∗ denotes the true earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints parameter in (1). As for metric (b), we need to evaluate e ( n ) = P (cid:34) n n (cid:88) t =1 p I ( t ) ( a ∗ ) > θ (cid:35) , in conjunction with (5), i.e., whether the proposed SEEDA algorithm minimizes R ( n ) and satisﬁes e ( n ) ≤ δ at thesame time. In addition, other performance measures suchas successful recommendation probability and sample efﬁ-ciency are of practical interest, and we provide theoreticalguarantees for them as well. Due to space limitations, allproofs are provided in the supplementary material.3.2.1. C UMULATIVE EFFICACY

We start the theoretical analysis by showing that for eachpatient t in SEEDA, the dose levels whose toxicities arebelow the MTD threshold are included in the admissible setwith high probability. This corresponds to the type I errorevent that is of interest in clinical trials. Lemma 1 P [ p k (ˆ a ( t ) + α ( t )) > θ ] ≤ δ , ∀ p k ( a ∗ ) ≤ θ . Next we prove that with sufﬁcient patients, the dose levelsexceeding the toxicity threshold are excluded from the ad-missible set with high probability. This corresponds to the type II error event in clinical trials.

Lemma 2 If t > t = (cid:16) ¯ C K | ∆ − (cid:15) | (cid:17) γ log Kδ , ∆ =min k ∈K ∆ k , where ∆ k = | a ∗ − p − k ( θ ) | represents thegap between a ∗ and the parameter when the toxicity is at θ ,then: P [ p k (ˆ a ( t ) + α ( t )) ≤ θ ] ≤ exp( − t(cid:15) ) , ∀ p k ( a ∗ ) > θ. (6)Combining Lemmas 1 and 2 leads to the main result oncumulative efﬁcacy regret. Theorem 1

With t deﬁned in Lemma 2, the regret ofSEEDA can be upper bounded as: R ( n ) ≤ (cid:88) d k : p k ( a ∗ ) ≤ θ c log( n ) q ∗ − q k + (cid:18) nδQ + 12 t + K − M (cid:15) (cid:19) (7) where Q = max i ∈K | q i − q k ∗ | denotes the maximal single-step regret, and (cid:15) > is a constant. Furthermore, if δ = O ( n ) , we have that R ( n ) ≤ O (log n ) . Theorem 1 indicates that the efﬁcacy regret is bounded by O (log n ) . A closer look at this scaling reveals that it consistsof two parts. The ﬁrst is due to the structureless model forefﬁcacy – we impose no assumption on the efﬁcacy of differ-ent dose levels. The second part, which is reﬂected through t , is determined by the structured model for toxicity, whichaffects the admissible set. As will be shown in Section 4,with the increase-then-plateau efﬁcacy assumption, the ﬁrst log n component can be further improved.3.2.2. S AFETY CONSTRAINT VIOLATION

We now move on to analyzing the safety constraint violation.The ﬁrst result is to verify whether the SEEDA algorithmindeed satisﬁes the safety constraint in problem (2).

Theorem 2

For any given n , the average toxicity observedfrom the SEEDA algorithm satisﬁes P (cid:34) n n (cid:88) t =1 p I ( t ) − θ ≤ C (cid:15) γ (cid:35) ≥ − δ, for an arbitrary (cid:15) > . C and γ are problem-dependentparameters deﬁned in Section A of the supplementary mate-rial. The safety constraint in problem (2) is formulated basedon the average toxicity exceeding the MTD threshold. Inpractice, we are often interested in minimizing the numberof patients that have been exposed to unsafe dose levels, E [ (cid:80) k ∈K ,p k >θ N k ( n ) /n ] . Corollary 1 analyzes this metric. Corollary 1

The number of unsafe dose allocations fromSEEDA, i.e., the selected dose levels exceed the MTD thresh-old, can be bounded as: E  (cid:88) d k : p k >θ N k ( n )  ≤ t + K − M (cid:15) . Interestingly, Corollary 1 indicates that unsafe dose alloca-tions in SEEDA are upper bounded by a constant, which islinear in the number of unsafe doses K − M regardless ofthe number of trial participants n .3.2.3. R ECOMMENDATION ACCURACY

Finally, we analyze the recommendation accuracy ofSEEDA at the end of the n -th dose allocation. Corollary 2

The probability that SEEDA recommends theMTD satisﬁes: P (cid:20) ˆ d ( n ) = arg max d k : p k ≤ θ p k (cid:21) ≥ − δ , (8) where δ = 2 K exp (cid:18) − (cid:16) ∆ M C K (cid:17) γ n (cid:19) . Corollary 2 guarantees the ﬁnding of the MTD with highprobability. The recommendation error rate decays expo-nentially with the number of trial participants, which is a earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints nice property. It is worth noting that a lower bound of theminimal number of trial participants for a given accuracyrequirement can be inferred from the upper bound of rec-ommendation error rate (8). This is a practically importantresult, as sample efﬁciency directly relates to the cost andethical constraints of a trial. This is further illustrated in thenumerical experiments in Section 5.1.3.

4. Extension to the increase-then-plateauefﬁcacy model

Algorithm 2

The SEEDA-Plateau Algorithm

Input: p k ( a ) for each k ∈ K ; MTD threshold θ ; totalnumber of patients n . Initialize: N k (1) = 0 , ˆ p k (1) = 0 , ˆ q k (1) = 0 , ∀ k ∈ K ; L (1) = K ; η = 2 ; l k = 0 , ∀ k ∈ K ; Sample each doseonce and set: I ( t ) = t , ˆ q I ( t ) ( K ) = X t , ˆ p I ( t ) ( K ) = Y t , N I ( t ) ( K ) = 1 , for t = 1 to K ; t = K + 1 . while t ≤ n do Compute the estimated parameter: ˆ a ( t ) = (cid:80) Kk =1 w k ( t − a k ( t − ; Set the admissible set: D ( t ) = { d k ∈ D : p k (ˆ a ( t ) + α ( t )) ≤ θ } ; Set L ( t ) = arg max d k ∈D ( t ) ˆ q k ( t ) and increase l L ( t ) by 1; If l L ( t ) − η +1 ∈ N , I ( t ) = L ( t ) ; Otherwise I ( t ) =arg max { L ( t ) − ,L ( t ) ,L ( t )+1 } (cid:84) D ( t ) F (ˆ q k ( t ) , N k ( t ) , t ) ; Observe the revealed outcomes X t and Y t ; Update estimations: ˆ q I ( t ) ( t ) = ˆ q I ( t ) ( t − N I ( t ) ( t − X t N I ( t ) ( t − , ˆ p I ( t ) ( t ) = ˆ p I ( t ) ( t − N I ( t ) ( t − Y t N I ( t ) ( t − , N I ( t ) ( t ) = N I ( t ) ( t −

1) + 1 ; Update parameter estimation: ˆ a I ( t ) ( t ) =arg min | p I ( t ) ( a ) − ˆ p I ( t ) ( t ) | ; Update weights: w k ( t ) = N k ( t ) /t , ∀ d k ∈ D ; t = t + 1 . end while Estimate the turning point of efﬁcacy as: L ( n ) = min k : d k ∈D ( n ) (cid:110) m ≥ k : | ˆ q m ( n ) − ˆ q m +1 ( n ) |≤ (cid:115) c log( n ) N m ( n ) + (cid:115) c log( n ) N m +1 ( n ) , ˆ q m ( n ) ≤ ˆ q m +1 ( n ) (cid:111) ,L ( n ) = arg max d k : p k (ˆ a ( n )) ≤ θ p k (ˆ a ( n )) . Output: ˆ d ( n ) = min { L ( n ) , L ( n ) } .The proposed SEEDA dose allocation policy is general inthe sense that no efﬁcacy model is assumed. In practice,however, efﬁcacy often exhibits certain structure which, ifutilized correctly, may further improve the performance. For conventional cytotoxic agents, efﬁcacy monotonicallyincreases with dose levels. The same is not true for MTAs,for which the dose-efﬁcacy curve increases initially andthen plateaus after reaching the level of saturation (Zanget al., 2014; Riviere et al., 2018). In this section, we modifythe SEEDA algorithm to handle the increase-then-plateauefﬁcacy model, and analyze its performance.Formally, we introduce the following increase-then-plateauefﬁcacy assumption, which holds for MTA. Assumption 1 q k , k ∈ K satisﬁes q ≤ q ≤ q ≤ · · · ≤ q N = q N +1 = · · · = q K . The

SEEDA-Plateau algorithm is given in Algorithm 2.With Assumption 1, the efﬁcacy has an inherent non-decreasing structure. The key idea is to combine the se-lection rule of OSUB in (Combes & Proutière, 2014) andreform step 4 in Algorithm 1. Note that step 4 calculates L ( t ) as the estimated dose level with the optimal efﬁcacyand safe toxicity at t . Algorithm 2 not only selects thisdose level frequently enough, but also keeps exploring itsneighboring dose levels.We now analyze the regret of SEEDA-Plateau and presentthe result in Theorem 3. Compared to Theorem 1 forSEEDA without the increase-then-plateau efﬁcacy model,one can see that the ﬁrst log( n ) coefﬁcient improves from c (cid:80) d k : p k ( a ∗ ) ≤ θ ( q ∗ − q k ) − to c ( q ∗ − q N − ) − . This gaincomes precisely from the increase-then-plateau efﬁcacymodel, as the unimodal structure that exploits this struc-ture leads to log( n ) regret only from the neighboring arm. Theorem 3

The regret of SEEDA-Plateau satisﬁes: R ( n ) ≤ c log( n ) q ∗ − q N − + O (log log( n )) + (cid:0) nδQ + t + K − M (cid:15) (cid:1) . (9) Furthermore, if δ = O ( n ) , we have that R ( n ) ≤ O (log n ) . The optimal dose level we have deﬁned before can be rewrit-ten as k ∗ = min { M, N } , the recommendation accuracy ofSEEDA-Plateau is given in Theorem 4. Theorem 4

With c set as < c < , the probability thatSEEDA-Plateau fails to recommend the optimal dose canbe bounded as: P [ ˆ d ( n ) (cid:54) = k ∗ ] ≤ n c + δ . (10)Compared to Corollary 2, the error probability of SEEDA-Plateau is increased by n c . This is due to the ambiguityof the efﬁcacy-optimal dose and the toxicity-optimal one,which leads to the two candidate doses L ( n ) and L ( n ) .In practice, however, this ambiguity can be eliminated viapreliminary experiments. earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

5. Experiments

To investigate the operational characteristics and evalu-ate the performance of the proposed adaptive designs, wepresent an experimental study with K = 6 dose levels and n = 300 trial cohorts, with each cohort consists of 3 patients.The estimation is updated after observing all individual out-comes from a cohort. All experiment results are obtainedwith 1000 trial repetitions. The MTD threshold is set as θ = 0 . .The trial setup is the same as (Riviere et al., 2018) and (Zanget al., 2014), and we have simulated eight different efﬁcacyand toxicity scenarios . Due to the space limitation, we onlyreport the results of the ﬁrst scenario, where efﬁcacy reach-ing the maximal value (the optimal dose) before toxicityhits MTD threshold. Additional results for this setting aswell as the other seven scenarios are reported in Section Kto M in the supplementary material.The following baseline designs are used for compari-son (whenever appropriate), whose details can be foundin the supplementary material: 3+3, CRM, MCRM, In-dependent TS, KL-UCB, UCB-1, and multi-objectivebandits. Note that MTA-RA and other TS variants in(Riviere et al., 2018) are not included because they as-sume a different truncated efﬁcacy model, which needsto be perfectly known to the algorithm. For algo-rithms that require prior information of toxicity and efﬁ-cacy, they are set as [0 . , . , . , . , . , . and [0 . , . , . , . , . , . , respectively.5.1.1. R ECOMMENDATION AND ALLOCATIONACCURACY

We report the allocation and recommendation percentagesof each dose for all considered designs in Table 2. Dose 3(in bold font) is the optimal biological dose for this scenario.However, we comment that dose 4 also satisﬁes the optimal-ity condition without violating the safety constraint. Never-theless, it has a higher toxicity probability (although still be-low MTD) without increasing efﬁcacy; thus less preferableto Dose 3. We note that for all the considered designs, therecommendation rule is ˆ d ( n ) = arg max k :ˆ p k ( n ) ≤ θ ˆ q k ( n ) ,where ˆ q k ( n ) and ˆ p k ( n ) are the ﬁnal estimations of toxicityand efﬁcacy for dose level d k , respectively. This suggeststhat safety constraint is considered in recommendation.We can see from the results that SEEDA almost equallyrecommends dose 3 and 4 with a total probability of 94.6%. We remark that although no real-world trial data is utilizedin the experiment, this approach is commonly accepted in clinicaltrials as the ﬁrst-step study for a new methodology; see (Whiteheadet al., 2012; Yap et al., 2013; Zang et al., 2014; Riviere et al., 2018).

Number of Cohorts T y p e I E rr o r R a t e ( % ) SEEDASEEDA-PlateauIndep TSKLUCBUCBMulti obj

Number of Cohorts

50 100 150 200 250 300 T y p e II E rr o r R a t e ( % ) SEEDASEEDA-PlateauIndep TSKLUCBUCBMulti obj

Figure 1: Type I (left) and type II (right) error rates as afunction of number of cohorts.This is because the algorithm cares about maximizing efﬁ-cacy without violating safety constraint, and both dose 3 and4 satisfy such conditions. As a result, SEEDA treats bothequally as the optimal solution. However, by leveraging theincrease-then-plateau model assumption, SEEDA-Plateaucan further break the “tie” between dose 3 and 4, and cor-rectly recognize that dose 3 is the optimal biological dose:it chooses dose 3 at 86.6% while dose 4 only 10.4%. Wesee that the gain of SEEDA-Plateau is signiﬁcant over allthe other designs (even compared to SEEDA). For a moredetailed understanding of the recommendation accuracy, thecorresponding type I and type II error rates (deﬁnitions aregiven in Section J in the supplementary material) are plottedin Fig. 1, and we observe that both SEEDA and SEEDA-Plateau outperform other baseline methods over the rangeof cohorts.As for allocation, we observe that both SEEDA and SEEDA-Plateau concentrate at dose 3 and 4, while spending verylittle budget on both tail ends of the dosage. In particular,SEEDA-Plateau allocates the fewest percentages (1%) ofpatients to the most toxic dose 6 among all designs.5.1.2. C

ONVERGENCE AND SAFETY VIOLATION

Number of Cohorts E ff i cacy p e r P a t i e n t SEEDASEEDA-PlateauIndep TSKL-UCBUCB3+3CRMMCRMMulti obj

Number of Cohorts S a f e V i o l a t i on P e r ce n t a g e ( % ) SEEDASEEDA-PlateauIndep TSKL-UCBUCB3+3CRMMCRMMulti obj

Figure 2: Comparison of efﬁcacy per patient (left) and thesafety violation percentage (right).To have a deeper understanding of the tradeoff between efﬁ-cacy and toxicity, we plot side-by-side the convergence ofefﬁcacy and toxicity as t increases in Fig. 2. KL-UCB, UCBand Independent TS have good convergence but suffer fromsigniﬁcant safety violation in the process since they do notconsider the safety constraint during exploration. CRM has earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints Table 2: Recommendation & allocation percentages of different designs. Optimal biological dose is

Recommended AllocatedToxicity probabilities 0.01 0.05 higher efﬁcacy at the cost of bad safety constraint violation,while 3+3 performs poorly in efﬁcacy but has the lowestsafety probability; this behavior is similarly observed formulti-objective bandits. The SEEDA(-Plateau) algorithm,in comparison, converges to the optimal efﬁcacy at a slowerrate, but the exploration process is carefully controlled sothat the safety violation is minimized, which is evident fromthe right subplot of Fig. 2.5.1.3. S

AMPLE EFFICIENCY

Sample efﬁciency is measured by the minimum number oftrial participants to achieve a pre-speciﬁed recommenda-tion accuracy (also known as early stopping (Montori et al.,2005)). We start the trial with a minimum of 6 patients,and continue recruiting patients until the stopping conditionis triggered. Fig. 3 plots the average minimum number ofpatients to achieve a given a recommendation accuracy fordifferent algorithms . We see that SEEDA-Plateau outper-forms all other algorithms by a large margin, thanks to the“double dipping” of the model assumptions which gives themost accurate estimation of the optimal dose. In compari-son, SEEDA performs similarly to the baseline algorithms.The reason is that the goal of SEEDA is to recommend theefﬁcacy-maximal dose that satisﬁes the safety constraint. Inthis particular setting, both dose 3 and 4 satisfy this condi-tion, and SEEDA does not have the mechanism to furtherminimize toxicity. This leads to a recommendation errorthat is similar to other baseline designs. Recommendation Accuracy N u m b e r o f P a t i e n t s SEEDASEEDA-PlateauIndep TSKL-UCBUCBMulti-obj

Figure 3: The minimum number of trial participants toachieve a given a recommendation accuracy.The sample efﬁciency advantage of SEEDA-Plateau is ofcritical importance in practice, as the signiﬁcant cost associ-ated with clinical trials is mostly proportional to the numberof trial participants. Furthermore, reducing the number ofpatients while achieving the same level of accuracy mini-mizes the safety and ethical concern in the trial, which isanother important consideration.

We evaluate the SEEDA algorithms in two real-worlddatasets neurodeg and

IBSCovars based on (Biesheuvel &Hothorn, 2002). We ﬁrst extract dose and resp variablesfrom the observations reported in the dataset. With thesesamples, we ﬁt them into a commonly used Emax dose-response model as in (Bornkamp et al., 2011) with an R earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Table 3: Recommendation & allocation percentages of the neurodeg dataset. In each cell the ﬁrst row reports the mean valueover 1000 repetitions, and the second row reports the (standard deviation).Recommended AllocatedToxicity 0.01 0.08 resp = 169 .

94 + 12 . dose .

85 + dose , IBScovars: resp = 0 .

26 + 0 . dose .

01 + dose . As for the toxicity event, since it is not reported in thedataset, we resort to simulations with model (1).The allocation and recommendation percentages of eachdose for all the algorithms are shown in Table 3 and Table4 for both datasets. We have similar observations as inthe synthetic experiment that SEEDA and SEEDA-Plateaurecommend the correct doses majority of the times, whilethe suboptimal recommendation is mostly safe in that thedoses immediately below MTD are recommended secondmost. The same is true for allocation.

6. Related works

This work is concerned with adaptive phase I clinical trials,whose uptake in practice is starting to increase considerably.See (Bretz et al., 2017; Pallmann et al., 2018) for recentcomprehensive surveys. The main motivation to use theseadaptive designs is to learn as the trial progresses and usethis learning to deliver more efﬁcient or more ethically ap-pealing trials. Adaptive clinical trial with sequential patientrecruitment is considered in (Atan et al., 2019), but it does not address the subsequent dose allocation. The 3+3 and theCRM designs or their variations remain the de facto adaptivedesigns in practice for dose-ﬁnding studies (Petroni et al.,2017; Pallmann et al., 2018), although new methodologiesthat aim at better safety protection are also proposed (Leeet al., 2017). In recent years, there is a growing interestin adaptive trial designs for MTA because of its differentdose-response relationships (Zang et al., 2014; Riviere et al.,2018), but these studies do not explicitly enforce the safetyconstraints during the trial; neither do they provide theoreti-cal guarantees on the trial performance.Multi-armed bandit has long been considered as an impor-tant tool for learning in clinical trials, dating back to theearliest papers of (Thompson, 1933; Robbins, 1952). De-veloping bandit models and algorithms that better suit thespeciﬁc requirements of adaptive clinical trials has attractedsome attention in recent years. Villar et. al (Villar et al.,2015b; Villar & Rosenberger, 2018) adopted the (modiﬁed)forward-looking Gittins index rule for multi-arm clinicaltrials. The authors of (Wang et al., 2018) propose a regionalbandit model that can be applied to learning the drug dosageand patient response relationship. The sample complex-ity of thresholding bandit is analyzed in (Garivier et al.,2017), which matches MTD identiﬁcation. Furthermore,dose-ﬁnding clinical trials with heterogeneous groups areinvestigated in (Lee et al., 2020) from a MAB perspective.Probably the closest work to ours is (Aziz et al., 2019),which also considers both toxicity and efﬁcacy. However, earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Table 4: Recommendation & allocation percentages of the IBScovars datasets. In each cell the ﬁrst row reports the meanvalue over 1000 repetitions, and the second row reports the (standard deviation).

Recommended AllocatedToxicity probabilities 0.01 0.10 the safety constraint, which is an essential constraint of real-world phase I trials, has not been explicitly considered inthese papers.On the other hand, the problem of safe exploration hasattracted a lot of attention recently, albeit often in control(Koller et al., 2018) and general reinforcement learning(Berkenkamp et al., 2017). The authors in (Sui et al., 2015)propose the SAFEOPT algorithm for safe exploration inGaussian processes, and (Kazerouni et al., 2017) presentsa variant of linear UCB method for the contextual linearbandit problem. A different line of works (Maillard, 2013;Galichet et al., 2013) consider minimizing risk in MAB, butthey are mostly casted in the mean-variance framework withrespect to the reward distribution.

7. Conclusions

Learning in adaptive clinical trials faces several unique chal-lenges that have not been well addressed, which may havecontributed to their lack of adoption in actual clinical trials.In particular, the safety constraints resulting from ethical andsocietal considerations have been insufﬁciently researched,which has motivated us to develop the SEEDA algorithmthat explicitly imposes safety constraints (in terms of tox-icity) while also aiming for maximum patient response ina dose-ﬁnding study. Theoretical analysis of SEEDA iscarried out and the proposed algorithm is further extendedto the increase-then-plateau efﬁcacy model and shown to have smaller regret thanks to the unimodal structure. Theperformance advantages over state-of-the-art adaptive clin-ical trial designs are illustrated with experiments on bothsynthetic and real-world datasets.

8. Acknowledgements

CS acknowledges the funding support from Kneron, Inc.SSV thanks the funding received from the National Institutefor Health Research Cambridge Biomedical Research Cen-tre at the Cambridge University Hospitals NHS FoundationTrust and the UK Medical Research Council (grant number:MC_UU_00002/3). The research of MV has been supportedby ONR and NSF 1524417 and 1722516.

References

Atan, O., Zame, W. R., and van der Schaar, M. Sequentialpatient recruitment and allocation for adaptive clinical tri-als. In

Proceedings of The 22nd International Conferenceon Artiﬁcial Intelligence and Statistics , pp. 1891–1900,Apr 2019.Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-timeanalysis of the multiarmed bandit problem.

MachineLearning , 47(2):235–256, May 2002.Aziz, M., Kaufmann, E., and Riviere, M.-K. On multi-armed bandit designs for phase I clinical trials. arXive-prints , art. arXiv:1903.07082, March 2019. earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Berkenkamp, F., Turchetta, M., Schoellig, A. P., and Krause,A. Safe model-based reinforcement learning with stabil-ity guarantees. In

Proceedings of the 31st InternationalConference on Neural Information Processing Systems ,pp. 908–919, Long Beach, California, USA, December2017.Biesheuvel, E. and Hothorn, L. Many-to-one comparisonsin stratiﬁed designs.

Biometrical Journal , 44:101?–116,2002.Bornkamp, B., Bretz, F., Dette, H., and Pinheiro, J. C.Response-adaptive dose-ﬁnding under model uncertainty.

Annals of Applied Statistics , 5:1611–?1631, 2011.Bretz, F., Gallo, P., and Maurer, W. Adaptive designs: Theswiss army knife among clinical trial designs?

ClinicalTrials , 14(5):417–424, 2017.Combes, R. and Proutière, A. Unimodal bandits: Regretlower bounds and optimal algorithms. In

Proceedings ofthe 31th International Conference on Machine Learning ,pp. 521–529, Beijing, China, June 2014.Galichet, N., Sebag, M., and Teytaud, O. Exploration vsexploitation vs safety: risk-aware multi-armed bandits.In

Proceedings of the 5th Asian Conference on MachineLearning , pp. 245–260, November 2013.Garivier, A. and Cappè, O. The KL-UCB algorithm forbounded stochastic bandits and beyond. In

Proceedingsof Conference On Learning Theory (COLT) , 2011.Garivier, A., Ménard, P., and Rossi, L. Thresholding banditfor dose-ranging: The impact of monotonicity. arXive-prints , art. arXiv:1711.04454, November 2017.Kazerouni, A., Ghavamzadeh, M., Abbasi, Y., and Van Roy,B. Conservative contextual linear bandits. In

Proceedingsof Advances in Neural Information Processing Systems ,pp. 3910–3919, 2017.Koller, T., Berkenkamp, F., Turchetta, M., and Krause, A.Learning-based model predictive control for safe explo-ration. In

IEEE Conference on Decision and Control(CDC) , pp. 6059–6066, December 2018.Lee, H.-S., Shen, C., Jordon, J., and van der Schaar, M.Contextual constrained learning for dose-ﬁnding clinicaltrials. In

Proceedings of The 23rd International Confer-ence on Artiﬁcial Intelligence and Statistics , Aug. 2020.Lee, S. M., Ursino, M., Cheung, Y. K., and Zohar, S. Dose-ﬁnding designs for cumulative toxicities using multipleconstraints.

Biostatistics , 20(1):17–29, Nov. 2017.Maillard, O.-A. Robust risk-averse stochastic multi-armedbandits. In

Proceedings of the 24th International Con-ference on Algorithmic Learning Theory , pp. 218–233,Singapore, 2013. Montori, V. M. et al. Randomized trials stopped early forbeneﬁt: A systematic review.

JAMA , 294(17):2203–2209,Nov. 2005.Neuenschwander, B., Branson, M., and Gsponer, T. Criticalaspects of the Bayesian approach to phase I cancer trials.

Statistics in Medicine , 27(13):2420–2439, 2008.O’Quigley, J., Pepe, M., and Fisher, L. Continual reassess-ment method: a practical design for phase 1 clinical trialsin cancer.

Biometrics , 43(1):33–48, 1990.Pallmann, P. et al. Adaptive designs in clinical trials: whyuse them, and how to run and report them.

BMC Medicine ,16(1):29, Feb 2018.Paoletti, X. and Postel-Vinay, S. Phase I–II trial designs:how early should efﬁcacy guide the dose recommendationprocess?

Annals of Oncology , 29(3):540–541, Feb. 2018.Petroni, G. R., Wages, N. A., Paux, G., and Dubois, F. Im-plementation of adaptive methods in early-phase clinicaltrials.

Statistics in Medicine , 36(2):215–224, 2017.Postel-Vinay, S. et al. Clinical beneﬁt in phase-I trials ofnovel molecularly targeted agents: does dose matter?

British Journal of Cancer , 100(9):1373–1378, May 2009.Riviere, M.-K., Yuan, Y., Jourdan, J.-H., Dubois, F., andZohar, S. Phase I/II dose-ﬁnding design for molecularlytargeted agent: Plateau determination using adaptive ran-domization.

Statistical Methods in Medical Research , 27(2):466–479, 2018.Robbins, H. Some aspects of the sequential design of exper-iments.

Bull. Amer. Math. Soc. , 58:527–535, 1952.Roberts, T. G. et al. Trends in the risks and beneﬁts topatients with cancer participating in phase 1 clinical trials.

JAMA , 292(17):2130–2140, Nov. 2004.Storer, B. E. Design and analysis of phase I clinical trials.

Biometrics , 45:925–37, 1989.Sui, Y., Gotovos, A., Burdick, J. W., and Krause, A. Safeexploration for optimization with Gaussian processes. In

Proceedings of the 32nd International Conference onMachine Learning , pp. 997–1005, 2015.Thiessen, B. et al. A phase I/II trial of GW572016 (la-patinib) in recurrent glioblastoma multiforme: clinicaloutcomes, pharmacokinetics and molecular correlation.

Cancer Chemotherapy and Pharmacology , 65(2):353–361, Jan 2010.Thompson, W. On the likelihood that one unknown prob-ability exceeds another in view of the evidence of twosamples.

Biometrika , 25(3-4):285–294, December 1933. earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Tighiouart, M., Liu, Y., and Rogatko, A. Escalation withoverdose control using time to toxicity for cancer phase Iclinical trials.

PLOS ONE , 9:1–13, 03 2014.Villar, S. S. and Rosenberger, W. F. Covariate-adjustedresponse-adaptive randomization for multi-arm clinicaltrials using a modiﬁed forward looking gittins index rule.

Biometrics , 74(1):49–57, 2018.Villar, S. S., Bowden, J., and Wason, J. Multi-armed banditmodels for the optimal design of clinical trials: Beneﬁtsand challenges.

Statistical Science , 30(2):199–215, May2015a.Villar, S. S., Wason, J., and Bowden, J. Response-adaptiverandomization for multi-arm clinical trials using the for-ward looking Gittins index rule.

Biometrics , 71(4):969–978, 2015b.Wang, Z., Zhou, R., and Shen, C. Regional multi-armedbandits. In

Proceedings of the 21st International Confer-ence on Artiﬁcial Intelligence and Statistics (AISTATS) ,pp. 510–518, Playa Blanca, Lanzarote, Canary Islands,Apr. 2018.Whitehead, J. et al. A novel phase I/IIa design for earlyphase oncology studies and its application in the eval-uation of MK-0752 in pancreatic cancer.

Statistics inMedicine , 31(18):1931–1943, 2012.Yahyaa, S. and Manderick, B. Thompson sampling for multi-objective multi-armed bandits problem. In

Proceedingsof European Symposium on Artiﬁcial Neural Networks,Computational Intelligence and Machine Learning , pp.47–52, Bruges, Belgium, April 2015.Yan, F., Thall, P. F., Lu, K. H., Gilbert, M. R., and Yuan, Y.Phase I–II clinical trial design: a state-of-the-art paradigmfor dose ﬁnding.

Annals of Oncology , 29(3):694–699,Dec. 2017.Yap, C. et al. Implementation of adaptive dose-ﬁndingdesigns in two early phase haematological trials: clinical,operational, and methodological challenges.

Trials , 14(1):O75, Nov 2013.Yoshida, K.

Emax Model Analysis with ’Stan’ .Columbia University, New York, USA, 2019.URL https://cran.r-project.org/web/packages/rstanemax .Zang, Y., Lee, J. J., and Yuan, Y. Adaptive designs for iden-tifying optimal biological dose for molecularly targetedagents.

Clinical Trials , 11(3):319–327, 2014. earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Supplementary Material: Learning for Dose Allocation in Adaptive ClinicalTrials with Safety Constraints

Cong Shen, Zhiyang Wang, Sofía S. Villar, Mihaela van der Schaar

A. Preliminaries

Before presenting the technical proofs, we introduce some notations and regularity assumptions on the dose-toxicity model,which can be veriﬁed to hold for Eqn. (1). For a general toxicity function p k ( a ) of an unknown parameter a ∈ A , thefollowing regularities are imposed: Assumption 2 Monotonicity:

For each k ∈ K and a, a (cid:48) ∈ A there exists C ,k > and < γ ,k , such that | p k ( a ) − p k ( a (cid:48) ) | ≥ C ,k | a − a (cid:48) | γ ,k .2) Hölder continuity:

For each k ∈ K and a, a (cid:48) ∈ A there exists C ,k > and < γ ,k ≤ , such that | p k ( a ) − p k ( a (cid:48) ) | ≤ C ,k | a − a (cid:48) | γ ,k . We note that both monotonicity and continuity assumptions are mild and standard in the literature; see (Wang et al., 2018).Proposition 1 immediately follows with Assumption 2.

Proposition 1

For functions p k ( a ) , ∀ k ∈ K that satisfy Assumption 2, we have:1) p k ( a ) is invertible;2) For each k ∈ K and d, d (cid:48) ∈ P , we have | p − k ( d ) − p − k ( d (cid:48) ) | ≤ ¯ C ,k | d − d (cid:48) | ¯ γ ,k , where ¯ γ ,k = γ ,k , ¯ C ,k = ( C ,k ) γ ,k . For ease of exposition, we denote C = min C ,k , C = max C ,k , γ = max γ ,k , γ = min γ ,k , ¯ γ = 1 /γ , and ¯ C = C − ¯ γ . B. Select Design Parameters

The parameters appeared in Assumption 2 collectively determine the conﬁdence interval in Eqn. (3). We take function (1) asan example to show how to select these parameters. We have | p k ( a ) − p k ( a (cid:48) ) | ≥ C ,k | a − a (cid:48) | γ ,k , | p k ( a ) − p k ( a (cid:48) ) || a − a (cid:48) | ≥ C ,k | a − a (cid:48) | γ ,k − , min a ∈A p (cid:48) k ( a ) ≥ C ,k |A| γ ,k − , log (cid:18) tanh( d k ) + 12 (cid:19) ≥ C ,k |A| γ ,k − . Therefore, we can ﬁrst set γ ,k as and ﬁnd the corresponding C ,k . Then, with the known function p k ( a ) , parameters canbe approximately calculated. earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints C. Proof of Lemma 1 P [ˆ a ( t ) + α ( t ) < p − i ( θ )] ≤ P [ˆ a ( t ) + α ( t ) < a ∗ ] ≤ P [ | a ∗ − ˆ a ( t ) | > α ( t )] ≤ P (cid:34) K (cid:88) k =1 w k ( t −

1) ¯ C | ˆ p k ( t ) − p k ( a ∗ ) | ¯ γ > α ( t ) (cid:35) ≤ K (cid:88) k =1 P (cid:20) | ˆ p k ( t ) − p k ( a ∗ ) | > (cid:18) α ( t ) w k ( t −

1) ¯ C K (cid:19) γ (cid:21) ≤ K (cid:88) k =1 (cid:32) − N k ( t ) (cid:18) α ( t ) w k ( t ) ¯ C K (cid:19) γ (cid:33) (11) ≤ K exp (cid:32) − (cid:18) α ( t )¯ C K (cid:19) γ t (cid:33) = δ. (12)Inequality (11) is from the Hoeffding’s inequality and (12) is derived from the deﬁnition of N k ( t ) = tw k ( t ) and Assumption2 with γ > . D. Proof of Lemma 2

From the Hoeffding’s Inequality and Eqn. (6), we have: α ( t ) ≤ p − k ( θ ) − a ∗ − (cid:15) = ∆ k − (cid:15), where ∆ k = | a ∗ − p − k ( θ ) | denotes the gap between the true value of parameter a and the parameter corresponding to whenthe toxicity of dose level d k is exactly at the MTD threshold θ . When t > t and with the deﬁnition of α ( t ) in Eqn. (3), thelemma can be immediately derived. E. Proof of Theorem 1

Depending on whether the optimal dose level is included in the admissible set or not, we can decompose the regret into twoparts: R ( n ) = n (cid:88) t =1 P [ k ∗ / ∈ D ( t )] Q + P [ k ∗ ∈ D ( t )] R ( n ) ≤ nδQ + R ( n ) . The probability of the ﬁrst error event { k ∗ / ∈ D ( t ) } can be bounded by Lemma 1, which indicates that at each step t the probability of a safe dose level being excluded from the admissible set is bounded by δ . For the second part, R ( n ) represents the regret when the optimal dose is included in the admissible set. In this case, the error event is due to theinaccuracy of parameter estimation at the beginning as well as the limited efﬁcacy information provided by each sample.Using Lemma 2, we have: R ( n ) ≤ t + ( K − M ) n (cid:88) t =1 exp( − t(cid:15) ) + n (cid:88) t = t +1 (cid:88) d k : p k ≤ θ { I ( t ) = k }≤ t + K − M (cid:15) + (cid:88) d k : p k ≤ θ c log( n ) q ∗ − q k . Putting the regret from both error events together leads to (7), which completes the proof. earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

F. Proof of Theorem 2

First we note: p I ( t ) ( a ∗ ) − θ ≤ p I ( t ) ( a ∗ ) − θ + θ − p I ( t ) ( a ∗ − α ( t )) ≤ C | a ∗ − ˆ a ( t ) + α ( t ) | γ . Thus, the probability can be upper bounded as: P [ˆ a ( t ) − a ∗ > α ( t ) + (cid:15) ] ≤ exp( − t ( α ( t ) + (cid:15) ) ) . Reorganizing the terms, we ﬁnally have P (cid:34) n n (cid:88) t =1 p I ( t ) ( a ∗ ) − θ < C (cid:15) γ (cid:35) ≥ − exp( − t ( α ( t ) + (cid:15) ) ) ≥ − δ. G. Proof of Corollary 2 P [ | ˆ a ( n ) − a ∗ | ≥ ∆ M ] ≤ K (cid:88) k =1 P (cid:20) | ˆ p k ( t ) − p k ( a ∗ ) | > (cid:18) ∆ M w k ( t ) ¯ C K (cid:19) γ (cid:21) ≤ K (cid:88) k =1 (cid:32) − N k ( n ) (cid:18) ∆ M w k ( t ) ¯ C K (cid:19) γ (cid:33) ≤ K exp (cid:32) − (cid:18) ∆ M ¯ C K (cid:19) γ n (cid:33) . H. Proof of Theorem 3

We ﬁrst establish Lemma 3, whose proof directly follow Theorem C.1 in (Combes & Proutière, 2014).

Lemma 3 E [ l k ( n )] = O (log(log( n ))) , for each k (cid:54) = k ∗ . Then, following the similar proof steps in Theorem 1, we have the bound in (9).

I. Proof of Theorem 4

Since k ∗ = min { M, N } and L ( n ) and L ( n ) are the estimations for N and M respectively, { ˆ d r ( n ) (cid:54) = k ∗ } ⊆ E (cid:83) E ,where E = { L ( n ) (cid:54) = N } , E = { L ( n ) (cid:54) = M } . The latter can be bounded by Corollary 2. With the notation β k ( n ) = (cid:113) c log( n ) N k ( n ) , the probability of E can be bounded as follows: P [ L ( n ) < M ] ≤ P [ | ˆ q N ( n ) − ˆ q N − ( n ) | ≤ β N − ( n ) + β N ( n )] ≤ P [ˆ q N − ( n ) − q k + q N − ˆ q N ( n ) ≤ q N − q N − − β N − ( n ) − β N ( n )] ≤ (cid:32) − N N − ( n ) (cid:18) q N − q N − − β N − ( n ) − β N ( n )2 (cid:19) (cid:33) ≤ (cid:32) − f ( N −

1) log( n ) (cid:18) ∆ N − ,N − β N − ( n ) − β N ( n )2 (cid:19) (cid:33) = o (cid:16) n − (cid:17) . earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints Furthermore, P [ L ( n ) > M ] ≤ P [ | ˆ q N ( n ) − ˆ q N +1 ( n ) | > β N ( n ) + β N +1 ( n )] ≤ P [ | ˆ q N ( n ) − q N | + | q N +1 − ˆ q N +1 ( n ) | > β N ( n ) + β N +1 ( n )] ≤ P [ | ˆ q N ( n ) − q N | > β N ( n )] + P [ | ˆ q N +1 ( n ) − q N +1 | > β N +1 ( n )] ≤ n c . Lastly, f ( N − is the coefﬁcient of the lower bound of N N − ( n ) , and can be written as (see Theorem 4.1 in (Combes &Proutière, 2014)) f ( N −

1) = 1 I ( q N − , q N ) . This completes the proof.

J. Baseline designs in the experiments

The following baseline designs are used for comparison to SEEDA and SEEDA-Plateau in the experiments.•

KL-UCB (Garivier & Cappè, 2011): This approach ignores the safety constraint and focuses entirely on efﬁcacy duringallocation, as for each patient it allocates the dose level with the highest efﬁcacy index. The efﬁcacy performancefor each dose level is characterized by the KL-UCB index. However, at the end of the experiment, a dose level isrecommended according to ˆ d ( n ) = arg max k :ˆ p k ( n ) ≤ θ ˆ q k ( n ) , where ˆ q k ( n ) and ˆ p k ( n ) are the last empirical estimationsof toxicity and efﬁcacy for dose level d k . This suggests that safety constraint is considered in recommendation.Accordingly, type I and type II errors are deﬁned as: e = (cid:88) k ∈K { p k ≤ θ } { ˆ p k ( n ) > θ } ,e = (cid:88) k ∈K { p k > θ } { ˆ p k ( n ) ≤ θ } . • UCB-1 (Auer et al., 2002): The allocation and recommendation rules are similar to KL-UCB above, with the onlydifference that the dose level with the highest UCB-1 index of efﬁcacy is allocated to the patient.•

Independent Thompson Sampling (TS) (Thompson, 1933; Aziz et al., 2019): Toxicity and efﬁcacy are estimatedwith Bayesian indices: ˜ p k ( t ) ∼ Beta ( S pk ( t ) + 1 , N k ( t ) − S pk ( t ) + 1) , and ˜ q k ( t ) ∼ Beta ( S qk ( t ) + 1 , N k ( t ) − S qk ( t ) + 1) , where S pk ( t ) counts the number of toxic outcomes of dose level k among the ﬁrst t patients and S qk ( t ) counts the numberof effective responses. The dose with maximum ˜ q k ( t ) is allocated to the t -th patient and ˆ d ( n ) = arg max k :˜ p k ( n ) ≤ θ ˜ q k ( n ) is recommended. Deﬁnitions of type I and type II errors are slightly modiﬁed to: e = (cid:88) k ∈K { p k ≤ θ } { ˜ p k ( n ) > θ } ,e = (cid:88) k ∈K { p k > θ } { ˜ p k ( n ) ≤ θ } . • CRM (O’Quigley et al., 1990): We here employ the CRM algorithm with the same one-parameter toxicity model inour paper: p k ( a ) = (cid:18) tanh( d k ) + 12 (cid:19) a . earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints We choose a typical prior distribution as a ∼ exp(0 . . Therefore, d k can be solved with prior tox and the prior meanof a . π t ( a ) denotes the posterior distribution of a after observing the outcomes of the ﬁrst t patients. The allocationrule is a greedy one: I CRMt = arg min k ∈K | θ − p k (ˆ a ( t )) | , ˆ a ( t ) = (cid:90) ∞ a d π t ( a ) , where ˆ a ( t ) is the posterior mean value. With this estimation, the ﬁnal recommendation rule can be written as: ˆ d ( n ) = arg min k ∈K | θ − p k (ˆ a ( n )) | . • (Storer, 1989): The lowest dose is ﬁrst given to 3 patients. If none reports a toxic outcome, the next lowest doselevel is given to the next 3 patients. If there are less than 2 among these 6 patients who report toxic outcome, the nextlowest dose level is given to the next 3 patients; otherwise the experiment is stopped and the dose level used beforestopping is recommended as MTD.• MCRM (Neuenschwander et al., 2008): This algorithm classiﬁes the probability of toxicity into four categories. Forour simulated setting, the categories are set as:Under-dosing: π a ( d ) ∈ (0 , . Targeted toxicity: π a ( d ) ∈ (0 . , . Excessive toxicity: π a ( d ) ∈ (0 . , . Unacceptable toxicity: π a ( d ) ∈ (0 . , . The recommendation and the allocation rules are to maximize the probability of targeted toxicity while controlling theprobability of excessive or unacceptable toxicity at P thre = 25% . Based on the posterior distribution of the toxicity,the probability that the toxicity falls in the above four categories can be calculated. The probability that it falls inTargeted category is denoted as P ti while falls in Excessive and Unacceptable categories as P ei . The selection rule istherefore I t = arg max P ei ≤ P thre P ti .• Multi-objective Bandits (Yahyaa & Manderick, 2015): We implement the Pareto Thompson Sampling algorithm of(Yahyaa & Manderick, 2015) in our experiments. Speciﬁcally, after getting the estimations of toxicity and efﬁcacy ofeach dose from running the Independent TS design, the algorithm computes the Pareto optimal dose level set I ∗ , whichmeans ∀ i ∈ I ∗ , ∀ j / ∈ I ∗ , ˜ p i ( t ) ≤ ˜ p j ( t ) or ˜ q i ( t ) ≥ ˜ q j ( t ) .Other policies designed for MTA, such as MTA-RA, depend on a different truncated two-parameter logistic efﬁcacymodel (Riviere et al., 2018). In our setting, the exact efﬁcacy model is assumed to be unknown – we only make theincrease-then-plateau assumption. K. Additional experiment results under the same setting as in Section 5

Due to space limitations, we were not able to include all the experiment results of the setting in Section 5. These additionalresults are provided here.In particular, Table 2 only reports the recommendation and allocation percentages for a given n = 100 . It is of interest tosee how these metrics change with n . We plot the mean allocation and recommendation probabilities as a function of n inFig. 4. It can be seen that SEEDA-Plateau outperforms all other methods across a large range of n . L. Experiment of a new setting and its comprehensive results

In the main paper, a setting that has the efﬁcacy reaching the maximal value (the optimal dose) before toxicity hits MTDthreshold is used. A different setting can be considered when maximum efﬁcacy dose exceeds the MTD threshold. The earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Number of Cohorts M ea n A ll o ca t i on P r ob a b ili t i es ( % ) SEEDASEEDA-PlateauIndep TSKLUCBUCB3+3CRMMCRMMulti obj

Number of Cohorts M ea n R ec o mm e nd a t i on P r ob a b ili t i es ( % ) SEEDASEEDA-PlateauIndep TSKLUCBUCB3+3CRMMCRMMulti obj

Figure 4: Mean allocation (left) and recommendation (right) probabilities versus number of patients n .experiment results for this setting (called “setting 2”) is reported in this section. Unless otherwise stated, the parameters arethe same as in Section 5 of the main paper.Table 5 presents the setting as well as the allocation and recommendation percentages of each dose for all consideredalgorithms. For this scenario, dose level 3 is the optimal one. We note that a large portion of the previous conclusions in themain paper still hold. However, the gain of SEEDA-Plateau is less signiﬁcant over SEEDA, but still outperforms all thecomparing designs. The corresponding Type I and Type II error rates are similarly plotted in Fig. 5.Table 5: Recommendation & allocation percentages of different designs for setting 2. Recommended AllocatedToxicity probabilities 0.1 0.2 earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Number of Cohorts T y p e I E rr o r R a t e ( % ) SEEDASEEDA-PlateauIndep TSKLUCBUCBMulti obj

Number of Cohorts

50 100 150 200 250 300 T y p e II E rr o r R a t e ( % ) SEEDASEEDA-PlateauIndep TSKLUCBUCBMulti obj

Figure 5: Type I and type II error rates in setting 2.An in-depth look at the mean allocation and recommendation probabilities versus number of patients n for this new settingis given in Fig. 6. The same observation as in Section K holds. Number of Cohorts M ea n A ll o ca t i on P r ob a b ili t i es ( % ) SEEDASEEDA-PlateauIndep TSKLUCBUCB3+3CRMMCRMMulti obj

Number of Cohorts M ea n R ec o mm e nd a t i on P r ob a b ili t i es ( % ) SEEDASEEDA-PlateauIndep TSKLUCBUCB3+3CRMMCRMMulti obj

Figure 6: Mean allocation (left) and recommendation (right) probabilities versus number of patients n in setting 2. Number of Cohorts E ff i cacy p e r P a t i e n t SEEDASEEDA-PlateauIndep TSKL-UCBUCB3+3CRMMCRMMulti obj

Number of Cohorts S a f e V i o l a t i on P e r ce n t a g e ( % ) SEEDASEEDA-PlateauIndep TSKL-UCBUCB3+3CRMMCRMMulti obj

Figure 7: Comparison of efﬁcacy per patient and the safety violation percentage in setting 2. earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

The convergence of efﬁcacy and toxicity as t increases for setting 2 is plotted in Fig. 7. There is a notable difference tothe previous result in Fig. 2, in that now SEEDA and SEEDA-Plateau converge to a different (but correct) dose than theother considered designs, which only emphasize maximum efﬁcacy. It is clear that with such aggressive pursue of efﬁcacy,they succeed in obtaining better treatment effect than SEEDA(-Plateau), but at the signiﬁcant cost of frequent violation ofthe safety constraint: as opposed to safety violation percentage hovering between and in Fig. 2, now we face aviolation in the range of to as shown in Fig. 7. Recommendation Accuracy N u m b e r o f P a t i e n t s SEEDASEEDA-PlateauIndep TSKL-UCBUCBMulti-obj

Figure 8: Sample size comparison in setting 2.Lastly, the sample efﬁciency is evaluated. Fig. 8 plots the minimum number of patients to achieve a given a recommendationaccuracy for different algorithms.

M. Experiment setting 3 to 8 with evaluation of allocation and recommendation percentages

This section reports the allocation and recommendation percentages of each dose for all considered algorithms underdifferent toxicity/efﬁcacy probabilities. We reuse the same 6 scenarios as those in the experiments of (Zang et al., 2014).See Table 6 to 11 for the detailed results. They are in line with the conclusions of the main paper. earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Table 6: Recommended & allocated percentages for Scenario 1 of (Zang et al., 2014).Recommended AllocatedToxicity probability 0.08 0.12 0.2 earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Table 8: Recommended & allocated percentages for Scenario 3 of (Zang et al., 2014).Recommended AllocatedToxicity probability 0.06 0.08 0.14 earning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Table 10: Recommended & allocated percentages for Scenario 5 of (Zang et al., 2014).

Recommended AllocatedToxicity probability 0.1

Table 11: Recommended & allocated percentages for Scenario 6 of (Zang et al., 2014).Recommended AllocatedToxicity probability 0.01 0.03 0.050.6