[PDF] Improving the assessment of the probability of success in late stage drug development

Abstract

There are several steps to confirming the safety and efficacy of a new medicine. A sequence of trials, each with its own objectives, is usually required. Quantitative risk metrics can be useful for informing decisions about whether a medicine should transition from one stage of development to the next. To obtain an estimate of the probability of regulatory approval, pharmaceutical companies may start with industry-wide success rates and then apply to these subjective adjustments to reflect program-specific information. However, this approach lacks transparency and fails to make full use of data from previous clinical trials. We describe a quantitative Bayesian approach for calculating the probability of success (PoS) at the end of phase II which incorporates internal clinical data from one or more phase IIb studies, industry-wide success rates, and expert opinion or external data if needed. Using an example, we illustrate how PoS can be calculated accounting for differences between the phase IIb data and future phase III trials, and discuss how the methods can be extended to accommodate accelerated drug development pathways.

Full PDF

IImproving the assessment of the probability of success in late stagedrug development

Lisa V Hampson , Bj¨orn Bornkamp , Bj¨orn Holzhauer , Joseph Kahn , Markus R.Lange , Wen-Lin Luo , Giovanni Della Cioppa , Kelvin Stott , and Steﬀen Ballerstedt Analytics, Novartis Pharma AG, Basel, Switzerland Analytics, Novartis Pharmaceuticals Corporation, New Jersey, US Portfolio Analytics, Novartis Pharma AG, Basel, Switzerland Clinical R&D Consultants srl, Rome, Italy * Corresponding author: Lisa Hampson, Novartis Pharma AG, Postfach 4002, Basel,Switzerland. [email protected] 5, 2021

Abstract

There are several steps to conﬁrming the safety and eﬃcacy of a new medicine. A sequence of trials,each with its own objectives, is usually required. Quantitative risk metrics can be useful for informingdecisions about whether a medicine should transition from one stage of development to the next. Toobtain an estimate of the probability of regulatory approval, pharmaceutical companies may start withindustry-wide success rates and then apply to these subjective adjustments to reﬂect program-speciﬁcinformation. However, this approach lacks transparency and fails to make full use of data from previousclinical trials. We describe a quantitative Bayesian approach for calculating the probability of success(PoS) at the end of phase II which incorporates internal clinical data from one or more phase IIb studies,industry-wide success rates, and expert opinion or external data if needed. Using an example, we illustratehow PoS can be calculated accounting for diﬀerences between the phase IIb data and future phase IIItrials, and discuss how the methods can be extended to accommodate accelerated drug developmentpathways.

Keywords :Bayesian methods; expert elicitation; meta-analysis; quantitative decision making

Before a new medicine can be licensed, a sequence of trials, each with its own objectives, is required toconﬁrm the medicine’s eﬃcacy and safety. Clinical research generally begins with small-scale phase I studieswhich focus on the tolerability and safety of the medicine. Phase II is then often subdivided into two distinctstages: phase IIa, intended to demonstrate proof-of-concept, and phase IIb, which is used to identify the doseand dosing schedule to be taken forward to phase III. The ﬁnal stage of development involves performinglarge-scale, conﬁrmatory, phase III trials. A program like this will be punctuated by a series of milestones atwhich the sponsor reviews all of the available evidence and decides whether or not to transition to the nextstage of development. A key milestone occurs at the end of phase IIb, since continuation requires investmentin large scale pivotal studies. Metrics such as a program’s probability of success (PoS) are routinely used toquantify and communicate risks.We adopt a Bayesian approach and deﬁne PoS as the unconditional probability of success, averaging acrossour uncertainty about unknown parameters such as key treatment eﬀects . Therefore, the PoS evolves asnew information becomes available such as interim results from an on-going pivotal study . Of course successcan mean diﬀerent things to diﬀerent stakeholders. At the trial-level, success is often taken to mean achieving1 a r X i v : . [ s t a t . A P ] F e b igure 1: Three hurdles for success at the end of Phase II.statistical signiﬁcance on the primary endpoint, in which case PoS simpliﬁes to the Bayesian predictive powerof the trial, also referred to as ‘assurance’ . In this paper, we focus on program-level success, deﬁned asobtaining regulatory approval with eﬀects on key endpoints suﬃcient to secure market access, that is, geta newly approved drug used and reimbursed. PoS can be used to inform trial design discussions, such aswhether to include a futility interim analysis . It also essential for calculating a programs’s expected netpresent value (eNPV), deﬁned approximately as eNPV = PoS × NPV(Rewards) − NPV(Costs). The eNPVmetric is important for informing investment decisions and has also been used as an objective function tooptimize various aspects of program design including sample size and dose-selection strategies .Several methods of increasing complexity have been proposed for calculating PoS. The ﬁrst, simplest,approach summarizes industry data by aggregate success rates: so-called industry ‘benchmarks’ are availablefor the probability of regulatory approval from a particular milestone, or the probability of successfullytransitioning out of a development phase . Typically benchmarks are disaggregated across a limited setof covariates, such as therapeutic area or lifecycle class. The project team can then apply adjustments tothe benchmark based on a subjective assessment of program-speciﬁc risks to arrive at a ﬁnal, more tailored,PoS. While this approach is quick and simple, there are several drawbacks. Most importantly, subjectiveadjustments are open to heuristic biases and are likely to be applied inconsistently across programs, thusresulting in PoS assessments lacking in consistency and transparency. Recently, more advanced multivariablemodelling and machine learning techniques have been applied to industry datasets to generate tailoredindustry benchmarks, which are estimates of the probability of approval adjusting for several (and sometimeshundreds) of program characteristics . However, when using this approach, the analyst must be carefulto avoid leaking information by conditioning on prophetic variables which would not typically be known atthe time an investment decision is made.An alternative approach for calculating PoS is to present to a group of experts the relevant evidence andelicit from them a prior distribution for the treatment eﬀect which is then used to drive the PoS evalua-tion . Evidence may comprise data from early phase trials, studies of the drug in related indications, ortrials of drugs with a similar mechanism of action, as well as more broadly relevant information, such as thescientiﬁc rationale for the mechanism of action.A third, data-based, approach to the PoS assessment is to take the phase II data at face value withoutadjusting for any selection that may have occurred at the end of phase II, and use a Bayesian approach tocombine these with prior information. The resulting posterior for the treatment eﬀect is then used to drivethe PoS evaluation. The choice of prior will have an important impact on the PoS . One major limitationof this approach is that it can be applied only when there are no design diﬀerences between phases.In this paper, we extend existing methodology to propose a new quantitative approach for evaluatingthe PoS of a drug development program at the end of phase II. Referring to adverse events that wouldprompt abandonment of a development program despite positive eﬃcacy data as safety showstopper events(SSEs), at the end of phase II there are three hurdles for success, as shown in Figure 1. Speciﬁcally, wemust: 1) meet statistical signiﬁcance on the one or two eﬃcacy endpoints needed for approval in all phase2II trials without observing a SSE; 2) obtain regulatory approval; and 3) for all eﬃcacy endpoints consideredessential for market access (which is typically a larger set than the group of endpoints needed for approval),observe treatment eﬀect estimates which are in excess of the minimum thresholds believed to be suﬃcientto secure access. We refer to these thresholds collectively as the ‘target product proﬁle’ (TPP) because thisis the document which is used in many pharmaceutical companies to outline the desired proﬁle of a newtherapeutic, including eﬃcacy and safety. We begin focusing on traditional development programs with dataavailable from at least one phase IIb trial. Then, we can calculate the probability of taking all three of thehurdles for success in Figure 1 by combining several sources of information including the tailored industrybenchmark, the phase IIb eﬃcacy data, the design of the phase III studies and a qualitative assessment ofremaining unaccounted risks. Frequently diﬀerences between phases will preclude a purely data-based PoSevaluation. In these cases, we propose leveraging expert opinion in order to relate the phase IIb data to thequantities of interest in phase III.The remainder of this paper proceeds as follows. In Section 2, we begin by giving an overview of thePoS calculation. Section 3 describes how we use program-speciﬁc eﬃcacy data and industry benchmarks forreasons of attrition to estimate the probability of running a positive phase III program. In Section 4, wetake a step back and discuss how prior distributions for parameters of the Bayesian meta-analytic modelused to combine early phase eﬃcacy data can be informed by tailored industry benchmarks. In Section 5,we propose a semi-quantitative approach to account for any remaining risks which are not accounted forin previous steps of the calculation, while in Section 6 we discuss how to bridge across diﬀerences betweenphase IIb and phase III trials using expert opinion. We illustrate the proposed framework with an examplein Section 7 and conclude by outlining further work in Section 8.

This section provides an overview of the PoS calculation; a schematic diagram can be found in SupplementaryMaterials A. The evaluation begins by considering the risks associated with the phase III studies. Theprobability of a positive phase III program in which we demonstrate eﬃcacy on key endpoints with no SSEis: P { Eﬃcacy success in phase III on 1-2 key endpoints }× P { No SSE in phase III | Eﬃcacy success in phase III on 1-2 key endpoints } . (1)Eﬃcacy success in phase III has two components. Firstly, we must achieve statistical signiﬁcance on thekey endpoints in all phase III trials. Secondly, we must observe average treatment eﬀect estimates for theseendpoints which are at least in line with the TPP. Whilst the ﬁrst component is a minimum requirement forregulatory approval, the second is a prerequisite for market access. We begin by calculating the probabilityof eﬃcacy success assuming data are available from J ≥ P { Approval & TPP | Eﬃcacy success on 1-2 key endpoints & no SSE in phase III }× P { Eﬃcacy success on 1-2 key endpoints & no SSE in phase III } (2)Section 5 describes the semi-quantitative approach taken to calculate the left hand side term of (2), referredto as the conditional PoS. Suppose two endpoints will be key to eﬃcacy success in phase III; the case for a single endpoint followsnaturally. We refer to them as the primary endpoint P and secondary endpoint S, but the same approachcan be applied if they are, for example, co-primary endpoints. We deﬁne θ j = ( θ P j , θ S j ) as the study-speciﬁc treatment eﬀects underpinning the j th phase IIb trial, for j = 1 , . . . , J . Without loss of generality,we assume that the null eﬀect consistent with no advantage versus control is 0 for each endpoint, and largereﬀects are consistent with greater eﬃcacy. The TPP thresholds for endpoints P and S are δ P and δ S .Suppose the j th phase IIb trial provides an estimate ˆ θ j of θ j . In many cases, such as when estimates areobtained from ﬁtting a generalized linear model using maximum likelihood estimation or a Cox proportionalhazards model using maximum partial likelihood , ˆ θ j will follow, at least approximately after suitabletransformation, a bivariate normal distribution: (cid:18) ˆ θ P j ˆ θ S j (cid:19) | θ j ∼ N (cid:18)(cid:18) θ P j θ S j (cid:19) , (cid:18) I − P j κ √ ( I − P j I − S j ) κ √ ( I − P j I − S j ) I − S j (cid:19)(cid:19) , (3)where κ represents the within-patient correlation of responses on endpoints P and S, and I P j and I S j arethe Fisher information levels for θ P j and θ S j . We treat κ as known and set it equal to the estimate fromphase IIb. For many types of data, information levels will depend on one or more ‘nuisance’ parameters.For example, for normal data information levels will depend on the response variance while for binary data(under the null hypothesis of no treatment eﬀect) they will depend on the common response probability.One approach would be to stipulate prior distributions for all unknown nuisance parameters and incorporatethis uncertainty into the PoS calculation . However, for simplicity, we prefer to set information levels equalto the values obtained assuming nuisance parameters are equal to their estimates based on the phase IIbdata.We assume that study-speciﬁc treatment eﬀects in phase IIb are exchangeable, so that θ , . . . , θ J | µ , τ P , τ S , ρ ∼ N (cid:18)(cid:20) µ P µ S (cid:21) , (cid:20) τ P ρ τ P τ S ρ τ P τ S τ S (cid:21)(cid:19) . (4)We interpret τ P and τ S are the standard deviations of the phase IIb study-speciﬁc eﬀects on endpoints Pand S, while ρ is the within-study correlation of treatment eﬀects on these two endpoints. For simplicity, wetreat ρ as a ﬁxed constant supplied by the analyst; its speciﬁcation could be based on a meta-regression ofpairs of treatment eﬀect estimates obtained from trials of drugs with a similar mechanism of action to thenovel drug. The Bayesian meta-analytic model for the phase IIb data is completed by stipulating priors forthe average treatment eﬀect vector µ = ( µ P , µ S ), and τ P and τ S . Discussion of the prior for µ will bepostponed to Section 4. We follow others to stipulate weakly informative half-normal priors for theheterogeneity parameters with τ P ∼ HN ( z P ) and τ S ∼ HN ( z S ), where HN ( z ) is the distribution of | X | if X ∼ N (0 , z ) . Neuenschwander and Schmidli

24, Table 3 characterize diﬀerent degrees of heterogeneity(large, substantial, moderate, small) in terms of multiples of the ‘unit-information standard deviation’, whichin this context is the standard error of the eﬀect estimate based on two patient responses (one on each arm)or a single event. We expanded this categorization to introduce a ‘very small’ level of heterogeneity, as4hown in Supplementary Materials B. We choose z P ( z S ) to ensure that the prior median estimate of τ P ( τ S ) is equal to the multiple of the unit-information standard deviation corresponding to the stateddegree of between-study heterogeneity. Adopting the nomenclature of Neuenschwander and Schmidli , theexamples presented in this paper will assume ‘small’ between-trial heterogeneity in phase IIb for all keyendpoints. The hyperparamters z P and z S will take diﬀerent values if either endpoints P and S followdiﬀerent distributions or, more generally, if diﬀerent levels of heterogeneity are attributed to each.Given the phase IIb data ˆ θ , . . . , ˆ θ J , we ﬁt the model deﬁned in (3)-(4) using Markov chain MonteCarlo (MCMC). We label the L samples from the posterior distribution for µ as ( µ (1) P , µ (1) S ) , . . . , ( µ ( L ) P , µ ( L ) S ).We now describe how we can use these to generate L samples from the meta-analytic-predictive (MAP) priorfor the study-speciﬁc treatment eﬀects in the K planned phase III trials, denoted by θ k = ( θ P k , θ S k ), for k = 1 , . . . , K . If there are no diﬀerences between the target estimands of the phase IIb and phase III studies,the long-run averages of the study-speciﬁc eﬀects in each phase should be identical. However, the degree ofbetween-study heterogeneity is expected to be smaller in phase III than phase IIb because it is common forpivotal studies to run concurrently with one another and follow similar (if not identical) protocols. Let τ P and τ S denote the standard deviations of the phase III study-speciﬁc treatment eﬀects on endpoints P and S . For the purposes of the examples described in this paper, we use the method described in the previousparagraph to specify priors τ P ∼ HN ( z P ) and τ S ∼ HN ( z S ) with medians corresponding to ‘very small’heterogeneity. Taking L independent samples from the prior distributions of τ P and τ S , and assuming thatstudy-speciﬁc treatment eﬀects in phases IIb and III are partially exchangeable, we can then sample θ ( (cid:96) )31 , . . . , θ ( (cid:96) )3 K | µ ( (cid:96) ) , ρ, τ ( (cid:96) ) P , τ ( (cid:96) ) S ∼ N (cid:32)(cid:34) µ ( (cid:96) ) P µ ( (cid:96) ) S (cid:35) , (cid:34) τ ( (cid:96) )2 P ρ τ ( (cid:96) ) P τ ( (cid:96) ) S ρ τ ( (cid:96) ) P τ ( (cid:96) ) S τ ( (cid:96) )2 S (cid:35)(cid:33) for (cid:96) = 1 , . . . , L . (5)For each (cid:96) = 1 , . . . , L , given the study-speciﬁc treatment eﬀects θ ( (cid:96) )31 , . . . , θ ( (cid:96) )3 K , we can simulate theoutcome of the (cid:96) th phase III program assuming that treatment eﬀect estimators follow canonical jointdistribution (3) and setting information levels equal to their design values. Statistical signiﬁcance in a trialis declared if the simulated treatment eﬀect estimate exceeds the critical value of the planned hypothesistest. A TPP threshold for an endpoint is deemed to have been met if it is less than the weighted mean of the K simulated eﬀect estimates, weighting by the inverse variances . The predictive probability of eﬃcacysuccess in phase III is given by1 L L (cid:88) (cid:96) =1 { Meet eﬃcacy success criteria in (cid:96) th phase III program | θ ( (cid:96) )31 , . . . , θ ( (cid:96) )3 K } . We need to proceed slightly diﬀerently when a key eﬃcacy endpoint is binary and the treatment eﬀectis a diﬀerence in proportions. This is because the assumption of normality in (4) could lead us to placeprobability mass on eﬀects outside the interval [ − , The probability of a positive phase III program in (1) depends on the conditional probability of no SSE inphase III given eﬃcacy is demonstrated. We use industry benchmarks, rather than project-speciﬁc clinicaldata, to quantify the risk of a SSE. Several authors have reviewed the reasons for attrition in drug devel-opment and how these vary across phases . However, since only the primary reason for terminationis typically reported in industry datasets, we cannot estimate the joint distribution of diﬀerent causes forfailure. In addition, failure attribution may not always be explicit: ‘strategic reasons’ are commonly citedfor termination but we speculate this may be a coded version of poor eﬃcacy or safety.We simplify to assume that a program can only fail due to either inadequate eﬃcacy or safety. Further-more, we assume these two causes for failure are independent. Under the latter assumption, the conditionalprobability of no SSE in phase III given eﬃcacy success simpliﬁes to the unconditional probability of no SSEin phase III. This approximation is likely to be conservative because higher rates of serious adverse events onthe novel drug would be expected to result in higher rates of study discontinuations or treatment switching,5hich would in turn dilute eﬃcacy: if we were told eﬃcacy had been demonstrated, the risk of a SSE wouldtherefore decrease. The unconditional probability of no SSE in phase III is given by: P { No SSE in phase III } = 1 − P { Fail in phase III } × P { SSE in phase III | Fail in phase III } = 1 − (1 − P { Succeed in phase III } ) × P { SSE in phase III | Fail in phase III } . (6)We estimate P { Succeed in phase III } in (6) using a tailored industry benchmark which is obtained by evalu-ating a simple predictive model which was ﬁtted to an industry dataset according to the approach describedin Appendix B. The conditional risk P { SSE in phase i | Fail in phase i } in (6) is also estimated using anindustry benchmark. To derive this, we used the Clarivate Global R&D performance metric program ‘CMR’(Centre for Medicines Research) database provides aggregate summaries of transition rates and reasons forfailure by phase. Assuming all failures that are not attributed to safety are due to lack of eﬃcacy, restrict-ing attention to programs entering a phase between 2012-2018 and excluding vaccines and biosimilars, weobtained estimates: • Non-oncology: P { SSE | Fail in phase II } = 0 . P { SSE | Fail in phase III } = 0 . • Oncology: P { SSE | Fail in phase II } = 0 . P { SSE | Fail in phase III } = 0 . P { SSE | Fail in phase IIa } = P { SSE | Fail in phase IIb } which we set equal to ourestimate of P { SSE | Fail in phase II } . µ We have yet to comment on what prior we will place on µ . Using an ‘oﬀ the shelf’ weakly informative normaldistribution would neglect the information we have from the industry benchmark, as well as the potentialimpact of selection bias on the phase II eﬀect estimate(s). Figure 2(a) compares the probability densityfunction (pdf) of the maximum likelihood estimate (MLE) of the treatment eﬀect from a phase III trial withthe conditional pdf of the MLE from phase II given statistical signiﬁcance is achieved. Figure 2(b) showshow the magnitude of the selection bias in the phase II MLE from a statistically signiﬁcant trial varies withthe treatment eﬀect and phase II sample size. The impact of the selection bias is highest when the phaseII trial is poorly powered for its primary objective and when the drug has little beneﬁt versus control.We could try to account for the selection bias by explicitly modelling the phase III go/no-go criteria whenanalyzing the phase IIb data although, in practice, this is not straightforward as investment decisions areinﬂuenced by multiple factors. Alternatively, based on a small review of the Pﬁzer portfolio, Kirby et al. propose discounting the phase II estimate by 10%. However, applying a ﬁxed discount factor ignores theinﬂuence of phase IIb sample size and drug eﬃcacy eﬀect on the selection bias. With these factors in mind,we try to ameliorate the impact of selection bias by using a prior for µ satisfying the following requirements:1. It should incorporate some degree of skepticism.2. The degree of skepticism should reﬂect the historical success rates of similar projects at the samestage of development. Since µ measures the eﬃcacy of the new drug, only the benchmark probabilityof eﬃcacy success in phases II and III (given we start phase II) is relevant for informing the prior.The benchmark conditional probability of approval given we submit a new drug application (NDA) isnot considered informative for µ since approval outcomes may be inﬂuenced by many factors beyondeﬃcacy; see Section 5. 6 a) (b) Figure 2: Results are for the case that the primary endpoint P is normally distributed with a known standarddeviation of 2, where the diﬀerence in average responses on the new drug vs control is identical across phases,i.e. θ P = θ P = θ P . The phase IIb and phase III studies are designed to test H : θ P ≤ H : θ P > α = 0 .

025 at θ P = 0, and power is speciﬁed at θ P = δ P setting δ P = 1 .

5. The phase IIItrial is designed to have power 0.9 at θ P = δ P . The ﬁgures plot: (a) Comparison of the unconditional pdf ofˆ θ P with the conditional pdf of ˆ θ P given we achieve statistical signiﬁcance in phase IIb, when the phaseIIb trial has power 0.8 at θ P = δ P and in truth θ P = 0 . δ P . (b) Conditional bias in ˆ θ P given statisticalsigniﬁcance is achieved in phase IIb at level- α .3. The inﬂuence of the benchmark should decrease as the phase II sample size and/or as the phase IIbeﬀect estimate increases.After motivating our choice of prior for µ , Section 4.2 provides more details on its speciﬁcation. We specify a mixture prior for µ placing probability ω on a ‘null’ component consistent with the hypothesisthat the new treatment oﬀers no clinically relevant advantage over control on either endpoint, and probability(1 − ω ) on a ‘TPP’ component consistent with the hypothesis that drug eﬀects on both endpoints are closeto the TPP thresholds. To capture these beliefs, we set: f ( µ P , µ S ) = ω f ( µ P , µ S ) + (1 − ω ) f ( µ P , µ S ) . (7)For c = 1 ,

2, we deﬁne f c ( µ P , µ S ) as the pdf of a bivariate normal random variable with mean η c and variancematrix Σ c , where: η = (cid:18) (cid:19) Σ = (cid:18) σ P ρ σ P σ S ρ σ P σ S σ S (cid:19) and η = (cid:18) δ P δ S (cid:19) Σ = (cid:18) σ P ρ σ P σ S ρ σ P σ S σ S (cid:19) . We assume that the correlation between µ P and µ S is the same as the linear correlation between study-speciﬁc eﬀects in (4). We ﬁnd σ P as the solution to P { µ P ≥ δ P ; η , Σ } = 0 .

01, and σ S is deﬁned similarly:placing 1% probability in the upper tail is consistent with the interpretation of f ( µ P , µ S ) as the ‘null’component of the mixture. Meanwhile, we ﬁnd σ P as the solution to P { µ P ≤ η , Σ } = 0 .

01, consistentwith the interpretation of f ( µ P , µ S ) as the ‘TPP’ component; σ S is deﬁned similarly. In the exampleswe have considered, choosing the prior standard deviations in this way implies that as ω tends towards 0.5,marginal priors f ( µ P ) and f ( µ S ) roughly approximate uniform densities on the intervals (0 , δ P ) and (0 , δ S ),capturing our prior equipoise about whether or not the drug has a meaningful beneﬁt. Figure 4.2 shows onesuch example with a single primary eﬃcacy endpoint.7igure 3: Mixture prior for µ P when δ P = 10 and ω = 0 .

5. In this case, σ P = − δ P / Φ − (0 .

01) and σ P = − δ P / Φ − (0 . ω such that the uncondi-tional probability of eﬃcacy success in a ‘standard’ phase IIb and phase III program equals the correspondingtailored industry benchmark, the derivation of which is given in Appendix B. We characterize a ‘standard’development program as comprising one phase IIb study and either one or two phase III studies, dependingon the disease area (one if oncology; two, otherwise). The unconditional probability of eﬃcacy success is (cid:90) P { Eﬃcacy success in ‘standard’ phase IIb and III | µ P , µ S } f ( µ P , µ S )d µ . For the purposes of prior calibration, we deﬁne eﬃcacy success as observing a (one-sided) p-value < .

05 inphase IIb for the endpoint associated with the smallest Fisher information for any given sample size; andobserving p-values < .

025 on both endpoints P and S in all phase III studies. This is because based onour experience, success on one endpoint is typically considered suﬃcient in phase II. We assume a standardphase IIb (phase III) study is designed to have power 0.8 (0.9) to meet its objectives when treatment eﬀectsequal their TPP thresholds. When there is one key eﬃcacy endpoint, a closed form expression for ω existswhich is presented in Appendix C. So far we have restricted attention to traditional development pathways, where a phase III program ispreceded by one or more phase IIb trials. However, accelerated pathways which skip phases are commonin highly competitive research spaces, in conditions where there is a high level of unmet medical need, orwhere there is an abundance of existing relevant evidence. For accelerated development programs, phaselabels can also become somewhat arbitrary. For example, pivotal studies may be labelled phase II ratherthan phase III, although they are still intended to support registration. It is also common in the oncologyspace for phase Ib expansion-cohort studies to collect eﬃcacy data in the target patient population and thesetrials play a similar role to that of phase IIa studies in other disease areas. If eﬃcacy data are availablefrom early phase Ib or phase IIa studies, it is straightforward to extend our approach to evaluate the PoS ofthese abbreviated programs, the principal methodological question being which industry benchmark shouldwe use to calibrate the prior for µ ? If early-phase eﬃcacy data come from a phase Ib or phase IIa study,we calibrate f ( µ P , µ S ) in (7) so that the unconditional probability of eﬃcacy success in a standard phaseIIa, IIb and III program equals the corresponding industry benchmark. We assume a standard phase IIaprogram consists of a single study with (one-sided) type I error rate 0.1 and power 0.8; standard phase IIband phase III programs remain as above.Note that our focus is on calculating the PoS of pivotal trial(s) intended from the outset to supportregulatory approval. Programs granted conditional approval based on overwhelmingly positive results from8n early phase trial will typically fall outside the scope of the current work unless these studies were pre-speciﬁed as registrational. We retrospectively evaluated the PoS of a Novartis program which at the time of the PoS assessment hadstarted phase III but had not reported the results of the pivotal trials. To protect conﬁdentiality, somedetails have been anonymised.We began by recording the important program characteristics listed in Table 3 associated with a project’sprobability of success in phases II and III. The drug (T) was a small molecule, orally administered, targetingan enzyme to treat a condition in the cardiovascular/metabolic/renal therapeutic area and was alreadyapproved for other indications. Drug T had not been granted a breakthrough designation for the indicationin question. Prior to phase III, T had been studied in a single phase II trial, which we label as phase IIa andindex as study j = 1. Table 1 lists the industry benchmarks given these characteristics obtained from thepredictive models described in Appendix B. The primary endpoint P of the phase IIa trial was change frombaseline at week 12 in a log -transformed continuous biomarker, which is normally distributed with standarddeviation 0.91. The objective was to demonstrate superiority of T versus control; larger reductions in thebiomarker by week 12 reﬂect an advantage of T, and the TPP threshold was log (0 .

75) = − .

42, interpretedas a 25% relative reduction in the geometric mean biomarker ratio to baseline at week 12.Phase Prob. of success Prob. of eﬃcacy Prob. of no SSEin phase success in phase in phaseIIa 0.68 0.72 0.97IIb 0.68 0.72 0.97III 0.70 0.76 0.96Submission 0.88 NA NATable 1: Tailored benchmark probabilities of overall, eﬃcacy and safety success by phase for our example.Success in the submission phase means obtaining regulatory approval.Figure 4(a) shows the calibrated mixture prior for µ P . Figure 4(b) plots the posterior median of µ P afterﬁtting the Bayesian hierarchical model from Section 3.1 to data from the phase IIa study, placing the mixtureprior in Figure 4(a) on µ P and stipulating τ P ∼ HN (0 . ), which has a median consistent with smallbetween-trial heterogeneity in this context. All calculations were performed in R 3.6.1 using JAGS . Theposterior medians are attenuated towards the null if we observe a point estimate ˆ θ P ≤ − .

42; otherwise,they are shrunk away from the na¨ıve MLE. We compare these Bayesian estimates with the approach ofdiscounting ˆ θ P by 10% and using this to update a non-informative prior for θ P . We discuss how to assess P { Approval & TPP | Eﬃcacy success on 1-2 key endpoints & no SSE in phase III } ,which we refer to as a program’s conditional PoS. This probability should capture risks known at the endof phase IIb which have not yet been accounted for in previous steps of the PoS evaluation described inSections 3-4. These risks fall into ﬁve categories: • Regulatory alignment ( R ): Phase III design may not be aligned with regulatory expectations • Unaccounted safety ( R ): Safety risks recently emerging from within the program (e.g pre-clinicalstudies) and/or beyond the program (e.g. safety signals from clinical trials of a compound with thesame mechanism of action) point towards an increased risk of a rare AE which, while unlikely to bedetected in phase III, may raise concerns during submission. Such risks would not be captured in theSSE calculation. • Unaccounted TPP ( R ): Risk of not meeting the TPP endpoints other than P and S which necessaryfor approval and /or market access. 9 a) (b) Figure 4: a) Calibrated mixture prior for µ P for the example of Section 4.4, with pdf 0 . × N (0 , . . × N ( − . , . θ P and µ P as a function of ˆ θ P . • Quality and compliance ( R ): Known risks in quality and compliance that could jeopardize approvaldespite positive results on P and S. E.g. poor internal audit outcome on phase II trial, issues withassay validation for key biomarker, phase III program to occur in areas with poor infrastructures andinexperienced investigators. • Technical development ( R ): Known issues on formulation and/or device that could create uncertaintiesabout dose selection or manufacturing.Appendix B describes how we used industry data to ﬁt a logistic model for the conditional probabilityof regulatory approval given NDA submission, adjusting for the lifecycle class of the drug and disease area.Evaluating this model for a program yields a tailored benchmark ˆ p BS . However, the logistic model does notcapture the impact of R − R . We propose a semi-quantitative approach to integrating these risks whichbegins by asking a team to score their program on a three-point risk scale (low; medium; high) for each risk R to R : the scorecard is included in Supplementary Materials C. A program’s risk proﬁle, deﬁned as theconﬁguration of the ﬁve low-med-high ratings, is then used to adjust ˆ p BS / (1 − ˆ p BS ).In order to translate a program’s qualitative risk proﬁle to a number that we can then use to adjustthe benchmark odds of approval given NDA submission, we need to understand the impact of R - R on aprogram’s PoS. While there are no readily-available data on the impact of R - R , this does not mean there isno relevant evidence. To quantify what it known about the eﬀect of R - R on PoS, we elicited the judgementsof senior Novartis colleagues with experience of several submissions and market access negotiations. Eachexpert was asked to complete a survey listing 15 conﬁgurations of the low-med-high risk ratings: for eachone, the expert was asked to state how many out of 100 hypothetical programs with the same risk proﬁlewould fail to gain approval and access despite having run a positive phase III program meeting statisticalsigniﬁcance and the TPP on the 1-2 key eﬃcacy endpoints without a SSE. Each survey was accompanied bya cover sheet providing background information which cited a crude historical regulatory approval rate of90% after NDA submission. Since elicited conditional success rates were expected to be very high for somerisk proﬁles, we preferred to ask experts for opinions on failure rates and deduce success rates from these.Three diﬀerent versions of the survey were circulated and are included in Supplementary Materials D.All experts were asked a common set of 11 questions to identify the main eﬀects of R - R . The remainingfour questions were then tailored to explore one of the three pairwise interactions between R , R and R .In total, 46 experts spanning seven line functions were invited to participate in the survey, split between thethree versions of the questionnaire (16:16:14). Experts were assigned to versions using purposeful samplingwhere appropriate, e.g. to ensure experts in regulatory aﬀairs received versions of the questionnaire relevantfor understanding potential pairwise interactions between R and R , and R and R ; a similar strategy wasapplied to assign experts in safety and patient access to questionnaires.In total, 31 of 46 experts responded. One completed survey was discarded due to a misunderstanding10igure 5: Comparing the ﬁt of a linear mixed eﬀects model adjusting only for the main eﬀects of risk factors R - R (red points) with experts’ stated opinions summarized by the mean ± R = low, R =med, R =low, R =low, R =med) appears as ABAAB.of the questions, meaning results are based on a denominator of 30. We model experts’ individual opinionsusing a linear mixed eﬀects model, linking the average opinion on the conditional PoS to the main eﬀectsof R - R using a logit link function, and assuming a Gaussian random error term. We ﬁt the model witha random expert intercept term, treating all other model terms as ﬁxed eﬀects. We represent R - R ascategorical variables to avoid the need for assumptions about how the conditional odds of success changeacross levels of the risk factors. Let ˆ ep ( r , . . . , r ) denote the ﬁtted conditional PoS of a program with riskproﬁle ( R = r , . . . , R = r ) obtained from the mixed eﬀects model. Figure 5 compares ﬁtted values withelicited opinions; ﬁtted values are also listed in Supplementary Materials E.Recall that the tailored benchmark ˆ p BS incorporates information on a program’s disease area and lifecycleclass. Assuming the eﬀects on the conditional PoS of R - R , lifecycle class and disease area are additive onthe logit scale, we can leverage ˆ ep ( r , . . . , r ) to derive a multiplicative adjustment to ˆ p BS / (1 − ˆ p BS ). Wedenote this adjustment factor by C ( r , . . . , r ), which will capture the impact of R - R on the conditionalodds of success. For ease of presentation, we will henceforth drop the risk arguments to ˆ ep and C .As expert opinions were elicited with a crude benchmark conditional PoS of 0.9 in mind, the adjustmentfactor C must satisfy: ˆ ep − ˆ ep = C . − . , (8)Applying this adjustment factor to ˆ p BS / (1 − ˆ p BS ), our estimate of the conditional odds of success reﬂectinginformation on R - R , disease area and lifecycle class is given by P { Approval & TPP | Positive phase III on key endpoints } − P { Approval & TPP | Positive phase III on key endpoints } = C × ˆ p BS (1 − ˆ p BS ) . (9)Substituting in our expression for C from (8) into (9), we obtain P { Approval & TPP | Positive phase III on key endpoints } = 0 . × ˆ ep × ˆ p BS . − ˆ ep ) + ˆ p BS ( ˆ ep − . . (10)The ﬁnal PoS estimate is then obtained as the product of P { Positive phase III program } in (1) and theconditional PoS in (10). 11 Accommodating diﬀerences between phase IIb and phase III

So far, we have restricted attention to the relatively simple scenario that similar treatment eﬀects aremeasured in phase IIb and phase III. However, diﬀerences between early phase and pivotal trials are common.Examples of diﬀerences include: pushing out the time-point at which the primary endpoint is measured;switching from measuring a biomarker to a clinical outcome; broadening-out the patient population; orreﬁning the drug formulation in a manner which impacts on eﬃcacy. In disease areas where the treatmentlandscape is rapidly evolving, we may also ﬁnd the control arm used in phase IIb has been replaced asstandard of care by the time phase III studies launch. Failure to examine the impact of these diﬀerences onthe eﬀect of treatment will make it diﬃcult to interpret just how predictive statistical signiﬁcance in a phaseIIb trial is of success in phase III. An FDA report highlighting 22 case-studies of phase II and III trialswith divergent results included projects where positive phase II results on a short-term endpoint turned outto be inconsistent with the lack of long-term beneﬁt subsequently found in phase III.When short- and long-term endpoints are chosen consistently across an indication, one could perform aBayesian meta-regression of data from trials reporting pairs of eﬀect estimates on these two endpoints ;the association between treatment eﬀects can then be used to bridge from the phase IIb data to derive aMAP prior for the long-term treatment eﬀect in phase III. Alternatively, a network meta-analytic approachcould be used to bridge across phases when there are diﬀerences between control arms. Both meta-analyticapproaches rely on the availability of relevant historical data. However, these data are often unavailable,rendering a purely data-driven PoS evaluation impossible. This does not necessarily imply however that weare in complete ignorance about the relationship between the quantities of interest in phases II and III.Expert elicitation is a scientiﬁc approach to quantifying knowledge about unknown parameters whichcan be adopted in this situation. There are several examples of elicited prior distributions being used toinform the design and analysis of clinical trials and drug development decisions . SHELF (the SheﬃeldElicitation Framework) is a package of templates, software and methods intended to facilitate a systematicapproach to prior elicitation minimising the scope for heuristics and bias. We propose using the SHELFextension method to elicit a functional relationship between eﬀects on diﬀerent endpoints including experts’uncertainty.To illustrate how this would proceed, suppose diﬀerent eﬃcacy endpoints are studied in phases II andIII and this is the only diﬀerence; applications to other scenarios follow directly. For simplicity, supposethe phase III program comprises one trial indexed by k = 1 and let θ P denote the eﬀect of treatment onendpoint P in this study. Furthermore, let θ P (cid:63) denote the treatment eﬀect on the phase II primary endpointif this were to be measured in the new phase III trial. In this setting, we can use the phase IIb data to derivea MAP prior for θ P (cid:63) and then follow the SHELF extension method to elicit experts’ conditional judgementson θ P given θ P (cid:63) . We summarize the experts’ beliefs by asking them to consider what a rational impartialobserver (RIO) would believe after listening to their discussions. Then, by repeatedly sampling ﬁrst fromthe MAP prior for θ P (cid:63) and then from RIO’s conditional prior distribution for θ P | θ P (cid:63) , we obtain a setof Monte Carlo samples from the marginal MAP prior for θ P , which can be used to simulate the phaseIII trial. It is straightforward to extend this process to the case when the phase III program comprises K trials. More details on the SHELF extension method are given elsewhere . In Section 7, we describe howwe applied this approach to evaluate the probability of success for the example described in Section 4.4 whenthere was a change in endpoint in phase III. We revisit the example of Section 4.4 assessing the PoS of a cardiovascular drug in lifecycle management. Asingle phase III trial, which we index by k = 1, was planned. While the primary endpoint of the phase IIatrial was a biomarker, the phase III trial would compare drug T with control on the basis of two long-termclinical outcomes: the primary endpoint (P) was the number of occurrences of a composite recurrent eventendpoint, while the key secondary endpoint (S) was the time to cause-speciﬁc mortality. Let θ P , a lograte-ratio, and θ S , a log hazard ratio (HR), represent treatment eﬀects on endpoints P and S in the phaseIII study. Negative eﬀects θ P < θ S < H : θ P ≥ θ P (cid:63) , the biomarker treatment eﬀect in the new phase III trial, given ˆ θ P =log (0 .

77) with 95% CI: log (0 .

64) to log (0 . <

1) would be suﬃcient. The team considered success on both endpointsto be essential.We analysed the phase IIa data on the biomarker in Section 4.4. We related the phase IIa data to thebiomarker treatment eﬀect in a new phase III study by assuming θ P (cid:63) ∼ N ( µ P , τ P (cid:63) ) and τ P (cid:63) ∼ HN(0 . )is centered at very small between-study heterogeneity. Figure 7 shows the MAP prior for θ P (cid:63) , which places ahigh predictive probability of 0.998 on the event that drug T would have a beneﬁcial eﬀect on the biomarkerin the phase III study.We convened an elicitation workshop to quantify what was currently understood about the associationbetween the eﬀect of T (vs C) on the biomarker and endpoints P and S. Four experts from Novartis were in-vited, three from the program team (2 statisticians, 1 clinician) and one independent clinician with knowledgeof the disease area.The elicitation process largely followed the SHELF extension method . However, there were somesmall deviations from that procedure as the elicitation workshop was run as an internal pilot of the elicitationprocess, meaning the team had the opportunity to test a modiﬁed version of the approach. Based on ourlearnings, we plan to adopt the SHELF extension method for forthcoming workshops, but for transparencywe describe what was actually done in the pilot meeting. Prior to the workshop, we circulated an evidencedossier summarising the key data as well as their limitations. During the meeting, we ﬁrst elicited experts’conditional judgements on the rate ratio for the primary composite recurrent event, exp( θ P ), given thebiomarker treatment eﬀect. We then elicited conditional judgements on the HR for the secondary endpoint,exp( θ S ), given the biomarker treatment eﬀect. This strategy assumes beliefs about treatment eﬀects on Pand S are conditionally independent given θ P (cid:63) . We used the roulette method to elicit from each experta sequence of three conditional priors for exp( θ P ) and exp( θ S ). Experts were asked to condition theirjudgements on: • θ P (cid:63) = − .

47, interpreted as 28% relative reduction of geometric means between baseline and week 12 • θ P (cid:63) = − .

40, corresponding to a 24% reduction • θ P (cid:63) = − .

30, corresponding to a 19% reduction.These conditioning values correspond to the 22, 45 and 74th percentiles of the MAP prior for θ P (cid:63) . Weimplemented the roulette method by asking an expert to allocate a total of 25 chips, each representing aprobability of 4%, to bins covering their plausible range for the treatment eﬀect given a particular valueof θ P (cid:63) . To determine conditional priors for θ P and θ S given θ P (cid:63) , we took log-transformations of thequantiles elicited for exp( θ P ) and exp( θ S ). For example, if an expert stated that P { exp( θ P ) ≤ q } = p we took this to imply she/he believed P { θ P ≤ log( q ) } = p . We then ﬁtted parametric distributions to an13 a) Given θ P (cid:63) = − .

47 (b) Given θ P (cid:63) = − . θ P (cid:63) = − .

30 (d) Marginal

Figure 7: Individual and pooled density functions for the log rate ratio for endpoint P. 10th, 50th and 90thpercentiles of the marginal prior for the log rate ratio were -0.44, -0.23, -0.07, respectively.expert’s conditional opinions on θ P and θ S using the SHELF package in R . Due to time constraints,we used mathematical, rather than behavioral, aggregation to derive ‘consensus’ conditional priors, assigningequal weights to each expert. Figures 7a) - c) and 8a) - c) compare individual and pooled conditional priors.To determine a marginal prior for θ P , we began by calculating the 10th, 50th and 90th percentiles ofeach of the three pooled conditional prior distributions. Let F p ( a ) denote the p th percentile of the pooledconditional prior for θ P given θ P (cid:63) = a . For each p = 10 , ,

90, we assumed a piecewise linear relationshipconnected F p ( θ P (cid:63) ) and θ P (cid:63) , meaning we can interpolate to deduce F p ( θ P (cid:63) ) for any θ P (cid:63) ∈ [ − . , − . F p ( − .

47) and F p ( − .

40) to the left, and extending the straight line connecting F p ( − .

40) and F p ( − .

30) to the right. Wethen sampled from the marginal prior for θ P by following four steps:1. Sample θ (1) P (cid:63) , . . . , θ ( L ) P (cid:63) from the MAP prior for θ P (cid:63) .2. Using linear interpolation, calculate F p ( θ (1) P (cid:63) ), for p = 10 , ,

90. Find the best ﬁtting statisticaldistribution for these percentiles, and sample θ (1) P from this.3. Repeat Step 2 to generate L samples from the marginal prior distribution of θ P .A similar process was used to generate L samples from the marginal prior for θ S . We set L = 40 , θ P , θ S ) is shown in Figure 9. Using a common set of samples for θ P (cid:63) in Step 1 for both endpointsinduces a Spearman correlation of 0.4 between pairs of trial-speciﬁc treatment eﬀects. For each pair ofsamples, we simulated one phase III trial and recorded whether we achieved the eﬃcacy success criteria onendpoints P and S, and overall. 14 a) Given θ P (cid:63) = − .

47 (b) Given θ P (cid:63) = − . θ P (cid:63) = − .

30 (d) Marginal

Figure 8: Individual and pooled density functions for the log-HR for endpoint S. 10th, 50th and 90thpercentiles of the marginal prior for the log-HR were -0.30, -0.14, 0.02, respectively.From Table 1, we see that for this program the probability of no SSE in phase III is 0.96, while thebenchmark probability of regulatory approval after a positive phase III program (given the disease area andlifecycle class) is 0.88. The project team also completed the risk scorecard described in Section 5: theyscored the program low risk on all factors except ‘Unaccounted TPP risks’, which they considered mediumrisk. Incorporating this information, we calculated that the conditional probability of obtaining approvaland meeting the TPP on all endpoints needed for access given a positive Phase III program (succeeding onP and S without a SSE) is 0.80.On the basis of the simulations of the phase III trial and information on beyond phase III risks, weestimated that the probability of: • Statistical signiﬁcance on P in the phase III trial was 0.57 • Meeting the above criterion and meeting the TPP for P and a positive trend on S was 0.50 • Meeting the above criterion and seeing no SSE was 0.48 • Meeting the above criterion and obtaining approval and meeting the TPP on all remaining endpointswas 0.38.In conclusion, the PoS of T before entering pivotal trials was retrospectively estimated at 38%.

In this paper, we have presented a comprehensive approach for calculating the PoS of a program at the endof phase II. Our approach has several advantages. Firstly, it makes use of all available evidence, including15igure 9: Joint MAP prior for the study-speciﬁc treatment eﬀects on endpoints P and S in the phase IIItrial.industry benchmarks, early phase data within the project and relevant data outside the project. Secondly,it makes use of expert knowledge to bridge diﬀerent outcomes across phases and assess risks beyond thekey phase III outcomes. Finally, the new approach is transparent, granular and standardized, allowingidentiﬁcation of the “pain points” of the project and improving comparisons across projects. Our experiencespiloting the framework lead us to believe that it produces more accurate PoS estimates which can helpprogram teams to assess the adequacy of the phase III design and evaluate whether TPP targets are tooambitious. The structured process for assessing risks beyond phase III can also lead teams to proposemodiﬁcations to mitigate risks. If applied early, prior to phase IIb, the process can even help teams torethink their phase II design, for example, by considering whether the knowledge generated by measuringthe phase III endpoint in phase II would oﬀset the time and cost required to do so.Despite the ﬂexibility of proposed approach, not all development programs will ﬁt perfectly into thePoS framework and further adaptations beyond those discussed in Section 4.3 can and will be needed. Forexample, lifecycle management programs which skip straight to phase III can be accommodated if it isfeasible to follow Section 6 and use expert opinion to bridge phase III data from the approved indication tothe eﬀect of treatment in the new indication. Phase III data from the approved indication could be combinedusing a meta-analytic approach stipulating a weakly informative, rather than a calibrated mixture, prior forthe mean of the random eﬀects distribution. This is because selection bias is likely to be less of a concernfor these data due to the size of the previous phase III studies.In some programs we have encountered, no relevant clinical data are available at the time of the PoSassessment. In these cases, we propose calculating PoS based on the calibrated prior for the eﬃcacy treatmenteﬀects described in Section 4. Another challenge is that for some highly innovative medicines (e.g. novelgene therapies), existing industry benchmarks may be deemed to be irrelevant. Further work is needed toidentify how best to proceed in this scenario, although one idea would be to elicit expert opinion directly onthe treatment eﬀect parameter and use this (uncalibrated) prior to drive the PoS assessment.Subgroup selection is common at the end of phase II and if unaccounted for, may introduce additionalselection bias into the phase II eﬀect estimate. While there are a number of approaches to correct forsubgroup selection bias from a given set of subgroups (see Thomas and Bornkamp for a review or Guoand He for recent developments), the subgroup selection process may not always be totally quantitativeand not only be driven by the data in the observed study. A pragmatic approach in line with the proposedoverall procedure here would be to downweight the TPP component of the phase II prior according to of howplausible the selected subgroup is (e.g. if the subgroup is considered to be among the top three hypothesisbefore start of phase II, a multiplier of 1/3 would be applied to the TPP component). This approach willbe investigated in future applications. 16 cknowledgements The authors would like to thank G¨unther M¨uller-Velten, Jim Gong, Claudio Gimplewicz, Victor Shi, Wolf-gang Kothny, Pritibha Singh and Michael Wittpoth for helpful discussions during the development of thiswork. We would also like to thank Professor Anthony O’Hagan who facilitated the Bayesian expert elicitationmeeting described in Section 7.

References [1] DJ Spiegelhalter and LS Freedman. A predictive approach to selecting the size of a clinical trial basedon subjective clinical opinion.

Statistics in Medicine , 5:1–13, 1986.[2] K Ruﬁbach, P Jordan, and M Abt. Sequentially updating the likelihood of success of a phase 3 pivotaltime-to-event trial based on interim analyses or external information.

Journal of BiopharmaceuticalStatistics , 26:191–201, 2016.[3] A O’Hagan, JW Stevens, and MJ Campbell. Assurance in clinical trial design.

Pharmaceutical Statistics ,4:187–201, 2005.[4] A Crisp, S Miller, D Thompson, and N Best. Practical experiences of adopting assurance as a quantita-tive framework to support decision making in drug development.

Pharmaceutical Statistics , 17:317–328,2018.[5] N Patel, J Bolognese, C Chuang-Stein, D Hewitt, A Gammaitoni, and J Pinheiro. Designing phase 2trials based on program-level considerations: a case for neuropathic pain.

Therapeutic Innovation &Regulatory Science , 46(4):439–454, 2012.[6] O Marchenko, J Miller, T Parke, I Perevozskaya, J Qian, and Y Wang. Improving oncology clinicalprograms by use of innovative designs and comparing them via simulations.

Therapeutic Innovation &Regulatory Science , 47(5):602–612, 2013.[7] Z Antonijevic, M Kimber, D Manner, C-F Burman, J Pinheiro, and K Bergenheim. Optimizing drugdevelopment programs: type 2 diabetes case study.

Therapeutic Innovation & Regulatory Science ,47(3):363–374, 2013.[8] M Hay, DW Thomas, JL Craighead, C Economides, and J Rosenthal. Clinical development successrates for investigational drugs.

Nature Biotechnology , 32:40–51, 2014.[9] CH Wong, KW Siah, and AW Lo. Estimation of clinical trial success rates and related parameters.

Biostatistics , 20:273–286, 2019.[10] A O’Hagan, CE Buck, A Daneshkhah, JR Eiser, PH Garthwaite, DJ Jenkinson, JE Oakley, andT Rakow.

Uncertain Judgements: eliciting experts’ probabilities . Wiley & Sons, Chichester, 2006.[11] D Kahneman.

Thinking, fast and slow . Penguin, London, 2011.[12] AW Lo, KW Siah, and CH Wong. Machine learning with statistical imputation for predicting drugapprovals, 2019.[13] F Feijoo, M Palopoli, J Bernstein, S Siddiqui, and TE Albright. Key indicators of phase transition forclinical trials through machine learning.

Drug Discovery Today , 25(2):414–421, 2020.[14] N Dallow, N Best, and TH Montague. Better decision making in drug development through adoptionof formal prior elicitation.

Pharmaceutical Statistics , 17:301–316, 2018.[15] A O’Hagan. Expert knowledge elicitation: Subjective but scientiﬁc.

The American Statistician ,73(S1):69–81, 2019. 1716] KJ Carroll. Decision making from Phase II to Phase III and the probability of success: reassured by“assurance”?

Journal of Biopharmaceutical Statistics , 23:1188–1200, 2013.[17] K Ruifbach, HU Burger, and M Abt. Bayesian predictive power: choice of prior and some recommen-dations for its use as probability of success in drug development.

Pharmaceutical Statistics , 15:438–446,2016.[18] C Jennison and Bruce W. Turnbull. Group-sequential analysis incorporating covariate information.

Journal of the American Statistical Association , 92(405):1330–1341, 1997.[19] DO Scharfstein, AA Tsiatis, and JM Robins. Semiparametric eﬃciency and its implication on the designand analysis of group-sequential studies.

Journal of the American Statistical Association , 92(405):1342–1350, 1997.[20] ZA Alhussain and JE Oakley. Assurance for clinical trial design with normallu distributed outcomes:eliciting uncertainty about variances.

Pharmaceutical Statistics , 19(6):827–839, 2020.[21] DJ Spiegelhalter, KR Abrams, and JP Myles.

Bayesian approaches to clinical trials and healthcareevaluation . Wiley & Sons, Chichester, 2004.[22] B Neuenschwander, G Capkun-Niggli, M Branson, and DJ Spiegelhalter. Summarizing historical infor-mation on controls in clinical trials.

Clinical Trials , 7:5–18, 2010.[23] T Friede, C R¨over, S Wandel, and B Neuenschwander. Meta-analyses of few small studies in orphandiseases.

Research Synthesis Methods , 8:79–91, 2017.[24] B Neuenschwander and H Schmidli.

Bayesian methods in Pharmaceutical Research , chapter Use ofhistorical data, pages 1–27. Chapman & Hall/CRC, New York, ﬁrst edition, 2020.[25] A Whitehead.

Meta-analysis of controlled clinical trials . John Wiley & Sons, Chichester, 2002.[26] M Borenstein, LV Hedges, JPT Higgins, and HR Rothstein.

Introduction to meta-analysis . Wiley,Chichester, 2009.[27] I Kola and J Landis. Can the pharmaceutical industry reduce attrition rates?

Nature Reviews DrugDiscovery , 3:711–715, 2004.[28] MJ Waring, J Arrowsmith, AR Leach, PD Leeson, S Mandrell, RM Owen, G Pairaudeau, WD Pennie,SD Pickett, J Wang, O Wallance, and A Weir. An analysis of the attrition of drug candidates from fourmajor pharmaceutical companies.

Nature Reviews Drug Discovery , 14:475–486, 2015.[29] RK Harrison. Phase II and phase III failures: 2013-2015.

Nature Reviews Drug Discovery , 15:817–818,2016.[30] A Gelman and J Carlin. Beyond power calculations: assessing type S (sign) and type M (magnitude)errors.

Perspectives on Psychological Science , 9:641–651, 2014.[31] MJ Bayarri and J Berger. Robust Bayesian analysis of selection models.

Annals of Statistics , 26:645–659,1998.[32] S Kirby, J Burke, C Chuang-Stein, and C Sin. Discounting phase 2 results when planning phase 3clinical trials.

Pharmaceutical Statistics , 11:373–385, 2012.[33] R Core Team.

R: A Language and Environment for Statistical Computing . R Foundation for StatisticalComputing, Vienna, Austria, 2019.[34] M Plummer.

JAGS Version 4.3.0 user manual , 2017.[35] Food and Drug Administration.

22 Case Studies where phase 2 and phase 3 trials had divergent results .U.S. Department of Health and Human Services, 2017.1836] G Saint-Hilary, V Barboux, M Pannaux, M Gasparini, V Robert, and G Mastrantonio. Predictiveprobability of success using surrogate endpoints.

Statistics in Medicine , 38:1753–1774, 2019.[37] LV Hampson, J Whitehead, D Eleftheriou, and P Brogan. Bayesian methods for the design and inter-pretation of clinical trials in very rare diseases.

Statistics in Medicine , 24:4186, 2014.[38] AV Ramanan, LV Hampson, H Lythgoe, AP Jones, B Hardwick, H Hind, B Jacobs, D Vasileiou,I Wadsworth, N Ambrose, J Davidson, PJ Ferguson, T Herlin, A Kavirayani, OG Killen, S Compeyrot-Lacassagne, RM Laxer, M Roderik, JF Swart, CM Hedrich, and MW Beresford. Deﬁning consensusopinion to develop randomised controlled trials in rare diseases using bayesian design: An example of aproposed trial of adalimumab versus pamidronate for children with cno/crmo.

PLOS ONE , 2019.[39] JE Oakley and A O’Hagan.

SHELF: the Sheﬃeld Elicitation Framework (version 4) . School of Mathe-matics and Statistics, Sheﬃeld, UK, 2019.[40] B Holzhauer, LV Hampson, JP John Gosling, B Bornkamp, J Kahn, MR Lange, W-L Luo, C Brindicci,D Lawrence, S Ballerstedt, and A O’Hagan. Eliciting judgements about dependent quantities of interest:The SHELF extension and copula methods illustrated using an asthma case study. arXiv e-prints , pagearXiv:TBD, February 2021.[41] JE Oakley, A Daneshkhah, and A O’Hagan.

Nonparametric prior elicitation using the roulette method ,2020.[42] Jeremy Oakley.

SHELF: Tools to Support the Sheﬃeld Elicitation Framework , 2019. R package version1.6.0.[43] M Thomas and B Bornkamp. Comparing approaches to treatment eﬀect estimation for subgroups inclinical trials.

Statistics in Biopharmaceutical Research , 9(2):160–171, 2017.[44] X Guo and X He. Inference on selected subgroups in clinical trials.

Journal of the American StatisticalAssociation , pages 1–19, 2020.[45] B Neuenschwander, S Roychoudhury, and H Schmidli. On the use of co-data in clinical trials.

Statisticsin Biopharmaceutical Research , 8(3):345–354, 2016.

Appendix A: Handling a binary endpoint where the treatment ef-fect summary is a risk diﬀerence

For the reasons outlined in Section 3.1, when synthesizing the phase IIb data we need to proceed slightlydiﬀerently when a key eﬃcacy endpoint is binary and the treatment eﬀect is a diﬀerence in proportions. Tooutline how we proceed in this special case, suppose a single endpoint P is of interest and estimates of theresponse probabilities on the new drug and control are available from each phase IIb study. Rather thanperform a Bayesian meta-analysis of the risk diﬀerence estimates, the analyst is instead asked to providethe sample size and number of responders per arm and study. We then use these data to run two analyses.Firstly, we derive estimates of the study-speciﬁc log-odds ratios and combine these using a Bayesian meta-analysis based on a normal-normal hierarchical model . Secondly, we perform a Bayesian meta-analysis ofthe total number of responders on control in each phase IIb study, assuming that these data follow a binomialdistribution and the study-speciﬁc log-odds of response on control are samples from a normal random-eﬀectsdistribution.Let p T k , p C k and η k denote the study-speciﬁc probabilities of response on the new drug and control, andthe log-odds ratio in the k th phase III trial. From the two analyses described above, we can obtain samples η (1)3 k , . . . , η ( L )3 k and p C (1)3 k , . . . , p C ( L )3 k from the MAP priors for η k and p C k , respectively. The (cid:96) th pair of samples( η ( (cid:96) )3 k , p C ( (cid:96) )3 k ) is transformed to obtain ( p T ( (cid:96) )3 k , p C ( (cid:96) )3 k ) which is used to simulate the outcome of the k phase IIItrial. 19rogram Feature Type of variable LevelsDisease Area Categorical Allergy / RespiratoryAutoimmune / Immunology / Dermatology / RheumatologyCardiovascular / Metabolic / RenalEndocrineHaematologyInfectious DiseasesNeurologyOncologyOphthalmologyPsychiatryOthersMolecule Categorical Small molecule, Protein-Antibody, Protein-Other, OtherTarget Coded as 3 Receptor, Enzyme, Otherdummy variablesRoute of Administration Coded as 5 Oral, Intramuscular, Intravenous, Subcutaneous,dummy variables Topical, OtherSize of Sponsor Binary Yes = Sponsor is in top 20 R&D spendLifecycle Class Categorical New Molecular Entity, Lifecycle Management, BiosimilarBreakthrough Designation Binary Yes / NoSpecial Protocol Assessment Binary Yes / NoTable 2: Measured program characteristics available in the industry benchmark dataset. Missing values onthe variables Molecule, Target and RoA were imputed using random sampling with replacement. Appendix B: Deriving tailored industry benchmarks

We describe below how we derived tailored benchmarks for the probability of success in phase IIa, IIb, IIIand submission. Tailored benchmarks are obtained from predictive models ﬁtted to industry data. Thecommercial dataset we had access to contained records on 7956 programs reporting clinical trial resultsbetween 2007-2018: 4652 programs started phase II; 1846 started phase III; and a NDA was submitted for1308 programs. The dataset did not distinguish between phase IIa and phase IIb trials. However, under theassumption that risks are discharged equally across stages IIa and IIb, the benchmark probability of successin phase IIb is given by the square root of the phase II benchmark. It was suﬃcient to use the industry datato ﬁt logistic models for:(a) P { Success in phase II } : conditional probability of success in phase II given we start phase II(b) P { Success in phase III } : conditional probability of success in phase III given we start phase III(c) P { Success in submission } : conditional probability of regulatory approval given we submit a NDATable 2 shows the program characteristics available in the database and how these were coded. For bothmodels (a) and (b), forward variable selection was used to identify important predictors of success from thefollowing options: disease area; lifecycle class; drug molecule class; drug target; route of administration;size of sponsor; breakthrough designation; and special protocol assessment status. The last two regulatorycharacteristics were only considered for inclusion in model (b) because these designations can be granted atany time prior to the start of phase III and may be inﬂuenced by the phase II data. Table 1 lists the predictors20hase Program CharacteristicsII Disease, Lifecycle, Molecule, Target (Receptor, Enzyme, Other), RoA (IV)III Disease, Lifecycle, Molecule, RoA (SQ, IM, Other), Sponsor, BreakthroughSubmission Disease, LifecycleTable 3: Program characteristics used to derive tailored benchmarks for the success probability within adevelopment phase given a program starts that stage. Where a covariate is coded as dummy variables, theselected dummy variables are listed in parentheses.that were actually selected for inclusion in each model. We needed to take a slightly diﬀerent approach toﬁt model (c) since only 164 of the 1308 submitted programs failed to obtain regulatory approval, and thissmall number of events limited model complexity. We identiﬁed a limited set of predictors from discussionswith drug development experts, and ﬁtted a logistic model adjusting only for disease area and lifecycle class.Fitted values of parameters in logistic models (a) - (c) can be found in Supplementary Materials F.Tailored benchmarks for the probability of success in a phase are used to calculate the probability ofnot seeing a SSE in phase III in Equation (6). They are also used to calculate tailored benchmarks for theprobability of eﬃcacy success in phase IIb and phase III, which themselves are needed to calibrate the priorfor µ in Section 4.2. The probability of eﬃcacy success in phase i , for i ∈ { IIb , III } , is given by: P { Eﬃcacy success in phase i } = 1 − (1 − P { Success in phase i } ) P { Fail on Eﬃcacy in phase i | Fail in phase i } , (11)where P { Fail on Eﬃcacy in phase i | Fail in phase i } is 1 minus the conditional probability of failing due toa SSE in phase i under the assumption that failures are due to poor eﬃcacy or poor safety are mutuallyexclusive, and the latter conditional risk is estimated using the aggregate statistics presented in Section 3.2.We take the square root of the benchmark chance of eﬃcacy success in phase II as the phase IIb benchmark,assuming that risks are discharged equally across stages IIa and IIb. Appendix C: Calibrating the mixture prior when there is a singleeﬃcacy endpoint

Suppose a new drug is being developed in a therapeutic area outside oncology, so that the standard phaseII and phase III program comprises: • A single phase II trial designed to test H : θ P ≤ H : θ P > α at θ P = 0 and power 1 − β at θ P = δ P . • Two Phase III trials designed to test H : θ P ≤ H : θ P > α at θ P = 0 and power 1 − β at θ P = δ P .For the purposes of prior calibration, we assume there is no between-study heterogeneity, so that all trialsare underpinned by a common treatment eﬀect, denoted by µ P , with prior f ( µ P ) = ωσ P φ (cid:18) µ P σ P (cid:19) + (1 − ω ) σ P φ (cid:18) µ P − δ P σ P (cid:19) , where φ ( · ) is the pdf of a standard normal random variate, σ P = − δ P / Φ − (0 .

01) and σ P = − δ P / Φ − (0 . i = 2 ,

3, let c i = Φ − (1 − α i ) and I i = { Φ − (1 − α i ) + Φ − (1 − β i ) } δ P Then, we demonstrate eﬃcacy in Phase II if and only if Z ≥ c , where Z | µ P ∼ N ( µ P √I , k th study of the Phase III program if and only if Z k ≥ c , where Z k | µ P ∼ N ( µ P √I , η ( η ) denote the benchmark probability of eﬃcacy success within Phase II (III), ω is given by: ω = ( η η − B ) A − B , A = (cid:90) ∞−∞ Φ( µ P (cid:112) I − c ) (cid:110) Φ( µ P (cid:112) I − c ) (cid:111) σ P φ (cid:18) µ P σ P (cid:19) d µ P (12) B = (cid:90) ∞−∞ Φ( µ P (cid:112) I − c ) (cid:110) Φ( µ P (cid:112) I − c ) (cid:111) σ P φ (cid:18) µ P − δ P σ P (cid:19) d µ P (13)We interpret A as the unconditional probability of demonstrating eﬃcacy in phase II and phase II given µ P ∼ N (0 , σ P ), and B as the unconditional probability of eﬃcacy success given µ P ∼ N ( δ P , σ P ). Thesingle-fold integrals in equations (12)-(13) can be evaluated numerically, for example using the integratefunction in R . The expression for ωω