A review of Bayesian perspectives on sample size derivation for confirmatory trials
Kevin Kunzmann, Michael J. Grayling, Kim May Lee, David S. Robertson, Kaspar Rufibach, James M. S. Wason
AA review of Bayesian perspectives on sample sizederivation for confirmatory trials
A Preprint
Kevin Kunzmann ∗ MRC Biostatistics UnitUniversity of Cambridge [email protected]
Michael J. Grayling
Population Health Sciences InstituteNewcastle University [email protected]
Kim May Lee
Pragmatic Clinical Trials UnitQueen Mary University of London [email protected]
David S. Robertson
MRC Biostatisics UnitUniversity of Cambridge [email protected]
Kaspar Rufibach
Methods, Collaboration, and Outreach Group (MCO)Department of BiostatisticsF. Hoffmann-La Roche, Basel [email protected]
James M. S. Wason
Population Health Sciences InstituteNewcastle UniversityandMRC Biostatistics UnitUniversity of Cambridge [email protected]
June 30, 2020
Abstract
Sample size derivation is a crucial element of the planning phase of any confirmatory trial.A sample size is typically derived based on constraints on the maximal acceptable type Ierror rate and a minimal desired power. Here, power depends on the unknown true effectsize. In practice, power is typically calculated either for the smallest relevant effect size or alikely point alternative. The former might be problematic if the minimal relevant effect isclose to the null, thus requiring an excessively large sample size. The latter is dubious sinceit does not account for the a priori uncertainty about the likely alternative effect size. ABayesian perspective on the sample size derivation for a frequentist trial naturally emergesas a way of reconciling arguments about the relative a priori plausibility of alternative effectsizes with ideas based on the relevance of effect sizes. Many suggestions as to how such‘hybrid’ approaches could be implemented in practice have been put forward in the literature.However, key quantities such as assurance, probability of success, or expected power areoften defined in subtly different ways in the literature. Starting from the traditional andentirely frequentist approach to sample size derivation, we derive consistent definitions forthe most commonly used ‘hybrid’ quantities and highlight connections, before discussing anddemonstrating their use in the context of sample size derivation for clinical trials. K eywords Assurance · Expected power · Probability of success · Power · Sample size derivation ∗ MRC Biostatistics Unit, University of Cambridge, Cambridge Institute of Public Health, Forvie Site, RobinsonWay, Cambridge Biomedical Campus, Cambridge CB2 0SR, United Kingdom a r X i v : . [ s t a t . A P ] J un preprint - June 30, 2020 Randomised Controlled Trials (RCTs) are the gold-standard study design for evaluating the effectiveness andsafety of new interventions. Despite their successful history of demonstrating the benefits and uncoveringthe risks of new treatments, they face substantial challenges. Recent evidence has shown that the successrate of RCTs is low [Wong et al., 2019]. This low success rate and the increasing cost of conducting RCTs isresulting in large real-term increases in the cost of drug development [DiMasi et al., 2016]. Since the samplesize of a trial is a key determinant of the chances of detecting a treatment effect (if it is present) and animportant cost factor, the choice of an adequate sample size enables more economic drug development. Purelyeconomic arguments would suggest a utility based approach as discussed in Lindley [1997], for example.However, specifying utility functions, particularly in investigator-sponsored studies concerned with non-druginterventions, can be hard in practice. These practical problems, the desire to maintain minimal ethicalstandards, as well as recommendations of health authority guidelines result in the vast majority of confirmatoryclinical trials deriving their sample size based on desired conditional type I and type II error rates. Forinstance, a randomised clinical trial with an unnecessarily large sample size (‘overpowered’) is unethical if thetreatment shows a substantial effect and the consequences of being randomised to the control arm are severe.Thus the issue of selecting an appropriate sample size is a vitally important part of conducting a clinical trial.The traditional approach to determining the required sample size for a trial is to choose a point alternativeand derive a sample size such that the probability to reject the null hypothesis exceeds a certain threshold(typically 80% or 90%) while maintaining a certain maximal type I error rate (typically 2.5% one-sided).The maximal type I error rate is usually realised at the boundary of the null hypothesis and thus availableimmediately. The type II error rate, however, critically depends on the choice of the (point) alternative.There are at least two ways of justifying the choice of point alternative. The first is based on a relevanceargument, which requires the specification of a minimal clinically relevant difference (MCID). Since theprobability to reject the null hypothesis is typically monotonic in the effect size, determining the samplesize such that the desired probability to reject the null is exceeded at the MCID implies that the powerfor all other relevant differences will be even larger. Guidance has recently been published on choosing theMCID [Cook et al., 2018], but making this choice may still be difficult in practice.The second perspective is based on a priori considerations about the likelihood of the treatment effect. Here,an a priori likely effect size is used as the point alternative. This implies that the resulting sample size mightbe too small to detect smaller but still relevant differences reliably. On the other hand, the potential savingsin terms of sample size might outweigh the risk of ending up with an underpowered study. The core differencebetween these approaches is that basing the required sample size on the MCID might be ineffective if priorevidence for a larger effect is available, but the MCID is chosen based on relevance arguments and is thus notsubject to uncertainty. In contrast, choosing the point alternative based on considerations about the relative a priori likelihood of effect sizes implies that there is an inherent uncertainty about the effect size that canbe expected – otherwise no trial would be needed in the first place.Depending on the study objective, the sample size can also be derived based on entirely different considerations.For example, studies that aim to establish new diagnostic tools or biomarkers would rather target a certainwidth of the confidence interval for the AUC [Obuchowski, 1998]. Similarly, studies aimed at estimatingpopulation parameters with sufficient precision should target the standard error of the estimate, rather thanderiving a sample size based on power arguments [Thompson, 2012, Grouin et al., 2007]. These approachesare beyond the scope of this manuscript and we will only discuss sample size derivation based on error rateconsiderations.To make things more tangible, consider the case of a simple one-stage, one-arm Z -test. Let X i , i = 1 , . . . , n ,be iid observations with mean θ and known standard deviation σ . Under suitable regularity conditions, theirmean is asymptotically normal and Z n := √ n (cid:0) X − θ (cid:1) /σ · ∼ N (cid:0) θ n , (cid:1) where θ n := √ n ( θ − θ ) /σ . Assumethat interest lies in testing the null hypothesis H : θ ≤ θ = 0 at a one-sided significance level of α . Thecritical value for rejecting H is given by the (1 − α )-quantile of the standard normal distribution, z − α , andis independent of n . The probability of rejecting the null hypothesis for given n and θ isPr θ [ Z n > z − α ] = 1 − Φ (cid:0) z − α − θ n (cid:1) = Φ (cid:0) θ n − z − α (cid:1) , (1)where Φ is the cumulative distribution function (CDF) of the standard normal distribution. Often, Pr θ [ Z n >z − α ] is seen as a function of θ and termed the ‘power function’. This terminology may lead to confusionwhen considering parameter values θ ≤ θ and θ ≥ θ MCID , since the probability to reject the null hypothesiscorresponds to the type I error rate in the former case and classical ‘power’ in the latter. For the sake ofclarity we will thus use the neutral term ‘probability to reject’.2 preprint - June 30, 2020
Assume that a point alternative θ alt > θ is given. A sample size can then be chosen as the smallest samplesize that results in a probability to reject of at least 1 − β at θ alt n ∗ θ alt := argmin n n subject to: Pr θ alt [ Z n > z − α ] ≥ − β . (2)Since Pr θ [ Z n > z − α ] is monotone in θ , the probability to reject the null hypothesis for θ > θ alt is also greaterthan 1 − β . Thus, if θ alt = θ MCID , the null hypothesis can be rejected for all clinically relevant effect sizeswith a probability of at least 1 − β . This approach requires no assumptions about the a priori likelihoodof the value of θ but only about θ MCID and the desired minimal power level (see also [Chuang-Stein et al.,2011, Section 3] and Chuang-Stein, 2006). However, the required sample size increases quickly as θ MCID approaches θ . In almost all practically relevant situations, a maximal feasible sample size is given (e.g.,due to budget constraints), which might not be sufficient to achieve the desired power if θ MCID is close to θ . This problem might arise when considering overall survival in oncology trials, for example. However, itmay then be hard to justify a value for θ MCID much larger than θ since almost any improvement in overallsurvival can be considered relevant . The problem becomes even more pressing if the null hypothesis isdefined as H : θ ≤ θ MCID , i.e., if the primary study objective is to demonstrate a clinically important effect.In either case it is clearly impossible to derive a feasible sample size based on the minimal clinically importantdifference alone [Chuang-Stein et al., 2011]. Formulating a principled approach to eliciting a sample size insituations like these is difficult, and in practice trialists may resort to back-calculating an effect size in orderto achieve the desired power given the maximum feasible sample size [Lenth, 2001, Lan and Wittes, 2012,Grouin et al., 2007]. One way of justifying a smaller sample size is to consider a point alternative θ alt > θ MCID based on a priori likelihood arguments: if there is prior evidence for effect sizes larger than θ MCID , determiningthe sample size based on θ MCID might well be inefficient and lead to an unnecessarily large trial. Therefore,planning of the required sample size is often based on a single point alternative θ alt , θ alt > θ MCID . Yet, thispragmatic approach is unsatisfactory in that it ignores any uncertainty about the effect size [Lenth, 2001].In the following, we review approaches to sample size derivation that do account for a priori uncertainty via aprior density for the effect size. We propose a framework encompassing the most relevant quantities discussedin this context, give precise definitions of the terms, and highlight connections between individual items.Where necessary, we refine existing definitions to improve overall consistency. Note that we exclusively focuson what is usually termed a ‘hybrid’ Bayesian-frequentist approach to sample size derivation [Spiegelhalteret al., 2004]. This means that, although, Bayesian arguments are used to derive a sample size under uncertaintyabout the true effect size, the final analysis is strictly frequentist. After introducing the individual concepts indetail, a structured overview of all quantities considered is provided in Figure 1. We then present a review ofthe literature on the subject, showcasing the confusing diversity of terminology used in the field and relatingour definitions back to the existing literature. Finally, we present some numerical examples and concludewith a discussion.
One way of incorporating planning uncertainty is to make assumptions about the relative a priori likelihoodof the unknown effect size. This approach can be formalised within a Bayesian framework by seeing the trueeffect size θ as the realisation of a random variable Θ with prior density ϕ ( θ ). This means that the CDF of Θis given by Pr[ Θ ≤ x ] = R x ϕ ( θ ) d θ . At the planning stage, the probability to reject the null hypothesis isthen given by the random variable RPR( n ), the ‘random probability to reject’:RPR( n ) := Pr Θ [ Z n > z − α ] . (3)We explicitly denote this quantity as ‘random’ to emphasise the distinction between the (conditional onΘ = θ ) probability to reject given in equation (1) and the unconditional ‘random’ probability to reject. Thevariation of the random variable RPR( n ) reflects the a priori uncertainty about the unknown underlyingeffect size that is encoded in the prior density ϕ ( · ) of the random variable Θ. We prefer the term ‘randomprobability to reject’ over ‘random power’ since RPR( n ) is unconditional on the effect size, and consequentlydoes not distinguish between rejections under the null hypothesis and under relevant effect sizes. Instead, It is important to distinguish between the (unknown) true effect size and the observed effect size. It might still bereasonable to additionally require a certain deviation of the observed effect from the null, e.g. a hazard ratio of lessthan 0.85. preprint - June 30, 2020 we define the conditional random variable ‘random power’ asRPow( n ) := Pr Θ ≥ θ MCID [ Z n > z − α ] (4)= RPR( n ) | Θ ≥ θ MCID . (5)This definition more closely resembles the concept of frequentist power since it conditions on a relevant effectsize.Determining the required sample size based on a point alternative as outlined in the introduction evaluatesthe probability to reject the null hypothesis solely on θ alt . This can be understood as conditioning the randomprobability to reject, or the random power, on Θ = θ alt , i.e., to consider (cid:0) RPR( n ) | Θ = θ alt (cid:1) = (cid:0) RPow( n ) | Θ = θ alt (cid:1) . Due to conditioning on a single parameter value θ alt ≥ θ MCID , under any prior density with ϕ ( θ alt ) > θ alt [ Z n > z − α ] which isoften termed ‘power’ in a frequentist context. Basing the sample size derivation on this quantity means thatthe probability to reject the null hypothesis for relevant values of θ ≥ θ MCID , θ = θ alt is completely ignored andthe a priori evidence encoded in ϕ ( · ) is not used.Spiegelhalter and Freedman [1986] have pointed out that a power constraint for sample size derivation couldbe computed based on “[...] a somewhat arbitrarily chosen location parameter of the [prior] distribution (forexample the mean, the median or the 70th percentile).” Using a location parameter of the unconditional priordistribution of Θ, however, might lead to situations where no sample size can be determined if the locationparameter lies within the null hypothesis. Here, we follow a similar idea but motivate the choice of locationparameter in terms of the a priori distribution of random power. To this end, letQ − γ [ RPow( n ) ] := inf x Pr ϕ ( · ) (cid:2) RPow( n ) ≥ x (cid:3) ≥ γ (6)be the (1 − γ )-quantile of the random power (RPow( n )). Furthermore, let ϕ ( θ | Θ ≥ θ MCID ) := θ ≥ θ MCID ϕ ( θ ) Z ∞ θ MCID ϕ ( x ) d x (7)be the conditional prior density of (cid:0) Θ | Θ ≥ θ MCID (cid:1) . We choose to make the dependency ofPr ϕ ( · ) (cid:2) RPow( n ) ≥ x (cid:3) = Z ∞−∞ ϕ ( θ | Θ ≥ θ MCID ) Pr θ [ Z n > z − α ] d θ (8)on the prior density explicit by using the index ‘ ϕ ( · )’ since the random parameter Θ does not appear directlyin the description of the event ‘RPow( n ) ≥ x ’. Whenever Θ appears explicitly, we omit the index since thedependency on ϕ ( · ) is then clear from the context. The expression Pr ϕ ( · ) (cid:2) RPow( n ) ≥ x (cid:3) is a real numberin [ 0 , n ) = Pr Θ [ Z n > z − α ] which is a random variable.If a sample size was then chosen such that Q − γ [ RPow( n ) ] ≥ − β , the a priori probability of exceeding aprobability to reject of 1 − β given a relevant effect would be, by definition, at least γ . The required samplesize for this approach is the solution ofargmin n n subject to: Q − γ [ RPow( n ) ] ≥ − β . (9)Since Pr θ [ Z n > z − α ] is monotonic in θ , this problem is equivalent to solving n ∗ γ := argmin n n subject to: Pr Q − γ [ Θ ≥ θ MCID ] (cid:2) Z n > z − α (cid:3) ≥ − β , (10)where Q − γ [ Θ ≥ θ MCID ] is the (1 − γ )-quantile of the prior distribution of the random variable Θ conditionalon a relevant effect. Consequently, this ‘prior quantile approach’ can be used with any existing frequentistsample size derivation formula. It is merely a formal Bayesian justification for determining the sample size ofa trial based on a point alternative θ alt := Q − γ [ Θ ≥ θ MCID ] ≥ θ MCID . The prior quantile approach reduces topowering on θ MCID whenever the target power needs to be met with absolute certainty for all relevant effectsizes ( γ = 1). One may thus see the prior quantile approach as a principled relaxation of powering on θ MCID .The approach differs from Spiegelhalter and Freedman’s suggestions in two key aspects. Firstly, the pointalternative naturally emerges as a quantile of the prior conditional on a relevant effect by imposing a lower4 preprint - June 30, 2020 boundary on the a priori probability to undershoot the target power. This intuitively makes sense sincea large probability to reject is only desirable when the underlying θ is relevant. This also ensures thatQ − γ [ Θ ≥ θ MCID ] > θ MCID ≥ θ for γ < γ > .
5) than the conditional median ( γ = 0 .
5) which wasdiscussed by Spiegelhalter and Freedman.
Spiegelhalter and Freedman also proposed the use of the “probability of concluding that the new treatment issuperior and of this being correct” ( P Ss in their notation) to derive a required sample size [Spiegelhalter andFreedman, 1986]. The quantity has also been referred to as ‘prior adjusted power’ [Spiegelhalter et al., 2004,Shao et al., 2008]. This definition of probability of success is also discussed in Liu [2010] and Ciarleglio et al.[2015]. In the situation at hand, it reduces toPoS( n ) := Pr[ Z n > z − α , Θ ≥ θ MCID ] (11)= Z ∞ θ MCID Z ∞ z − α φ ( z − θ n ) ϕ ( θ ) d z d θ , (12)where φ is the PDF of the standard normal distribution. Here, we are slightly more general than previousauthors in that we allow θ MCID > and the effect is relevant. Whenever θ MCID = 0 this coincides with the definitions usedpreviously in the literature.The definition of PoS( n ) critically relies on what is being considered a ‘success’. The original proposal ofSpiegelhalter and Freedman only considers a significant study result a success if the underlying effect is alsonon-null (i.e., the joint probability of non-null and detection). In more recent publications, a majority ofauthors tend to follow O’Hagan et al. who consider a slightly different definition of the probability of successby integrating the probability to reject over the entire parameter range [O’Hagan and Stevens, 2001, O’Haganet al., 2005] and term this ‘assurance’. For a more comprehensive overview of the terms used in the literature,see Section 5. The alternative definition for probability of success introduced by O’Hagan et al. correspondsto the marginal probability of rejecting the null hypothesis irrespective of the corresponding parameter valuePoS ( n ) := Pr[ Z n > z − α ] (13)= Z ∞−∞ Z ∞ z − α φ ( z − θ n ) ϕ ( θ ) d z d θ (14)= PoS( n ) + Pr[ Z n > z − α , < Θ < θ MCID ] | {z } probability of rejection and irrelevant effect + Pr[ Z n > z − α , Θ ≤ | {z } probability of a type I error . (15)The decomposition in equation (15) makes it clear that the implicit definition of ‘success’ underlying PoS ( n )is at least questionable [Liu, 2010]. The marginal probability of rejecting the null hypothesis includesrejections under irrelevant or even null values of θ , and is thus inflated by type I errors and rejections underirrelevant values of θ . This issue was first raised by Spiegelhalter et al. [2004] for simple (point) null andalternative hypotheses. The degree to which PoS( n ) and PoS ( n ) differ numerically is discussed in more detailin Section 7.2. Which definition of ‘success’ is preferred mainly depends on perspective: a short-term orientedpharmaceutical company might just be interested in rejecting the null hypothesis to monetise a new drug -irrespective of it actually showing a relevant effect. This view would then correspond to PoS ( n ). Regulatorsand companies worried about the longer-term consequences of potentially having to retract ineffective drugs,might tend towards the joint probability of correctly rejecting the null, i.e., PoS( n ). We take the latterperspective and focus on PoS( n ). 5 preprint - June 30, 2020 PoS( n ) is an unconditional quantity and must therefore implicitly depend on the a priori probability of arelevant effect. To see this, consider the following decompositionPoS( n ) = Pr[ Z n > z − α , Θ ≥ θ MCID ] (16)= Z ∞ θ MCID Pr θ [ Z n > z − α ] ϕ ( θ ) d θ (17)= Pr[ Z n > z − α | Θ ≥ θ MCID ] Pr[ Θ ≥ θ MCID ] (18)= E (cid:2) Pr Θ ≥ θ MCID [ Z n > z − α ] (cid:3)| {z } = E[ RPow( n ) ] =: EP( n ) Pr[ Θ ≥ θ MCID ] . (19)This means that the probability of success can be expressed as the product of the ‘expected power’, EP( n ),and the a priori probability of a relevant effect (see again Spiegelhalter et al. [2004] for the situation withpoint hypotheses). Expected power was implicitly mentioned in Spiegelhalter and Freedman [1986] ( P Ss /P · s in their notation) as a way to characterise the properties of a design. The use of expected power as a meansto derive the required sample size of a design under uncertainty was then proposed in Brown et al. [1987] bysolving n ∗ EP := argmin n n subject to: EP( n ) ≥ − β . (20)Since the power function is monotonically increasing in θ , expected power is strictly larger than powerat the minimal relevant value whenever Pr[ Θ > θ MCID ] >
0. This implies that a constraint on expectedpower instead of a constraint on the probability to reject the null hypothesis at θ MCID is less restrictive, andconsequently n ∗ EP < n ∗ θ MCID .The terms ‘expected power’ and ‘probability of success’ are sometimes used interchangeably in the literature(see Section 5). In the following, we take a closer look at their connection to clarify their characteristicdifferences. Expected power is merely a weighted average of the probability to reject in the relevance region θ ≥ θ MCID , where the weight function is given by the conditional prior density ϕ ( · | Θ ≥ θ MCID ) defined inequation (7) EP( n ) = E (cid:2) Pr Θ [ Z n > z − α ] | Θ > θ MCID (cid:3) (21)= Z ∞ θ MCID Pr θ [ Z n > z − α ] ϕ ( θ | Θ ≥ θ MCID ) d θ . (22)PoS( n ), on the other hand, integrates the probability to reject over the same region using the unconditionalprior density (see equations (17) and (22)). Thus, in contrast to PoS( n ), expected power does not dependon the a priori probability of a relevant effect size but only on the relative magnitude of the prior density(‘ a priori likelihood’) of relevant parameter values. Since the conditional prior density differs from theunconditional one only by normalisation via the a priori probability of a relevant effect, it follows fromequation (19) that EP( n ) and PoS( n ) differ only by the constant factor Pr[ Θ ≥ θ MCID ]. Consequentially, anyconstraint on probability of success can be reformulated as a constraint on expected power and vice versa
PoS( n ) ≥ − β ⇔ EP( n ) ≥ (1 − β ) / Pr[ Θ ≥ θ MCID ] . (23)Furthermore, since PoS( n ) = EP( n ) Pr[ Θ ≥ θ MCID ] and EP( n ) ≤
1, PoS( n ) can never exceed the a priori probability of a relevant effect, Pr[ Θ ≥ θ MCID ]. This implies that the usual conventions on the choice of β asthe maximal type II error rate for a point alternative cannot be meaningful in terms of the unconditionalPoS( n ), since the maximum attainable probability of success is the situation-specific a priori probabilityof a relevant effect. The need to recalibrate typical benchmark thresholds when considering probability ofsuccess was previously discussed in the literature. For instance, O’Hagan and Stevens [2001] states that “[t]heassurance figure is often much lower [than the power], because there is an appreciable prior probability thatthe treatment difference is less than δ ∗ ”, where in their notation, δ ∗ corresponds to θ MCID in our notation. Asimilar argument is put forward in Rufibach et al. [2016, Section 2] for PoS ( n ). The key issue is thus whetherone is interested in the joint probability of rejecting the null hypothesis and the effect being relevant, PoS( n ),or the conditional probability of the rejecting the null hypothesis given a relevant effect, EP( n ). While theinterpretation of both quantities is different, in any particular situation, they only differ by a constant factor.6 preprint - June 30, 2020 Since expected power and probability of success are proportional, it suffices to compare expected powerand the quantile-based approach outlined in Section 2 with respect to sample size derivation. Consider θ := Q − γ [ Θ ≥ θ MCID ] and an arbitrary but fixed parameter value θ > θ > θ MCID . Clearly, under thequantile-based approach, the rejection probability at any θ > θ does not contribute towards the fulfilmentof the power constraint since the probability to reject is only evaluated at θ . For expected power, however,the total functional derivative with respect to changes in the probability to reject at θ and θ isd EP( n ) = ∂ EP( n ) ∂ Pr θ [ Z n > z − α ] d Pr θ [ Z n > z − α ] + ∂ EP( n ) ∂ Pr θ [ Z n > z − α ] d Pr θ [ Z n > z − α ] (24)= ϕ ( θ | θ ≥ θ MCID ) d Pr θ [ Z n > z − α ] + ϕ ( θ | θ ≥ θ MCID ) d Pr θ [ Z n > z − α ] . (25)Keeping expected power constant, i.e., setting d EP( n ) = 0 and solving for d Pr θ [ Z n > z − α ] yields0 = ϕ ( θ | Θ ≥ θ MCID ) d Pr θ [ Z n > z − α ] + ϕ ( θ | Θ ≥ θ MCID ) d Pr θ [ Z n > z − α ] (26) ⇔ d Pr θ [ Z n > z − α ] = − ϕ ( θ | Θ ≥ θ MCID ) ϕ ( θ | Θ ≥ θ MCID ) d Pr θ [ Z n > z − α ] (27) ⇔ d Pr θ [ Z n > z − α ] = − ϕ ( θ ) ϕ ( θ ) d Pr θ [ Z n > z − α ] . (28)A reduction in the probability to reject at θ by one percentage point can thus be compensated by an increasein the probability to reject at θ by ϕ ( θ ) / ϕ ( θ ) percentage points. This demonstrates that the core differencebetween the prior quantile-based approach and the expected power approach is whether or not a trade-offbetween power at different parameter values is deemed permissible (expected power) or not (quantile-basedapproach). A structured overview of the terms introduced so far and the respective connections betweenthem is given in Figure 1. In a regulatory environment, and most scientific fields, the choice of the significance level, α , is a pre-determined quality criterion. In the life sciences a one-sided α of 2.5% is common. Yet, the exact choiceof the threshold 1 − β is much more arbitrary. In clinical trials, 1 − β = 0 . − β = 0 . − β that is independent of the specific context of a trial only makes sensewith conditional approaches like the (conditional prior) quantile approach or when using EP( n ) to derive arequired sample size. In principle, the unconditional PoS( n ) should be easier to interpret by non-statisticians.Equation (23) allows the transformation of an EP( n )-based sample size derivation, which can readily use anyof the established values for 1 − β , into a PoS( n )-based sample size derivation by re-calibrating the thresholdwith the proportionality factor linking EP( n ) and PoS( n ). This only transforms the conditional criteria(minimum EP( n )) of the classical sample size derivation to the unconditional domain (minimum PoS( n ))without affecting the derived sample size in any way. For instance, if 1 − β = 0 . ≥ θ MCID ] = 0 . n ) would be 0 . ϕ ( · ) give a natural answer to this issue of trial-specific thresholdsvia the concept of utility maximisation or maximal expected utility (MEU). An in-depth discussion of theMEU concept is beyond the scope of this manuscript and we refer the reader to, for example, Lindley [1997].We merely want to highlight the fact that the choice of the constraint threshold 1 − β can be justified bymaking the link to MEU principles. To this end we consider a particularly simple utility function.Assume that the maximal type I error rate is still to be controlled at level α . For sake of simplicity, furtherassume that a correct rejection of the null hypothesis yields an expected return of λ . Here the return is givenin terms of the average per-patient costs within the trial. Ignoring fixed costs, the expected trial utility (inunits of average per-patient costs) is then given by U ( n ) := λ PoS( n ) − n (29)7 preprint - June 30, 2020 and the utility-maximising sample size is n ∗ U ( λ ) := argmax n U ( n ). Obviously, the same sample size would beobtained by solving problem (20), given1 − β = EP (cid:0) n ∗ U ( λ ) (cid:1) = Pr[ Θ ≥ θ MCID ] PoS (cid:0) n ∗ U ( λ ) (cid:1) . (30)The right hand side, Pr[ Θ ≥ θ MCID ] PoS (cid:0) n ∗ U ( λ ) (cid:1) , is the utility-maximising expected power threshold giventhe utility parameter λ . Similarly, one could start with n ∗ EP for a given β and derive the corresponding λ suchthat n ∗ U ( λ ) = n ∗ EP . This value of λ would then correspond to the implied expected reward upon successfulrejection of the null for given β . Under the assumption of a utility function of the form (29), λ and β canthus be matched such that the corresponding utility maximisation problem and the constraint minimisationof the sample size under a power constraint both lead to the same required sample size. Consequently,practitioners are free to either define an expected return upon successful rejection, λ , or a threshold on theminimal expected power, 1 − β .While it is theoretically attractive to derive the sample size directly based on a utility function, an informedchoice of λ is often hard to justify in practice. In these situations one may instead reverse the perspective anddetermine the value of λ under which the utility-maximising design would coincide with the design obtainedunder a standard (expected) power threshold of, say, 80% or 90%. The implied reward parameter mightthen be used to communicate the consequences of different choices of power thresholds to decision makersand to inform the final choice of 1 − β . This approach can, of course, be generalised to more detailed utilityfunctions. Note, however, that for utility functions with more than one free parameter there is no longer aone-to-one correspondence between power level and utility parameters. Rather, for a given power level, therewill be a level-set of values for the utility parameters that match the specified power. We give a practicalexample of this process in Section 7.5.Table 1: Selected publications on ‘hybrid’ sample size derivation based on error rates. Structured by conceptsas defined in Figure 1. Concept References NotesMarginal proba-bility to reject H Crook and Good [1982] Termed ‘strength’; application in multinomial contin-gency tables.Spiegelhalter and Freed-man [1986] Only implicitly mentioned; discussing close relation toPoS( n ), termed ‘expected/average power’ in Spiegel-halter et al. [2004].Gillett [1994] Termed ‘average power‘; focus on replication.O’Hagan and Stevens[2001] Termed ‘assurance’ or ‘expected power’; differentfrom our notion of expected power which is condi-tional on a relevant effect, see also [O’Hagan et al.,2005].Chuang-Stein [2006] Termed ‘average probability of success’; discussesother definitions of ‘success’ based on additional cri-teria for the observed point estimates; discusses howbasing the sample size on relevance arguments aloneis theoretically correct but ineffective if evidence forlarger effect sizes is available, see also Chuang-Steinet al. [2011].Grouin et al. [2007] Termed ‘predictive power’ and ‘predictive probabilityto reject H ’; review of regulatory aspects, discussionof interval-based sample size calculation, and utilityconsiderations.Daimon [2008] Termed ‘hybrid Neyman–Pearson–Bayesian (hNPB)probability‘; application in non-inferiority setting.Shao et al. [2008] Termed ‘adjusted power’; review of regulatory aspects,discussion of interval-based sample size calculation,and utility considerations.8 preprint - June 30, 2020 Concept References Notes
Liu [2010] Termed ‘extended Bayesian expected power 1’; ex-tended by treating variance as unknown, also considerPoS( n ) and EP( n ).Lan and Wittes [2012] Termed ‘average power’; discusses upper limit of ‘av-erage power‘ depending on prior choice and suggesttruncated priors which would be very close to condi-tioning on a relevant effect.Carroll [2013] Termed ‘assurance’ and ‘probability of success’ (PoS);discusses other definitions of success but all defini-tions are also exclusively based on observed quanti-ties (minimum threshold on point estimate), see alsoChuang-Stein [2006].Brutti et al. [2014] Termed ‘predictive frequentist power’; also discussessample size derivation based on Bayesian decisioncriteria.Ren and Oakley [2014] Termed ‘assurance’; discusses ideas of O’Hagan et al.[2005] in time-to-event setting.Hu [2014] Termed ‘probability of success’; considers priors onmean and standard deviation; discuss upper limiton probability of success in the more complex two-parameter situation.Ibrahim et al. [2015] Termed ‘average probability of success’; discussed incontext of historical data integration.Walley et al. [2015] Termed ‘assurance’ or ‘probability of success’; exten-sion to multi-parameter situations.Ciarleglio et al. [2015] Termed ‘expected power’; also consider EP( n ) andPoS( n ), very similar settings considered in Ciarleglioet al. [2016], Ciarleglio and Arendt [2017].Rufibach et al. [2016] Termed ‘assurance’ or ‘probability of success’; in-depth discussion of the distribution of the probabilityto reject the null hypothesis.Saint-Hilary et al. [2018] Termed ‘predictive probability of success’; considerboth ‘statistical success’ ( p -value ≤ α ) and ‘clinicalrelevance’ ( observed effect above relevance threshold),see also Saint-Hilary et al. [2019].Chen and Ho [2017] Termed ‘assurance’ and ‘expected power’; discussesconditional nature of the (frequentist) probability toreject the null hypothesis from a Bayesian perspective.Jiang [2011], Kirby et al.[2012], Zhang and Zhang[2013], Wang et al. [2015],Götte et al. [2017] Termed ‘probability of statistical success’, ‘probabil-ity of success’, ‘assurance’, ‘predictive power’; dis-cusses extensions to multiple studies or entire drugdevelopment programs.Ambrosius et al. [2012],Wang et al. [2013], Wang[2015], Crisp et al. [2018],Chen and Chen [2018] Termed ‘assurance’, ‘probability of success’, ‘probabil-ity of study success’; practical applications in varioussettings. Probability ofsuccess
Spiegelhalter and Freed-man [1986] Only implicitly mentioned, termed ‘prior adjustedpower’ in Spiegelhalter et al. [2004]; discusses closerelation to marginal probability to reject H (suggest-ing the latter as practical approximation).9 preprint - June 30, 2020 Concept References Notes
Brown et al. [1987] Termed ‘expected power’; also discusses ‘conditionalexpected power’ which corresponds to our definitionof EP( n ).Shao et al. [2008] Termed ‘adjusted power‘; application of the ideas ofSpiegelhalter et al. [2004] to binary setting, defineprobability of success but approximate it with themarginal probability to reject H .Liu [2010] Termed ‘extended Bayesian expected power 2’; ex-tended by treating variance as unknown, also consid-ers PoS ( n ) and EP( n ).Ciarleglio et al. [2015] Termed ‘prior-adjusted power’; also considers EP( n )and PoS ( n ), very similar settings considered in Ciar-leglio et al. [2016], Ciarleglio and Arendt [2017]. Expected power
Brown et al. [1987] Termed ‘conditional expected power’; also discussesunconditional expected power which corresponds toour definition of PoS( n ).Spiegelhalter et al. [2004] Not named; referencing Brown et al. [1987].Liu [2010] Termed ‘extended Bayesian expected power 3’; ex-tended by treating variance as unknown, also considerPoS( n ) and PoS( n ).Ciarleglio et al. [2015] Termed ‘conditional expected power’; also consid-ers PoS( n ) and PoS ( n ), very similar settings consid-ered in Ciarleglio et al. [2016], Ciarleglio and Arendt[2017]. A structured overview of the literature on ‘hybrid’ Bayesian sample size derivation in the context of clinicaltrials is given in Table 1. The table relates publications in the field to the terms defined in Figure 1.Publications with a similar take on the matter are grouped. In the following, we highlight a few particularlyinteresting contributions and how they relate to the definitions used in this manuscript.The majority of the manuscripts only consider the marginal probability to reject H (PoS ( n )). Manypublications refer to O’Hagan and Stevens [2001] or O’Hagan et al. [2005], where this quantity was introducedas ‘assurance’. The range of names for what we call the ‘marginal probability to reject H ’ is, however, quitediverse: ‘assurance’, ‘probability of success’, ‘predictive probability of success’, ‘average probability of success’,‘probability of statistical success’, ‘probability of study success’, ‘predictive power’, ‘predictive frequentistpower’, ‘expected power’, ‘average power’, ‘strength’, ‘extended Bayesian expected power 1’, and ‘hybridNeyman-Pearson-Bayesian probability’.However, only a handful of authors elaborate on the intricacies of defining what exactly constitutes a ‘success’and whether to consider an unconditional measure of success or to condition on the presence of a relevanteffect for sample size derivation [Spiegelhalter and Freedman, 1986, Brown et al., 1987, Shao et al., 2008,Liu, 2010, Ciarleglio et al., 2015]. Most publications fail to define explicitly what exactly constitutes a‘success’. Yet, the use of PoS ( n ) implies that rejection of the null hypothesis, irrespective of its truth, mustbe considered a success. Our analysis in Section 7.2 confirms the statement in Spiegelhalter et al. [2004]that PoS ( n ) can be used as a practical approximation to PoS( n ) in many situations. The exact definition of‘probability of success’ becomes more interesting when allowing for θ MCID > θ , a potential extension rarelyconsidered in the literature (see, e.g., Brown et al. [1987] for the binary case). We revisit the distinctionbetween PoS ( n ) and PoS( n ) in a concrete example in Section 7.2.10 preprint - June 30, 2020 Probability to reject H Sym. : Pr θ [ Z n > z − α ] Def. : Φ (cid:0) θ n − z − α (cid:1) Int. : real number; prob-ability to reject thenull hypothesis givena fixed value of θ Random probability to reject H Sym. : RPR( n ) Def. : Pr Θ [ Z n > z − α ] = Φ (cid:16) √ n/σ (Θ − θ ) − z − α (cid:17) Int. : random variable; realisations correspond tothe probability to reject the null hypothesisfor Θ = θ Random power
Sym. : RPow( n ) Def. : RPR( n ) | Θ ≥ θ MCID
Int. : random variable; realisations correspond tothe probability to reject the null hypothesisfor Θ = θ given a relevant effect Expected power
Sym. : EP( n ) Def. : Pr[ Z n > z − α | Θ ≥ θ MCID ] = E[ RPow( n ) ]= Z ∞ θ MCID Pr θ [ Z n > z − α ] ϕ ( θ | Θ ≥ θ MCID ) d θ Int. : real number; average probability to reject thenull weighted with prior density conditionalon Θ ≥ θ MCID
Probability of success
Sym. : PoS( n ) Def. : Pr[ Z n > z − α , Θ ≥ θ MCID ]= Z ∞ θ MCID Pr θ [ Z n > z − α ] ϕ ( θ ) d θ Int. : real number; joint probability to reject thenull and have a relevant effect; average prob-ability to reject on Θ ≥ θ MCID weighted with unconditional prior density
Marginal probability to reject H Sym. : PoS ( n ) Def. : Pr ϕ ( · ) [ Z n > z − α ] = E[ RPR( n ) ]= Z ∞−∞ Pr θ [ Z n > z − α ] ϕ ( θ ) d θ Int. : real number; marginal probability to rejectthe null irrespective of underlying effect replace fixed pa-rameter θ withrandom variableΘ ∼ ϕ ( · ) condition onΘ ≥ θ MCID integrate with respectto prior conditionalon relevant effect, ϕ ( · | Θ ≥ θ MCID )integrate over θ ≥ θ MCID with uncondi-tional prior ϕ ( · )integrateover entireparameterrange withunconditionalprior ϕ ( · ) form expected value · Pr[ Z n > z − α ]add expected type one errorrate and probability of rejec-tion under irrelevant values formexpectedvalueformexpectedvalue Figure 1: Structured overview of all quantities related to ‘power’ that are introduced in Sections 1 to 3.The symbols used in the text (
Sym. ), their exact definitions (
Def. ), and verbal interpretation (
Int. ) aresummarised in the respective boxes. The relationships between the individual quantities are given as labelledarrows. For an overview of previous mentions and synonyms used in the literature, see Table 1.11 preprint - June 30, 2020
The exact choice of wording should not be given too much weight. However, we feel that any notion of powerin the ‘hybrid’ Bayesian/frequentist setting should be conditional on a relevant effect (or at least a non-nulleffect) to preserve the conditional nature of the purely frequentist power. Using the term ‘power’ to refer to ajoint probability like the ‘expected power’ of Brown et al. [1987] and Ciarleglio et al. [2015] (our PoS( n )) orthe ‘average/expected power’ of Spiegelhalter et al. [2004] (our PoS ( n )) is potentially misleading. Otherssuggest ‘conditional expected power’ for EP( n ) to distinguish it from ‘expected power’ (our PoS ( n )) [Brownet al., 1987, Ciarleglio et al., 2015]. This wording, however, may lead to confusion when also consideringinterim analyses where ‘conditional power’ is a well-established term for the probability of rejecting the nullhypothesis given θ alt and partially observed data [Bauer et al., 2016].A particularly interesting publication is Liu [2010]. They extend hybrid sample size derivation in the normalcase to also incorporate uncertainty about the variance and clearly distinguish between PoS ( n ) = ‘extendedBayesian expected power 1’, PoS( n ) = ‘extended Bayesian expected power 2’, and EP( n ) = ‘extendedBayesian expected power 3’. Apart from nomenclature, our definitions of these three quantities only differ inthat they assume the standard deviation to be fixed and the fact that we accommodate the optional notionof a relevant effect via θ MCID . The former makes explicit formulas more manageable, the latter is important tokeep sample sizes small in situations with vague or conservative prior information but substantial relevancethresholds. Liu [2010] and Rufibach et al. [2016] are also the only publications we found that study thedistribution of the quantities that are averaged over (RPR( n ) and RPow( n ) in our notation, see Figures 4and 5). In Ciarleglio et al. [2015], the distinction between all three quantities is also made explicit (‘expectedpower’ is our PoS ( n ), ‘prior-adjusted power’ is our PoS( n ), and ‘conditional expected power’ is our EP( n )). A major issue in modelling uncertainty and computing sample size via prior densities is the elicitation of anadequate prior. At first glance, non-informative or ‘objective’ priors seem to be a viable choice. As illustratedin Rufibach et al. [2016], the prior crucially impacts the properties and interpretability of PoS and likewiseany other quantity depending on ϕ ( · ) in Figure 1, so careful selection is paramount. Often in clinical research,there is no direct prior knowledge on the effect size of interest, e.g., overall survival in a phase III trial, asno randomised trials comparing these same treatments using the same endpoint have been run previously.Researchers are then often tempted to use a vague prior, typically a normal prior with large variance, as, e.g.,advocated in Saint-Hilary et al. [2019].Assuming a non-informative, improper prior for Θ would imply that arbitrarily large effect sizes are just aslikely as small ones. Yet, in clinical trials, the standardised effect size rarely exceeds 0.5 [Lamberink et al.,2018]. We thus illustrate the characteristics of the different approaches to defining power constraints underuncertainty using a convenient truncated Gaussian prior. The truncated Gaussian is conjugate to a Gaussianlikelihood and allows us to restrict the plausible range of effect sizes to, e.g., liberally [ − , Let θ MCID = 0 . γ = 0 . , . α = 0 .
025 and β = 0 . a priori probability of a relevant effect and thus the requiredsample sizes explode for large prior standard deviations (in relation to the prior mean). For very largestandard deviations, the constraint on probability of success becomes infeasible.12 preprint - June 30, 2020 EP PoS quantile 0.5 quantile 0.9−0.25 0.00 0.25 0.50 −0.25 0.00 0.25 0.50 −0.25 0.00 0.25 0.50 −0.25 0.00 0.25 0.500.250.500.751.00 prior mean p r i o r s t anda r d de v i a t i on required sample size Figure 2: Required sample size plotted against prior parameters (Normal truncated to [-0.3, 0.7], with varyingmean and standard deviation); θ MCID = 0 .
1; EP = Expected Power, PoS = Probability of Success, quantile =quantile approach with γ = 0 . γ = 0 .
9, respectively.The expected power criterion leads to a completely different sample size pattern. Since expected power isdefined conditional on a relevant effect, large prior uncertainty increases the weight in the upper tails of thepower curve where power quickly approaches one. Consequently, for small prior means, larger uncertaintydecreases the required sample size. For large prior means, however, smaller prior uncertainty leads to smallersample sizes since again more weight is concentrated in the tails of the power curve.The characteristics of the prior-quantile approach very much depend on the choice of γ . When using theconditional prior median ( γ = 0 .
5) the approach is qualitatively similar to the expected power approach. Thisis due to the fact that computing power on the conditional median of the prior is close to computing poweron the conditional prior mean. Since the power function is locally linear around the centre of mass of theconditional prior, this approximates computing expected power by interchanging forming the expected valueand computing power (i.e., first average the prior and then compute power or average over power with weightsgiven by the conditional prior). For a stricter criterion ( γ = 0 .
9) the required sample sizes are much larger.This is due to the fact that the quantile approach does not allow a trade-off between power in the uppertails of the power curve and regions with low power. Higher uncertainty then decreases the (1 − γ )-quantiletowards the minimal relevant effect and thus increases the required sample size. In theory, one might be inclined to derive a sample size based on the probability of success instead of usingexpected power. Consider a situation in which the a priori probability of Θ ≥ θ MCID is 0 .
51. The probabilityof success is then only 41% (for 80% expected power) or 46% (for 90% expected power). A sponsor mightwant to increase these relatively low unconditional success probabilities by deriving a sample size based ona minimal PoS( n ) of 1 − β instead. The choice of 1 − β is limited by the a priori probability of a relevanteffect (0.51 in this case). Using equation (23) a minimal probability of success of 0.5 is equivalent to requiringan expected power of more than 98%. In essence, the attempt to increase PoS( n ) via a more stringentthreshold on EP( n ) implies that low a priori chances of success are to be offset with an almost certaindetection (EP( n ) ≈
1) in the unlikely event of an effect actually being present. The ethical implication ofthis approach to sample size derivation is that an extremely large number of patients would be exposed totreatment although the sponsor more or less expects it to be ineffective.This example demonstrates that the (situation agnostic) conventions on power levels cannot be transferred tothresholds for probability of success without adjustment for the situation-specific a priori probability of arelevant effect. It thus seems much easier to directly impose a constraint on expected power, which implicitlyadjusts for the a prior probability of a relevant effect via equation (23).To investigate the difference between the probability of success, PoS( n ), and the marginal probability ofrejecting the null hypothesis, PoS ( n ), Figure 3 visualises the proportion of the individual components of13 preprint - June 30, 2020 prior mean p r i o r s t anda r d de v i a t i on ABC
Figure 3: Components of PoS ( n ) for n = 150, θ = 0, α = 0 . θ MCID = 0 . ( n ); proportions in individual pie charts correspond to:A = probability to reject and null effect (type I error), B = probability to reject and irrelevant but non-nulleffect, C = probability to reject and relevant effect (PoS).PoS ( n ) for varying prior standard deviation and prior means. The sample size is fixed at n = 150, θ = 0,the maximal type I error rate is α = 0 . θ MCID = 0 . ( n ) is mostly negligible unlessthe prior is sharply peaked at an effect size slightly smaller than the null. The a priori probability of arelevant effect size is close to zero in these cases and so is PoS ( n ). For the more practically relevant scenarioswith prior mean greater than θ = 0, the contribution of the average type I error rate to PoS ( n ) is almostnegligible. Still, if θ MCID > θ , PoS ( n ) might be inflated substantially by rejections under parameter valuesthat are non-null but also not clinically relevant. This phenomenon evidently depends on the magnitude of θ MCID ; the more of the prior mass concentrated in [ 0 , θ
MCID ], the larger the contribution towards PoS ( n ). If θ MCID = 0, the numeric difference between PoS and PoS is negligible since the maximal type I error rate iscontrolled at level α and the power curve quickly approaches zero on the interior of the null hypothesis. Thiswas already pointed out by Spiegelhalter et al. [2004], who argue that PoS ( n ) can be used as approximationto PoS( n ) in many practically relevant situations. To further investigate the properties of the random variable Pr Θ [ Z n > z − α ], we consider three exampleprior configurations with means − . , . , . . , . , .
05 respectively. Thecorresponding sample sizes to reach an expected power of at least 80% are 854, 126, and 32. Figure 4 showsthe unconditional and conditional (on a relevant effect) priors, the corresponding probability of rejecting thenull as a function of θ , and histograms of the distributions of random power (RPow( n )), and the unconditionalprobability to reject the null hypothesis (RPR( n )).Clearly, the distributions of the conditional and unconditional rejection probabilities (random power andrandom probability to reject, respectively) are qualitatively very different in the three situations. In the firstcase (mean − .
25, standard deviation 0.4), the prior mass is mostly concentrated on the null hypothesisand the normalising factor that links the unconditional and the conditional prior is clearly noticeable. Theconditional prior then assigns most weight to values of θ close to the relevance threshold leading to a largerequired sample size. The large sample size then implies a steep power curve and a distribution of RPow( n )that is highly right-skewed towards 1 since RPow( n ) is conditional on Θ > θ MCID . If the unconditionaldistribution of the rejection probability is considered instead (RPR( n )), the characteristic u-shape discussedin Rufibach et al. [2016] is recovered.For the intermediate setting (mean 0.3, standard deviation 0.125), most prior mass is already concentratedon relevant values. The difference between conditional and unconditional prior is less pronounced (thenormalising factor is closer to 1) and even the unconditional distribution of the rejection probability is nolonger u-shaped. 14 preprint - June 30, 2020 mean=−0.25, sd=0.40, n=172 mean=0.30, sd=0.12, n=109 mean=0.50, sd=0.05, n=33 conditional prior unconditional prior q P D F q p r obab ili t y t o r e j e c t mean=−0.25, sd=0.40, n=172 mean=0.30, sd=0.12, n=109 mean=0.50, sd=0.05, n=33 r ando m po w e rr ando m p r obab ili t y t o r e j e c t probability to reject Figure 4: Top: conditional and unconditional prior PDF and power curves corresponding to n ∗ EP for 1 − β = 0 . θ is almost certainly highly relevant. Sincethe normalising factor is thus close to 1, there is no discernible difference between the conditional and theunconditional prior densities. Not surprisingly, the assumption of a highly relevant effect with high certaintyonly requires a small sample size (32). This leads to a relatively flat curve of the rejection probability and toa peaked distribution of RPow( n ) and RPR( n ). Flat power curves and high certainty about the effect sizetend to result in peaked distributions of RPow( n ) and RPR( n ) because the power curve is almost linear inthe region of the parameter where the prior mass is concentrated. The distribution of RPow( n ) and RPR( n )is thus well approximated by a linear transformation of the (conditional) prior, which is a peaked truncatednormal distribution. Since conditioning has almost no effect, the unconditional distribution of the probabilityto reject is the same in this case.Interestingly, both settings with higher a priori uncertainty lead to a high chance of exceeding a power of 80%.This is due to the fact that the rare occurrence of very low rejection probabilities needs to be compensated toachieve an overall expected power of 80%. To compare the results in the previous section with the prior quantile-based approach, we consider theintermediate example with prior mean 0.3 and prior standard deviation 0.2 again. For this situation, the15 preprint - June 30, 2020 q P D F conditional prior unconditional prior 0.00.20.40.60.81.0 0.0 0.5 q p r obab ili t y t o r e j e c t gamma = 0.501 − beta = 0.70n = 56 gamma = 0.501 − beta = 0.80n = 71 gamma = 0.901 − beta = 0.70n = 244 gamma = 0.901 − beta = 0.80n = 311 gamma = 0.501 − beta = 0.70n = 56 gamma = 0.501 − beta = 0.80n = 71 gamma = 0.901 − beta = 0.70n = 244 gamma = 0.901 − beta = 0.80n = 311 r ando m po w e rr ando m p r obab ili t y t o r e j e c t probability to reject Figure 5: Top: conditional and unconditional prior PDF for a truncated normal prior on [ − . , .
7] withmean 0.3, standard deviation 0.2 and power curves corresponding to n ∗ γ for 1 − β = 0 . . θ MCID = 0 . γ = 0 . , . . .
8, the corresponding curves of the rejectionprobability, and histograms of the distribution of the rejection probability are given in Figure 5.The required sample sizes depend heavily on the choice of γ . The crucial difference between the quantile-basedand the expected power approach is that for the prior quantile approach the exact distribution of power belowthe target value of 80% is irrelevant; only the total mass of the distribution below this critical point matters.This means that the sample size for the γ = 0 . the exact amount bywhich the target power is undershot is that there is a relatively high chance of ending up with a severelyunderpowered study in these cases. Increasing the certainty to exceed a power of 80% or 70% by setting γ = 0 .
9, however, leads to substantially larger required sample sizes than under the expected power approach.The example demonstrates the problems arising from having to specify both γ and 1 − β . While this allowsmore fine-grained control over the distribution of the (conditional) rejection probability, there seems to be nocanonical choice for γ , which is critical in determining the required sample size.16 preprint - June 30, 2020
012 −0.5 0.0 0.5 1.0 q P D F conditional prior unconditional prior EP (n = 218) MCID (n = 3140) quantile, 0.5 (n = 120) quantile, 0.9 (n = 834)0.00.10.20.30.40.50.60.70.80.91.0 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 q p r obab ili t y t o r e j e c t random power CD F Figure 6: Left panel: prior PDF and conditional prior PDF (on Θ > θ
MCID = 0 . θ for the expected power design (EP), the design powered for θ MCID (MCID), the design based on power at the conditional prior median (quantile 0.5), and the design usingthe conditional 0.1-quantile, i.e. γ = 0 . > θ MCID = 0 .
05) for the four different design choices.
To make things more tangible, consider the case of a clinical trial designed to demonstrate superiority ofan intervention over a historical control group with respect to the endpoint of overall survival. To staywithin the framework of (approximately) normally distributed test statistics, we assume that effect sizesare given on a standardised log hazard scale, i.e., θ = 0 corresponds to no difference in overall survivaland θ > − . , . . . θ MCID = 0 .
05. This setting correspondsto an a priori probability of a relevant effect of approximately 0 . − β = 0 . θ MCID (MCID), at Q . [ Θ ≥ θ MCID ] ≈ .
26 (quantile, 0.5), at Q . [ Θ ≥ θ MCID ] ≈ .
10 (quantile, 0.9), or a minimalexpected power of 1 − β = 0 . n = 3140. The quantile approach (with γ = 0 .
9) already reduces thisto n = 834 while still maintaining an a priori chance of 90% to exceed the target power of 80%. The quantileapproach with γ = 0 . n = 120 at the cost of only having a 50% chance toexceed the target power of 80%. The EP approach is more liberal than the quantile approach ( γ = 0 .
9) with n = 218 but still guarantees a chance of exceeding the target power of roughly 75%. A sample size based onPoS( n ) ≥ − β = 0 . a priori probability of a relevant valueis lower than 0 .
8. The large spread between the derived sample sizes shows how sensitive the the requiredsample size is to the changes in the power constraint. Clearly, the MCID approach is highly ineffective, asaccepting a small chance to undershoot the target power with the quantile approach ( γ = 0 .
9) reduces therequired sample size from n = 3140 to roughly a quarter ( n = 834). At the other extreme, constraining poweronly on the conditional prior median (quantile approach, γ = 0 .
5) leads to a rather unattractive a priori distribution of the random power: by definition, the probability to exceed a rejection probability of 0 . . a priori chance of ending up with a severely underpowered study is non-negligible.These considerations leave the trial-sponsor with essentially two options. Either a range of scenarios for thequantile approach with values of γ between 0 . . γ could be reached by considering the corresponding distributions of RPow( n ), or theintermediate EP approach could be used. We assume that the trial-sponsor accepts the implicit trade-offinherent to expected power and decides to base the sample size derivation on the EP approach. The required17 preprint - June 30, 2020 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l expected power i m p li ed r e w a r d [ a v e r age pe r − pa t i en t c o s t s ] Figure 7: Utility-maximising implied reward λ for varying expected power levels in the situation discussed inSection 7.5.sample size for an (expected) power of 80% is then n = 218. Note that this still means that there is a roughlyone-in-five a priori probability to end up in a situation with less than 50% power (see Figure 6, CDF panel).In a situation where 1 − β = 0 . λ directly. One may then guide decision making by computing the values of λ that lead tothe same required sample size for a range of values of β . Figure 7 shows this ‘implied reward’ as function ofthe minimal expected power constraint. An expected power of 0 . . · . · $US. The utility-maximising reward for an expected powerof 0 . . · $US. Even without committing to a fixed value of λ ,these considerations can be used to guide the decision as to which of the ‘standard’ power levels (0.8 or 0.9)might be more appropriate in the situation at hand.Of course, one might also directly optimise utility if the reward upon successful rejection of the null hypothesiscan be specified. To that end, assume that a reward of 100 · $US is expected. Under the same assumptionabout average per-patient costs, this translates to λ ≈ n = 329 and the corresponding utility-maximising expected power is 0 . The concept of ‘hybrid’ sample size derivations based on Bayesian priors for planning and the design’sfrequentist error rate properties is well-established in the literature on clinical trial design. Nevertheless, thesubstantial variation in the terminology used and small differences in the exact definition of the terms usedcan be confusing. We have tried our best to formulate a consistent naming scheme, to be explicit about theexact definitions, highlight connections between the different quantities (see Figure 1), and to relate back toprevious authors (see Table 1). Any naming scheme necessarily has a subjective element to it and ours isby no means exempt from this problem (see also https://xkcd.com/927/ ). We do hope, however, that ourreview encourages a clearer separation between terminology for joint probabilities (avoiding the use of theword ‘power’) and for probabilities that condition on the presence of an effect (‘power’ seems more appropriatehere), as well as a more transparent distinction between arguments based on a priori likelihood of effects18 preprint - June 30, 2020 and their relevance. We also strongly believe that an explicit definition (in formulae) of any quantities usedshould be given when discussing the subject. Merely referring to terms like ‘expected power’ or ‘probabilityof success’ are too ambiguous given their inconsistent use in the literature.Often, the main argument for a ‘hybrid’ approach to sample size derivation is the fact that the uncertaintyabout the true underlying effect can be incorporated in the planning of a design. This is certainly a majoradvantage but it is equally important that the ‘hybrid’ approach allows a very natural distinction betweenarguments relating to the (relative) a priori likelihood of different parameter values (encoded in the priordensity) and relevance arguments (encoded in the choice of θ MCID ). The fact that these two components canbe represented naturally within the ‘hybrid’ approach has the potential to make sample size derivation muchmore transparent.The ‘hybrid’ quantity considered most commonly in the literature is the marginal probability to reject H .Often, it is not clear whether the authors are aware of the fact that this quantity includes the error ofrejecting the null hypothesis incorrectly, i.e. when θ < θ MCID . In many practical situations this problem isnumerically negligible and PoS ( n ) ≈ PoS( n ), i.e., the marginal probability to reject the null hypothesis isapproximately the same as the joint probability of a non-null effect and the rejection of the null hypothesis.If, however, the definition of ‘success’ also takes into account a non-trivial relevance threshold θ MCID > θ , thedistinction becomes more important in practice. Given the great emphasis on strict type I error rate controlin the clinical trials community it seems at least strange to implicitly consider type I errors as ‘successful’trial outcomes. Beyond these principled considerations, a practical advantage of PoS( n ) over PoS ( n ) is thedirect and simple connection to EP( n ) = PoS( n ) / Pr[ Θ ≥ θ MCID ]. While EP( n ) is independent of the a priori probability of a relevant effect and only depends on the relative a priori likelihood of different effects throughthe conditional prior, PoS( n ) does directly depend on Pr[ Θ ≥ θ MCID ]. Although Spiegelhalter et al. [2004]see this as a disadvantage of EP( n ), it is actually a necessary property to use it for sample size derivationwithout re-calibrating the conventional values for 1 − β (see also Brown et al., 1987). If one tried to derive asample size such that PoS( n ) = 0 . ≥ θ MCID ] < .
8. In asituation where Pr[ Θ ≥ θ MCID ] exceeds 0.8 only slightly, the expected power (EP( n )) would have to be closeto 1 to compensate for the a priori probability of a relevant effect. In essence, one would thus increase thesample size in situations where the efficacy of the new treatment is still uncertain. This would put more studyparticipants at risk just to make sure that the treatment effect is detected almost certainly if it is indeedpresent. The use of PoS( n ) for sample size derivation thus only makes sense in a setting where the threshold1 − β is adapted to the a priori probability of a relevant effect. The simplest way to do so is by usingPoS( n ) ≥ Pr[ Θ ≥ θ MCID ] (1 − β ) which is, however, entirely equivalent to EP( n ) ≥ − β . Another optionto derive situation-specific thresholds is via utility maximisation, and PoS( n ) is a key term in the simpleexpected utility function proposed in Section 4. Ultimately, PoS( n ) and EP( n ) can be used interchangeablyonce the prior distribution is fixed as long as the respective multiplicative factor is taken into account. Themain advantage of PoS( n ) is that it is an unconditional probability which might be easier to interpret bypractitioners, while EP( n ) can be readily used in conjunction with an already established power threshold ina research field.A slightly different concept to sample size derivation via expected power is what we call the ‘quantile approach’.This approach uses a different functional of the probability to reject the null hypothesis given a relevant effect.Instead of the mean, we propose to use a γ quantile of this distribution. Compared to expected power, thisallows direct control of the left-tail of the a priori distribution of the probability to reject the null hypothesisgiven a relevant effect. This can be desirable since a sample size derived via a threshold for expected powermight still lead to a substantial chance of ending up with an underpowered study. This can be avoided withthe quantile approach and a higher value for γ (see Figure 5). The quantile approach is also relatively easyto implement in practice, since it is just a Bayesian justification for powering on a point alternative. Thisflexibility comes at the price of having to specify an additional parameter, γ (the acceptable risk of endingup with an underpowered study). Theoretically, both expected power and the prior quantile approach areperfectly viable to determine a sample size. Whichever approach is preferred, it is certainly advisable to notonly plot the corresponding power curves but also the resulting distribution of RPow( n ) (see Figure 4). Inessence, the problem of defining a ‘hybrid’ power constraint boils down to finding a summary functional of thepower curve that reflects the planning objectives. Ideally, one would like to control the a priori distribution ofRPow( n ) such that it is sharply peaked around a certain target value avoiding both over- and underpoweredstudies. Yet, controlling both location (e.g., mean) and spread (e.g., standard deviation) of the distributionof RPow( n ) is impossible. A second constraint on the standard deviation of RPow( n ) in addition to themean constraint (expected power) would led to an over-determined problem since there is only one freeparameter, n . To increase expected power, the sample size must be increased. The standard deviation of19 preprint - June 30, 2020 RPow( n ), however, decreases as the sample size is lowered since this flattens the power curve of the resultingtest (the standard deviation would be 0 if the power curve was constant). Both conflicting objectives (highexpected power, low standard deviation of power) are thus not fulfillable at the same time.Finally, it should be stressed again that the key frequentist property of strict type I error rate control of thedesigns are not affected by the fact that the arguments for calculating a required sample size are Bayesian.In fact, at no point, the Bayes theorem is invoked (i.e. the posterior distribution of the effect size is notrequired). The Bayesian perspective is merely a principled and insightful way of specifying a weight function(prior density) that can then be used to guide the choice of the power level of the design, or as Brown et al.[1987] put it: “This proposed use of Bayesian methods should not be criticised by frequentists in that thesemethods do not replace any current statistical techniques, but instead offer additional guidance where currentpractice is mute”. Supplemental Materials
The code required to reproduce the figures is available at https://github.com/kkmann/sample-size-calculation-under-uncertainty . A permanent backup of the exact version of therepository used for this manuscript is available under the digital object identifier 10.5281/zen-odo.3899943 (release 0.2.1). An interactive version of the repository at the time of publication ishosted at https://mybinder.org/v2/gh/kkmann/sample-size-calculation-under-uncertainty/0.2.1?urlpath=lab/tree/notebooks/figures-for-manuscript.ipynb using Binder [Jupyter et al.,2018]. A simple shiny app implementing the sample size calculation procedures is availableat https://mybinder.org/v2/gh/kkmann/sample-size-calculation-under-uncertainty/0.2.1?urlpath=shiny/apps/sample-size-calculation-under-uncertainty/ . Funding
DSR was funded by the Biometrika Trust and the Medical Research Council (MC_UU_00002/6).
Conflicts of interest
None to declare.
References
Walter T Ambrosius, Tamar S Polonsky, Philip Greenland, David C Goff Jr, Letitia H Perdue, Stephen PFortmann, Karen L Margolis, and Nicholas M Pajewski. Design of the value of imaging in enhancing thewellness of your heart (view) trial and the impact of uncertainty on power.
Clinical Trials , 9(2):232–246,2012.Peter Bauer, Frank Bretz, Vladimir Dragalin, Franz König, and Gernot Wassmer. Twenty-five years ofconfirmatory adaptive designs: opportunities and pitfalls.
Statistics in Medicine , 35(3):325–347, 2016.Barry W Brown, Jay Herson, E Neely Atkinson, and M Elizabeth Rozell. Projection from previous studies: abayesian and frequentist compromise.
Controlled Clinical Trials , 8(1):29–44, 1987.Pierpaolo Brutti, Fulvio De Santis, and Stefania Gubbiotti. Bayesian-frequentist sample size determination:a game of two priors.
Metron , 72(2):133–151, 2014.Kevin J Carroll. Decision making from phase ii to phase iii and the probability of success: reassured by“assurance”?
Journal of Biopharmaceutical Statistics , 23(5):1188–1200, 2013.Ding-Geng Chen and Jenny K Chen. Statistical power and bayesian assurance in clinical trial design. In
NewFrontiers of Biostatistics and Bioinformatics , pages 193–200. Springer, 2018.Ding-Geng Chen and Shuyen Ho. From statistical power to statistical assurance: It’s time for a paradigmchange in clinical trial design.
Communications in Statistics-Simulation and Computation , 46(10):7957–7971,2017.Christy Chuang-Stein. Sample size and the probability of a successful trial.
Pharmaceutical Statistics , 5(4):305–309, 2006.Christy Chuang-Stein, Simon Kirby, Ian Hirsch, and Gary Atkinson. The role of the minimum clinicallyimportant difference and its impact on designing a trial.
Pharmaceutical Statistics , 10(3):250–256, 2011.20 preprint - June 30, 2020
Maria M Ciarleglio and Christopher D Arendt. Sample size determination for a binary response in a superiorityclinical trial using a hybrid classical and bayesian procedure.
Trials , 18(1):83, 2017.Maria M Ciarleglio, Christopher D Arendt, Robert W Makuch, and Peter N Peduzzi. Selection of thetreatment effect for sample size determination in a superiority clinical trial using a hybrid classical andbayesian procedure.
Contemporary Clinical Trials , 41:160–171, 2015.Maria M Ciarleglio, Christopher D Arendt, and Peter N Peduzzi. Selection of the effect size for sample sizedetermination for a continuous response in a superiority clinical trial using a hybrid classical and bayesianprocedure.
Clinical Trials , 13(3):275–285, 2016.Jonathan A Cook, Steven A Julious, William Sones, Lisa V Hampson, Catherine Hewitt, Jesse A Berlin,Deborah Ashby, Richard Emsley, Dean A Fergusson, Stephen J Walters, Edward C F Wilson, GraemeMacLennan, Nigel Stallard, Joanne C Rothwell, Martin Bland, Louise Brown, Craig R Ramsay, AndrewCook, David Armstrong, Doug Altman, and Luke D Vale. DELTA2 guidance on choosing the targetdifference and undertaking and reporting the sample size calculation for a randomised controlled trial.
BMJ , 363:k3750, 2018.Adam Crisp, Sam Miller, Douglas Thompson, and Nicky Best. Practical experiences of adopting assurance asa quantitative framework to support decision making in drug development.
Pharmaceutical Statistics , 17(4):317–328, 2018.James Flinn Crook and Irving John Good. The powers and strengths of tests for multinomials and contingencytables.
Journal of the American Statistical Association , 77(380):793–802, 1982.Takashi Daimon. Bayesian sample size calculations for a non-inferiority test of two proportions in clinicaltrials.
Contemporary Clinical Trials , 29(4):507–516, 2008.Nigel Dallow, Nicky Best, and Timothy H Montague. Better decision making in drug development throughadoption of formal prior elicitation.
Pharmaceutical Statistics , 17(4):301–316, July 2018. ISSN 1539-1612.doi: 10.1002/pst.1854.Joseph A DiMasi, Henry G Grabowski, and Ronald W Hansen. Innovation in the pharmaceutical industry:new estimates of R&D costs.
Journal of Health Economics , 47:20–33, 2016.Raphael Gillett. An average power criterion for sample size estimation.
Journal of the Royal StatisticalSociety: Series D (The Statistician) , 43(3):389–394, 1994.Heiko Götte, Marietta Kirchner, and Martin Oliver Sailer. Probability of success for phase iii after exploratorybiomarker analysis in phase ii.
Pharmaceutical Statistics , 16(3):178–191, 2017.Jean-Marie Grouin, Maylis Coste, Pierre Bunouf, and Bruno Lecoutre. Bayesian sample size determination innon-sequential clinical trials: Statistical aspects and some regulatory considerations.
Statistics in Medicine ,26(27):4914–4924, 2007.Peter Hu. Probability of success: estimation framework, properties and applications.
Stat , 3(1):158–171,2014.Joseph G Ibrahim, Ming-Hui Chen, Mani Lakshminarayanan, Guanghan F Liu, and Joseph F Heyse. Bayesianprobability of success for clinical trials using historical data.
Statistics in Medicine , 34(2):249–264, 2015.Kaihong Jiang. Optimal sample sizes and go/no-go decisions for phase ii/iii development programs based onprobability of success.
Statistics in Biopharmaceutical Research , 3(3):463–475, 2011.Jupyter, M Bussonnier, J Forde, J Freeman, B Granger, T Head, C Holdgraf, K Kelley, G Nalvarte, A Osheroff,et al. Binder 2.0-reproducible, interactive, sharable environments for science at scale. In
Proceedings of the17th python in science conference , volume 113, page 120, 2018.Nelson Kinnersley and Simon Day. Structured approach to the elicitation of expert beliefs for a bayesian-designed clinical trial: a case study.
Pharmaceutical Statistics , 12(2):104–113, 2013.S Kirby, J Burke, C Chuang-Stein, and C Sin. Discounting phase 2 results when planning phase 3 clinicaltrials.
Pharmaceutical Statistics , 11(5):373–385, 2012.Herm J Lamberink, Willem M Otte, Michel RT Sinke, Daniël Lakens, Paul P Glasziou, Joeri K Tijdinke, andChristiaan H Vinkers. Statistical power of clinical trials increased while effect size remained stable: anempirical analysis of 136,212 clinical trials between 1975 and 2014.
Journal of Clinical Epidemiology , 102:123–128, 2018.KK Gordan Lan and Janet T Wittes. Some thoughts on sample size: a bayesian-frequentist hybrid approach.
Clinical Trials , 9(5):561–569, 2012. 21 preprint - June 30, 2020
Russell V Lenth. Some practical guidelines for effective sample size determination.
The American Statistician ,55(3):187–193, 2001.Dennis V Lindley. The choice of sample size.
Journal of the Royal Statistical Society: Series D (TheStatistician) , 46(2):129–138, 1997.Fang Liu. An extension of bayesian expected power and its application in decision making.
Journal ofBiopharmaceutical Statistics , 20(5):941–953, 2010.J Oakley and A O’Hagan. SHELF: the Sheffield Elicitation Framework, 2019. URL .Nancy A Obuchowski. Sample size calculations in studies of test accuracy.
Statistical Methods in MedicalResearch , 7:392, 1998.Anthony O’Hagan and John W Stevens. Bayesian assessment of sample size for clinical trials of cost-effectiveness.
Medical Decision Making , 21(3):219–230, 2001.Anthony O’Hagan, John W Stevens, and Michael J Campbell. Assurance in clinical trial design.
PharmaceuticalStatistics , 4(3):187–201, 2005.Shijie Ren and Jeremy E Oakley. Assurance calculations for planning clinical trials with time-to-eventoutcomes.
Statistics in Medicine , 33(1):31–45, 2014.K Rufibach, HU Burger, and M Abt. Bayesian predictive power: choice of prior and some recommendationsfor its use as probability of success in drug development.
Pharmaceutical Statistics , 15(5):438–446, 2016.Gaelle Saint-Hilary, Veronique Robert, and Mauro Gasparini. Decision-making in drug development using acomposite definition of success.
Pharmaceutical Statistics , 17(5):555–569, 2018.Gaelle Saint-Hilary, Valentine Barboux, Matthieu Pannaux, Mauro Gasparini, Veronique Robert, and GianlucaMastrantonio. Predictive probability of success using surrogate endpoints.
Statistics in Medicine , 38(10):1753–1774, 2019.Yongzhao Shao, Vandana Mukhi, and Judith D Goldberg. A hybrid bayesian-frequentist approach to evaluateclinical trial designs for tests of superiority and non-inferiority.
Statistics in Medicine , 27(4):504–519, 2008.D J Spiegelhalter and L S Freedman. A predictive approach to selecting the size of a clinical trial, based onsubjective clinical opinion.
Statistics in Medicine , 5(1):1–13, 1986.David J Spiegelhalter, Keith R Abrams, and Jonathan P Myles.
Bayesian approaches to clinical trials andhealth-care evaluation , volume 13. John Wiley & Sons, 2004.Stephen K Thompson.
Sampling . John Wiley & Sons, Hoboken, New Jersey, 3rd edition edition, 2012.Rosalind J Walley, Claire L Smith, Jeremy D Gale, and Phil Woodward. Advantages of a wholly bayesianapproach to assessing efficacy in early drug development: a case study.
Pharmaceutical Statistics , 14(3):205–215, 2015.Meihua Wang, G Frank Liu, and Jerald Schindler. Evaluation of program success for programs with multipletrials in binary outcomes.
Pharmaceutical Statistics , 14(3):172–179, 2015.Ming-Dauh Wang. Applications of probability of study success in clinical drug development. In
AppliedStatistics in Biomedicine and Clinical Trials Design , pages 185–196. Springer, 2015.Yanping Wang, Haoda Fu, Pandurang Kulkarni, and Christopher Kaiser. Evaluating and utilizing probabilityof study success in clinical development.
Clinical Trials , 10(3):407–413, 2013.Chi Heem Wong, Kien Wei Siah, and Andrew W Lo. Estimation of clinical trial success rates and relatedparameters.
Biostatistics , 20(2):273–286, 2019.Jianliang Zhang and Jenny J Zhang. Joint probability of statistical success of multiple phase iii trials.