[PDF] A hypothesis test of feasibility for external pilot trials assessing recruitment, follow-up and adherence rates

Abstract

The power of a large clinical trial can be adversely affected by low recruitment, follow-up and adherence rates. External pilot trials estimate these rates and use them, via pre-specified decision rules, to determine if the definitive trial is feasible and should go ahead. There is little methodological research underpinning how these decision rules, or the sample size of the pilot, should be chosen. In this paper we propose a hypothesis test of the feasibility of a definitive trial, to be applied to the external pilot data and used to make progression decisions. We quantify feasibility by the power of the planned trial, as a function of recruitment, follow-up and adherence rates. We use this measure to define hypotheses to test in the pilot, propose a test statistic, and show how the error rates of this test can be calculated for the common scenario of a two-arm parallel group definitive trial with a single normally distributed primary endpoint. We use our method to re-design TIGA-CUB, an external pilot trial comparing a psychotherapy with treatment as usual for children with conduct disorders. We then extend our formulation to include using the pilot data to estimate the standard deviation of the primary endpoint. and incorporate this into the progression decision.

Full PDF

aa r X i v : . [ s t a t . M E ] A ug A hypothesis test of feasibility for external pilottrials assessing recruitment, follow-up andadherence rates

Duncan T. Wilson Rebecca E. A. Walwyn Julia BrownAmanda J. FarrinLeeds Institute of Clinical Trials Research, University of Leeds,Leeds, UK ([email protected])

Abstract

The power of a large clinical trial can be adversely aﬀected by lowrecruitment, follow-up and adherence rates. External pilot trials estimatethese rates and use them, via pre-speciﬁed decision rules, to determine ifthe deﬁnitive trial is feasible and should go ahead. There is little method-ological research underpinning how these decision rules, or the sample sizeof the pilot, should be chosen. In this paper we propose a hypothesis testof the feasibility of a deﬁnitive trial, to be applied to the external pilotdata and used to make progression decisions. We quantify feasibility bythe power of the planned trial, as a function of recruitment, follow-up andadherence rates. We use this measure to deﬁne hypotheses to test in thepilot, propose a test statistic, and show how the error rates of this testcan be calculated for the common scenario of a two-arm parallel groupdeﬁnitive trial with a single normally distributed primary endpoint. Weuse our method to re-design TIGA-CUB, an external pilot trial comparinga psychotherapy with treatment as usual for children with conduct disor-ders. We then extend our formulation to include using the pilot data toestimate the standard deviation of the primary endpoint. and incorporatethis into the progression decision.

Randomised Controlled Trials (RCTs) often fail to recruit to target [1], lead-ing to an analysis with less power than intended. Power can also be adverselyaﬀected by large rates of attrition, poor adherence to the allocated treatment[2], and incorrect speciﬁcation of a nuisance parameter such as the standarddeviation of a continuous endpoint [3]. A common approach to anticipate theseproblems is to run a small version of the intended trial, known as an externalpilot, to obtain estimates of the relevant parameters and decide if, and how, thedeﬁnitive trial should be conducted [4, 5]. This decision is typically made with1eference to so-called

Progression Criteria (PCs), a set of conditions which mustbe satisﬁed for the deﬁnitive trial to be considered feasible [6]. In the case ofquantitative PCs, parameter estimates are computed using the pilot data andthen compared against threshold values. If all estimates exceed their respectivethresholds, it is recommended that the deﬁnitive trial should be conducted. Arecent workshop identiﬁed that recruitment, follow-up, and intervention adher-ence have emerged as common targets of progression criteria [6].The key role of PCs in pilot trials has led to the CONSORT extension forrandomised pilots requiring their reporting [7], and the NIHR, one of the majorfunders of pilot trials in the UK, requiring their pre-speciﬁcation in the researchplan [8]. There are ethical and economic imperatives to ensuring PCs are appro-priate. If they are too lenient, we increase the risk of proceeding to the deﬁnitivetrial only to discover it is infeasible, fail to answer the scientiﬁc question, in theprocess wasting resources and subjecting patients to research unnecessarily. Ifthey are too strict, we increase the risk of failing to conduct a deﬁnitive trialwhich would in fact have been feasible, and as a result withholding a promis-ing intervention from patients. Despite all this, there is little methodologicalresearch available to help researchers determine optimal PCs [6].Another important aspect of external pilot trial design is the choice of samplesize. Here, methods have generally focussed on obtaining a suﬃciently preciseestimate of the standard deviation of a continuous primary outcome to informthe sample size calculation of the deﬁnitive trial [3, 9, 10, 11, 12, 13]. Thesemethods, often reduced to simple rules-of-thumb such as requiring 35 partic-ipants in each arm [3], are nevertheless widely used to choose pilot samplesize when the estimation of the standard deviation is not the only, or primary,objective. When the goal of the pilot is to provide estimates of recruitment,follow-up and adherence rates and use these in PCs, imprecise estimates will bemore likely to meet or miss a PC threshold by chance alone [12, 14] and thuslead to incorrect progression decisions. This issue will be compounded wheneverthere are several PCs, with progression to the deﬁnitive trial permitted only ifall are satisﬁed. As the number of PCs grows, so does the probability of missingat least one threshold through bad luck, even when the precision around eachindividual estimate appears adequate - the so-called ‘reverse-multiplicity’ prob-lem seen in multiple testing procedures [15, 16]. Although the sample size ofpilots is often justiﬁed by reporting the anticipated precision in the estimates offeasibility parameters (e.g. the widths of conﬁdence intervals), there is no clearguidance on how precise they should be, or how to weigh precision against thecost of sampling.A more formal approach to the design and analysis of external pilot trialscould employ a hypothesis testing framework. Taking this view, progression tothe deﬁnitive trial would be determined by comparing an appropriate statisticto a critical value rather than using several independent PCs. To implementthis approach we must be able to identify levels of recruitment, follow-up andadherence which would correspond to feasible or infeasible deﬁnitive trials, andidentify an approproiate statistic that can diﬀerentiate between them. Giventhese, the critical value and the pilot sample size can then be chosen with2egards to the long-run error rates they lead to. Although many authors haveadvised against conducting formal hypothesis tests in pilot trials [17, 18, 19, 12],these warnings have been in the context of tests of eﬀectiveness. Assumingthe eﬀect size of interest is the same in the pilot and the deﬁnitive trial andthat conventional type I error rates are used, such a test will have low power.Tests assessing rates of recruitment, follow-up and adherence will not necessarilysuﬀer from the same restriction. Moreover, it should be emphasised that theconventional method of pre-specifying several PCs and using these to map pilotdata to progression decisions is mathematically equivalent to a multivariatehypothesis test, but without hypotheses being deﬁned and therefore withoutany understanding of the statistical properties of the resulting procedure. Aformal approach is therefore of interest, both to gain an understanding of currentpractice and to investigate if, and how, progression decisions can be improved.The remainder of the paper is structured as follows. First, we will deﬁnethe speciﬁc problem under consideration in Section 2. In Section 3 we willdescribe a formal hypothesis test of feasibility based on recruitment, follow-upand adherence rates. We will show how null and alternative hypotheses canbe deﬁned in terms of the power which will be obtained in the deﬁnitive trial,deﬁne an appropriate test statistic, and use the statistic’s sampling distributionto deﬁne and calculate type I and II error rates. We then use the method tore-design an external pilot trial comparing a psychotherapy with treatment asusual for children with conduct disorders in Section 4. In Section 5 we studythe properties of the proposed test in a range of scenarios and compare itsperformance against the conventional approach to choosing pilot sample sizeand PCs. We extend the method in Section 6 to allow for the additional goalof estimating the standard deviation of the primary outcome, before concludingwith a discussion of implications and limitations in Section 7.

Consider an external pilot planned in advance of a large two-arm parallel grouptrial which will compare an intervention with control based on a normally dis-tributed primary endpoint. We assume that the deﬁnitive trial has a targetsample size n t to be recruited from a pool of n e eligible patients. Each of the n e eligible patients will agree to take part in the trial with probability φ r , andrecruitment will continue until either the target n t has been reached or the poolof eligible patients has been exhausted. We denote by N the total number ofparticipants in the deﬁnitive trial, a random variable with a truncated binomialdistribution of size n e , probability φ r , and constrained to be less than or equalto n t .We assume that the N recruited participants of the deﬁnitive trial will berandomised equally between the two arms. In the control arm, F partici-pants will be successfully followed up with probability φ f . Thus, F | N ∼ Bin ( N/ , φ f ). In the intervention arm, participants may or may not be followed-up, and may or may not adhere to the intervention. We allow for the possibility3hat these binary outcomes will be correlated at the participant level by us-ing a multinomial distribution, such that participants will both adhere and befollowed up with probability p ; adhere, but be lost to follow up with proba-bility p ; not adhere, but be successfully followed up with probability p ; andneither adhere nor be followed up with probability p . We can parametrisethe model with marginal rate of follow-up φ f , rate of adherence conditional onbeing followed-up, φ a , and an odds ratio φ or : p = φ a φ f p = φ f − p p = φ or p (1 − p − p ) p + φ or p p = 1 − p − p − p . Note that we have assumed a constant marginal rate of follow-up in the inter-vention and control arms. We denote the number followed-up in the interventionarm by F , where F | N ∼ Bin ( N/ , φ f ), and the number of those followed-upwho also adhere by A , where A | F ∼ Bin ( F , φ a ).We assume that non-adherence will be absolute in the sense that no treat-ment eﬀect will be gained. We model the outcome for patient i in arm j as y i,j = t j a i,j µ + e i,j , where t i and a i,j are binary indicators of treatment arm and adherence re-spectively, µ is the diﬀerence in mean outcome between the two groups, e i,j ∼ N (0 , σ ) is the residual term, and we have omitted the usual constant interceptfor notational simplicity and without loss of generality. We will assume initiallythat σ is known, although this will be relaxed in Section 6. For simplicitythe primary analysis of the deﬁnitive trial is assumed to be a complete-caseintention-to-treat z -test of the observed mean diﬀerence ¯ Y − ¯ Y with a nullhypothesis of H : µ = 0, where ¯ Y j = F j P F j i =1 y i,j .We assume that the external pilot trial will take the same form as the deﬁni-tive trial, but on a smaller scale. The model of recruitment, follow-up andadherence in the pilot trial is assumed to be the same as for the deﬁnitive trialapart from one aspect: for the small external pilot trial we will continue re-cruiting until the target pilot sample size of n p has been reached. Thus, weconsider the achieved sample size of the pilot to be known and ﬁxed, with thenumber of eligible patients approached but declining to participate following anegative binomial distribution S ∼ N B ( n p , φ r ). Denoting the parameters by φ = ( φ r , φ f , φ a ), the external pilot trial will provide an estimate ˆ φ . Our goal isto show how a decision rule mapping ˆ φ to the set { stop, go } of progression de-cisions can be deﬁned, and how the parameters of the rule and the pilot samplesize n p can be chosen. 4 A hypothesis test of feasibility

A decision rule mapping pilot estimates ˆ φ to the set { stop, go } of progressiondecisions can be speciﬁed and evaluated under a framework of hypothesis testing.Speciﬁcally, given a model of the pilot trial and thus of the sampling distributionfor ˆ φ , we can calculate the probabilities of a given decision rule leading to ‘stop’and ‘go’ decisions conditional on the true parameter values φ . If we can identifythose values φ for which we would like to make a ‘stop’ decisions, and likewisethose for which we would like to make a ‘go’ decision, we can calculate long-runerror rates of the ﬁrst and second kind and use these to inform our choice ofdecision rule and of pilot trial sample size.In this section we ﬁrst propose that a planned deﬁnitive trial can be classiﬁedas either feasible or infeasible according to the power it will have in its ﬁnalanalysis. Power will be determined by the parameter φ , and so this provides ameans to deﬁne null and alternative hypotheses to be tested in the pilot trial.We begin by deriving an expression for the deﬁnitive trial power as a function of φ , and then go on to deﬁne hypotheses. We then propose a test statistic to beused in the pilot trial and provide its sampling distribution, allowing long-runerror rates to be calculated. Finally, we show how these error rates can be usedto determine an optimal design for the pilot trial, encompassing both its samplesize and the decision rule to be used at its conclusion. The power of the deﬁnitive trial is determined by the sampling distribution ofthe diﬀerence in group means, ¯ Y − ¯ Y . We assume that the deﬁnitive trialper-arm sample size will be greater than 30, allowing the sampling distributionsof the group means to be approximated by normal distributions. The power ofthe deﬁnitive trial to detect a diﬀerence µ is thenΦ E [ ¯ Y − ¯ Y | µ ] p V ar ( ¯ Y − ¯ Y | µ ) − z − α ! , where α denotes the (one-sided) type I error rate and Φ denotes the standardnormal distribution function. The expectation and variance of the mean diﬀer-ence will depend on the values of µ, φ, n t and n e . For clarity, we will drop theterms µ, n t and n e from our notation as they can be considered ﬁxed, and focuson power as a function of φ , denoted g ( φ ). Then, g ( φ ) = Φ ( x ( φ ) − z − α ) . The appendix will show that x ( φ ) = φ a µ p φ f E [ N | φ r ] p σ + 2 µ φ a (1 − φ a ) , E [ N | φ r ] is the expected number of participants recruited into the deﬁni-tive trial when the recruitment rate is φ r , E [ N | φ r ] = n t − X k =0 k (cid:18) n e k (cid:19) φ kr (1 − φ r ) n e − k + n t P r ( C ≥ n t ) , and C ∼ Bin ( n e , φ r ). We propose to deﬁne null and alternative hypotheses by considering the powerwhich would be obtained in the deﬁnitive trial which will aim to recruit n t participants from a pool of n e eligible patients. We require a power threshold p to be identiﬁed, such that if the power of the deﬁnitive trial was known to beless than or equal to p then it would be considered infeasible and we would liketo minimise the chance of mistakenly proceeding to it (a type I error). Similarly,we require a second threshold p such that if the deﬁnitive trial had a power ofat least p we would considered it to be feasible and would like to minimise thechance of mistakenly concluding otherwise (a type II error).Given the thresholds p , p , we can deﬁne the null (alternative) hypothesisas those parameter values which would lead to a deﬁnitive trial with power lessthan or equal to p (greater than or equal to p ). We denote these hypothesesby Φ , Φ respectively. We can express these in terms of the previously deﬁned x ( φ ): Φ = { φ ∈ Φ | x ( φ ) ≤ x } Φ = { φ ∈ Φ | x ( φ ) ≥ x } , where x i = Φ − ( p i ) + z − α . Testing feasibility in the external pilot trial will involve calculating a statisticbased on the pilot estimate ˆ φ and proceeding to the deﬁnitive trial if and onlyif it exceeds some pre-speciﬁed critical value c . A natural choice of statistic is x ( ˆ φ ). We can then deﬁne the type I and II error rates of the pilot trial, whichwill be determined by the pilot sample size n p and the critical value c : α ( n p , c ) = max φ ∈ Φ P r [ x ( ˆ φ ) > c | φ, n p ] ,β ( n p , c ) = max φ ∈ Φ P r [ x ( ˆ φ ) < c | φ, n p ] . To obtain type I and II error rates for a particular pilot trial of sample size n p per arm and critical value c , we require an expression for the probability that6he pilot test statistic will be greater than c conditional on some true parametervalue φ , denoted h ( n p , c, φ ) ≡ P r [ x ( ˆ φ ) > c | n p , φ ] . Denote by n af the value taken of the multinomial variable recording follow-up and adherence outcomes, and by N af the set of possible realisations. Thepower of the pilot trial can be calculated by considering each value in N af andcalculating the probability that the observed recruitment rate will be suﬃcientfor the test x ( ˆ φ ) > c to pass. Given adherence and follow-up estimates ˆ φ a , ˆ φ f ,this condition will be met when the estimated recruitment rate ˆ φ r is such that E [ N | ˆ φ r ] > c (4 + 2 µ ˆ φ a (1 − ˆ φ a ) µ ˆ φ a ˆ φ f = ˜ n. Since ˆ φ r = 2 n p / (2 n p + s ), we can ﬁnd the largest value of s such that E [ N | ˆ φ r ] > ˜ n . Noting this will be a function of n af and denoting it by ˜ s ( n af ), the power ofthe pilot trial is then h ( n p , c, φ ) = X n af ∈N af P r [ S ≤ ˜ s ( n af ) | φ ] p af ( n af | φ ) . (1)Here, p af ( . ) is the multinomial density describing follow-up and adherence out-comes.Equation 1 allows for follow-up and adherence outcomes to be correlated,which may be appropriate if we expect that a participant who does not adhereto the intervention delivery is also less likely to adhere to other aspects of thetrial protocol, including data collection. If this is not thought to be the case, wecan assume follow-up and adherence are independent and equation 1 simpliﬁesto h ( n p , c, φ ) = n p X a =0  n p X f =0 P r [ S ≤ ˜ s ( a, f ) | φ ] p f ( f | φ )  p a ( a | φ ) , where p f ( . ) and p a ( . ) are the densities describing the follow-up and adherenceoutcomes respectively. By avoiding a summation over the multinomial space N af , which can be very large for moderate n p , this reduces the number of termsto be evaluated and thus decreases computation time. Recall that the type I (II) error rate of the pilot trial is deﬁned as the largestprobability of proceeding (failing to proceed) to the deﬁnitive trial when thetrue parameter is in the null (alternative) hypothesis. That is, α ( n p , c ) = max φ ∈ Φ h ( n p , c, φ ) ,β ( n p , c ) = max φ ∈ Φ − h ( n p , c, φ ) . n p and critical values, solving the above optimisation problems foreach, and then choosing the ( n p , c ) pair which is deemed to give the best bal-ance between the costs of sampling and the two types of errors. This approachhas the drawbacks of requiring an appropriate range of c a priori , and a lack ofsharing information between the discrete optimisation problems. An alternativeformulation which avoids these drawbacks is to (for ﬁxed n p ) cast the problemas one of constrained bi-objective optimisation over the space R × Φ × Φ, si-multaneously searching for a critical value c and two points in the parametersspace, φ and φ , which maximise type I and II error rates whilst satisfying theconstraints that φ ∈ Φ and φ ∈ Φ . Formally, we solvemax ( h ( n p , c, φ ) , − h ( n p , c, φ )) (2)subject to c ∈ R ,φ ∈ Φ ,φ ∈ Φ . A problem of this nature can be solved numerically using the NSGA-II algo-rithm [20], as implemented in the R package ‘mco’ [21]. It will provide a set ofcritical values and corresponding points in the null and alternative hypothesesoﬀering diﬀerent balances between type I and type II error rates. These errorrates can then be plotted for a number of choices of n p , and an appropriatedesign selected from them. To ensure a reasonably fast computation time, weimplemented the function h ( n p , c, φ ) in C++ via the ‘Rcpp’ package[22]. Fulldetails, including all code used to produce the results described throughout thepaper, are provided in the supplementary materials. An alternative to the proposed test is to follow the conventional approach andmake the stop/go decision based on several independent progression criteria.Decision rules are then deﬁned by three critical values c = ( c f , c a , c r ), wherewe proceed to the deﬁnitive trial only when ˆ φ f > c f , ˆ φ a > c a and ˆ φ r > c r .Assuming independence between the parameter estimates, the probability ofthis event is f ( n p , c , φ ) = P r [ ˆ φ r > c r | φ ] × P r [ ˆ φ f > c f | φ ] × P r [ ˆ φ a > c a | φ ]= F s (2 n p /c r − n p | φ ) × [1 − F f (2 n p c f | φ )] × [1 − F a ( n p c a | φ )] , where F s ( . ) , F f ( . ) and F a ( . ) are the cumulative distribution functions for therandom variables S, F and A in the pilot, respectively. For any given choice ofpilot sample size n p and progression criteria c , type I (II) error rates are deﬁnedas before by maximising f ( n p , c , φ ) over the null (alternative) hypotheses. Toﬁnd a set of progression criteria which oﬀer diﬀerent trade-oﬀs between minimis-ing type I and II error rates, we solve the following bi-objective optimisationproblem: 8in c ∈ [0 , (cid:18) max φ ∈ Φ f ( n p , c , φ ) , min φ ∈ Φ − f ( n p , c , φ ) (cid:19) . We use NSGA-II again to solve the outer bi-objective minimisation prob-lem. For the inner optimisation problems, we assume that the solutions will lieon the boundaries of the hypotheses and therefore have a constraint, φ ∈ Φ i ,meaning that the search is over two dimensions. For example, we can searchover ( φ r , φ a ) ∈ [0 , since φ f will then be determined by φ f = 4 σ + 2 µ φ a (1 − φ a ) x i ( φ a µ ) E [ N | φ r ] . Given the low dimension of the search space, we obtain approximate solutionsby searching over a ﬁne grid of values for ( φ r , φ a ), in increments of 0.005. Thisprocedure is computationally eﬃcient since the error rates at each point in thegrid can be calculated in a vectorised manner, with very little overheads incomparison to a sequential evaluation. Again, full details are provided in thesupplementary materials. TIGA-CUB [23] was a two-arm, parallel group, individually-randomised exter-nal pilot trial aiming to determine the feasibility of a deﬁnitive trial comparingsecond-line, short-term, manualised psychoanalytic child psychotherapy withtreatment as usual for children with conduct disorders. The trial’s objectivesincluded estimation of the rate at which eligible dyads (where a child - carerdyad was the unit of randomisation and observation) consented to take part inthe trial; the level of missing data in the primary outcome; the rate of adherenceto the intervention; and the parameters required for the sample size calculationof the main study. Progression criteria relating to recruitment, follow-up andadherence rate were speciﬁed, with the deﬁnitive trial to go ahead only if:1. Recruitment was to target;2. Attendance was at more than 50% of sessions in the intervention arm;3. At least 75% of follow-up data was collected.The sample size was determined by using the rule-of-thumb that 30 par-ticipants per arm is suﬃcient to estimate a standard deviation of a continuousprimary outcome which is common across arms [17]. Assuming 90% of pilot par-ticipants are followed-up, a sample size of 60 participants in total was chosenand justiﬁed in terms of the expected width of 95% conﬁdence intervals around9stimates of the standard deviation (0.39 multiplied by SD), the follow-up rate( ± ±

8% to 18%).To revisit the choice of sample size and progression criteria, we ﬁrst needto deﬁne some aspects of the planned deﬁnitive trial. We begin by setting theminimal clinically important eﬀect size to be detected and the outcome standarddeviation, here assumed to be µ = 0 . σ = 1 respectively. We then decideon the planned design of the deﬁnitive trial, in terms of the number of eligiblepatients who will be approached and the target total sample size. For illustrativepurposes we suppose these are equal to n e = 1000 and n t = 514. Note that thistarget sample size, after allowing for 10% attrition, would lead to a deﬁnitivetrial with 90% power to detect µ = 0 . p = 0 .

8. Similarly, we consider that if the trial could onlyprovide a power of 65% or less then we would like to obtain a ‘stop’ decisionfrom the pilot, and so we set p = 0 .

65. These parameters specify the null andalternative hypotheses which are to be tested in the pilot trial. They can bevisualised by considering pairs of recruitment and adherence rates and ﬁndingthe corresponding value of follow-up rate that would lead to a deﬁnitive trialpower of exactly p = 0 .

65 for the null, or p = 0 . p or p . As therecruitment rate increases beyond around 50% there are no opportunities fortrade-oﬀs with the remaining parameters because it becomes certain that therecruitment target of n t will be met, and so the sample size will not increaseany further. p and p Given the hypotheses, and assuming that follow-up and adherence rates areindependent, we can now consider diﬀerent sample sizes and progression criteriaand calculate the error rates they lead to. For illustration, we consider pilot trialsample sizes of n p = 30 , ,

70 per arm and solve the optimisation problem 2 foreach case. The resulting operating characteristics are plotted in Figure 2 (solidlines). In this case we see that error rates with the original sample size of n p = 30are quite poor, in comparison to the nominal values typically seen in early phasedrug trials. For example, a type I error rate of 0.09 corresponds to a type IIerror rate of 0.44. As we would expect, increasing the external pilot sample sizeleads to improved error rates. A pilot sample size of 50 participants per arm, forexample, reduces the type II error rate to around 0.23 whilst maintaining type10 ull Alternative0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.00.80.91.0 Recruitment A dhe r en c e Follow−up

Figure 1: Values of recruitment, follow-up and adherence parameters for leadingto a deﬁnitive trial power of p = 0 .

65 (i.e. the boundary of the null hypothesis,left panel), or p = 0 . n p = 50 , α = 0 .

09 and β = 0 .

23. The correspondingcritical value is c = 2 . x ( ˆ φ ) > . φ , isgreater than Φ(2 . − z − . ) = 0 . n p = 30 , ,

70 we searched over the full set of possible PC thresh-olds c to identify a non-dominated subset, where a threshold c is non-dominatedif there does not exist another threshold c ′ which leads to lower type I and lowertype II error rates. In this example we ﬁnd that setting several independent PCswill lead to very poor error rates. For example, when n p = 30 we must accept atype II error rate of 0.9 (i.e. a power of 0.1) to obtain a type I error rate of 0.1.In fact, regardless of the speciﬁc thresholds used, error rates will be no betterthan what would be obtained when making progression decisions by tossing a11 .00.20.40.60.81.0 0.0 0.2 0.4 0.6 0.8 1.0 a b n p Method

Feasibility testStandard practice

Figure 2: Type I ( α ) and type II ( β ) error rates obtained for a range of criticalvalues c and pilot sample sizes n p when using the proposed method (solid lines).Error rates available when using conventional progression criteria are shown forcomparison (dashed lines).(possibly biased) coin. Increasing the pilot sample size does not improve errorrates, but actually makes them worse. These counter-intuitive results can beexplained by the fact that independent PCs are of an intrinsically conjunctive nature, requiring all three endpoints to show a positive result. In contrast, ourhypotheses have been deﬁned in a more disjunctive manner, where a positiveresult in only one endpoint may be enough to indicate overall feasibility due tothe acknowledged trade-oﬀs involved. To illustrate the point further, considerthe case of a pilot trial sample size tending to inﬁnity so that ˆ φ = φ and piloterror rates will be either 0 or 1. In this idealised scenario, a type II error rateof β = 0 will be obtained only if c f < φ f , c a < φ a , c r < φ r ∀ φ ∈ Φ . For thisinﬁnite sample size to also lead to α = 0, a necessary condition is therefore that φ ′ = (cid:18) min Φ φ f , min Φ φ a , min Φ φ r (cid:19) Φ . For example, in this case we ﬁnd φ ′ = ( φ ′ f = 0 . , φ ′ a = 0 . , φ ′ r = 0 . x ( φ ′ ) = 1 . < x and so φ ′ is in the null hypothesis. Thus, a pilot sample sizetending to inﬁnity and with independent PC thresholds chosen to give β = 0will have α = 1.Aside from the overall ineﬃciency of independent PCs, we found that thereare many possible choices for independent PC thresholds which are dominatedby others. For example, for n p = 30 we ﬁnd that c = ( c f = 0 . , c a =0 . , c r = 0 . c ′ = ( c f = 0 . , c a = 0 . , c r = 0 .

4) will give a worse type I error (0.74) and a worse type II error (0.88). This suggests that if independent PCs areto be used under the conventional approach, as they were in TIGA-CUB, thechoice of threshold values should be informed by a careful statistical analysis oftheir properties if we are to avoid needlessly ineﬃcient decision rules. p and β An alternative approach to pilot trial design results from the observation that,if the alternative hypothesis threshold p is ﬁxed, we can ﬁnd a critical value c which will lead to some desired pilot type II error rate independently of p .Keeping c ﬁxed at this value, we can then look at a range of values for p andplot the corresponding type I error rate. The evaluation of a pilot with samplesize n p and critical value c can then be based on the probability of making a godecision as a function of p , rather than focussing on a speciﬁc point.For example, suppose we take p = 0 . c = 2 .

46 will achieve this error rate. We can then keep c ﬁxed at this value andcalculate the corresponding type I error rate as we vary p . Doing this againfor pilot sample sizes of 50 and 70, we plot the results in Figure 3. We see thatwhile a large pilot sample of n p = 70 will ensure there is only a 0.03 probabilityof proceeding to a deﬁnitive trial with a true power of p = 0 .

6, it will also meanthat when p ≈ .

725 there will only be a 0.5 chance of proceeding. This mightbe considered rather low, when we have deﬁned the alternative hypothesis using p = 0 .

8. In comparison, the associated values when n p = 30 are 0.24 and 0.78. To understand to what extent the results given in the above illustrative ex-ample will be found more widely, we applied both the proposed test and theconventional method to a number of diﬀerent scenarios. Recall that a scenariois deﬁned by: the parameters describing the primary outcome of the deﬁnitivetrial (mean µ and variance σ ); its planned design (target sample size n t andnumber of eligible patients n e ); and the power thresholds used to deﬁne our nulland alternative hypotheses ( p and p respectively). Scenarios are distinguishedfrom one another in terms of the null and alternative hypotheses they imply,which are determined by x ( φ ) and the thresholds p , p .We will ﬁx p = 0 . p = 0 . , . , .

7. We notethat the denominator in x ( φ ) will be dominated by its ﬁrst term for the typicalstandardised eﬀect sizes (i.e less than around 0.5). The expected deﬁnitive trialsample size in the numerator depends on n e , n t and φ r and so, for ﬁxed φ , theterms µ, σ , n e and n t reduce to one factor when determining x ( φ ). As such,we focus on varying one of these parameters while keeping the others ﬁxed.We choose to ﬁx µ = 0 . , σ = 1 , n e = 1000 whilst varying n t . We consider n t = 468 , , .000.250.500.751.00 0.5 0.6 0.7 0.8 p a n p Figure 3: Type I error rates for pilot trials of diﬀerent sample sizes n p , as a func-tion of the null hypothesis parameter p and keeping type II error maintainedat 0.1.obtained when asking for 90% power and allowing for attrition of 0, 10 and 20%respectively. For each scenario we calculated the operating characteristics ofan external pilot with sample size n p = 30 , ,

70 per arm. The resulting errorrates are illustrated in Figure 4.We see that inﬂating the target sample size leads to a small increase in piloterror rates. For example, for n p = 30, n t = 468 and p = 0 . α = 0 .

08 and β = 0 .

58. Increasing the targetsample size to n t = 562 gives an increased type II error rate of β = 0 .

77 whilstmaintaining type I error. This general behaviour can be explained by notingthat recruiting more participants will mean we can tolerate worse follow-upand/or adherence whilst maintaining power. As a result the hypotheses will belarger, and in general we can expect error rates to increase as the size of thehypotheses they are deﬁned over increase. Error rates are also sensitive to thepower used to deﬁne the null hypothesis. For example, with pilot sample size n p = 30 and deﬁnitive trial target sample size n t = 468 but reducing p from0.7 to 0.6, one can reduce the type II error rate from 0.58 to 0.20 at the cost ofa small increase in the type I error rate from 0.08 to 0.11. In all scenarios wesee a considerable improvement in error rates when we increase the pilot samplesize from 30 to 50 per arm, with less of a subsequent improvement moving from50 to 70. Although the choice of sample size for a given pilot trial will dependon the costs of sampling the consequences of type I and II errors, our resultssuggest that a sample size of around 50 per arm should be suﬃcient whenever p = 0 . p ≤ .

65. 14 = = = n t = t = t = a b Method

Feasibility test Standard practice n p

30 50 70

Figure 4: Type I ( α ) and type II ( β ) error rates obtained for a range of criticalvalues c and pilot sample sizes n p when using the proposed method (solid lines)and the conventional approach (dashed line). The power used to deﬁne the nullhypothesis increases from left to right, p = 0 . , , , .

7, while the target sam-ple size of the deﬁnitive trial is increased from top to bottom, n t = 468 , , n t and p . Aswith the proposed method, error rates tend to increase as n t increases, but inthis case at a higher rate. Reducing p leads to improved error rates. Even inthe best scenario we considered, with n t = 468 and p = 0 .

6, the error rateswhich can be obtained using independent PCs are still no better than those of arandom coin toss. We can conclude from these results that, when the proposedmethod of deﬁning hypotheses is deemed appropriate, independent PCs shouldnot be used to guide progression decisions in external pilot trials.

We have considered how recruitment, follow-up and adherence rates all aﬀectthe power of the deﬁnitive trial, and have shown how these can be assessed inan external pilot trial as part of a hypothesis test. Another parameter whichaﬀects the power of the deﬁnitive trial is the standard deviation of the outcomemeasure: recall from Section 3.1 that the conditional power is g ( φ, σ ) = Φ ( x ( φ, σ ) − z − α ) , where x ( φ, σ ) = φ a µ p φ f E [ N | φ r ] p σ + 2 µ φ a (1 − φ a ) , The methods described so far have assumed σ is known. When this is notthe case we can extend our approach by including the true value of σ in thedeﬁnition of the hypotheses Φ and Φ , and the pilot estimate ˆ σ in the teststatistic x ( ˆ φ, ˆ σ ). This will allow for high variability in the outcome measure,which may lead to a trial with infeasibly low power, to be identiﬁed at the pilotstage.For notational simplicity and ease of computation we will focus on the casewhere follow-up and adherence are independent, but the method extends natu-rally to the more general case. The power of the pilot trial is now approximately h ( n p , c, φ, σ ) = n p X a =0  n p X f =0 s max X s =0 (cid:20)Z y p ˆ σ (ˆ σ | s, a, f, φ ) d ˆ σ (cid:21) p s ( s | φ ) ! p f ( f | φ )  p a ( a | φ ) , where p σ ( . ) is the density function of the sample variance, speciﬁcally,ˆ σ | f ∼ σ χ f − f − . The approximation comes from the fact that we would like to sum over thefull range of possible values for s ∈ [0 , ∞ ) but must limit the summation tosome value s max . We set s max to the upper 0.999 quartile of the negativebinomial distribution with parameter φ r , ensuring an accurate approximation16hilst avoiding excessive computation. The upper limit of the integral, y , isthe largest value that the sample variance could take and still lead to the test x ( ˆ φ, ˆ σ ) > c passing, given s, a and f . It is: y = ˆ φ a µ ˆ φ f E [ N | ˆ φ r ]4 c − µ ˆ φ a (1 − ˆ φ a ) . Empirically, we ﬁnd that both type I and II error rates increase as thestandard deviation decreases. To avoid the trivial situation where error ratestend to 1 as σ tends to zero, we include a lower limit on σ ∗ in the deﬁnitions ofthe hypotheses Φ and Φ . These are now deﬁned asΦ = { φ ∈ Φ , σ > σ ∗ | g ( φ, σ ) < = p } , Φ = { φ ∈ Φ , σ > σ ∗ | g ( φ, σ ) > = p } . To illustrate the eﬀect of allowing for uncertainty in σ , we calculated theerror rates available when n p = 50 , n t = 514 , p = 0 .

65 and p = 0 .

8, and thelower limit was taken to be σ ∗ = 0 .

8. We then compared these error rates withthose previously obtained when it was assumed that σ = 1. For example,when σ was assumed known a pilot trial type I error rate of 0.12 allowed a typeII error rate of 0.23. Maintaining the same type I error but now estimating σ in the pilot increases type II error to 0.39. The detriment stems from twoissues. Firstly, by using the pilot variance estimate in the test statistic x ( ˆ φ, ˆ σ )its sampling variability increases. Secondly, by allowing for the lower limitof σ ∗ = 0 . φ r , φ a and φ f in ourhypotheses, enlarging them and thus allowing more extreme error rates to belocated. Error rates can be improved to some extent through increasing samplesize. For example, increasing n p from 50 to 70 will lead to a type II error rateof 0.34 whilst maintaining the type I error rate at 0.12. We have proposed a statistical test which simultaneously assesses recruitment,follow-up and adherences rates in order to anticipate and mitigate related prob-lems which would render the planned deﬁnitive trial infeasible. We have shownthat, without increasing the sample size of the pilot trial beyond typical values,the test can limit the probability of mistakenly running an underpowered trialwhilst reliably ensuring well-powered trials will progress. We have describedhow the test can be extended to include the common pilot objective of estimat-ing an unknown standard deviation nuisance parameter, and found that for thisto be incorporated the sample size of the pilot trial will have to be increased ifoperating characteristics are to be maintained.The proposed method leads to reasonable error rates when the pilot sam-ple size is around 50 patients per arm. While this is more than some commonrules-of-thumb for choosing an external pilot sample size such as 25 [13] or 35[3] per arm, it remains reasonable in the context of a pilot preceding a much17arger trial. In comparison, our study of the conventional approach to settingprogression criteria has shown that, at least in the scenarios we considered, noamount of increase to the pilot sample size will lead to reasonable error rates.Although the proposed method is more complicated than the conventional ap-proach, it is ﬂexible in the sense that it can be applied in its full generalitywhen the aim of the pilot is to estimate recruitment, adherence, follow-up andthe outcome standard deviation, but could equally be used in the special casewhere only a subset of these parameters are of interest. We have produced arobust and eﬃcient implementation in R and have provided this in the supple-mentary material, where we also show how to reproduce the results given in thispaper.Some assumptions have been made for simplicity and ease of explanation.In particular, we assumed that the rate of follow-up will be constant across theintervention and control arms of the trials. Relaxing this assumption is possible,although it would add another dimension when searching for the maximum errorrates of a given pilot sample size and critical value. We also assumed that adher-ence is absolute, with non-adherers receiving no treatment eﬀect. Violation ofthis assumption will lead to a general under-estimation of deﬁnitive trial power,which will aﬀect the construction of the pilot hypotheses and subsequently thepilot error rates. Finally, we note assumptions of a complete-case ITT analysis(as we have done for the deﬁnitive trial) and of the successful recruitment of thetarget sample size (as we have done for the pilot trial) are both standard whencarrying out power calculations.We have focussed on a speciﬁc problem where the deﬁnitive trial will have arandomised two-arm design with a single, normally distributed primary outcomemeasure. Our method could be applied directly to any problem with an outcomethat can be approximately modelled as normal. For example, a binary endpointcan be incorporated using a normal approximation if the deﬁnitive trial samplesize is suﬃciently large; or a cluster randomised trial could be addressed if theendpoints are all measured at the cluster level and the variance parameters ofthe primary outcome were known. To extend the method to other problems wewould require an expression for the power of the deﬁnitive trial (as a functionof parameters φ ), a statistic to use in the pilot trial, and an expression forthe power of the pilot trial using that statistic. When analytic forms of thepower functions cannot be found, they can be approximated using simulation[24] and, in principle, the optimisation problem in Section 3.4 could be solvedas before. In practice, evaluating two objectives and two constraints usingsimulation would be computationally demanding and the NSGA-II algorithmwould take an infeasibly long time to converge. Future work could explorehow eﬃcient global optimisation algorithms [25] could be used to solve theseproblems in a timely manner. A further extension could consider internal pilottrials, where the pilot data is combined with the deﬁnitive data in the ﬁnalanalysis. This would, however, induce a correlation between the pilot trialprogression decision and the ﬁnal test of treatment eﬀectiveness. As a result,appropriate adjustments would be needed to ensure the type I error rate of theﬁnal analysis can be controlled at the nominal level.18ur method requires the speciﬁcation of the number of eligible patients n e and the target sample size n t for the deﬁnitive trial. In practice, the exactvalue of these parameters may not be known at the pilot design stage, and inparticular the results of the pilot may inﬂuence how they are set. However,note that n e and n t are only used, in conjunction with the power thresholds p , p , to specify the null and alternative hypotheses Φ , Φ . Thus, we are freeto use some hypothetical choices of n e and n t providing we accept that anypoint φ ∈ Φ i , i = 0 , n e or n t are changed. Forexample, we may have deﬁned the null hypothesis so that φ ∈ Φ if and onlyif a deﬁnitive trial with design n t , n e would have less than 60% power under φ .For that same φ , we could increase the deﬁnitive trial sample size to some n ′ e , n ′ t such that its power would increase to 90%. Under our formulation we wouldstill consider φ to be in the null, and would want to avoid running a deﬁnitivetrial with sample size n ′ e , n ′ t and power g ( φ, n ′ e , n ′ t ) = 0 .

9. This corresponds toa judgement that the improvement in power obtained by moving from n e , n t to n ′ e , n ′ t is not suﬃcient to justify the increased cost of sampling.Our method leads to a binary stop/go decision. Increasingly, progressioncriteria in pilot trials are incorporating three outcomes, adding an additionalintermediate ‘amber’ decision between the ‘red’ stop and ‘green’ go [6]. Theintention is that if the pilot estimate is neither clearly good nor bad, but some-where between, then it may be possible to make some modiﬁcations to theintervention or the trial protocol which would ensure the deﬁnitive trial will befeasible. A possible extension to the testing framework outlined could considerdeﬁning three hypotheses where the decisions ‘stop’, ‘modify’ or ‘go’ would beoptimal. In practice, the pilot trial could be designed using the proposed methodand assuming no modiﬁcations will be possible, and, in the event that we wishto make modiﬁcations, another pilot could be done to assess the eﬀect of these.This would appear to be in line with the iterative process of complex interventiondevelopment and evaluation suggested in the MRC framework[4]. Alternatively,if the modiﬁcations are made and we proceed directly to the deﬁnitive trial, weshould be clear that the error rates associated with the pilot trial as designedwill no longer apply.In addition to the recruitment, follow-up and adherence rates and the stan-dard deviation we have considered in this paper, another parameter which wouldbe of interest when deciding to run a deﬁnitive trial is the treatment eﬀect µ [26]. In particular, if the true treatment eﬀect is very small, we would ideallylike to recognise this at the pilot stage and declare futility. To incorporate suchan assessment into our method we could consider a bivariate test in the pilot,simultaneously assessing the power of the deﬁnitive trial and the true treat-ment eﬀect. Deﬁning the null and alternative hypotheses for such a test may bechallenging, as it would require specifying the trade-oﬀs we would be willing tomake between deﬁnitive trial power and treatment eﬀect. Future research couldconsider how methods for testing two outcomes in phase II drug trials[27, 28]could be applied in this context.The proposed method could beneﬁt from further methodological research.For example, the pilot test statistic was deﬁned in a pragmatic manner and is19ot guaranteed to be optimal, and so further research could consider alternativestatistics which might be more powerful when testing the hypotheses we havedeﬁned. A further criticism of the proposed method is the simplicity of thedecision rule it suggests. Particularly in the case of external pilot trials ofcomplex interventions, we would expect the decision of if and how the deﬁnitivetrial should be conducted to be informed by many factors beyond the pilotestimates of a handful of parameters, not least qualitative outcomes. However,we believe the method will still provide a useful guide to decision making, andallows for the choice of pilot sample size to be considered more fully. This viewis supported by the continued prevalence of hypothesis testing in the design andanalysis of all types of clinical trials, where the ﬁnal decision is rarely made byrigidly following the result of the test. Acknowledgements

We would like to thank Alex Wright-Hughes and the TIGA-CUB trial team fordiscussions which helped shape the scope of this paper.

Data availability statement

Data sharing is not applicable to this article as no new data were created oranalysed in this study. The code used to implement the methods and generatethe results of this paper is freely available at

Funding

This work was supported by the Medical Research Council [grant number MR/N015444/1].

References [1] Ben G. O. Sully, Steven A. Julious, and Jon Nicholl. A reinvestigation ofrecruitment to randomised, controlled, multicenter trials: a review of trialsfunded by two uk funding agencies.

Trials , 14(1):166, Jun 2013.[2] Michael P. Fay, M. Elizabeth Halloran, and Dean A. Follmann. Accountingfor variability in sample size estimation with applications to nonadherenceand estimation of variance and eﬀect size.

Biometrics , 63(2):465–474, dec2006.[3] M Teare, Munyaradzi Dimairo, Neil Shephard, Alex Hayman, Amy White-head, and Stephen Walters. Sample size requirements to estimate key de-sign parameters from external pilot randomised controlled trials: a simula-tion study.

Trials , 15(1):264, 2014.[4] Peter Craig, Paul Dieppe, Sally Macintyre, Susan Michie, Irwin Nazareth,and Mark Petticrew. Developing and evaluating complex interventions: the20ew medical research council guidance.

BMJ: British Medical Journal , 337,9 2008.[5] Sandra M. Eldridge, Gillian A. Lancaster, Michael J. Campbell, LehanaThabane, Sally Hopewell, Claire L. Coleman, and Christine M. Bond.Deﬁning feasibility and pilot studies in preparation for randomised con-trolled trials: Development of a conceptual framework.

PLOS ONE ,11(3):e0150205, mar 2016.[6] Kerry N L Avery, Paula R Williamson, Carrol Gamble, Elaine O’ConnellFrancischetto, Chris Metcalfe, Peter Davidson, Hywel Williams, andJane M Blazeby. Informing eﬃcient randomised controlled trials: explo-ration of challenges in developing progression criteria for internal pilot stud-ies.

BMJ Open , 7(2):e013537, feb 2017.[7] Sandra M Eldridge, Claire L Chan, Michael J Campbell, Christine M Bond,Sally Hopewell, Lehana Thabane, and Gillian A Lancaster. CONSORT2010 statement: extension to randomised pilot and feasibility trials.

BMJ ,page i5239, oct 2016.[8] National Institute for Health Research. Research for patient beneﬁt (rfpb)programme guidance on applying for feasibility studies, 2017.[9] Richard H. Browne. On the use of a pilot sample for sample size determi-nation.

Statistics in Medicine , 14(17):1933–1940, 1995.[10] Steven A. Julious. Sample size of 12 per group rule of thumb for a pilotstudy.

Pharmaceutical Statistics , 4(4):287–291, 2005.[11] Julius Sim and Martyn Lewis. The size of a pilot study for a clinical trialshould be calculated in relation to considerations of precision and eﬃciency.

Journal of Clinical Epidemiology , 65(3):301–308, mar 2012.[12] Sandra M Eldridge, Ceire E Costelloe, Brennan C Kahan, Gillian A Lan-caster, and Sally M Kerry. How big should the pilot study for my clusterrandomised trial be?

Statistical Methods in Medical Research , 2015.[13] Amy L Whitehead, Steven A Julious, Cindy L Cooper, and Michael JCampbell. Estimating the sample size for a pilot randomised trial to min-imise the overall trial sample size for the external pilot and main trial fora continuous outcome variable.

Statistical Methods in Medical Research ,2015.[14] Cindy L Cooper, Amy Whitehead, Edward Pottrill, Steven A Julious, andStephen J Walters. Are pilot trials useful for predicting randomisationand attrition rates in deﬁnitive studies: A review of publicly funded trials.

Clinical Trials , 0(0):1740774517752113, 2018. PMID: 29361833.[15] Stephen Senn and Frank Bretz. Power and sample size when multipleendpoints are considered.

Pharmaceut. Statist. , 6(3):161–170, 2007.2116] Christy Chuang-Stein, Paul Stryszak, Alex Dmitrienko, and Walter Oﬀen.Challenge of multiple co-primary endpoints: a new approach.

Statistics inMedicine , 26(6):1181–1192, 2007.[17] Gillian A. Lancaster, Susanna Dodd, and Paula R. Williamson. Design andanalysis of pilot studies: recommendations for good practice.

Journal ofEvaluation in Clinical Practice , 10(2):307–312, 2004.[18] Mubashir Arain, Michael Campbell, Cindy Cooper, and Gillian Lancaster.What is a pilot or feasibility study? a review of current practice and edi-torial policy.

BMC Medical Research Methodology , 10(1):67, 2010.[19] Lehana Thabane, Jinhui Ma, Rong Chu, Ji Cheng, Aﬁsi Ismaila, LorenaRios, Reid Robson, Marroon Thabane, Lora Giangregorio, and CharlesGoldsmith. A tutorial on pilot studies: the what, why and how.

BMCMedical Research Methodology , 10(1):1, 2010.[20] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multi-objective genetic algorithm: NSGA-II.

IEEE Transactions on EvolutionaryComputation , 6(2):182–197, apr 2002.[21] Olaf Mersmann. mco: Multiple Criteria Optimization Algorithms and Re-lated Functions , 2014. R package version 1.0-15.1.[22] Dirk Eddelbuettel and Romain Fran¸cois. Rcpp: Seamless R and C++integration.

Journal of Statistical Software , 40(8):1–18, 2011.[23] Elizabeth Edginton, Rebecca Walwyn, Kayleigh Burton, Robert Cicero,Liz Graham, Sadie Reed, Sandy Tubeuf, Maureen Twiddy, Alex Wright-Hughes, Lynda Ellis, Dot Evans, Tom Hughes, Nick Midgley, Paul Wallis,and David Cottrell. TIGA-CUB – manualised psychoanalytic child psy-chotherapy versus treatment as usual for children aged 5–11 years withtreatment-resistant conduct disorders and their primary carers: study pro-tocol for a randomised controlled feasibility trial.

Trials , 18(1), sep 2017.[24] Sabine Landau and Daniel Stahl. Sample size and power calculations formedical studies by simulation when closed form expressions are not avail-able.

Statistical Methods in Medical Research , 22(3):324–345, 2013.[25] Donald R. Jones. A taxonomy of global optimization methods based onresponse surfaces.

Journal of Global Optimization , 21(4):345–383, 2001.[26] D. T. Wilson, R. E. Walwyn, J. Brown, A. J. Farrin, and S. R. Brown.Statistical challenges in assessing potential eﬃcacy of complex interven-tions in pilot or feasibility studies.

Statistical Methods in Medical Research ,25(3):997–1009, jun 2015.[27] Mark R. Conaway and Gina R. Petroni. Designs for phase II trials allowingfor a trade-oﬀ between response and toxicity.

Biometrics , 52(4):pp. 1375–1386, 1996. 2228] Peter F. Thall. Some geometric methods for constructing decision criteriabased on two-dimensional parameters.

Journal of Statistical Planning andInference , 138(2):516–527, feb 2008.

Appendix

Recall that the sample means are¯ Y = AF µ + 1 F F X i =1 e i, , ¯ Y = 1 F F X i =1 e i, , where F j denotes the number of participants in arm j who are followed-up, and A denotes the number of participants in the intervention arm who both adhere and are follow-up. These sample means have expectations E [ ¯ Y | φ ] = E [ AF µ | φ ] = φ a µ and E [ ¯ Y | φ ] = 0. For the variance of the intervention group mean, we have bythe law of total variance that V ar ( ¯ Y ) = E [ V ar ( ¯ Y | F , A )] + E [ V ar ( E [ ¯ Y | F , A ] | F )] + V ar ( E [ ¯ Y | F ]) . (3)Taking each of these terms in turn: E [ V ar ( ¯ Y | F , A )] = E (cid:20) σ F (cid:21) = 2 σ φ f E [ N ] , where E [ N ] is the expected number of participants recruited to the trial andrandomised equally between arms. Denoting the number of eligible participantswho consent by C , then, E [ N ] = E [ N | C < n t ] P r ( C < n t ) + E [ N | C ≥ n t ] P r ( C ≥ n t ) . Note that C follows a binomial distribution with size n e and probability φ r .Because recruitment will stop once the target has been reached, E [ N | C ≥ n t ] = n t . We also have E [ N | C < n t ] = P n t − k =0 k (cid:0) n e k (cid:1) φ kr (1 − φ r ) n e − k P r ( C < n t ) , and so E [ N ] = n t − X k =0 k (cid:18) n e k (cid:19) φ kr (1 − φ r ) n e − k + n t P r ( C ≥ n t ) . E (cid:2) V ar ( E [ ¯ Y | F , A ] | F ) (cid:3) = E (cid:20) V ar (cid:18)

AµF | F (cid:19)(cid:21) = E (cid:20) µ F V ar ( A | F ) (cid:21) = E (cid:20) µ F F φ a (1 − φ a ) (cid:21) = 2 µ φ f E [ N ] φ a (1 − φ a ) . Finally,

V ar ( E [ ¯ Y | F ]) = V ar (cid:18) E (cid:20) AµF | F (cid:21)(cid:19) = V ar (cid:18) E [ A | F ] µF (cid:19) = V ar (cid:18) F φ a µF (cid:19) = 0 . This then gives

V ar ( ¯ Y ) = 2 σ φ f E [ N ] + 2 µ φ f E [ N ] φ a (1 − φ a ) . The variance of the control group sample mean is

V ar ( ¯ Y ) = E [ V ar ( ¯ Y | F )] + V ar ( E [ ¯ Y | F ])= E (cid:20) σ F (cid:21) + 0= 2 σ φ f E [ N ] . The power of the trial can then be obtained by substituting the expectationand variance of the sample means into the usual formula (which, again, assumesthey are normally distributed): g ( φ ) = Φ E [ ¯ Y − ¯ Y ] p V ar ( ¯ Y − ¯ Y ) − z − α ! = Φ  φ a µ q σ φ f E [ N ] + µ φ f E [ N ] φ a (1 − φ a ) + σ φ f E [ N ] − z − α  = Φ φ a µ p φ f E [ N ] p σ + 2 µ φ a (1 − φ a ) − z − α ! = Φ ( x ( φ ) − z − α ) ,,