[PDF] A web application for the design of multi-arm clinical trials

Abstract

Multi-arm designs provide an effective means of evaluating several treatments within the same clinical trial. Given the large number of treatments now available for testing in many disease areas, it has been argued that their utilisation should increase. However, for any given clinical trial there are numerous possible multi-arm designs that could be used, and choosing between them can be a difficult task. This task is complicated further by a lack of available easy-to-use software for designing multi-arm trials. To aid the wider implementation of multi-arm clinical trial designs, we have developed a web application for sample size calculation when using a variety of popular multiple comparison corrections. Furthermore, the application supports sample size calculation to control several varieties of power, as well as the determination of optimised arm-wise allocation ratios. It is built using the Shiny package in the R programming language, is free to access on any device with an internet browser, and requires no programming knowledge to use. The application provides the core information required by statisticians and clinicians to review the operating characteristics of a chosen multi-arm clinical trial design. We hope that it will assist with the future utilisation of such designs in practice.

Full PDF

AA web application for the design of multi-arm clinicaltrials

Michael J. Grayling , James M. S. Wason

1. Institute of Health & Society, Newcastle University, Newcastle, UK,2. Hub for Trials Methodology Research, MRC Biostatistics Unit, Cambridge, UK.*Address correspondence to M. J. Grayling, Institute of Health & Society,Baddiley Clark Building, Richardson Road, Newcastle upon Tyne NE2 4AX, UK.E-mail: [email protected].

Abstract:

Multi-arm designs provide an eﬀective means of evaluating several treatmentswithin the same clinical trial. Given the large number of treatments now available fortesting in many disease areas, it has been argued that their utilisation should increase.However, for any given clinical trial there are numerous possible multi-arm designs thatcould be used, and choosing between them can be a diﬃcult task. This task is compli-cated further by a lack of available easy-to-use software for designing multi-arm trials.To aid the wider implementation of multi-arm clinical trial designs, we have developeda web application for sample size calculation when using a variety of popular multiplecomparison corrections. Furthermore, the application supports sample size calculationto control several varieties of power, as well as the determination of optimised arm-wiseallocation ratios. It is built using the Shiny package in the R programming language,is free to access on any device with an internet browser, and requires no programmingknowledge to use. The application provides the core information required by statisticiansand clinicians to review the operating characteristics of a chosen multi-arm clinical trialdesign. We hope that it will assist with the future utilisation of such designs in practice.1 a r X i v : . [ s t a t . C O ] J un eywords: False discovery rate; Familywise error-rate; Multiple comparisons; Optimaldesign; Power; Sample size.

Background

Drug development is becoming an increasingly expensive process, with the estimatedaverage cost per approved new compound now standing at over $1 bn [1]. In no smallpart this is due to the high failure rate of clinical trials, in particular in phases II andIII. This is particularly true in the ﬁeld of oncology, where the likelihood of approvalfrom phase I is only 5.1% [2]. Consequently, the clinical research community is constantlyseeking new methods that may improve the eﬃciency of the drug development process.One possible method, which has received substantial attention in recent years, is theidea to make use of multi-arm designs that compare several experimental treatments toa shared control group. Several desirable, inter-related, features of such designs havenow been described. For example, the number of patients on the control treatment istypically reduced compared to conducting separate two-arm trials, and simultaneouslypatients are more likely to be randomized to an experimental treatment, which may helpwith recruitment [3, 4]. Furthermore, the overall required sample size, for the same levelof power, will typically be smaller than that which would be required if multiple two-armtrials were conducted [5]. Finally, multi-arm designs oﬀer a fair head-to-head comparisonof experimental treatments in the same study [3, 4], and the cost of assessing a treatmentin a multi-arm trial is often around half of that for a separate two-arm trial [3].Based upon these advantages, and their experiences of utilising such designs in severaloncology trials, Parmar et al. [3] make a compelling case for the need for more multi-armdesigns to be used in clinical research. We are not aware of any systematic evidence onwhether this has now permeated through to practice, but a simple search of PubMedCentral suggests it may be the case: 859 articles have included the phrases “multi-arm”and “clinical trial” since 2015, as opposed to just 273 in all years prior to this. Considering2his result in combination with the ﬁndings of Baron et al. [6], who determined 17.9% oftrials published in 2009 were multi-arm, as well as the recent publication of a key guidancedocument on reporting results from multi-arm trials [7], it is clear that there is now muchinterest within the trials community in such designs.However, whilst there are numerous advantages of multi-arm trials, it is importantto recognise that determining a suitable design for a multi-arm clinical trial can be asubstantially more complex process than for a two-arm trial. In particular, a decisionmust be made on how to account for the multiple comparisons that will be made. Indeed,whether the ﬁnal analysis should adjust for multiplicity has been a topic of much debatewithin the literature. In brief, presented arguments primarily revolve around the fact thatfailing to account for multiplicity can substantially increase the probability of committinga type-I error. Yet, if a series of two-arm trials were conducted, no adjustment would bemade to the signiﬁcance level used in each trial. For brevity, we will not repeat all furtherarguments on this issue here, and instead refer the reader to several key discussions onmultiplicity [5, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18].For the purposes of what follows in this article, the more important consideration isthat when a multiple comparison correction (MCC) is to be used, one of a wide selectionmust actually be chosen (see, e.g., [19, 20, 21] for an overview). MCCs vary widely in theircomplexity, with Bonferroni’s correction often recommended because of its simplicity [7].However, other MCCs often perform better in terms of the operating characteristics theyimpart, as Bonferroni’s correction is known to be conservative [10, 18, 20, 22]. A recentreview found that amongst those multi-arm trials that did adjust for multiplicity, 50%used one of the comparatively simple Bonferroni or Dunnett corrections [5]. Thus, therearguably remains the potential for increased eﬃciency gains to be made in multi-armtrials, if more advanced MCCs can be employed.Furthermore, regardless of whether a MCC is utilised, there are other complicationsthat must also be addressed in multi-arm trial design, including how to power the trial,and what the allocation ratio to each experimental arm relative to the control arm willbe. Indeed, power is not a simple quantity in a multi-arm trial, whilst the literature on3ow to choose the allocation ratios in an optimal manner is extensive (see, e.g., [23] for anoverview), and deciding whether to specify allocation ratios absolutely, or whether theycan be optimised to improve trial eﬃciency may not be an easy decision.These considerations imply that user-friendly software for designing multi-arm clinicaltrials would be a valuable tool in the trials community. It is unfortunate therefore thatlittle software is available to assist with such studies. The principal exception to thisis the MULTIARM module for East [24], which allows users to compare the operatingcharacteristics of many multi-arm designs with respect to numerous important quantities.However, the cost of this package may be prohibitive to many working within academia.For this reason, we have developed a web application for multi-arm clinical trial design.We hope that the availability of this application will assist with the utilization of moreadvanced multi-arm designs in future clinical trials.

Implementation

The web application is written using the Shiny package [25] in the R programming lan-guage [26]. It is available as a function in (for oﬀ-line local use), and is built using otherfunctions from, the R package multiarm [27]. A vignette is provided for multiarm thatgives great detail on its formal statistical speciﬁcations. A less technical summary isprovided here.

Design setting

It is assumed that outcomes X ik will be accrued from patients i ∈ { , . . . , n k } on treatmentarms k ∈ { , . . . , K } , with arm k = 0 corresponding to a shared control arm, and arms k ∈ { , . . . , K } to several experimental arms. Later, we provide more information onthe precise types of outcome that are currently supported by the web application. Thehypotheses of interest are assumed to be H k : τ k ≤ k ∈ { , . . . , K } . Here, τ k corresponds to a treatment eﬀect for experimental arm k ∈ { , . . . , K } relative to thecontrol arm. Thus, we assume one-sided tests for superiority. Note that in the app,4eference is also made to the global null hypothesis, H G , which we deﬁne to be thescenario with τ = · · · = τ K = 0.To test hypothesis H k , we assume that a Wald test statistic, z k , is computed z k = ˆ τ k (cid:112) Var(ˆ τ k ) = ˆ τ k I / k , k ∈ { , . . . , K } . In what follows, we use the notation z k = ( z , . . . , z k ) (cid:62) ∈ R k . With this, note that ourapp supports design in particular scenarios where Z k , the random pre-trial value of z k ,has (at least asymptotically) a k -dimensional multivariate normal (MVN) distribution,with E ( Z l ) = τ l I / l , l = 1 , . . . , k, Cov( Z l , Z l ) = 1 , l ∈ { , . . . , k } , Cov( Z l , Z l ) = I / l I / l Cov( τ l , τ l ) , l (cid:54) = l , l , l ∈ { , . . . , k } . As is discussed further later, this includes normally distributed outcome variable scenariosand, for large sample sizes, other parametric distributions such as Bernoulli outcome data.Ultimately, to test the hypotheses, z K is converted to a vector of p -values, p =( p , . . . , p K ) (cid:62) ∈ [0 , K , via p k = 1 − Φ ( z k , , k ∈ { , . . . , K } . Here,Φ n { ( a , . . . , a n ) (cid:62) , λ , Σ } is the cumulative distribution function of an n -dimensional MVNdistribution, with mean λ and covariance matrix Σ. PreciselyΦ n { ( a , . . . , a n ) (cid:62) , λ , Σ } = (cid:90) a −∞ · · · (cid:90) a n −∞ φ n { x , λ , Σ } d x n . . . d x , where φ n { x , λ , Σ } is the probability density function of an n -dimensional MVN distribu-tion with mean λ and covariance matrix Σ, evaluated at vector x = ( x , . . . , x n ) (cid:62) .Then, which null hypotheses are rejected is determined by comparing the p k to aset of signiﬁcance thresholds speciﬁed based on a chosen MCC, in combination with anominated signiﬁcance level α ∈ (0 , Operating characteristics

Our app returns a wide selection of statistical operating characteristics that may be ofinterest when choosing a multi-arm trial design. Speciﬁcally, it can compute the followingquantities for any nominated multi-arm design and true set of treatment eﬀects • The conjunctive power ( P con ): The probability that all of the null hypotheses arerejected, irrespective of whether they are true or false. • The disjunctive power ( P dis ): The probability that at least one of the null hypothesesis rejected, irrespective of whether they are true or false. • The marginal power for arm k ∈ { , . . . , K } ( P k ): The probability that H k isrejected, irrespective of whether it is true or false. • The per-hypothesis error-rate (

P HER ): The expected value of the number of type-Ierrors divided by the number of hypotheses. • The a -generalised type-I familywise error-rate ( F W ER Ia ): The probability that atleast a ∈ { , . . . , K } type-I errors are made. Note that F W ER I is the conventionalfamilywise error-rate ( F W ER ); the probability of making at least one type-I error. • The a -generalised type-II familywise error-rate ( F W ER

IIa ): The probability thatat least a ∈ { , . . . , K } type-II errors are made. • The false discovery rate (

F DR ): The expected proportion of type-I errors amongstthe rejected hypotheses. • The false non-discovery rate (

F N DR ): The expected proportion of type-II errorsamongst the hypotheses that are not rejected. • The positive false discovery rate ( pF DR ): The rate that rejections are type-I errors. • The sensitivity (

Sensitivity ): The expected proportion of the number of correctrejections of the hypotheses to the number of false null hypotheses.6

The speciﬁcity (

Specif icity ): The expected proportion of the number of correctlynot rejected hypotheses to the number of true null hypotheses.

Multiple comparison corrections

Per-hypothesis error-rate control

The most simple method for selecting the signiﬁcance thresholds against which to comparethe p k , is to compare each to the chosen signiﬁcance level α . That is, to reject H k for k ∈ { , . . . , K } if p k ≤ α . This controls the P HER to α .A potential problem with this, however, can be that the statistical operating charac-teristics of the resulting design may not be desirable (e.g., in terms of F W ER I ). Asdiscussed earlier, it is for this reason that we may wish to make use of a MCC. Currently,the web application supports the use of a variety of such MCCs, which aim to controleither (a) the conventional familywise error-rate, F W ER I (with these techniques sub-divided into single-step, step-down, and step-up corrections) or (b) the F DR . Single-step familywise error-rate control

These MCCs test each of the H k against a common signiﬁcance level, γ ∈ (0 ,

1) say,rejecting H k if p k ≤ γ . The currently supported single-step corrections are • Bonferroni’s correction: This sets γ = α/K [28]. • Sidak’s correction: This sets γ = 1 − (1 − α ) /K [29]. • Dunnett’s correction: This sets γ = 1 − Φ { z D , , } , where z D is the solution of thefollowing equation α = 1 − Φ K { ( z D , . . . , z D ) (cid:62) , K , Cov( Z K ) } , with n = (0 , . . . , (cid:62) ∈ R n an n -dimensional vector of zeroes [30].Note that each of the above specify a γ such that the maximum probability of incor-rectly rejecting at least one of the null hypotheses H k , k ∈ { , . . . , K } , over all possible7alues of τ ∈ R K is at most α . This is referred to as strong control of F W ER I . Step-down familywise error-rate control

Step-down MCCs work by ranking the p -values from smallest to largest. We will referto these ranked p -values by p (1) , . . . , p ( K ) , with associated hypotheses H (1) , . . . , H ( K ) . The p ( k ) are compared to a vector of signiﬁcance levels γ = ( γ , . . . , γ K ) ∈ (0 , K . Precisely,the maximal index k such that p ( k ) > γ k is identiﬁed, and then H (1) , . . . , H ( k − arerejected and H ( k ) , . . . , H ( K ) are not rejected. If k = 1 then we do not reject any of thenull hypotheses, and if no such k exists then we reject all of the null hypotheses. Thecurrently supported step-down corrections are • Holm-Bonferroni correction: This sets γ k = α/ ( K + 1 − k ) [31]. • Holm-Sidak correction: This sets γ k = 1 − (1 − α ) K +1 − k . • Step-down Dunnett correction: This can only currently be used when theCov( Z k , Z k ) are equal for all k (cid:54) = k , k , k ∈ { , . . . , K } . In this case, it sets γ k = 1 − Φ { z Dk , , } , where z Dk is the solution to α = 1 − Φ K +1 − k { ( z Dk , . . . , z Dk ) (cid:62) , K +1 − k , Cov( Z K +1 − k ) } . Note that both of the above methods provide strong control of

F W ER I . Step-up familywise error-rate control

Step-up MCCs also work by ranking the p -values from smallest to largest, and similarlyutilise a vector of signiﬁcance levels γ . However, here, the largest k such that p ( k ) ≤ γ k isidentiﬁed. Then, the hypotheses H (1) , . . . , H ( k ) are rejected, and H ( k +1) , . . . , H ( K ) are notrejected. Currently, one such correction is supported: Hochberg’s correction [32], whichsets γ k = α/ ( K + 1 − k ). This method also provides strong control of F W ER I .8 alse discovery rate control It may be of interest to instead control the

F DR , which can oﬀer a compromise betweenstrict

F W ER I control and P HER control, especially when we expect a large proportionof the experimental treatments to be eﬀective. Currently, two methods that will controlthe

F DR to at most α over all possible τ ∈ R K are supported. They function in thesame way as the step-up corrections discussed above, with • Benjamini-Hochberg correction: This sets γ k = kα/K [33]. • Benjamini-Yekutieli correction: This sets [34]: γ k = kαK (cid:0) + · · · + K (cid:1) . Sample size determination

The sample size required by a design to control several types of power to a speciﬁedlevel 1 − β , under certain speciﬁc scenarios, can be computed. Precisely, following forexample [35], values for ‘interesting’ and ‘uninteresting’ treatment eﬀects, δ ∈ R + and δ ∈ ( −∞ , δ ) respectively, are speciﬁed and the following deﬁnitions are made • The global alternative hypothesis, H A , is given by τ = · · · = τ K = δ . • The least favourable conﬁguration for experimental arm k ∈ { , . . . , K } , LF C k , isgiven by τ k = δ , τ = · · · = τ k − = τ k +1 = · · · = τ K = δ .Then, the following types of power can be controlled to level 1 − β by design’s deter-mined using the app • The conjunctive power under H A . • The disjunctive power under H A . • The minimum marginal power under the respective

LF C k .9 llocation ratios One of the primary goals of the app is to aid the choice of values for n , . . . , n K . Theapp speciﬁcally supports the determination of values for these parameters by searchingfor a suitable n via a one-dimensional root solving algorithm, and then sets n k = r k n , r k ∈ (0 , ∞ ), for k ∈ { , . . . , K } . Here, r k is the allocation ratio for experimental arm k relative to the control arm.For this reason, the app also allows the allocation ratios to be speciﬁed in a varietyof ways: they can be deﬁned explicitly, or alternatively can be determined in an optimalmanner. For this optimality problem, many possible optimality criteria have been deﬁned,each with their own merits. Therefore, we refer the reader to Atkinson (2007) [23] forfurther details of optimal allocation in multi-arm designs. Instead, we simply note thatin the web application, the allocation ratios can currently be determined for three suchcriteria • A -optimality: Minimizes the trace of the inverse of the information matrix of thedesign. This results in the minimization of the average variance of the treatmenteﬀect estimates. • D -optimality: Maximizes the determinant of the information matrix of the design.This results in the minimization of the volume of the conﬁdence ellipsoid for thetreatment eﬀect estimates. • E -optimality: Maximizes the minimum eigenvalue of the information matrix. Thisresults in the minimization of the maximum variance of the treatment eﬀect esti-mates.The optimal allocation ratios are identiﬁed in the app using available closed-form so-lutions were possible (see [36] for a summary of these), otherwise non-linear programmingis employed. 10 ther design speciﬁcations Finally, the web application also supports the following options • Plot production: Plots can be produced of (a) all of the operating characteristicsquantities listed earlier when τ = · · · = τ K = θ , as well as (b) the P k when τ k = θ and τ l = θ − ( δ − δ ) for l (cid:54) = k . If these are selected for rendering, the quality ofthe plots, in terms of the number of values of θ used for line-graph production, canalso be controlled. • Require n k ∈ N for k ∈ { , . . . , K } : By default, the sample size determined for eacharm will only be required to be a positive number. In practice, such values need tobe integers. This can thus be enforced if desired, with the integer n k speciﬁed byrounding up their determined continuous values. Supported outcome variables

Normally distributed outcome variables

Currently, the app supports multi-arm trial design for scenarios in which the outcomevariables are assumed to be either normally or Bernoulli distributed.Precisely, for the normal case, it assumes that X ik ∼ N ( µ k , σ k ), and that σ k is knownfor k ∈ { , . . . , K } . Then, for each k ∈ { , . . . , K } τ k = µ k − µ , ˆ τ k = 1 n k n k (cid:88) l =1 x ik − n n (cid:88) l =1 x i ,I k = 1 σ n + σ k n k , where x ik is the realised value of X ik .Note that in this case, Z K has a MVN distribution, and thus the operating character-istics can be computed exactly and eﬃciently using MVN integration [37]. Furthermore,11he distribution of Z K does not depend upon the values of the µ k , k ∈ { , . . . , K } .Consequently, these parameters play no part in the inputs or outputs of the app. Bernoulli distributed outcome variables

In this case, X ik ∼ Bern ( π k ) for response rates π k , and for each k ∈ { , . . . , K } τ k = π k − π , ˆ τ k = 1 n k n k (cid:88) l =1 x ik − n n (cid:88) l =1 x i ,I k = 1 π (1 − π ) n + π k (1 − π k ) n k . Thus, a problem for design determination becomes that the I k are dependent on theunknown response rates. In practice, this is handled at the analysis stage of a trial bysetting I k = 1 ˆ π (1 − ˆ π ) n + ˆ π k (1 − ˆ π k ) n k , for ˆ π k = (cid:80) n k l =1 x ik /n k , k ∈ { , . . . , K } . This is the assumption made where requiredin by the app. With this, Z K is only asymptotically MVN. Thus, in general it wouldbe important to validate operating characteristics evaluated using MVN integration viasimulation.In addition, note that the above problem also means that the operating charactersticsunder H G , H A , and the LF C k are not unique without further restriction. Thus, to achieveuniqueness, the app requires a value be speciﬁed for π for use in the deﬁnition of thesescenarios. Moreover, for this reason, the inputs and outputs of functions supportingBernoulli outcomes make no reference to the τ k , and work instead directly in terms of the π k . Finally, note that this problem also means that to determine A -, D -, or E -optimisedallocation ratios, a speciﬁc set of values for the π k must be assumed.In this case, we should also ensure that δ ∈ (0 ,

1) and δ ∈ ( − π , δ ), for the assumedvalue of π , since π k ∈ [0 ,

1] for k ∈ { , . . . , K } .12 esults Support

The web application is freely available from https://mjgrayling.shinyapps.io/multiarm/.The R code for the application can also be downloaded fromhttps://github.com/mjg211/multiarm. Furthermore, as noted earlier, the app isbuilt in to the package multiarm [27], as the function gui() , for ease-of-use withoutinternet access. The application has a simple interface, and has the capability to • Determine the sample required in each arm in a speciﬁed multi-arm clinical trialdesign scenario; • Summarise and plot the operating characteristics of the identiﬁed design; • Produce a report describing the chosen design scenario, the identiﬁed design, and asummary of its operating characteristics.

Inputs

The outputs (i.e., the identiﬁed design and its operating characteristics) are determinedbased upon the following set of user speciﬁed inputs (Figure 1)1. The number of experimental treatment arms, K .2. The chosen multiple comparison correction (e.g., Dunnett’s correction).3. The signiﬁcance level, α .4. The type of power to control (e.g., the conjunctive power under H A ).5. The desired power, 1 − β .6. For Bernoulli distributed data, the control arm response rate π .7. The interesting treatment eﬀect, δ . 13. The uninteresting treatment eﬀect, δ .9. For normally distributed data, the standard deviations, σ , . . . , σ K . These are allo-cated by ﬁrst selecting the type of standard deviations (e.g., that they are assumedto be equal across all arms), and then the actual values for the parameters.10. The allocation ratios (e.g., A -optimal).11. For Bernoulli distributed data, when searching for optimal allocation ratios, theresponse rates to assume in the search.12. Whether the sample size in each arm should be required to be an integer;13. Whether plots should be produced, and if so the plot quality.Note that a Reset inputs button is provided to simplify returning the inputs to theirdefault values. Once the inputs have been speciﬁed as desired, the outputs can be gener-ated by clicking the

Update outputs button.

Example

Here, we demonstrate speciﬁcation of the input parameters (Figure 1), and then subse-quent output generation (Figures 2-4), for parameters motivated by a three-arm phaseII randomized controlled trial of treatments for myelodysplastic syndrome patients, de-scribed in [38]. This trial compared, via a binary primary outcome, two experimentaltreatments with conventional azacitidine treatment. The trial was designed with α = 0 . β = 0 . δ = 0 .

15, and π = 0 .

3. For simplicity, we assume that the familiar Dunnettcorrection will be used, that δ = 0, and that allocation will be equal across the arms( r = · · · = r K = 1). Finally, we assume it is the minimum marginal power that shouldbe controlled.Each input widget in Figure 1 can be seen to have been allocated accordingly basedon the description above, whilst we have additionally elected to produce plots (of mediumquality), and to not require the arm-wise sample sizes to be integers. Note that in Figure14 we can see that the input widgets are supported by help boxes that can be opened byclicking on the small question marks beside them.Figure 2 then depicts the output to the Design summary box once the user clicks on

Update outputs . Speciﬁcally, a summary of the chosen inputs and the identiﬁed designis rendered. Furthermore, in Figure 3 we can see the tables that provide the variousstatistical quantities under H G , H A , the LF C k , as well as the various treatment eﬀectscenarios that are considered for plot production.Finally, in Figure 4 the plots discussed earlier are shown. Observe that horizontal andvertical lines are added at the values α , 1 − β , δ , and δ respectively. Note that these plotsare outputted in a manner to allow the user to zoom in on a particular sub-component ifdesired.In all, Figures 2-4 provide a set of outputs with a variety of features that shouldbe anticipated given the chosen input parameters. Firstly, the speciﬁcation that theallocation to all arms should be equal means that n = · · · = n K . In addition, F W ER I is equal to 0.15 under H G , and the minimum marginal power is 0.799, as is approximatelydesired. Moreover, the speciﬁcation that r = · · · = r K means that P con and P dis are equalfor each of the LF C k , and P = P .Finally, as noted above, and as can be seen in Figure 1, a Generate report button isprovided that can produce a copy of the outputs in either PDF (.pdf), HTML (.html),or Word (.docx) format. The user can also nominate a name for this ﬁle in the

Reportﬁlename input widget. This allows a record of designs to be stored, presented, andcompared to other designs if required.

Conclusions

A possible barrier to previous calls for increased use of multi-arm clinical trial designs is alack of available easy-to-access user-friendly software that facilitates associated sample sizecalculations. For this reason, we have created an online web application that supportsmulti-arm trial design determination for a wide selection of possible input parameters.Its use requires no knowledge of statistical programming languages and is facilitated15ia a simple user interface. Furthermore, we have made the application available on theinternet, so that it is readily accessible, and have also made it freely available for downloadfor remote use without an internet connection. Like similar applications that have beenreleased recently for phase I clinical trial design [39, 40], we hope that the availability ofthis application will assist with the design of future multi-arm studies.Before we conclude, however, it is important to acknowledge the limitations of ourwork. Firstly, MVN integration is utilised in all instances to determine the statisticaloperating characteristics of potential multi-arm designs. This makes the execution timefor returning outputs with many possible input parameters fast. However, there is anunavoidable complexity in certain multi-arm designs, which may make execution timelong. This is particularly true of scenarios with K ≥

5. It can also be true of designs thatutilise the more complex step-wise MCCs. It is for this reason that the application placesan upper cap in the inputs of K = 5, and also returns a warning in scenarios for which alengthy execution time would be anticipated. Nonetheless, users may have to wait severalminutes in certain situations to identify their desired design.Furthermore, it is crucial that all software for clinical trial design be validated. Thisis challenging for multi-arm designs because of the aforementioned limited freely availablesoftware for designing such studies. We compared the output of our application to thatof PASS [41], a validated software package, for a variety of supported input parameters,but output for many possible inputs remained diﬃcult to corroborate because of a lackof equivalent available functionality. For this reason, we have carefully followed recom-mended good-programming practices and perform all statistical calculations within theapplication by calling functions from the R package multiarm, in which the code has beenmodularised [27]. Furthermore, in this package we have created a function that simulatesmulti-arm clinical trials using a given design. This allows us to perform an additionalcheck on our analytical computations. Speciﬁcally, we generated 1000 random combina-tions of possible input parameters for trials assuming normally distributed outcomes, thuscovering an extremely wide range of supported design scenarios. The analytical operatingcharacteristics returned by the web application in the Operating characteristics summary H G , H A , and the LF C k were then compared to those based on trial simulation,using 100,000 replicate simulations in each of the 1000 designs. Across all considered sce-narios, the maximum absolute diﬀerence between the analytical and simulated operatingcharacteristics was just 5 × − , which is within what would be anticipated due to sim-ulation error. Consequently, it does appear that our command is functioning as desired.Code to replicate this work is available upon request from the corresponding author.Finally, we note one primary possible avenue for future development of the web appli-cation: numerous papers have now provided designs for adaptive multi-arm trials (e.g.,[42, 43]), and software for their determination in certain settings [44, 45]. Given the evi-dential increased interest in such designs [46], allowing for their determination would bea valuable extension to our application. Funding

This work was supported by the Medical Research Council [grant numberMC UU 00002/6 to JMSW]. The funding body did not have any role in the design of thisstudy, collection, analysis, and interpretation of data, nor in the writing of the manuscript.

References [1] DiMasi, J.A., Grabowski, H.G., Hansen, R.W.: Innovation in the pharmaceuticalindustry: new estimates of R&D costs. Journal of Health Economics , 20–33 (2016)[2] Biotechnology Innovation Organization (BIO), Biomedtracker, AMPLION: Clinicaldevelopment success rates 2006-2015 (2016)[3] Parmar, M.K.B., Carpenter, J., Sydes, M.R.: More multiarm randomised trials ofsuperiority are needed. Lancet (9940), 283–4 (2014)[4] Jaki, T., Wason, J.M.S.: Multi-arm multi-stage trials can improve the eﬃciency ofﬁnding eﬀective treatments for stroke: a case study. BMC Cardiovascular Disorders (1), 215 (2018) 175] Wason, J.M.S., Stecher, L., Mander, A.P.: Correcting for multiple-testing in multi-arm trials: is it necessary and is it done? Trials , 364 (2014)[6] Baron, G., Perrodeau, E., Boutron, I., Ravaud, P.: Reporting of analyses from ran-domized controlled trials with multiple arms: a systematic review. BMC Medicine , 84 (2013)[7] Juszczak, E., Altman, D.G., Hopewell, S., Schulz, K.: Reporting of multi-armparallel-group randomized trials: extension of the CONSORT 2010 statement. JAMA (16), 1610–1620 (2019)[8] Rothman, K.J.: No adjustments are needed for multiple comparisons. Epidemiology (1), 43–46 (1990)[9] Cook, R.J., Farewell, V.T.: Multiplicity considerations in the design and analysisof clinical trials. Journal of the Royal Statistical Society (Series A) (1), 93–110(1996)[10] Proschan, M.A., Waclawiw, M.A.: Practical guidelines for multiplicity adjustmentin clinical trials. Controlled clinical trials (6), 527–539 (2000)[11] Bender, R., Lange, S.: Adjusting for multiple testing - when and how? Journal ofClinical Epidemiology (4) (2001)[12] Feise, R.J.: Do multiple outcome measures require p-value adjustment? BMC Med-ical Research Methodology , 8 (2002)[13] Hughes, M.D.: Multiplicity in clinical trials. Encyclopedia Biostatistics , 3446–3451(2005)[14] Freidlin, B., Korn, E.L., Gray, R., Martin, A.: Multi-arm clinical trials of new agents:some design considerations. Clinical Cancer Research (2008)[15] Li, G., Taljaard, M., Van den Heuvel, E.R., Levine, M.A.H., Cook, D.J., Wells, G.A.,Devereaux, P.J., Thabane, L.: An introduction to multiplicity issues in clinical trials:18he what, why, when and how. International Journal of Epidemiology (2), 746–755(2016)[16] Agency, E.M.: Guideline on Multiplicity Issues in Clinical Trials. (2017). [17] Administration, U.F..D.: Multiple Endpoints in Clinical Tri-als Guidance for Industry. (2017). [18] Howard, D.R., Brown, J.M., Todd, S., Gregory, W.M.: Recommendations on mul-tiple testing adjustment in multi-arm trials with a shared control group. StatisticalMethods in Medical Research (5), 1513–1530 (2018)[19] Hochberg, Y., Tamhane, A.C.: Multiple Comparison Procedures. John Wiley & Sons,New York, NY (1987)[20] Hsu, J.C.: Multiple Comparisons. Chapman & Hall, London (1996)[21] Bretz, F., Hothorn, T., Westfall, P.: Multiple Comparisons using R. CRC Press,Boca Raton, FL (2010)[22] Sankoh, A.J., D’Agostino, R.B.S., Huque, M.F.: Eﬃcacy endpoint selection andmultiplicity adjustment methods in clinical trials with inherent multiple endpointissues. Statistics in Medicine (20), 3133–3150 (2003)[23] Atkinson, A., Donev, A., Tobias, R.: Optimum Experimental Designs, with SAS.Oxford University Press, Oxford (2007)[24] East. . Accessed: 2019-05-04[25] Chang, W., Cheng, J., Allaire, J.J., Xie, Y., McPherson, J.: shiny: Web ApplicationFramework for R. (2019). https://CRAN.R-project.org/package=shiny [27] Grayling, M.J.: multiarm: Design and analysis of ﬁxed-sample multi-arm clinicaltrials (2019). [28] Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probabilit. Pubblicazionidel R Istituto Superiore di Scienze Economiche e Commerciali di Firenze (1936)[29] ˇSid´ak, Z.: Rectangular conﬁdence regions for the means of multivariate normal dis-tributions. Journal of the American Statistical Association (318), 626–633 (1967)[30] Dunnett, C.W.: A multiple comparison procedure for comparing several treatmentswith a control. Journal of the American Statistical Association (272), 1096–1121(1955)[31] Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Jour-nal of Statistics (2), 65–70 (1979)[32] Hochberg, Y.: A sharper bonferroni procedure for multiple tests of signiﬁcance.Biometrika (4), 800–802 (1988)[33] Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical andpowerful approach to multiple testing. Journal of the Royal Statistical Society (SeriesB) (1), 289–300 (1995)[34] Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multipletesting under dependency. Annals of Statistics (4), 1165–1188 (1995)[35] Wason, J., Magirr, D., Law, M., Jaki, T.: Some recommendations for multi-armmulti-stage trials. Statistical Methods in Medical Research (2), 716–727 (2016)[36] Sverdlov, O., Rosenberger, W.F.: On recent advances in optimal allocation designsin clinical trials. Journal of Statistical Theory and Practice (4), 753–773 (2013)2037] Genz, A., Bretz, F., Miwa, T., X, M., F, L., F, S., T, H.: mvtnorm: Multivariate nor-mal and t distributions. R package version 1.0-10. (2019). http://CRAN.R-project.org/package=mvtnorm [38] Jacob, L., M, U., Boulet, S., Begaj, I., Chevret, S.: Evaluation of a multi-arm multi-stage Bayesian design for phase II drug selection trials - an example in hemato-oncology. BMC Medical Research Methodology , 67 (2016)[39] Wheeler, G.M., Sweeting, M.J., Mander, A.P.: AplusB: A Web Application for In-vestigating A + B Designs for Phase I Cancer Clinical Trials. PLoS ONE (7),0159026 (2016)[40] Wages, N.A., Petroni, G.R.: A web tool for designing and conducting phase I trialsusing the continual reassessment method. BMC Cancer , 133 (2018)[41] PASS. . Accessed: 2019-05-04[42] Magirr, D., Jaki, T., Whitehead, J.: A generalized Dunnett test for multi-arm multi-stage clinical studies with treatment selection. Biometrika (2), 494–501 (2012)[43] Wason, J., Stallard, N., Bowden, J., Jennison, C.: A multi-stage drop-the-losersdesign for multi-arm clinical trials. Statistical Methods in Medical Research (1),508–524 (2017)[44] Barthel, F.M.S., Royston, P., Parmar, M.K.B.: A menu-driven facility for sample-size calculation in novel multiarm, multistage randomized controlled trials with atime-to-event outcome. Stata Journal (4), 505–523 (2009)[45] Jaki, T., Pallmann, P., Magirr, D.: The R package MAMS for designing multi-armmulti-stage clinical trials. Journal of Statistical Software (4), 1–25 (2019)[46] Dimairo, M., Coates, E., Pallmann, P., Todd, S., Julious, S.A., Jaki, T., Wason, J.,Mander, A.P., Weir, C.J., Koenig, F., Walton, M.K., Biggs, K., Nicholl, J., Hamasaki,T., Proschan, M.A., Scott, J.A., Ando, Y., Hind, D., Altman, D.G.: Development21rocess of a consensus-driven CONSORT extension for randomised trials using anadaptive design. BMC Medicine16