Bayesian Sequentially Monitored Multi-arm Experiments with Multiple Comparison Adjustments
BBayesian Sequentially Monitored Multi-arm Experiments withMultiple Comparison Adjustments
Andrew W. CorreiaSessionM, Inc.2 Seaport Lane, 11 th FloorBoston, MA [email protected]
Abstract
Randomized experiments play a major role in data-driven decision making acrossmany different fields and disciplines. In medicine, for example, randomized controlledtrials (RCTs) are the backbone of clinical trial methodology for testing the efficacyof new drugs and therapies versus existing treatments or placebo. In business andmarketing, randomized experiments are typically referred to as A/B tests when thereare only two arms, or variants, in the experiment, and as multivariate A/B tests whenthere are more than two arms. Typical applications of A/B tests include comparingthe effectiveness of different ad campaigns, evaluating how people respond to differentwebsite layouts, or comparing different customer subpopulations to each other.This paper focuses on multivariate A/B testing from a digital marketing perspec-tive, and presents a method for the sequential monitoring of such experiments whileaccounting for the issue of multiple comparisons. In adapting and combining the meth-ods of two previous works, the method presented herein is straightforward to implementusing standard statistical software and performs quite well in various simulation studies,exhibiting better power and smaller average sample sizes than comparable methods.
Introduction
In traditional A/B testing, a researcher conducts a randomized experiment with two arms,or variants, A and B, which can be thought of as the control and treatment group, re-spectively. The objective of the experiment is to determine if the people randomized togroup B performed better towards some outcome or goal than did the people randomizedto the control group, A. One simple example of this would be a clinical trial for patientswith high blood pressure testing the efficacy of some new high blood pressure medicationversus placebo. Researchers would be interested in knowing whether the group receivingthe new medication, group B, saw its members experience a greater reduction in bloodpressure levels than did the placebo group, group A. In perhaps a more familiar marketing1 a r X i v : . [ s t a t . A P ] A ug etting, a typical example might be assigning one group of people, group B, to be shown acertain advertisement while the control group, group A, is not shown that ad. The marketermight then be interested in, say, knowing if group B exhibited an increase in some type ofpurchasing behavior compared to the control group, group A.A multivariate A/B test extends the standard A/B test to include more than one treatmentgroup, but still has only one control group. Let the control group continue to be denotedas group A, then the different arms, or variants, in a multivariate A/B test can be thoughtof as: A, B i , i = 1 , , ..., m , where m is some arbitrary number of variants (excluding thecontrol group) to be included in the experiment. As before, an example of a multivariateA/B test might be the same clinical trial for high blood pressure medication, but with threenew candidate treatments each being compared against baseline, instead of just one. ∗ Sim-ilarly, in the context of marketing, a multivariate A/B test would look very much like theprevious marketing example, but with, say, four or five different creative types each beingcompared against the control group, A.When there are multiple arms in an experiment there are a number of questions a researchermight hope to answer. Some of those questions could include the following: • What is the best performing variant? • What are the top three performing variants? • From a statistical testing standpoint, is each variant significantly better than baseline? • How do the variants compare to each other? • For each variant, what is the probability that it is the best performing variant?The above questions come in addition to the usual questions that arise when undertakinga classical two-arm A/B test, such as: • How many people do I need in each group to detect a statistically significant difference? • When I’m calculating my sample size, how do I know how big of a difference I wantto detect? • Can I stop as soon as there’s a significant difference between groups? • How soon can I look at the data from this experiment? ∗ Throughout this paper, baseline and control will be used interchangeably and are taken to mean thesame thing. The baseline success rate, for example, is equal to the success rate in the control group.
2n the rest of this paper, I outline my proposed methodology for multivariate A/B testingand how it addresses each of the questions outlined above. I begin with a brief overviewof statistical hypothesis testing and the different study designs that can be implementedin hypothesis testing under a simple two-arm A/B test. I then present the proposed mul-tivariate A/B testing methodology followed by results for that methodology applied tosimulated multivariate A/B experiments, and close with a discussion of possible extensionsand additions to this methodology.
Methodology
Overview
At its core, a standard two-arm A/B test is a simple statistical hypothesis test. Denote thenull hypothesis by H and the alternative hypothesis by H . The hypotheses being testedthen are typically either:1. H : Groups A and B perform identically towards the same goal of interestvs. H : Groups A and B do not perform identically towards the same goal of interest,or2. H : Group A performs no worse than group B towards the same goal of interestvs. H : Group B performs better than group A towards the same goal of interest.For clarity, assume a digital marketer is interested in converting customers to the latestupgrade of his company’s product. To do this, he wants to launch a new campaign aimedat generating conversions. But before launching this campaign to all customers he wants tobe sure it will work, so he first decides to conduct an A/B test wherein some customers arerandomly assigned to see the new ad (group B) and some are not (group A). Let µ A denotethe conversion rate in the control group, A, and µ B the conversion rate in the exposedgroup, B. Then the marketer’s A/B test will look to test either:1. H : µ A = µ B vs. H : µ A (cid:54) = µ B or2. H : µ A ≥ µ B vs. H : µ A < µ B To formally test either of these hypotheses there are a few analytical approaches to consider.First note that in the marketing example provided, customer conversion rates are beinganalyzed. This means that the outcome variable is dichotomous (yes or no, 0 or 1, etc.). Foreach individual in the experiment, the outcome variable of interest is the variable capturing3hether this person installed the latest upgrade, yes or no. To compare the two groups, theanalysis could take the form of a Z-test for the comparison of two proportions (assumingthe sample size in each group is large enough), Fisher’s exact test for the comparison oftwo proportions, a simple Bayesian analysis comparing the posterior distributions of twoBeta distributions, or if there are other variables that should be considered in the modelthen perhaps a logistic regression wherein a statistical test is conducted on the coefficientof interest via, e.g., a Wald test or a likelihood ratio test, etc.
Study Design
In addition to the analytic method, the researcher also needs to decide on a study design.The study design could take the form of a fixed-sample size design, or a sequential samplesize design (with or without adaptive randomization) [1, Ch. 87]. In a fixed-sample design,the number of individuals, or customers, to be enrolled in the experiment is fixed a priori based on a sample size calculation prior to the start of the experiment [1, Ch. 87]. Thatcalculation will depend on a number of factors, including: an estimate of the treatmenteffect (i.e., how much better/worse the exposed group will perform compared to the controlgroup), an estimate of the baseline success (conversion) rate in the control group, the de-sired Type I error rate for the test ( α ), the desired power to detect the estimated treatmenteffect (power = 1 - Type II error rate), and the proportion of individuals to be randomizedto each group (e.g. 50% in each group, 60%/40%, etc) [2]. In a fixed-sample size design,no conclusions regarding the hypotheses being tested can be drawn until the full samplehas been accrued in each group, otherwise the Type I error will be greatly inflated beyondthe pre-specified level [3, 4]. Additionally, the proportion of individuals randomized to eacharm of the experiment will typically remain fixed throughout the experiment. The fixeddesign is often referred to as a traditional [4] or conventional [2] study design, and it is howmost experimental trials are carried out [1, Ch. 87].A sequential sample size design, sometimes also called an adaptive [2] or flexible design[4], will monitor the data accrued throughout the experiment either continuously, or atadministratively convenient intervals in order to make decisions about the future course ofthe study while the study is underway [4]. At each interim point, a decision can be maderegarding whether the experiment should be stopped because the treatment arm has shownitself to be either a clear winner or loser with respect to the control group. If there aremultiple treatment arms in the study, decisions can also be made regarding the re-allocationof individuals to different arms of the experiment by altering the assignment probabilitiesso that more people are randomized to better performing arms as the study is ongoing,or by completely dropping arms that are exhibiting a very high probability that they areineffective. † Under such a design, the sample size of the study is typically not determined † Note that throughout the rest of this document, unless otherwise stated, the term adaptive will only beused with respect to adaptive randomization, i.e. continuously updating the randomization of individuals a priori , though in many cases a maximum sample size will be defined a pri-ori in order to keep the size of the experiment manageable. Similarly, in some cases thepostulated treatment effect need not be provided as well. However the desired power andType I error rate will play a role in study design, and adjustments need to be made toclassical test statistics to avoid inflating the Type I error of the test. For studies in whichthe experimenter has pre-determined when and how many interim looks at the data therewill be, a common frequentist approach to controlling the Type I error of a sequentially-designed study is to specify an alpha spending function [5], examples of which include thePocock [6], Peto [7], and O’Brien-Fleming [8] spending functions. Several sequential andadaptive Bayesian approaches also exist [e.g. 9, 10, 11, 12], though many Bayesian designedstudies will ultimately use thresholds corresponding to frequentist characteristics, makingthe results of the experiment difficult to interpret [13].Among sequential tests that do not require a sample size to be calculated a priori , perhapsthe most well-known is Wald’s sequential probability ratio test (SPRT) [14], which allowsfor continuous monitoring of a test between two pre-specified hypotheses, with the experi-ment terminating when the test statistic crosses one of two thresholds - one corresponding toaccepting the null hypothesis, and the other the alternative. The test, however, only consid-ers two simple hypotheses, and is therefore challenging to apply in the more commonplacescenarios of composite hypotheses presented above, for example. Several extensions to theSPRT have been developed [e.g. 15, 16, 17], and many of these do extend the SPRT toallow for one-sided hypothesis testing. An alternative, but very similar approach was de-veloped by Cheng and Shen [18] under a Bayesian decision-theoretic framework, which isvery appealing for the problem presented here, and which I discuss in more detail below.
Sequential Monitoring in A/B Testing
While fixed-sample designs have historically been the most popular approach, sequential de-signs have been around since the first half of last century - for roughly 70 years dating back toWald’s SPRT [14] in 1945, and there are several other examples of this type of methodologyand/or extensions of the original SPRT in the literature [see, e.g., 14, 15, 16, 17, 19, 20, 21, toname a few]. The appeal of sequential sample size designs is that, on average, they allow re-searchers to answer questions with smaller sample sizes than do fixed sample size designs [1,Ch. 87]. The reason, however, that fixed sample designs have been most popular is becausesequential sample size designs are limited to settings where outcomes can be categorized assuccesses or failures, and where that categorization can be made shortly after enrollment inthe study. The categorization is important, because continuing the experiment to includethe next person or group of persons is dependent on being able to observe the outcome of to treatment groups based on how well each group has performed up until that point - better performinggroups would receive more subjects, poorly performing groups fewer. This is known as outcome-adaptiverandomization [1, Ch. 88]. More on this in the Discussion section.
A Bayesian Sequential Design for A/B Testing
The title of this section plays on the title of Cheng’s and Shen’s 2005 paper, Bayesian Adap-tive Designs for Clinical Trials [18]. Despite their use of the word “adaptive” in the title,they do not develop nor implement adaptive procedures in the way defined here. Rather,they develop a Bayesian theoretic approach to a sequential two-arm clinical trial, or A/Btest, in which the total sample size in the experiment is not fixed a priori , nor is a maximum6ample size imposed. Their method is as follows [18].Consider a one-sided hypothesis test similar to those defined earlier in which the researcherwishes to test: H : µ A + θ ≥ µ B vs . H : µ A < µ B . Define θ = µ B − µ A , where, as before, µ B represents the average response rate in the exposedgroup, B; µ A represents the average response rate in the control group, A; and θ is someconstant ≥
0. For simplicity let θ = 0 throughout, so that the hypothesis test can berewritten as: H : θ ≤ . H : θ > . Let A and R denote the actions of either accepting or rejecting the null hypothesis, respec-tively. The authors then define the following loss functions: L ( θ, A ) = (cid:40) , if θ ≤ K , if θ > L ( θ, R ) = (cid:40) K , if θ ≤ , if θ > K and K are the losses for making a Type I and Type II error, respectively.The goal is to devise a stopping rule to minimize the loss. Define X j = { X , . . . , X j } asthe cumulative data collected up to step j of the experiment (in other words, after j looksat the data), and define the corresponding information set at that time as the σ -algebra F j = σ ( X j ). The total cost (or loss) of stopping the experiment after the j -th look at thedata is the cost of enrolling everyone in the study up to and including that point, plusthe loss of either accepting or rejecting the the null hypothesis at that point, whichever issmaller: L stop ( X j ) = 2 K j (cid:88) i =1 B i + min (cid:2) E { L ( θ, A ) |F j } , E { L ( θ, R ) |F j } (cid:3) , (1)where K is the cost of enrolling one person into the study, and B i is the number of additionalpeople involved in the study between looks i − i of the experiment. Further, notethat E { L ( θ, A ) |F j } = K P ( θ > |F j ) and E { L ( θ, R ) |F j } = K P ( θ ≤ |F j ) . j + 1)-th look at the data can beexpressed as: L cont ( X j ) = 2 K j +1 (cid:88) i =1 B i + E (cid:16) min (cid:2) E { L ( θ, A ) |F j +1 } , E { L ( θ, R ) |F j +1 } (cid:3)(cid:12)(cid:12)(cid:12) F j (cid:17) , (2)where the outside expectation in (2) is taken with respect to the posterior predictive distri-bution of θ .The decision to reject or accept the null hypothesis at the j -th look would proceed as follows:If the expected loss of accepting the null hypothesis at step j is less than or equal to theexpected loss of rejecting the null hypothesis at step j , then accept the null; otherwise,reject. Notationally, the rejection region, R j , can be defined as: R j = (cid:26) X j : P ( θ > |F j ) P ( θ ≤ |F j ) ≥ K K (cid:27) , j = 1 , , . . . (3)The authors show that the Type I and Type II error rates can be controlled by judiciousselection of the parameters ( K , K , K ). Specifically, they show that for a desired TypeI error rate of α , setting the ratio K /K = (1 − α ) /α will place an upper bound on theType I error of their test at α , while a grid search over the value of the ratio K /K canbe performed to obtain the desired Type II error rate.To select the optimal study design that minimizes the expected loss, the following two-stepprocedure is followed:1. If L stop ( X j ) ≤ L cont ( X j ), the experiment is stopped at interim look j . If at that point X j is in the rejection region, R j , reject the null; otherwise accept.2. If L stop ( X j ) > L cont ( X j ), then continue the experiment for one more look at the dataand repeat steps 1 and 2.Because of the iterative nature of this study design, the stopping time of the trial, denoted M , is not pre-fixed but rather is a random integer. The authors show in their Theorem1 that for K > M is a stopping time satisfying P ( M < ∞ | θ ) = 1. In other words,if K >
0, then the experiment is guaranteed to terminate at some point and not go onforever.
Extensions
In this section I discuss my adaptation of the Cheng and Shen methodology outlined above.First, note the form of the rejection region, R j , defined in (3). The test statistic checked atevery interim look is: 8 ( θ > |F j ) P ( θ ≤ |F j ) , (4)which is easily identified as the Bayes factor for a model with θ > θ ≤ a priori assumption with respect to favoring one hypothesis over the other.Note, also, the striking similarity between this test statistic and that of the classic WaldSPRT [14] which can also be interpreted as a Bayes factor, although comparing two simplehypotheses. It should not be surprising, then, that the critical values for those two teststatistics are so similar. For Wald’s SPRT, the null hypothesis is rejected once the Bayesfactor is greater than (1 − β ) /α , and is accepted if the Bayes factor is less than β/ (1 − α )where α and β are the Type I and Type II errors, respectively. If at any given interim checkof the data either of those thresholds is crossed, the experiment is stopped; otherwise moresubjects are enrolled until a decision can be made.Second, I examine the stopping rule derived by Cheng and Shen; specifically that the studyis stopped once L stop ( X j ) ≤ L cont ( X j ), or equivalently, when2 K B j +1 + E (cid:16) min (cid:2) E { L ( θ, A ) |F j +1 } , E { L ( θ, R ) |F j +1 } (cid:3)(cid:12)(cid:12)(cid:12) F j (cid:17) ≥ min (cid:2) E { L ( θ, A ) |F j } , E { L ( θ, R ) |F j } (cid:3) . (5)In letting: Y j = min (cid:2) E { L ( θ, A ) |F j } , E { L ( θ, R ) |F j } (cid:3) , the authors show, by Jensen’s inequality, that: E ( Y j +1 |F j ) ≤ min (cid:16) E (cid:2) E { L ( θ, A ) |F j +1 } (cid:12)(cid:12) F j (cid:3) , E (cid:2) E { L ( θ, R ) |F j +1 } (cid:12)(cid:12) F j (cid:3)(cid:17) = Y j . (6)Following the notation in (6), the stopping rule (5) can be rewritten as2 K B j +1 ≥ Y j − E ( Y j +1 |F j ) . (7)By the proof of Theorem 1 in [18] (or (6) above), 0 ≤ Y j − E ( Y j +1 |F j ), with equality comingby taking the limit as j → ∞ [18]. By taking K > M < ∞ . However, I argue that, in the setting of digitalmarketing, it is not unreasonable to set K = 0.First, consider that the primary concern here is that of a mobile or digital marketer serv-ing ads to his customers. If that marketer has the ability to send an ad to as many as500,000 customers, the cost of sending a campaign out to 200,000 people versus 100,000,9or example, is practically negligible; even if K is not truly zero, it should be very, veryclose. Secondly, a marketer will likely be operating from the standpoint of wanting to runa campaign for some number of pre-specified weeks, but with the option to stop if resultsshow themselves to be significant before that point. In practice, therefore, the experimentwould never run for an infinite amount of time. The sequential monitoring procedure simplyallows the marketer to monitor the campaign continuously and stop the experiment at anytime, with all of the results maintaining statistical validity with respect to the Type I errorrate whenever the campaign is halted.Further, by recognizing that the test statistic (4) can be interpreted as a Bayes factor, Ipropose that the test can easily be extended to two-sided hypotheses of the form: H : θ = 0 vs . H : θ (cid:54) = 0 , by altering the test statistic to: P ( θ (cid:54) = 0 |F j ) P ( θ = 0 |F j ) , (8)with a rejection region of R j = (cid:26) X j : P ( θ (cid:54) = 0 |F j ) P ( θ = 0 |F j ) ≥ c α/ (cid:27) , j = 1 , , . . . (9)where c α/ is a threshold cutoff, the value of which is selected to ensure a desired Type Ierror rate, α . Because I set K = 0, the rejection region above also doubles as the stoppingrule: If X j ∈ R j , stop the experiment and reject the null hypothesis; otherwise, continueenrolling customers into the study. Such a stopping rule is very similar to that of the WaldSPRT. Via simulation (results in later section), I show that the same c α/ = (1 − α/ / ( α/ α .Additionally, because the hypothesis test is two-sided, the stopping rule encompasses bothsuperiority and inferiority of the treatment arm. Estimation
Any number of estimation procedures would work under this framework. I prefer a Bayesianapproach to the problem, because a posteriori comparisons and transformations are greatlysimplified under a Bayesian framework, and because it is also appealing to assign a proba-bility of being the best performing arm to each arm of a multivariate A/B test - somethingthat is not feasible in a frequentist framework. Specifically, I choose to estimate parame-ters via a logistic regression, which provides the flexibility to additionally control for anydemographic variables available if desired, as well as to extend the model to easily include10ultiple treatment groups and any interactions between treatment group and demograph-ics, all in one single estimation procedure. The simple two-arm model can take the followingform: logit( p i ) = α + β ∗ trt i + k (cid:88) j =1 γ j z j , (10)where p i is the probability of a success (e.g., a conversion), trt i is a variable capturingwhether person i is in the treatment or control group, each z j is one of k optional de-mographic variables to additionally control for, and Θ = ( α, β, γ , . . . , γ k ) is a vector ofparameters to be estimated by the model. The main parameter of interest is β , and thehypothesis test can be written: H : β = 0 vs . H : β (cid:54) = 0 . Note that β captures the log-odds of success for the treatment group compared to the con-trol group, though it can easily be transformed to give the estimated probability of successin each group. The Bayes factor (8) can be calculated by comparing the model that includesthe treatment variable to one that does not. As model complexity grows, and as the modelis extended to accommodate multivariate A/B tests, a fully Bayesian framework can provecomputationally intensive, particularly when evaluating tens or hundreds of campaigns atonce. In practice, therefore, the parameters of the model can be estimated using standardstatistical software, e.g., via the glm function in R, with samples from the posterior distribu-tion of the parameters generated using the functionality of R’s arm package. Additionally,the Bayes factor can be approximated by using Schwarz’s Bayesian Information Criterion(BIC) from the fitted models [23, 24, 25], which also greatly reduces computation time. Multivariate A/B Tests and Multiple Testing
The methodology outlined to this point covers only a simple two-arm A/B test. In thissection, I present extensions of the results above to multivariate ( > α , by the number of hypotheses being tested, m , yielding an adjusted Type I error11ate of α ∗ = α/m for each of the m tests. Other methods for controlling the Type I errorrate include the ˇSid´ak correction and the Holm-Bonferroni correction, among others [see,e.g., 27]. All of these procedures aim to control the familywise error rate (FWER) acrossthe group of tests at α . The main drawback to Bonferroni-type procedures is that theType I error problem is adjusted at the expense of the Type II error, meaning the power ofthe statistical test can be greatly reduced using one of these adjustments. An alternativeapproach to the multiple comparisons problem is to, instead of controlling the overall TypeI error rate, control the False Discovery Rate (FDR). In doing so, the researcher is able toensure that among all tests declared statistically significant, only a certain percentage areexpected to be false positives [28]. Under FDR adjustment, the test is less conservative withrespect to Type I error, though the researcher will have confidence that a high percentageof the significant results detected represent real differences. A common thread betweenmethods controlling the FWER and those controlling the FDR is that both do so by meansof a p-value adjustment.An alternative approach to the multiple comparisons problem is described by Gelman, et.al. [26], wherein the authors shift the multiple comparisons problem into a modeling is-sue and address it by operating from a Bayesian multilevel modeling framework. UnlikeFWER or FDR adjustments which, in essence, keep point estimates stationary and adjustfor multiple comparisons by making confidence intervals wider (or, equivalently, by ad-justing the p-values), the multilevel modeling approach instead shifts point estimates andtheir corresponding intervals closer together - a process known as “shrinkage” or “partialpooling.” In this way, the estimates from a multilevel model naturally make comparisonsmore conservative by increasing the likelihood that any intervals for comparisons containzero [26]. At the same time, the method does not correct for multiple comparisons at theexpense of the power of the test, unlike many of the traditional methods discussed above [26].Rather than operating from the standard Type I error paradigm, Gelman, et. al. arguethat more troublesome issues are when either: 1. estimators have the incorrect sign, or 2.estimators declare an effect to be small when it is actually large, or declare it is actuallylarge when in fact it is near zero. They dub these errors of sign, and errors of magnitude -Type S and Type M errors, respectively. While the multilevel Bayesian approach addressesType S and Type M errors, it also does well to control for the conventional Type I errorwhile maintaining good power. Additionally, given the Bayesian nature of the analysisproposed thus far, I find it desirable to address the multiple comparison issue in a verynatural Bayesian approach as opposed to making p-value adjustments since p-values areinherently very non-Bayesian. Simulation results summarizing the performance of the fullmultivariate A/B testing methodology are presented in a later section, but first considerhow the approach of Gelman, et. al. [26] can be applied to the method presented thus far.Recall that in a simple two-arm A/B test, the comparison between the treatment and12ontrol group can be made by fitting the model given by (10). In an ( m + 1)-arm A/B test( m > p i ) = α + m (cid:88) r =1 β r ∗ trt i,r + k (cid:88) j =1 γ j z j , (11)where now trt i,r is an indicator for whether person i was on arm B r , and β r is the corre-sponding coefficient capturing the log-odds of success in group B r compared to the baselinegroup, A. However to account for multiple comparisons under (11), some kind of p-valueadjustment would need to be made to adjust the FWER or FDR of the test. Instead, Ipropose the following multilevel model based on [26]:logit( p i ) = α r [ i ] + k (cid:88) j =1 γ j z j , (12)where α r [ i ] captures the log-odds of success in person i ’s arm of the experiment. And α r ∼ N ( µ, σ α ) . The caveat is that now r = 1 , , . . . , m, m + 1, meaning that the control group’s successrate is also part of the pooling estimation. I believe this is reasonable in the sense that itis not unrealistic to assume these are all realizations from a common distribution - it is thesame outcome being measured in each arm of the experiment, and it seems plausible that,across multiple experiments, some arms might perform better than baseline, some worse,and some very similarly. Assume α captures the log-odds in the control group, then themultiple hypotheses being tested are of the form: H : α r − α = 0 vs . H : α r − α (cid:54) = 0 , ∀ r ≥ glmer function inthe lme4 package, with posterior simulations given by the arm package. It is also easy toextend the model to obtain subgroup estimates by treatment group. For more details onthe multiple testing procedure, the reader is encouraged to see [26]. Simulation Results
In this section simulation results for the full method outlined above are presented, combin-ing the estimation procedure of (12) with the decision rule given by (9). Table 2 summarizesthe average expected outcomes of five hypothetical multivariate A/B experiments, each ofwhich is a five-arm (four treatment + one control) experiment with equal allocation in eacharm. The results presented for each multivariate A/B experiment represent an average over1,000 simulated trials. The hypotheses being tested are two-sided hypothesis tests taking13he form of (13). All hypothesis tests are carried out at the α = 0 .
05 level, yielding a cutoffvalue of c α/ = 39. For convenience a maximum of 20,000 people in each arm is enforced,and the data is checked after every 500 enrollments in each arm, for a total of 40 interimlooks at the data. The fields in Table 2 are as follows:Table 1: Results of Interest from Simulated Experiments Field Description p Probability of success in the control group, i.e. baselinesuccess rate. p r Probability of success in treatment group r .Power Estimated power to detect a difference between each arm andbaseline; formally P(reject H | H True).Type I Error (ˆ α ) Estimated Type I Error for each arm;formally P(reject H | H True) N Average sample size required in each arm across all simulated trialsFixed-Sample Avg. Power Average power for a fixed-sample design given the observed N from each simulated trial without multiple testing correction.Fixed-Sample Avg. Power (Bonferroni) Average power for a fixed-sample design given the observed N from each simulated trial with a Bonferroni correction.FWER Familywise Error Rate. Percentage of simulated trials producingat least one false positive.FDR False Discovery Rate. Percentage of statistically significant resultsthat are actually false positives.Per-test ˆ α Average Type I Error rate across all arms where H is true. a b l e : F i v e - a r m S i m u l a t i o n R e s u l t s p p r P o w e r ( - ˆ β ) T y p e I E rr o r ( ˆ α ) N F i x e d - S a m p l e A v g . P o w e r F i x e d - S a m p l e A v g . P o w e r ( B o n f e rr o n i ) F W E R F D R P e r - t e s t ˆ α . . . , . . .
673 0 . . . . . , . . . , . . . , . . .
731 0 . . . , . . .
654 0 . . . . . , . . . , . . .
650 0 . . , . . .
723 0 . . . , . . .
694 0 . . . . . , . . . , . . .
666 0 . . , . . .
728 0 . . . , . . .
654 0 . . . . . , . . . , . . .
654 0 . . , . . .
717 0 . . . , . . . . . . , . . .
659 0 . . , . . .
751 0 . . , . . . p = 0 .
50 had more than onearm where the null hypothesis was true. Still, the results in Table 2 provide a good overviewof the performance of the methodology where there are only a handful of hypotheses beingsimultaneously tested (only 4 comparisons being made) and where the null hypothesis isnot true in most arms.Table 3 below displays analogous results to those in Table 2 above, but for 11-arm (tentreatment + one control) experiments with equal allocation in each arm. In each experi-ment, the null hypothesis of no treatment effect is true in five out of ten of the arms. Toconserve space, the results for only two (instead of five) such experiments are presented:one with p = 0 .
5, and the other with p = 0 .
05. As before, the results presented for eachmultivariate A/B experiment represent an average over 1,000 simulated trials; all hypothesistests were two-sided and carried out at an α = 0 .
05 level; and a maximum of 20,000 peoplein each arm was enforced, with interim checks at every 500 enrollments for each arm. Evenin this scenario, the proposed method performs no worse than a Bonferroni-corrected exper-iment with respect to power, and for “larger” differences between treatment and control theproposed method far outperforms it - even doing better, again, than fixed-sample designswith no multiple testing correction. Of particular note, the FWER is controlled below the α = 0 .
05 level, and the FDR is quite low, between 1% and 1.5% for both experiments.Another attractive aspect of this method is that in fitting the logistic model under a Bayesianframework, it is easy to obtain, via posterior simulation, the probability correspondingto each arm that it is the best performing arm in the study at each interim check, or P ( p r ∗ = max r ( p r ) | F j ). The result is a procedure that not only quantifies lift over baselineand whether that lift is statistically significant or not, but also through these posterior prob-abilities provides a means to compare the strength of evidence against the null hypothesisfor each of the arms in the study to one another. In other words, if two arms show significantlift over baseline and one has a point estimate of 0.03 and the other 0.02, without diggingmuch deeper it would be reasonable to conclude that they performed very similarly to eachother. If however, you have readily available the fact that P ( p . = max r ( p r ) | F j ) = 0 . P ( p . = max r ( p r ) | F j ) = 0 .
25, the arm with the point estimate of 0.03 now car-ries more weight than if one only knew the point estimates. Figure 1 below shows howthese probabilities change throughout a trial as the data collected grows. Specifically, Fig-ure 1 plots the average of P ( p r ∗ = max r ( p r ) | F j ) for each j and for each treatment armcorresponding to the simulated trial in Table 3 with p = 0 . a b l e : - a r m S i m u l a t i o n R e s u l t s p p r P o w e r ( - ˆ β ) T y p e I E rr o r ( ˆ α ) N F i x e d - S a m p l e A v g . P o w e r F i x e d - S a m p l e A v g . P o w e r ( B o n f e rr o n i ) F W E R F D R P e r - t e s t ˆ α . . . , . . .
598 0 . . . . . , . . .
192 0 . . , . . . , . . . , . . . , . . . , . . . , . . .
190 0 . . , . . .
565 0 . . , . . .
674 0 . . . , . . .
670 0 . . . . . , . . .
550 0 . . , . . . , . . . , . . . , . . . , . . . , . . .
491 0 . . , . . .
601 0 . . , . . . P ( p r ∗ = max r ( p r )) vs. N; 11-arms, p = 0 . N P r ob . B e s t Success_Rate p=0.48p=0.49p=0.5p=0.51p=0.52p=0.53
For convenience, all five arms with a success rate of 0.5 are condensed into one line in Figure1. As expected, the arm with a success rate of 0.53, the maximum among all arms in theexperiment, breaks away from the rest and converges towards one, while the probabilitiesof all the other arms being the best slowly fade towards zero. This phenomenon is notunique to this particular simulation, as all simulations presented in Table 2 and Table 3exhibited nearly identical trends. Indeed, this makes intuitive sense. At the beginning ofthe experiment we don’t know anything about any of the arms in the experiment, so a priori all arms have an equal probability of being the best. On average, however, as more data isaccumulated, we’re able to more accurately identify which arm is the best performing armin the study.
Additional Comparisons
It is worth noting that the sequential monitoring portion of the method presented in thispaper is also quite similar to that of Johari, Pekelis, and Walsh [21] (henceforth referred18o as JPW), who appear to have independently derived very similar results to those in[18]. Indeed, they are largely concerned with the same problem presented in this paper -specifically sequential monitoring and multiple comparisons adjustment in A/B tests as itrelates to digital marketing. The method in [21], however, does not carry out model fittingand estimation in a fully Bayesian framework, and employs an FDR correction for multiplecomparisons, both of which differ from the method presented here.Briefly, JPW [21] (or, in a more accessible version, Pekelis, et al [29]) also define their teststatistic as the Bayes Factor given by (8), which they denote Λ n . They then invert Λ n toobtain a p-value, p ∗ . For a given Type I error, α , p ∗ would be interpreted in the usual way.For a family of p-values, p ∗ i , i = 1 , . . . , m obtained from a multiple-testing situation, theauthors apply an FDR correction, e.g. [28], to control the false discovery rate. To comparethe method presented here to that of JPW, I applied the method outlined in [21, 29] to thesame 11-arm simulated data that I applied the proposed method to, the results of whichwere presented in Table 3. Table 4 below shows the results when the approach of [21, 29] isapplied to that same data. Specifically, to implement their approach I fit a standard logisticregression model with an indicator for each arm in the experiment as in (11) and calculatedthe Bayes Factor corresponding to each covariate, β r , via a BIC approximation. I thentook the inverse of each Bayes Factor to obtain the corresponding p-values, and appliedthe standard Benjamini-Hochberg correction [28] to the p-values at each interim check tocontrol the FDR at 0.05.It is immediately obvious from Table 4 that the method of JPW, as I have applied it, isextremely conservative when applied to this simulated data. While it’s true that the FDR isnearly zero, as is the FWER, in both simulated experiments, this is achieved at the expenseof the test’s power. This is especially the case when the true difference between arms issmall. As the difference between arms increases, the JPW approach begins to performbetter, although the average sample size required to detect that difference is still muchlarger than what is seen from the method I have proposed. Table 5 provides a side-by-sidecomparison of the proposed method and the JPW method for a handful of metrics, clearlyshowing that the proposed method has much higher overall accuracy (a lower overall errorrate) and accrues substantially fewer subjects, on average, than does JPW’s method. Onthe other hand, JPW present a method that achieves a lower FDR and a lower false positiverate. 19 a b l e : - a r m S i m u l a t i o n R e s u l t s , J P W M e t h o d p p r P o w e r ( - ˆ β ) T y p e I E rr o r ( ˆ α ) N F i x e d - S a m p l e A v g . P o w e r F i x e d - S a m p l e A v g . P o w e r ( B o n f e rr o n i ) F W E R F D R P e r - t e s t ˆ α . . . , . . .
829 0 . . . . . , . . .
209 0 . . , . . . , . . . , . . . , . . . , . . . , . . .
209 0 . . , . . .
822 0 . . , . . .
891 0 . . . , . . .
889 0 . . . . . , . . .
736 0 . . , . . . , . . . , . . . , . . . , . . . , . . .
666 0 . . , . . .
877 0 . . , . . . p = 0 . p = 0 . > F P + F N ) /N tests ) 20.00% 33.98% 7.61% 18.19%Avg. sample accrued fortests where p r (cid:54) = p It is possible that some of the difference observed between the two methods is attributableto how I have implemented the JPW method. In particular, in [21, 29] the authors dis-cuss how an important part of their method involves selecting an “optimal prior” for theparameter of interest, something that they base on over 40,000 historical A/B tests run ontheir company’s platform [21]. Still, as that appears to be proprietary information, it is notclear how an independent researcher can adequately evaluate their method without use ofanything more than a non-informative prior. Because both methods presented here assumethe same prior distribution that is imposed by using the BIC to approximate the Bayes Fac-tor, specifically a unit information prior [25], I think the comparison is fair. Additionally,anyone wishing to implement an approach like this for the first time or on an ad hoc basiswouldn’t have much, if any, historical data to use to tune their prior, so it is important toevaluate each method’s performance in the setting with little prior information. Yet, it isalso important to consider that the JPW method would likely see improved performancegiven a well tuned prior for the regression coefficients in the logistic regression.Another natural comparison that is not presented here would be this method’s performancerelative to an experiment with an O’Brien Fleming (OBF) stopping rule [8]. Cheng and Shen[18] present this comparison in Table 1 of their paper for a simple two-arm test. When theminimum detectable difference is correctly specified, an OBF design will perform just as wellas this method with respect to Type I error, power, and average sample size. If, however,the minimum detectable difference is overestimated, the power in an OBF design will takea substantial hit. Alternatively, if the minimum detectable difference is underestimated,the study will be overpowered and result in a longer study that enrolls more people thanneeded, on average, than the sequential Bayesian design.
Discussion
Additional Considerations
I believe the most natural extension in this framework would be allowing for outcome-adaptive randomization throughout the experiment. That is, as the experiment is ongoing,21he study would learn from itself and use the data that has already been accrued to propor-tionately allocate more people to the best performing arms, while reducing the allocationpercentage in arms that are not performing well. Formally, let p j,r ∗ (max) = P ( p r ∗ = max r ( p r ) | F j )be the probability that arm r ∗ is the best performing arm at interim check j . Then, movingforward, the percentage of study participants to be allocated to any arm r = r ∗ after interimcheck j can be given by [10, 11]: Alloc j + ( r ∗ ) = { p j,r ∗ (max) } h (cid:80) r { p j,r (max) } h , (14)where 0 ≤ h ≤ h = 0 yields conventional randomization, while h = 1 gives Alloc j + ( r ∗ ) = p j,r ∗ (max). Thall and Wathen [11] note that h = 1 / h = n/ N , where n/N is the current samplesize divided by the maximum sample size for the experiment, though it could also be thoughtof as the fraction of the study completed to that point if no maximum sample size has beenset. The adaptive randomization could be constructed so that (14) does not include thecontrol group, and instead is only calculated for each of the different treatment arms in thestudy. Such an approach would likely help to hone in on the best performing variant morequickly, but might not be desirable in a setting where the goal is to accurately measurethe treatment effect in each arm, because discontinuing allocation to some arms early on inthe study should result in more variable estimates in those arms, meaning they’ll be pulledcloser towards the overall mean across all arms via (12). Therefore, depending on the goalof the study, this is potentially a decision the researcher must make before conducting theexperiment. More work is required to determine exactly how well an adaptive randomizationscheme like (14) would perform when used together with this method. Conclusions
In this paper I have presented a method for implementing multivariate A/B testing that Ibelieve can be quite useful in online and mobile marketing. The method primarily drawson previous work in clinical trials methodology [18] and applied social sciences research[26] to create a unified approach to both continuous monitoring of A/B tests, and how toadequately account for multiple comparisons in the event of a multivariate A/B test. Whencompared to other existing methods, the method proposed herein performs quite favorably.Recall the questions presented on pages one and two of this paper. By approaching theproblem from a Bayesian perspective, a researcher can easily rank and compare arms, orvariants, to each other by making use of the posterior simulations and calculating p j,r ∗ (max)22or each arm at each interim check, in addition to determining whether each is significantlybetter than baseline using the sequential monitoring procedure. Additionally, by not spec-ifying the sample size, N , in advance, a researcher does not need to worry about setting aminimum detectable difference in the planning stage and thereby potentially enrolling toofew or too many people in his study. The method also gives the researcher the freedomto look at the data as often as he pleases, with valid results no matter when he decidesto stop the study. Additionally, while the method presented here was done so with binaryoutcomes in mind, the same rules would apply for a continuous outcome variable.From the standpoint of an applied researcher, a major advantage of this approach is therelative ease with which it can be carried out. The models to be fitted are familiar, andapproximating the Bayes factor using the BIC is straightforward and computationally in-expensive. Using the BIC approximation to the Bayes factor amounts to assuming a unitinformation prior for all of the parameters in the model, a conservative assumption withrespect to rejecting the null hypothesis [25]. Coupled with a conservative rejection criterionof c α/ = (1 − α/ / ( α/ References [1] Curtis L. Meinert.
Clinical Trials Handbook: Design and Conduct . John Wiley & Sons,Inc., Hoboken, NJ, USA, 2012.[2] The Office of Biostatistics and the Office of New Drugs in the Center for Drug Eval-uation and Research. Guidance for Industry: Adaptive design clinical trials for drugsand biologics. Report, U.S. Food and Drug Administration, 2010.233] P. Armitage, C. K. McPherson, and B. C. Rowe. Repeated significance tests on accu-mulating data.
Journal of the Royal Statistical Society, Series A , 132:235–244, 1969.[4] Cyrus R. Mehta. Lecture 1: Introduction to flexible clinical trials. In
Sequential Designand Interim Monitoring . Harvard University & Cytel Software Corporation, 2004.[5] David L. DeMets and K. K. Gordon Lan. Interim analysis: The alpha spending functionapproach.
Statistics in Medicine , 13:1341–1352, 1994.[6] Stuart J. Pocock. Group sequential methods in the design and analysis of clinical trials.
Biometrika , 64:191–199, 1977.[7] R. Peto, M. C. Pike, P. Armitage, N. E. Breslow, D. R. Cox, S. V. Howard, N. Mantel,K. McPherson, J. Peto, and P. G. Smith. Design and analysis of randomized clinicaltrials requiring prolonged observations of each patient, I. Mortality results.
BritishJournal of Cancer , 34:585–612, 1976.[8] Peter C. O’Brien and Thomas R. Fleming. A multiple testing procedure for clinicaltrials.
Biometrics , 35:549–556, 1979.[9] S. B. Tan and D. Machin. Bayesian two-stage designs for phase II clinical trials.
Statistics in Medicine , 21:1991–2012, 2002.[10] Peter F. Thall and J. Kyle Wathen. Covariate-adjusted adaptive randomization in asarcoma trial with multi-stage treatments.
Statistics in Medicine , 24:1947–1964, 2005.[11] Peter F. Thall and J. Kyle Wathen. Practical bayesian adaptive randomization inclinical trials.
European Journal of Cancer , 43:859–866, 2007.[12] J. Kyle Wathen and Peter F. Thall. Bayesian adaptive model selection for optimizinggroup sequential clinical trials.
Statistics in Medicine , 27:5586–5604, 2008.[13] Valen E. Johnson and John D. Cook. Bayesian design of single-arm phase II clinicaltrials with continuous monitoring. Working paper series, paper 47, University of Texas,MD Anderson Cancer Center, 2008.[14] Abraham Wald. Sequential tests of statistical hypotheses.
Annals of MathematicalStatistics , 16:117–186, 1945.[15] M. Kulldorff, R. L. Davis, M. Kolczak, E. Lewis, T. Lieu, and R. Platt. A maximizedsequential probability ratio test for drug and vaccine safety surveillance.
SequentialAnalysis , 30:58–78, 2011.[16] D. G. Hoel, G. H. Weiss, and R. Simon. Sequential tests for composite hypotheseswith two binomial populations.
Journal of the Royal Statistical Society, Series B ,38:302–308, 1976. 2417] William Q. Meeker Jr. A conditional sequential test for the equality of two binomialproportions.
Journal of the Royal Statistical Society, Series C , 30:109–115, 1981.[18] Yi Cheng and Yu Shen. Bayesian adaptive designs for clinical trials.
Biometrika ,92:633–646, 2005.[19] Peter Armitage. Numerical studies in the sequential estimation of a binomial parame-ter.
Biometrika , 45:1–15, 1958.[20] Peter Armitage.
Sequential Medical Trials . Thomas, Springfield, IL, USA, 1960.[21] Ramesh Johari, Leo Pekelis, and David J. Walsh. Always valid inference: Bringingsequential analysis to A/B testing. arXiv preprint , arXiv:1512.04922, 2015.[22] Alexandru D. Corlan. Medline trend: automated yearly statistics of pubmed resultsfor any query. http://dan.corlan.net/medline-trend.html , 2016. Accessed: 2016-01-13.[23] Gideon Schwarz. Estimating the dimension of a model.
Annals of Statisitics , 6:461–464,1978.[24] Robert E. Kass. Bayes factors in practice.
The Statistician , 42:551–560, 1993. Specialissue: Conference on practical Bayesian statistics.[25] Adrian E. Raftery. Bayes factors and BIC: Comment on “A critique of the Bayesianinformation criterion for model selection”.
Sociological Methods & Research , 27:411–427, 1999.[26] Andrew Gelman, Jennifer Hill, and Masanao Yajima. Why we (usually) don’t have toworry about multiple comparisons.
Journal of Research on Educational Effectiveness ,5:189–211, 2012.[27] Jason C. Hsu.
Multiple Comparisons: Theory and Methods . Chapman and Hall, Lon-don, UK, 1996.[28] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practicaland powerful approach to multiple testing.