Optimal Testing in the Experiment-rich Regime
OOptimal Testing in the Experiment-rich Regime
Sven Schmit, Virag Shah, Ramesh JohariStanford UniversityMay 31, 2018
Abstract
Motivated by the widespread adoption of large-scale A/B testing in in-dustry, we propose a new experimentation framework for the setting wherepotential experiments are abundant (i.e., many hypotheses are availableto test), and observations are costly; we refer to this as the experiment-rich regime . Such scenarios require the experimenter to internalize theopportunity cost of assigning a sample to a particular experiment. Wefully characterize the optimal policy and give an algorithm to compute it.Furthermore, we develop a simple heuristic that also provides intuitionfor the optimal policy. We use simulations based on real data to compareboth the optimal algorithm and the heuristic to other natural alternativeexperimental design frameworks. In particular, we discuss the paradox ofpower : high-powered “classical” tests can lead to highly inefficient sam-pling in the experiment-rich regime.
In modern A/B testing (e.g., for web applications), it is not uncommon to findorganizations that run hundreds or even thousands of experiments at a time[Kaufman et al., 2017, Tang et al., 2010a, Kohavi et al., 2009, Bakshy et al.,2014]. Increased computational power and the ubiquity of software have made iteasier to generate hypotheses and deploy experiments. Organizations typicallycontinuously experiment using A/B testing. In particular, the space of potentialexperiments of interest (i.e., hypotheses being tested) is vast; e.g., testing thesize, shape, font, etc., of page elements, testing different feature designs anduser flows, testing different messages, etc. Artificial intelligence techniques arebeing deployed to help automate the design of such tests, further increasing thepace at which new experiments are designed (e.g., Sensei, Adobe’s A/B testingproduct, is being used in Adobe Target). This abundance of potential experiments has led to an interesting phenomenon:despite the large numbers of visitors arriving per day at most online web ap- a r X i v : . [ s t a t . M L ] M a y lications, organizations need to constantly consider the most efficient way toallocate these visitors to experiments. For many experiments, baseline ratesmay be small (e.g., a low conversion rate), or more generally effect sizes may bequite small even relative to large sample sizes. For example, large organizationsmay be seeking relative changes in a conversion rate of 0 .
5% or less, potentiallynecessitating millions of users allocated to a single experiment to discover atrue effect. (See Tang et al. [2010a,b], Deng et al. [2013] and Azevedo et al.[2018], where these issues are discussed extensively.) Since organizations have aplethora of hypotheses of interest to test, there is a significant opportunity cost :they must constantly trade off allocation of a visitor to a current experimentagainst the potential allocation of this visitor to a new experiment.In this paper, we study a benchmark model with the feature that experiments areabundant relative to the arrival rate of data; we refer to this as the experiment-rich regime . A key feature of our analysis is the impact of the opportunitycost described above: whereas much of optimal experiment design takes placein the setting of a single experiment, the experiment-rich regime fundamentallyrequires us to trade off the potential for discoveries across multiple experiments .Our main contribution is a complete description of an optimal discovery algo-rithm for our setting; the development of an effective heuristic; and an extensivedata-driven simulation analysis of its performance against more classical tech-niques commonly applied in industrial A/B testing.We present our model in Section 2. The formal setting we consider mimics thesetting of most industrial A/B testing contexts. The experimenter receives astream of observational units and can assign them to an infinite number of pos-sible experiments, or alternatives , of varying quality ( effect size ). We considera Bayesian setting where there is a prior over the effect size of each alternative,which is natural in a setting with an infinite number of experiments.We focus on the objective of finding an alternative that is at least as good asa given threshold s as fast as possible. In particular, we call an alternativea discovery if the posterior probability that the effect is greater than s is atleast 1 − α , and the goal is to minimize the expected time per discovery. Thisis a natural criterion: good performance requires finding an alternative that isactually delivering practically significant effects (as measured by s ). Adjusting s and α allows the experimenter to trade off the quality and quantity of discoveriesmade. Note that under this criterion any optimal policy is naturally incentivizedto find the “best” experiments, because the discovery criterion is easiest to bemet for those alternatives.In Section 3 we present an optimal policy for allocation of observations to exper-iments. Since observations arrive sequentially, the problem can be equivalentlyformulated as minimizing the cumulative number of observations until until adiscovery is made. We characterize a dynamic programming approximation ofthis problem, and show this method converges to the optimal policy in an appro-priate sense. We also develop a simple heuristic that approximates and providesinsight into the optimal policy. 2n Section 4 we use data on baseball players’ batting averages as input data fora simulation analysis of our approach. Our simulations demonstrate that ourapproach delivers fast discovery while controlling the rate of false discoveries;and that our heuristic approximates the optimal policy well. We also use thesimulation setup to compare our method to “classical” techniques for discoveryin experiments (e.g., hypothesis testing). This comparison reveals the ways inwhich classical methods can be inefficient in the experiment-rich regime. Inparticular, there is a paradox of power : efficient discovery can often lead to lowpower in a classical sense, and conversely high-powered classical tests can behighly inefficient in maximizing the discovery rate.Due to space constraints, all proofs are given in the appendix. The literature on sequential testing goes back many decades. Originally, Waldand Wolfowitz [1948] propose an optimal test, the sequential probability ratiotest, or SPRT for short, for testing a simple hypothesis. Chernoff [1959] studiesthe asymptotics of experimentation with two hypothesis tests and how to as-sign observations. Lai [1988] proposes a class of Bayesian sequential tests witha composite alternative for an exponential family of distributions. For a morethorough overview of sequential testing, we refer the interested reader to Sieg-mund [2013], Wetherill and Glazebrook [1986], Shiryayev [1978] and Lai [1997].None of these approaches consider the opportunity cost associated with havingmultiple experiments.Recently, there has been an increased interest in sequential testing due to the risein popularity of A/B testing [Deng et al., 2017, Kaufman et al., 2017, Kharitonovet al., 2015], and the ubiquity of peeking [Johari et al., 2017, Balsubramani andRamdas, 2016, Deng et al., 2016]. A recent paper by Azevedo et al. [2018]discusses how the tails of the effect distribution affect the assignment strategyof observations to experiments and complements this work.There is also a strong connection to the multi-armed bandit literature [Gittinset al., 2011, Bubeck and Cesa-Bianchi, 2012], especially the pure explorationproblem [Bubeck et al., 2009, Jamieson et al., 2014, Russo, 2016], where thegoal is to find the best arm. The case with infinitely many arms is studied byCarpentier and Valko [2015], Chaudhuri and Kalyanakrishnan [2017], Aziz et al.[2018]. Locatelli et al. [2016] studies the setting of finding the set of arms (outof finitely many) above a given threshold in a fixed time horizon.Methods to control of the false discovery rate in the sequential hypothesis settingare discussed by Foster and Stine [2007], Javanmard and Montanari [2016] andRamdas et al. [2017]. The connection between with multi-armed bandits ismade by Yang et al. [2017]. However, the Bayesian framework we propose doesnot require multiple testing corrections.3he heavy-coin problem [Chandrasekaran and Karp, 2012, Malloy et al., 2012,Jamieson et al., 2016, Lai et al., 2011] is another closely related research area.Here, a fraction of coins in a bag is considered heavy, while most are light. Thegoal is to find a heavy coin as quickly as possible. These approaches rely onlikelihood ratios, as there are only two alternatives, and there is a connection tothe CUSUM procedure [Page, 1954]. The approaches mentioned above all con-sider the same problem as we do in this work, albeit for testing two alternativesagainst each other.Optimal stopping rules have been studied extensively, often under the umbrellaof the secretary problem [Freeman, 1983, Samuels, 1991]. There, the focus is oncomparing across alternatives.
In this section we describe the model we study and the objective of the experi-menter.
Experiments . We consider a model with an infinite number of experiments,or alternatives, indexed by i ∈ { , , . . . } . Each experiment is associated witha parameter µ i ∈ M ⊂ R drawn independently from a common (known) prior π that completely characterizes the distribution of outcomes corresponding tothat experiment. Throughout our analysis, the experimenter is interested inexperiments with higher values of µ i . Actions and outcomes . At times t = 1 , , . . . , the experimenter selects an al-ternative I t and observes an independent outcome X t drawn from a distribution F ( µ I t ). Note, in particular, that opportunities for observations arrive in a se-quential, streaming fashion. We also assume that observations are independentacross experiments.We assume that F ( µ i ) is described by a single parameter natural exponentialfamily, i.e. the density for an observation can be written as: f X ( x | µ ) = h ( x ) exp ( µS ( x ) − A ( µ )) , (1)for known functions S , h , and A . Let S it = (cid:80) t : I k = i S ( X t ) be the canonicalsufficient statistic for experiment i at time t . Note that in particular, thismodel includes the conjugate normal model with known variance and the beta-binomial model for binary outcomes. Policies . Let F t = σ { X , I , X , I , . . . , X t , I t } denote the σ -field generated. A policy is a mapping from F t to experiments. Discoveries . The experimenter is interested in finding discoveries , defined asfollows. 4 efinition 1 (Discovery) . We say that alternative i is a discovery at time t ,given s and α , if P ( µ i < s | F t ) < α. (2)Here s and α are parameters that capture the experimenter’s preferences, i.e.,the level of aggressiveness and risk that she is willing to tolerate. (Note thatthis is more stringent than the related false discovery rate guarantees [Benjaminiand Hochberg, 2007].)We assume that the prior satisfies P ( µ i < s | ∅ ) ∈ ( α,
1) to avoid the trivialscenarios that all or none of the alternatives is a discovery before trials begin.
Objective: Minimize time to discovery.
As motivated in the introduction,informally the objective is to find discoveries as fast as possible. We formal-ize this as follows: The goal of the experimenter is to design a policy (i.e.,an algorithm to match observations to experiments) such that the number ofobservations until the first discovery is minimized.In particular, define the time to first discovery τ as: τ = min { t : there exists i ∗ such that P ( µ i ∗ < s | F t ) < α } . (3)Then the goal is to minimize E [ τ ] over all policies. Given this goal, the onlydecision the experimenter needs to make at each point in time till the firstsuccess is whether to reject the current experiment or to continue with it. Discussion . We conclude with two remarks regarding our model.(1)
Posterior validity . Note that at the (random) stopping time τ , the posterioris computed based on the potentially adaptive matching policy used by the ex-perimenter. The following lemma shows that when the experimenter computesthe posterior and decides to stop the experiment at time t when the condition P ( µ i ∗ < s | F t ) is met, the decision to stop does not invalidate the discovery. Lemma 1.
The posterior for the discovered experiment i ∗ at time τ satisfies P ( µ i ∗ < s | F τ ) < α (4) almost surely. (2) Fixed cost per experiment . In some scenarios, starting a new experimenthas a cost; e.g., there may be a cost to implementing a new variant, or resultsmay need to be analyzed on a per experiment basis. We can incorporate sucha cost in the objective, and our results and approach generalize accordingly.Formally, let c be the cost of starting a new experiment, and let m t = |{ i : ∃ t (cid:48) ≤ t : I t (cid:48) = i }| be the cumulative number of matched experiments up to time t .We can include the per experiment cost by considering instead the problem of minimizing E [ τ + cm τ ]. 5 Optimal policy
In this section, we characterize the structure of the optimal policy, show thatit can be approximated arbitrarily well by considering a truncated problem,and give an algorithm to compute the optimal policy of the truncated problem.Finally, we present a simple heuristic that approximates the optimal policyremarkably well.
We start with a key structural result that simplifies the search for an optimalpolicy. The following lemma shows that we can focus on policies that onlyconsider experiments sequentially , in the sense that once a new experiment isbeing allocated observations, no previous experiment will ever again receiveobservations.
Lemma 2.
There exists an optimal policy such that I t +1 ≥ I t for all t almostsurely. This result hinges on three aspects of our model: experiments are independentof each other, with identically distributed effects µ i ; there are an infinite numberof experiments available; and observations arrive in an infinite stream. As a con-sequence, all experiments are a priori equally viable, and a posteriori once theexperimenter has determined to stop allocating observations to an experiment,she need never consider it again.Note in particular that this lemma also reveals that any optimal policy for thefirst discovery also straightforwardly minimizes the expected time until the k ’thdiscovery, for any k . Based on Lemma 2, we can reformulate and simplify the optimization problemfaced by the experimenter as a sequential decision problem, where the onlychoice is whether or not to continue testing the current experiment.We abuse notation to describe this new perspective. Let µ denote the effect sizeof the current experiment. In particular, let X n be the n ’th observation; let F n be the σ -field generated by observations of the current experiment ( X , . . . , X n ).Let S n = (cid:80) nk =1 S ( X k ) denote the canonical sufficient statistic at state n . The state of the sequential decision problem is ( n, S n ), the number of observationsand the sufficient statistic of the current experiment.If ( n, S n ) has the property that P ( µ < s | S n ) < α , then a discovery has beenfound and so the process stops. The following lemma shows that this discov-ery criterion induces an acceptance region on the sufficient statistic S n , i.e.,6 sequence of thresholds a n such the current experiment is a discovery when S n ≥ a n . Lemma 3.
There exists a sequence { a n } ∞ n =1 such that P ( µ < s | S n ) < α if andonly if S n > a n . If S n < a n , then the experimenter can make one of two decisions:1. Continue (i.e., collect one additional observation on the current experi-ment); or2.
Reject (i.e., quit the current experiment and collect the first observationof a new experiment).If
Continue is chosen, the state updates to ( n + 1 , S n +1 ). If Reject is chosen,the state changes to (1 , S ) (where S is an independent draw of the sufficientstatistic after the first observation); and in either case, the process continues.The goal of the experimenter is to minimize the expected time until the observa-tion process stops, i.e., until a discovery is found. Let V ( n, S n ) be this minimum,starting from state ( n, S n ). Then the Bellman equation for this process is asfollows: V ( n, S n ) = 0 , S n ≥ a n ; (5) V ( n, S n ) = 1 + min { E [ V ( n + 1 , S n +1 ) | S n ] , E [ V (1 , S )] } , S n ≤ a n n ≥ . (6)The first line corresponds to the case where S n is in the acceptance region, i.e.,the process stops. In the second line, we consider two possibilities: continuingincurs a unit cost for the current observation, plus the expected cost from thestate ( n + 1 , S n +1 ); rejecting resets the state with no cost incurred. The optimalchoice is found by minimizing between these alternatives. The expected numberof samples T ∗ till a discovery satisfies T ∗ = 1 + E [ V (1 , S )]. The following theorem shows that an optimal policy for the dynamic program-ming problem (5)-(6) can be expressed using a sequence of rejection thresholdson the sufficient statistic. That is, for each n there is an r n such that it isoptimal to Continue if S n ≥ r n , and to Reject if S n < r n . Theorem 4.
There exists an optimal policy for (5) - (6) described by a sequenceof rejection thresholds { r n } ∞ n =1 such that, after n observations, Reject is declaredif S n < r n , Continue is declared if r n ≤ S n ≤ a n , and the process stops with adiscovery if S n > a n . The remainder of the section is devoted to computing the optimal sequence ofrejection thresholds. 7 .4 Approximating the optimal policy via truncation
In order to compute an optimal policy, we consider a truncated problem. Thisproblem is identical in every respect to the problem in Section 3.2, except thatwe consider only policies that must choose
Reject after k observations. We referto this as the k -truncated problem.Let V k ( n, S n ) denote the minimum expected cumulative time to discovery forthe k -truncated problem, starting from state ( n, S n ). The Bellman equation isnearly identical to (5)-(6), except that now V k ( k, S k ) = 1+ E [ V k (1 , S )] , S k ≤ a k ,and we add the additional constraint that n < k to (6). We have the followingproposition. Theorem 5.
There exists an optimal policy for the k -truncated problem de-scribed by a sequence of rejection thresholds { r kn } ∞ n =1 such that, after n observa-tions, Reject is declared if S n < r kn , Continue is declared if r kn ≤ S n ≤ a n , and Accept is declared is if S n > a n .Further, let T ∗ k = E [ V k (1 , S )]+1 be the optimal expected number of observationsuntil a discovery is made. Then for each n , r kn → r n as k → ∞ ; and T ∗ k → T ∗ as k → ∞ . The truncated horizon brings us closer to computing an optimal policy, but itis still an infinite horizon dynamic programming problem. In this section weshow instead that we can compute the truncated optimal policy by iterativelysolving a single-experiment truncated problem with a fixed rejection cost κ . Let W k ( n, S n | κ ) be the optimal expected cost for this problem starting from state( n, S n ). We have the following Bellman equation. W k ( n, S n | κ ) = 0 , S n > a n ; (7) W k ( k, S k | κ ) = κ, S k ≤ a k ; (8) W k ( n, S n | κ ) = 1 + min { E [ W k ( n + 1 , S n +1 | κ ) | S n ] , κ } , n < k, S n ≤ a n . (9)For any terminal cost κ , this dynamic programming problem is easily solved us-ing backward induction to find the rejection boundaries. The following theoremshows how we can use this solution to find an optimal policy to the truncatedproblem. Theorem 6. If κ = T ∗ k , then the optimal policy for (7) - (9) with rejectionthresholds r kn found by backward induction satisfies r kn = r kn for all n ≤ k .Furthermore, let f ( κ ) = 1 + E [ W k (1 , S | κ )] be the optimal cost. Then if κ > T ∗ k , f ( κ ) < κ , and if κ < T ∗ k , then f ( κ ) > κ . Thus, to find approximately optimal rejection thresholds, select k suitably large,and start with an arbitrary κ . Then iteratively compute the corresponding8
25 50 75 100 125 150number of samples0.51.01.52.02.53.0 M A P f o r μ Conjugate-Normal boundaries M A P f o r p Beta-binomial boundaries
Figure 1: Acceptance and rejection regions for the conjugate Normal and theBeta-Binomial models. The dashed blue line gives the heuristic rejection bound-ary, while the red line corresponds to the optimal rejection thresholds. Note thatthe boundaries are shown in terms of the MAP estimate.thresholds r kn and the cost f ( κ ), using bisection to converge on T ∗ k , and thusthe corresponding optimal thresholds.We note that the same program we have outlined in this section can be used tocompute an optimal policy with a per experiment fixed cost c , by using rejectioncost κ + c instead of κ . Empirically, this leads to only slightly lower rejectionthresholds; due to space constraints, we omit the details. We have seen that the optimal policy is easy to approximate by solving dynamicprograms iteratively. However, this does not give us direct insight into thestructure of the solution, and in certain cases a quick rule-of-thumb that providesan approximate policy might be all that is required. In this section, we showthat there exists a simple heuristic that performs remarkably well.The approximate rejection boundary at time n is found as follows. Let ˆ µ bethe MAP estimate of µ for sufficient statistic S n + T ∗ = a n + T ∗ . Then rejectthe current experiment if S n is not plausible under ˆ µ . That is, the heuristicboundary r n is, for a suitably chosen β , P ( S n ≤ r n | µ = ˆ µ ) = β. (10)Of course, this heuristic is not practical as is, as in general we do not know T ∗ unless we compute the optimal policy. But often a n + t varies only littlein t so a reasonable approximate choice T h is sufficient. In Figure 1 we plotthe discovery and rejection boundaries, along with the heuristic outlined above(with T h = T ∗ ), for the normal and Bernoulli models.The heuristic and optimal policies clearly exhibit aggressive rejection regions,cf. Figure 1. THe interpretation is as follows: to continue sampling from thecurrent experiment, we do not just want its quality to be s , but substantially9etter than s , since a n > s for all n . If not, it would take too many additionalobservations to verify the discovery. We now empirically analyze our testing framework based on a simulation withbaseball data. First, we demonstrate empirically that the proposed algorithmleads to fast discoveries, and behaves differently from traditional testing ap-proaches. Second, we show that the rule-of-thumb heuristic performance isclose to that of the optimal policy. Data
We use the baseball dataset with pitching and hitting statistics from 1871through 2016 from the Lahman R package. The number of At Bats (AB) andHits (H) is collected for each player, and we are interested in finding players witha high batting average, defined as b i = Hits i / At Bats i . We consider players withat least 200 At Bats, which leaves a total of 5721 players, with a mean of about2300 At Bats. In the top left of Figure 2, we plot the histogram of battingaverages, along with an approximation by a beta distribution (fit via method ofmoments). We note that these fit the data reasonably well, but not perfectly.This discrepancy helps us evaluate the robustness to a misspecified prior. Simulation setup
To construct the testing problem, we view the batters asalternatives, with empirical batting average b i of batter i treated as groundtruth. We want to find alternatives with b i > s . We draw a Bernoulli sample ofmean b i to simulate an observation from alternative i . These samples are thenused to test whether b i > s . We set α = 0 .
05, and vary s between 0 .
25 and 0 . To assess performance, we compare several testing procedures. Note that thenon-traditional setup of our testing framework does not allow for easy com-parison with other methods, in particular frequentist approaches, as they givedifferent guarantees. Thus, we restrict attention to Bayesian methods that pro-vide the same error guarantee. All of the benchmarks use the same beta priorcomputed above. Code to replicate results can be found at https://github.com/schmit/optimal-testing-experiment-rich-regime . Based on sequential batting data from the 2014-2018 seasons there is no evidence forstrong correlation between at-bats. ptimal policy First we study the optimal policy based on the beta-binomialmodel, computed using the bisection and backward induction approach in Sec-tion 3.5, where we truncate after k = 5000 samples. Heuristic policy
Next, we include the heuristic rejection thresholds thatapproximate the optimal policy for truncation k = 5000 samples. The heuristicpolicy requires setting two parameters: T h , i.e., how far to look into the future tofind the acceptance boundary, which is ideally set close to T ∗ ; and the rejectionregion β . To demonstrate the the insensitivity to T ∗ , we use T h = 2000 and β = 0 . T ∗ varies dramatically as we change thethreshold s . Fixed sample size test
Our next benchmark is a simple fixed sample sizetest. For each experiment, we gather N observations, and claim a discoveryif P ( µ i < s | Y i ) < α where Y i is the number of Hits of alternative (batter) i . We focus our attention on using N = 1000 samples per test, as this seemsto perform best when compared to other sample sizes, but any differences areimmaterial for our conclusions. Fixed sample size test with early stopping
This benchmark is similarto the fixed sample size test, except that we stop the experiment early if thediscovery criterion is met. Thus, we can quantify the gains from being able todiscover early.
Bayesian sequential test
Now we consider a sequential test that also rejectsearly. In particular, we reject the current experiment if P ( b i > s | S it ) < β. Wealso reject an alternative after 4000 samples. This approach also requires carefultuning of β . In particular, if β is too large, say larger than the prior probability P ( b i > s ), then the test is too aggressive and rejects all alternatives outright.Instead, we found empirically that setting β = 0 . P ( b i > s ) leads to goodperformance across all values of s . Average time to discovery
The average number of observations until a dis-covery is shown in the top right plot of Figure 2. As expected, the fixed sampletest performs worst. Early stopping leads to slightly better performance, butthis method is still not effective as most of the gains come from early rejection.The Bayesian sequential test demonstrates this effect and shows substantialgains over the fixed tests. The heuristic policy, despite lack of parameter tun-ing, performs very well, essentially matching the performance of the optimalalgorithm for most thresholds. 11 .15 0.20 0.25 0.30 0.35batting average0.02.55.07.510.012.515.0 f r e q u e n c y Histogram H/AB
Beta(56.9, 167.9)data 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32threshold10 s a m p l e s / d i s c o v e r i e s Average time to discovery f a l s e d i s c o v e r i e s / d i s c o v e r i e s False discovery proportion t r u e d i s c o v e r i e s / g oo d b a tt e r s Power optimalheuristicfixedfixed, early stoppingbayesian
Figure 2: Top left: histogram of batting averages. Top right: Efficacy of ofalgorithms. Bottom left: Plot the false discovery proportion across thresholds.Bottom right: Plot of the empirical power of algorithms. Note the paradox ofpower effect: the most efficient algorithms have low power.
False discovery proportion and robustness
Next, we compare the falsediscovery proportion (FDP) [Benjamini and Hochberg, 2007], i.e., the fractionof discoveries that in fact had true b i < s . If the prior is correctly specified,the methods we consider satisfy E (FDP) ≤ α . Indeed, we observe that theguarantee holds for most thresholds and algorithms in the bottom left plot ofFigure 2. There is some minor exceedance of the FDP for thresholds around s ≈ .
28, which can be explained by the fact that the prior does not fit theempirical batting averages perfectly. Since there are few rejections for thresholdsbeyond s = 0 .
3, the FDP estimate has higher variance in that regime. Acrossall simulations, the optimal policy has an FDP of 0 . < α . Finally, we seethat the lack of early stopping makes the fixed test rather conservative. The paradox of power
Finally, we compare power , i.e., the fraction of alter-natives i with b i > s that are declared a discovery. Power comparisons acrossthe algorithms are plotted in the bottom right of Figure 2. The most surprisinginsight from the simulations is the paradox of power . Algorithms that are ef-fective have very low power. This is counter-intuitive: how can algorithms thatmake many discoveries have only a small chance of picking up true effects? Themain driver of good performance for an algorithm is the ability to quickly rejectunpromising alternatives. Some unpromising alternatives are “barely winners”:i.e., b i is only slightly above s . In the experiment-rich regime, such alternativesshould be rejected quickly, because it takes too many observations to get enoughconcentration around the posterior to claim a discovery. This effect leads to low12 .20 0.25 0.30(true) batting average10 a v e r a g e nu m b e r o f s a m p l e s Number of samples overalldiscoveredrejected 0.20 0.25 0.30(true) batting average0.000.020.040.060.080.10 p r o b a b ili t y o f d i s c o v e r y Discoveries ( t r u e ) b a tt i n g a v e r a g e Estimation
Figure 3: Analysis of optimal policy. The left plot shows average number ofsamples for rejections and discoveries. The center plot shows the fraction ofdiscoveries. The right plot shows MAP estimates of effect sizes.power, but fast discoveries.
Characteristics of the optimal policy
We consider the outcomes of in-dividual tests for the optimal algorithm ( s = 0 .
27) in Figure 3. The averagenumber of samples for rejected alternatives is very small while it is much largerfor discovered alternatives. We also note the concave shape for the discoveredalternatives, that seems to peek around 0 . .
28, rather than around s = 0 .
27. Asnoted before, the optimal policy tries to avoid effects that are close to thethreshold.Finally, the MAP estimates for the batting averages of discovered batters. Itillustrates a known but important fact that the parameter of discovered alter-natives are quite poor. If estimation of effects is important, the experimenterought to obtain more samples for the discovered alternatives.
We consider an experimentation setting where observations are costly, and thereis an abundance of possible experiments to run — an increasingly prevalent sce-nario as the world is becoming more data-driven. Based on backward induction,we can compute an approximately optimal algorithm that allocates observationsto experiments such that the time to a discovery is minimized. Simulations val-idate the efficacy of our approach, and also reveal discuss the paradox of power :13here is a tension between high-powered tests, and being efficient with observa-tions.Our paradigm has several additional practical benefits. First, we can leverageknowledge across experiments through the prior. Second, adaptive matchingof observations to experiments does not preclude valid inference, and thus out-comes can thus continuously be monitored. Finally, the framework also providesan easy “user interface”: it directly incorporates the desired effect size, and leadsto guarantees that are easy to explain to non-experts.
Further directions
The framework assumes there is a common prior amongalternatives, and this allows us to view every rejection as a renewal. If ex-periments have different priors, then the order at which experiments are chosenmatters. This is also true when the costs of observations or starting experimentsdiffer across experiments.Furthermore, we assume the prior is known. The experimenter can take anempirical Bayes approach similar to our simulations before starting the experi-ments, but data gathered while the optimal policy is running distorts estimatesof the prior.We briefly touched upon the independence assumption across observations ofa single experiment, and showed that for baseball data this does not lead toproblems, but in other use cases time variation, e.g. novelty effects, might playa bigger role and need to be encoded into the framework. One way to incorporatesuch effects, along with suitably chose covariates that can reduce the variance oftesting and thereby improving the time till discoveries are Bayesian generalizedlinear models.Finally, we assume that experiments are independent. In certain settings theresults of one experiment can affect future experiments, or there might be cor-relations between outcomes of experiments. Incorporating these effects is non-trivial and beyond the scope of this work.
The authors would like to thank Johan Ugander, David Walsh, Andrea Lo-catelli, and Carlos Riquelme for their suggestions and feedback. This workwas supported by the Stanford TomKat Center, and by the National ScienceFoundation under Grant No. CNS-1544548 and CNS-1343253. Any opinions,findings, and conclusions or recommendations expressed in this material arethose of the author(s) and do not necessarily reflect the views of the NationalScience Foundation. 14 eferences
Eduardo M Azevedo, Alex Deng, Jose Montiel Olea, Justin M Rao, and E GlenWeyl. A/b testing. In
Proceedings of the Nineteenth ACM Conference onEconomics and Computation . ACM, 2018.Maryam Aziz, Jesse Anderton, Emilie Kaufmann, and Javed A. Aslam. Pureexploration in infinitely-armed bandit models with fixed-confidence. In
ALT ,2018.Eytan Bakshy, Dean Eckles, and Michael S Bernstein. Designing and deployingonline field experiments. In
Proceedings of the 23rd international conferenceon World wide web , pages 283–292. ACM, 2014.Akshay Balsubramani and Aaditya Ramdas. Sequential nonparametric testingwith the law of the iterated logarithm.
CoRR , abs/1506.03486, 2016.Author Yoav Benjamini and Yosef Hochberg. Controlling the false discoveryrate: A practical and powerful approach to multiple testing. 2007.S´ebastien Bubeck and Nicol`o Cesa-Bianchi. Regret analysis of stochastic andnonstochastic multi-armed bandit problems.
Foundations and Trends in Ma-chine Learning , 5:1–122, 2012.S´ebastien Bubeck, R´emi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In
ALT , 2009.Alexandra Carpentier and Michal Valko. Simple regret for infinitely many armedbandits. In
ICML , 2015.Karthekeyan Chandrasekaran and Richard M. Karp. Finding the most biasedcoin with fewest flips.
CoRR , abs/1202.3639, 2012.Arghya Roy Chaudhuri and Shivaram Kalyanakrishnan. Pac identification of abandit arm relative to a reward quantile. In
AAAI , pages 1777–1783, 2017.Herman Chernoff. Sequential design of experiments.
The Annals of Mathemat-ical Statistics , 30(3):755–770, 1959.Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. Improving the sensitivity ofonline controlled experiments by utilizing pre-experiment data. In
Proceedingsof the sixth ACM international conference on Web search and data mining ,pages 123–132. ACM, 2013.Alex Deng, Jiannan Lu, and Shouyuan Chen. Continuous monitoring of a/btests without pain: Optional stopping in bayesian testing. , pages243–252, 2016.Alex Deng, Jiannan Lu, and Jonthan Litz. Trustworthy analysis of online a/btests: Pitfalls, challenges and solutions. In
WSDM , 2017.15ean P. Foster and Robert A. Stine. Alpha-investing: A procedure for sequentialcontrol of expected false discoveries. 2007.PR Freeman. The secretary problem and its extensions: A review.
InternationalStatistical Review/Revue Internationale de Statistique , pages 189–206, 1983.John Gittins, Kevin Glazebrook, and Richard Weber.
Multi-armed bandit allo-cation indices . John Wiley & Sons, 2011.Kevin G. Jamieson, Matthew Malloy, Robert D. Nowak, and S´ebastien Bubeck.lil’ ucb : An optimal exploration algorithm for multi-armed bandits. In
COLT ,2014.Kevin G. Jamieson, Daniel Haas, and Benjamin Recht. The power of adaptivityin identifying statistical alternatives. In
NIPS , 2016.Adel Javanmard and Andrea Montanari. Online rules for control of false dis-covery rate and false discovery exceedance.
CoRR , abs/1603.09000, 2016.Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Peeking ata/b tests: Why it matters, and what to do about it. In
KDD , 2017.Raphael Lopez Kaufman, Jegar Pitchforth, and Lukas Vermeer. Democ-ratizing online controlled experiments at booking.com. arXiv preprintarXiv:1710.08217 , 2017.Eugene Kharitonov, Aleksandr Vorobev, Craig MacDonald, Pavel Serdyukov,and Iadh Ounis. Sequential testing for early stopping. 2015.Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, RandyHenne, Juan Lavista Ferres, and Tamir Melamed. Online experimentationat microsoft. 2009.Lifeng Lai, H. Vincent Poor, Yan Xin, and Georgios Georgiadis. Quickest searchover multiple sequences.
IEEE Transactions on Information Theory , 57:5375–5386, 2011.Tze Leung Lai. Nearly optimal sequential tests of composite hypotheses.
TheAnnals of Statistics , pages 856–886, 1988.Tze Leung Lai. On optimal stopping problems in sequential hypothesis testing.
Statistica Sinica , 7(1):33–51, 1997.Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimalalgorithm for the thresholding bandit problem. In
ICML , 2016.Matthew Malloy, Gongguo Tang, and Robert D. Nowak. Quickest search for arare distribution. , pages 1–6, 2012. 16wan S Page. Continuous inspection schemes.
Biometrika , 41(1/2):100–115,1954.Aaditya Ramdas, Fanny Yang, Martin J Wainwright, and Michael I Jordan.Online control of the false discovery rate with decaying memory. In
Advancesin Neural Information Processing Systems , pages 5655–5664, 2017.Daniel Russo. Simple bayesian algorithms for best arm identification. In
Con-ference on Learning Theory , pages 1417–1418, 2016.Stephen M Samuels. Secretary problems.
Handbook of sequential analysis , 118:381–405, 1991.Alexi N Shiryayev. Optimal stopping rules, volume 8 of applications of mathe-matics, 1978.David Siegmund.
Sequential analysis: tests and confidence intervals . SpringerScience & Business Media, 2013.Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. Overlappingexperiment infrastructure: More, better, faster experimentation. In
Proceed-ings 16th Conference on Knowledge Discovery and Data Mining , pages 17–26,Washington, DC, 2010a.Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. Overlappingexperiment infrastructure: More, better, faster experimentation (presen-tation). 2010b. URL https://static.googleusercontent.com/media/research.google.com/en//archive/papers/Overlapping_Experiment_Infrastructure_More_Be.pdf .Abraham Wald and Jacob Wolfowitz. Optimum character of the sequentialprobability ratio test.
The Annals of Mathematical Statistics , pages 326–339,1948.G Barrie Wetherill and Kevin D Glazebrook. Sequential methods in statistics,1986.David Williams.
Probability with martingales . Cambridge university press, 1991.Fanny Yang, Aaditya Ramdas, Kevin G Jamieson, and Martin J Wainwright.A framework for multi-a (rmed)/b (andit) testing with online fdr control. In
Advances in Neural Information Processing Systems , pages 5959–5968, 2017.17
Proofs
A.1 Proofs from Section 2
Proof of Lemma 1.
The result relies on τ being a stopping time. Recall that i ∗ indicates the discovered experiment. Then we find P ( µ i ∗ < s | F τ ) = ∞ (cid:88) t =1 P ( µ ∗ i < s | F τ ∩ { τ = t } ) P ( τ = t )= ∞ (cid:88) t =1 P ( µ ∗ i < s | F t ) P ( τ = t ) ≤ α ∞ (cid:88) t =1 P ( τ = t ) = α where we use that F ∈ F τ if F ∩ { τ = t } ∈ F t , and thus F τ = F t if τ = t [Williams, 1991][p.219]. Proof of Lemma 2.
Note that due to independence we can assume without lossof generality that the index of the arm corresponds to the order in which alterna-tives are first considered. Thus the result follows if we show that for any t , action I t < I t − cannot be strictly better than I t = I t +1 . Assume to the contrary that I t = y is optimal (and strictly better than I t = I t − + 1 for some y < I t . Con-sider the last time alternative y was selected: t (cid:48) = max { k < t : I k = y } . At thattime it was at least as good to consider a new alternative, and subsequently theposterior for alternative y has not changed due to independence. Due to theinfinite time horizon, it is thus at least as good to consider a new alternative. A.2 Proofs from Section 3
Proof of Lemma 3.
Let n ≥
1. We can rewrite the discovery criterion as P ( µ < s | S n = t ) = (cid:82) s −∞ (cid:81) ni =1 h ( X i ) exp ( µS ( X i ) − A ( µ )) dπ ( µ ) (cid:82) ∞−∞ (cid:81) ni =1 h ( X i ) exp ( µS ( X i ) − A ( µ )) dπ ( µ ) (11)= (cid:82) s −∞ exp ( µS n − nA ( µ )) dπ ( µ ) (cid:82) ∞−∞ exp ( µS n − nA ( µ )) dπ ( µ ) (12)= (cid:82) s −∞ exp ( µt − nA ( µ )) dπ ( µ ) (cid:82) ∞−∞ exp ( µt − nA ( µ )) dπ ( µ ) (13)We show that this is decreasing in t . 18ow take the logarithm and the derivative with respect to t to obtain Findexpression ddt log( P ( µ < s | S n = t )) = (cid:82) s −∞ µ exp( µt − nA ( µ )) dπ ( µ ) (cid:82) s −∞ exp( µt − nA ( µ )) dπ ( µ ) (14) − (cid:82) ∞−∞ µ exp( µt − nA ( µ )) dπ ( µ ) (cid:82) ∞−∞ exp( µt − nA ( µ )) dπ ( µ ) (15)= E f t ( µ | µ < s ) − E f t ( µ ) < f t ( µ ) = exp( µt − nA ( µ )) dπ ( µ ) (cid:82) s −∞ exp( µt − nA ( µ )) dπ ( µ ) (17)Note that the last inequality holds, because, in general E ( θ ) = E ( θ | θ < s ) P ( θ < s ) + E ( θ | θ ≥ s ) P ( θ ≥ s ) > E ( θ | θ < s ) P ( θ < s ) + s P ( θ ≥ s ) > E ( θ | θ < s ) (18)Now the lemma follows: if P ( µ < s | S n = t ) < α , then P ( µ < s | S n = t (cid:48) ) < α for all t (cid:48) > t , and similarly if if P ( µ < s | S n = t ) > α , then P ( µ < s | S n = t (cid:48) ) > α for all t (cid:48) < t .To prove the theorems in Section 3 we use the following lemmas, which areproven at the end of this section. Lemma 7.
The optimal policy for the truncated problem can be characterized bya rejection threshold. That is, the optimal policy rejects the current experimentif S n < r kn for a sequence r kn , and collects another observation for the currentexperiment otherwise, until a discovery is made. Write T k for the expected number of observations required for a discovery forthe optimal policy of the truncated problem. Then we can show that both T k and r kn converge. Lemma 8.
Both T k and r kn converge as k → ∞ .Proof of Theorem 4. Lemma 7 shows that the truncated problem has an optimalpolicy that has the form of a threshold. Next, lemma 8 shows that both thethresholds and the optimal cost converge.Recall T ∗ = lim k →∞ T ∗ k and r n = lim k →∞ r kn . Now we show that limiting policy r n with corresponding cost T ∗ is optimal.Suppose there exists an ε > φ with cost T such that T = T ∗ − ε .Consider a policy with cost T < T ∗ . Let τ be the stopping time of this policy.We consider the truncated version of this policy, and show that it cannot bemuch worse. On the other hand, this truncated policy has a cost larger than T ∗ . The k -truncated policy, denoted by φ k rejects the current alternative after19 samples, but is otherwise identical to φ . Let τ and τ k be the stopping timescorresponding to φ and φ k . Trivially, we have T = (cid:80) ∞ k =1 P ( τ ≥ k ). Because T is finite, P ( τ ≥ k ) = O (( k log k ) − ). Because φ and φ k are identical up to k observations, it follows that if τ < k , then τ k < k , and thus we find that E ( τ k ) = P ( τ > k ) E ( τ k | τ > k ) + E ( τ I ( τ ≤ k )) ≤ P ( τ > k )( k + E ( τ k )) + E ( τ I ( τ ≤ k ))Thus, it follows that E ( τ k ) ≤ k P ( τ > k )1 − P ( τ > k ) + T − P ( τ > k ) . (19)Since P ( τ > k ) = O (( k log k ) − ), E ( τ k ) → T as k → ∞ .However, T ∗ ≤ E ( τ k ) for all k , and thus T ∗ ≤ lim k →∞ E ( τ k ) = T , which is acontradiction. Proof of Theorem 5.
This is a direct consequence of lemmas 7 and 8.
Proof of Theorem 6.
Let τ r denote the (random) hitting time of the boundaryof the first alternative τ r = min { n : S n ≥ a n or S n < r n } (20)under rejection boundary r = { r n } ki =1 . Furthermore, let q r = P ( S τ < r τ ) denotethe rejection probability. Now note that f ( κ ) = min r E ( τ r ) + κq r Note that wecan solve this minimization problem using backward induction, since the timehorizon is fixed ( k ). First, we show that f has a unique fixed point which isequal to T ∗ k .Note that we have T r = E ( τ r ) + T r q r (21)By definition, T ∗ k minimizes min r E ( τ r )+ T ∗ k q r , thus, it follows immediately that T ∗ k is a fixed point of f .Next, we show that f ( κ ) > κ for each κ < T ∗ k and f ( κ ) < κ for each κ > T ∗ k .First, fix κ < T ∗ k . Suppose that f ( κ ) ≤ κ . Thus, there exists r (cid:48) such that E ( τ r (cid:48) ) + κq r (cid:48) ≤ κ . Thus, κ ≥ E ( τ r (cid:48) )1 − q r (cid:48) = T r (cid:48) , where the last equality follows from(21). This, along with κ < T ∗ k implies that T r (cid:48) < T ∗ k , a contradiction. Thus, wemust have f ( κ ) > κ .Finally, fix κ > T ∗ . We know that T ∗ = E ( τ r ∗ ) + T ∗ k q r ∗ < E ( τ r ∗ ) + κq r ∗ . (22)Thus, there exists r (equal to r ∗ ) such that E [ τ r ]+ κq r < κ . Thus, f ( κ ) < κ .20 .3 Proofs of lemmas Proof of Lemma 7.
Based on Lemma 2, there exists a policy that can be char-acterized by a sequence of three sets • Discover if S n ∈ A n , the experiment is a discovery • Continue if S n ∈ D n , and • Reject if S n ∈ R n Now note that R kn is a threshold region for n ≥ k by definition. Assume R km =( −∞ , r km ] for all m > n . Further, from the Bellman equation for the truncatedproblem, it is clear that the optimal solution rejects the current experiment atthe time n if E [ V k ( n + 1 , S n +1 ) | S n ] > E [ V k (1 , S )] = T ∗ k − . (23)Note that E [ V k ( n + 1 , S n +1 ) | S n = x ] = (cid:90) y ∈ D n +1 V k ( n + 1 , y ) f ( S n +1 = y | S n = x ) dy + T ∗ k P ( S n +1 < r n +1 | S n = x ) (24)Then for n we note that E [ V k ( n + 1 , S n +1 ) | S n = x ] is decreasing in x . Thisfollows since V k ( n + 1 , y ) < T ∗ k for all y ∈ D n +1 = [ r n +1 , a n +1 ], as for such y itis better to continue than to reject. Furthermore, arguing along the lines of theproof of Lemma 3, P ( S n +1 < r n +1 | S n = x ) is decreasing in x . This implies wecan write R kn = ( −∞ , r kn ] for some r kn . Proof of Lemma 8.
Due to increased degrees of freedom, it follows that T ∗ k isdecreasing. Since T ∗ k is bounded below by 0, T nk converges. Let T ∗ = lim k T ∗ k .Next, we show that r kn is decreasing in k : Clearly, r k +1 k ≤ r kk . Now suppose r k +1 n +1 ≤ r kn +1 , then r k +1 n ≤ r kn , which follows from the fact that E [ V k ( n +1 , S n +1 ) | S n = r kn ] + 1 = T ∗ k , and T ∗ k is decreasing in k . It remains to show that r kn isbounded.We construct a lower bound on r kn , for large k , as follows. Let ε = T ∗ andlet x be such that P ( ∃ m > n s.t. S m > a m | S n = x ) < ε , by choosing x sufficiently small. Then we note that the cost for obtaining another sample isat least 1 + (1 − ε ) T ∗ k ≥ − ε ) T ∗ = T ∗ + 1 /
2. However, if the experimenterrejects the current alternative now, the cost is T ∗ k . Thus, if we can show thatthere exists a K such that for all k > K , T ∗ k < T ∗ + 1 /
2, then x is a lower boundon r kn for all k > K . But above we have shown that T ∗ k → T ∗ , hence such K exists. This implies that r kn converges as k → ∞→ ∞