aa r X i v : . [ m a t h . S T ] M a r The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2016
BATCHED BANDIT PROBLEMS
By Vianney Perchet ∗ , , Philippe Rigollet † , ,Sylvain Chassang ‡ , and Erik Snowberg § , Universit´e Paris Diderot and INRIA ∗ , Massachusetts Institute ofTechnology † , Princeton University ‡ and California Institute of Technologyand NBER § Motivated by practical applications, chiefly clinical trials, we studythe regret achievable for stochastic bandits under the constraint thatthe employed policy must split trials into a small number of batches.We propose a simple policy, and show that a very small number ofbatches gives close to minimax optimal regret bounds. As a byprod-uct, we derive optimal policies with low switching cost for stochasticbandits.
1. Introduction.
All clinical trials are run in batches : groups of patientsare treated simultaneously, with the data from each batch influencing thedesign of the next. This structure arises as it is impractical to measure out-comes (rewards) for each patient before deciding what to do next. Despitethe fact that this system is codified into law for drug approval, it has re-ceived scant attention from statisticians. What can be achieved with a smallnumber of batches? How big should these batches be? How should resultsin one batch affect the structure of the next?We address these questions using the multi-armed bandit framework. Thisencapsulates an “exploration vs. exploitation” dilemma fundamental to eth-ical clinical research [30, 34]. In the basic problem, there are two populationsof patients (or arms ), corresponding to different treatments. At each pointin time t = 1 , . . . , T , a decision maker chooses to sample one, and receives arandom reward dictated by the efficacy of the treatment. The objective is Received May 2015; revised August 2015. Supported by ANR Grant ANR-13-JS01-0004. Supported by NSF Grants DMS-13-17308 and CAREER-DMS-10-53987. Supported by NSF Grant SES-1156154.
AMS 2000 subject classifications.
Primary 62L05; secondary 62C20.
Key words and phrases.
Multi-armed bandit problems, regret bounds, batches, multi-phase allocation, grouped clinical trials, sample size determination, switching cost.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2016, Vol. 44, No. 2, 660–681. This reprint differs from the original in paginationand typographic detail. 1
PERCHET, RIGOLLET, CHASSANG AND SNOWBERG to devise a series of choices—a policy—maximizing the expected cumulativereward over T rounds. There is thus a clear tradeoff between discoveringwhich treatment is the most effective—or exploration —and administeringthe best treatment to as many patients as possible—or exploitation .The importance of batching extends beyond clinical trials. In recent years,the bandit framework has been used to study problems in economics, finance,chemical engineering, scheduling, marketing and, more recently, internet ad-vertising. This last application has been the driving force behind a recentsurge of interest in many variations of bandit problems over the past decade.Yet, even in internet advertising, technical constraints often force data to beconsidered in batches; although the size of these batches is usually based ontechnical convenience rather than on statistical reasoning. Discovering theoptimal structure, size and number of batches has applications in marketing[8, 31] and simulations [14].In clinical trials, batches may be formal—the different phases requiredfor approval of a new drug by the US Food and Drug Administration—orinformal—with a pilot, a full trial, and then diffusion to the full populationthat may benefit. In an informal setup, the second step may be skipped if thepilot is successful enough. In this three-stage approach, the first, and usuallysecond, phases focus on exploration, while the third focuses on exploitation.This is in stark contrast to the basic bandit problem described above, whicheffectively consists of T batches, each containing a single patient.We describe a policy that performs well with a small fixed number ofbatches. A fixed number of batches reflects clinical practice, but presentsmathematical challenges. Nonetheless, we identify batch sizes that lead toa minimax regret bounds as low as the best non-batched algorithms. Wefurther show that these batch sizes perform well empirically. Together, thesefeatures suggest that near-optimal policies could be implemented with onlysmall changes to current clinical practice.
2. Description of the problem.
Notation.
For any positive integer n , define [ n ] = { , . . . , n } , and forany n < n , [ n : n ] = { n , . . . , n } and ( n : n ] = { n + 1 , . . . , n } . For anypositive number x , let ⌊ x ⌋ denote the largest integer n such that n ≤ x and ⌊ x ⌋ denotes the largest even integer m such that m ≤ x . Additionally, forany real numbers a and b , a ∧ b = min( a, b ) and a ∨ b = max( a, b ). Further,define log( x ) = 1 ∨ (log x ). ( · ) denotes the indicator function.If I , J are closed intervals of R , then I ≺ J if x < y for all x ∈ I , y ∈ J .Finally, for two sequences ( u T ) T , ( v T ) T , we write u T = O ( v T ) or u T . v T if there exists a constant C > | u T | ≤ C | v T | for any T . Moreover,we write u T = Θ( v T ) if u T = O ( v T ) and v T = O ( u T ). ATCHED BANDITS Framework.
We employ a two-armed bandit framework with hori-zon T ≥
2. Central ideas and intuitions are well captured by this conciseframework. Extensions to K -armed bandit problems are mostly technical(see, for instance, [28]).At each time t ∈ [ T ], the decision maker chooses an arm i ∈ { , } and ob-serves a reward that comes from a sequence of i.i.d. draws Y ( i )1 , Y ( i )2 , . . . fromsome unknown distribution ν ( i ) with expected value µ ( i ) . We assume that thedistributions ν ( i ) are standardized sub-Gaussian, that is, R e λ ( x − µ ( i ) ) ν i ( dx ) ≤ e λ / for all λ ∈ R . Note that these include Gaussian distributions with vari-ance at most 1, and distributions supported on an interval of length at most2. Rescaling extends the framework to other variance parameters σ .For any integer M ∈ [2 : T ], let T = { t , . . . , t M } be an ordered sequence,or grid , of integers such that 1 < t < · · · < t M = T . It defines a partition S = { S , . . . , S M } of [ T ] where S = [1 : t ] and S k = ( t k − : t k ] for k ∈ [2 : M ]. The set S k is called k th batch . An M -batch policy is a couple ( T , π )where T = { t , . . . , t M } is a grid and π = { π t , t = 1 , . . . , T } is a sequenceof random variables π t ∈ { , } , indicating which arm to pull at each time t = 1 , . . . , T , which depend only on observations from batches strictly prior tothe current one. Formally, for each t ∈ [ T ], let J ( t ) ∈ [ M ] be the index of the current batch S J ( t ) . Then, for t ∈ S J ( t ) , π t can only depend on observations { Y ( π s ) s : s ∈ S ∪ · · · ∪ S J ( t ) − } = { Y ( π s ) s : s ≤ t J ( t ) − } .Denote by ⋆ ∈ { , } the optimal arm defined by µ ( ⋆ ) = max i ∈{ , } µ ( i ) , by † ∈ { , } the suboptimal arm, and by ∆ := µ ( ⋆ ) − µ ( † ) > π is measured by its (cumulative) regret attime T R T = R T ( π ) = T µ ( ⋆ ) − T X t =1 E µ ( π t ) . Denoting by T i ( t ) = P ts =1 ( π s = i ) , i ∈ { , } the number of times arm i waspulled before time t ≥
2, regret can be rewritten as R T = ∆ E T † ( T ).2.3. Previous results.
Bandit problems are well understood in the casewhere M = T , that is, when the decision maker can use all available data ateach time t ∈ [ T ]. Bounds on the cumulative regret R T for stochastic multi-armed bandits come in two flavors: minimax or adaptive . Minimax boundshold uniformly in ∆ over a suitable subset of the positive real line such asthe intervals (0 ,
1) or even (0 , ∞ ). The first results of this kind are attributedto Vogel [36, 37], who proved that R T = Θ( √ T ) in the two-armed case (seealso [6, 20]). PERCHET, RIGOLLET, CHASSANG AND SNOWBERG
Adaptive policies exhibit regret bounds that may be much smaller thanthe order of √ T when ∆ is large. Such bounds were proved in the seminalpaper of Lai and Robbins [25] in an asymptotic framework (see also [10]).While leading to tight constants, this framework washes out the correctdependency on ∆ of the logarithmic terms. In fact, recent research [1–3, 28]has revealed that R T = Θ(∆ T ∧ log( T ∆ ) / ∆).Nonetheless, a systematic analysis of the batched case does not exist,even though Ucb2 [2] and improved-Ucb [3] are implicitly M -batch poli-cies with M = Θ(log T ). These algorithms achieve optimal adaptive bounds.Thus, employing a batched policy is only a constraint when the numberof batches M is much smaller than log T , as is often the case in clini-cal practice. Similarly, in the minimax framework, M -batch policies, with M = Θ(log log T ), lead to the optimal regret bound (up to logarithmic terms)of O ( √ T log log log T ) [11, 12]. The sub-logarithmic range M ≪ log T is es-sential in applications where M is small and constant, like clinical trials. Inparticular, we wish to bound the regret for small values of M , such as 2, 3or 4.2.4. Literature.
This paper connects to two lines of work: batched se-quential estimation [17, 18, 21, 33] and multistage clinical trials. Somerville[32] and Maurice [26] studied the two-batch bandit problem in a minimaxframework under a Gaussian assumption. They prove that an “explore-then-commit” type policy has regret of order T / for any value of the gap ∆; aresult we recover and extend (see Section 4.3).Colton [15, 16] introduced a Bayesian perspective, initiating a long line ofwork (see [22] for a recent overview). Most of this work focuses on the caseof two-three batches, with isolated exceptions [13, 22]. Typically, this workclaims the size of the first batch should be of order √ T , which agrees withour results, up to a logarithmic term (see Section 4.2).Batched procedures have a long history in clinical trials (see, for instance,[23] and [5]). Usually, batches are of the same size, or of random size, withthe latter case providing robustness. This literature also focuses on inferencequestions rather than cumulative regret. A notable exception provides an ad-hoc objective to optimize batch size but recovers the suboptimal √ T in thecase of two batches [4].2.5. Outline.
Section 3 introduces a general class of M -batch policies wecall explore-then-commit ( etc ) policies . These policies are close to clinicalpractice within batches. The performance of generic etc policies are de-tailed in Proposition 1, found in Section 3.3. In Section 4, we study severalinstantiations of this generic policy and provide regret bounds with explicit,and often drastic, dependency on the number of batches M . Indeed, in Sec-tion 4.3, we describe a policy in which regret decreases doubly exponentially ATCHED BANDITS fast with the number of batches.Two of the instantiations provide adaptive and minimax types of bounds,respectively. Specifically, we describe two M -batch policies, π and π thatenjoy the following bounds on the regret: R T ( π ) . (cid:18) T log( T ) (cid:19) /M log( T ∆ )∆ ,R T ( π ) . T / (2 − − M ) log α M ( T / (2 M − ) , α M ∈ [0 , / . Note that the bound for π corresponds to the optimal adaptive ratelog( T ∆ ) / ∆ when M = Θ(log( T / log( T ))) and the bound for π correspondsto the optimal minimax rate √ T when M = Θ(log log T ). The latter is en-tirely feasible in clinical settings. As a byproduct of our results, we show thatthe adaptive optimal bounds can be obtained with a policy that switchesbetween arms less than Θ(log( T / log( T ))) times, while the optimal mini-max bounds only require Θ(log log T ) switches. Indeed, etc policies can beadapted to switch at most once in each batch.Section 5 then examines the lower bounds on regret of any M -batch policy,and shows that the policies identified are optimal, up to logarithmic terms,within the class of M -batch policies. Finally, in Section 6 we compare policiesthrough simulations using both standard distributions and real data from aclinical trial, and show that the policies we identify perform well even witha very small number of batches.
3. Explore-then-commit policies.
In this section, we describe a simplestructure that can be used to build policies: explore-then-commit ( etc ).This structure consists of pulling each arm the same number of times ineach non-terminal batch, and checking after each batch whether, accordingto some statistical test, one arm dominates the other. If one dominates,then only that arm is pulled until T . If, at the beginning of the terminalbatch, neither arm has been declared dominant, then the policy commits tothe arm with the largest average past reward. This “go for broke” step isdictated by regret minimization: in the last batch exploration is pointless asthe information it produces can never be used.Any policy built using this principle is completely characterized by twoelements: the testing criterion and the sizes of the batches.3.1. Statistical test.
We begin by describing the statistical test employedbefore non-terminal batches. Denote by b µ ( i ) s = 1 s s X ℓ =1 Y ( i ) ℓ the empirical mean after s ≥ i . This estimator allows for theconstruction of a collection of upper and lower confidence bounds for µ ( i ) of PERCHET, RIGOLLET, CHASSANG AND SNOWBERG the form b µ ( i ) s + B ( i ) s and b µ ( i ) s − B ( i ) s , where B ( i ) s = 2 p T /s ) /s (with the convention that B ( i )0 = ∞ ). It followsfrom Lemma B.1 that for any τ ∈ [ T ], P {∃ s ≤ τ : µ ( i ) > b µ ( i ) s + B ( i ) s } ∨ P {∃ s ≤ τ : µ ( i ) < b µ ( i ) s − B ( i ) s } ≤ τT . (1)These bounds enable us to design the following family of tests { ϕ t } t ∈ [ T ] with values in { , , ⊥} where ⊥ indicates that the test was inconclusive.This test is only implemented at times t ∈ [ T ] at which each arm has beenpulled exactly s = t/ t . For t ≥
1, define ϕ t = ( i ∈ { , } , if T ( t ) = T ( t ) = t/ b µ ( i ) t/ − B ( i ) t/ > b µ ( j ) t/ + B ( j ) t/ , j = i, ⊥ , otherwise.The errors of such tests are controlled as follows. Lemma 1.
Let
S ⊂ [ T ] be a deterministic subset of even times such that T ( t ) = T ( t ) = t/ , for t ∈ S . Partition S into S − ∪ S + , S − ≺ S + , where S − = (cid:26) t ∈ S : ∆ < r log(2 T /t ) t (cid:27) , S + = (cid:26) t ∈ S : ∆ ≥ r log(2 T /t ) t (cid:27) . Let ¯ t denote the smallest element of S + . Then (i) P ( ϕ ¯ t = ⋆ ) ≤ tT and (ii) P ( ∃ t ∈ S − : ϕ t = † ) ≤ tT . Proof.
Assume without loss of generality that ⋆ = 1.(i) By definition, { ϕ ¯ t = 1 } = { b µ (1)¯ t/ − B (1)¯ t/ ≤ b µ (2)¯ t/ + B (2)¯ t/ } ⊂ { E t ∪ E t ∪ E t } , where E t = { µ (1) ≥ b µ (1) t/ + B (1) t/ } , E t = { µ (2) ≤ b µ (2) t/ − B (2) t/ } , and E t = { µ (1) − µ (2) < B (1) t/ + 2 B (2) t/ } . It follows from (1) that with τ = ¯ t/ P ( E t ) ∨ P ( E t ) ≤ t/T .Finally, for any t ∈ S + , in particular for t = ¯ t , we have E t ⊂ (cid:26) µ (1) − µ (2) < r log(2 T /t ) t (cid:27) = ∅ . (ii) Focus on the case t ∈ S − , where ∆ < p log(2 T /t ) /t . Here, [ t ∈S − { ϕ t = 2 } = [ t ∈S − { b µ (2) t/ − B (2) t/ > b µ (1) t/ + B (1) t/ } ⊂ [ t ∈S − { E t ∪ E t ∪ F t } , ATCHED BANDITS where, E t , E t are defined above and F t = { µ (1) − µ (2) < } = ∅ as ⋆ = 1. Itfollows from (1), that with τ = ¯ t P (cid:18) [ t ∈S − E t (cid:19) ∨ P (cid:18) [ t ∈S − E t (cid:19) ≤ tT . (cid:3) Go for broke.
In the last batch, the etc structure will “go for broke”by selecting the arm i with the largest average. Formally, at time t , let ψ t = i iff b µ ( i ) T i ( t ) ≥ b µ ( j ) T j ( t ) , with ties broken arbitrarily. While this criterionmay select the suboptimal arm with higher probability than the statisticaltest described in the previous subsection, it also increases the probability ofselecting the correct arm by eliminating inconclusive results. This statementis formalized in the following lemma. The proof follows immediately fromLemma B.1. Lemma 2.
Fix an even time t ∈ [ T ] , and assume that both arms havebeen pulled t/ times each (i.e., T i ( t ) = t/ , for i = 1 , ). Going for brokeleads to a probability of error P ( ψ t = ⋆ ) ≤ exp( − t ∆ / . Explore-then-commit policy.
In a batched process, an extra con-straint is that past observations can only be inspected at a specific set oftimes T = { t , . . . , t M − } ⊂ [ T ], called a grid .The generic etc policy uses a deterministic grid T that is fixed before-hand, and is described more formally in Figure 1. Informally, at each decisiontime t , . . . , t M − , the policy implements the statistical test. If one arm isdetermined to be better than the other, it is pulled until T . If no arm isdeclared best, then both arms are pulled the same number of times in thenext batch.We denote by ε t ∈ { , } the arm pulled at time t ∈ [ T ], and employ anexternal source of randomness to generate the variables ε t . With N an evennumber, let ( ε , . . . , ε N ) be uniformly distributed over the subset V N = { v ∈{ , } N : P i ( v i = 1) = N/ } . This randomization has no effect on thepolicy, and could easily be replaced by any other mechanism that pulls eacharm an equal number of times. For example, a mechanism that pulls onearm for the first half of the batch, and the other for the second half, may beused if switching costs are a concern. Odd numbers for the deadlines t i could be considered, at the cost of rounding problemsand complexity, by defining V N = { v ∈ { , } N : | P i ( v i = 1) − P i ( v i = 2) | ≤ } . PERCHET, RIGOLLET, CHASSANG AND SNOWBERG
Input: • Horizon: T . • Number of batches: M ∈ [2 : T ]. • Grid: T = { t , . . . , t M − } ⊂ [ T ], t = 0, t M = T , | S m | = t m − t m − iseven for m ∈ [ M − • Let ε [ m ] = ( ε [ m ]1 , . . . , ε [ m ] | S m | ) be uniformly distributed over a V | S m | , for m ∈ [ M ]. • The index ℓ of the batch in which a best arm was identified is initializedto ℓ = ◦ .Policy:1. For t ∈ [1 : t ], choose π t = ε [1] t .2. For m ∈ [2 : M − ℓ = ◦ , then π t = ϕ t ℓ for t ∈ ( t m − : t m ].(b) Else, compute ϕ t m − i. If ϕ t m − = ⊥ , select an arm at random, that is, π t = ε [ m ] t for t ∈ ( t m − : t m ].ii. Else, ℓ = m − π t = ϕ t m − for t ∈ ( t m − : t m ].3. For t ∈ ( t M − , T ]:(a) If ℓ = ◦ , π t = ϕ t ℓ .(b) Otherwise, go for broke, that is, π t = ψ t M − . a In the case where | S m | is not an even number, we use the general definition offootnote 4 for V | S m | . Fig. 1.
Generic explore-then-commit policy with grid T . In the terminal batch S M , if no arm was determined to be optimal in anyprior batch, the etc policy will go for broke by selecting the arm i such that b µ ( i ) T i ( t M − ) ≥ b µ ( j ) T j ( t M − ) , with ties broken arbitrarily.To describe the regret incurred by a generic etc policy, we introduceextra notation. For any ∆ ∈ (0 , τ (∆) = T ∧ ϑ (∆) where ϑ (∆) is thesmallest integer such that∆ ≥ s log[2 T /ϑ (∆)] ϑ (∆) . ATCHED BANDITS Notice that the above definition implies that τ (∆) ≥ τ (∆) ≤ log (cid:18) T ∆ (cid:19) . (2)The time τ (∆) is, up to a multiplicative constant, the theoretical time atwhich the optimal arm will be declared better by the statistical test withlarge enough probability. As ∆ is unknown, the grid will not usually containthis value. Thus, the relevant time is the first posterior to τ (∆) in a grid: m (∆ , T ) = (cid:26) min { m ∈ { , . . . , M − } : t m ≥ τ (∆) } , if τ (∆) ≤ t M − , M − , otherwise.(3)The first proposition gives an upper bound for the regret incurred by ageneric etc policy run with a given set of times T = { t , . . . , t M − } . Proposition 1.
Given the time horizon T ∈ N , the number of batches M ∈ [2 , T ] , and the grid T = { t , . . . , t M − } ⊂ [ T ] with t = 0 . For any ∆ ∈ [0 , , the generic etc policy described in Figure 1 incurs regret bounded R T (∆ , T ) ≤ t m (∆ , T ) + T ∆ e − ( t M − ∆ ) / ( m (∆ , T ) = M − . (4) Proof.
Denote ¯ m = m (∆ , T ). Note that t ¯ m denotes the theoretical timeon the grid at which the statistical test will declare ⋆ to be (with highprobability) the better arm.We first examine the case where t ¯ m < M −
1. Define the following events: A m = m \ n =1 { ϕ t n = ⊥} , B m = { ϕ t m = †} and C m = { ϕ t m = ⋆ } . Regret can be incurred in one of the following three manners:(i) by exploring before time t ¯ m ,(ii) by choosing arm † before time t ¯ m : this happens on event B m ,(iii) by not committing to the optimal arm ⋆ at the optimal time t ¯ m : thishappens on event C ¯ m .Error (i) is unavoidable and may occur with probability close to one. Itcorresponds to the exploration part of the policy and leads to an additionalterm t ¯ m ∆ / T ∆, so we need to ensure that they occur with low probability.Therefore, the regret incurred by the policy is bounded as R T (∆ , T ) ≤ t ¯ m ∆2 + T ∆ E " ¯ m − [ m =1 A m − ∩ B m ! + ( B ¯ m − ∩ C ¯ m ) , (5)with the convention that A is the whole probability space. PERCHET, RIGOLLET, CHASSANG AND SNOWBERG
Next, observe that ¯ m is chosen such that16 s log(2 T /t ¯ m ) t ¯ m ≤ ∆ < s log(2 T /t ¯ m − ) t ¯ m − . In particular, t ¯ m plays the role of ¯ t in Lemma 1. Thus, using part (i) ofLemma 1, P ( B ¯ m − ∩ C ¯ m ) ≤ t ¯ m T .
Moreover, using part (ii) of the same lemma, P ¯ m − [ m =1 A m − ∩ B m ! ≤ t ¯ m T .
Together with (5) this implies regret is bounded by R T (∆ , T ) ≤ t ¯ m .In the case where t m (∆ , T ) = M −
1, Lemma 2 shows that the go for broketest errs with probability at most exp( − t M − ∆ / R T (∆ , T ) ≤ t m (∆ , T ) + T ∆ e − ( t M − ∆ ) / , using the same arguments as before. (cid:3) Proposition 1 helps choose a grid by showing how that choice reduces toan optimal discretization problem.
4. Functionals, grids and bounds.
The regret bound of Proposition 1critically depends on the choice of the grid T = { t , . . . , t M − } ⊂ [ T ]. Ideally,we would like to optimize the right-hand side of (4) with respect to the t m s.For a fixed ∆, this problem is easy, and it is enough to choose M = 2, t ≃ τ (∆) to obtain optimal regret bounds of the order R ∗ (∆) = log( T ∆ ) / ∆. Forunknown ∆, the problem is not well defined: as observed by [15, 16], it con-sists in optimizing a function R (∆ , T ) for all ∆, and there is no choice thatis uniformly better than others. To overcome this limitation, we minimizepre-specified real-valued functionals of R ( · , T ). The functionals we focus onare: F xs [ R T ( · , T )] = sup ∆ ∈ [0 , { R T (∆ , T ) − CR ∗ (∆) } , C > ,F cr [ R T ( · , T )] = sup ∆ ∈ [0 , R T (∆ , T ) R ∗ (∆) Competitive ratio ,F mx [ R T ( · , T )] = sup ∆ ∈ [0 , R T (∆ , T ) Maximum . ATCHED BANDITS Optimizing different functionals leads to different optimal grids. We investi-gate the properties of these functionals and grids in the rest of this section. Excess regret and the arithmetic grid.
We begin with the simple gridconsisting in a uniform discretization of [ T ]. This is particularly prominentin the group sequential testing literature [23]. As we will see, even in afavorable setup, it yields poor regret bounds.Assume, for simplicity, that T = 2 KM for some positive integer K , sothat the grid is defined by t m = mT /M . In this case, the right-hand sideof (4) is bounded below by ∆ t = ∆ T /M . For small M , this lower bound islinear in T ∆, which is a trivial bound on regret. To obtain a valid upperbound, note that t m (∆ , T ) ≤ τ (∆) + TM ≤ log (cid:18) T ∆ (cid:19) + TM .
Moreover, if m (∆ , T ) = M − p /T , thus, T ∆ . / ∆. Together with (4), this yields the following theorem. Theorem 1.
The etc policy implemented with the arithmetic grid de-fined above ensures that, for any ∆ ∈ [0 , , R T (∆ , T ) . (cid:18)
1∆ log( T ∆ ) + T ∆ M (cid:19) ∧ T ∆ . The optimal rate is recovered if M = T . However, the arithmetic gridleads to a bound on the excess regret of the order of ∆ T when T is largeand M constant.In Section 5, the bound of Theorem 1 is shown to be optimal for excessregret, up to logarithmic factors. Clearly, this criterion provides little usefulguidance on how to attack the batched bandit problem when M is small.4.2. Competitive ratio and the geometric grid.
The geometric grid is de-fined as T = { t , . . . , t M − } , where t m = ⌊ a m ⌋ , and a ≥ m (∆ , T ) ≤ M − R T (∆ , T ) ≤ a m (∆ , T ) ≤ a ∆ τ (∆) ≤ a ∆ log (cid:18) T ∆ (cid:19) , One could also consider the Bayesian criterion F by [ R T ( · , T )] = R R T (∆ , T ) dπ (∆)where π is a given prior distribution on ∆, rather than on the expected rewards as inthe traditional Bayesian bandit literature [7]. PERCHET, RIGOLLET, CHASSANG AND SNOWBERG and if m (∆ , T ) = M −
1, then τ (∆) > t M − . Then, (4), together with Lem-ma B.2 yields R T (∆ , T ) ≤ a M − + T ∆ e − ( a M − ∆ ) / ≤ a ∆ log (cid:18) T ∆ (cid:19) for a ≥ T log T ) /M ≥
2. We have proved the following theorem.
Theorem 2.
The etc policy implemented with the geometric grid de-fined above for the value a := 2( T log T ) /M , when M ≤ log( T / (log T )) ensuresthat, for any ∆ ∈ [0 , , R T (∆ , T ) . (cid:18) T log T (cid:19) /M log( T ∆ )∆ ∧ T ∆ . For a logarithmic number of batches, M = Θ(log T ), the geometric gridleads to the optimal regret bound R T (∆ , T ) . log( T ∆ )∆ ∧ T ∆ . This bound shows that the geometric grid leads to a deterioration of theregret bound by a factor (
T / log( T )) /M , which can be interpreted as auniform bound on the competitive ratio. For example, for M = 2 and ∆ = 1,this leads to the √ T regret bound observed in the Bayesian literature, whichis also optimal in the minimax sense. However, this minimax optimal boundis not valid for all values of ∆. Indeed, maximizing over ∆ > ∆ R T ( T , ∆) . T ( M +1) / (2 M ) log ( M − / (2 M ) (( T / log( T )) /M ) , which yields the minimax rate √ T when M ≥ log( T / log( T )), as expectedfrom prior results. The decay in M can be made even faster if one focuseson the maximum risk, by employing our “minimax grid.”4.3. Maximum risk and the minimax grid.
The objective of this gridis to minimize the maximum risk, and to recover the classical distributionindependent minimax bound in √ T . The intuition behind this grid comesfrom Proposition 1, in which ∆ t m (∆ , T ) is the most important term to control.Consider a grid T = { t , . . . , t M − } , where the t m ’s are defined recursively as t m +1 = f ( t m ) so that, by definition, t m (∆ , T ) ≤ f ( τ (∆) − f ( τ (∆)) should be the smallest possible term, andconstant with respect to ∆. This is ensured by choosing f ( τ (∆) −
1) = a/ ∆or, equivalently, by choosing f ( x ) = a/τ − ( x + 1) for a suitable notion ofthe inverse. This yields ∆ t m (∆ , T ) ≤ a , so that the parameter a is actuallya bound on the regret. This parameter also has to be large enough so that ATCHED BANDITS the regret T sup ∆ ∆ e − t M − ∆ / = 2 T / √ et M − incurred in the go for brokestep is also of the order of a . The formal definition below uses not only thisdelicate recurrence, but also takes care of rounding problems.Let u = a , for some a > u j = f ( u j − ) where f ( u ) = a r u log((2 T ) /u )(6)for all j ∈ { , . . . , M − } . The minimax grid T = { t , . . . , t M − } has pointsgiven by t m = ⌊ u m ⌋ , m ∈ { , . . . , M − } .If m (∆ , T ) ≤ M −
2, then it follows from (4) that R T (∆ , T ) ≤ t m (∆ , T ) ,and as τ (∆) is the smallest integer such that ∆ ≥ a/f ( τ (∆)), we have∆ t m (∆ , T ) ≤ ∆ f ( τ (∆) − ≤ a. As discussed above, if a is greater than 2 √ T / (16 √ et M − ), then the regretis also bounded by 16 a when m (∆ , T ) = M −
1. Therefore, in both cases, theregret is bounded by 16 a . Before finding an a satisfying the above conditions,note that it follows from Lemma B.3 that, as long as 15 a S M − ≤ T , t M − ≥ u M − ≥ a S M −
30 log S M − / (2 T /a S M − ) , with the notation S k := 2 − − k . Therefore, we need to choose a such that a S M − ≥ r e T log S M − / (cid:18) Ta S M − (cid:19) and 15 a S M − ≤ T. It follows from Lemma B.4 that the choice a := (2 T ) /S M − log / − (3 / / (2 M − ((2 T ) / (2 M − )ensures both conditions when 2 M ≤ log(2 T ) /
6. We emphasize thatlog / − (3 / / (2 M − ((2 T ) / (2 M − ) ≤ M = ⌊ log (log(2 T ) / ⌋ . As a consequence, in order to get the optimal minimax rate of √ T , one onlyneeds ⌊ log log( T ) ⌋ batches. If more batches are available, then our policyimplicitly combines some of them. We have proved the following theorem. Theorem 3.
The etc policy over the minimax grid with a = (2 T ) / (2 − − M ) log / − (3 / / (2 M − ((2 T ) / (2 M − ) ensures that, for any M such that M ≤ log(2 T ) / , sup ≤ ∆ ≤ R T (∆ , T ) . T / (2 − − M ) log / − (3 / / (2 M − ( T / (2 M − ) , which is minimax optimal, that is, sup ∆ R T (∆ , T ) . √ T , for M ≥ log log( T ) . PERCHET, RIGOLLET, CHASSANG AND SNOWBERG
Table 1
Regret and decision times of the etc policy with the minimax grid for M = 2 , , , . Inthe table, l T = log( T ) M t = sup ∆ R T (∆ , T ) t t t T / T / l / T T / l − / T T / l / T T / l − / T T / l − / T T / l / T T / l − / T T / l − / T T / l − / T Table 1 gives the regret bounds (without constant factors) and the deci-sion times of the etc policy with the minimax grid for M = 2 , , , etc policy with the minimax grid can easily be adapted to haveonly O (log log T ) switches, and yet still achieve regret of optimal order √ T .To do so, in each batch one arm should be pulled for the first half of thebatch, and the other for the second half, leading to only one switch withinthe batch, until the policy commits to a single arm. To ensure that a switchdoes not occur between batches, the first arm pulled in a batch should be setto the last arm pulled in the previous batch, assuming that the policy hasnot yet committed. This strategy is relevant in applications such as laboreconomics and industrial policy, where switching from an arm to the othermay be expensive [24]. In this context, our policy compares favorably withthe best current policies constrained to have log log( T ) switches, which leadto a regret bound of order √ T log log log T [11].
5. Lower bounds.
In this section, we address the optimality of the regretbounds derived above for the specific functionals F xs , F cr and F mx . Theresults below do not merely characterize optimality (up to logarithmic terms)of the chosen grid within the class of etc policies, but also optimality of thefinal policy among the class of all M -batch policies . Theorem 4.
Fix T ≥ and M ∈ [2 : T ] . Any M -batch policy ( T , π ) ,must satisfy the following lower bounds: sup ∆ ∈ (0 , (cid:26) R T (∆ , T ) − xs (cid:27) & TM , sup ∆ ∈ (0 , { ∆ R T (∆ , T ) } & T /M , sup ∆ ∈ (0 , { R T (∆ , T ) } & T / (2 − − M ) . ATCHED BANDITS Proof.
Fix ∆ k = √ t k , k = 1 , . . . , M . Focusing first on excess risk, it fol-lows from Proposition A.1 thatsup ∆ ∈ (0 , (cid:26) R T (∆ , T ) − (cid:27) ≥ max ≤ k ≤ M M X j =1 (cid:26) ∆ k t j − t j − ∆ k / − k (cid:27) ≥ max ≤ k ≤ M (cid:26) t k +1 √ et k − √ t k (cid:27) . As t k +1 ≥ t k , the last quantity above is minimized if all the terms are oforder 1. This yields t k +1 = t k + a , for some positive constant a . As t M = T ,we get that t j ∼ jT /M , and taking ∆ = 1 yieldssup ∆ ∈ (0 , (cid:26) R T (∆ , T ) − (cid:27) ≥ t & TM .
Proposition A.1 also yieldssup ∆ ∈ (0 , { ∆ R T (∆ , T ) } ≥ max k M X j =1 (cid:26) ∆ k t j (cid:18) − t j − ∆ k (cid:19)(cid:27) ≥ max k (cid:26) t k +1 √ et k (cid:27) . Arguments similar to the ones for the excess regret above, give the lowerbound for the competitive ratio. Finally,sup ∆ ∈ (0 , R T (∆ , T ) ≥ max k M X j =1 (cid:26) ∆ k t j (cid:18) − t j − ∆ k (cid:19)(cid:27) ≥ max k (cid:26) t k +1 √ et k (cid:27) gives the lower bound for maximum risk. (cid:3)
6. Simulations.
In this final section, we briefly compare, in simulations,the various policies (grids) introduced above. These are also compared with
Ucb2 [2], which, as noted above, can be seen as an M = O (log T ) batchtrial. A more complete exploration can be found in [29].The minimax and geometric grids perform well using an order of magni-tude fewer batches than Ucb2 . The number of batches required for
Ucb2 make its use for medical trials functionally impossible. For example, a study PERCHET, RIGOLLET, CHASSANG AND SNOWBERG
Fig. 2.
Performance of policies with different distributions and M = 5 . (For all distri-butions µ ( † ) = 0 . , and µ ( ⋆ ) = 0 . . .) that examined STI status six months after an intervention in [27] wouldrequire 1.5 years to run using minimax batch sizes, but Ucb2 would use asmany as 56 batches, meaning the study would take 28 years.Specific examples of performance can be found in Figure 2. This figurecompares average regret produced by different policies and many values ofthe total sample, T . For each value of T in the figure, a sample is drawn, gridsare computed based on M and T , the policy is implemented, and averageregret is calculated based on the choices in the policy. This is repeated 100times for each value of T .The number of batches is set at M = 5 for all policies except Ucb2 .Each panel considers one of four distributions: two continuous—Gaussianand Student’s t -distribution—and two discrete—Bernoulli and Poisson. Inall cases, we set the difference between the arms at ∆ = 0 . T is large enough, the etc policy will tend to commit after the first batch, as the first evaluation pointwill be greater than τ (∆). In the arithmetic grid, the size of this first batchis a constant proportion of the overall participant pool, so average regretwill be constant when T is large enough.Second, the minimax grid also produces relatively constant average regret,although this holds for smaller values of T , and produces lower regret than ATCHED BANDITS the geometric or arithmetic case when M is small. This indicates, usingthe intuition above, that the minimax grid excels at choosing the optimalbatch size to allow a decision to commit very close to τ (∆). This advantageover the arithmetic and geometric grids is clear. The minimax grid can evenproduce lower regret than Ucb2 , using an order of magnitude fewer batches.Third, and finally, the
Ucb2 algorithm generally produces lower regretthan any of the policies considered in this manuscript for all distributionsexcept the heavy-tailed Student’s t -distribution, for which batched policiesperform significantly better. Indeed, the Ucb2 is calibrated for sub-Gaussianrewards, as are batched policies. However, even with heavy-tailed distribu-tions, the central limit theorem implies that batching a large number ofobservations returns averages that are sub-Gaussian; see the supplementarymaterial [29]. Even when
Ucb2 performes better, this increase in perfor-mance comes at a steep practical cost: many more batches. For example,with draws from a Gaussian distribution, and T between 10,000 and 40,000,the minimax grid with only 5 batches performs better than Ucb2 . Through-out this range,
Ucb2 uses roughy 50 batches.It is worth noting that in medical trials, there is nothing special aboutwaiting six months for data from an intervention. Trials of cancer drugsoften measure variables like the 1- or 3-year survival rate, or the increase inaverage survival compared to a baseline that may be greater than a year. Inthese cases, the ability to get relatively low regret with a small number ofbatches is extremely important.APPENDIX A: TOOLS FOR LOWER BOUNDSOur results hinge on tools for lower bounds, recently adapted to thebandit setting in [9]. Specifically, we reduce the problem of deciding whicharm to pull to that of hypothesis testing. Consider the following two can-didate setups for the rewards distributions: P = N (∆ , ⊗ N (0 ,
1) and P = N (0 , ⊗ N (∆ , P successive pulls of arm 1 yield N (∆ ,
1) rewards and successive pulls of arm 2 yield N (0 ,
1) rewards. Theopposite is true for P , so arm i is optimal under P i .At a given time t ∈ [ T ], the choice of π t ∈ { , } is a test between P t and P t where P ti denotes the distribution of observations available at time t under P i . Let R ( t, π ) denote the regret incurred by policy π at time t . Wehave R ( t, π ) = ∆ ( π t = i ). Denote by E ti the expectation under P ti , so that E t [ R ( t, π )] ∨ E t [ R ( t, π )] ≥
12 ( E t [ R ( t, π )] + E t [ R ( t, π )])= ∆2 ( P t ( π t = 2) + P t ( π t = 1)) . Next, we use the following lemma (see [35], Chapter 2). PERCHET, RIGOLLET, CHASSANG AND SNOWBERG
Lemma A.1.
Let P and P be two probability distributions such that P ≪ P . Then for any measurable set A , P ( A ) + P ( A c ) ≥ exp( − KL ( P , P )) , where KL ( · , · ) is the Kullback–Leibler divergence defined by KL ( P , P ) = Z log (cid:18) dP dP (cid:19) dP . Here, observations are generated by an M -batch policy π . Recall that J ( t ) ∈ [ M ] denotes the index of the current batch. As π depends on obser-vations { Y ( π s ) s : s ∈ [ t J ( t ) − ] } , P ti is a product distribution of at most t J ( t ) − marginals. It is straightforward to show that whatever arms are observedover the history, KL ( P t , P t ) = t J ( t ) − ∆ /
2. Therefore, E t [ R ( t, π )] ∨ E t [ R ( t, π )] ≥ exp( − t J ( t ) − ∆ / . Summing over t yields the following result. Proposition A.1.
Fix T = { t , . . . , t M } and let ( T , π ) be an M -batchpolicy. There exist reward distributions with gap ∆ , such that ( T , π ) hasregret bounded below as, defining t := 0 , R T (∆ , T ) ≥ ∆ M X j =1 t j − t j − ∆ / . A variety of lower bounds in Section 5 are shown using this proposition.APPENDIX B: TECHNICAL LEMMASA process { Z t } t ≥ is a sub-Gaussian martingale difference sequence if E [ Z t +1 | Z , . . . , Z t ] = 0 and E [ e λZ t +1 ] ≤ e λ / for every λ > , t ≥ Lemma B.1.
Let Z t be a sub-Gaussian martingale difference sequence.Then, for every δ > and every integer t ≥ , P (cid:26) ¯ Z t ≥ s t log (cid:18) δ (cid:19)(cid:27) ≤ δ. Moreover, for every integer τ ≥ , P (cid:26) ∃ t ≤ τ, ¯ Z t ≥ s t log (cid:18) δ τt (cid:19)(cid:27) ≤ δ. ATCHED BANDITS Proof.
The first inequality follows from a classical Chernoff bound.To prove the maximal inequality, define ε t = 2 q t log( δ τt ). Note that, byJensen’s inequality, for any α >
0, the process { exp( αs ¯ Z s ) } s is a sub-martingale. Therefore, it follows from Doob’s maximal inequality [19], The-orem 3.2, page 314, that for every η > t ≥ P {∃ s ≤ t, s ¯ Z s ≥ η } = P {∃ s ≤ t, e αs ¯ Z s ≥ e αη }≤ E [ e αt ¯ Z t ] e − αη . Next, as Z t is sub-Gaussian, we have E [exp( αt ¯ Z t )] ≤ exp( α t/ α > P {∃ s ≤ t, s ¯ Z s ≥ η } ≤ exp (cid:18) − η t (cid:19) . Next, using a peeling argument, one obtains P {∃ t ≤ τ, ¯ Z t ≥ ε t } ≤ ⌊ log ( τ ) ⌋ X m =0 P ( m +1 − [ t =2 m { ¯ Z t ≥ ε t } ) ≤ ⌊ log ( τ ) ⌋ X m =0 P ( m +1 [ t =2 m { ¯ Z t ≥ ε m +1 } ) ≤ ⌊ log ( τ ) ⌋ X m =0 P ( m +1 [ t =2 m { t ¯ Z t ≥ m ε m +1 } ) ≤ ⌊ log ( τ ) ⌋ X m =0 exp (cid:18) − (2 m ε m +1 ) m +2 (cid:19) = ⌊ log ( τ ) ⌋ X m =0 m +1 τ δ ≤ log ( τ )+2 τ δ ≤ δ. Hence, the result. (cid:3)
Lemma B.2.
Fix two positive integers T and M ≤ log( T ) . It holds that T ∆ e − ( a M − ∆ ) / ≤ a log(( T ∆ ) / if a ≥ (cid:18) M T log T (cid:19) /M . PERCHET, RIGOLLET, CHASSANG AND SNOWBERG
Proof.
Fix the value of a and observe that M ≤ log T implies that a ≥ e . Define x := T ∆ / > θ := a M − /T >
0. The first inequality isrewritten as xe − θx ≤ a log( x ) . (7)We will prove that this inequality is true for all x >
0, given that θ and a satisfy some relation. This, in turn, gives a condition that depends solely on a , ensuring that the statement of the lemma is true for all ∆ > x ≤ e as a log( x ) = a ≥ e . Similarly, xe − θx ≤ / ( θe ). Thus (7) holds for all x ≥ / √ θ when a ≥ a ∗ := 1 / ( θ log(1 /θ )).We assume this inequality holds. Thus, we must show that (7) holds for x ∈ [ e, / √ θ ]. For x ≤ a , the derivative of the right-hand side is ax ≥
1, whilethe derivative of the left-hand side is smaller than 1. As a consequence, (7)holds for every x ≤ a , in particular for every x ≤ a ∗ . To summarize, whenever a ≥ a ∗ = Ta M − T /a M − ) , equation (7) holds on (0 , e ], on [ e, a ∗ ] and on [1 / √ θ, + ∞ ), thus on (0 , + ∞ )as a ∗ ≥ / √ θ . Next, if a M ≥ M T / log T , we obtain aa ∗ = a M T log (cid:18) Ta M − (cid:19) ≥ M log( T ) log (cid:18) T (cid:18) log TM T (cid:19) ( M − /M (cid:19) = 1log( T ) log (cid:18) T (cid:18) log( T ) M (cid:19) M − (cid:19) . The result follows from log( T ) /M ≥
1, hence a/a ∗ ≥ (cid:3) Lemma B.3.
Fix a ≥ , b ≥ e and let u , u , . . . be defined by u = a and u k +1 = a q u k log( b/u k ) . Define S k = 0 for k < and S k = k X j =0 − j = 2 − − k for k ≥ . Then, for any M such that a S M − ≤ b , and all k ∈ [ M − , u k ≥ a S k −
15 log S k − / ( b/a S k − ) . Moreover, for k ∈ [ M − M ] , we also have u k ≥ a S k −
15 log S k − / ( b/a S M − ) . ATCHED BANDITS Proof.
Define z k = log( b/a S k ). It is straightforward to show that z k ≤ z k +1 iff a S k +2 ≤ b . In particular, a S M − ≤ b implies that z k ≤ z k +1 for all k ∈ [0 : M − u k +1 = a r u k log( b/u k ) ≥ a vuut a S k − z S k − / k − log( b/u k ) . (8)Observe that b/a S k − ≥
15, so for all k ∈ [0 , M −
1] we havelog( b/u k ) ≤ log( b/a S k − ) + log 15 + S k − z k − ≤ z k − . This yields z S k − / k − log( b/u k ) ≤ z S k − / k − z k − = 15 z S k − k − . Plugging this bound into (8) completes the proof for k ∈ [ M − k ≥ M −
2, we have by induction on k from M − u k +1 = a r u k log( b/u k ) ≥ a vuut a S k − z S k − / M − log( b/u k ) . Moreover, as b/a S k − ≥
15, for k ∈ [ M − , M −
1] we havelog( b/u k ) ≤ log( b/a S k − ) + log 15 + S k − z M − ≤ z M − . (cid:3) Lemma B.4. If M ≤ log(4 T ) / , the following specific choice a := (2 T ) /S M − log / − (3 / / (2 M − ((2 T ) / (2 M − ) ensures that a S M − ≥ r e T log S M − / (cid:18) Ta S M − (cid:19) (9) and a S M − ≤ T. (10) Proof.
Immediate for M = 2. For M >
2, 2 M ≤ log(4 T ) implies a S M − = 2 T log S M − / ((2 T ) / (2 M − ) ≥ T (cid:20)
16 152 M − T ) (cid:21) / ≥ T. PERCHET, RIGOLLET, CHASSANG AND SNOWBERG
Therefore, a ≥ (2 T ) /S M − , which in turn implies that a S M − = 2 T log S M − / ((2 T ) − S M − /S M − ) ≥ r e T log S M − / (cid:18) Ta S M − (cid:19) . This completes the proof of (9). Equation (10) follows if15 S M − (2 T ) S M − log ( S M − S M − ) / ((2 T ) / (2 M − ) ≤ (2 T ) S M − . (11)Using that S M − k ≤
2, we get that the left-hand side of (10) is smaller than15 log((2 T ) / (2 M − ) ≤ T ) − M ) . The result follows using 2 M ≤ log(2 T ) /
6, which implies that the right-handside in the above inequality is bounded by (2 T ) − M . (cid:3) SUPPLEMENTARY MATERIAL
Supplement to “Batched bandit problems” (DOI: 10.1214/15-AOS1381SUPP; .pdf). The supplementary material [29]contains additional simulations, including some using real data.REFERENCES [1]
Audibert, J.-Y. and
Bubeck, S. (2010). Regret bounds and minimax policies underpartial monitoring.
J. Mach. Learn. Res. Auer, P. , Cesa-Bianchi, N. and
Fischer, P. (2002). Finite-time analysis of themultiarmed bandit problem.
Mach. Learn. Auer, P. and
Ortner, R. (2010). UCB revisited: Improved regret bounds forthe stochastic multi-armed bandit problem.
Period. Math. Hungar. Bartroff, J. (2007). Asymptotically optimal multistage tests of simple hypotheses.
Ann. Statist. Bartroff, J. , Lai, T. L. and
Shih, M.-C. (2013).
Sequential Experimentation inClinical Trials: Design and Analysis . Springer, New York. MR2987767[6]
Bather, J. A. (1981). Randomized allocation of treatments in sequential experi-ments.
J. Roy. Statist. Soc. Ser. B Berry, D. A. and
Fristedt, B. (1985).
Bandit Problems: Sequential Allocation ofExperiments . Chapman & Hall, London. MR0813698[8]
Bertsimas, D. and
Mersereau, A. J. (2007). A learning approach for interactivemarketing to a customer segment.
Oper. Res. Bubeck, S. , Perchet, V. and
Rigollet, P. (2013). Bounded regret in stochasticmulti-armed bandits.
COLT 2013, JMLR W&CP Capp´e, O. , Garivier, A. , Maillard, O.-A. , Munos, R. and
Stoltz, G. (2013).Kullback–Leibler upper confidence bounds for optimal sequential allocation.
Ann. Statist. Cesa-Bianchi, N. , Dekel, O. and
Shamir, O. (2013). Online learning with switch-ing costs and other adaptive adversaries.
Adv. Neural Inf. Process. Syst. [12] Cesa-Bianchi, N. , Gentile, C. and
Mansour, Y. (2012). Regret minimization forreserve prices in second-price auctions. In
Proceedings of the Twenty-FourthAnnual ACM-SIAM Symposium on Discrete Algorithms
Cheng, Y. (1996). Multistage bandit problems.
J. Statist. Plann. Inference Chick, S. E. and
Gans, N. (2009). Economic analysis of simulation selection prob-lems.
Manage. Sci. Colton, T. (1963). A model for selecting one of two medical treatments.
J. Amer.Statist. Assoc. Colton, T. (1965). A two-stage model for selecting one of two treatments.
Biometrics Cottle, R. , Johnson, E. and
Wets, R. (2007). George B. Dantzig (1914–2005).
Notices Amer. Math. Soc. Dantzig, G. B. (1940). On the non-existence of tests of “Student’s” hypothe-sis having power functions independent of σ . Ann. Math. Statist. Doob, J. L. (1990).
Stochastic Processes . Wiley, New York. MR1038526[20]
Fabius, J. and van Zwet, W. R. (1970). Some remarks on the two-armed bandit.
Ann. Math. Statist. Ghurye, S. G. and
Robbins, H. (1954). Two-stage procedures for estimating thedifference between means.
Biometrika Hardwick, J. and
Stout, Q. F. (2002). Optimal few-stage designs.
J. Statist.Plann. Inference
Jennison, C. and
Turnbull, B. W. (2000).
Group Sequential Methods with Appli-cations to Clinical Trials . Chapman & Hall/CRC, Boca Raton, FL. MR1710781[24]
Jun, T. (2004). A survey on the bandit problem with switching costs.
Economist
Lai, T. L. and
Robbins, H. (1985). Asymptotically efficient adaptive allocationrules.
Adv. in Appl. Math. Maurice, R. J. (1957). A minimax procedure for choosing between two populationsusing sequential sampling.
J. R. Stat. Soc., B Metsch, L. R. , Feaster, D. J. , Gooden, L. et al. (2013). Effect of risk-reductioncounseling with rapid HIV testing on risk of acquiring sexually transmittedinfections: The AWARE randomized clinical trial.
JAMA
Perchet, V. and
Rigollet, P. (2013). The multi-armed bandit problem with co-variates.
Ann. Statist. Perchet, V. , Rigollet, P. , Chassang, S. and
Snowberg, E. (2015). Supplementto “Batched bandit problems.” DOI:10.1214/15-AOS1381SUPP.[30]
Robbins, H. (1952). Some aspects of the sequential design of experiments.
Bull.Amer. Math. Soc. Schwartz, E. M. , Bradlow, E. and
Fader, P. (2013). Customer acquisition viadisplay advertising using multi-armed bandit experiments. Technical report,Univ. Michigan.[32]
Somerville, P. N. (1954). Some problems of optimum sampling.
Biometrika Stein, C. (1945). A two-sample test for a linear hypothesis whose power is indepen-dent of the variance.
Ann. Math. Statist. Thompson, W. R. (1933). On the likelihood that one unknown probability exceedsanother in view of the evidence of two samples.
Biometrika PERCHET, RIGOLLET, CHASSANG AND SNOWBERG[35]
Tsybakov, A. B. (2009).
Introduction to Nonparametric Estimation . Springer, NewYork. MR2724359[36]
Vogel, W. (1960). An asymptotic minimax theorem for the two armed bandit prob-lem.
Ann. Math. Statist. Vogel, W. (1960). A sequential design for the two armed bandit.
Ann. Math. Statist. V. PerchetLPMA, UMR 7599Universit´e Paris Diderot8, Place FM/1375013, ParisFranceE-mail: [email protected]
P. RigolletDepartment of Mathematics and IDSSMassachusetts Institute of Technology77 Massachusetts AvenueCambridge, Massachusetts 02139-4307USAE-mail: [email protected]
S. ChassangDepartment of EconomicsPrinceton UniversityBendheim Hall 316Princeton, New Jersey 08544-1021USAE-mail: [email protected]