[PDF] Self-Tuning Bandits over Unknown Covariate-Shifts

Abstract

Bandits with covariates, a.k.a. contextual bandits, address situations where optimal actions (or arms) at a given time t , depend on a context x t , e.g., a new patient's medical history, a consumer's past purchases. While it is understood that the distribution of contexts might change over time, e.g., due to seasonalities, or deployment to new environments, the bulk of studies concern the most adversarial such changes, resulting in regret bounds that are often worst-case in nature. Covariate-shift on the other hand has been considered in classification as a middle-ground formalism that can capture mild to relatively severe changes in distributions. We consider nonparametric bandits under such middle-ground scenarios, and derive new regret bounds that tightly capture a continuum of changes in context distribution. Furthermore, we show that these rates can be adaptively attained without knowledge of the time of shift nor the amount of shift.

Full PDF

SS ELF -T UNING B ANDITS OVER U NKNOWN C OVARIATE -S HIFTS

A P

REPRINT

Joseph Suk

Columbia University, Statistics [email protected]

Samory Kpotufe

Columbia University, Statistics [email protected]

July 20, 2020 A BSTRACT

Bandits with covariates, a.k.a. contextual bandits , address situations where optimal actions (or arms)at a given time t , depend on a context x t , e.g., a new patient’s medical history, a consumer’s pastpurchases. While it is understood that the distribution of contexts might change over time, e.g., dueto seasonalities, or deployment to new environments, the bulk of studies concern the most adversarialsuch changes, resulting in regret bounds that are often worst-case in nature. Covariate-shift on the other hand has been considered in classiﬁcation as a middle-ground formalismthat can capture mild to relatively severe changes in distributions. We consider nonparametric banditsunder such middle-ground scenarios, and derive new regret bounds that tightly capture a continuumof changes in context distribution. Furthermore, we show that these rates can be adaptively attainedwithout knowledge of the time of shift nor the amount of shift.

Bandits with covariates, or contextual bandits, concern situations where the reward of an action depends on a currentcontext x , e.g., a patient’s medical record (actions are treatments), or a user’s proﬁle and past history (actions are newproducts to propose). The problem is to maximize the total rewards of actions over time as similar contexts appearand rewards are observed over past actions. We adopt the stochastic setting where covariate and rewards are jointlydistributed over time.In the nonparametric version, little is assumed about the distribution of rewards over contexts, beyond Lipchitz conditionsthat capture the idea that rewards should be somewhat close for nearby contexts. Now suppose contexts are drawn froma ﬁxed distribution over time; this ensures typicality, i.e., we can expect similar contexts to have appeared previously, sothere is much potential for learning to progress over time. Most recent advances have been made in nonparametricsettings with a ﬁxed distribution, with early consistency results in [YZ + inthe nonparametric literature on the problem, although it has long been recognized in the more established parametricsetting on contextual bandits, under various formalisms of often adversarial nature [HMB15, KA16, LWAL17, LLS18,WIW18].As an initial take on a nonparametric setting with changes in distributions, we focus attention to the less adversarialcase of covariate-shift , a formalism often adopted in works on domain adaptation in classiﬁcation, starting with[SNK +

08, CMRR08, GSH +

09, BDU12]. For example, consider a situation where clinical trials are to be extendedto a new population. The distribution on patients’ proﬁles X would likely shift, however the predictors in X (e.g.,biometrics, medical history) are naturally chosen to be predictive of treatment outcomes; in other words, the conditionaldistribution of rewards given X remains unchanged. Formally, in covariate-shift , Q Y | X = P Y | X but Q X (cid:54) = P X , where P and Q denote previous and new joint-distributions on context-reward pairs ( X, Y ) . We actually are not aware of any such work to date for the nonparametric setting. a r X i v : . [ s t a t . M L ] J u l elf-Tuning Bandits over Unknown Covariate-Shifts A P

REPRINT

We are interested in achievable rewards after the shift P → Q , i.e., over the time period corresponding to the newdistribution Q X over contexts. Intuitively, such performance under Q depends on how far the prior distribution P X isfrom Q X , as typical contexts under Q X might not be likely to have been observed under P X . Hence, a ﬁrst goal is tounderstand, whether and how a policy π started before the shift might adapt to an a priori unknown such discrepancybetween P X and Q X . The problem is generally harder than in classiﬁcation where most established approacheswould compare observed X ∼ Q X to prior observations X ∼ P X to evaluate and adjust to the unknown discrepancy.In particular, we might not know the change time, and therefore cannot readily identify which data is which: forexample, in an ongoing clinical study, or running a recommender system, a change in distributions might occur due toseasonalities in population makeup, so is likely not known a priori.Interestingly, we ﬁnd that it is not necessary to spend resources identifying such unknown change points. Namely, weshow that it possible to automatically adapt to unknown change points, along with unknown discrepancies betweencovariate distributions P X , Q X , while achieving regrets under Q X of near optimal order in terms of these unknownchange parameters. In particular, the achievable regret tightly characterizes the effective amount of past experience contributed to Q by previous runs on P , in terms of both the length of previous runs and the discrepancy P X → Q X . Further Background.

There is by now an expansive literature on parametric contextual bandits, ranging from fullyadversarial to stochastic settings (see e.g. [HM07, LZ08, BCB12, AC16, RS16]).In the stochastic parametric setting, the earlier cited results [HMB15, KA16, LWAL17, LLS18, WIW18] are closest inspirit to the present work. However, they consider settings of a more adversarial nature, as their aim is to achieve regrets– over stationary periods of length ∆ t – of similar order O ( √ ∆ t ) as would have been achieved without distributionshifts; in other words, they contend that past experiences could adversarially affect regret over stationary periods, andthe aim is to mitigate such adversity. In contrast, as we will see, past experience is actually useful under covariate-shift(to a variable extent depending on shift charateristics), as long as the bandits procedure is reasonably conservative inleast-observed regions of context space.[Sli14] considers nonparametric settings, however, with ﬁxed non-stochastic contexts.Finally, in the setting of active online regression with multiple domains, [CLMZ20] establishes adaptive regretguarantees in terms of the domain dimensions and durations.We start with a formal setup in Section 2, followed by an overview of results in Section 3. Algorithms and proof ideasare discussed in Section 4. We consider a ﬁnite set of actions (or arms) [ K ] . = { , . . . , K } , and let Y ∈ [0 , K denote the rewards of each action i ∈ [ K ] . We assume that the covariate X , supported on X . = [0 , d , is jointly distributed with Y , and we thereforeassume a random independent sequence of covariate-reward pairs { ( X t , Y t ) } t ∈ N , identically distributed over different and possibly unknown periods of time .In particular we assume (in the bulk of the paper) that for some n P ≥ , possibly a priori unknown, the sequence { ( X t , Y t ) } t ∈ [ n p ] is i.i.d. according to a distribution P , while { ( X t , Y t ) } t>n p is i.i.d. according to a target distribution Q with different marginals. We will be interested in performance under Q , i.e., after such a shift. We note however thatour analysis readily extends to the case of multiple distribution shifts (before time n P ), as discussed in Appendix D. Assumption 1 (Covariate shift) . While the distribution of covariates X t might change overtime, the conditionaldistribution of Y t | X t remains ﬁxed (i.e., in our context Q X (cid:54) = P X , but P Y | X = Q Y | X ). In particular, the aim is to maximize expected rewards conditioned on X t ; this is captured through the ﬁxed regression function f : X → [0 , K as f i ( x ) . = E ( Y i | X = x ) , i ∈ [ K ] . The term characterizes settings displaying O ( √ n ) regrets, due to either parametric constraints on rewards or on hindsight baseline policies. The indexing set N denotes the natural numbers excluding . We use the terms time t or round t interchangeably, the latter in the context of a procedure. A P

REPRINT

In the bandits setting, a so-called policy (or bandit procedure) chooses actions at each round t , based on observedcovariates (up to round t ) and passed rewards, whereby at each round t only the rewards Y it of chosen actions i arerevealed. We say an arm i is pulled if action i is chosen by the policy. We adopt the following formalism. Deﬁnition 1 (Policy) . A policy π . = { π t } t ∈ N is a random sequence of functions π t : X t × [ K ] t − × [0 , t − → [ K ] .In an abuse of notation, in the context of a sequence of observations till round t , we will let π t ∈ [ K ] also denote theaction chosen at round t . In the case of a randomized policy, i.e., where π t in fact maps to distributions on [ K ] , we willstill let π t ∈ [ K ] denote the (random) action chosen at round t . We let X t . = { X s } s ≤ t , Y t . = { Y s } s ≤ t denote the observed covariates and (observed and unobserved) rewards fromrounds to t . The performance of a policy is evaluated through a notion of regret . Deﬁnition 2 (Cumulative regret) . Deﬁne the regret between rounds n P < n of a policy π , as R n P ,n ( π ) . = n (cid:88) t = n P +1 max i ∈ [ K ] (cid:0) f i ( X t ) − f π t ( X t ) (cid:1) . In our context of a shift to Q , we often will use the short notation R Qn ( π ) to denote R n P ,n ( π ) . The oracle policy π ∗ refers to the strategy that maximizes the expected reward at any round t , and is given by π ∗ t ( X t ) ∈ argmax i ∈ [ K ] f i ( X t ) . The regret of a policy π is therefore the excess expected reward of π ∗ relative to π over X n . We seek a policy π that minimizes E X n , Y n R Qn ( π ) .We emphasize that, while we will be interested in regret over particular periods n P + 1 : n (corresponding to a ﬁxedtarget Q ), it is understood by deﬁnition that π runs starting at t = 1 , and R Qn ( π ) . = R n P ,n ( π ) therefore depends on priordecisions up till time n P . Finally, usual bounds for stationary distributions are recovered simply by letting n P = 0 . Our main assumptions below are stated under the (cid:96) ∞ norm on [0 , d for convenience, as we build our procedures π over regular grids of [0 , d . It should be clear however that the relevant conditions hold under any norm (e.g., any (cid:96) p , p ≥ ) when they hold under (cid:96) ∞ , by the equivalence of R d norms. • Standard Assumptions and Conditions.

We assume, as in prior work on nonparametric contextual bandits [RZ10, PR13, Sli14, RMB18, GJ18], that theregression function is Lipschitz, with some known upper-bound λ on the Lipchitsz constant (often simply assumed tobe ). Assumption 2 (Lipschitz f ) . There exists λ > such that for all i ∈ [ K ] and x, x (cid:48) ∈ X , | f i ( x ) − f i ( x (cid:48) ) | ≤ λ (cid:107) x − x (cid:48) (cid:107) ∞ . (1)Furthermore, the difﬁculty of detecting the optimal arm π ∗ ( x ) at any x is parametrized through the following margin condition of f w.r.t. Q , originally due to [T +

04] (for nonparametric classiﬁcation).

Deﬁnition 3 (Margin Condition) . Let f (1) ( x ) , f (2) ( x ) denote the highest and second highest values of f i ( x ) , i ∈ [ K ] ,if they are not all equal; otherwise let f (1) ( x ) = f (2) ( x ) be that value.There exists δ > , C α > so that ∀ δ ∈ [0 , δ ] , Q X (0 < | f (1) ( X ) − f (2) ( X ) | < δ ) ≤ C α δ α . (2)In particular, the above is always satisﬁed with at least α = 0 . Intuitively, the larger the margin f (1) ( x ) − f (2) ( x ) at x ,the easier it is to detect the best arm, in the sense that a rough approximation to f is sufﬁcient. The above condition,common in prior work on nonparametric bandits, encodes the margin distribution under Q X . Interestingly, we needno assumption on the margin distribution under P X , although our setting assumes that the procedure π is ﬁrst ran oncovariates X t ∼ P X , t ≤ n P ; in fact, we will see that we only need to ensure that π maintains good choices of arms forevery potential x ∈ X , along with sufﬁcient arm pulls, up till the distribution shifts at round n P + 1 .The next assumption ensures that Q X has good coverage of [0 , d . It holds for instance if Q X has lower-boundedLebesgue density on [0 , d . We remark that the term policy is often used to denote a mapping from state (or covariate) to action; here we simply equate itwith any decision procedure taking action based on current and past observations.

A P

REPRINT

Figure 1:

Some settings with < γ < ∞ . Left: the density f P ∝ | x | γ goes fast to , while f Q is uniform; f Q /f P then diverges(so density ratios , and f -divergences are ill deﬁned). Right: P X moves mass away from regions of large Q X mass, with relativedensities captured by γ and the size of the region ( r ). Assumption 3 (Mass under Q ) . ∃ C d > s.t., ∀ (cid:96) ∞ balls B ⊂ [0 , d of diameter r ∈ (0 , : Q X ( B ) ≥ C d · r d . • Quantifying the Shift in P to Q . Next we aim to quantify how much the earlier covariate distribution P X differs from the shift Q X . Intuitively, P X hasinformation on Q X if it yields data useful to Q X , in other words, if it has sufﬁcient mass in regions of large Q X mass.The next condition, adapted from recent work [KM18] on classiﬁcation, parametrizes such intuition. Deﬁnition 4 (Transfer Exponent γ ) . ∃ C γ , γ ≥ s.t., ∀ (cid:96) ∞ balls B ⊂ [0 , d of diameter r ∈ (0 , : P X ( B ) ≥ C γ · r γ · Q X ( B ) . Note that the above condition always holds with at least γ = ∞ . The larger the shift, the larger γ , with γ = 0 capturing the mildest such shifts in covariate distribution. Some examples are given in Figure 1. As we will see, thetransfer exponent γ manages to tightly capture a continuum of easy to hard shifts in covariate distributions as evident inachievable regret rates R Qn over the shift period. A common algorithmic approach in nonparametric contextual bandits, starting from earlier work [RZ10, PR13], is tomaintain tree -based (regression) estimates ˆ f t of the expected reward function f , so that at any time t , upon observing X t , only those arms i with f i ( X t ) close to f (1) ( X t ) might be played. This assumes a good estimate of f at any time t ,which in the context of tree-based estimates boils down to choosing an optimal level in the tree – where each level r corresponds to a piecewise-constant regression estimate ˆ f t over bins of side-length r in X . In the usual setting with astationary distribution Q , a level r = r t might be chosen as O ( t − / (2+ d ) ) yielding optimal regression that would resultin the best provable regret rates.In our context however, the unknown amount of drift parametrized by γ has to be accounted for in the choice of alevel r t at any time t . Namely, using intuition from classiﬁcation, it can be shown that an optimal choice, based on(unavailable) knowledge of γ and the switch time n P , is of the form r t ( γ, n P ) . = O (min { n − / (2+ d + γ ) P , t − / (2+ d ) } ) .A main aim is therefore to design a procedure which, without such knowledge, still makes near optimal adaptive choicesof levels at any time t .Our adaptive strategies, detailed in Section 4, rely directly on the relative proportions of samples observed on a pathfrom the root of a tree T down to a leaf containing X t . Roughly, let n r ( X t ) denote the covariate count in the bincontaining X t at level r (by time t ). We then choose the smallest level r such that n − r ( X t ) ≤ r . For intuition, thischoice roughly balances regression variance (controlled by n − r ) and bias (controlled by r ). Such a choice stems fromprior insights on adaptive tree-based regression with ﬁxed data distribution but unknown d (see e.g. [KD12]), which weshow here to yield a regression rate – in terms of unknown γ, n p – similar to that of the oracle choice r t ( γ, n P ) .However, such an adaptive choice of level at each time t immediately introduces a book-keeping problem in the banditssetting: the number of observed rewards for a given arm i – which drive the estimates ˆ f – might signiﬁcantly differ The switch time might be available in some situations as discussed in the introduction.

A P

REPRINT from the number of covariates n r in a bin, as we eliminate suboptimal arms over time. In particular, while an arm i might still be valid in a bin B at time t , it might have been eliminated in a child bin B (cid:48) ⊂ B at a much earlier time, andtherefore would lack enough observations for a conﬁdent reward estimate ˆ f i at time t (relative to other arms). Furthercare is thus required for such book-keeping on observed rewards (or arm pulls ).We will ﬁrst consider a simpliﬁed bandit setting which alleviates book-keeping, namely a multiple-play variant wheremultiple arms might be pulled at once for every X t ; here we still have to eliminate suboptimal arms so the aboveproblem remains, but can be shown to be milder. This will serve as a warmup procedure that helps lay down much of thekey intuition towards adaptation to unknown distribution shift parameters. Much of our analysis overview in the maintext centers on the more intuitive multiple-play setting for brevity. We then show how to extend such a multiple-playprocedure to a single-play variant where only a single arm is pulled in each round. This is done by properly randomizingarms to be pulled to ensure a fair relative distribution of arm pulls. Adaptive Single-Play Bandits.

Our main theorem considers the canonical single-play variant where the policy π pulls one arm (and observes its reward) at every round t . This is given by the randomized procedure of Algorithm 2from Section 4, which takes in a conﬁdence parameter δ ∈ (0 , . Just as in previous work for the stationary case, themargin parameter α needs not be known, while in addition here, we do not need the drift parameters n P , γ either. Theexpectation in the statement below is over the entire sequence X n , Y n ∼ P n P × Q n − n P , plus the randomness in π . Theorem 1.

Let π denote the procedure of Algorithm 2, ran, with parameter δ ∈ (0 , , up till time n > n P ≥ , with n P possibly unknown. Suppose P X has unknown transfer exponent γ w.r.t. Q X , and that the average reward function f satisﬁes a margin condition with unknown α under Q X . Let n Q . = n − n P denote the (possibly unkown) number ofrounds after the drift, i.e., over the phase X t ∼ Q X . We have for some constant C > : E R Qn ( π ) ≤ Cn Q (cid:34) min (cid:32)(cid:18) K log ( K/δ ) n P (cid:19) α +12+ d + γ , (cid:18) K log ( K/δ ) n Q (cid:19) α +12+ d (cid:33) + K log ( K/δ ) n Q + nδ (cid:35) The following corollary is immediate.

Corollary 1.

Under the setup of Theorem 1, letting δ = O (1 /n ) yields: E R Qn ( π ) ≤ Cn Q (cid:34) min (cid:32)(cid:18) K log( Kn ) n P (cid:19) α +12+ d + γ , (cid:18) K log( Kn ) n Q (cid:19) α +12+ d (cid:33) + K log( Kn ) n Q (cid:35) . The above rates interpolate between two terms: one involving n P past observations and the drift parameter γ , the otherinvolving n Q . This last term matches the minimax regret rate of n − α +12+ d Q of [PR13], and is attained by the adaptive π when there is no drift, i.e., for n P = 0 . For n P > , the interpolated rate can be rewritten as n Q · (cid:16) n d γ P + n Q (cid:17) − ∧ α +12+ d for d γ = (2 + d ) / (2 + d + γ ) ; in other words n d γ P might be viewed as the effective amount of past experience contributed despite the drift; this quantity is largest when γ = 0 , lowering regret, and vanishes as γ → ∞ , i.e.,with larger discrepancy between P X and Q X . Such intuition is conﬁrmed in simulations (Figure 2). As previouslymentioned, the results readily extend to the case of multiple drifts before time n P , with γ above replaced by an average ¯ γ of transfer exponents between past P X ’s and Q X (Appendix D).Finally, we note that the above rates are tight (up to log terms) in the sense that the average regret (cid:16) n d γ P + n Q (cid:17) − α +12+ d matches minimax lower bounds for classiﬁcation under covariate-shift of [KM18] (see discussion in Appendix E). All algorithms build on a dyadic partitioning tree T deﬁned as follows. Deﬁnition 5 (Partition Tree) . Let R . = { − i : i ∈ N ∪ { }} , and let T r , r ∈ R denote a regular partition of [0 , d intohypercubes (which we refer to as bins ) of side length (a.k.a. bin size) r . We then deﬁne the dyadic tree T . = { T r } r ∈R ,i.e., a hierarchy of nested partitions of [0 , d . We will refer to the level r of T as the collection of bins in partition T r .The parent of a bin B ∈ T r , r < is the bin B (cid:48) ∈ T r containing B ; child , ancestor and descendant relations follownaturally. The notation T r ( x ) will then refer to the bin at level r containing x . Note that, while in the above deﬁnition, T has inﬁnite levels r ∈ R , at any round t in a procedure, we implicitlyonly operate on the subset of T containing data. Our procedures, as in prior work on nonparametric bandits, maintainestimates ˆ f of the average reward function f over levels of T .5elf-Tuning Bandits over Unknown Covariate-Shifts A P

REPRINT

Figure 2:

Simulation Results. Q X ∼ U ([0 , ) , P X has density ∝ (cid:107) x (cid:107) γ , K = 3 arms, with rewards Y i = f i ( X )+ N (0 , . , i ∈ [ K ] , where f i ( x ) ∝ (cid:80) k ± (1 − (cid:107) x − z k (cid:107) /r k ) + for 25 randomly placed bumps with centers z k , radius r k . A proﬁle of f is shownon the left, with lower gradient colors corresponding to least margins (white meaning no margin). The right plots average 20 runs ofAlgorithm 2, and verify the guarantees of Theorem 1, namely that the procedure adapts to unknown shift parameters n p and γ . Inparticular, the amount of past experience n d γ P clearly helps, and how much it helps depends on the level of shift P → Q as capturedby γ . Algorithm 1

Adaptive-Multiple-Play Requires : upper bound on Lipschitz constant λ , set of arms [ K ] , tree T with levels r ∈ R Input : c . = max(1 , λ − ) , δ ∈ (0 , , covariates X , X , . . . Initialization : For any bin B at any level in T , set I B ← [ K ] for t = 1 , , . . . do

5: If t ≤ (cid:100) c log( K/δ ) (cid:101) , play all arms in [ K ] t > (cid:100) c log( K/δ ) (cid:101) :7: Choose a level r t ∈ R for X t : r t ← min (cid:110) r ∈ R : λr ≥ (cid:113) log( K/δ ) n r ( X t ) (cid:111) Update candidate arms for the bin B containing X t at level r t :

9: Set I B ← (cid:84) B (cid:48) ∈ T r ,r ≥ r t I B (cid:48) ˆ f i ( B ) for any i ∈ I B over B ˆ f i

11: Reﬁne candidate arms: I B ← I B \ { i : ˆ f i ( B ) < ˆ f (1) ( B ) − λr t } .12: Play all arms in I B . end for Deﬁnition 6 (Regression estimates and arm pull counts) . At any round t > (cid:100) c log( K/δ ) (cid:101) , for any bin B at any levelin the tree, we deﬁne the following regression estimate for arm i : ˆ f it ( B ) . = 1 m t ( B, i ) (cid:88) X s ∈ B,s ≤ t − ,π s = i Y is , where m t ( B, i ) denotes the number of times arm i was pulled in B before time t . If m t ( B, i ) = 0 , we take ˆ f it ( B ) = 0 .For any B at level r in the tree, ˆ f it ( B ) serves as a regression estimate for any covariate x ∈ B . We often drop B or t inthe above deﬁnitions, when understood from context. Adaptive Multiple-Play Bandits.

We now discuss the simplest procedure Algorithm 1, which yields much of the basicintuition for adapting to unknown shift parameters n P , γ . Deﬁnition 7 (Covariate counts) . Let

B . = T r ( X t ) . We write: n r ( X t ) . = (cid:80) s ∈ [ t − { X s ∈ B } . At any time t , upon observing X t , a level r t is chosen according to the covariate counts n r ( X t ) along the path { T r ( X t ) } r ∈R . Roughly, r t is picked as the smallest r ∈ R such that / (cid:112) n r ( X t ) ≤ r . The level r t , more preciselythe bin B = T r t ( X t ) containing X t at that level, then determines the estimate ˆ f ( B ) to be used at time t .To understand this choice, recall that a main aim is to quickly identify which arms are suboptimal – and shouldn’t bepulled for X t , and we ought to therefore use a good estimate of f ( X t ) . We will show that r t indeed provides such anestimate at a near optimal regression rate in terms of unknown n P and γ (see Lemma 2). In particular, the covariatecounts n r ( X t ) , at any level r , account for covariates from both P X and Q X , whenever t > n P . Intuitively, we will6elf-Tuning Bandits over Unknown Covariate-Shifts A P

REPRINT

Algorithm 2

Adaptive-Single-Play Requires : upper bound on Lipschitz constant λ , set of arms [ K ] , tree T with levels r ∈ R Input Parameters : c . = max(8 , λ − ) , δ ∈ (0 , , covariates X , X , . . . Initialization : For any bin B at any level in T , set I B ← [ K ] B .4: for t = 1 , , . . . do

5: If t ≤ K (cid:100) c log( K/δ ) (cid:101) , play a random arm i ∈ [ K ] selected with probability / | K | .6: Otherwise, for t > K (cid:100) c log( K/δ ) (cid:101) :7: Choose a level r t ∈ R for X t : r t ← min (cid:110) r ∈ R : λr ≥ (cid:113) K log( K/δ ) n r ( X t ) , and n r ( X t ) ≥ K log( K/δ ) (cid:111) Update candidate arms for the bin B containing X t at level r t :

9: Set I B ← (cid:84) B (cid:48) ∈ T r ,r ≥ r t I B (cid:48) ˆ f i ( B ) for any i ∈ I B over B ˆ f i

11: Reﬁne candidate arms: I B ← I B \ { i : ˆ f i ( B ) < ˆ f (1) ( B ) − λr t } .12: Play a random arm i ∈ I B selected with probability / |I B | . end for then expect n r ( X t ) ≈ n P · P X ( T r ( X t )) + ( t − − n P ) · Q X ( T r ( X t )) (cid:38) n P · r d + γ + ( t − − n P ) · r d . The choiceof r t can then be shown to properly balance regression variance and bias in terms of unknown n P and γ .Once this choice is made, only those arms deemed safe are pulled for X t at time t . These so-called candidate arms aremaintained as I B ⊆ [ K ] for each bin B over time, and exclude identiﬁed suboptimal arms whose average rewards areclearly below that of the unknown best arm π ∗ ( x ) for any x ∈ B . In particular, suppose at any time t , we can ensurethat | ˆ f it ( B ) − f i ( x ) | (cid:46) r t for all remaining arms i over x ∈ B . Then we can safely discard i if ˆ f (1) t ( B ) − ˆ f it ( B ) (cid:38) r t .It then makes sense to also discard such an arm in all descendants of B .Now, adaptation to the unknown margin parameter α comes for free through such decisions over I B . Namely, if themargin f (1) ( x ) − f (2) ( x ) (cid:29) r t for all x ∈ B , then all suboptimal arms are discarded by time t so we suffer no regretfor X t ∈ B at time t . Otherwise, all arms i left in I B satisfy f (1) ( x ) − f i ( x ) (cid:46) r t , for x ∈ B , i.e., a bound on regret;on the other hand, the margin distribution ensures that the Q X -probability of X t landing in such a bin with low marginsis at most r αt . All that is left is to ensure that r t is of the right order in terms of t and the unknown n P , γ (as we show isthe case of the adaptive choice of r t discussed above). Remark 1 (Book-keeping) . As discussed earlier, our adaptive choice of level r t brings in additional difﬁculty in thebook-keeping of arm pulls. In fact the above discussion assumes that covariate counts n r ( X t ) (used in choosing r t ,towards adapting to unknown n P , γ ) and arm-pull counts m t ( B, i ) (used in estimating ˆ f t ( B ) ) are of similar order.However, because we can have r t > r t − as X t , X t − fall in different regions of space, the following situation canoccur: in a given bin B . = T r t ( X t ) chosen at time t , some arms might have been eliminated in a descendant of B atan earlier time, and therefore not pulled as much as other arms in I B . However, our choices ensure that the total armpull count (for any arm in I B ) at that earlier time must have been sufﬁciently large. This is argued in Lemma 1. Thesituation is however more severe in the single-play variant described below. Adaptive Single-Play Bandits.

We modify the above multiple-play variant as follows (as detailed in Algorithm 2).Upon choosing a bin B = T r t ( X t ) at level r t for X t , where we would have pulled all arms in I B , we instead pull asingle candidate arm at random. Lemma 3 then argues, similar to Lemma 1, that in expectation, total arm pull counts m t ( B, i ) (for i ∈ I B ) remain of sufﬁciently large order w.r.t. n r ( X t ) . Continuing on the discussion of Section 4, ﬁrst consider the multiple-play setting of Algorithm 1 which yields similarregret rates as those of Theorem 1 (see Proposition 1 of Appendix B). At any round t (cid:38) log K with selected bin B . = T r t ( X t ) , we have by standard arguments (Lemma 2) that, with high probability (over random rewards, conditionedon all past covariates): ∀ x ∈ B, i ∈ I B : | ˆ f it ( B ) − f i ( x ) | (cid:46) (cid:115) log( K/δ ) m t ( B, i ) + λr t . (3)7elf-Tuning Bandits over Unknown Covariate-Shifts A P

REPRINT

However, r t is selected based on covariate counts n r ( X t ) which might not directly relate to m t ( B, i ) . Fortunately, thetwo quantities are equal (for B ) till the ﬁrst time the subtree rooted at B is visited; building on this, one can argue that m t ( B, i ) should be sufﬁciently large at any time B is selected: Lemma 1.

Suppose at round t , we select bin B = T r t ( X t ) . Then, max i ∈I B (cid:113) log( K/δ ) m t ( B,i ) ≤ λr t . Thus, the regression error in (3) is at most λr t . Therefore, since we only discard an arm i if ˆ f (1) t ( B ) − ˆ f it ( B ) ≥ λr t ,it follows that the best arm for any x ∈ B is never removed. In particular this holds up to the unknown shift time n P + 1 . Now, for t > n P the regression error r t can be shown to be of optimal order: it approximately minimizes theexpression, ( n r ( X t )) − / + r (cid:46) (cid:0) n P · r d + γ + ( t − − n P ) · r d (cid:1) − / + r, i.e., is less than the value r ∗ minimizing the r.h.s. (Lemma 2). Now r ∗ is of optimal regression order min (cid:32)(cid:18) log( K/δ ) n P (cid:19) d + γ , (cid:18) log( K/δ ) t − n P (cid:19) d (cid:33) . As mentioned previously, a non-zero regret can only occur if the margin at X t is below the regression error λr t (otherwise the best arm is pulled). Since the likelihood of picking such an X t is O ( r αt ) (by Deﬁnition 3), andfurthermore, any arm picked can only incur regret λr t (by equation (3)), it follows that the expected regret at time t isbounded by O ( r αt ) , which is of optimal order. Summing over t > n P yields a regret bound of the right order.For single-play, even at the round s when the subtree rooted at B is ﬁrst visited, it is possible that m s ( B, i ) does notequal its covariate count (for some i ∈ I B ). However, by Lemma 3 (the counterpart to Lemma 1 for single-play)we have that m t ( B, i ) (cid:38) n r t ( X t ) /K , through concentration arguments on the distribution of arm pulls. The rest ofthe proof of Theorem 1 proceeds in the same manner as the multiple-play case described above. Details are given inAppendix C.The case of multiple shifts is handled similarly by properly bounding n r ( X t ) (see Appendix D). Broader Impact

Domain adaptation is now understood as an essential step towards the successful deployment of sequential learning systems across (related) real-world domains, e.g., adapting clinical trials (for experimental treatments) between differentpopulations, or sharing knowledge between automated driving runs from related urban environments.In contrast with classiﬁcation and regression settings where much recent progress has been made, domain adaptationin sequential learning – ranging from bandits to more challenging reinforcement learning applications – is furthercomplicated by the strong interdependencies between learning rounds, the hardness of adequate sampling of statespaces, and the general lack of full information on the potential rewards of actions.The present work contributes basic insights into this broader research program, with a ﬁrst focus on the relatively moreamenable setting of bandits with covariates.

References [AC16] Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic andadversarial bandits. In

Conference on Learning Theory , pages 116–120, 2016.[BCB12] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armedbandit problems. arXiv preprint arXiv:1204.5721 , 2012.[BDU12] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of unlabeled targetsamples. In

International Conference on Algorithmic Learning Theory , pages 139–153, 2012.[CBCG04] Nicoló Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learningalgorithms.

Information Theory, IEEE Transactions , 50(9):2050–2057, 2004.[CLMZ20] Yining Chen, Haipeng Luo, Tengyu Ma, and Chicheng Zhang. Active online domain adaptation. arXivpreprint arXiv:2006.14481 , 2020.[CMRR08] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correctiontheory. In

International conference on algorithmic learning theory , pages 38–53. Springer, 2008.8elf-Tuning Bandits over Unknown Covariate-Shifts

A P

REPRINT [GJ18] Melody Y Guan and Heinrich Jiang. Nonparametric stochastic contextual bandits.

AAAI , 2018.[GSH +

09] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and BernhardSchölkopf. Covariate shift by kernel mean matching.

Dataset shift in machine learning , 3(4):5, 2009.[HM07] Elad Hazan and Nimrod Megiddo. Online learning with prior knowledge. In

International Conference onComputational Learning Theory , pages 499–513. Springer, 2007.[HMB15] Negar Hariri, Bamshad Mobasher, and Robin Burke. Adapting to user preference changes in interactiverecommendation. In

Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence , 2015.[KA16] Zohar S Karnin and Oren Anava. Multi-armed bandits: Competing with optimal sequences. In

Advancesin Neural Information Processing Systems , pages 199–207, 2016.[KD12] Samory Kpotufe and Sanjoy Dasgupta. A tree-based regressor that adapts to intrinsic dimension.

Journalof Computer and System Sciences , 78(5):1496–1515, 2012.[KM18] Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the beneﬁts of labels in covariate-shift.

COLT , 2018.[LLS18] Fang Liu, Joohyun Lee, and Ness Shroff. A change-detection based framework for piecewise-stationarymulti-armed bandit problem. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018.[LWAL17] Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, and John Langford. Efﬁcient contextual bandits in non-stationary worlds. arXiv preprint arXiv:1708.01799 , 2017.[LZ08] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information.In

Advances in neural information processing systems , pages 817–824, 2008.[PR13] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates.

The Annals ofStatistics , 41(2):693–721, 2013.[RMB18] Henry W. J. Reeve, Joe Mellor, and Gavin Brown. The k -nearest neighbour ucb algorithm for multi-armedbandits with covariates. JMLR , 2018.[RS16] Alexander Rakhlin and Karthik Sridharan. Bistro: An efﬁcient relaxation-based method for contextualbandits. In

ICML , pages 1977–1985, 2016.[RZ10] Phillipe Rigollet and Assaf Zeevi. Nonparametric bandits with covariates.

COLT , 2010.[Sli14] Aleksandrs Slivkins. Contextual bandits with similarity information.

The Journal of Machine LearningResearch , 15(1):2533–2568, 2014.[SNK +

08] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Directimportance estimation with model selection and its application to covariate shift adaptation. In

Advancesin neural information processing systems , pages 1433–1440, 2008.[T +

04] Alexander B Tsybakov et al. Optimal aggregation of classiﬁers in statistical learning.

The Annals ofStatistics , 32(1):135–166, 2004.[WIW18] Qingyun Wu, Naveen Iyer, and Hongning Wang. Learning contextual bandits in a non-stationary envi-ronment. In

The 41st International ACM SIGIR Conference on Research & Development in InformationRetrieval , pages 495–504, 2018.[YZ +

02] Yuhong Yang, Dan Zhu, et al. Randomized allocation with nonparametric estimation for a multi-armedbandit problem with covariates.

The Annals of Statistics , 30(1):100–121, 2002.9elf-Tuning Bandits over Unknown Covariate-Shifts

A P

REPRINT

Appendix A Additional Experiments

Figure 3:

Simulation results for bumps centers chosen uniformly, and for smaller n Q = 500 rounds. For our experiments, we ﬁx a covariate space X = [0 , . Our target covariate distribution is Q X ∼ U ([0 , ) , theuniform on [0 , , and our source covariate distribution P X has density ∝ (cid:107) x (cid:107) γ so that P X , Q X satisfy Deﬁnition 4with transfer exponent γ .The reward function common to both P and Q is constructed as the sum of bump functions, each with a circu-lar support disjoint from the other bumps. The bump centers { z k } k =1 are randomly sampled from the Gaussian N ((0 . , . , . · Id ) in Figure 2 and from U ([0 , ) in Figure 3. The radii { r k } k =1 are then chosen in a randomorder to maximize the bump areas. Then, for each of the K = 3 arms, we determined the sign of each of the bumpsrandomly and independently.However, up to this point, all three arms are optimal in the region of the covariate space outside of the bumps since thereward functions are equal there. So, to introduce additional heterogeneity in the top arm identity, each reward function f i was further raised or lowered by a randomly selected height in the range [ − . , . , in the area outside of the bumps.This determines a top arm (Arm 1 in Figure 2 and Arm 2 in Figure 3) in the region outside of the bumps.Thus, the reward functions f i can be written as f i ( x ) ∝ (cid:88) k ± (1 − (cid:107) x − z k (cid:107) /r k ) + Furthermore, Gaussian noise was added to each f i to produce the rewards Y i according to Y i = f i ( X ) + N (0 , . for each i ∈ [ K ] . We generated data using a range of different values for the parameters n Q , n P , γ using the same f i ’sand ran Algorithm 2 on each dataset. The ﬁrst plot in each of Figure 2 and 3 exhibits the guarantee of increasing pastexperience n P improving the regret, for ﬁxed n Q , γ . The second plot in each ﬁgure shows the effect of increasing γ worsening the regret, for ﬁxed n P , n Q , as proclaimed in Theorem 1. Each plot shows the mean and standard deviationof the regret over n Q rounds across simulations. Appendix B Upper Bound for Multiple-Play

B.1 Setup and Result

Our ﬁrst (warmup) result considers a multiple-play variant where the policy π can pull multiple arms (and observetheir rewards) at every round t . In other words, we then let π t ⊂ [ K ] denote the subset of arms pulled at time t , andaccordingly consider the following multiple-play regret between time n P < n : R Qn ( π ) . = R n P ,n ( π ) . = n (cid:88) t = n P +1 (cid:88) i ∈ π t f (1) ( X t ) − f i ( X t ) . Our adaptive multiple-play variant is presented in Algorithm 1 of Section 4, which takes in a conﬁdence parameter δ ∈ (0 , . We have the following result, where the expectation is over the entire sequence X n , Y n . Proposition 1.

Let π denote the procedure of Algorithm 1, ran, with parameter δ ∈ (0 , , up till time n > n P ≥ ,with n P possibly unknown. Suppose P X has unknown transfer exponent γ w.r.t. Q X , and that the average reward A P

REPRINT function f satisﬁes a margin condition with unknown α under Q X . Let n Q . = n − n P denote the (possibly unknown)number of rounds after the drift, i.e., over the phase X t ∼ Q X . We have for some constant C > : E R Qn ( π ) ≤ CKn Q (cid:34) min (cid:32)(cid:18) log( K/δ ) n P (cid:19) α +12+ d + γ , (cid:18) log( K/δ ) n Q (cid:19) α +12+ d (cid:33) + log( K/δ ) n Q + nδ (cid:35) . This gives an immediate corollary analogous to Corollary 1.

Corollary 2.

Under the conditions of Theorem 1, letting δ = O (1 /n ) gives us that E R Qn ( π ) ≤ CKn Q (cid:34) min (cid:32)(cid:18) log( Kn ) n P (cid:19) α +12+ d + γ , (cid:18) log( Kn ) n Q (cid:19) α +12+ d (cid:33) + log( Kn ) n Q (cid:35) . B.2 Proof of Proposition 1

Throughout the proof, c , c , . . . will denote positive constants not depending on t or n P , n Q . Here, c . = max(1 , λ − ) is taken from Line 2 of Algorithm 1.First, we justify that the criterion on Line 7 of Algorithm 1 is well deﬁned for rounds t > c log( K/δ ) . We have c ≥ λ − implies for all t > c log( K/δ ) (cid:114) log( K/δ ) t = (cid:115) log( K/δ ) n ( X t ) ≤ λ Thus, the level r = 1 satisﬁes the constraint on Line 7 of Algorithm 1. Bias-Variance Bound and Covariate Count Analysis.

This ﬁrst proposition establishes a standard bias-variancebound on the error of a regression function estimate | ˆ f it ( B ) − f i ( x ) | for a bin B , a round t , an arm i ∈ I B , and acovariate x ∈ B . Proposition 2.

Consider any round t > c log( K/δ ) with observed covariate X t , and ﬁx any bin B containing X t .Consider the estimate ˆ f it ( B ) as in Deﬁnition 6, and let m t ( B, i ) be deﬁned therein (i.e., the number of times arm i is pulled in B by time t . We then have at round t , that with probability at least − δ with respect to the conditionaldistribution Y t − | X t : ∀ x ∈ B : | ˆ f it ( B ) − f i ( x ) | ≤ (cid:32)(cid:115) log( K/δ ) m t ( B, i ) + λr (cid:33) . Proof.

Fix bin B and let r ∈ R be its side length. If m t ( B, i ) = 0 , then the desired bound is vacuous. Otherwise,recall from Deﬁnition 6, ˆ f it ( B ) = 1 m t ( B, i ) (cid:88) X u ∈ B,u ≤ t − ,i ∈ π u Y iu . Here, we note the estimate is formed over all rounds when an arm i was played in B , even if other arms were playedduring those rounds since we are in the multiple-play setup.For the sake of showing a bias-variance bound, deﬁne ˜ f it ( B ) . = E Y t − | X t − ( ˆ f it ( B )) = 1 m t ( B, i ) (cid:88) X u ∈ B,u ≤ t − ,i ∈ π u E ( Y iu | X u ) . Triangle inequality then yields | ˆ f it ( B ) − f i ( x ) | ≤ | ˆ f it ( B ) − ˜ f it ( B ) | + | ˜ f it ( B ) − f i ( x ) | . The second term on the RHS is at most λr by the Lipschitz assumption (Assumption 2). Now, ﬁx the values of { m t ( B, i ) } i ∈ [ K ] . By Hoeffding inequality and union bound, the ﬁrst term on the RHS above satisﬁes with probabilityat least − δ w.r.t. the distribution of Y t − | X t , { m t ( B, i ) } i ∈ [ K ] , ∀ i ∈ [ K ] : | ˆ f it ( B ) − ˜ f it ( B ) | ≤ (cid:115) log(2 K/δ ) m t ( B, i ) . In fact, the above holds with probability at least − δ w.r.t. the distribution of Y t − | X t by the tower property.11elf-Tuning Bandits over Unknown Covariate-Shifts A P

REPRINT

Next, we prove Lemma 1 to show that the regression error bound of Proposition 2 is further bounded by λr t . Proof. (of Lemma 1) First, suppose t is the ﬁrst round when the subtree rooted at B is visited (by which we mean a binin said subtree is chosen). Then, until round t , each candidate arm was played every time a covariate was observed in B .Thus, m t ( B, i ) = n r t ( X t ) > for all i ∈ I B . This proves the lemma in this case using Proposition 2 since λr t ≥ (cid:115) log( K/δ ) n r t ( X t ) . For an arbitrary round t , let s be the ﬁrst round that the subtree rooted at B was visited. Then, for any i ∈ I B , m t ( B, i ) ≥ m s ( B, i ) = n r t ( X s ) so that since r s ≤ r t , (cid:115) log( K/δ ) m t ( B, i ) ≤ (cid:115) log( K/δ ) n r t ( X s ) ≤ λr s ≤ λr t . Justifying Arm Eliminations.

For each round t > c log( K/δ ) , deﬁne the event G t on which the bound in Proposi-tion 2 holds or G t = (cid:110) ∀ i ∈ I B , x ∈ B : | ˆ f it ( B ) − f i ( x ) | ≤ λr t , B = T r t ( X t ) (cid:111) . This is the “good” event on which our regression function estimates ˆ f it ( B ) are accurate enough to be able to discernwhich arms have low and high rewards. From here on, let B be the selected bin at round t . This ﬁrst lemma asserts thatan eliminated arm cannot have a better reward than the best candidate arm. To simplify notation, we let ∆ t . = 4 λr t bethe bound mentioned in G t (so that t is the conﬁdence used on Line 11 of Algorithm 1 to discard arms). Proposition 3.

Suppose at round t , under event G t , we select bin B . Then for any two arms i, j ∈ I B and any x ∈ B , ˆ f it ( B ) − ˆ f jt ( B ) > t = ⇒ f i ( x ) > f j ( x ) Proof.

Using the deﬁnition of G t , we have f i ( x ) − f j ( x ) ≥ ˆ f it ( B ) − ˆ f jt ( B ) − t > . We obtain two corollaries from Proposition 3. The ﬁrst is that, under event G t , the best arm at any covariate in a bin B is always retained in I B . The second is that the regret of playing any candidate arm at any point in B is dominated by ∆ t . Corollary 3.

Suppose at round t , under event ( ∩ ts =1 G s ) , we select bin B . Then I B contains the best arm i ∗ ( x ) =argmax j ∈ [ K ] f j ( x ) for all x ∈ B .Proof. If i ∗ ( x ) (cid:54)∈ I B , then i ∗ ( x ) was eliminated at some round s < t when an ancestor bin B (cid:48) ⊃ B was selected.Thus, ˆ f (1) s ( B (cid:48) ) − ˆ f i ∗ ( x ) s ( B (cid:48) ) > s . By Proposition 3, under G s , this implies f (cid:96) ( x ) > f i ∗ ( x ) ( x ) for some arm (cid:96) (cid:54) = i ∗ ( x ) , a contradiction. Corollary 4.

Suppose at round t , under event ( ∩ ts =1 G s ) , we select bin B . Then, both of the following hold for all x ∈ B :1. | f (1) ( x ) − f j ( x ) | ≤ t for all j ∈ I B .2. Either < | f (1) ( x ) − f (2) ( x ) | ≤ t or f j ( x ) = f (1) ( x ) for all j ∈ I B Proof.

Fix x ∈ B , and let ˆ i = argmax j ∈I B ˆ f j ( B ) and i = argmax j ∈ [ K ] f j ( x ) . Using the deﬁnition of G t and the fact that i ∈ I B (Corollary 3), we have f (1) ( x ) − f ˆ i ( x ) = f i ( x ) − f ˆ i ( x ) ≤ f i ( x ) − ˆ f ˆ it ( B ) + ∆ t ≤ f i ( x ) − ˆ f it ( B ) + ∆ t ≤ t . A P

REPRINT

To show (1), we have for j ∈ I B , using the deﬁnition of G t and the above inequality, f (1) ( x ) − f j ( x ) ≤ f (1) ( x ) − ˆ f jt ( B ) + ∆ t ≤ f (1) ( x ) − ˆ f ˆ it ( B ) + 2∆ t + ∆ t (because j was not eliminated) ≤ f (1) ( x ) − f ˆ i ( x ) + ∆ t + 2∆ t + ∆ t ≤ t . To show (2), we have if I B contains a sub-optimal arm j ∈ I B at x , then by (1), | f (1) ( x ) − f (2) ( x ) | = f (1) ( x ) − f (2) ( x ) ≤ f (1) ( x ) − f j ( x ) ≤ t . Furthermore, f (1) ( x ) (cid:54) = f (2) ( x ) if there is a sub-optimal arm at x . Showing r t is of Optimal Regression Order. We consider the rounds n P + t past the switch time n P and we willbound the regret accrued under event ∩ n p + ts =1 G s by ﬁrst bounding r n P + t . Lemma 2.

Fix a round n P + t with observed covariate X n P + t . Then, for some c > , with probability at least − δ w.r.t. the distribution of X n P + t − | X n P + t , we have r n P + t ≤ c min (cid:32)(cid:18) log( K/δ ) n P (cid:19) d + γ , (cid:18) log( K/δ ) t (cid:19) d (cid:33) . Proof.

It sufﬁces to show r n P + t ≤ c (cid:16) log( K/δ ) n P (cid:17) d + γ when n P > log( K/δ ) (cid:16) log( K/δ ) t (cid:17) d when t > log( K/δ ) , since this implies the desired result for any c ≥ . We ﬁrst show r n P + t ≤ c (cid:18) log( K/δ ) t (cid:19) d . The other inequality will have a similar proof. First, we simplify notation for the sake of this proof. At round n P + t , itwill be understood that the observed covariate is represented by X . = X n P + t . We let n ( r ) . = n r ( X ) be the covariatecount at the level r .First, we claim that a level r ∈ R which satisﬁes n ( r ) ≥ log(1 /δ ) also satisﬁes, with probability at least − δ , n ( r ) ≥ E ( n ( r ))8 . If E ( n ( r )) < /δ ) , then this is already true. Otherwise, a Chernoff bound gives: P (cid:18) n ( r ) ≤ E ( n ( r )) (cid:19) ≤ exp (cid:18) − E ( n ( r )) (cid:19) ≤ δ. Here, we note the probability measure P is w.r.t. { X i } n P i =1 ∼ P n P X and { X i } n P + t − i = n P +1 ∼ Q t − X .Next, by Assumptions 3 and 4, we have for some c > : E ( n ( r )) = n P P X ( T r ( X )) + ( t − Q X ( T r ( X )) ≥ c ( n P r d + γ + ( t − r d ) . (4)In fact, without loss of generality, we can assume c ≤ − dd C dd d λ . (5)Thus, for a given level r satisfying n ( r ) ≥ log(1 /δ ) , we have with probability at least − δ that (cid:115) log( K/δ ) n ( r ) ≤ (cid:115) K/δ ) c ( t − r d . (6)13elf-Tuning Bandits over Unknown Covariate-Shifts A P

REPRINT

Now, let r ∗ ∈ R be the smallest level greater than or equal to λ − d ( c / − d (cid:18) log( K/δ ) t − (cid:19) d . Then, it sufﬁces to show r n P + t ≤ r ∗ . For the next part of the proof, we deﬁne n Q ( r ) as the covariate count exclusivelyfrom Q X : n Q ( r ) . = (cid:88) s ∈ [ t − { X n P + s ∈ T r ( X ) } . Next, we claim r ∗ satisﬁes n Q ( r ) ≥ log(1 /δ ) with probability at least − δ , so that (6) holds for r ∗ . Since t > log( K/δ ) by hypothesis, we have by (5) and Assumption 3 that E ( n Q ( r )) ≥ C d ( t − r ∗ ) d ≥ C d ( t − λ − d d ( c / − d d (cid:18) log( K/δ ) t − (cid:19) d d ≥ /δ ) . Thus, by a Chernoff bound, we have Q X ( n Q ( r ∗ ) < log(1 /δ )) ≤ Q X (cid:18) n Q ( r ∗ ) < E ( n Q ( r )) (cid:19) ≤ exp (cid:18) − E ( n Q ( r )) (cid:19) ≤ δ. Thus, with probability at least − δ , we have n Q ( r ∗ ) ≥ log(1 /δ ) . Then, we have that λr ∗ ≥ (cid:115) K/δ ) c ( t − r ∗ ) d ≥ (cid:115) log( K/δ ) n ( r ∗ ) . This gives us that r n P + t ≤ r ∗ by the minimization on Line 7 of Algorithm 1, as desired.The other inequality r n P + t ≤ c (cid:18) log( K/δ ) n P (cid:19) d + γ , can be shown in a similar fashion to the case above with the appropriate modiﬁcations. Speciﬁcally, t is replaced with n P , (4) is replaced with the inequality E ( n ( r )) ≥ c n P r d + γ , and n Q ( r ) is replaced with n P ( r ) which is deﬁned as the bin covariate counts from distribution P : n P ( r ) . = (cid:88) s ∈ [ n P ] { X s ∈ T r ( X ) } . Cumulative Regret Bound.

Next, we put the previous conclusions together to bound the cumulative regret bybounding the regret accrued at each round t and then summing over t . For t ≤ n P , deﬁne the event E t as the event onwhich the bound of Proposition 2 holds or E t = G t . For rounds t > n P , deﬁne the event E t as the event on which thebounds of Proposition 2 and Lemma 2 hold or: E t = G t ∩ (cid:40) r t ≤ c min (cid:32)(cid:18) log( K/δ ) n P (cid:19) d + γ , (cid:18) log( K/δ ) t − n P (cid:19) d (cid:33)(cid:41) . Recall from earlier that ∆ t . = 4 λr t . To sum the regrets across time t , the argument will involve conditioning on theevent E t , on which (1) Algorithm 1 correctly eliminates arms and (2) r t is of the optimal order.Let t > n P and let F t . = ∩ ts =1 E s . Also, to simplify notation, let U t denote U t . = c min (cid:32)(cid:18) log( K/δ ) n P (cid:19) d + γ , (cid:18) log( K/δ ) t − n P (cid:19) d (cid:33) . If n Q ≤ log( K/δ ) , we are already done since the regret is then bounded by K log( K/δ ) , which is the right order.Assume for the rest of the proof that log( K/δ ) < n Q . 14elf-Tuning Bandits over Unknown Covariate-Shifts A P

REPRINT

Next, let t > n P be the largest positive integer satisfying c (cid:18) log( K/δ ) t − n P (cid:19) d > δ , where δ is the parameter appearing in the margin assumption (Assumption 3).The regret for the ﬁrst max( t − n P , log( K/δ )) rounds among rounds { n P + 1 , . . . , n } can be bounded by O ( K log( K/δ )) which is always of the right order. For the rest of the proof, we constrain our attention to theremaining rounds t where we can now assume t − n P > log( K/δ ) and U t ≤ δ .Next, let the event A t be A t . = {| ˆ f (1) t ( B ) − ˆ f (2) t ( B ) | ≤ t , B = T r t ( X t ) } . Conditioned on X t , A t is the event where one arm remains in contention at round t according to Line 11 of Algorithm 1.For the remainder of the proof, let B be the bin that was selected at round t given an understood value of X t .Consider the expected regret of pulling arm j ∈ π t at round t : E X t , Y t f (1) ( X t ) − f j ( X t ) = E X t (cid:16) E X t − , Y t − | X t ( f (1) ( X t ) − f j ( X t ))( F t + F ct )( A t + A ct ) (cid:17) . Next, we consider three different cases depending on whether event F t or F ct holds and whether event A t or A ct holds:1) Suppose event F t ∩ A t holds. Suppose also that there is a suboptimal arm i ∈ I B for which f i ( X t ) < f (1) ( X t ) .Then, by Corollary 4, we have: < | f (1) ( X t ) − f (2) ( X t ) | ≤ t ≤ U t . Furthermore, for any j ∈ I B : | f (1) ( X t ) − f j ( X t ) | ≤ t ≤ U t . This last inequality happens with probability at most C α (6 U t ) α , under X t ∼ Q X , by the margin condition(Deﬁnition 3). Thus, we have E X t E X t − , Y t − | X t ( f (1) ( X t ) − f j ( X t )) F t ∩ A t ≤ C α α +1 U α +1 t .

2) Next, on F t ∩ A ct , the pointwise regret is zero by Corollary 3 since I B must contain the optimal arm at X t andno other arms.3) On F ct , the pointwise regret is bounded above by . By Proposition 2 and Lemma 2, this happens withprobability at most P ( F ct ) ≤ P (cid:0) ∪ ts =1 E cs (cid:1) ≤ t (cid:88) s =1 δ. Thus, E X t E X t − , Y t − | X t ( f (1) ( X t ) − f j ( X t )) F ct ≤ tδ .Next, we put the three cases above together. To further simplify notation, we reparametrize the round variable t andinstead consider rounds n P + t where t ∈ [ n Q ] . We have that, for some c > , the cumulative regret over the n Q rounds is then at most E R Qn ( π ) ≤ c K  log ( K/δ ) + n Q (cid:88) t =log( K/δ ) min (cid:32)(cid:18) log ( K/δ ) n P (cid:19) α +12+ d + γ , (cid:18) log ( K/δ ) t (cid:19) α +12+ d (cid:33) (7) +( n P + t ) δ ] . (8)First, (cid:80) n Q t =1 ( n P + t ) δ = O ( n Q nδ ) . For the remaining sum, it sufﬁces to bound n Q (cid:88) t =log( K/δ ) (cid:18) log( K/δ ) t (cid:19) α +12+ d . By an integral approximation, we have n Q (cid:88) t =log( K/δ ) (cid:18) log( K/δ ) t (cid:19) α +12+ d ≤ c (cid:90) n Q log( K/δ ) (cid:18) log( K/δ ) z (cid:19) α +12+ d dz If α ≤ d + 1 , this integral, for some c > , is bounded by c n Q (cid:18) log( K/δ ) n Q (cid:19) α +12+ d . Otherwise, it is bounded by O (log( K/δ )) . This concludes the proof.15elf-Tuning Bandits over Unknown Covariate-Shifts A P

REPRINT

Appendix C Proof of Theorem 1

First, it is straightforward to verify that the criterion for choosing r t on Line 8 of Algorithm 2 is well-deﬁned for t > K (cid:100) c log( K/δ ) (cid:101) . Relating Arm-Pull Counts to Covariate Counts.

To obtain an analogue of Lemma 1, we ﬁrst relate the arm pullcounts m ( B, i ) = m t ( B, i ) to the covariate counts n r ( X t ) at round t . For the following lemma, we drop the subscript t from m t ( B, i ) to simplify notation. We also use Z t to denote the randomness of Algorithm 2 at round t in choosingthe particular arm π t to play. Thus, { Z t } t ∈ N is independent of X n , Y n . Lemma 3.

Fix a round t > K (cid:100) c log( K/δ ) (cid:101) with observed covariate X t and selected bin B . Suppose that t is the ﬁrstround that the subtree rooted at B is visited. Then, with probability at least − δ with respect to the distribution of Y t − , { Z s } s

Proof.

Fix the values of X t , Y t − , I B and ﬁx some i ∈ I B . First, recall: m ( B, i ) = (cid:88) X s ∈ B,s ≤ t − { π s = i } . Next, we recall n r t ( X t ) ≥ K log( K/δ ) by Line 8 of Algorithm 2. Since the subtree rooted at B has not been visiteduntil round t , every round s < t for which a covariate landed in B , we pulled arm i independently with probability atleast /K . Thus, we have E ( m ( B, i ) | X t , Y t − , I B ) ≥ n r t ( X t ) · K ≥ K/δ ) . Then, by a Chernoff bound, we have w.r.t. the distribution of the Z s ’s: P (cid:18) m ( B, i ) ≤ n r t ( X t )2 K (cid:19) ≤ P (cid:18) m ( B, i ) ≤ E ( m ( B, i ) | X t , Y t − , I B )2 (cid:19) ≤ δ/K. By a union bound and the tower property, we have that the event {∀ i ∈ I B : m ( B, i ) ≥ n rt ( X t )2 K } holds with probabilityat least − δ w.r.t. the distribution of Y t − , { Z s } s K (cid:100) c log( K/δ ) (cid:101) with observed covariate X t and selected bin B . Suppose s ≤ t is the ﬁrstround when the subtree rooted at B is visited. Then, we claim that the candidate arms determined at round s in bin T r s ( X s ) must contain any arm i ∈ I B . If this is not the case for some arm i , then i must have been eliminated in anancestor bin of B , which would imply i (cid:54)∈ I B .Furthermore, B ⊇ T r s ( X s ) = ⇒ m t ( B, i ) ≥ m s ( T r s ( X s ) , i ) . Thus, we have that the event (cid:26) ∀ i ∈ I B : m s ( B, i ) ≥ n r s ( X s )2 K (cid:27) , holds with probability at least − δ by Lemma 3. In particular, m t ( B, i ) ≥ m s ( B, i ) > for all i ∈ I B withprobability at least − δ . Thus, Proposition 2 can be appropriately modiﬁed for single-play bandits: we have withprobability at least − δ w.r.t. the distribution Y t − , { Z s } s

REPRINT

Adaptivity of r t in Single-Play. Next, Proposition 3 and Corollaries 3 and 4 still hold in the single-play settingprovided ∆ t . = 8 λr t and the event G t is now deﬁned as: G t = (cid:110) ∀ i ∈ I B , x ∈ B : | ˆ f it ( B ) − f i ( x ) | ≤ λr t , B = T r t ( X t ) (cid:111) . Thus, it sufﬁces to bound r t for rounds t > n P .Next, we proceed in a nearly identical manner as Lemma 2, with the only technical difference being that our criterion forchoosing the level r t ∈ R on Line 8 of Algorithm 2 involves the extra constraint n r t ( X t ) ≥ K log( K/δ ) comparedto Line 7 of Algorithm 1. It can be shown using an identical concentration argument as the proof of Lemma 2 that thisconstraint is satisﬁed by the “optimal level”: r ∗ ∝ min (cid:32)(cid:18) K log( K/δ ) n P (cid:19) d + γ , (cid:18) K log( K/δ ) t − n P (cid:19) d (cid:33) . Thus, following the same arguments as the proof of Lemma 2, we have r ∗ (cid:38) r t .Then, the remainder of the proof in summing the regret over rounds t > n P proceeds identically as in the multiple-playcase. Appendix D Multiple Shifts

In this section, we give an extension of Theorem 1 to multiple distribution shifts. Let P . = { P j } Nj =1 be a sequence of N source distributions on the covariate-reward pair ( X, Y ) . Each P j satisﬁes covariate shift with respect to the targetdistribution Q . We then consider bandits with a sequence of shifts P → P → · · · → P N → Q. In this setup, data from each P j is observed for n j consecutive rounds. Then, there are n P . = (cid:80) Nj =1 n j totalrounds played under distributions from the source class P . We then consider the regret R Qn ( π ) of a policy π playing n = n P + n Q total rounds with the last n Q rounds having data observed from Q . Theorem 2.

Let π denote the procedure of Algorithm 2, ran, with parameter δ ∈ (0 , , up till time n > n P ≥ , with n P , N, { n j } Nj =1 all possibly unknown. Suppose the marginal of the covariate X under each P j has unknown transferexponent γ j w.r.t. Q X , and that the average reward function f satisﬁes a margin condition with unknown α under Q X .Let n Q . = n − n P denote the (possibly unkown) number of rounds after the drift, i.e., over the phase X t ∼ Q X . Let γ = (cid:80) Nj =1 γ j · n j n P . We have for some constant C > : E R Qn ( π ) ≤ Cn Q (cid:34) min (cid:32)(cid:18) K log ( K/δ ) n P (cid:19) α +12+ d + γ , (cid:18) K log ( K/δ ) n Q (cid:19) α +12+ d (cid:33) + K log ( K/δ ) n Q + nδ (cid:35) Proof Outline.

Consider a round t > n P and any level r ∈ R such that n r ( X t ) > log(1 /δ ) . Similarly to the proof ofLemma 2, by a Chernoff bound and Assumption 4, we have that the covariate count n r ( X t ) , with probability at least − δ for some c > satisﬁes n r ( X t ) ≥ c  ( t − − n P ) r d + N (cid:88) j =1 n j · r d + γ j  . Next, since the function x (cid:55)→ r x is convex for any r > , by Jensen’s inequality we have N (cid:88) j =1 n j n P · r d + γ j ≥ r d + nP (cid:80) Nj =1 n j γ j . Thus, n r ( X t ) ≥ c n P r d + γ . Using this bound, we can extend Lemma 2 to the case of single-play bandits in the samemanner as the proof of Theorem 1 (see Appendix C), except γ is replaced now by γ . In particular, for a ﬁxed round n P + t with observed covariate X n P + t , for some c > , with probability at least − δ , we have: r n P + t ≤ c min (cid:32)(cid:18) log( K/δ ) n P (cid:19) d + γ , (cid:18) log( K/δ ) t (cid:19) d (cid:33) . All other parts of the proof of Theorem 1 remain the same.17elf-Tuning Bandits over Unknown Covariate-Shifts

A P

REPRINT

Appendix E Lower Bound

Here, we establish that the bound of Theorem 1 is minimax optimal, up to log terms in the case where K = 2 , over acontinuum of regimes of choices of n P , n Q , γ, α .Our strategy is to use online-to-batch conversion to convert an online algorithm with regret R n P ,n during the last n Q rounds to a classiﬁer with excess risk of order R n P ,n /n Q . This then implies a conversion from classiﬁcationlower-bounds to bandits lower-bounds.We note that online-to-batch conversion results – which we call as a black-box – are usually given for i.i.d. sequencesof covariate-reward pairs, while we instead consider a setting with a shift in distribution P → Q . Therefore, in much ofwhat follows, we treat the ﬁrst phase { ( X t , Y t ) } n P t =1 ∼ P n P as a separate input randomness Z , and apply conversionarguments to the second phase { ( X t , Y t ) } nt = n P +1 ∼ Q n Q .First, we claim a bandit policy π can be converted to an online classiﬁcation algorithm where π t ∈ [ K ] indicates thepredicted label for covariate X t . This requires deﬁning a reward Y i for each label i ∈ [ K ] , which is done in Deﬁnition 8below. To simplify notation, we will denote the set of K = 2 arms as { , } . Deﬁnition 8 (Conversion from Labels to Rewards) . In the case of binary classiﬁcation with covariate X ∈ X andlabel ˜ Y ∈ { , } , we deﬁne the reward of arm i ∈ { , } as Y i . = { ˜ Y = i } . We use T to denote a class of tuples ( P, Q ) of distributions on the covariate-label pair ( X, ˜ Y ) . Each distribution on ( X, ˜ Y ) then induces a distribution onthe covariate-reward pair ( X, Y ) . Let T (cid:48) be the class of tuples of distributions on ( X, Y ) , induced by T .To simplify notation, in what follows, tuples ( P, Q ) will refer to either tuples in T or their one-to-one mapping to tuplesin T (cid:48) , which will be clear from context.We will also let { ( X t , ˜ Y t ) } mt =1 be a sequence of covariate-label pairs and let { ( X t , Y t ) } mt =1 be the sequence ofcorresponding covariate-reward pairs. In this constructed bandits problem, the regression function of arm i is f i ( x ) = P ( ˜ Y = i | X = x ) . Next, let h ∗ ( x ) . = { f ( x ) ≥ / } = π ∗ ( x ) be the Bayes classiﬁer; the excess risk of a classiﬁer h w.r.t. distribution Q is thengiven as: E Q ( h ) . = E Q (cid:16) { h ( X ) (cid:54) = ˜ Y } − { h ∗ ( X ) (cid:54) = ˜ Y } (cid:17) . Consider an arbitrary online learner

Λ = Λ( Z ) , based on additional randomness Z independent of the training data.We let Λ , Λ , . . . denote the sequentially generated classiﬁers of Λ . We also deﬁne the regret R m (Λ) over m roundsof Λ is then deﬁned as: R m (Λ) . = m (cid:88) t =1 (Λ t ( X t ) (cid:54) = ˜ Y t ) − ( h ∗ ( X t ) (cid:54) = ˜ Y t ) . The next few deﬁnitions and results will be stated in terms of an arbitrary online learner Λ and, in Corollary 6, we willspecialize Λ to the online learner induced by policy π .First, we formalize the types of black-box guarantees on online-to-batch conversion our arguments will rely on. In whatfollows, let X m . = { X t } mt =1 and Y m . = { Y t } mt =1 . Deﬁnition 9.

In what follows, let a . = { a m } , b . = { b m } denote bounded sequences in [0 , , indexed over m ∈ N . An online to batch conversion rate is a mapping F from sequences a (cid:55)→ b such that the following holds:If there exists an online learner Λ = Λ( Z ) , for additional randomness Z , which achieves expected regret E Z, X m , Y m ( R m (Λ)) ≤ m · a m for some sequence a , then there exists a classiﬁer ˆ h = ˆ h (Λ) with excess risk E Z, X m , Y m ( E Q (ˆ h )) ≤ ( F ( a )) m .Now, for any b = { b m } , deﬁne the pseudo-inverse F † ( b ) . = inf { a . = { a m } : ( F ( a )) m > b m } , where the inf over aset of sequences is deﬁned pointwise over m ∈ N (that is, ( F † ( b )) m = inf { a m : ( F ( a )) m > b m } for m ∈ N ). Next, we formally deﬁne the notion of a minimax lower bound for ofﬂine and online classiﬁcation problems in terms ofa rate { a m } . Deﬁnition 10.

Fix n P ∈ N . We say that the class T (of distribution pairs ( P, Q ) ) has a classiﬁcation minimax lowerbound of b = { b m } if the following holds: A P

REPRINT

For any m ∈ N and any classiﬁer ˆ h learned on data { ( X t , ˜ Y t ) } mt =1 ∼ Q m , and additional randomness Z = { ( X (cid:48) t , ˜ Y (cid:48) t ) } n P t =1 ∼ P n P , sup ( P,Q ) ∈T E Z, X m , Y m ( E Q (ˆ h )) > b m . Similarly, a class T has an online minimax lower bound of a = { a m } if the following holds:For any m ∈ N and any online learner Λ = Λ( Z ) trained on data { ( X t , ˜ Y t ) } mt =1 ∼ Q m , and additional randomness Z = { ( X (cid:48) t , ˜ Y (cid:48) t ) } n P t =1 ∼ P n P , we have: sup ( P,Q ) ∈T E Z, X m , Y m ( R m (Λ)) > m · a m . Given an online to batch conversion rate, the next lemma allows us to deduce an online minimax lower bound from aclassiﬁcation minimax lower bound.

Lemma 4 (Minimax Lower Bound Conversion) . Suppose b . = { b m } denotes a classiﬁcation minimax lower bound forthe class T . Then, if there exists an online to batch conversion rate F with ( F † ( b )) m > for all m ∈ N , we have that · F † ( b ) is an online minimax lower bound for the class T .Proof. Consider an online learner

Λ = Λ( Z ) , with additional randomness Z = { ( X (cid:48) t , ˜ Y (cid:48) t ) } n P t =1 , with regret rate a = { a m } . For contradiction, suppose there exists m ∈ N such that: sup ( P,Q ) ∈T E Z, X m , Y m ( R m (Λ)) ≤ m · a m ≤ m · · ( F † ( b )) m < m · ( F † ( b )) m . Then, by the deﬁnition of F and the pseudo-inverse F † , there exists a classiﬁer ˆ h = ˆ h (Λ) such that: sup ( P,Q ) ∈T E Z, X m , Y m ( E Q (ˆ h )) ≤ ( F ( a )) m ≤ b m . This is a contradiction on b being a classiﬁcation minimax lower bound for the class T .We next specify the online to batch conversion rate F that we will use with Lemma 4. Theorem 3 (Theorem 4 of [CBCG04], paraphrased) . Let

Λ = Λ( Z ) be an arbitrary online learner, trained on { ( X t , ˜ Y t ) } mt =1 , with additional randomness Z . Then, for any δ ∈ (0 , , there exists a classiﬁer ˆ h = ˆ h (Λ) , trained on { ( X t , ˜ Y t ) } mt =1 , such that: P (cid:32) P ( X, ˜ Y ) ∼ Q (ˆ h ( X ) (cid:54) = ˜ Y ) ≥ m m (cid:88) t =1 (Λ t ( X t ) (cid:54) = ˜ Y t ) + 6 (cid:114) log (2( m + 1) /δ ) m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:33) ≤ δ. Corollary 5.

Let

Λ = Λ( Z ) be an online learner trained on data { ( X t , ˜ Y t ) } mi =1 with additional input Z . Then, thereexists a classiﬁer ˆ h = ˆ h (Λ) such that for any distribution on X m , Y m , Z : E Z, X m , Y m ( E Q (ˆ h )) ≤ E Z, X m , Y m ( R m (Λ)) m + 6 (cid:115) m √ m + 1 m . Proof.

Fix a value of Z and let the event A be as in Theorem 3: A = (cid:40) P ( X, ˜ Y ) ∼ Q (ˆ h ( X ) (cid:54) = ˜ Y ) ≥ m m (cid:88) t =1 (Λ t ( X t ) (cid:54) = ˜ Y t ) + 6 (cid:114) log (2( m + 1) /δ ) m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:41) . Then, letting δ = 1 /m and conditioning on the event A , we have: E X m , Y m ( E Q (ˆ h ) | Z ) ≤ E X m , Y m (cid:32) m m (cid:88) t =1 (Λ t ( X t ) (cid:54) = ˜ Y t ) − ( h ∗ ( X t ) (cid:54) = ˜ Y t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:33) + 6 (cid:115) (cid:0) m √ (cid:1) m + 1 m = E X m , Y m ( R m (Λ) | Z ) m + 6 (cid:115) (cid:0) m √ (cid:1) m + 1 m . Taking a further expectation on both sides of the inequality gives the desired result.19elf-Tuning Bandits over Unknown Covariate-Shifts

A P

REPRINT

Theorem 1 of [KM18] provides us the classiﬁcation minimax lower-bound, which we restate it here.

Theorem 4 (Theorem 1 of [KM18]) . Let T (cid:48) be the class of all tuples ( P, Q ) of distributions satisfying Assumptions 2and 3, and Deﬁnitions 3 and 4, with some ﬁxed parameters ( λ, C d , d, C α , α, δ , C γ , γ ) . In what follows, let T be theone-to-one mapping of T (cid:48) to tuples of distributions on covariate-label pairs as in Deﬁnition 8. Suppose also that α ≤ d .Then, there exists a constant c > such that for any n P , n Q ∈ N and classiﬁer ˆ h learned on { ( X t , ˜ Y t ) } n P t =1 ∼ P n P and { ( X t , ˜ Y t ) } n P + n Q t = n P +1 ∼ Q n Q , we have: sup ( P,Q ) ∈T E (cid:16) E Q (ˆ h ) (cid:17) > c (cid:18) n d d + γ P + n Q (cid:19) − α +12+ d . Next, we will take b m . = c (cid:18) n d d + γ P + m (cid:19) − α +12+ d as our classiﬁcation minimax lower-bound where m here stands for n Q . Combining Lemma 4, Corollary 5, and Theorem 4, we obtain the following minimax lower bound for bandits: Corollary 6 (Matching Lower Bounds over Given Regimes) . Let the class T (cid:48) and the constant c > be as in Theorem 4.Suppose that n P , n Q satisfy: (cid:115) n Q √ n Q + 1 n Q < c (cid:18) n d d + γ P + n Q (cid:19) − α +12+ d . (10) Then, for any ﬁxed such n P , n Q and any contextual bandits policy π , we have: sup ( P,Q ) ∈T (cid:48) E X n , Y n ( R n P ,n ( π )) ≥ c n Q (cid:18) n d d + γ P + n Q (cid:19) − α +12+ d . Proof.

Fix n P , n Q satisfying the inequality in (10) and let n = n P + n Q . Let ˆ π = { π t } t>n P be the online learner in-duced by the policy π restricted to the second phase { ( X t , Y t ) } nt = n P +1 with additional randomness Z = { ( X t , Y t ) } n P t =1 .Then, by Corollary 5, we have there exists a classiﬁer ˆ h such that: E X n , Y n ( E Q (ˆ h )) ≤ E X n , Y n ( R n Q (ˆ π )) n Q + 6 (cid:115) n Q √ n Q + 1 n Q . We then have that the map F , deﬁned below on a sequence a = { a m } , is an online to batch conversion rate: ( F ( a )) m . = a m + 6 (cid:115) m √ m + 1 m , Let b m . = c (cid:18) n d d + γ P + m (cid:19) − α +12+ d be as in Theorem 4. Then, by Theorem 4 and Lemma 4, we have: sup ( P,Q ) ∈T (cid:48) E X n , Y n ( R n Q (ˆ π )) ≥ n Q · ( F † ( b )) n Q . Next, we observe: E X n , Y n ( R n Q (ˆ π )) = E X n , Y n (cid:32) n Q (cid:88) t =1 (ˆ π t ( X t ) (cid:54) = ˜ Y t ) − ( h ∗ ( X t ) (cid:54) = ˜ Y t ) (cid:33) = E X n , Y n (cid:32) n Q (cid:88) t =1 Y π ∗ t ( X t ) t − Y π t ( X t ) t (cid:33) = E X n , Y n ( R n P ,n ( π )) A P

REPRINT

Thus: sup ( P,Q ) ∈T (cid:48) E X n , Y n ( R n P ,n ( π )) ≥ n Q · ( F † ( b )) n Q ≥ n Q ·  b n Q − (cid:115) n Q √ n Q − n Q  ≥ n Q · c (cid:18) n d d + γ P + n Q (cid:19) − α +12+ d . Remark 2.

The inequality in (10) corresponds to the regime n P = ˜ O (cid:18) n d + γ α Q (cid:19) with α < d/ . In particular thisincludes the following subregimes. • Performance on Q depends mostly on covariates X t ∼ Q X , t > n P . This is the subregime where n P (cid:46) n d + γ d Q , roughly, that is when (in the upper-bound of Theorem 1) min( n − α +12+ d + γ P , n − α +12+ d Q ) = n − α +12+ d Q ,i.e., past experience under P is too short to signiﬁcantly inﬂuence regret under Q . The lower-bound ofCorollary 6 then is of the form sup ( P,Q ) ∈T (cid:48) E ( R n P ,n ( π )) ≥ n Q · n − α +12+ d Q , which conﬁrms that the threshold n P = ˜ O (cid:18) n d + γ d Q (cid:19) (on when past experience is too short) is indeed tight. • Performance on Q depends mostly on covariates X t ∼ P X , t ≤ n P . This is the subregime where n d + γ d Q (cid:46) n P (cid:46) n d + γ α Q . In other words min( n − α +12+ d + γ P , n − α +12+ d Q ) = n − α +12+ d + γ P , i.e., past experience under P signiﬁcantly inﬂuencesregret under Q . The lower-bound of Corollary 6 then is of the form sup ( P,Q ) ∈T (cid:48) E ( R n P ,n ( π )) ≥ n Q · n − α +12+ d + γ P ,,