Self-Tuning Bandits over Unknown Covariate-Shifts
SS ELF -T UNING B ANDITS OVER U NKNOWN C OVARIATE -S HIFTS
A P
REPRINT
Joseph Suk
Columbia University, Statistics [email protected]
Samory Kpotufe
Columbia University, Statistics [email protected]
July 20, 2020 A BSTRACT
Bandits with covariates, a.k.a. contextual bandits , address situations where optimal actions (or arms)at a given time t , depend on a context x t , e.g., a new patient’s medical history, a consumer’s pastpurchases. While it is understood that the distribution of contexts might change over time, e.g., dueto seasonalities, or deployment to new environments, the bulk of studies concern the most adversarialsuch changes, resulting in regret bounds that are often worst-case in nature. Covariate-shift on the other hand has been considered in classification as a middle-ground formalismthat can capture mild to relatively severe changes in distributions. We consider nonparametric banditsunder such middle-ground scenarios, and derive new regret bounds that tightly capture a continuumof changes in context distribution. Furthermore, we show that these rates can be adaptively attainedwithout knowledge of the time of shift nor the amount of shift.
Bandits with covariates, or contextual bandits, concern situations where the reward of an action depends on a currentcontext x , e.g., a patient’s medical record (actions are treatments), or a user’s profile and past history (actions are newproducts to propose). The problem is to maximize the total rewards of actions over time as similar contexts appearand rewards are observed over past actions. We adopt the stochastic setting where covariate and rewards are jointlydistributed over time.In the nonparametric version, little is assumed about the distribution of rewards over contexts, beyond Lipchitz conditionsthat capture the idea that rewards should be somewhat close for nearby contexts. Now suppose contexts are drawn froma fixed distribution over time; this ensures typicality, i.e., we can expect similar contexts to have appeared previously, sothere is much potential for learning to progress over time. Most recent advances have been made in nonparametricsettings with a fixed distribution, with early consistency results in [YZ + inthe nonparametric literature on the problem, although it has long been recognized in the more established parametricsetting on contextual bandits, under various formalisms of often adversarial nature [HMB15, KA16, LWAL17, LLS18,WIW18].As an initial take on a nonparametric setting with changes in distributions, we focus attention to the less adversarialcase of covariate-shift , a formalism often adopted in works on domain adaptation in classification, starting with[SNK +
08, CMRR08, GSH +
09, BDU12]. For example, consider a situation where clinical trials are to be extendedto a new population. The distribution on patients’ profiles X would likely shift, however the predictors in X (e.g.,biometrics, medical history) are naturally chosen to be predictive of treatment outcomes; in other words, the conditionaldistribution of rewards given X remains unchanged. Formally, in covariate-shift , Q Y | X = P Y | X but Q X (cid:54) = P X , where P and Q denote previous and new joint-distributions on context-reward pairs ( X, Y ) . We actually are not aware of any such work to date for the nonparametric setting. a r X i v : . [ s t a t . M L ] J u l elf-Tuning Bandits over Unknown Covariate-Shifts A P
REPRINT
We are interested in achievable rewards after the shift P → Q , i.e., over the time period corresponding to the newdistribution Q X over contexts. Intuitively, such performance under Q depends on how far the prior distribution P X isfrom Q X , as typical contexts under Q X might not be likely to have been observed under P X . Hence, a first goal is tounderstand, whether and how a policy π started before the shift might adapt to an a priori unknown such discrepancybetween P X and Q X . The problem is generally harder than in classification where most established approacheswould compare observed X ∼ Q X to prior observations X ∼ P X to evaluate and adjust to the unknown discrepancy.In particular, we might not know the change time, and therefore cannot readily identify which data is which: forexample, in an ongoing clinical study, or running a recommender system, a change in distributions might occur due toseasonalities in population makeup, so is likely not known a priori.Interestingly, we find that it is not necessary to spend resources identifying such unknown change points. Namely, weshow that it possible to automatically adapt to unknown change points, along with unknown discrepancies betweencovariate distributions P X , Q X , while achieving regrets under Q X of near optimal order in terms of these unknownchange parameters. In particular, the achievable regret tightly characterizes the effective amount of past experience contributed to Q by previous runs on P , in terms of both the length of previous runs and the discrepancy P X → Q X . Further Background.
There is by now an expansive literature on parametric contextual bandits, ranging from fullyadversarial to stochastic settings (see e.g. [HM07, LZ08, BCB12, AC16, RS16]).In the stochastic parametric setting, the earlier cited results [HMB15, KA16, LWAL17, LLS18, WIW18] are closest inspirit to the present work. However, they consider settings of a more adversarial nature, as their aim is to achieve regrets– over stationary periods of length ∆ t – of similar order O ( √ ∆ t ) as would have been achieved without distributionshifts; in other words, they contend that past experiences could adversarially affect regret over stationary periods, andthe aim is to mitigate such adversity. In contrast, as we will see, past experience is actually useful under covariate-shift(to a variable extent depending on shift charateristics), as long as the bandits procedure is reasonably conservative inleast-observed regions of context space.[Sli14] considers nonparametric settings, however, with fixed non-stochastic contexts.Finally, in the setting of active online regression with multiple domains, [CLMZ20] establishes adaptive regretguarantees in terms of the domain dimensions and durations.We start with a formal setup in Section 2, followed by an overview of results in Section 3. Algorithms and proof ideasare discussed in Section 4. We consider a finite set of actions (or arms) [ K ] . = { , . . . , K } , and let Y ∈ [0 , K denote the rewards of each action i ∈ [ K ] . We assume that the covariate X , supported on X . = [0 , d , is jointly distributed with Y , and we thereforeassume a random independent sequence of covariate-reward pairs { ( X t , Y t ) } t ∈ N , identically distributed over different and possibly unknown periods of time .In particular we assume (in the bulk of the paper) that for some n P ≥ , possibly a priori unknown, the sequence { ( X t , Y t ) } t ∈ [ n p ] is i.i.d. according to a distribution P , while { ( X t , Y t ) } t>n p is i.i.d. according to a target distribution Q with different marginals. We will be interested in performance under Q , i.e., after such a shift. We note however thatour analysis readily extends to the case of multiple distribution shifts (before time n P ), as discussed in Appendix D. Assumption 1 (Covariate shift) . While the distribution of covariates X t might change overtime, the conditionaldistribution of Y t | X t remains fixed (i.e., in our context Q X (cid:54) = P X , but P Y | X = Q Y | X ). In particular, the aim is to maximize expected rewards conditioned on X t ; this is captured through the fixed regression function f : X → [0 , K as f i ( x ) . = E ( Y i | X = x ) , i ∈ [ K ] . The term characterizes settings displaying O ( √ n ) regrets, due to either parametric constraints on rewards or on hindsight baseline policies. The indexing set N denotes the natural numbers excluding . We use the terms time t or round t interchangeably, the latter in the context of a procedure. A P
REPRINT
In the bandits setting, a so-called policy (or bandit procedure) chooses actions at each round t , based on observedcovariates (up to round t ) and passed rewards, whereby at each round t only the rewards Y it of chosen actions i arerevealed. We say an arm i is pulled if action i is chosen by the policy. We adopt the following formalism. Definition 1 (Policy) . A policy π . = { π t } t ∈ N is a random sequence of functions π t : X t × [ K ] t − × [0 , t − → [ K ] .In an abuse of notation, in the context of a sequence of observations till round t , we will let π t ∈ [ K ] also denote theaction chosen at round t . In the case of a randomized policy, i.e., where π t in fact maps to distributions on [ K ] , we willstill let π t ∈ [ K ] denote the (random) action chosen at round t . We let X t . = { X s } s ≤ t , Y t . = { Y s } s ≤ t denote the observed covariates and (observed and unobserved) rewards fromrounds to t . The performance of a policy is evaluated through a notion of regret . Definition 2 (Cumulative regret) . Define the regret between rounds n P < n of a policy π , as R n P ,n ( π ) . = n (cid:88) t = n P +1 max i ∈ [ K ] (cid:0) f i ( X t ) − f π t ( X t ) (cid:1) . In our context of a shift to Q , we often will use the short notation R Qn ( π ) to denote R n P ,n ( π ) . The oracle policy π ∗ refers to the strategy that maximizes the expected reward at any round t , and is given by π ∗ t ( X t ) ∈ argmax i ∈ [ K ] f i ( X t ) . The regret of a policy π is therefore the excess expected reward of π ∗ relative to π over X n . We seek a policy π that minimizes E X n , Y n R Qn ( π ) .We emphasize that, while we will be interested in regret over particular periods n P + 1 : n (corresponding to a fixedtarget Q ), it is understood by definition that π runs starting at t = 1 , and R Qn ( π ) . = R n P ,n ( π ) therefore depends on priordecisions up till time n P . Finally, usual bounds for stationary distributions are recovered simply by letting n P = 0 . Our main assumptions below are stated under the (cid:96) ∞ norm on [0 , d for convenience, as we build our procedures π over regular grids of [0 , d . It should be clear however that the relevant conditions hold under any norm (e.g., any (cid:96) p , p ≥ ) when they hold under (cid:96) ∞ , by the equivalence of R d norms. • Standard Assumptions and Conditions.
We assume, as in prior work on nonparametric contextual bandits [RZ10, PR13, Sli14, RMB18, GJ18], that theregression function is Lipschitz, with some known upper-bound λ on the Lipchitsz constant (often simply assumed tobe ). Assumption 2 (Lipschitz f ) . There exists λ > such that for all i ∈ [ K ] and x, x (cid:48) ∈ X , | f i ( x ) − f i ( x (cid:48) ) | ≤ λ (cid:107) x − x (cid:48) (cid:107) ∞ . (1)Furthermore, the difficulty of detecting the optimal arm π ∗ ( x ) at any x is parametrized through the following margin condition of f w.r.t. Q , originally due to [T +
04] (for nonparametric classification).
Definition 3 (Margin Condition) . Let f (1) ( x ) , f (2) ( x ) denote the highest and second highest values of f i ( x ) , i ∈ [ K ] ,if they are not all equal; otherwise let f (1) ( x ) = f (2) ( x ) be that value.There exists δ > , C α > so that ∀ δ ∈ [0 , δ ] , Q X (0 < | f (1) ( X ) − f (2) ( X ) | < δ ) ≤ C α δ α . (2)In particular, the above is always satisfied with at least α = 0 . Intuitively, the larger the margin f (1) ( x ) − f (2) ( x ) at x ,the easier it is to detect the best arm, in the sense that a rough approximation to f is sufficient. The above condition,common in prior work on nonparametric bandits, encodes the margin distribution under Q X . Interestingly, we needno assumption on the margin distribution under P X , although our setting assumes that the procedure π is first ran oncovariates X t ∼ P X , t ≤ n P ; in fact, we will see that we only need to ensure that π maintains good choices of arms forevery potential x ∈ X , along with sufficient arm pulls, up till the distribution shifts at round n P + 1 .The next assumption ensures that Q X has good coverage of [0 , d . It holds for instance if Q X has lower-boundedLebesgue density on [0 , d . We remark that the term policy is often used to denote a mapping from state (or covariate) to action; here we simply equate itwith any decision procedure taking action based on current and past observations.
A P
REPRINT
Figure 1:
Some settings with < γ < ∞ . Left: the density f P ∝ | x | γ goes fast to , while f Q is uniform; f Q /f P then diverges(so density ratios , and f -divergences are ill defined). Right: P X moves mass away from regions of large Q X mass, with relativedensities captured by γ and the size of the region ( r ). Assumption 3 (Mass under Q ) . ∃ C d > s.t., ∀ (cid:96) ∞ balls B ⊂ [0 , d of diameter r ∈ (0 , : Q X ( B ) ≥ C d · r d . • Quantifying the Shift in P to Q . Next we aim to quantify how much the earlier covariate distribution P X differs from the shift Q X . Intuitively, P X hasinformation on Q X if it yields data useful to Q X , in other words, if it has sufficient mass in regions of large Q X mass.The next condition, adapted from recent work [KM18] on classification, parametrizes such intuition. Definition 4 (Transfer Exponent γ ) . ∃ C γ , γ ≥ s.t., ∀ (cid:96) ∞ balls B ⊂ [0 , d of diameter r ∈ (0 , : P X ( B ) ≥ C γ · r γ · Q X ( B ) . Note that the above condition always holds with at least γ = ∞ . The larger the shift, the larger γ , with γ = 0 capturing the mildest such shifts in covariate distribution. Some examples are given in Figure 1. As we will see, thetransfer exponent γ manages to tightly capture a continuum of easy to hard shifts in covariate distributions as evident inachievable regret rates R Qn over the shift period. A common algorithmic approach in nonparametric contextual bandits, starting from earlier work [RZ10, PR13], is tomaintain tree -based (regression) estimates ˆ f t of the expected reward function f , so that at any time t , upon observing X t , only those arms i with f i ( X t ) close to f (1) ( X t ) might be played. This assumes a good estimate of f at any time t ,which in the context of tree-based estimates boils down to choosing an optimal level in the tree – where each level r corresponds to a piecewise-constant regression estimate ˆ f t over bins of side-length r in X . In the usual setting with astationary distribution Q , a level r = r t might be chosen as O ( t − / (2+ d ) ) yielding optimal regression that would resultin the best provable regret rates.In our context however, the unknown amount of drift parametrized by γ has to be accounted for in the choice of alevel r t at any time t . Namely, using intuition from classification, it can be shown that an optimal choice, based on(unavailable) knowledge of γ and the switch time n P , is of the form r t ( γ, n P ) . = O (min { n − / (2+ d + γ ) P , t − / (2+ d ) } ) .A main aim is therefore to design a procedure which, without such knowledge, still makes near optimal adaptive choicesof levels at any time t .Our adaptive strategies, detailed in Section 4, rely directly on the relative proportions of samples observed on a pathfrom the root of a tree T down to a leaf containing X t . Roughly, let n r ( X t ) denote the covariate count in the bincontaining X t at level r (by time t ). We then choose the smallest level r such that n − r ( X t ) ≤ r . For intuition, thischoice roughly balances regression variance (controlled by n − r ) and bias (controlled by r ). Such a choice stems fromprior insights on adaptive tree-based regression with fixed data distribution but unknown d (see e.g. [KD12]), which weshow here to yield a regression rate – in terms of unknown γ, n p – similar to that of the oracle choice r t ( γ, n P ) .However, such an adaptive choice of level at each time t immediately introduces a book-keeping problem in the banditssetting: the number of observed rewards for a given arm i – which drive the estimates ˆ f – might significantly differ The switch time might be available in some situations as discussed in the introduction.
A P
REPRINT from the number of covariates n r in a bin, as we eliminate suboptimal arms over time. In particular, while an arm i might still be valid in a bin B at time t , it might have been eliminated in a child bin B (cid:48) ⊂ B at a much earlier time, andtherefore would lack enough observations for a confident reward estimate ˆ f i at time t (relative to other arms). Furthercare is thus required for such book-keeping on observed rewards (or arm pulls ).We will first consider a simplified bandit setting which alleviates book-keeping, namely a multiple-play variant wheremultiple arms might be pulled at once for every X t ; here we still have to eliminate suboptimal arms so the aboveproblem remains, but can be shown to be milder. This will serve as a warmup procedure that helps lay down much of thekey intuition towards adaptation to unknown distribution shift parameters. Much of our analysis overview in the maintext centers on the more intuitive multiple-play setting for brevity. We then show how to extend such a multiple-playprocedure to a single-play variant where only a single arm is pulled in each round. This is done by properly randomizingarms to be pulled to ensure a fair relative distribution of arm pulls. Adaptive Single-Play Bandits.
Our main theorem considers the canonical single-play variant where the policy π pulls one arm (and observes its reward) at every round t . This is given by the randomized procedure of Algorithm 2from Section 4, which takes in a confidence parameter δ ∈ (0 , . Just as in previous work for the stationary case, themargin parameter α needs not be known, while in addition here, we do not need the drift parameters n P , γ either. Theexpectation in the statement below is over the entire sequence X n , Y n ∼ P n P × Q n − n P , plus the randomness in π . Theorem 1.
Let π denote the procedure of Algorithm 2, ran, with parameter δ ∈ (0 , , up till time n > n P ≥ , with n P possibly unknown. Suppose P X has unknown transfer exponent γ w.r.t. Q X , and that the average reward function f satisfies a margin condition with unknown α under Q X . Let n Q . = n − n P denote the (possibly unkown) number ofrounds after the drift, i.e., over the phase X t ∼ Q X . We have for some constant C > : E R Qn ( π ) ≤ Cn Q (cid:34) min (cid:32)(cid:18) K log ( K/δ ) n P (cid:19) α +12+ d + γ , (cid:18) K log ( K/δ ) n Q (cid:19) α +12+ d (cid:33) + K log ( K/δ ) n Q + nδ (cid:35) The following corollary is immediate.
Corollary 1.
Under the setup of Theorem 1, letting δ = O (1 /n ) yields: E R Qn ( π ) ≤ Cn Q (cid:34) min (cid:32)(cid:18) K log( Kn ) n P (cid:19) α +12+ d + γ , (cid:18) K log( Kn ) n Q (cid:19) α +12+ d (cid:33) + K log( Kn ) n Q (cid:35) . The above rates interpolate between two terms: one involving n P past observations and the drift parameter γ , the otherinvolving n Q . This last term matches the minimax regret rate of n − α +12+ d Q of [PR13], and is attained by the adaptive π when there is no drift, i.e., for n P = 0 . For n P > , the interpolated rate can be rewritten as n Q · (cid:16) n d γ P + n Q (cid:17) − ∧ α +12+ d for d γ = (2 + d ) / (2 + d + γ ) ; in other words n d γ P might be viewed as the effective amount of past experience contributed despite the drift; this quantity is largest when γ = 0 , lowering regret, and vanishes as γ → ∞ , i.e.,with larger discrepancy between P X and Q X . Such intuition is confirmed in simulations (Figure 2). As previouslymentioned, the results readily extend to the case of multiple drifts before time n P , with γ above replaced by an average ¯ γ of transfer exponents between past P X ’s and Q X (Appendix D).Finally, we note that the above rates are tight (up to log terms) in the sense that the average regret (cid:16) n d γ P + n Q (cid:17) − α +12+ d matches minimax lower bounds for classification under covariate-shift of [KM18] (see discussion in Appendix E). All algorithms build on a dyadic partitioning tree T defined as follows. Definition 5 (Partition Tree) . Let R . = { − i : i ∈ N ∪ { }} , and let T r , r ∈ R denote a regular partition of [0 , d intohypercubes (which we refer to as bins ) of side length (a.k.a. bin size) r . We then define the dyadic tree T . = { T r } r ∈R ,i.e., a hierarchy of nested partitions of [0 , d . We will refer to the level r of T as the collection of bins in partition T r .The parent of a bin B ∈ T r , r < is the bin B (cid:48) ∈ T r containing B ; child , ancestor and descendant relations follownaturally. The notation T r ( x ) will then refer to the bin at level r containing x . Note that, while in the above definition, T has infinite levels r ∈ R , at any round t in a procedure, we implicitlyonly operate on the subset of T containing data. Our procedures, as in prior work on nonparametric bandits, maintainestimates ˆ f of the average reward function f over levels of T .5elf-Tuning Bandits over Unknown Covariate-Shifts A P
REPRINT
Figure 2:
Simulation Results. Q X ∼ U ([0 , ) , P X has density ∝ (cid:107) x (cid:107) γ , K = 3 arms, with rewards Y i = f i ( X )+ N (0 , . , i ∈ [ K ] , where f i ( x ) ∝ (cid:80) k ± (1 − (cid:107) x − z k (cid:107) /r k ) + for 25 randomly placed bumps with centers z k , radius r k . A profile of f is shownon the left, with lower gradient colors corresponding to least margins (white meaning no margin). The right plots average 20 runs ofAlgorithm 2, and verify the guarantees of Theorem 1, namely that the procedure adapts to unknown shift parameters n p and γ . Inparticular, the amount of past experience n d γ P clearly helps, and how much it helps depends on the level of shift P → Q as capturedby γ . Algorithm 1
Adaptive-Multiple-Play Requires : upper bound on Lipschitz constant λ , set of arms [ K ] , tree T with levels r ∈ R Input : c . = max(1 , λ − ) , δ ∈ (0 , , covariates X , X , . . . Initialization : For any bin B at any level in T , set I B ← [ K ] for t = 1 , , . . . do
5: If t ≤ (cid:100) c log( K/δ ) (cid:101) , play all arms in [ K ] t > (cid:100) c log( K/δ ) (cid:101) :7: Choose a level r t ∈ R for X t : r t ← min (cid:110) r ∈ R : λr ≥ (cid:113) log( K/δ ) n r ( X t ) (cid:111) Update candidate arms for the bin B containing X t at level r t :
9: Set I B ← (cid:84) B (cid:48) ∈ T r ,r ≥ r t I B (cid:48) ˆ f i ( B ) for any i ∈ I B over B ˆ f i
11: Refine candidate arms: I B ← I B \ { i : ˆ f i ( B ) < ˆ f (1) ( B ) − λr t } .12: Play all arms in I B . end for Definition 6 (Regression estimates and arm pull counts) . At any round t > (cid:100) c log( K/δ ) (cid:101) , for any bin B at any levelin the tree, we define the following regression estimate for arm i : ˆ f it ( B ) . = 1 m t ( B, i ) (cid:88) X s ∈ B,s ≤ t − ,π s = i Y is , where m t ( B, i ) denotes the number of times arm i was pulled in B before time t . If m t ( B, i ) = 0 , we take ˆ f it ( B ) = 0 .For any B at level r in the tree, ˆ f it ( B ) serves as a regression estimate for any covariate x ∈ B . We often drop B or t inthe above definitions, when understood from context. Adaptive Multiple-Play Bandits.
We now discuss the simplest procedure Algorithm 1, which yields much of the basicintuition for adapting to unknown shift parameters n P , γ . Definition 7 (Covariate counts) . Let
B . = T r ( X t ) . We write: n r ( X t ) . = (cid:80) s ∈ [ t − { X s ∈ B } . At any time t , upon observing X t , a level r t is chosen according to the covariate counts n r ( X t ) along the path { T r ( X t ) } r ∈R . Roughly, r t is picked as the smallest r ∈ R such that / (cid:112) n r ( X t ) ≤ r . The level r t , more preciselythe bin B = T r t ( X t ) containing X t at that level, then determines the estimate ˆ f ( B ) to be used at time t .To understand this choice, recall that a main aim is to quickly identify which arms are suboptimal – and shouldn’t bepulled for X t , and we ought to therefore use a good estimate of f ( X t ) . We will show that r t indeed provides such anestimate at a near optimal regression rate in terms of unknown n P and γ (see Lemma 2). In particular, the covariatecounts n r ( X t ) , at any level r , account for covariates from both P X and Q X , whenever t > n P . Intuitively, we will6elf-Tuning Bandits over Unknown Covariate-Shifts A P
REPRINT
Algorithm 2
Adaptive-Single-Play Requires : upper bound on Lipschitz constant λ , set of arms [ K ] , tree T with levels r ∈ R Input Parameters : c . = max(8 , λ − ) , δ ∈ (0 , , covariates X , X , . . . Initialization : For any bin B at any level in T , set I B ← [ K ] B .4: for t = 1 , , . . . do
5: If t ≤ K (cid:100) c log( K/δ ) (cid:101) , play a random arm i ∈ [ K ] selected with probability / | K | .6: Otherwise, for t > K (cid:100) c log( K/δ ) (cid:101) :7: Choose a level r t ∈ R for X t : r t ← min (cid:110) r ∈ R : λr ≥ (cid:113) K log( K/δ ) n r ( X t ) , and n r ( X t ) ≥ K log( K/δ ) (cid:111) Update candidate arms for the bin B containing X t at level r t :
9: Set I B ← (cid:84) B (cid:48) ∈ T r ,r ≥ r t I B (cid:48) ˆ f i ( B ) for any i ∈ I B over B ˆ f i
11: Refine candidate arms: I B ← I B \ { i : ˆ f i ( B ) < ˆ f (1) ( B ) − λr t } .12: Play a random arm i ∈ I B selected with probability / |I B | . end for then expect n r ( X t ) ≈ n P · P X ( T r ( X t )) + ( t − − n P ) · Q X ( T r ( X t )) (cid:38) n P · r d + γ + ( t − − n P ) · r d . The choiceof r t can then be shown to properly balance regression variance and bias in terms of unknown n P and γ .Once this choice is made, only those arms deemed safe are pulled for X t at time t . These so-called candidate arms aremaintained as I B ⊆ [ K ] for each bin B over time, and exclude identified suboptimal arms whose average rewards areclearly below that of the unknown best arm π ∗ ( x ) for any x ∈ B . In particular, suppose at any time t , we can ensurethat | ˆ f it ( B ) − f i ( x ) | (cid:46) r t for all remaining arms i over x ∈ B . Then we can safely discard i if ˆ f (1) t ( B ) − ˆ f it ( B ) (cid:38) r t .It then makes sense to also discard such an arm in all descendants of B .Now, adaptation to the unknown margin parameter α comes for free through such decisions over I B . Namely, if themargin f (1) ( x ) − f (2) ( x ) (cid:29) r t for all x ∈ B , then all suboptimal arms are discarded by time t so we suffer no regretfor X t ∈ B at time t . Otherwise, all arms i left in I B satisfy f (1) ( x ) − f i ( x ) (cid:46) r t , for x ∈ B , i.e., a bound on regret;on the other hand, the margin distribution ensures that the Q X -probability of X t landing in such a bin with low marginsis at most r αt . All that is left is to ensure that r t is of the right order in terms of t and the unknown n P , γ (as we show isthe case of the adaptive choice of r t discussed above). Remark 1 (Book-keeping) . As discussed earlier, our adaptive choice of level r t brings in additional difficulty in thebook-keeping of arm pulls. In fact the above discussion assumes that covariate counts n r ( X t ) (used in choosing r t ,towards adapting to unknown n P , γ ) and arm-pull counts m t ( B, i ) (used in estimating ˆ f t ( B ) ) are of similar order.However, because we can have r t > r t − as X t , X t − fall in different regions of space, the following situation canoccur: in a given bin B . = T r t ( X t ) chosen at time t , some arms might have been eliminated in a descendant of B atan earlier time, and therefore not pulled as much as other arms in I B . However, our choices ensure that the total armpull count (for any arm in I B ) at that earlier time must have been sufficiently large. This is argued in Lemma 1. Thesituation is however more severe in the single-play variant described below. Adaptive Single-Play Bandits.
We modify the above multiple-play variant as follows (as detailed in Algorithm 2).Upon choosing a bin B = T r t ( X t ) at level r t for X t , where we would have pulled all arms in I B , we instead pull asingle candidate arm at random. Lemma 3 then argues, similar to Lemma 1, that in expectation, total arm pull counts m t ( B, i ) (for i ∈ I B ) remain of sufficiently large order w.r.t. n r ( X t ) . Continuing on the discussion of Section 4, first consider the multiple-play setting of Algorithm 1 which yields similarregret rates as those of Theorem 1 (see Proposition 1 of Appendix B). At any round t (cid:38) log K with selected bin B . = T r t ( X t ) , we have by standard arguments (Lemma 2) that, with high probability (over random rewards, conditionedon all past covariates): ∀ x ∈ B, i ∈ I B : | ˆ f it ( B ) − f i ( x ) | (cid:46) (cid:115) log( K/δ ) m t ( B, i ) + λr t . (3)7elf-Tuning Bandits over Unknown Covariate-Shifts A P
REPRINT
However, r t is selected based on covariate counts n r ( X t ) which might not directly relate to m t ( B, i ) . Fortunately, thetwo quantities are equal (for B ) till the first time the subtree rooted at B is visited; building on this, one can argue that m t ( B, i ) should be sufficiently large at any time B is selected: Lemma 1.
Suppose at round t , we select bin B = T r t ( X t ) . Then, max i ∈I B (cid:113) log( K/δ ) m t ( B,i ) ≤ λr t . Thus, the regression error in (3) is at most λr t . Therefore, since we only discard an arm i if ˆ f (1) t ( B ) − ˆ f it ( B ) ≥ λr t ,it follows that the best arm for any x ∈ B is never removed. In particular this holds up to the unknown shift time n P + 1 . Now, for t > n P the regression error r t can be shown to be of optimal order: it approximately minimizes theexpression, ( n r ( X t )) − / + r (cid:46) (cid:0) n P · r d + γ + ( t − − n P ) · r d (cid:1) − / + r, i.e., is less than the value r ∗ minimizing the r.h.s. (Lemma 2). Now r ∗ is of optimal regression order min (cid:32)(cid:18) log( K/δ ) n P (cid:19) d + γ , (cid:18) log( K/δ ) t − n P (cid:19) d (cid:33) . As mentioned previously, a non-zero regret can only occur if the margin at X t is below the regression error λr t (otherwise the best arm is pulled). Since the likelihood of picking such an X t is O ( r αt ) (by Definition 3), andfurthermore, any arm picked can only incur regret λr t (by equation (3)), it follows that the expected regret at time t isbounded by O ( r αt ) , which is of optimal order. Summing over t > n P yields a regret bound of the right order.For single-play, even at the round s when the subtree rooted at B is first visited, it is possible that m s ( B, i ) does notequal its covariate count (for some i ∈ I B ). However, by Lemma 3 (the counterpart to Lemma 1 for single-play)we have that m t ( B, i ) (cid:38) n r t ( X t ) /K , through concentration arguments on the distribution of arm pulls. The rest ofthe proof of Theorem 1 proceeds in the same manner as the multiple-play case described above. Details are given inAppendix C.The case of multiple shifts is handled similarly by properly bounding n r ( X t ) (see Appendix D). Broader Impact
Domain adaptation is now understood as an essential step towards the successful deployment of sequential learning systems across (related) real-world domains, e.g., adapting clinical trials (for experimental treatments) between differentpopulations, or sharing knowledge between automated driving runs from related urban environments.In contrast with classification and regression settings where much recent progress has been made, domain adaptationin sequential learning – ranging from bandits to more challenging reinforcement learning applications – is furthercomplicated by the strong interdependencies between learning rounds, the hardness of adequate sampling of statespaces, and the general lack of full information on the potential rewards of actions.The present work contributes basic insights into this broader research program, with a first focus on the relatively moreamenable setting of bandits with covariates.
References [AC16] Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic andadversarial bandits. In
Conference on Learning Theory , pages 116–120, 2016.[BCB12] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armedbandit problems. arXiv preprint arXiv:1204.5721 , 2012.[BDU12] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of unlabeled targetsamples. In
International Conference on Algorithmic Learning Theory , pages 139–153, 2012.[CBCG04] Nicoló Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learningalgorithms.
Information Theory, IEEE Transactions , 50(9):2050–2057, 2004.[CLMZ20] Yining Chen, Haipeng Luo, Tengyu Ma, and Chicheng Zhang. Active online domain adaptation. arXivpreprint arXiv:2006.14481 , 2020.[CMRR08] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correctiontheory. In
International conference on algorithmic learning theory , pages 38–53. Springer, 2008.8elf-Tuning Bandits over Unknown Covariate-Shifts
A P
REPRINT [GJ18] Melody Y Guan and Heinrich Jiang. Nonparametric stochastic contextual bandits.
AAAI , 2018.[GSH +
09] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and BernhardSchölkopf. Covariate shift by kernel mean matching.
Dataset shift in machine learning , 3(4):5, 2009.[HM07] Elad Hazan and Nimrod Megiddo. Online learning with prior knowledge. In
International Conference onComputational Learning Theory , pages 499–513. Springer, 2007.[HMB15] Negar Hariri, Bamshad Mobasher, and Robin Burke. Adapting to user preference changes in interactiverecommendation. In
Twenty-Fourth International Joint Conference on Artificial Intelligence , 2015.[KA16] Zohar S Karnin and Oren Anava. Multi-armed bandits: Competing with optimal sequences. In
Advancesin Neural Information Processing Systems , pages 199–207, 2016.[KD12] Samory Kpotufe and Sanjoy Dasgupta. A tree-based regressor that adapts to intrinsic dimension.
Journalof Computer and System Sciences , 78(5):1496–1515, 2012.[KM18] Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the benefits of labels in covariate-shift.
COLT , 2018.[LLS18] Fang Liu, Joohyun Lee, and Ness Shroff. A change-detection based framework for piecewise-stationarymulti-armed bandit problem. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.[LWAL17] Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, and John Langford. Efficient contextual bandits in non-stationary worlds. arXiv preprint arXiv:1708.01799 , 2017.[LZ08] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information.In
Advances in neural information processing systems , pages 817–824, 2008.[PR13] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates.
The Annals ofStatistics , 41(2):693–721, 2013.[RMB18] Henry W. J. Reeve, Joe Mellor, and Gavin Brown. The k -nearest neighbour ucb algorithm for multi-armedbandits with covariates. JMLR , 2018.[RS16] Alexander Rakhlin and Karthik Sridharan. Bistro: An efficient relaxation-based method for contextualbandits. In
ICML , pages 1977–1985, 2016.[RZ10] Phillipe Rigollet and Assaf Zeevi. Nonparametric bandits with covariates.
COLT , 2010.[Sli14] Aleksandrs Slivkins. Contextual bandits with similarity information.
The Journal of Machine LearningResearch , 15(1):2533–2568, 2014.[SNK +
08] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Directimportance estimation with model selection and its application to covariate shift adaptation. In
Advancesin neural information processing systems , pages 1433–1440, 2008.[T +
04] Alexander B Tsybakov et al. Optimal aggregation of classifiers in statistical learning.
The Annals ofStatistics , 32(1):135–166, 2004.[WIW18] Qingyun Wu, Naveen Iyer, and Hongning Wang. Learning contextual bandits in a non-stationary envi-ronment. In
The 41st International ACM SIGIR Conference on Research & Development in InformationRetrieval , pages 495–504, 2018.[YZ +
02] Yuhong Yang, Dan Zhu, et al. Randomized allocation with nonparametric estimation for a multi-armedbandit problem with covariates.
The Annals of Statistics , 30(1):100–121, 2002.9elf-Tuning Bandits over Unknown Covariate-Shifts
A P
REPRINT
Appendix A Additional Experiments
Figure 3:
Simulation results for bumps centers chosen uniformly, and for smaller n Q = 500 rounds. For our experiments, we fix a covariate space X = [0 , . Our target covariate distribution is Q X ∼ U ([0 , ) , theuniform on [0 , , and our source covariate distribution P X has density ∝ (cid:107) x (cid:107) γ so that P X , Q X satisfy Definition 4with transfer exponent γ .The reward function common to both P and Q is constructed as the sum of bump functions, each with a circu-lar support disjoint from the other bumps. The bump centers { z k } k =1 are randomly sampled from the Gaussian N ((0 . , . , . · Id ) in Figure 2 and from U ([0 , ) in Figure 3. The radii { r k } k =1 are then chosen in a randomorder to maximize the bump areas. Then, for each of the K = 3 arms, we determined the sign of each of the bumpsrandomly and independently.However, up to this point, all three arms are optimal in the region of the covariate space outside of the bumps since thereward functions are equal there. So, to introduce additional heterogeneity in the top arm identity, each reward function f i was further raised or lowered by a randomly selected height in the range [ − . , . , in the area outside of the bumps.This determines a top arm (Arm 1 in Figure 2 and Arm 2 in Figure 3) in the region outside of the bumps.Thus, the reward functions f i can be written as f i ( x ) ∝ (cid:88) k ± (1 − (cid:107) x − z k (cid:107) /r k ) + Furthermore, Gaussian noise was added to each f i to produce the rewards Y i according to Y i = f i ( X ) + N (0 , . for each i ∈ [ K ] . We generated data using a range of different values for the parameters n Q , n P , γ using the same f i ’sand ran Algorithm 2 on each dataset. The first plot in each of Figure 2 and 3 exhibits the guarantee of increasing pastexperience n P improving the regret, for fixed n Q , γ . The second plot in each figure shows the effect of increasing γ worsening the regret, for fixed n P , n Q , as proclaimed in Theorem 1. Each plot shows the mean and standard deviationof the regret over n Q rounds across simulations. Appendix B Upper Bound for Multiple-Play
B.1 Setup and Result
Our first (warmup) result considers a multiple-play variant where the policy π can pull multiple arms (and observetheir rewards) at every round t . In other words, we then let π t ⊂ [ K ] denote the subset of arms pulled at time t , andaccordingly consider the following multiple-play regret between time n P < n : R Qn ( π ) . = R n P ,n ( π ) . = n (cid:88) t = n P +1 (cid:88) i ∈ π t f (1) ( X t ) − f i ( X t ) . Our adaptive multiple-play variant is presented in Algorithm 1 of Section 4, which takes in a confidence parameter δ ∈ (0 , . We have the following result, where the expectation is over the entire sequence X n , Y n . Proposition 1.
Let π denote the procedure of Algorithm 1, ran, with parameter δ ∈ (0 , , up till time n > n P ≥ ,with n P possibly unknown. Suppose P X has unknown transfer exponent γ w.r.t. Q X , and that the average reward A P
REPRINT function f satisfies a margin condition with unknown α under Q X . Let n Q . = n − n P denote the (possibly unknown)number of rounds after the drift, i.e., over the phase X t ∼ Q X . We have for some constant C > : E R Qn ( π ) ≤ CKn Q (cid:34) min (cid:32)(cid:18) log( K/δ ) n P (cid:19) α +12+ d + γ , (cid:18) log( K/δ ) n Q (cid:19) α +12+ d (cid:33) + log( K/δ ) n Q + nδ (cid:35) . This gives an immediate corollary analogous to Corollary 1.
Corollary 2.
Under the conditions of Theorem 1, letting δ = O (1 /n ) gives us that E R Qn ( π ) ≤ CKn Q (cid:34) min (cid:32)(cid:18) log( Kn ) n P (cid:19) α +12+ d + γ , (cid:18) log( Kn ) n Q (cid:19) α +12+ d (cid:33) + log( Kn ) n Q (cid:35) . B.2 Proof of Proposition 1
Throughout the proof, c , c , . . . will denote positive constants not depending on t or n P , n Q . Here, c . = max(1 , λ − ) is taken from Line 2 of Algorithm 1.First, we justify that the criterion on Line 7 of Algorithm 1 is well defined for rounds t > c log( K/δ ) . We have c ≥ λ − implies for all t > c log( K/δ ) (cid:114) log( K/δ ) t = (cid:115) log( K/δ ) n ( X t ) ≤ λ Thus, the level r = 1 satisfies the constraint on Line 7 of Algorithm 1. Bias-Variance Bound and Covariate Count Analysis.
This first proposition establishes a standard bias-variancebound on the error of a regression function estimate | ˆ f it ( B ) − f i ( x ) | for a bin B , a round t , an arm i ∈ I B , and acovariate x ∈ B . Proposition 2.
Consider any round t > c log( K/δ ) with observed covariate X t , and fix any bin B containing X t .Consider the estimate ˆ f it ( B ) as in Definition 6, and let m t ( B, i ) be defined therein (i.e., the number of times arm i is pulled in B by time t . We then have at round t , that with probability at least − δ with respect to the conditionaldistribution Y t − | X t : ∀ x ∈ B : | ˆ f it ( B ) − f i ( x ) | ≤ (cid:32)(cid:115) log( K/δ ) m t ( B, i ) + λr (cid:33) . Proof.
Fix bin B and let r ∈ R be its side length. If m t ( B, i ) = 0 , then the desired bound is vacuous. Otherwise,recall from Definition 6, ˆ f it ( B ) = 1 m t ( B, i ) (cid:88) X u ∈ B,u ≤ t − ,i ∈ π u Y iu . Here, we note the estimate is formed over all rounds when an arm i was played in B , even if other arms were playedduring those rounds since we are in the multiple-play setup.For the sake of showing a bias-variance bound, define ˜ f it ( B ) . = E Y t − | X t − ( ˆ f it ( B )) = 1 m t ( B, i ) (cid:88) X u ∈ B,u ≤ t − ,i ∈ π u E ( Y iu | X u ) . Triangle inequality then yields | ˆ f it ( B ) − f i ( x ) | ≤ | ˆ f it ( B ) − ˜ f it ( B ) | + | ˜ f it ( B ) − f i ( x ) | . The second term on the RHS is at most λr by the Lipschitz assumption (Assumption 2). Now, fix the values of { m t ( B, i ) } i ∈ [ K ] . By Hoeffding inequality and union bound, the first term on the RHS above satisfies with probabilityat least − δ w.r.t. the distribution of Y t − | X t , { m t ( B, i ) } i ∈ [ K ] , ∀ i ∈ [ K ] : | ˆ f it ( B ) − ˜ f it ( B ) | ≤ (cid:115) log(2 K/δ ) m t ( B, i ) . In fact, the above holds with probability at least − δ w.r.t. the distribution of Y t − | X t by the tower property.11elf-Tuning Bandits over Unknown Covariate-Shifts A P
REPRINT
Next, we prove Lemma 1 to show that the regression error bound of Proposition 2 is further bounded by λr t . Proof. (of Lemma 1) First, suppose t is the first round when the subtree rooted at B is visited (by which we mean a binin said subtree is chosen). Then, until round t , each candidate arm was played every time a covariate was observed in B .Thus, m t ( B, i ) = n r t ( X t ) > for all i ∈ I B . This proves the lemma in this case using Proposition 2 since λr t ≥ (cid:115) log( K/δ ) n r t ( X t ) . For an arbitrary round t , let s be the first round that the subtree rooted at B was visited. Then, for any i ∈ I B , m t ( B, i ) ≥ m s ( B, i ) = n r t ( X s ) so that since r s ≤ r t , (cid:115) log( K/δ ) m t ( B, i ) ≤ (cid:115) log( K/δ ) n r t ( X s ) ≤ λr s ≤ λr t . Justifying Arm Eliminations.
For each round t > c log( K/δ ) , define the event G t on which the bound in Proposi-tion 2 holds or G t = (cid:110) ∀ i ∈ I B , x ∈ B : | ˆ f it ( B ) − f i ( x ) | ≤ λr t , B = T r t ( X t ) (cid:111) . This is the “good” event on which our regression function estimates ˆ f it ( B ) are accurate enough to be able to discernwhich arms have low and high rewards. From here on, let B be the selected bin at round t . This first lemma asserts thatan eliminated arm cannot have a better reward than the best candidate arm. To simplify notation, we let ∆ t . = 4 λr t bethe bound mentioned in G t (so that t is the confidence used on Line 11 of Algorithm 1 to discard arms). Proposition 3.
Suppose at round t , under event G t , we select bin B . Then for any two arms i, j ∈ I B and any x ∈ B , ˆ f it ( B ) − ˆ f jt ( B ) > t = ⇒ f i ( x ) > f j ( x ) Proof.
Using the definition of G t , we have f i ( x ) − f j ( x ) ≥ ˆ f it ( B ) − ˆ f jt ( B ) − t > . We obtain two corollaries from Proposition 3. The first is that, under event G t , the best arm at any covariate in a bin B is always retained in I B . The second is that the regret of playing any candidate arm at any point in B is dominated by ∆ t . Corollary 3.
Suppose at round t , under event ( ∩ ts =1 G s ) , we select bin B . Then I B contains the best arm i ∗ ( x ) =argmax j ∈ [ K ] f j ( x ) for all x ∈ B .Proof. If i ∗ ( x ) (cid:54)∈ I B , then i ∗ ( x ) was eliminated at some round s < t when an ancestor bin B (cid:48) ⊃ B was selected.Thus, ˆ f (1) s ( B (cid:48) ) − ˆ f i ∗ ( x ) s ( B (cid:48) ) > s . By Proposition 3, under G s , this implies f (cid:96) ( x ) > f i ∗ ( x ) ( x ) for some arm (cid:96) (cid:54) = i ∗ ( x ) , a contradiction. Corollary 4.
Suppose at round t , under event ( ∩ ts =1 G s ) , we select bin B . Then, both of the following hold for all x ∈ B :1. | f (1) ( x ) − f j ( x ) | ≤ t for all j ∈ I B .2. Either < | f (1) ( x ) − f (2) ( x ) | ≤ t or f j ( x ) = f (1) ( x ) for all j ∈ I B Proof.
Fix x ∈ B , and let ˆ i = argmax j ∈I B ˆ f j ( B ) and i = argmax j ∈ [ K ] f j ( x ) . Using the definition of G t and the fact that i ∈ I B (Corollary 3), we have f (1) ( x ) − f ˆ i ( x ) = f i ( x ) − f ˆ i ( x ) ≤ f i ( x ) − ˆ f ˆ it ( B ) + ∆ t ≤ f i ( x ) − ˆ f it ( B ) + ∆ t ≤ t . A P
REPRINT
To show (1), we have for j ∈ I B , using the definition of G t and the above inequality, f (1) ( x ) − f j ( x ) ≤ f (1) ( x ) − ˆ f jt ( B ) + ∆ t ≤ f (1) ( x ) − ˆ f ˆ it ( B ) + 2∆ t + ∆ t (because j was not eliminated) ≤ f (1) ( x ) − f ˆ i ( x ) + ∆ t + 2∆ t + ∆ t ≤ t . To show (2), we have if I B contains a sub-optimal arm j ∈ I B at x , then by (1), | f (1) ( x ) − f (2) ( x ) | = f (1) ( x ) − f (2) ( x ) ≤ f (1) ( x ) − f j ( x ) ≤ t . Furthermore, f (1) ( x ) (cid:54) = f (2) ( x ) if there is a sub-optimal arm at x . Showing r t is of Optimal Regression Order. We consider the rounds n P + t past the switch time n P and we willbound the regret accrued under event ∩ n p + ts =1 G s by first bounding r n P + t . Lemma 2.
Fix a round n P + t with observed covariate X n P + t . Then, for some c > , with probability at least − δ w.r.t. the distribution of X n P + t − | X n P + t , we have r n P + t ≤ c min (cid:32)(cid:18) log( K/δ ) n P (cid:19) d + γ , (cid:18) log( K/δ ) t (cid:19) d (cid:33) . Proof.
It suffices to show r n P + t ≤ c (cid:16) log( K/δ ) n P (cid:17) d + γ when n P > log( K/δ ) (cid:16) log( K/δ ) t (cid:17) d when t > log( K/δ ) , since this implies the desired result for any c ≥ . We first show r n P + t ≤ c (cid:18) log( K/δ ) t (cid:19) d . The other inequality will have a similar proof. First, we simplify notation for the sake of this proof. At round n P + t , itwill be understood that the observed covariate is represented by X . = X n P + t . We let n ( r ) . = n r ( X ) be the covariatecount at the level r .First, we claim that a level r ∈ R which satisfies n ( r ) ≥ log(1 /δ ) also satisfies, with probability at least − δ , n ( r ) ≥ E ( n ( r ))8 . If E ( n ( r )) < /δ ) , then this is already true. Otherwise, a Chernoff bound gives: P (cid:18) n ( r ) ≤ E ( n ( r )) (cid:19) ≤ exp (cid:18) − E ( n ( r )) (cid:19) ≤ δ. Here, we note the probability measure P is w.r.t. { X i } n P i =1 ∼ P n P X and { X i } n P + t − i = n P +1 ∼ Q t − X .Next, by Assumptions 3 and 4, we have for some c > : E ( n ( r )) = n P P X ( T r ( X )) + ( t − Q X ( T r ( X )) ≥ c ( n P r d + γ + ( t − r d ) . (4)In fact, without loss of generality, we can assume c ≤ − dd C dd d λ . (5)Thus, for a given level r satisfying n ( r ) ≥ log(1 /δ ) , we have with probability at least − δ that (cid:115) log( K/δ ) n ( r ) ≤ (cid:115) K/δ ) c ( t − r d . (6)13elf-Tuning Bandits over Unknown Covariate-Shifts A P
REPRINT
Now, let r ∗ ∈ R be the smallest level greater than or equal to λ − d ( c / − d (cid:18) log( K/δ ) t − (cid:19) d . Then, it suffices to show r n P + t ≤ r ∗ . For the next part of the proof, we define n Q ( r ) as the covariate count exclusivelyfrom Q X : n Q ( r ) . = (cid:88) s ∈ [ t − { X n P + s ∈ T r ( X ) } . Next, we claim r ∗ satisfies n Q ( r ) ≥ log(1 /δ ) with probability at least − δ , so that (6) holds for r ∗ . Since t > log( K/δ ) by hypothesis, we have by (5) and Assumption 3 that E ( n Q ( r )) ≥ C d ( t − r ∗ ) d ≥ C d ( t − λ − d d ( c / − d d (cid:18) log( K/δ ) t − (cid:19) d d ≥ /δ ) . Thus, by a Chernoff bound, we have Q X ( n Q ( r ∗ ) < log(1 /δ )) ≤ Q X (cid:18) n Q ( r ∗ ) < E ( n Q ( r )) (cid:19) ≤ exp (cid:18) − E ( n Q ( r )) (cid:19) ≤ δ. Thus, with probability at least − δ , we have n Q ( r ∗ ) ≥ log(1 /δ ) . Then, we have that λr ∗ ≥ (cid:115) K/δ ) c ( t − r ∗ ) d ≥ (cid:115) log( K/δ ) n ( r ∗ ) . This gives us that r n P + t ≤ r ∗ by the minimization on Line 7 of Algorithm 1, as desired.The other inequality r n P + t ≤ c (cid:18) log( K/δ ) n P (cid:19) d + γ , can be shown in a similar fashion to the case above with the appropriate modifications. Specifically, t is replaced with n P , (4) is replaced with the inequality E ( n ( r )) ≥ c n P r d + γ , and n Q ( r ) is replaced with n P ( r ) which is defined as the bin covariate counts from distribution P : n P ( r ) . = (cid:88) s ∈ [ n P ] { X s ∈ T r ( X ) } . Cumulative Regret Bound.
Next, we put the previous conclusions together to bound the cumulative regret bybounding the regret accrued at each round t and then summing over t . For t ≤ n P , define the event E t as the event onwhich the bound of Proposition 2 holds or E t = G t . For rounds t > n P , define the event E t as the event on which thebounds of Proposition 2 and Lemma 2 hold or: E t = G t ∩ (cid:40) r t ≤ c min (cid:32)(cid:18) log( K/δ ) n P (cid:19) d + γ , (cid:18) log( K/δ ) t − n P (cid:19) d (cid:33)(cid:41) . Recall from earlier that ∆ t . = 4 λr t . To sum the regrets across time t , the argument will involve conditioning on theevent E t , on which (1) Algorithm 1 correctly eliminates arms and (2) r t is of the optimal order.Let t > n P and let F t . = ∩ ts =1 E s . Also, to simplify notation, let U t denote U t . = c min (cid:32)(cid:18) log( K/δ ) n P (cid:19) d + γ , (cid:18) log( K/δ ) t − n P (cid:19) d (cid:33) . If n Q ≤ log( K/δ ) , we are already done since the regret is then bounded by K log( K/δ ) , which is the right order.Assume for the rest of the proof that log( K/δ ) < n Q . 14elf-Tuning Bandits over Unknown Covariate-Shifts A P
REPRINT
Next, let t > n P be the largest positive integer satisfying c (cid:18) log( K/δ ) t − n P (cid:19) d > δ , where δ is the parameter appearing in the margin assumption (Assumption 3).The regret for the first max( t − n P , log( K/δ )) rounds among rounds { n P + 1 , . . . , n } can be bounded by O ( K log( K/δ )) which is always of the right order. For the rest of the proof, we constrain our attention to theremaining rounds t where we can now assume t − n P > log( K/δ ) and U t ≤ δ .Next, let the event A t be A t . = {| ˆ f (1) t ( B ) − ˆ f (2) t ( B ) | ≤ t , B = T r t ( X t ) } . Conditioned on X t , A t is the event where one arm remains in contention at round t according to Line 11 of Algorithm 1.For the remainder of the proof, let B be the bin that was selected at round t given an understood value of X t .Consider the expected regret of pulling arm j ∈ π t at round t : E X t , Y t f (1) ( X t ) − f j ( X t ) = E X t (cid:16) E X t − , Y t − | X t ( f (1) ( X t ) − f j ( X t ))( F t + F ct )( A t + A ct ) (cid:17) . Next, we consider three different cases depending on whether event F t or F ct holds and whether event A t or A ct holds:1) Suppose event F t ∩ A t holds. Suppose also that there is a suboptimal arm i ∈ I B for which f i ( X t ) < f (1) ( X t ) .Then, by Corollary 4, we have: < | f (1) ( X t ) − f (2) ( X t ) | ≤ t ≤ U t . Furthermore, for any j ∈ I B : | f (1) ( X t ) − f j ( X t ) | ≤ t ≤ U t . This last inequality happens with probability at most C α (6 U t ) α , under X t ∼ Q X , by the margin condition(Definition 3). Thus, we have E X t E X t − , Y t − | X t ( f (1) ( X t ) − f j ( X t )) F t ∩ A t ≤ C α α +1 U α +1 t .
2) Next, on F t ∩ A ct , the pointwise regret is zero by Corollary 3 since I B must contain the optimal arm at X t andno other arms.3) On F ct , the pointwise regret is bounded above by . By Proposition 2 and Lemma 2, this happens withprobability at most P ( F ct ) ≤ P (cid:0) ∪ ts =1 E cs (cid:1) ≤ t (cid:88) s =1 δ. Thus, E X t E X t − , Y t − | X t ( f (1) ( X t ) − f j ( X t )) F ct ≤ tδ .Next, we put the three cases above together. To further simplify notation, we reparametrize the round variable t andinstead consider rounds n P + t where t ∈ [ n Q ] . We have that, for some c > , the cumulative regret over the n Q rounds is then at most E R Qn ( π ) ≤ c K log ( K/δ ) + n Q (cid:88) t =log( K/δ ) min (cid:32)(cid:18) log ( K/δ ) n P (cid:19) α +12+ d + γ , (cid:18) log ( K/δ ) t (cid:19) α +12+ d (cid:33) (7) +( n P + t ) δ ] . (8)First, (cid:80) n Q t =1 ( n P + t ) δ = O ( n Q nδ ) . For the remaining sum, it suffices to bound n Q (cid:88) t =log( K/δ ) (cid:18) log( K/δ ) t (cid:19) α +12+ d . By an integral approximation, we have n Q (cid:88) t =log( K/δ ) (cid:18) log( K/δ ) t (cid:19) α +12+ d ≤ c (cid:90) n Q log( K/δ ) (cid:18) log( K/δ ) z (cid:19) α +12+ d dz If α ≤ d + 1 , this integral, for some c > , is bounded by c n Q (cid:18) log( K/δ ) n Q (cid:19) α +12+ d . Otherwise, it is bounded by O (log( K/δ )) . This concludes the proof.15elf-Tuning Bandits over Unknown Covariate-Shifts A P
REPRINT
Appendix C Proof of Theorem 1
First, it is straightforward to verify that the criterion for choosing r t on Line 8 of Algorithm 2 is well-defined for t > K (cid:100) c log( K/δ ) (cid:101) . Relating Arm-Pull Counts to Covariate Counts.
To obtain an analogue of Lemma 1, we first relate the arm pullcounts m ( B, i ) = m t ( B, i ) to the covariate counts n r ( X t ) at round t . For the following lemma, we drop the subscript t from m t ( B, i ) to simplify notation. We also use Z t to denote the randomness of Algorithm 2 at round t in choosingthe particular arm π t to play. Thus, { Z t } t ∈ N is independent of X n , Y n . Lemma 3.
Fix a round t > K (cid:100) c log( K/δ ) (cid:101) with observed covariate X t and selected bin B . Suppose that t is the firstround that the subtree rooted at B is visited. Then, with probability at least − δ with respect to the distribution of Y t − , { Z s } s Proof. Fix the values of X t , Y t − , I B and fix some i ∈ I B . First, recall: m ( B, i ) = (cid:88) X s ∈ B,s ≤ t − { π s = i } . Next, we recall n r t ( X t ) ≥ K log( K/δ ) by Line 8 of Algorithm 2. Since the subtree rooted at B has not been visiteduntil round t , every round s < t for which a covariate landed in B , we pulled arm i independently with probability atleast /K . Thus, we have E ( m ( B, i ) | X t , Y t − , I B ) ≥ n r t ( X t ) · K ≥ K/δ ) . Then, by a Chernoff bound, we have w.r.t. the distribution of the Z s ’s: P (cid:18) m ( B, i ) ≤ n r t ( X t )2 K (cid:19) ≤ P (cid:18) m ( B, i ) ≤ E ( m ( B, i ) | X t , Y t − , I B )2 (cid:19) ≤ δ/K. By a union bound and the tower property, we have that the event {∀ i ∈ I B : m ( B, i ) ≥ n rt ( X t )2 K } holds with probabilityat least − δ w.r.t. the distribution of Y t − , { Z s } s REPRINT Adaptivity of r t in Single-Play. Next, Proposition 3 and Corollaries 3 and 4 still hold in the single-play settingprovided ∆ t . = 8 λr t and the event G t is now defined as: G t = (cid:110) ∀ i ∈ I B , x ∈ B : | ˆ f it ( B ) − f i ( x ) | ≤ λr t , B = T r t ( X t ) (cid:111) . Thus, it suffices to bound r t for rounds t > n P .Next, we proceed in a nearly identical manner as Lemma 2, with the only technical difference being that our criterion forchoosing the level r t ∈ R on Line 8 of Algorithm 2 involves the extra constraint n r t ( X t ) ≥ K log( K/δ ) comparedto Line 7 of Algorithm 1. It can be shown using an identical concentration argument as the proof of Lemma 2 that thisconstraint is satisfied by the “optimal level”: r ∗ ∝ min (cid:32)(cid:18) K log( K/δ ) n P (cid:19) d + γ , (cid:18) K log( K/δ ) t − n P (cid:19) d (cid:33) . Thus, following the same arguments as the proof of Lemma 2, we have r ∗ (cid:38) r t .Then, the remainder of the proof in summing the regret over rounds t > n P proceeds identically as in the multiple-playcase. Appendix D Multiple Shifts In this section, we give an extension of Theorem 1 to multiple distribution shifts. Let P . = { P j } Nj =1 be a sequence of N source distributions on the covariate-reward pair ( X, Y ) . Each P j satisfies covariate shift with respect to the targetdistribution Q . We then consider bandits with a sequence of shifts P → P → · · · → P N → Q. In this setup, data from each P j is observed for n j consecutive rounds. Then, there are n P . = (cid:80) Nj =1 n j totalrounds played under distributions from the source class P . We then consider the regret R Qn ( π ) of a policy π playing n = n P + n Q total rounds with the last n Q rounds having data observed from Q . Theorem 2. Let π denote the procedure of Algorithm 2, ran, with parameter δ ∈ (0 , , up till time n > n P ≥ , with n P , N, { n j } Nj =1 all possibly unknown. Suppose the marginal of the covariate X under each P j has unknown transferexponent γ j w.r.t. Q X , and that the average reward function f satisfies a margin condition with unknown α under Q X .Let n Q . = n − n P denote the (possibly unkown) number of rounds after the drift, i.e., over the phase X t ∼ Q X . Let γ = (cid:80) Nj =1 γ j · n j n P . We have for some constant C > : E R Qn ( π ) ≤ Cn Q (cid:34) min (cid:32)(cid:18) K log ( K/δ ) n P (cid:19) α +12+ d + γ , (cid:18) K log ( K/δ ) n Q (cid:19) α +12+ d (cid:33) + K log ( K/δ ) n Q + nδ (cid:35) Proof Outline. Consider a round t > n P and any level r ∈ R such that n r ( X t ) > log(1 /δ ) . Similarly to the proof ofLemma 2, by a Chernoff bound and Assumption 4, we have that the covariate count n r ( X t ) , with probability at least − δ for some c > satisfies n r ( X t ) ≥ c ( t − − n P ) r d + N (cid:88) j =1 n j · r d + γ j . Next, since the function x (cid:55)→ r x is convex for any r > , by Jensen’s inequality we have N (cid:88) j =1 n j n P · r d + γ j ≥ r d + nP (cid:80) Nj =1 n j γ j . Thus, n r ( X t ) ≥ c n P r d + γ . Using this bound, we can extend Lemma 2 to the case of single-play bandits in the samemanner as the proof of Theorem 1 (see Appendix C), except γ is replaced now by γ . In particular, for a fixed round n P + t with observed covariate X n P + t , for some c > , with probability at least − δ , we have: r n P + t ≤ c min (cid:32)(cid:18) log( K/δ ) n P (cid:19) d + γ , (cid:18) log( K/δ ) t (cid:19) d (cid:33) . All other parts of the proof of Theorem 1 remain the same.17elf-Tuning Bandits over Unknown Covariate-Shifts A P REPRINT Appendix E Lower Bound Here, we establish that the bound of Theorem 1 is minimax optimal, up to log terms in the case where K = 2 , over acontinuum of regimes of choices of n P , n Q , γ, α .Our strategy is to use online-to-batch conversion to convert an online algorithm with regret R n P ,n during the last n Q rounds to a classifier with excess risk of order R n P ,n /n Q . This then implies a conversion from classificationlower-bounds to bandits lower-bounds.We note that online-to-batch conversion results – which we call as a black-box – are usually given for i.i.d. sequencesof covariate-reward pairs, while we instead consider a setting with a shift in distribution P → Q . Therefore, in much ofwhat follows, we treat the first phase { ( X t , Y t ) } n P t =1 ∼ P n P as a separate input randomness Z , and apply conversionarguments to the second phase { ( X t , Y t ) } nt = n P +1 ∼ Q n Q .First, we claim a bandit policy π can be converted to an online classification algorithm where π t ∈ [ K ] indicates thepredicted label for covariate X t . This requires defining a reward Y i for each label i ∈ [ K ] , which is done in Definition 8below. To simplify notation, we will denote the set of K = 2 arms as { , } . Definition 8 (Conversion from Labels to Rewards) . In the case of binary classification with covariate X ∈ X andlabel ˜ Y ∈ { , } , we define the reward of arm i ∈ { , } as Y i . = { ˜ Y = i } . We use T to denote a class of tuples ( P, Q ) of distributions on the covariate-label pair ( X, ˜ Y ) . Each distribution on ( X, ˜ Y ) then induces a distribution onthe covariate-reward pair ( X, Y ) . Let T (cid:48) be the class of tuples of distributions on ( X, Y ) , induced by T .To simplify notation, in what follows, tuples ( P, Q ) will refer to either tuples in T or their one-to-one mapping to tuplesin T (cid:48) , which will be clear from context.We will also let { ( X t , ˜ Y t ) } mt =1 be a sequence of covariate-label pairs and let { ( X t , Y t ) } mt =1 be the sequence ofcorresponding covariate-reward pairs. In this constructed bandits problem, the regression function of arm i is f i ( x ) = P ( ˜ Y = i | X = x ) . Next, let h ∗ ( x ) . = { f ( x ) ≥ / } = π ∗ ( x ) be the Bayes classifier; the excess risk of a classifier h w.r.t. distribution Q is thengiven as: E Q ( h ) . = E Q (cid:16) { h ( X ) (cid:54) = ˜ Y } − { h ∗ ( X ) (cid:54) = ˜ Y } (cid:17) . Consider an arbitrary online learner Λ = Λ( Z ) , based on additional randomness Z independent of the training data.We let Λ , Λ , . . . denote the sequentially generated classifiers of Λ . We also define the regret R m (Λ) over m roundsof Λ is then defined as: R m (Λ) . = m (cid:88) t =1 (Λ t ( X t ) (cid:54) = ˜ Y t ) − ( h ∗ ( X t ) (cid:54) = ˜ Y t ) . The next few definitions and results will be stated in terms of an arbitrary online learner Λ and, in Corollary 6, we willspecialize Λ to the online learner induced by policy π .First, we formalize the types of black-box guarantees on online-to-batch conversion our arguments will rely on. In whatfollows, let X m . = { X t } mt =1 and Y m . = { Y t } mt =1 . Definition 9. In what follows, let a . = { a m } , b . = { b m } denote bounded sequences in [0 , , indexed over m ∈ N . An online to batch conversion rate is a mapping F from sequences a (cid:55)→ b such that the following holds:If there exists an online learner Λ = Λ( Z ) , for additional randomness Z , which achieves expected regret E Z, X m , Y m ( R m (Λ)) ≤ m · a m for some sequence a , then there exists a classifier ˆ h = ˆ h (Λ) with excess risk E Z, X m , Y m ( E Q (ˆ h )) ≤ ( F ( a )) m .Now, for any b = { b m } , define the pseudo-inverse F † ( b ) . = inf { a . = { a m } : ( F ( a )) m > b m } , where the inf over aset of sequences is defined pointwise over m ∈ N (that is, ( F † ( b )) m = inf { a m : ( F ( a )) m > b m } for m ∈ N ). Next, we formally define the notion of a minimax lower bound for offline and online classification problems in terms ofa rate { a m } . Definition 10. Fix n P ∈ N . We say that the class T (of distribution pairs ( P, Q ) ) has a classification minimax lowerbound of b = { b m } if the following holds: A P REPRINT For any m ∈ N and any classifier ˆ h learned on data { ( X t , ˜ Y t ) } mt =1 ∼ Q m , and additional randomness Z = { ( X (cid:48) t , ˜ Y (cid:48) t ) } n P t =1 ∼ P n P , sup ( P,Q ) ∈T E Z, X m , Y m ( E Q (ˆ h )) > b m . Similarly, a class T has an online minimax lower bound of a = { a m } if the following holds:For any m ∈ N and any online learner Λ = Λ( Z ) trained on data { ( X t , ˜ Y t ) } mt =1 ∼ Q m , and additional randomness Z = { ( X (cid:48) t , ˜ Y (cid:48) t ) } n P t =1 ∼ P n P , we have: sup ( P,Q ) ∈T E Z, X m , Y m ( R m (Λ)) > m · a m . Given an online to batch conversion rate, the next lemma allows us to deduce an online minimax lower bound from aclassification minimax lower bound. Lemma 4 (Minimax Lower Bound Conversion) . Suppose b . = { b m } denotes a classification minimax lower bound forthe class T . Then, if there exists an online to batch conversion rate F with ( F † ( b )) m > for all m ∈ N , we have that · F † ( b ) is an online minimax lower bound for the class T .Proof. Consider an online learner Λ = Λ( Z ) , with additional randomness Z = { ( X (cid:48) t , ˜ Y (cid:48) t ) } n P t =1 , with regret rate a = { a m } . For contradiction, suppose there exists m ∈ N such that: sup ( P,Q ) ∈T E Z, X m , Y m ( R m (Λ)) ≤ m · a m ≤ m · · ( F † ( b )) m < m · ( F † ( b )) m . Then, by the definition of F and the pseudo-inverse F † , there exists a classifier ˆ h = ˆ h (Λ) such that: sup ( P,Q ) ∈T E Z, X m , Y m ( E Q (ˆ h )) ≤ ( F ( a )) m ≤ b m . This is a contradiction on b being a classification minimax lower bound for the class T .We next specify the online to batch conversion rate F that we will use with Lemma 4. Theorem 3 (Theorem 4 of [CBCG04], paraphrased) . Let Λ = Λ( Z ) be an arbitrary online learner, trained on { ( X t , ˜ Y t ) } mt =1 , with additional randomness Z . Then, for any δ ∈ (0 , , there exists a classifier ˆ h = ˆ h (Λ) , trained on { ( X t , ˜ Y t ) } mt =1 , such that: P (cid:32) P ( X, ˜ Y ) ∼ Q (ˆ h ( X ) (cid:54) = ˜ Y ) ≥ m m (cid:88) t =1 (Λ t ( X t ) (cid:54) = ˜ Y t ) + 6 (cid:114) log (2( m + 1) /δ ) m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:33) ≤ δ. Corollary 5. Let Λ = Λ( Z ) be an online learner trained on data { ( X t , ˜ Y t ) } mi =1 with additional input Z . Then, thereexists a classifier ˆ h = ˆ h (Λ) such that for any distribution on X m , Y m , Z : E Z, X m , Y m ( E Q (ˆ h )) ≤ E Z, X m , Y m ( R m (Λ)) m + 6 (cid:115) m √ m + 1 m . Proof. Fix a value of Z and let the event A be as in Theorem 3: A = (cid:40) P ( X, ˜ Y ) ∼ Q (ˆ h ( X ) (cid:54) = ˜ Y ) ≥ m m (cid:88) t =1 (Λ t ( X t ) (cid:54) = ˜ Y t ) + 6 (cid:114) log (2( m + 1) /δ ) m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:41) . Then, letting δ = 1 /m and conditioning on the event A , we have: E X m , Y m ( E Q (ˆ h ) | Z ) ≤ E X m , Y m (cid:32) m m (cid:88) t =1 (Λ t ( X t ) (cid:54) = ˜ Y t ) − ( h ∗ ( X t ) (cid:54) = ˜ Y t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:33) + 6 (cid:115) (cid:0) m √ (cid:1) m + 1 m = E X m , Y m ( R m (Λ) | Z ) m + 6 (cid:115) (cid:0) m √ (cid:1) m + 1 m . Taking a further expectation on both sides of the inequality gives the desired result.19elf-Tuning Bandits over Unknown Covariate-Shifts A P REPRINT Theorem 1 of [KM18] provides us the classification minimax lower-bound, which we restate it here. Theorem 4 (Theorem 1 of [KM18]) . Let T (cid:48) be the class of all tuples ( P, Q ) of distributions satisfying Assumptions 2and 3, and Definitions 3 and 4, with some fixed parameters ( λ, C d , d, C α , α, δ , C γ , γ ) . In what follows, let T be theone-to-one mapping of T (cid:48) to tuples of distributions on covariate-label pairs as in Definition 8. Suppose also that α ≤ d .Then, there exists a constant c > such that for any n P , n Q ∈ N and classifier ˆ h learned on { ( X t , ˜ Y t ) } n P t =1 ∼ P n P and { ( X t , ˜ Y t ) } n P + n Q t = n P +1 ∼ Q n Q , we have: sup ( P,Q ) ∈T E (cid:16) E Q (ˆ h ) (cid:17) > c (cid:18) n d d + γ P + n Q (cid:19) − α +12+ d . Next, we will take b m . = c (cid:18) n d d + γ P + m (cid:19) − α +12+ d as our classification minimax lower-bound where m here stands for n Q . Combining Lemma 4, Corollary 5, and Theorem 4, we obtain the following minimax lower bound for bandits: Corollary 6 (Matching Lower Bounds over Given Regimes) . Let the class T (cid:48) and the constant c > be as in Theorem 4.Suppose that n P , n Q satisfy: (cid:115) n Q √ n Q + 1 n Q < c (cid:18) n d d + γ P + n Q (cid:19) − α +12+ d . (10) Then, for any fixed such n P , n Q and any contextual bandits policy π , we have: sup ( P,Q ) ∈T (cid:48) E X n , Y n ( R n P ,n ( π )) ≥ c n Q (cid:18) n d d + γ P + n Q (cid:19) − α +12+ d . Proof. Fix n P , n Q satisfying the inequality in (10) and let n = n P + n Q . Let ˆ π = { π t } t>n P be the online learner in-duced by the policy π restricted to the second phase { ( X t , Y t ) } nt = n P +1 with additional randomness Z = { ( X t , Y t ) } n P t =1 .Then, by Corollary 5, we have there exists a classifier ˆ h such that: E X n , Y n ( E Q (ˆ h )) ≤ E X n , Y n ( R n Q (ˆ π )) n Q + 6 (cid:115) n Q √ n Q + 1 n Q . We then have that the map F , defined below on a sequence a = { a m } , is an online to batch conversion rate: ( F ( a )) m . = a m + 6 (cid:115) m √ m + 1 m , Let b m . = c (cid:18) n d d + γ P + m (cid:19) − α +12+ d be as in Theorem 4. Then, by Theorem 4 and Lemma 4, we have: sup ( P,Q ) ∈T (cid:48) E X n , Y n ( R n Q (ˆ π )) ≥ n Q · ( F † ( b )) n Q . Next, we observe: E X n , Y n ( R n Q (ˆ π )) = E X n , Y n (cid:32) n Q (cid:88) t =1 (ˆ π t ( X t ) (cid:54) = ˜ Y t ) − ( h ∗ ( X t ) (cid:54) = ˜ Y t ) (cid:33) = E X n , Y n (cid:32) n Q (cid:88) t =1 Y π ∗ t ( X t ) t − Y π t ( X t ) t (cid:33) = E X n , Y n ( R n P ,n ( π )) A P REPRINT Thus: sup ( P,Q ) ∈T (cid:48) E X n , Y n ( R n P ,n ( π )) ≥ n Q · ( F † ( b )) n Q ≥ n Q · b n Q − (cid:115) n Q √ n Q − n Q ≥ n Q · c (cid:18) n d d + γ P + n Q (cid:19) − α +12+ d . Remark 2. The inequality in (10) corresponds to the regime n P = ˜ O (cid:18) n d + γ α Q (cid:19) with α < d/ . In particular thisincludes the following subregimes. • Performance on Q depends mostly on covariates X t ∼ Q X , t > n P . This is the subregime where n P (cid:46) n d + γ d Q , roughly, that is when (in the upper-bound of Theorem 1) min( n − α +12+ d + γ P , n − α +12+ d Q ) = n − α +12+ d Q ,i.e., past experience under P is too short to significantly influence regret under Q . The lower-bound ofCorollary 6 then is of the form sup ( P,Q ) ∈T (cid:48) E ( R n P ,n ( π )) ≥ n Q · n − α +12+ d Q , which confirms that the threshold n P = ˜ O (cid:18) n d + γ d Q (cid:19) (on when past experience is too short) is indeed tight. • Performance on Q depends mostly on covariates X t ∼ P X , t ≤ n P . This is the subregime where n d + γ d Q (cid:46) n P (cid:46) n d + γ α Q . In other words min( n − α +12+ d + γ P , n − α +12+ d Q ) = n − α +12+ d + γ P , i.e., past experience under P significantly influencesregret under Q . The lower-bound of Corollary 6 then is of the form sup ( P,Q ) ∈T (cid:48) E ( R n P ,n ( π )) ≥ n Q · n − α +12+ d + γ P ,,