[PDF] Robust Bandit Learning with Imperfect Context

Abstract

A standard assumption in contextual multi-arm bandit is that the true context is perfectly known before arm selection. Nonetheless, in many practical applications (e.g., cloud resource management), prior to arm selection, the context information can only be acquired by prediction subject to errors or adversarial modification. In this paper, we study a contextual bandit setting in which only imperfect context is available for arm selection while the true context is revealed at the end of each round. We propose two robust arm selection algorithms: MaxMinUCB (Maximize Minimum UCB) which maximizes the worst-case reward, and MinWD (Minimize Worst-case Degradation) which minimizes the worst-case regret. Importantly, we analyze the robustness of MaxMinUCB and MinWD by deriving both regret and reward bounds compared to an oracle that knows the true context. Our results show that as time goes on, MaxMinUCB and MinWD both perform as asymptotically well as their optimal counterparts that know the reward function. Finally, we apply MaxMinUCB and MinWD to online edge datacenter selection, and run synthetic simulations to validate our theoretical analysis.

Full PDF

RRobust Bandit Learning with Imperfect Context

Jianyi Yang, Shaolei Ren

University of California, Riverside { jyang239, shaolei } @ucr.edu Abstract

MaxMinUCB (Maximize Minimum UCB) which maximizesthe worst-case reward, and

MinWD (Minimize Worst-caseDegradation) which minimizes the worst-case regret. Im-portantly, we analyze the robustness of

MaxMinUCB and

MinWD by deriving both regret and reward bounds comparedto an oracle that knows the true context. Our results showthat as time goes on,

MaxMinUCB and

MinWD both per-form as asymptotically well as their optimal counterparts thatknow the reward function. Finally, we apply

MaxMinUCB and

MinWD to online edge datacenter selection, and run syn-thetic simulations to validate our theoretical analysis.

Contextual bandits (Lu, P´al, and P´al 2010; Chu et al. 2011)concern online learning scenarios such as recommendationsystems (Li et al. 2010), mobile health (Lei, Tewari, andMurphy 2014), cloud resource provisioning (Chen and Xu2019), wireless communications (Saxena et al. 2019), inwhich arms (a.k.a., actions) are selected based on the under-lying context to balance the tradeoff between exploitationof the already learnt knowledge and exploration of uncer-tain arms (Auer et al. 2002; Auer, Cesa-Bianchi, and Fischer2002; Bubeck and Cesa-Bianchi 2012; Dani et al. 2008).The majority of the existing studies on contextual bandits(Chu et al. 2011; Valko et al. 2013; Saxena et al. 2019) as-sume that a perfectly accurate context is known before eacharm selection. Consequently, as long as the agent learns in-creasingly more knowledge about reward, it can select armswith lower and lower average regrets. In many cases, how-ever, the perfect (or true) context is not available to the agentprior to arm selection. Instead, the true context is revealedafter taking an action at the end of each round (Kirschner and

Krause 2019), but can be predicted using predictors, such astime series prediction(Brockwell et al. 2016; Gers, Schmid-huber, and Cummins 2000), to facilitate the agent’s arm se-lection. For example, in wireless communications, the chan-nel condition is subject to various attenuation effects (e.g.,path loss and small-scale multi-path fading), and is criticalcontext information for the transmitter conﬁguration suchas modulation and rate adaption (i.e., arm selection) (Gold-smith 2005; Saxena et al. 2019). But, the channel conditioncontext is predicted and hence can only be coarsely knownuntil the completion of transmission. For another example,the exact workload arrival rate is crucial context informa-tion for cloud resource management, but cannot be knownuntil the workload actually arrives. Naturally, context pre-diction is subject to prediction errors. Moreover, it can alsoopen a new attack surface — an outside attacker may adver-sarially modify the predicted context. For example, a recentstudy (Chen, Tan, and Zhang 2019) shows that the energyload predictor in smart grid can be adversarially attacked toproduce load estimates with higher-than-usual errors. Moremotivating examples are provided in (Yang and Ren 2021).In general, imperfectly predicted and even adversarially pre-sented context is very common in practice.As motivated by practical problems, we consider a banditsetting where the agent receives imperfectly predicted con-text and selects an arm at the beginning of each round andthe context is revealed after arm selection. We focus on ro-bust arm optimization given imperfect context, which is ascrucial as robust reward function estimation or explorationin contextual bandits (Dud´ık, Langford, and Li 2011; Neuand Olkhovskaya 2020; Zhu et al. 2018). Concretely, withimperfect context, our goal is to select arms online in a ro-bust manner to optimize the worst-case performance in aneighborhood domain with the received imperfect contextas center and a defense budget as radius. In this way, therobust arm selection can defend against the imperfect con-text error ( from either context prediction error or adversarialmodiﬁcation) constrained by the budget.Importantly and interestingly, given imperfect context,maximizing the worst-case reward (referred to as type-I ro-bustness objective) and minimizing the worst-case regret(referred to as type-II robustness objective) can lead to dif-ferent arms, while they are the same under the setting ofperfect context (Saxena et al. 2019; Li et al. 2010; Slivkins a r X i v : . [ c s . L G ] M a r MaxMinUCB (MaximizeMinimum UCB), which maximizes the worst-case rewardfor type-I objective, and

MinWD (Minimize Worst-caseDegradation), which minimizes the worst-case regret fortype-II objective. The challenge of algorithm designs is thatthe agent has no access to exact knowledge of reward func-tion but the estimated counterpart based on history collecteddata. Thus, in our design,

MaxMinUCB maximizes the lowerbound of reward, while

MinWD minimizes the upper boundof regret.We analyze the robustness of

MaxMinUCB and

MinWD by deriving both regret and reward bounds, compared to astrong oracle that knows the true context for arm selectionas well as the exact reward function. Importantly, our re-sults show that, while a linear regret term exists for both

MaxMinUCB and

MinWD due to imperfect context, theadded linear regret term is actually the same as the amount ofregret incurred by respectively optimizing type-I and type-II objectives with perfect knowledge of the reward function.This implies that as time goes on,

MaxMinUCB and

MinWD will asymptotically approach the corresponding optimizedobjectives from the reward and regret views, respectively.Finally, we apply

MaxMinUCB and

MinWD to the prob-lem of online edge datacenter selection and run syntheticsimulations to validate our theoretical analysis.

Contextual bandits.

Linear contextual bandit learning isconsidered in LinUCB by (Li et al. 2010). . The study(Abbasi-Yadkori, P´al, and Szepesv´ari 2011) improves theregret analysis of linear contextual bandit learning, whilethe studies (Agrawal and Goyal 2012, 2013) solve this prob-lem by Thompson sampling and give a regret bound. Thereare also studies to extend the algorithms to general rewardfunctions like non-linear functions, for which kernel methodis exploited in GP-UCB (Srinivas et al. 2010), Kernel-UCB (Valko et al. 2013), IGP-UCB and GP-TS (Chowdhuryand Gopalan 2017; Deshmukh, Dogan, and Scott 2017).Nonetheless, a standard assumption in these studies is thatperfect context is available for arm selection, whereas im-perfect context is common in many practical applications(Kirschner et al. 2020).

Adversarial bandits and Robustness.

The prior studieson adversarial bandits (Auer and Chiang 2016; Jun et al.2018; Altschuler, Brunel, and Malek 2019; Liu and Shroff 2019) have primarily focused on that the adversary ma-liciously presents rewards to the agent or directly injectserrors in rewards. Moreover, many studies (Audibert andBubeck 2009; Gerchinovitz and Lattimore 2016) considerthe best constant policy throughout the entire learning pro-cess as the oracle, while in our setting the best arm de-pends on the true context at each round. The adversarial set-ting has also been extended to contextual bandits (Neu andOlkhovskaya 2020; Syrgkanis, Krishnamurthy, and Schapire2016; Han et al. 2020).Recently, robust bandit algorithms have been proposed forvarious adversarial settings. Some focus on robust rewardestimation and exploration (Altschuler, Brunel, and Malek2019; Guan et al. 2020; Dud´ık, Langford, and Li 2011), andothers train a robust or distributionally robust policy (Wuet al. 2016; Syrgkanis, Krishnamurthy, and Schapire 2016;Si et al. 2020b,a). Our study differs from the existing ad-versarial bandits by seeking two different robust algorithmsgiven imperfect (and possibly adversarial) context.

Optimization and bandits with imperfect context. (Rakhlin and Sridharan 2013) considers online optimizationwith predictable sequences and (Jadbabaie et al. 2015) fo-cuses on adaptive online optimization competing with dy-namic benchmarks. Besides, (Chen et al. 2014; Jiang et al.2013) study the robust optimization of mini-max regret.These studies assume perfectly known cost functions with-out learning. A recent study (Bogunovic et al. 2018) consid-ers Bayesian optimization and aims at identifying a worst-case good input region with input perturbation (which canalso model a perturbed but ﬁxed environment/context pa-rameter). The study (Wang, Wu, and Wang 2016) consid-ers the linear bandit where certain context features are hid-den, and uses iterative methods to estimate hidden contextsand model parameters. Another recent study (Kirschner andKrause 2019) assumes the knowledge of context distribu-tion for arm selection, and considers a weak oracle thatalso only knows context distribution. The relevant papers(Kirschner et al. 2020) and (Nguyen et al. 2020) considerrobust Bayesian optimizations where context distribution in-formation is imperfectly provided, and propose to maximizethe worst-case expected reward for distributional robustness.Although the objective of

MaxMinUCB in our paper is simi-lar to the robust optimization objectives in the two papers,we additionally derive a lower bound for the true rewardin our analysis, which provides another perspective on therobustness of arm selection. More importantly, consideringthat the objectives in the two relevant papers are equivalentto minimizing a pseudo robust regret, we propose

MinWD and derive an upper bound for the incurred true regret.

Assume that at the beginning of round t , the agent receivesimperfect context ˆ x t ∈ X which is exogenously providedand not necessarily the true context x t . Given the imperfectcontext ˆ x t ∈ X and an arm set A , the agent selects an arm a t ∈ A for round t . Then, the reward y t along with the truecontext x t is revealed to the agent at the end of round t .Assume that X × A ⊆ R d , and we use x a t ,t to denote the d -dimensional concatenated vector [ x t , a t ] .he reward y t received by the agent in round t is jointlydecided by the true context x t and selected arm a t , and canbe expressed as follows y t = f ( x t , a t ) + n t , (1)where f : X × A → R is the reward function, X isthe context domain, and n t is the noise term. We assumethat the reward function f belongs to a reproducing ker-nel Hilbert space (RKHS) H generated by a kernel func-tion k : ( X × A ) × ( X × A ) → R . In this RKHS,there exists a mapping function φ : ( X × A ) → H whichmaps context and arm to their corresponding feature in H . By reproducing property, we have k ([ x, a ] , [ x (cid:48) , a (cid:48) ]) = (cid:104) φ ( x, a ) , φ ( x (cid:48) , a (cid:48) ) (cid:105) and f ( x, a ) = (cid:104) φ ( x, a ) , θ (cid:105) where θ isthe representation of function f ( · , · ) in H . Further, as com-monly considered in the bandit literature (Slivkins 2019;Li et al. 2010), the noise n t follows b -sub-Gaussian dis-tribution for a constant b ≥ , i.e. conditioned on the ﬁl-tration F t − = { x τ , y a,τ , a τ , τ = 1 , · · · , t − } , ∀ σ ∈ R , E [ e σn t |F t − ] ≤ exp (cid:16) σ b (cid:17) . Without knowledge of reward function f , bandit al-gorithms are designed to decide an arm sequence { a t , t = 1 , · · · , T } to minimize the cumulative regret R T = T (cid:88) t =1 f ( x t , A ∗ ( x t )) − f ( x t , a t ) , (2)where A ∗ ( x t ) = arg max a ∈A f ( x t , a ) is the oracle-optimalarm at round t given the true context x t . When the receivedcontexts are perfect, i.e. ˆ x t = x t , minimizing the cumula-tive regret is equivalent to maximizing the cumulative re-ward F T = (cid:80) Tt =1 f ( x t , a t ) . The context error can come from a variety of sources, includ-ing imperfect context prediction algorithms and adversarialcorruption (Kirschner et al. 2020; Chen, Tan, and Zhang2019) on context. We simply use context error to encap-sulate all the error sources without further differentiation.We assume that context error (cid:107) x t − ˆ x t (cid:107) , where (cid:107) · (cid:107) is acertain norm (Bogunovic et al. 2018), is less than ∆ . Also, ∆ is referred to as the defense budget and can be consid-ered as the level of robustness/safeguard that the agent in-tends to provide against context errors: with a larger ∆ , theagent wants to make its arm selection robust against largercontext errors (at the possible expense of its reward). Atime-varying error budget can be captured by using ∆ t . De-note the neighborhood domain of context x as B ∆ ( x ) = { y ∈ X | (cid:107) y − x (cid:107) ≤ ∆ } . Then, we have the true context x t ∈ B ∆ (ˆ x t ) , where ˆ x t is available to the agent. Reward estimation is critical for arm selection. Kernelridge regression, which is widely used in contextual bandits(Slivkins 2019) serves as the reward estimation method inour algorithm designs. By kernel ridge regression, the esti-mated reward given arm a and context x is expressed as ˆ f t ( x, a ) = k Tt ( x, a )( K t + λ I ) − y t (3) where I is an identity matrix, y t ∈ R t − contains the his-tory of y τ , k t ( x, a ) ∈ R t − contains k ([ x, a ] , [ x τ , a τ ]) ,and K t ∈ R ( t − × ( t − contains k ([ x τ , a τ ] , [ x τ (cid:48) , a τ (cid:48) ]) , for τ,τ (cid:48) ∈ { , · · · , t − } .The conﬁdence width of kernel ridge regression is givenin the following concentration lemma followed by a deﬁni-tion of reward UCB. Lemma 1 (Concentration of Kernel Ridge Regression) . As-sume that the reward function f ( x, a ) satisﬁes | f ( x, a ) | ≤ B , the noise n t satisﬁes a sub-Gaussian distribution withparameter b , and kernel ridge regression is used to esti-mate the reward function. With a probability of at least − δ, δ ∈ (0 , , for all a ∈ A and t ∈ N , the es-timation error satisﬁes | ˆ f t ( x, a ) − f ( x, a ) | ≤ h t s t ( x, a ) , where h t = √ λB + b (cid:112) γ t − δ ) , γ t = log det( I + K t /λ ) ≤ ¯ d log(1 + t ¯ dλ ) and ¯ d is the rank of K t . Let V t = λ I + (cid:80) ts =1 φ ( x, a ) φ ( x, a ) (cid:62) , the squared conﬁdencewidth is given by s t ( x, a ) = φ ( x, a ) (cid:62) V − t − φ ( x, a ) = λ k ([ x, a ] , [ x, a ]) − λ k t ( x, a ) T ( K t + λ I ) − k t ( x, a ) . Deﬁnition 1.

Given arm a ∈ A and context x ∈ X ,the reward UCB (Upper Conﬁdence Bound) is deﬁned as U t ( x, a ) = ˆ f t ( x, a ) + h t s t ( x, a ) . The next lemma shows that the term s t ( x t , a t ) has a van-ishing impact on regret over time. Lemma 2.

The sum of conﬁdence widths given x t for t ∈{ , · · · , T } satisﬁes (cid:80) Tt =1 s t ( x t , a t ) ≤ γ T , where γ T =log det( I + K T /λ ) ≤ ¯ d log(1 + T ¯ dλ ) and ¯ d is the rank of K T . Then, we give the deﬁnition of

UCB-optimal arm whichis important in our algorithm designs.

Deﬁnition 2.

Given context x ∈ X , the UCB-optimal armis deﬁned as A † t ( x ) = arg max a ∈A U t ( x, a ) . Note that if the received contexts are perfect, i.e. ˆ x t = x t ,the standard contextual UCB strategy selects arm at round t as A † t ( x t ) . Under the cases with imperfect context, a naivepolicy (which we call SimpleUCB ) is simply oblivious ofcontext errors, i.e. the agent selects the UCB-optimal armregarding imperfect context ˆ x t , denoted as a † t = A † t (ˆ x t ) ,by simply viewing the imperfect context ˆ x t as true context.Nonetheless, if we want to guarantee the arm selection per-formance even in the worst case, robust arm selection thataccounts for context errors is needed. In the existing bandit literature such as (Auer and Chiang2016; Han et al. 2020; Li et al. 2010), maximizing the cu-mulative reward is equivalent to minimizing the cumulativeregret, under the assumption of perfect context for arm se-lection. In this section, we will show that maximizing theworst-case reward is equivalent to minimizing a pseudo re-gret and is different from minimizing the worst-case true re-gret. I CRntext R e w a r d arm 1arm 2 c I CRntext R e g r e t arm 1arm 2 (a) Type-I Robustness c I CRntext R e w a r d arm 1arm 2 c I CRntext R e g r e t arm 1arm 2 (b) Type-II Robustness Figure 1: Illustration of reward and regret functions that Type-I and Type-II robustness objectives are suitable for, respectively.The golden dotted vertical line represents the imperfect context c I , and the gray region represents the defense region B ∆ ( c I ) . With imperfect context, one approach to robust arm selec-tion is to maximize the worst-case reward. With perfectknowledge of reward function, the oracle arm that maxi-mizes the worst-case reward at round t is ¯ a t = arg max a ∈A min x ∈B ∆ (ˆ x t ) f ( x, a ) . (4)For analysis in the following sections, given ¯ a t , the corre-sponding context for the worst-case reward is denoted as ¯ x t = arg min x ∈B ∆ (ˆ x t ) f ( x, ¯ a t ) , (5)and the resulting optimal worst-case reward is denoted as M F t = f (¯ x t , ¯ a t ) . (6)Next, Type-I robustness objective is deﬁned based on thedifference (cid:80) Tt =1 M F t − F T , where F T = (cid:80) Tt =1 f ( x t , a t ) is the actual cumulative reward. Deﬁnition 3.

If, with an arm selection strategy { a , · · · , a T } , the difference between the optimal cu-mulative worst-case reward and the cumulative true reward (cid:80) Tt =1 M F t − F T is sub-linear with respect to T , then thestrategy achieves Type-I robustness.If an arm selection strategy achieves Type-I robustness,the lower bound for the true reward f ( x t , a t ) approachesthe optimal worst-case reward M F t in the defense regionas t increases. Therefore, a strategy achieving type-I robust-ness objective can prevent very low reward. For example,in Fig. 1(a), arm 1 is the one that maximizes the worst-casereward, which is not necessarily optimal but always avoidsextremely low reward under any context in the defense re-gion.Note that maximizing the worst-case reward is equivalentto minimizing the robust regret deﬁned in (Kirschner et al.2020), which is written using our formulation as ¯ R T = T (cid:88) t =1 min x ∈B ∆ (ˆ x t ) f ( x, ¯ a t ) − min x ∈B ∆ (ˆ x t ) f ( x, a t ) . (7)However, this robust regret is a pseudo regret because the re-wards of oracle arm ¯ a t and selected arm a t are compared un-der different contexts (i.e., their respective worst-case con-texts), and it is not an upper or lower bound of the true regret R T . To obtain a robust regret performance, we need to deﬁneanother robustness objective based on the true regret. To provide robustness for the regret with imperfect context,we can minimize the cumulative worst-case regret, which isexpressed as ˜ R T = T (cid:88) t =1 max x ∈B ∆ (ˆ x t ) [ f ( x, A ∗ ( x )) − f ( x, a t )] . (8)Clearly, the true regret R T ≤ ˜ R T , and minimizing the worst-case regret is equivalent to minimizing an upper bound forthe true regret. Deﬁne the instantaneous regret function withrespect to context x and arm a as r ( x, a ) = f ( x, A ∗ ( x )) − f ( x, a ) . Since given the reward function the optimization isdecoupled among different rounds, the robust oracle arm tominimize the worst-case regret at round t is ˜ a t = arg min a ∈A max x ∈B t (ˆ x t ) r ( x, a ) . (9)For analysis in the following sections, given ˜ a t , the corre-sponding context for the worst-case regret is denoted as ˜ x t = arg max x ∈B ∆ (ˆ x t ) r ( x, ˜ a t ) , (10)and the resulting optimal worst-case regret is M R t = r (˜ x t , ˜ a t ) . (11)Now, we can give the deﬁnition of Type-II robustness as fol-lows. Deﬁnition 4.

If, with an arm selection strategy { a , · · · , a T } , the difference between the cumulativetrue regret and the optimal cumulative worst-case regret R T − (cid:80) Tt =1 M R t is sub-linear with respect to T , then thestrategy achieves Type-II robustness.If an arm selection strategy achieves Type-II robust-ness, as time increases, the upper bound for the true re-gret r ( x t , a t ) also approaches the optimal worst-case regret M R t . Hence, a strategy achieving type-II robustness objec-tive can prevent a high regret. As shown in Fig. 1(b), arm 1is selected by minimizing the worst-case regret, which is arobust arm selection because the regret of arm 1 under anycontext in the defense region is not too high. The two types of robustness correspond to the algorithmsmaximizing the worst-case reward and minimizing the lgorithm 1

Robust Arm Selection with Imperfect Context

Input:

Context error budget ∆ for t = 1 , · · · , T do Receive imperfect context ˆ x t .Select arm a I t to solve Eqn. (12) in MaxMinUCB ; orselect arm a II t to solve Eqn. (16) in MinWD

Observe the true context x t and the reward y t . end for worst-case regret, respectively. In many cases, they result indifferent arm selections. Take the two scenarios in Fig. 1 asexamples. In the scenario of Fig. 1(a), given the defense re-gion, arm 1 is selected by maximizing the worst-case rewardand arm 2 is selected by minimizing the worst-case regret. Itcan be observed that the worst-case regrets of the two armsare very close, but the worst-case reward of arm 2 is muchlower than that of arm 1. Thus, the strategy of maximizingthe worst-case reward is more suitable for this scenario. Dif-ferently, in the scenario of Fig. 1(b), arm 2 is selected bymaximizing the worst-case reward and arm 1 is selected byminimizing the worst-case regret. Since the worst-case re-wards of the two arms are very close and the worst-case re-gret of arm 2 is much larger than arm 1, it is more suitableto minimize the worst-case regret. In this section, we propose two robust arm selection algo-rithms: (1)

MaxMinUCB (Maximize Minimum Upper Con-ﬁdence Bound), which aims to maximize the minimum re-ward (Type-I robustness objective); and (2)

MinWD (Mini-mize Worst-case Degradation), which aims to minimize themaximum regret (Type-II robustness objective). We derivethe regret and reward bounds for both algorithms and theproofs are available in (Yang and Ren 2021).

MaxMinUCB : Maximize Minimum UCB

Algorithm

To achieve type-I robustness,

MaxMinUCB inAlgorithm 1 selects an arm a I t by maximizing the minimumUCB within the defense region B ∆ (ˆ x t ) : a I t = arg max a ∈A min x ∈B ∆ (ˆ x t ) U t ( x, a ) . (12)The corresponding context that attains the minimum UCB inEqn.(12) is x I t = min x ∈B ∆ (ˆ x t ) U t (cid:0) x, a I t (cid:1) . Analysis

The next theorem gives a lower bound of the cu-mulative true reward of

MaxMinUCB in terms of the optimalworst-case reward and a sub-linear term.

Theorem 3. If MaxMinUCB is used to select arms with im-perfect context, then for any true contexts x t ∈ B ∆ (ˆ x t ) atround t, t = 1 , · · · , T , with a probability of − δ, δ ∈ (0 , ,we have the following lower bound on the worst-case cumu-lative reward F T ≥ T (cid:88) t =1 M F t − h T (cid:114) T ¯ d log(1 + T ¯ dλ ) (13) where M F t is the optimal worst-case reward in Eqn. (6) , ¯ d is the rank of K t and h T is given in Lemma 1. Remark 1.

Theorem 3 shows that by

MaxMinUCB , the dif-ference between the optimal cumulative worst-case rewardand the cumulative true reward is sub-linear and thus effec-tively achieves Type-I robustness according to Deﬁnition 3.This means that the reward by

MaxMinUCB has a boundedsub-linear gap compared to the optimal worst-case reward (cid:80) Tt =1 M F t obtained with perfect knowledge of the rewardfunction. (cid:3) We are also interested in the cumulative true regret of

MaxMinUCB which is given in the following corollary.

Corollary 3.1. If MaxMinUCB is used to select arms withimperfect context, then for any true contexts x t ∈ B ∆ (ˆ x t ) atround t, t = 1 , · · · , T , with a probability of − δ, δ ∈ (0 , ,we have the following bound on the cumulative true regretdeﬁned in Eqn. (2) : R T ≤ T (cid:88) t =1 M R t + 2 h T (cid:114) T ¯ d log(1 + T ¯ dλ ) (14) where M R t = max x ∈B ∆ (ˆ x t ) f ( x, A ∗ ( x )) − M F t , M F t isthe optimal worst-case reward in Eqn. (6) . Remark 2.

Corollary 3.1 shows that the worst-case regretby

MaxMinUCB can be quite larger than the optimal worst-case regret

M R t given in Eqn. (11) (Type-II robustness ob-jective). Actually, despite being robust in terms of rewards,arms selected by MaxMinUCB can still have very large re-gret as shown in Fig. 1(b). Thus, to achieve type-II robust-ness, it is necessary to develop an arm selection algorithmthat minimizes the worst-case regret.

MinWD : Minimize Worst-case Degradation

Algorithm

MinWD is designed to asymptotically min-imize the worst-case regret. Without the oracle knowl-edge of reward function,

MinWD performs arm selectionbased on the upper bound of regret. Denote D a ( x ) = U t (cid:16) x, A † t ( x ) (cid:17) − U t ( x, a ) referred to as UCB degradation at context x . By Lemma 1, the instantaneous true regret canbe bounded as r ( x t , a t ) ≤ [ D a t ( x t )+2 h t s t ( x t , a t )] ≤ D a t +2 h t s t ( x t , a t ) , (15)where D a t = max x ∈B ∆ (ˆ x t ) D a ( x ) is called the worst casedegradation , and h t s t ( x t , a t ) has a vanishing impact byLemma 2. Thus, to minimize worst-case regret, MinWD minimizes its upper bound D a t excluding the vanishing term h t s t ( x t , a t ) , i.e. a II t = min a ∈A max x ∈B ∆ (ˆ x t ) (cid:110) U t (cid:16) x, A † t ( x ) (cid:17) − U t ( x, a ) (cid:111) . (16)The context that attains the worst case in Eqn. (16) is writtenas x II t = arg max x ∈B ∆ (ˆ x t ) D a II t ( x ) . Analysis

Given arm a II t selected by MinWD , the nextlemma gives an upper bound of worst-case degradation. emma 4. If MinWD is used to select arms with imperfectcontext, then for each t = 1 , , · · · , T , with a probability atleast − δ, δ ∈ (0 , , we have D a II t ,t ≤ M R t + 2 h t s t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) , (17) where M R t is the optimal worst-case regret deﬁned inEqn. (11) , ˙ x t = arg max x ∈B ∆ (ˆ x t ) D ˜ a t ( x ) is the context thatmaximizes the degradation given the arm ˜ a t deﬁned for theoptimal worst-case regret in Eqn. (10) . Then, in order to show that D a II t ,t approaches M R t , weneed to prove that h t s t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) vanishes as t in-creases. But, this is difﬁcult because the considered se-quence (cid:110) ˙ x t , A † t ( ˙ x t ) (cid:111) is different from the actual sequenceof context and selected arms (cid:8) x t , a II t (cid:9) under MinWD . Tocircumvent this issue, we ﬁrst introduce the concept of (cid:15) − covering (Wu 2016). Denote Φ =

X × A as the context-armspace. If a ﬁnite set Φ (cid:15) is an (cid:15) − covering of the space Φ , thenfor each ϕ ∈ Φ , there exists at least one ¯ ϕ ∈ Φ (cid:15) satisfying (cid:107) ϕ − ¯ ϕ (cid:107) ≤ (cid:15) . Denote C (cid:15) ( ¯ ϕ ) = { ϕ | (cid:107) ϕ − ¯ ϕ (cid:107) ≤ (cid:15) } as thecell with respect to ¯ ϕ ∈ Φ (cid:15) . Since the dimension of the en-tries in Φ is d , the size of the Φ (cid:15) is | Φ (cid:15) | ∼ O (cid:0) (cid:15) d (cid:1) . Besides,we assume the mapping function φ is Lipschitz continuous,i.e. ∀ x, y ∈ Φ , (cid:107) φ ( x ) − φ ( y ) (cid:107) ≤ L φ (cid:107) x − y (cid:107) . Next, weprove the following proposition to bound the sum of conﬁ-dence widths under some conditions. Proposition 5.

The conditions in Proposition 5 guarantee thatthe time interval between the events that true context-armfeature lies in the same cell is not larger than (cid:6) κ/(cid:15) d (cid:7) , whichis proportional to the size of the (cid:15) -covering | Φ (cid:15) | . That means,similar contexts and selected arms occur in the true sequencerepeatedly if T is large enough. If contexts are sampledfrom a bounded space X with some distribution, then simi-lar contexts will occur repeatedly. Also, note that the arm inour considered sequence A † t ( ˙ x t ) is the UCB-optimal arm,which becomes close to the optimal arm for ˙ x t if the conﬁ-dence width is sufﬁciently small. Hence, there exists somecontext error budget sequence { ∆ t } such that, starting from a certain round T , the two conditions are satisﬁed. The twoconditions in Proposition 5 are mainly for theoretical analy-sis of MinWD .By Lemma 4 and Proposition 5, we bound the cumulativeregret of

MinWD . Theorem 6. If MinWD is used to select arms with imperfectcontext and as time goes on, and the conditions in Proposi-tion 5 are satisﬁed, then for any true context x t ∈ B ∆ (ˆ x t ) atround t, t = 1 , · · · , T , with a probability of − δ, δ ∈ (0 , ,we have the following bound on the cumulative true regret: R T ≤ T (cid:88) t M R t + 2 h T T (cid:115)(cid:18) d log (cid:18) T ˜ dλ (cid:19) + 1 λ (cid:19) +4 (cid:114) λ L φ κ d h T T − d + 2 h T (cid:114) T ¯ d log(1+ T ¯ dλ ) , where M R t is the optimal worst-case regret for round t inEqn. (11) , d is the dimension of x a t ,t , ˜ d is the effective di-mension deﬁned in the proof of Proposition 5, ¯ d is the rankof K t and h T is given in Lemma 1. Remark 4.

Theorem 6 shows that by

MinWD , R T − (cid:80) Tt =1 M R t is sub-linear w.r.t. T and thus Type-II robust-ness is effectively achieved according to Deﬁnition 4. Thismeans the true regret bound approaches (cid:80) Tt M R t , the opti-mal worst-case regret, asymptotically.Next, in parallel with MaxMinUCB , we derive the boundof true reward for

MinWD . Corollary 6.1. If MinWD is used to select arms with imper-fect context and as time goes on, and the true sequence ofcontext and arm obeys the conditions in Proposition 5, thenfor any true contexts x t ∈ B ∆ (ˆ x t ) at round t, t = 1 , · · · , T ,with a probability of − δ, δ ∈ (0 , , we have the followinglower bound of the cumulative reward F T ≥ T (cid:88) t =1 [ M F t − M R t ] − h T T (cid:115)(cid:18) d log (cid:18) T ˜ dλ (cid:19) +1 λ (cid:19) − (cid:114) λ L φ κ d h T T − d − h T (cid:114) T ¯ d log(1+ T ¯ dλ ) , where M R t is the optimal worst-case regret for round t inEqn. (11) , d is the dimension of x a t ,t , ˜ d is the effective di-mension deﬁned in the proof of Proposition 5, ¯ d is the rankof K t , and h T is given in Lemma 1. Remark 5.

Corollary 6.1 shows that as t becomes suf-ﬁciently large, the difference between the optimal worst-case reward and the true reward of the selected arm is nolarger than the optimal worst-case regret M R t . With perfectcontext, we have M R t = 0 , and hence MaxMinUCB and

MinWD both asymptotically maximize the reward, imply-ing that these two types of robustness are the same underperfect context.

We summarize our analysis of

MaxMinUCB and

MinWD in Table 1, while the algorithms details are available in Al-gorithm 1. In the table, d is the dimension of context-arm

200 400 600 800 1000Time Slots051015 R e g r e t SimpleUCBMaxMinUCBMinWD (a) Robust regret. R e g r e t SimpleUCBMaxMinUCBMinWD (b) Worst-case regret. R e g r e t SimpleUCBMaxMinUCBMinWD (c) True regret.

Figure 2: Different cumulative regret objectives for different algorithms.Table 1: Summary of Analysis

Algorithms Regret Reward

MaxMinUCB (cid:80) Tt =1 M R t + O ( √ T log T ) (cid:80) Tt =1 M F t − O ( √ T log T ) MinWD (cid:80) Tt =1 M R t + O ( T √ log T + T − d + √ T log T ) (cid:80) Tt [ M F t − M R t ] − O ( T √ log T + T − d + √ T log T ) vector [ x, a ] , M R t = max x ∈B ∆ (ˆ x t ) f ( x, A ∗ ( x )) − M F t ,and M F t and M R t are deﬁned in Eqn. (6) and (11), respec-tively. Type-I and type-II robustness objectives are achievedby MaxMinUCB and

MinWD respectively.

Edge computing is a promising technique to meet the de-mand of latency-sensitive applications (Shi et al. 2016).Given multiple heterogeneous edge datacenters located indifferent locations, which one should be selected? Specif-ically, each edge datacenter is viewed as an arm, and theusers’ workload is context that can only be predicted priorto arm selection. Our goal is to learn datacenter selectionto optimize the latency in a robust manner given imperfectworkload information. We assume that the service rate ofthe edge datacenter a , a ∈ A , is µ a , the computation la-tency satisﬁes an M/M/1 queueing model and the averagecommunication delay between this datacenter and users is p a . Hence, the average total latency cost can be expressedas l ( x, a ) = p a · x + xµ a − x which is commonly-consideredin the literature (Lin et al. 2011; Xu, Chen, and Ren 2017;Lin et al. 2012). The detailed settings are given in (Yang andRen 2021).In Fig. 2, we compare different algorithms in terms ofthree cumulative regret objectives: robust regret in Eqn. (7),worst-case regret in Eqn. (8) and true regret in Eqn. (2).We consider the following algorithms: SimpleUCB with im-perfect context,

MaxMinUCB with imperfect context and

MinWD with imperfect context. Given a sequence of truecontexts, imperfect context sequence is generated by sam-pling i.i.d. uniform distribution over B ∆ ( x t ) at each round.In the simulations, Gaussian kernel with parameter . isused for reward (loss) estimation. λ in Eqn. (3) is set as . . The exploration rate is set as h t = 0 . .As is shown in Fig. 2(a), MaxMinUCB has the best perfor-mance of robust regret among the three algorithms. This isbecause

MaxMinUCB targets at type-I robustness objectivewhich is equivalent to minimizing the robust regret. How-ever,

MaxMinUCB is not the best algorithm in terms of trueregret as is shown in Fig. 2(c) since robust regret is not anupper or lower bound of true regret. Another robust algo-rithm

MinWD is also better than

SimpleUCB in terms of ro-bust regret, and it has the best performance among the threealgorithms in terms of the worst-case regret, as shown inFig. 2(b). This is because the regret of

MinWD approachesthe optimal worst-case regret (Theorem 6).

MinWD also hasa good performance of true regret, which coincides with thefact that the worst-case regret is the upper bound of the trueregret. By comparing the three algorithms in terms of thethree regret objectives, we can clearly see that

MaxMinUCB and

MinWD achieve performance robustness in terms of therobust regret and worst-case regret, respectively.

In this paper, considering a bandit setting with imperfectcontext, we propose:

MaxMinUCB which maximizes theworst-case reward; and

MinWD which minimizes the worst-case regret. Our analysis of

MaxMinUCB and

MinWD basedon regret and reward bounds shows that as time goes on,

MaxMinUCB and

MinWD both perform as asymptoticallywell as their counterparts that have perfect knowledge of thereward function. Finally, we consider online edge datacenterselection and run synthetic simulations for evaluation.

Acknowledgments

This work was supported in part by the NSF under grantsCNS-1551661 and ECCS-1610471.

References

Abbasi-Yadkori, Y.; P´al, D.; and Szepesv´ari, C. 2011. Im-proved Algorithms for Linear Stochastic Bandits.

NeurIPS .Agrawal, S.; and Goyal, N. 2012. Analysis of ThompsonSampling for the Multi-armed Bandit Problem.

COLT .Agrawal, S.; and Goyal, N. 2013. Thompson Sampling forContextual Bandits with Linear Payoffs.

ICML .ltschuler, J.; Brunel, V.-E.; and Malek, A. 2019. Best ArmIdentiﬁcation for Contaminated Bandits.

Journal of Ma-chine Learning Research

COLT .Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time Analysis of the Multiarmed Bandit Problem.

MachineLearning

47: 235–256.Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E.2002. The Nonstochastic Multiarmed Bandit Problem.

SIAM Journal on Computing

32: 48–77.Auer, P.; and Chiang, C.-K. 2016. An Algorithm with NearlyOptimal Pseudo-regret for Both Stochastic and AdversarialBandits. In

COLT .Bogunovic, I.; Scarlett, J.; Jegelka, S.; and Cevher, V.2018. Adversarially Robust Optimization with GaussianProcesses. In

NIPS .Brockwell, P. J.; Brockwell, P. J.; Davis, R. A.; and Davis,R. A. 2016.

Introduction to time series and forecasting .Springer.Bubeck, S.; and Cesa-Bianchi, N. 2012. Regret analysis ofstochastic and nonstochastic multi-armed bandit problems.

Foundations and Trends® in Machine Learning

5: 1–122.Chen, B.; Wang, J.; Wang, L.; He, Y.; and Wang, Z. 2014.Robust optimization for transmission expansion planning:Minimax cost vs. minimax regret.

IEEE Transactions onPower Systems

IEEE Journal on Selected Areas in Communications

IEEE Transactions on Wireless Communications e-Energy .Chowdhury, S. R.; and Gopalan, A. 2017. On kernelizedmulti-armed bandits.

ICML .Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contex-tual bandits with linear payoff functions.

NeurIPS .Dani, V.; Hayes, T.; Thomas, P.; and Kakade, S. 2008.Stochastic linear optimization under bandit feedback.

COLT .Deshmukh, A. A.; Dogan, U.; and Scott, C. 2017. Multi-Task Learning for Contextual Bandits.

NeurIPS .Dud´ık, M.; Langford, J.; and Li, L. 2011. Doubly robust pol-icy evaluation and learning. arXiv preprint arXiv:1103.4601 . Garcıa, J.; and Fern´andez, F. 2015. A comprehensive surveyon safe reinforcement learning.

Journal of Machine Learn-ing Research

NeurIPS .Gers, F. A.; Schmidhuber, J.; and Cummins, F. 2000. Learn-ing to Forget: Continual Prediction with LSTM.

NeuralComputation

Wireless Communications . CambridgeUniversity Press.Guan, Z.; Ji, K.; Bucci Jr, D. J.; Hu, T. Y.; Palombo, J.; Lis-ton, M.; and Liang, Y. 2020. Robust Stochastic Bandit Algo-rithms under Probabilistic Unbounded Adversarial Attack.In

AAAI .Han, Y.; Zhou, Z.; Zhou, Z.; Blanchet, J.; Glynn,P. W.; and Ye, Y. 2020. Sequential Batch Learning inFinite-Action Linear Contextual Bandits. arXiv preprintarXiv:2004.06321 .Jadbabaie, A.; Rakhlin, A.; Shahrampour, S.; and Sridha-ran, K. 2015. Online optimization: Competing with dynamiccomparators. In

AISTATS .Jiang, R.; Wang, J.; Zhang, M.; and Guan, Y. 2013. Two-stage minimax regret robust unit commitment.

IEEE Trans-actions on Power Systems

NIPS .Kirschner, J.; Bogunovic, I.; Jegelka, S.; and Krause, A.2020. Distributionally Robust Bayesian Optimization. In

AISTATS .Kirschner, J.; and Krause, A. 2019. Stochastic Bandits withContext Distributions. In

NeurIPS .Lei, H.; Tewari, A.; and Murphy, S. 2014. An actor-criticcontextual bandit algorithm for personalized interventionsusing mobile devices.

NeurIPS .Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. AContextual-bandit Approach to Personalized News ArticleRecommendation. In

WWW .Lin, M.; Liu, Z.; Wierman, A.; and Andrew, L. L. H. 2012.Online algorithms for geographical load balancing. In

IGCC .Lin, M.; Wierman, A.; Andrew, L. L. H.; and Thereska,E. 2011. Dynamic right-sizing for power-proportional datacenters. In

INFOCOM .Liu, F.; and Shroff, N. 2019. Data Poisoning Attacks onStochastic Bandits. In

ICML .Lu, T.; P´al, D.; and P´al, M. 2010. Contextual multi-armedbandits.

AISTATS .Neu, G.; and Olkhovskaya, J. 2020. Efﬁcient and robust al-gorithms for adversarial linear contextual bandits. In

COLT .Nguyen, T.; Gupta, S.; Ha, H.; Rana, S.; and Venkatesh, S.2020. Distributionally robust bayesian quadrature optimiza-tion. In

AISTATS .akhlin, A.; and Sridharan, K. 2013. Online Learning withPredictable Sequences. In

COLT .Saxena, V.; Jald´en, J.; Gonzalez, J. E.; Bengtsson, M.; Tull-berg, H.; and Stoica, I. 2019. Contextual Multi-Armed Ban-dits for Link Adaptation in Cellular Networks. In

Workshopon Network Meets AI & ML (NetAI) .Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; and Xu, L. 2016. EdgeComputing: Vision and Challenges.

IEEE Internet of ThingsJournal arXiv preprintarXiv:2006.05630 .Si, N.; Zhang, F.; Zhou, Z.; and BlanchetWu, J. 2020b. Dis-tributionally Robust Policy Evaluation and Learning in Of-ﬂine Contextual Bandits. In

ICML .Slivkins, A. 2019. Introduction to Multi-Armed Bandits.

Foundations and Trends in Machine Learning

ICML .Sun, W.; Dey, D.; and Kapoor, A. 2017. Safety-aware algo-rithms for adversarial contextual bandit. In

ICML .Syrgkanis, V.; Krishnamurthy, A.; and Schapire, R. 2016.Efﬁcient algorithms for adversarial contextual learning.

ICML .Valko, M.; Korda, N.; Munos, R.; Flaounas, I.; and Cristian-ini, N. 2013. Finite-time analysis of kernelised contextualbandits.

UAI .Wang, H.; Wu, Q.; and Wang, H. 2016. Learning HiddenFeatures for Contextual Bandits.

CIKM ∼ yw562/teaching/598/lec14.pdf.Wu, Y.; Shariff, R.; Lattimore, T.; and Szepesv´ari, C. 2016.Conservative bandits. In ICML .Xu, J.; Chen, L.; and Ren, S. 2017. Online Learning for Of-ﬂoading and Autoscaling in Energy Harvesting Mobile EdgeComputing.

IEEE Transactions on Cognitive Communica-tions and Networking arXiv preprint arXiv:2102.05018 .Zhu, F.; Guo, J.; Li, R.; and Huang, J. 2018. Robust actor-critic contextual bandit for mobile health (mhealth) inter-ventions. In

Proceedings of the 2018 ACM InternationalConference on Bioinformatics, Computational Biology, andHealth Informatics , 492–501.

A Applications and Simulation Settings

A key novelty of our work is the consideration of imperfectcontext for arm selection, which characterizes many prac-tical applications such as resource management problemswhere the true context is difﬁcult to obtain for arm selection until revealed later. We list some examples of these applica-tions in this section and provide the simulation settings.

A.1 Motivating Applications

Cloud Resource Management.

Cloud computing platformsare crucial infrastructures offering utility-style computationresources to users on demand. To optimize the performancemetrics such as latency and cost for these applications, ef-ﬁcient online cloud resource management such as dynamicvirtual machine scheduling is necessary. Contextual banditlearning can be employed in this scenario where the exactworkload information (measured in, e.g., how many requestswill arrive per unit time) is crucial context information, butcannot be known until the workload actually arrives. Instead,the agent can only predict the upcoming workload by ex-ploiting the recent workload history plus other applicablesystem features. A real-word example is Amazon’s predic-tive scaling that leverages time series prediction to estimateupcoming workload for virtual machine scheduling (Ama-zon 2021). In this motivating example, the context predic-tion error comes primarily from the workload predictor.

Energy Scheduling in Smart Grid.

Energy load is a cru-cial context information for energy scheduling in smart grid.However, a recent study (Chen, Tan, and Zhang 2019) showsthat the energy load predictor in smart grid can be adver-sarially attacked to produce load estimates (i.e., context forenergy scheduling) with higher-than-normal errors. Thus,our model also captures a novel adversarial setting whereerroneously predicted context is presented to the agent forarm selection. This example shows that the use of machinelearning-based predictor for acquiring context to facilitatethe agent’s arm selection can also open a new attack sur-face — an outside attacker may adversarially modify thepredicted context — which requires a robust algorithm witha provable worst-case performance guarantee.

Online Edge Datacenter Selection.

With the rapidly in-creasing number of devices at the Internet edge, compu-tational demand by latency-sensitive applications (e.g., as-sisted driving and virtual reality) has been escalating. Edgecomputing is a promising solution to meet the demand,which deploys computation resources at densely-distributededge datacenters close to end users and thus reduces theoverall latency (Shi et al. 2016). Since users’ workloads canbe processed in multiple edge datacenters, dynamic selec-tion of edge datacenters plays a key role for minimizing theoverall latency — given multiple heterogeneous edge data-centers located in different locations, which one should beselected? Here, servers in each edge datacenter can be eithervirtual machines rented from third-party service providersor physical servers owned by the edge computing provideritself. We refer to this problem as online edge datacenter se-lection. Our considered bandit setting applies to the problemof online edge datacenter selection. Speciﬁcally, each edgedatacenter is viewed as an arm, and the users’ workload iscontext that can only be predicted prior to arm selection. Ourgoal is to dynamically select edge datacenters to optimizethe latency in a robust manner given imperfect informationabout users’ workloads. .2 Simulation Settings in Section 6

In our simulation, we apply our algorithms to online edgedatacenter selection in the context of edge computing. Weconsider users’ workloads can be processed in one of fouravailable edge datacenters, each having different compu-tation capabilities and communication latencies. Here, weconsider a simple latency model to capture ﬁrst-order ef-fects. Concretely, we assume that the service rate of the edgedatacenter a , a ∈ A = { , , , } , is µ a , the computationlatency satisﬁes an M/M/1 queueing model and the averagecommunication delay between this datacenter and users is p a . The values of µ a and p a are shown in Table 2. In thesimulations, perfect context sequence { x t } is generated bysampling i.i.d. uniform distribution between 10 and 30 foreach round, while the defense budget ∆ is set as 2. The av-erage total latency cost can be expressed as f ( x, a ) = xµ a − x + p a · x (18)whose inverse can be equivalently viewed as the reward inour model. Table 2: Simulation Settings Datacenter a I II III IV µ a

35 38 45 51 p a µ a ) may not be known to the agent and needsto be learnt based on latency feedback and revealed context.Note that some practical factors (e.g., workload parallelism)are beyond the scope of our analysis and incorporating theminto our simulation does not add substantially to our maincontribution. B Algorithm and Proofs Related toSimpleUCB

B.1 Algorithm of

SimpleUCB

We describe

SimpleUCB in Algorithm 2.

Algorithm 2

Simple UCB (

SimpleUCB ) for t = 1 , · · · , T do Receive imperfect context ˆ x t .Select arm a † t as the UCB-optimal arm A † t (ˆ x t ) deﬁnedin Deﬁnition 2.Observe the true context x t and the reward y t end for B.2 Proof of Lemma 1

Lemma 1

Assume that the reward function f ( x, a ) satis-ﬁes | f ( x, a ) | ≤ B , the noise n t satisﬁes a sub-Gaussiandistribution with parameter b , and kernel ridge regression is used to estimate the reward function. With a probabil-ity of at least − δ, δ ∈ (0 , , for all a ∈ A and t ∈ N , the estimation error satisﬁes | ˆ f t ( x, a ) − f ( x, a ) | ≤ h t s t ( x, a ) , where h t = √ λB + b (cid:112) γ t − δ ) , γ t =log det( I + K t /λ ) ≤ ¯ d log(1 + t ¯ dλ ) and ¯ d is the rank of K t .Let V t = λ I + (cid:80) ts =1 φ ( x, a ) φ ( x, a ) (cid:62) , the squared conﬁ-dence width is given by s t ( x, a ) = φ ( x, a ) (cid:62) V − t − φ ( x, a ) = λ k ([ x, a ] , [ x, a ]) − λ k t ( x, a ) T ( K t + λ I ) − k t ( x, a ) .Proof. Let φ : ( X × A ) → H be the mapping func-tion with respect to k ( · , · ) . Deﬁne Ψ t with the s th row as φ ( x s , a s ) , s = 1 , , · · · , t − and thus K t = Ψ t Ψ (cid:62) t . De-note y t = [ y , · · · , y t − ] (cid:62) as the collected rewards and thenoise vector n t = [ n , n · · · , n t − ] (cid:62) , so y t = Ψ t θ + n t .Then we have | f ( x, a ) − ˆ f t ( x, a ) | = | f ( x, a ) − k (cid:62) t ( x, a )( K t + λ I t ) − y t | = | f ( x, a ) − k (cid:62) t ( x, a )( K t + λ I t ) − (Ψ t θ + n t ) |≤ | f ( x, a ) − k (cid:62) t ( x, a )( K t + λ I t ) − Ψ t θ | + | k (cid:62) t ( x, a )( K t + λ I t ) − n t | . (19)where I t is an identity matrix with ( t − dimensions andthe last inequality comes from triangle inequality.Let V t − = Ψ (cid:62) t Ψ t + λ I and deﬁne s t ( x, a ) = φ ( x, a ) (cid:62) V − t − φ ( x, a ) . By Woodbury formula, we can write s t ( x, a ) by kernel functions: s t ( x, a ) = φ ( x, a ) (cid:62) (cid:18) λ I − λ Ψ (cid:62) t (cid:0) λ I t + Ψ t Ψ (cid:62) t (cid:1) − Ψ t (cid:19) φ ( x, a )= 1 λ k ([ x, a ] , [ x, a ]) − λ k t ( x, a ) (cid:62) ( K t + λ I t ) − k t ( x, a ) . (20)For the ﬁrst term of Eqn. (19), using Woodbury formula, wehave | f ( x, a ) − k (cid:62) t ( x, a )( K t + λ I t ) − Ψ t θ | = | φ ( x, a ) (cid:62) θ − φ ( x, a ) (cid:62) Ψ (cid:62) t (Ψ t Ψ (cid:62) t + λ I t ) − Ψ t θ | = | φ ( x, a ) (cid:62) θ − φ ( x, a ) (cid:62) (Ψ (cid:62) t Ψ t + λ I ) − Ψ (cid:62) t Ψ t θ | = | λφ ( x, a ) (cid:62) (Ψ (cid:62) t Ψ t + λ I ) − θ | ≤ √ λBs t ( x, a ) , (21)where the inequality comes from Cauchy-Schwartz inequal-ity and (cid:107) θ (cid:107) V − t − ≤ (cid:107) θ (cid:107) eig max ( V t − ) ≤ B √ λ .For the second term, we have the following inequalities. | k (cid:62) t ( x, a )( K t + λ I t ) − n t | = | φ ( x, a ) (cid:62) (Ψ (cid:62) t Ψ t + λ I ) − Ψ (cid:62) t n t |≤ (cid:107) φ ( x, a ) (cid:107) V − t − (cid:113) n (cid:62) t Ψ t (Ψ (cid:62) t Ψ t + λ I ) − Ψ (cid:62) t n t = s t ( x, a ) (cid:107) n t (cid:107) K t ( K t + λ I t ) − , (22)where the ﬁrst inequality comes from Cauchy-Schwartz in-equality.ince n t satisﬁes b -sub-Gaussain distribution conditionedon F t − , by Theorem 1 in (Abbasi-Yadkori, P´al, andSzepesv´ari 2011), with probability − δ , δ ∈ (0 , for t ∈ N and a ∈ A , we have (cid:107) n t (cid:107) K t ( K t + λ I t ) − = (cid:107) Ψ Tt n t (cid:107) V − t − ≤ b log (cid:18) det( V t − ) / δ det( λ I d ) / (cid:19) = b ( γ t − δ )) . (23)where γ t = log det ( K t /λ + I t ) ≤ ¯ d log(1 + t ¯ dλ ) where ¯ d is the dimension of K t the inequality comes from Lemma10 in (Abbasi-Yadkori, P´al, and Szepesv´ari 2011).Combining the bounds of the ﬁrst and second term inEqn. (19) and let h t = √ λB + b (cid:112) γ t − δ ) , we ob-tain the concentration bound in Lemma 3.1: | f ( x, a ) − ˆ f t ( x, a ) | ≤ h t s t ( x, a ) . B.3 Proof of Lemma 2

Lemma 2

The conﬁdence widths given x t for t ∈{ , · · · , T } satisﬁes (cid:80) Tt =1 s t ( x t , a t ) ≤ γ T , where γ T =log det( I + K T /λ ) ≤ ¯ d log(1 + T ¯ dλ ) and ¯ d is the rank of K T , and so (cid:80) Tt =1 s t ( x t , a t ) ≤ √ T γ T .Proof. By Lemma 11 in (Abbasi-Yadkori, P´al, andSzepesv´ari 2011), if λ ≥ , we have T (cid:88) t =1 s t ( x t , a t ) = T (cid:88) t =1 φ ( x t , a t ) (cid:62) V − t − φ ( x t , a t ) ≤ (cid:18) λ V T (cid:19) = 2 γ T (24)where γ T = log det ( K T /λ + I T ) ≤ ¯ d log(1 + T ¯ dλ ) . ByH¨older’s inequality, we have T (cid:88) t =1 s t ( x t , a t ) ≤ (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 s t ( x t , a t ) = (cid:112) T γ T . (25) C Proofs Related to

MaxMinUCB

C.1 Proof of Theorem 3

Theorem 3 If MaxMinUCB is used to select arms with im-perfect context, then for any true contexts x t ∈ B ∆ (ˆ x t ) atround t, t = 1 , · · · , T , with a probability of − δ, δ ∈ (0 , ,we have the following lower bound on the worst-case cumu-lative reward F T ≥ T (cid:88) t =1 M F t − h T (cid:114) T ¯ d log(1 + T ¯ dλ ) (26) where M F t is the optimal worst-case reward in Eqn. (6) , ¯ d is the rank of K t and h T is given in Lemma 1.Proof. By Lemma 1 and UCB deﬁned in Deﬁnition 1, withprobability δ , δ ∈ (0 , , the instantaneous regret can bebounded as f ( x t , a I t ) ≥ U t (cid:0) x t , a I t (cid:1) − h t s t (cid:0) x t , a I t (cid:1) ≥ U t (cid:0) x I t , a I t (cid:1) − h t s t (cid:0) x t , a I t (cid:1) (27) where the last inequality comes from the arm se-lection policy of MaxMinUCB implying that x I t =arg min x ∈B ∆ (ˆ x t ) U t (cid:0) x, a I t (cid:1) .With the deﬁnition of the optimal worst-case reward M F t and exploiting the arm selection policy of MaxMinUCB , wecan further bound the instantaneous reward as below. f ( x t ,a I t ) ≥ (cid:2) U t (cid:0) x I t , a I t (cid:1) − h t s t (cid:0) x t , a I t (cid:1)(cid:3) ≥ (cid:2) U t (cid:0) ¯ x I t , ¯ a t (cid:1) − h t s t (cid:0) x t , a I t (cid:1)(cid:3) ≥ (cid:2) U t (cid:0) ¯ x I t , ¯ a t (cid:1) − h t s t (cid:0) x t , a I t (cid:1)(cid:3) − f (cid:0) ¯ x I t , ¯ a t (cid:1) + M F t ≥ M F t − h t s t (cid:0) x t , a I t (cid:1) (28)where ¯ a t = arg max a ∈A min x ∈B ∆ (ˆ x t ) f ( x, a ) is the op-timal arm for maximizing the worst-case reward deﬁnedin Deﬁnition 1, ¯ x I t = arg min x ∈B ∆ (ˆ x t ) U t ( x, ¯ a t ) , thesecond inequality comes from the arm selection strat-egy of MaxMinUCB such that min x ∈B ∆ (ˆ x t ) U t (cid:0) x, a I t (cid:1) ≥ min x ∈B ∆ (ˆ x t ) U t ( x, ¯ a t ) , the third inequality holds becausethe deﬁnition of M F t in Eqn. (6). which guarantees M F t =min x ∈B ∆ (ˆ x t ) f ( x, ¯ a t ) ≤ f (cid:0) ¯ x I t , ¯ a t (cid:1) , and the last inequal-ity comes from Lemma 1 which guarantees U t (cid:0) ¯ x I t , ¯ a t (cid:1) ≥ f (cid:0) ¯ x I t , ¯ a t (cid:1) . Therefore, combined with Lemma 2, we can getthe bound the cumulative reward of MaxMinUCB . C.2 Proof of Corollary 3.1

Corollary 3.1 If MaxMinUCB is used to select arms withimperfect context, then for any true contexts x t ∈ B ∆ (ˆ x t ) atround t, t = 1 , · · · , T , with a probability of − δ, δ ∈ (0 , ,we have the following bound on the cumulative true regretdeﬁned in Eqn. (2) : R T ≤ T (cid:88) t =1 M R t + 2 h T (cid:114) T ¯ d log(1 + T ¯ dλ ) (29) where M R t = max x ∈B ∆ (ˆ x t ) f ( x, A ∗ ( x )) − M F t , M F t isthe optimal worst-case reward in Eqn. (6) , ¯ d is the rank of K t and h T is given in Lemma 1 .Proof. For any x t ∈ B ∆ (ˆ x t ) , by Theorem 3, with a prob-ability of − δ, δ ∈ (0 , , the instantaneous regret can bebounded as r ( x t ,a I t ) = f ( x t , A ∗ ( x t )) − f (cid:0) x t , a I t (cid:1) ≤ f ( x t , A ∗ ( x t )) − ( M F t − h t s t (cid:0) x t , a I t (cid:1) ) ≤ max x ∈B ∆ (ˆ x t ) { f ( x, A ∗ ( x )) } − M F t + 2 h t s t (cid:0) x t , a I t (cid:1) . (30)Corollary 3.1 can be obtained by taking summation with t and Lemma 2. D Proofs Related to

MinWD

D.1 Proof of Lemma 4

Lemma 4 If MinWD is used to select arms with imperfectcontext, then for each t = 1 , , · · · , T , with a probability atleast − δ, δ ∈ (0 , , we have D a II t ,t ≤ M R t + 2 h t s t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) , (31) here M R t is the optimal worst-case regret deﬁned inEqn. (11) , ˙ x t = arg max x ∈B ∆ (ˆ x t ) D ˜ a t ( x ) is the context thatmaximizes the degradation given the arm ˜ a t deﬁned for theoptimal worst-case regret in Eqn. (10) .Proof. Recall that in Eqn. (9), the optimal armfor minimizing the worst-case regret is ˜ a t =arg min a ∈A max x ∈B ∆ (ˆ x t ) r ( x, a ) . By the arm selec-tion policy of MinWD , a II t is th arm that minimizes D a,t , sothe worst-case degradation can be bounded as follows. D a II t ,t ≤ D ˜ a t ,t = max x ∈B ∆ (ˆ x t ) (cid:110) U t (cid:16) x, A † t ( x ) (cid:17) − U t ( x, ˜ a t ) (cid:111) = U t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) − U t ( ˙ x t , ˜ a t ) ≤ U t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) − U t ( ˙ x t , ˜ a t ) − (cid:104) f (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) − f ( ˙ x t , ˜ a t ) (cid:105) + M R t ≤ h t s t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) + M R t , (32)where ˙ x t = arg max x ∈B ∆ (ˆ x t ) D ˜ a t ,t ( x ) , the second in-equality holds because (cid:104) f (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) − f ( ˙ x t , ˜ a t ) (cid:105) ≤ [ f ( ˙ x t , A ∗ ( ˙ x t )) − f ( ˙ x t , ˜ a t )] = r ( ˙ x t , ˜ a t ) ≤ M R t where M R t = max x ∈B ∆ (ˆ x t ) r ( x, ˜ a t ) is the optimal worst-caseregret deﬁned in Eqn. (11) and the third inequality isfrom Lemma 1 which guarantees that U t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) − f (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) ≤ h t s t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) and f ( ˙ x t , ˜ a t ) − U t ( ˙ x t , ˜ a t ) ≤ . D.2 Proof of Proposition 5

We ﬁrst bound the conﬁdence width sum of consideredcontext-arm sequence in the Lemma 7 assuming a linear re-ward function, i.e. φ ( x, a ) = [ x, a ] , and context-arm space Φ =

X × A is ﬁnite with size | Φ | . Then in Lemma 8,we prove the bound for linear reward case and continuouscontext-arm space. Finally we generalize the bound to thekernel cases. For conciseness, denote x a t ,t = [ x t , a t ] ∈ R d as the true context and selected arm at round t and ˙ x ˙ a t ,t =[ ˙ x t , ˙ a t ] ∈ R d as the considered context and arm at round t . Lemma 7 (Sum of Conﬁdence Width with Finite Contex-t-arm Space) . Let X T = { x a , , · · · , x a T ,T } be the se-quence of true contexts and selected arms by bandit algo-rithms and ˙ X T = { ˙ x ˙ a , , · · · , ˙ x ˙ a T ,T } be the considered se-quence of contexts and arms. Suppose that both x a t ,t and ˙ x ˙ a t ,t ( t = 1 , · · · , T ) belong to a ﬁnite set Φ with size | Φ | .Besides, ∀ ϕ ∈ Φ , ∃ κ ≥ , two conditions are satisﬁed.First, ∃ t ≤ (cid:100) κ | Φ |(cid:101) such that x a t ,t = ϕ . Second, if at round t , x a t ,t = ϕ , then ∃ t (cid:48) , t ≤ t (cid:48) ≤ t + (cid:100) κ | Φ |(cid:101) , such that x a t (cid:48) ,t (cid:48) = ϕ . The sum of conﬁdence width is bounded as T (cid:88) t =1 s t ( ˙ x ˙ a t ,t ) ≤ (cid:100) κ | Φ |(cid:101) (cid:18) d log (cid:18) Tdλ (cid:19) + 1 λ (cid:19) , (33) where s t ( ˙ x ˙ a t ,t ) = ˙ x (cid:62) ˙ a t ,t V − t − ˙ x ˙ a t ,t and V t = λ I + (cid:80) ts =1 x a s ,s x (cid:62) a s ,s . " (cid:11)(cid:22)(cid:21)(cid:26)(cid:19)(cid:17)(cid:18)(cid:25)(cid:18)(cid:17)(cid:1)(cid:13)(cid:18)(cid:24)(cid:28)(cid:18)(cid:21)(cid:16)(cid:18)(cid:14)(cid:25)(cid:28)(cid:18)(cid:1)(cid:13)(cid:18)(cid:24)(cid:28)(cid:18)(cid:21)(cid:16)(cid:18) (cid:2) (cid:8)(cid:12)(cid:22)(cid:28)(cid:21)(cid:17)(cid:1)(cid:13)(cid:27)(cid:15)(cid:20)(cid:23)(cid:26) " $ " $ (cid:9) (cid:10) " " $ " " $ " $ " " " " $ " $ " " $ " " $ " (cid:3) (cid:4) (cid:5) (cid:6) (cid:7) (cid:1)(cid:1)(cid:1) Figure 3: An example of sequence grouping in Lemma 7.The feature set is

Φ = { z , z } . κ = 2 . (cid:100) κ | Φ |(cid:101) = 4 groupsare constructed: G = { , , , · · · } , G = { , , · · · } , G = { , , · · · } , G = { , , · · · } . The interval between the fea-tures in considered sequence and their counterparts in truesequence is less than (cid:100) κ | Φ |(cid:101) = 4 . Proof.

Let M = (cid:98) T / (cid:100) κ | Φ |(cid:101)(cid:99) . Divide the roundstamp sequence { , , · · · , M (cid:100) κ | Φ |(cid:101)} uniformlyinto (cid:100) κ | Φ |(cid:101) groups, each with M elements. The l th group of round stamps, (1 ≤ l ≤ (cid:100) κ | Φ |(cid:101) ) , is G l = { l, l + (cid:100) κ | Φ |(cid:101) , · · · , l + ( M − (cid:100) κ | Φ |(cid:101)} , and G l,m = { l, l + (cid:100) κ | Φ |(cid:101) , · · · , l + ( m − (cid:100) κ | Φ |(cid:101)} , (1 ≤ m ≤ M ) is a subset of G l with the ﬁrst m entries. A simpleexample of group construction is shown in Fig. 3.Let W l,m = λ I d + (cid:80) s ∈G l,m ˙ x ˙ a s ,s ˙ x (cid:62) ˙ a s ,s , for ≤ l ≤(cid:100) κ | Φ |(cid:101) . For the sequence { ˙ x ˙ a t ,t | t ∈ G l } , by Lemma 2, wehave (cid:80) t ∈G l ˙ x (cid:62) ˙ a t ,t W − l,m t − ˙ x ˙ a t ,t ≤ γ ( W l,M ) where m t = (cid:98) ( t − / (cid:100) κ | Φ |(cid:101)(cid:99) and γ ( W l,M ) = log det( W l,M )det( λ I ) .Let’s consider the sequence in G l . The two conditionsin Lemma 8, imply that for t ∈ G l and its index m t = (cid:98) ( t − / (cid:100) κ | Φ |(cid:101)(cid:99) , ∀ s ∈ G l,m t − , if ˙ x ˙ a s ,s = ϕ , there ex-ists s (cid:48) ( s ≤ s (cid:48) < s + (cid:100) κ | Φ |(cid:101) ) such that the true context-arm x a s (cid:48) ,s (cid:48) = ϕ (For example, in Fig. 3, s = 1 is in G , and thereexists s (cid:48) = 3 ≤ (4 − such that ˙ x ˙ a s ,s = x a (cid:48) s ,s (cid:48) = z ).Therefore we have s (cid:48) ∈ T t − = { , , · · · , t − } , andwe conclude that { ˙ x ˙ a s ,s , s ∈ G l,m t − } ⊆ { x a s ,s , s ∈ T t − } .Thus, considering V t − and W l,m t − are both positive def-inite matrices, we have V t − (cid:23) W l,m t − and V − m t − (cid:22) W − l,m t − . Therefore, for group l ( ≤ l ≤ (cid:100) κ | Φ |(cid:101) ), we have (cid:88) t ∈G l s t ( ˙ x ˙ a t ,t ) = (cid:88) t ∈G l ˙ x (cid:62) ˙ a t ,t V − t − ˙ x ˙ a t ,t ≤ (cid:88) t ∈G l ˙ x (cid:62) ˙ a t ,t W l,M ˙ x ˙ a t ,t ≤ γ ( W l,M ) . (34)From the above analysis, since s t ( ˙ x ˙ a t ,t ) ≤ λ , we can get T (cid:88) t =1 s t ( ˙ x ˙ a t ,t ) = (cid:100) κ | Φ |(cid:101) (cid:88) l =1 (cid:88) t ∈G l s t ( ˙ x ˙ a t ,t ) + T (cid:88) t = M (cid:100) κ | Φ |(cid:101) +1 s t ( ˙ x ˙ a t ,t ) ≤ (cid:100) κ | Φ |(cid:101) (cid:88) l =1 γ ( W l,M ) + (cid:100) κ | Φ |(cid:101) λ ≤ (cid:100) κ | Φ |(cid:101) (cid:18) d log (cid:18) Mdλ (cid:19) + 1 λ (cid:19) ≤ (cid:100) κ | Φ |(cid:101) (cid:18) d log (cid:18) Tdλ (cid:19) + 1 λ (cid:19) , (35)here the second inequality comes from Lemma 11 in(Abbasi-Yadkori, P´al, and Szepesv´ari 2011). thus complet-ing the proof.If the context-arm space is continuous, the ﬁnite set as-sumption of Φ in Lemma 7 is not satisﬁed anymore. To over-come this issue, we construct an (cid:15) - covering Φ (cid:15) ∈ Φ for thecontext-arm space Φ such that for each ϕ ∈ Φ , there ex-ists at least one ¯ ϕ ∈ Φ (cid:15) satisfying (cid:107) ϕ − ¯ ϕ (cid:107) ≤ (cid:15) . Sincethe dimension of x a t ,t is d , the size of the (cid:15) - covering is | Φ (cid:15) | ∼ O (cid:0) (cid:15) d (cid:1) . Now we can bound the sum of conﬁdencewidth with linear reward function and continuous context-arm space in Lemma 8. Lemma 8 (Sum of Conﬁdence Width with Continuous Con-text-arm Space) . Let X T = { x a , , · · · , x a T ,T } be the se-quence of true contexts and selected arms by bandit algo-rithms and ˙ X T = { ˙ x ˙ a , , · · · , ˙ x ˙ a T ,T } be the considered se-quence of contexts and actions. Suppose that both x a t ,t and ˙ x ˙ a t ,t ( t = 1 , · · · , T ) belong to Φ . Besides, with an (cid:15) − cover-ing Φ (cid:15) ⊆ Φ , (cid:15) > , there exists κ ≥ such that two condi-tions are satisﬁed for X T . First, ∀ ¯ ϕ ∈ Φ (cid:15) , ∃ t ≤ (cid:6) κ/(cid:15) d (cid:7) suchthat x a t ,t ∈ C (cid:15) ( ¯ ϕ ) . Second, if at round t , x a t ,t ∈ C (cid:15) ( ¯ ϕ ) for some ¯ ϕ ∈ Φ (cid:15) , then ∃ t ≤ t (cid:48) < t + (cid:6) κ/(cid:15) d (cid:7) such that x a (cid:48) t ,t (cid:48) ∈ C (cid:15) ( ¯ ϕ ) . The sum of squared conﬁdence width isbounded as T (cid:88) t =1 s t ( ˙ x ˙ a t ,t ) ≤√ T (cid:18) d log (cid:18) Tdλ (cid:19) + 1 λ (cid:19) + 8 κ /d λ T − /d , (36) where s t ( ˙ x ˙ a t ,t ) = ˙ x (cid:62) ˙ a t ,t V − t − ˙ x ˙ a t ,t and V t = λ I d + (cid:80) ts =1 x a s ,s x (cid:62) a s ,s .Proof. Based on the (cid:15) - covering Φ (cid:15) ∈ Φ , we di-vide the round stamp sequence (cid:8) , , · · · , M (cid:6) κ/(cid:15) d (cid:7)(cid:9) uni-formly into (cid:6) κ/(cid:15) d (cid:7) groups, each with M = (cid:4) T / (cid:6) κ/(cid:15) d (cid:7)(cid:5) elements. The l th group (cid:0) ≤ l ≤ (cid:6) κ/(cid:15) d (cid:7)(cid:1) is G l = (cid:8) l, l + (cid:6) κ/(cid:15) d (cid:7) , · · · , l + ( M − (cid:6) κ/(cid:15) d (cid:7)(cid:9) , and G l,m = (cid:8) l, l + (cid:6) κ/(cid:15) d (cid:7) , · · · , l + ( m − (cid:6) κ/(cid:15) d (cid:7)(cid:9) is a subset of G l with the ﬁrst m entries.Now we consider the l th group. The two conditions inthis lemma imply that for a certain t ∈ G l and its index m t = (cid:4) ( t − / (cid:6) κ/(cid:15) d (cid:7)(cid:5) , ∀ s ∈ G l,m t − , if ˙ x ˙ a s ,s ∈ C (cid:15) ( ¯ ϕ ) for some ¯ ϕ ∈ Φ (cid:15) , there exists s (cid:48) ( s ≤ s (cid:48) < s + (cid:6) κ/(cid:15) d (cid:7) ) such that the true context-arm x a s (cid:48) ,s (cid:48) ∈ C (cid:15) ( ¯ ϕ ) . Denote s (cid:48) = ζ ( s ) mapping s to its corresponding s (cid:48) , and we have ζ ( s ) ≤ t − . Also, denote G (cid:48) l,m t − = { ζ ( s ) | s ∈ G l,m t − } ,and we have G (cid:48) l,m t − ⊆ T t − = { , , · · · , t − } . Let W l,m = λ I d + (cid:80) s ∈G l,m x a ζ ( s ) ,ζ ( s ) x (cid:62) a ζ ( s ) ,ζ ( s ) = λ I d + (cid:80) s (cid:48) ∈G (cid:48) l,m x a s (cid:48) ,s (cid:48) x (cid:62) a s (cid:48) ,s (cid:48) , for ≤ l ≤ (cid:6) κ/(cid:15) d (cid:7) . By the aboveanalysis, we have V t − (cid:23) W l,m t − , and so V − t − (cid:22) W − l,m t − considering V t − and W l,m t − are both positivedeﬁnite matrices.By Lemma 2, for the sequence { x (cid:62) a ζ ( t ) ,ζ ( t ) , t ∈ G l } , wehave (cid:80) t ∈G l x (cid:62) a ζ ( t ) ,ζ ( t ) W − l,m t − x a ζ ( t ) ,ζ ( t ) ≤ γ ( W l,M ) . De- note e t = ˙ x ˙ a t ,t − x a ζ ( t ) ,ζ ( t ) . Since x a ζ ( t ) ,ζ ( t ) and ˙ x ˙ a t ,t be-long to the same cell in (cid:15) − covering Φ (cid:15) , we have (cid:107) e t (cid:107) ≤ (cid:15) by triangle inequality. Therefore for group G l , we have (cid:88) t ∈G l s t ( ˙ x ˙ a t ,t ) = (cid:88) t ∈G l ( ˙ x ˙ a t ,t ) (cid:62) V − t − ( ˙ x ˙ a t ,t )= (cid:88) t ∈G l (cid:16) x a ζ ( t ) ,ζ ( t ) + e t (cid:17) (cid:62) V − t − (cid:16) x a ζ ( t ) ,ζ ( t ) + e t (cid:17) ≤ (cid:88) t ∈G l x (cid:62) a ζ ( t ) ,ζ ( t ) V − t − x a ζ ( t ) ,ζ ( t ) + 2 (cid:88) t ∈G l e (cid:62) t V − t − e t ≤ (cid:88) t ∈G l x (cid:62) a ζ ( t ) ,ζ ( t ) W − l,m t − x a ζ ( t ) ,ζ ( t ) + 2 Mλ (2 (cid:15) ) ≤ γ ( W l,M ) + 8 Mλ (cid:15) , (37)where the ﬁrst inequality comes from Cauchy-Schwartz in-equality and the second inequality holds because V − t − (cid:22) W − l,m t − and eig( V − t − ) ≤ λ , and the last inequality holdsby Lemma 2.Now we can bound the sum of the conﬁdence widths forthe whole sequence as T (cid:88) t =1 s t ( ˙ x ˙ a t ,t ) = (cid:100) κ/(cid:15) d (cid:101) (cid:88) l =1 (cid:88) t ∈G l s t ( ˙ x ˙ a t ,t ) + T (cid:88) t = M (cid:100) κ/(cid:15) d (cid:101) +1 s t ( ˙ x ˙ a t ,t ) ≤ (cid:6) κ/(cid:15) d (cid:7) (cid:18) γ ( W l,M ) + 8 Mλ (cid:15) + 1 λ (cid:19) ≤ (cid:6) κ/(cid:15) d (cid:7) (cid:18) d log (cid:18) Tdλ (cid:19) + 1 λ (cid:19) + 8 Tλ (cid:15) , (38)where the last inequality holds because γ ( W l,M ) ≤ d log (cid:0) Mdλ (cid:1) by Lemma 11 in (Abbasi-Yadkori, P´al, andSzepesv´ari 2011) and M = (cid:4) T / (cid:6) κ/(cid:15) d (cid:7)(cid:5) . Let (cid:6) κ/(cid:15) d (cid:7) = T / , i.e. (cid:15) = (cid:0) κ /T (cid:1) / d , then we have T (cid:88) t =1 s t ( ˙ x ˙ a t ,t ) ≤√ T (cid:18) d log (cid:18) Tdλ (cid:19) + 1 λ (cid:19) +8 κ /d λ T − /d , (39)thus completing the proof.Now we can prove proposition 5 by generalizing Lemma8 into kernel case assuming the mapping function φ : X ×A → H is Lipschitz continuous with constant L φ , i.e. ∀ x, y ∈ X × A , (cid:107) φ ( x ) − φ ( y ) (cid:107) ≤ L φ (cid:107) x − y (cid:107) . Proof of Proposition 5Proposition 5

Let X T = { x a , , · · · , x a T ,T } be the se-quence of true contexts and selected arms by bandit algo-rithms and ˙ X T = { ˙ x ˙ a , , · · · , ˙ x ˙ a T ,T } be the considered se-quence of contexts and actions. Suppose that both x a t ,t and ˙ x ˙ a t ,t belong to Φ . Besides, with an (cid:15) − covering Φ (cid:15) ⊆ Φ , (cid:15) > , there exists κ ≥ such that two conditions are satis-ﬁed. First, ∀ ¯ ϕ ∈ Φ (cid:15) , ∃ t ≤ (cid:6) κ/(cid:15) d (cid:7) such that x a t ,t ∈ C (cid:15) ( ¯ ϕ ) .econd, if at round t , x a t ,t ∈ C (cid:15) ( ¯ ϕ ) for some ¯ ϕ ∈ Φ (cid:15) , then ∃ t ≤ t (cid:48) < t + (cid:6) κ/(cid:15) d (cid:7) such that x a (cid:48) t ,t (cid:48) ∈ C (cid:15) ( ¯ ϕ ) . If the map-ping function φ is Lipschitz continuous with constant L φ , thesum of squared conﬁdence widths is bounded as T (cid:88) t =1 s t ( ˙ x ˙ a t ,t ) ≤√ T (cid:18) d log (cid:18) T ˜ dλ (cid:19) + 1 λ (cid:19) +8 L φ κ /d λ T − /d , (40) where d is the dimension of x a t ,t , ˜ d is the effective dimensiondeﬁned in the proof, s t ( ˙ x ˙ a t ,t ) = φ ( ˙ x ˙ a t ,t ) (cid:62) V − t − φ ( ˙ x ˙ a t ,t ) and V t = λ I + (cid:80) ts =1 φ ( x a s ,s ) φ ( x a s ,s ) (cid:62) .Proof. With the same method in Lemma 8, we construct (cid:6) κ/(cid:15) d (cid:7) groups, each with M = (cid:4) T / (cid:6) κ/(cid:15) d (cid:7)(cid:5) elements.Deﬁne G l , G l,m , G (cid:48) l,m , ζ ( t ) , m t , e t same as Lemma 8 and let W l,m = λ I + (cid:80) s ∈G l,m φ ( x a ζ ( s ) ,ζ ( s ) ) φ ( x a ζ ( s ) ,ζ ( s ) ) (cid:62) = λ I + (cid:80) s (cid:48) ∈G (cid:48) l,m φ ( x a s (cid:48) ,s (cid:48) ) φ ( x a s (cid:48) ,s (cid:48) ) (cid:62) , for ≤ l ≤ (cid:6) κ/(cid:15) d (cid:7) .By Lemma 8, we have V t − (cid:23) W l,m t − , and so V − t − (cid:22) W − l,m t − . Therefore for group G l , we have (cid:88) t ∈G l s t ( ˙ x ˙ a t ,t ) = (cid:88) t ∈G l φ ( ˙ x ˙ a t ,t ) (cid:62) V − t − φ ( ˙ x ˙ a t ,t )= (cid:88) t ∈G l (cid:107) φ (cid:0) x a ζ ( t ) ,ζt (cid:1) + φ ( ˙ x ˙ a t ,t ) − φ (cid:0) x a ζ ( t ) ,ζt (cid:1) (cid:107) V − t − ≤ (cid:88) t ∈G l (cid:107) φ (cid:0) x a ζ ( t ) ,ζt (cid:1) (cid:107) V − t − +2 (cid:88) t ∈G l (cid:107) φ (˙ x ˙ a t ,t ) − φ (cid:0) x a ζ ( t ) ,ζt (cid:1) (cid:107) V − t − ≤ (cid:88) t ∈G l (cid:107) φ ( x a ζ ( t ) ,ζ ( t ) ) (cid:107) W − l,mt − + 2 Mλ ( L φ (cid:107) e t (cid:107) ) ≤ γ ( W l,M ) + 8 M L φ λ (cid:15) , (41)where the ﬁrst inequality comes from Cauchy-Schwartz in-equality and the second inequality holds because V − t − (cid:22) W − l,m t − , eig( V − t − ) ≤ λ and Lipschitz continuity of φ such that (cid:107) φ ( ˙ x ˙ a t ,t ) − φ (cid:0) x a ζ ( t ) ,ζt (cid:1) (cid:107) ≤ L φ (cid:107) e t (cid:107) , and thelast inequality holds by Lemma 2.For kernel case, with K l,M ∈ R M × M containing k ( x a ζ ( t ) ,ζ ( t ) , x a ζ ( t ) ,ζ ( t ) ) for t ∈ G l , we have γ ( W l,M ) =log det( W l,M )det( λ I H ) = log det( I M + K l,M )det( λ I M ) ≤ d l log(1 + M ˜ d l ) ,where ˜ d l is the rank of K l,M . Deﬁne the effective dimen-sion as ˜ d = arg max ˜ d l = ˜ d , ··· , ˜ d (cid:100) κ/(cid:15)d (cid:101) d l log(1 + M ˜ d l ) . Thenwe can bound the sum of the conﬁdence widths for the whole sequence as T (cid:88) t =1 s t ( ˙ x ˙ a t ,t ) = (cid:100) κ/(cid:15) d (cid:101) (cid:88) l =1 (cid:88) t ∈G l s t ( ˙ x ˙ a t ,t ) + T (cid:88) t = M (cid:100) κ/(cid:15) d (cid:101) +1 s t ( ˙ x ˙ a t ,t ) ≤ (cid:6) κ/(cid:15) d (cid:7) (cid:32) γ ( W l,M ) + 8 M L φ λ (cid:15) + 1 λ (cid:33) ≤ (cid:6) κ/(cid:15) d (cid:7) (cid:18) d log(1 + M ˜ d ) + 1 λ (cid:19) + 8 T L φ λ (cid:15) , (42)Let (cid:6) κ/(cid:15) d (cid:7) = T / , i.e. (cid:15) = (cid:0) κ /T (cid:1) / d , then we have T (cid:88) t =1 s t ( ˙ x ˙ a t ,t ) ≤√ T (cid:18) d log (cid:18) T ˜ dλ (cid:19) + 1 λ (cid:19) +8 L φ κ /d λ T − /d , (43)thus completing the proof. D.3 Proof of Theorem 6

Theorem 6. If MinWD is used to select arms with imperfectcontext and as time goes on, and the conditions in Proposi-tion 5 are satisﬁed, then for any true context x t ∈ B ∆ (ˆ x t ) atround t, t = 1 , · · · , T , with a probability of − δ, δ ∈ (0 , ,we have the following bound on the cumulative true regret: R T ≤ T (cid:88) t M R t + 2 h T T (cid:115)(cid:18) d log (cid:18) T ˜ dλ (cid:19) + 1 λ (cid:19) +4 (cid:114) λ L φ κ d h T T − d + 2 h T (cid:114) T ¯ d log(1+ T ¯ dλ ) , where M R t is the optimal worst-case regret for round t inEqn. (11) , d is the dimension of x a t ,t , ˜ d is the effective di-mension deﬁned in the proof of Proposition 5, ¯ d is the rankof K t and h T is given in Lemma 1.Proof. Since with a probability − δ , δ ∈ (0 , , r (cid:0) x t , a II t (cid:1) ≤ D a II t ,t + 2 h t s t ( x t , a II t ) , we can bound the cu-mulative regret of MinWD as R T = T (cid:88) t =1 r (cid:0) x t , a II t (cid:1) ≤ T (cid:88) t =1 (cid:104) D a II t ,t + 2 h t s t ( x t , a II t ) (cid:105) ≤ T (cid:88) t =1 (cid:104) M R t + 2 h t s t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) + 2 h t s t ( x t , a II t ) (cid:105) , (44)where the second inequality comes from Lemma 4.Since the sequence { x t , a II t } meets the conditions inProposition 5, by replacing ˙ x ˙ a t ,t and x a t ,t in Proposition 5ith [ ˙ x t , A † t ( ˙ x t )] and [ x t , a II t ] , we have T (cid:88) t =1 s t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) ≤ (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 s t (cid:16) ˙ x t , A † t ( ˙ x t ) (cid:17) ≤ (cid:115) T / (cid:18) d log (cid:18) T ˜ dλ (cid:19) + 1 λ (cid:19) + 8 L φ κ /d λ T − /d ≤ T (cid:115) d log (cid:18) T ˜ dλ (cid:19) + 1 λ + 2 (cid:114) λ L φ κ d T − d . (45)Combining with Lemma 2, we can get the bound in Theorem6. D.4 Proof of Corollary 6.1

Corollary 6.1 If MinWD is used to select arms with imper-fect context and as time goes on, and the true sequence ofcontext and arm obeys the conditions in Proposition 5, thenfor any true contexts x t ∈ B ∆ (ˆ x t ) at round t, t = 1 , · · · , T ,with a probability of − δ, δ ∈ (0 , , we have the followinglower bound of the cumulative reward F T ≥ T (cid:88) t =1 [ M F t − M R t ] − h T T (cid:115)(cid:18) d log (cid:18) T ˜ dλ (cid:19) +1 λ (cid:19) − (cid:114) λ L φ κ d h T T − d − h T (cid:114) T ¯ d log(1+ T ¯ dλ ) , where M R t is the optimal worst-case regret for round t inEqn. (11) , d is the dimension of x a t ,t , ˜ d is the effective di-mension deﬁned in the proof of Proposition 5, ¯ d is the rankof K t , and h T is given in Lemma 1.Proof. By Lemma 1, with a probability − δ , δ ∈ [0 , , wecan bound the reward as below f ( x t , a II t ) ≥ U t (cid:0) x t , a II t (cid:1) − h t s t (cid:0) x t , a II t (cid:1) = U t (cid:0) x t , a II t (cid:1) − U t (cid:16) x t , A † t ( x t ) (cid:17) + U t (cid:16) x t , A † t ( x t ) (cid:17) − h t s t (cid:0) x t , a II t (cid:1) = − D a II t ,t ( x t ) + U t (cid:16) x t , A † t ( x t ) (cid:17) − h t s t (cid:0) x t , a II t (cid:1) , (46)Recall that the upper bound of the worst-case degradation is deﬁned as D a,t =max x ∈B ∆ (ˆ x t ) (cid:110) U t (cid:16) x, A † t ( x ) (cid:17) − U t ( x, a ) (cid:111) , so the re-ward can be further bounded as f ( x t , a II t ) ≥ (cid:16) − D a II t ,t (cid:17) + U t (cid:16) x t , A † t ( x t ) (cid:17) − h t s t (cid:0) x t , a II t (cid:1) ≥ (cid:16) − D a II t ,t (cid:17) + min x ∈B ∆ (ˆ x t ) U t (cid:16) x, A † t ( x ) (cid:17) − h t s t (cid:0) x t , a II t (cid:1) ≥ (cid:16) − D a II t ,t (cid:17) + max a ∈A min x ∈B ∆ (ˆ x t ) U t ( x, a ) − h t s t (cid:0) x t , a II t (cid:1) , (47) where the last inequality is be-cause min x ∈B ∆ (ˆ x t ) U t (cid:16) x, A † t ( x ) (cid:17) =min x ∈B ∆ (ˆ x t ) max a ∈A U t ( x, a ) ≥ max a ∈A min x ∈B ∆ (ˆ x t ) U t ( x, a ) by max-min inequality.Note that max a ∈A min x ∈B ∆ (ˆ x t ) U t ( x, a ) is the arm se-lection policy of MaxMinUCB whose solutions are a I t and x I t . This observation is important since it bridges MinWD and

MaxMinUCB . Also, recall that in Eqn. (4), ¯ a t =arg max a ∈A min x ∈B ∆ (ˆ x t ) f ( x, a ) is the optimal arm forworst-case reward and M F t = min x ∈B ∆ (ˆ x t ) f ( x, ¯ a t ) is theoptimal worst-case reward. Then following Eqn. (47), wehave f ( x t , a II t ) ≥ (cid:16) − D a II t ,t (cid:17) + max a ∈A min x ∈B ∆ (ˆ x t ) U t ( x, a ) − h t s t (cid:0) x t , a II t (cid:1) = (cid:16) − D a II t ,t (cid:17) + U t (cid:0) x I t , a I t (cid:1) − h t s t (cid:0) x t , a II t (cid:1) ≥ (cid:16) − D a II t ,t (cid:17) + U t (cid:0) ¯ x I t , ¯ a t (cid:1) − h t s t (cid:0) x t , a II t (cid:1) ≥ (cid:16) − D a II t ,t (cid:17) + U t (cid:0) ¯ x I t , ¯ a t (cid:1) − h t s t (cid:0) x t , a II t (cid:1) − f (cid:0) ¯ x I t , ¯ a t (cid:1) + M F t ≥ (cid:16) − D a II t ,t (cid:17) + M F t − h t s t (cid:0) x t , a II t (cid:1) , (48)where ¯ x I t = arg min x ∈B ∆ (ˆ x t ) U t ( x, ¯ a t ) , the second inequal-ity holds by the arm selection strategy of MaxMinUCB suchthat U t (cid:0) x I t , a I t (cid:1) ≥ U t (cid:0) ¯ x I t , ¯ a t (cid:1) , the third inequality comesfrom the deﬁnition of M F t in Eqn. (6) which guarantees M F t = min x ∈B ∆ (ˆ x t ) f ( x, ¯ a t ) ≤ f (cid:0) ¯ x I t , ¯ a t (cid:1) and the forthinequality comes from Lemma 1 such that U t (cid:0) ¯ x I t , ¯ a t (cid:1) ≥ f (cid:0) ¯ x I t , ¯ a t (cid:1) .Finally, since D a II t ,t +2 h t s t (cid:0) x t , a II t (cid:1) is the upper bound ofinstantaneous regret of MinWD , by directly using Theorem6 we can prove the lower bound reward of