Stochastic Linear Bandits Robust to Adversarial Attacks
Ilija Bogunovic, Arpan Losalka, Andreas Krause, Jonathan Scarlett
SStochastic Linear BanditsRobust to Adversarial Attacks
Ilija Bogunovic
ETH Zürich
Arpan Losalka
National Universityof Singapore
Andreas Krause
ETH Zürich
Jonathan Scarlett
National Universityof Singapore
Abstract
We consider a stochastic linear bandit problem in which the rewards are not onlysubject to random noise, but also adversarial attacks subject to a suitable budget C (i.e., an upper bound on the sum of corruption magnitudes across the time horizon).We provide two variants of a Robust Phased Elimination algorithm, one that knows C and one that does not. Both variants are shown to attain near-optimal regret in thenon-corrupted case C = 0 , while incurring additional additive terms respectivelyhaving a linear and quadratic dependency on C in general. We present algorithm-independent lower bounds showing that these additive terms are near-optimal. Inaddition, in a contextual setting, we revisit a setup of diverse contexts, and showthat a simple greedy algorithm is provably robust with a near-optimal additiveregret term, despite performing no explicit exploration and not knowing C . Over the past years, bandit algorithms have found application in computational advertising, rec-ommender systems, clinical trials, and many more. These algorithms make online decisions bybalancing between exploiting previously high-reward actions vs. exploring less known ones that couldpotentially lead to higher rewards. Bandit problems can roughly be categorized [16] into stochasticbandits , in which subsequently played actions yield independent rewards, and adversarial bandits ,where the rewards are chosen by an adversary, possibly subject to constraints. A recent line of workshas sought to reap the benefits of both approaches by studying bandit problems that are stochasticin nature, but with rewards subject to a limited amount of adversarial corruption . Various workshave developed provably robust algorithms [10, 22, 4, 19], and attacks have been designed that causestandard algorithms to fail [8, 10, 11, 20].While near-optimal theoretical guarantees have been established in the case of independent arms [10],more general settings remain relatively poorly understood or even entirely unexplored; see Section1.2 for details. Our primary goal is to bridge these gaps via a detailed study of stochastic linear bandits with adversarial corruptions. In the case of a fixed finite (but possibly very large) set of arms,we develop an elimination-based robust algorithm and provide regret bounds with a near-optimal jointdependence on the time horizon and the adversarial attack budget, demonstrating distinct behaviordepending on whether the attack budget is known or unknown. In addition, we introduce a novel contextual linear bandit setting under adversarial corruptions, and show that under a context diversityassumption, a simple greedy algorithm attains near-optimal regret under adversarial corruptions,despite having no built-in mechanism that explicitly encourages exploration or robustness.
We consider the stochastic linear bandit setting with a given set of arms A ⊂ R d of finite size k , andadversarially corrupted rewards. At each round t ∈ { , . . . , T } : • The learner chooses an action A t ∈ A . Preprint. Under review. a r X i v : . [ s t a t . M L ] J u l The adversary observes A t and decides upon the attack/corruption c t ( A t ) ; in addition, c t ( · ) may (implicitly) depend on other problem parameters, as detailed below. • The learner receives a corrupted reward Y t such that Y t = (cid:104) θ, A t (cid:105) + (cid:15) t + c t ( A t ) , (1)where θ ∈ R d is an unknown parameter vector, and ( (cid:15) t ) Tt =1 is a random noise term, which isassumed to be zero-mean and -sub-Gaussian.We assume that the action feature vectors are unique, span R d , and are bounded, i.e., (cid:107) a (cid:107) ≤ , ∀ a ∈A . We similarly make the standard assumption (cid:107) θ (cid:107) ≤ , which implies that |(cid:104) θ, a (cid:105)| ≤ , ∀ a ∈ A .We consider an adversary/attacker that has complete knowledge of the problem – it knows both A and θ , and observes both the precise arm pulled and the noise realization (cid:15) t before choosing its attack.The total attack budget of the adversary is given by (cid:80) Tt =1 | c t ( A t ) | ≤ C . We will consider both thecases that C is known and unknown to the learner.The goal of the learner is to minimize the cumulative regret , defined as R T = T (cid:88) t =1 max a ∈A (cid:104) θ, a − A t (cid:105) . (2)Broadly speaking, we say that an algorithm that attains low regret (e.g., sublinear scaling R T = o ( T ) )is corruption-tolerant or robust to adversarial attacks . Recent surveys on bandit algorithms can be found in [16, 25]; here we focus on the most relevantworks considering stochastic settings with adversarial corruptions and bandit attacks. Some additionalrelated work is discussed in Appendix F.Adversarial attacks on standard bandit algorithms (e.g., UCB, (cid:15) -greedy, and Thompson sampling)were introduced for the case of independent arms (i.e., a classical multi-armed bandit setting) in[11, 20, 21], and for linear bandits in [8]. We will use the latter in our experiments to test robustnessof the proposed algorithms, along with other heuristic attacks.In the case of independent arms, Lykouris et al. [22] show that a simple elimination algorithm withenlarged confidence bounds is robust and near-optimal when the attack budget C is known. Forunknown C , randomized algorithm is given whose regret bound roughly amounts to scaling theuncorrupted regret by C , i.e., multiplicative dependence. Subsequently, Gupta et al. [10] establish animproved algorithm whose regret bound is near-optimal, with an additive dependence on C .Bogunovic et al. [4] consider corruption-tolerant bandits for functions with a bounded RKHS norm,which includes linear bandits as a special case. The algorithm of [4] is based on that of [22], andhas analogous guarantees. However, even in the case of known C , the best dependence obtained ismultiplicative; the possibility of additive dependence was left as an open problem.Li et al. [19] also study stochastic linear bandits with adversarial corruptions. A distinction in [19] isthat the regret bounds are instance-dependent , relying on positive gaps between the function values atcorner points of the polyhedral domain. These results are complementary to the instance-independent bounds with a finite number of arms that we seek in this paper, and neither can be deduced fromthe other; see [4, App. K] for a detailed discussion. Finally, we conclude this section by noting thatthe previously mentioned works [22, 10, 4, 19] consider a weaker adversary that cannot observe thecurrent action which is often assumed when designing efficient bandit attacks [11, 20]. Our main contributions are as follows: • For known C , we present a Robust Phased Elimination algorithm, and show that it recovers anear-optimal regret bound when C = 0 , while incurring an additive O ( d / C log T ) term (upto log log( dT ) factors) more generally. A standard lower bound argument [22] shows that Ω( C ) dependence is unavoidable, thus certifying the upper bound as being optimal up to log factors. • For unknown C , we modify our algorithm to gradually decrease its confidence bound enlarge-ment term over time, and show that we only pay a further O ( C ) term compared to the known2 case. We also provide a novel algorithm-independent lower bound showing that the C dependence is unavoidable for any algorithm that attains the usual non-corrupted bound in thecase C = 0 . Thus, we again establish the near-optimality of our regret bounds, and demonstratea fundamental gap between the known C and unknown C settings. • We introduce a linear contextual problem with adversarial attacks, and show that under themodel of diverse contexts from [12], the greedy algorithm not only attains near-optimal regretin the uncorrupted setting (as shown in [12]), but is also robust to adversarial attacks . We present our Robust Phased Elimination algorithm in Algorithm 1, which builds on non-robustelimination algorithms [16, 17, 27] (see Appendix F for further discussion). The known C vs. un-known C variants only differ on Line 1. The algorithm runs in epochs of exponentially increasinglength and maintains a set of potentially optimal actions. In every epoch, the following steps areperformed: (i) compute a near-optimal experimental design over a set of potentially optimal actions,and play each action from this subset in proportion to the computed design (Lines 2-4); (ii) computean estimate of θ , and use it to eliminate actions that appear suboptimal (Lines 5-6). We proceed bydescribing these steps in more detail. Action selection.
To introduce the action selection procedure, consider the problem of finding aprobability distribution ζ : A → [0 , that solves the following: minimize ζ max a ∈A (cid:107) a (cid:107) ζ ) − s.t. (cid:88) a ∈A ζ ( a ) = 1 , (3)where Γ( ζ ) = (cid:80) a ∈A ζ ( a ) aa T , and (cid:107) a (cid:107) M = √ a T M a . A classical result from [14] states that theoptimal solution ζ ∗ exists, and achieves max a ∈A (cid:107) a (cid:107) ζ ∗ ) − = d with | supp ( ζ ∗ ) | ≤ d ( d +1)2 . For ourpurposes, however, it suffices to solve the problem in (3) only near-optimally. As noted in [17], thereexists a near-optimal design of smaller support than d ( d + 1) /d . In particular, if A spans R d , thenwe can efficiently compute ζ : A → [0 , such that max a ∈A (cid:107) a (cid:107) ζ ) − ≤ d and | supp ( ζ ) | ≤ d (log log d + 18) (4)This follows from [26, Proposition 3.17], who provide a polynomial-time Frank-Wolfe algorithm.Hence, in every epoch h , the algorithm recomputes a near-optimal design from (4) over a subset ofthe actions that are still potentially optimal, i.e., A h . It then plays each action from this subset inproportion to the computed design, but it also makes sure that every arm in its support is played atleast some minimal number of times (cid:100) νm h (cid:101) , where ν is an input truncation parameter to be chosenbelow, and m h is an exponentially increasing parameter proportional to the epoch length. Parameter estimation and arm elimination.
Consider the estimator given in (6). This estimatoronly depends on the observations received in the current epoch, and hence, it is not affected by attackssuffered during previous epochs. However, it can still be biased due to the adversarial attacks sufferedin the current epoch, and we need to account for this bias. In Lemma 4 (Appendix A), for any of theremaining potentially optimal actions, we bound the difference of the true mean reward and estimatedone, and show that this error grows linearly with the total attack budget C . Hence, the algorithmmakes use of the enlarged confidence bounds in (7) to retain potentially optimal arms. Moreover,we show that when C is known, our estimator is guaranteed to have sufficient accuracy so that theoptimal arm is always retained in (7) with high probability. For unknown C , this is not always thecase, but we can control the level of suboptimality of the arms that are retained.The estimator of θ is robust due to the fact that it averages the rewards that correspond to the sameplayed action, which reduces the effect of the attack. Intuitively, actions that have higher importanceaccording to the found near-optimal design are played more times than others. Consequently, it isharder for the adversary to corrupt them as it needs to use more of the attack budget. In addition, dueto the introduced truncation, the algorithm plays each arm in the support of the computed design afixed minimum number of times. Remark 1.
The following observations from [17] are useful: (i) While (4) is stated assuming thearms span R d , we can simply work in the lower-dimensional subspace otherwise (e.g., when k < d );(ii) We can extend the algorithm and its analysis to infinite-arm settings using a covering argument. See Remark 1 below for the general case. lgorithm 1 Robust Phased Elimination
Require:
Actions A ⊂ R d , confidence δ ∈ (0 , , truncation parameter ν ∈ (0 , , time horizon T Initialize m = 4 d (log log d + 18) , and for each h ∈ { , , . . . , log T − } , set ˆ C h = C forknown C , or ˆ C h = min { √ Tm log T , m √ d log T − h } for unknown C . Initialize h = 0 . Compute design ζ h : A h → [0 , such that max a ∈A h (cid:107) a (cid:107) ζ h ) − ≤ d and | supp ( ζ h ) | ≤ d (log log d + 18) . (5) Set u h ( a ) = 0 if ζ h ( a ) = 0 , and u h ( a ) = (cid:100) m h max { ζ h ( a ) , ν }(cid:101) otherwise. Take each action a ∈ A h exactly u h ( a ) times with corresponding features ( A t ) u h t =1 and rewards ( Y t ) u h t =1 (implicitly depending on h ), where u h = (cid:80) a ∈A h u h ( a ) . Estimate the parameter vector ˆ θ h : ˆ θ h = Γ − h u h (cid:88) t =1 A t u h ( A t ) − (cid:88) s ∈T ( A t ) Y s , Γ h = (cid:88) a ∈A h u h ( a ) aa T , (6)where T ( a ) = (cid:8) s ∈ { , . . . , u h } : A s = a (cid:9) is the set of times at which arm a is played. Update the active set of arms: A h +1 ← (cid:110) a ∈ A h : max a (cid:48) ∈A h (cid:104) ˆ θ h , a (cid:48) − a (cid:105) ≤ (cid:113) dm h log (cid:0) δ (cid:1) + C h m h ν (cid:112) d (1 + νm ) (cid:111) . (7) Set m h +1 ← m h , h ← h + 1 and return to step 2 (terminating after T total arm pulls). We first provide a regret bound for the known C case, proved in Appendix A. Theorem 1.
For any attack budget C ≥ , with probability at least − δ , the Robust PhasedElimination algorithm with known C and truncation parameter ν = d (log log d +18) satisfies R T = ˜ O (cid:16)(cid:113) dT log (cid:0) kδ (cid:1) + Cd / log T (cid:17) , (8) where the notation ˜ O ( · ) hides log log( dT ) factors. When C = 0 , we recover the scaling of [16, Thm. 22.1], which is near-optimal in light of knownlower bounds [7]. In Section 2.2, we will argue that the second term is also near-optimal.Next, we consider the case that the total attack budget C is unknown to the learner. We startby discussing the choice of ˆ C h in Algorithm 1. Let H be the number of epochs, and note that ˜ H = log T be a deterministic upper bound on H (see Appendix A.2 for a short proof). Then,the choice in Algorithm 1 can be rewritten as ˆ C h = min { √ Tm log T , m √ d ˜ H − h } . Observe thatthe epochs’ lengths u h and corruption thresholds ˆ C h are exponentially increasing and decreasing,respectively. It follows that the algorithm is more cautious/robust in early epochs (i.e., uses largerthresholds). Our second main result stated is as follows, and proved in Appendix A. Theorem 2.
For any C ≤ √ T d (log log d +18) log T , with probability at least − δ , the Robust PhasedElimination algorithm with unknown C and truncation parameter ν = d (log log d +18) satisfies R T = ˜ O (cid:16)(cid:113) dT log (cid:0) kδ (cid:1) + Cd / log T + C (cid:17) . (9)This result matches that of Theorem 1, but with an additional penalty of C . In fact, due to thispenalty, the regret bound (9) trivially holds when C = Ω( √ T ) , because we have R T ≤ T dueto our assumption of bounded rewards. If d = ω (1) , then there still remains the regime where √ T ( d log log d ) log T (cid:28) C (cid:28) √ T , but in any case, one can slightly increase the final term and state that When d = 1 , we have log log d = −∞ , but the results hold with log log d replaced by log(1 + log d ) . T = ˜ O (cid:0)(cid:113) dT log (cid:0) kδ (cid:1) + Cd / log T + C d (log T ) (cid:1) for arbitrary C . The question remains ofwhether the C dependence is optimal; we provide a positive answer in the following. Using the same reasoning as the standard multi-armed bandit setting [22], it is straightforward to seethat Ω( C ) regret is unavoidable: The adversary can simply shift all rewards to zero for the first C rounds, and the learner cannot do better than random guessing. For completeness, this argument isgiven in more detail in Appendix C. This argument holds even when C is known, and thus, we seethat the second term in Theorem 1 is optimal up to at most an ˜ O (log T ) factor for fixed d . We expectthat an improvement on the d / dependence may be possible, but the following result, proved inAppendix C, shows that at least Ω( Cd ) is unavoidable. Theorem 3.
For any dimension d , there exists an instance with k = d such that any algorithm (evenwith knowledge of C ) must incur Ω( Cd ) regret with probability at least . Next, we provide another lower bound that allows us to claim that in a sense the regret bound ofTheorem 2 is near-optimal when C is unknown. Theorem 4.
For d = 2 and k = 2 , for any algorithm that guarantees R T ≤ ¯ R (0) T with probability atleast − δ for some uncorrupted regret bound ¯ R (0) T ≤ T when C = 0 , there exists an instance inwhich R T = Ω( T ) with probability at least − δ when the attack budget is C = 2 ¯ R (0) T . The proof is given in Appendix C. While we focus on the simplest case d = k = 2 , the proof canalso be adapted to more general choices. Discussion.
Consider the general goal of attaining a regret upper bound of the form R T ≤ ¯ R (0) T + f ( C ) log T, (10)for some f ( · ) satisfying f (0) = 0 . Here we let the second term contain a log T factor in accordancewith our upper bounds, but the following discussion still applies with only minor modifications whenthe log T factor is changed to poly(log T ) or similar.At first glance, it appears that f ( C ) should ideally be linear in C , and ¯ R (0) T should ideally be anorder-optimal regret bound for the non-corrupted setting. However, Theorem 4 shows that wecannot have both terms exhibiting their “ideal” behavior simultaneously. To see this, note that theideal uncorrupted regret bound behaves as ¯ R (0) T = ˜Θ( √ T ) (for fixed d , k , and δ ) [7, 16]. Then,to be consistent with Theorem 4, we require f ( C ) log T = ˜Ω( T ) for C = Θ( √ T ) , and hence f ( C ) = ˜Ω (cid:0) C log C (cid:1) . On the other hand, it may be possible to avoid the C dependence (i.e., improverobustness) at the expense of a worse uncorrupted regret. This idea is left for future work. In this section, we consider a k -arm linear contextual bandit problem with a single unknown d -dimensional parameter vector θ ∈ R d (e.g., see [12]). In each round t , contexts a ,t , . . . , a k,t arepresented to the learner, each in R d and associated to one action. The learner then chooses an actionindexed by I t ∈ { , . . . , k } and observes the corrupted reward Y t = (cid:104) θ, a I t ,t (cid:105) + (cid:15) t + c t ( a I t ,t ) , (11)where the same assumptions from Section 1.1 hold for both ( (cid:15) t ) Tt =1 and c t ( · ) (with attack budget C ),and (cid:107) θ (cid:107) ≤ . Similar to (2), the cumulative regret is R T = (cid:80) Tt =1 max i ∈{ ,...,k } (cid:104) θ, a i,t − a I t ,t (cid:105) . In general, the introduction of contexts may significantly complicate the problem, with algorithmssuch as the one in Section 2 being difficult to extend, particularly with unknown C . However, perhapssurprisingly, a line of recent works has demonstrated that simple exploration-free greedy methods canprovably work well (in the non-corrupted setting) under mild additional assumptions on the contexts.These assumptions amount to kinds of context diversity [3, 12, 23] ensuring that the collected samplesare sufficiently informative for learning θ accurately.Most related to this paper is [12], who analyze the greedy algorithm in the case that arbitrarycontext vectors undergo small random perturbations. Motivated by these results, we investigate5he performance of the greedy algorithm under the same assumption on the contexts, but with theaddition of adversarial attacks. Our main finding is that the context diversity assumption not onlyremoves the need for explicit exploration [12], but also automatically inherits near-optimal robustnessto adversarial attacks, with no need to know the attack budget C . Context generation.
In more detail, the setup of [12] is introduced as follows: An arbitrary tuple µ ,t , . . . , µ K,t of mean context vectors is given (possibly selected by an adaptive adversary based onthe history of contexts, actions, and rewards), such that (cid:107) µ i,t (cid:107) ≤ for all i, t . For every availableaction, the context vector is then generated as a i,t = µ i,t + ξ i,t , where the random perturbationvectors ξ i,t are drawn independently from some zero-mean distributions D ,t , . . . , D K,t . We considerperturbations that are ( r, δ ) -bounded for some r ≤ according to the following definition [12]: P [ (cid:107) ξ i,t (cid:107) ∞ ≤ r for all arms i and rounds t ] ≤ − δ. (12)As outlined above, we are interested in the diversity of samples collected by the greedy algorithm(defined below). The main idea is that the observed contexts should cover all directions in order toenable good estimation of the latent vector θ . Consequently, we make use of the notion of diversityfrom [12], which takes into account that the learner observes rewards for contexts that are selectedgreedily and thus only observes a conditional distribution of contexts. Specifically, following [12],a distribution D is called ( r, λ ) -diverse with parameters r > and λ > if, for a = µ + ξ with ξ ∼ D and any µ ∈ R d , it holds for all ˆ θ ∈ R d and ˆ b ∈ R satisfying ˆ b ≤ r (cid:107) ˆ θ (cid:107) that λ min (cid:16) E ξ ∼ D (cid:2) aa T (cid:12)(cid:12) ˆ θ T ξ ≥ ˆ b (cid:3)(cid:17) ≥ λ . (13)The overall perturbations are ( r, λ ) -diverse if the distributions D i,t are ( r, λ ) -diverse for all i and t .This diversity condition is the main component in [12] for proving that the minimum eigenvalue of theempirical covariance matrix λ min ( (cid:80) tτ =1 a I τ ,τ a TI τ ,τ ) grows linearly with t . In Lemma 6 (AppendixB), we demonstrate that this is the main quantity that has an impact on the accuracy of the estimatorof θ , and in turn, on the regret bounds in the corrupted setting. Greedy algorithm.
In round t , the greedy algorithm (see Algorithm 2) receives a set of contexts { a ,t , . . . , a k,t } , and chooses the best action according to the least squares estimate of θ : I t = arg max i ∈{ ,...,K } (cid:104) ˆ θ t , a i,t (cid:105) and ˆ θ t = arg min θ (cid:48) t − (cid:88) τ =1 ( (cid:104) θ (cid:48) , a I τ ,τ (cid:105) − Y τ ) . (14)Our regret bound for this setup is stated as follows, and proved in Appendix B. Theorem 5.
Suppose that (cid:107) a i,t (cid:107) ≤ for all i, t , the random context perturbations are ( r , /T )-bounded and ( r, λ ) -diverse with r ≤ , the reward noise is -sub-Gaussian, and the attack budget is C ≥ . Then with probability at least − δ , the greedy algorithm has regret bounded by R T = O (cid:18) λ (cid:16)(cid:113) dT log (cid:0) dTδ (cid:1) + C log T + log (cid:0) dTδ (cid:1)(cid:17) + (cid:113) log( kδ ) (cid:19) . (15)Under the mild assumptions δ = e − O ( dT ) and kδ = e O ( dT ) , this bound simplifies to R T = O (cid:18) λ (cid:16)(cid:113) dT log (cid:0) T dδ (cid:1) + C log T (cid:17)(cid:19) . (16)In addition, when C = 0 , Theorem 5 reduces to the result of [12]. The additional λ C log T termis essentially optimal when λ = Θ(1) , since a simple argument from [22] gives an Ω( C ) lowerbound (see Appendix C). In Corollary 1 (Appendix B), we specialize Theorem 5 to the case that theperturbations are Gaussian, i.e., every ξ i,t is drawn independently from N (0 , η I ) , and show that thegreedy algorithm has sublinear regret in the low- η regime. In this section, we evaluate the performance of the algorithms studied in this paper, along withthe baselines LinUCB [18, 16] and Thompson sampling [1]. We consider both the robust phasedelimination algorithm and the contextual greedy algorithm, starting with the latter. We use LinUCB as described in [16, Sec. 19.2] with least-squares regularization parameter λ = 1 andconfidence parameter δ = 0 . , and Thompson sampling [1] uses an i.i.d. Gaussian prior with variance . .
100 200 300 400 500
Adversarial Budget R e g r e t Garcelon et al.Oracle MABSimple-Flip-
Round C u m u l a t i v e r e g r e t Oracle MAB - GreedyOracle MAB - LinUCBOracle MAB - ThompsonFlip- - GreedyFlip- - LinUCBFlip- - Thompson
Round C u m u l a t i v e r e g r e t Oracle MAB - GreedyOracle MAB - LinUCBOracle MAB - ThompsonFlip- - GreedyFlip- - LinUCBFlip- - Thompson
Perturbation Level R e g r e t No attackGarcelon et al.Oracle MABSimple-Flip-
Figure 1: Contextual synthetic experiment: (Left) Regret at time T = 3500 as a function of C with η = 0 . ; (Middle Two) Regret as a function of time with η = 0 and η = 0 . ; (Right) Performance ofGreedy at time T = 3500 with C = 150 and varying η . Round C u m u l a t i v e r e g r e t No attackGarcelon et al.Oracle MABSimple-Flip-
Round C u m u l a t i v e r e g r e t No attackGarcelon et al.Oracle MABSimple-Flip- 0 2500 5000 7500 10000 12500 15000 17500 20000
Round C u m u l a t i v e r e g r e t No attackGarcelon et al.Oracle MABSimple-Flip-
Round C u m u l a t i v e r e g r e t No attack - GreedyNo attack - LinUCBNo attack - ThompsonGarcelon et al. - GreedyGarcelon et al. - LinUCBGarcelon et al. - Thompson
Figure 2: MovieLens experiment: (Left 3) Regret as a function of time with C = 150 for Greedy,LinUCB, and Thompson sampling; (Right) Regret of all algorithms under the Garcelon et al. attack. We consider the following attack algorithms, each depending on a target arm a target and/or a targetparameter vector θ target . These are briefly outlined as follows, with more details in Appendix D: • Garcelon et al. attack.
This attack is a minor modification of that of [8], leaving pulls from a target uncorrupted, while pushing all other rewards down to the minimum value. • Oracle MAB attack.
This attack from [11] pushes the reward of any a (cid:54) = a target to somemargin (cid:15) below that of a target , or leaves the reward unchanged if such a margin is already met. • Simple θ -based attack. This attack acts in the same way as that of Garcelon et al. , butwith a target always chosen as arg max a (cid:104) a, θ target (cid:105) . This is equivalent to that of [8] in thenon-contextual setting, but otherwise may differ due to a target varying with time. • Flip- θ attack. This attack simply flips the reward from (cid:104) θ, a (cid:105) to (cid:104)− θ, a (cid:105) .Note that the terminology “oracle” refers to attacks that use knowledge of θ , which we assume tobe permitted in this paper (the Flip- θ attack also falls in this category). We set a target to be the firstarm, which will have the same effect as choosing any fixed arm (since our arm feature vectors will begenerated in a symmetric manner). In addition, we let θ target be uniform on the unit sphere in thesimple θ -based attack, and set (cid:15) = 0 . in the Oracle MAB attack. In this experiment, we consider the contextual setting with contexts havinguniform entries and Gaussian perturbations with variance η > ; see Appendix E for the full details.We consider k = 25 arms, T = 5000 rounds, and attack budget C = 50 . At each time instant, weplot the cumulative regret averaged over trials, and error bars indicate one standard deviation. InAppendix E, we also provide analogous plots and discussion for the case that C = 150 .In Figure 1 (Left), we plot the regret of Greedy at T = 3500 as a function of C with η = 0 . . Weobserve a linear increase, which is in agreement with our theory. Analogous plots for LinUCB,Thompson sampling, and η ∈ { . , . } can be found in Appendix E. The middle two plots inFigure 1 show the regret as a function of time with the two most effective attacks, with η = 0 and η = 0 . . We see that the regret curves are still increasing linearly under the Flip- θ attack by time T = 5000 when η = 0 , whereas they are nearly flat when η = 0 . . While our theory only supportsthe robustness of Greedy, these experiments suggest that LinUCB and Thompson sampling may alsoenjoy similar robustness under context diversity. Finally, Figure 1 (Right) plots the regret of Greedyat T = 3500 as a function of η when C = 150 . We observe that once η moves past a certain level, theperformance remains fairly consistent, with a general (but not definitive) trend of decreasing regret.The greatest difference is at η = 0 , particularly when the standard deviation is considered. MovieLens Experiment.
We use the MovieLens-100K dataset in a similar manner to [5]; seeAppendix E for details. In each trial of the experiment, we select a uniformly random user and treatthe 1682 movies as possible contexts. At each time instant, k = 30 of these movies are chosen7 Round0200040006000800010000 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Round025005000750010000125001500017500 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Round020004000600080001000012000 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Figure 3: Non-contextual synthetic experiment with 10 trials: (Left) Average regret as a function oftime; (Middle) Worst run among 10; (Right) Second-worst run among 10.uniformly at random and presented as the context vectors. Hence, a subset of the movie vectors formthe contexts, and a fixed user vector forms θ . We set T = 20000 and C = 100 , and we plot the regretaveraged over 10 trials (each corresponding to a different user).In Figure 2, we plot the regret as a function of time, for Greedy, LinUCB, and Thompson sampling.Despite the lack of explicit context perturbation in this experiment, we see that the algorithms areagain able to recover from the attacks, suggesting that the various movies in the data set are sufficientlydiverse. On the other hand, we do not claim the attacks here to be optimal, and it is possible thatstronger attacks may incur linear regret. In Figure 2 (Right), we plot all three algorithms under thestrongest attack and under no attack. We see that Greedy has very low regret when there is no attack,but has slightly higher regret when attacked. We now turn to experiments for the robust PE algorithm (Algorithm 1), with some minor practicalchanges detailed in Appendix E. We use the above synthetic experimental setup with the contextperturbations removed (i.e., η = 0 ), and with d = 5 , k = 50 , T = 40000 , and C = 150 . Forcomparison, we also include non-robust PE, which removes the second term in (7).For LinUCB, Thompson sampling, and non-robust PE, we continue to attack right from the start.However, for robust PE, this is a poor attack strategy, since the algorithm initially uses a very stringentcondition for elimination. Instead, following insight from the proof of Theorem 2, we start the attackat the first epoch for which ˆ C h < C . We consider the Flip- θ attack of Section 4.1, as well as anadditional Top- N attack targeted at eliminating good arms: Whenever any of the top N remainingarms are pulled, push the reward to − . We consider both N = 3 and N = 5 . We focus on the caseof unknown C here, and present similar plots for known C in Appendix E.In Figure 3 (Left), we see that the average regret of all algorithms is similar by the end of the timehorizon; however, an inspection of the error bars reveals that this is not the full story. In particular,the regret of LinUCB and Thompson sampling vary considerably depending on whether the attackwas successful or not, whereas robust PE exhibits much lower variation. To highlight this, we plot theregret from the worst and second-worst runs out of 10 (as measured at time T ) in Figure 3 (Middle)and Figure 3 (Right). In Appendix E, we provide analogous plots in the case of 40 trials, showing theworst 4-out-of-40 runs and observing similar behavior to Figure 3.We see that LinUCB and Thompson sampling visibly have linear regret, whereas the regret of robustPE flattens out by the end of the time horizon even for these worst-2-of-10 curves, indicating better high-probability behavior. On the other hand, these experiments suggest the possibility of algorithmswith improved finite-time performance guarantees, which was not the focus of this paper. We have considered the linear stochastic problem in the presence of adversarial attacks/corruptions.We provided novel algorithms in both the standard and contextual settings that are provably robustagainst such attacks. We demonstrated near-optimal regret bounds in all cases, and to our knowledge,we are the first to do so in each case. A possible direction for future work is to consider a setting inwhich both rewards and contexts can be altered by the adversary subject to a limited attack budget.8 roader Impact
Who may benefit from this research.
This is a theory-oriented paper targeted at the researchcommunity. The bandit algorithms studied could potentially be useful for practitioners in areas wheredata can be altered by malicious adversaries, such as online advertising and recommender systems.
Who may be put at disadvantage from this research.
We are not aware of any significant orspecific risks of placing anyone at a disadvantage.
Consequences of failure of the system.
Since our algorithms are designed under an abstractmathematical framework, it is difficult to anticipate the consequences of failure in practice. Wenote that the algorithms are only shown to be robust to a particular type of adversarial attack on therewards, and thus, could still be non-robust to other attacks (e.g., attacks on the context vectors).
Potential biases.
Various notions of fairness have been studied in the multi-armed bandit literature,and the algorithms studied in this paper do not attempt to address this issue. Thus, they may besubject to the same fairness limitations as standard bandit algorithms.
Acknowledgments
This project has received funding from the European Research Council (ERC) under the EuropeanUnions Horizon 2020 research and innovation programme grant agreement No 815943 and ETHZürich Postdoctoral Fellowship 19-2 FEL-47. J. Scarlett was supported by the Singapore NationalResearch Foundation (NRF) under grant number R-252-000-A74-281.
References [1] S. Agrawal and N. Goyal, “Thompson sampling for contextual bandits with linear payoffs,” in
International Conference on Machine Learning , 2013.[2] P. Auer and C.-K. Chiang, “An algorithm with nearly optimal pseudo-regret for both stochasticand adversarial bandits,” in
Conference on Learning Theory , 2016.[3] H. Bastani, M. Bayati, and K. Khosravi, “Mostly exploration-free algorithms for contextualbandits,” arXiv preprint arXiv:1704.09011 , 2017.[4] I. Bogunovic, A. Krause, and J. Scarlett, “Corruption-tolerant Gaussian process bandit optimiza-tion,” in
Conference on Artificial Intelligence and Statistics , 2020.[5] I. Bogunovic, J. Scarlett, S. Jegelka, and V. Cevher, “Adversarially robust optimization withGaussian processes,” in
Advances in Neural Information Processing Systems , 2018.[6] S. Bubeck and A. Slivkins, “The best of both worlds: Stochastic and adversarial bandits,” in
Conference on Learning Theory , 2012.[7] V. Dani, T. P. Hayes, and S. M. Kakade, “Stochastic linear optimization under bandit feedback,”in
Conference on Learning Theory , 2008.[8] E. Garcelon, B. Roziere, L. Meunier, O. Teytaud, A. Lazaric, and M. Pirotta, “Adversarialattacks on linear contextual bandits,” arXiv preprint arXiv:2002.03839 , 2020.[9] A. Ghosh, S. R. Chowdhury, and A. Gopalan, “Misspecified linear bandits,” in
AAAI Conferenceon Artificial Intelligence , 2017.[10] A. Gupta, T. Koren, and K. Talwar, “Better algorithms for stochastic bandits with adversarialcorruptions,” arXiv preprint arXiv:1902.08647 , 2019.[11] K.-S. Jun, L. Li, Y. Ma, and J. Zhu, “Adversarial attacks on stochastic bandits,” in
Advances inNeural Information Processing Systems , 2018.[12] S. Kannan, J. H. Morgenstern, A. Roth, B. Waggoner, and Z. S. Wu, “A smoothed analysis of thegreedy algorithm for the linear contextual bandit problem,” in
Advances in Neural InformationProcessing Systems , 2018.[13] S. Kapoor, K. K. Patel, and P. Kar, “Corruption-tolerant bandit learning,”
Machine Learning ,vol. 108, no. 4, pp. 687–715, Apr. 2019.[14] J. Kiefer and J. Wolfowitz, “The equivalence of two extremum problems,”
Canadian Journal ofMathematics , vol. 12, pp. 363–366, 1960. 915] A. Krishnamurthy, Z. S. Wu, and V. Syrgkanis, “Semiparametric contextual bandits,” in
Interna-tional Conference on Machine Learning , 2018.[16] T. Lattimore and C. Szepesvári, “Bandit algorithms,” preprint , vol. 28, 2018.[17] T. Lattimore and C. Szepesvári, “Learning with good feature representations in bandits and inRL with a generative model,” in
International Conference on Machine Learning , 2020.[18] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalizednews article recommendation,” in
International Conference on World Wide Web , 2010.[19] Y. Li, E. Y. Lou, and L. Shan, “Stochastic linear optimization with adversarial corruption,” arXivpreprint arXiv:1909.02109 , 2019.[20] F. Liu and N. Shroff, “Data poisoning attacks on stochastic bandits,” in
International Conferenceon Machine Learning , 2019.[21] G. Liu and L. Lai, “Action-manipulation attacks on stochastic bandits,” in
IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , 2020.[22] T. Lykouris, V. Mirrokni, and R. Paes Leme, “Stochastic bandits robust to adversarial corrup-tions,” in
ACM SIGACT Symposium on Theory of Computing , 2018.[23] M. Raghavan, A. Slivkins, J. W. Vaughan, and Z. S. Wu, “The externalities of exploration andhow data diversity helps exploitation,” arXiv preprint arXiv:1806.00543 , 2018.[24] Y. Seldin and A. Slivkins, “One practical algorithm for both stochastic and adversarial bandits.”in
International Conference on Machine Learning , 2014.[25] A. Slivkins et al. , “Introduction to multi-armed bandits,”
Foundations and Trends in MachineLearning , vol. 12, no. 1-2, pp. 1–286, 2019.[26] M. J. Todd,
Minimum-Volume Ellipsoids . Philadelphia, PA: Society for Industrial and AppliedMathematics, 2016.[27] M. Valko, R. Munos, B. Kveton, and T. Kocák, “Spectral bandits for smooth graph functions,”in
International Conference on Machine Learning , 2014.[28] A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill, “Learning near optimal policieswith low inherent Bellman error,” arXiv preprint arXiv:2003.00153 , 2020.10 upplementary Material
Stochastic Linear Bandits Robust to Adversarial Attacks
Ilija Bogunovic, Arpan Losalka, Andreas Krause, Jonathan Scarlett
A Proofs for Section 2 (Robust Phased Elimination Algorithm)
A.1 Single Epoch Analysis (Known Corruption Budget)
In this section, we consider a single fixed epoch indexed by h . Since the analysis here holds for anyepoch, we omit the subscript ( · ) h throughout; in particular, ˆ θ = ˆ θ h and Γ = Γ h are as given in (6), A = A h is the active set of arms, u ( a ) = u h ( a ) is the number of times a is played, and u = u h isthe total length of the epoch. Lemma 1.
For any action b ∈ A , the following holds with probability at least − δ : |(cid:104) b, ˆ θ − θ (cid:105)| ≤ (cid:107) b (cid:107) Γ − (cid:113) δ ) + Cmν (cid:115)(cid:88) a ∈A u ( a ) (cid:107) b (cid:107) Γ − . (17) Proof.
We recall the definition of the set T ( a ) = (cid:8) s ∈ { , . . . , u } : A s = a (cid:9) , whose cardinality isgiven by u ( a ) . We characterize the considered estimator of θ as follows: ˆ θ = Γ − u (cid:88) t =1 A t u ( A t ) − (cid:88) s ∈T ( A t ) Y s (18) = Γ − u (cid:88) t =1 A t u ( A t ) − (cid:16) (cid:88) s ∈T ( A t ) (cid:0) (cid:104) θ, A s (cid:105) + (cid:15) s + c s ( A s ) (cid:1)(cid:17) (19) = (cid:16) Γ − u (cid:88) t =1 A t A Tt θ (cid:17) + (cid:16) Γ − u (cid:88) t =1 A t u ( A t ) − (cid:88) s ∈T ( A t ) (cid:15) s (cid:17) + (cid:16) Γ − u (cid:88) t =1 A t u ( A t ) − (cid:88) s ∈T ( A t ) c s ( A s ) (cid:17) (20) = θ + (cid:16) Γ − u (cid:88) t =1 A t (cid:15) t (cid:17) + (cid:16) Γ − u (cid:88) t =1 A t u ( A t ) − (cid:88) s ∈T ( A t ) c s ( A s ) (cid:17) , (21)where (19) uses the decomposition of Y s into the reward/noise/corruption, (20) uses u ( a ) = |T ( a ) | and the fact that all s ∈ T ( A t ) have A s = A t , and (21) uses Γ = (cid:80) a ∈A u ( a ) aa T = (cid:80) ut =1 A t A Tt .By (21), for any action b ∈ A (or more generally b ∈ A ), we have (cid:12)(cid:12) (cid:104) b, ˆ θ − θ (cid:105) (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b T Γ − u (cid:88) t =1 A t (cid:15) t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b T Γ − u (cid:88) t =1 A t u ( A t ) − (cid:88) s ∈T ( A t ) c s ( A s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (22)We proceed by bounding the two terms separately. The second term can be rewritten as follows: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b T Γ − u (cid:88) t =1 A t u ( A t ) − (cid:88) s ∈T ( a ) c s ( A s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a ∈A ,u ( a ) (cid:54) =0 C a u ( a ) u ( a ) b T Γ − a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (23)where we use C a to denote (cid:80) s ∈T ( a ) c s ( a ) , i.e., the sum of corruptions for arm a in the epoch, andwe keep the factor u ( a ) u ( a ) for convenience in what follows.11ext, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a ∈A ,u ( a ) (cid:54) =0 C a u ( a ) u ( a ) b T Γ − a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) a ∈A ,u ( a ) (cid:54) =0 Cu ( a ) u ( a ) (cid:12)(cid:12)(cid:12) b T Γ − a (cid:12)(cid:12)(cid:12) (24) ≤ Cmν (cid:88) a ∈A u ( a ) (cid:12)(cid:12)(cid:12) b T Γ − a (cid:12)(cid:12)(cid:12) (25) ≤ Cmν (cid:115)(cid:16) (cid:88) a ∈A u ( a ) (cid:17) b T (cid:88) a ∈A u ( a )Γ − aa T Γ − b (26) = Cmν (cid:115)(cid:88) a ∈A u ( a ) (cid:107) b (cid:107) Γ − , (27)where (24) uses the triangle inequality and | C a | ≤ C for every a , (25) holds since u ( a ) ≥ νm bythe choice of u ( a ) , (26) follows by multiplying and dividing by (cid:80) ¯ a ∈A u (¯ a ) and applying E [ | Z | ] ≤ (cid:112) E [ Z ] with the distribution u ( a ) (cid:80) ¯ a ∈A u (¯ a ) , and (27) follows by taking the first Γ − term outside thesum and applying the definition of Γ = Γ h from (6).Since ( (cid:15) t ) ut =1 are independent and -sub-Gaussian, the first term in (22) is bounded via standardconcentration results: From [16, Eq. (20.2)], with probability at least − δ , we have (cid:12)(cid:12)(cid:12) b T Γ − u (cid:88) t =1 A t (cid:15) t (cid:12)(cid:12)(cid:12) ≤ (cid:107) b (cid:107) Γ − (cid:113) δ ) . (28)Combining the bounds obtained in (27) and (28) completes the proof.Next, we characterize the term (cid:107) b (cid:107) − = √ b T Γ − b appearing in (17). Lemma 2.
For any arm b ∈ A , it holds that (cid:107) b (cid:107) − ≤ dm . (29) Proof.
We have (cid:107) b (cid:107) − = b T Γ − b (30) = b T (cid:16) (cid:88) a ∈A u ( a ) aa T (cid:17) − b (31) = b T (cid:16) (cid:88) a ∈A (cid:100) m max { ζ ( a ) , ν }(cid:101) aa T (cid:17) − b (32) ≤ b T (cid:16) (cid:88) a ∈A mζ ( a ) aa T (cid:17) − b (33) ≤ dm , (34)where: • (31) and (32) follow from the definitions of Γ = Γ h and u ( a ) = u h ( a ) in Algorithm 1; • (33) follows by letting A = (cid:80) a ∈A mζ ( a ) aa T and B = (cid:80) a ∈A (cid:100) m max { ζ ( a ) , ν }(cid:101) aa T andnoting that (cid:107) b (cid:107) A − ≥ (cid:107) b (cid:107) B − whenever A − (cid:23) B − , or equivalently B (cid:23) A (i.e., inversionreverses Loewner orders).; • (34) follows from max a ∈A h (cid:107) a (cid:107) ζ h ) − ≤ d (second step in Algorithm 1) and the definition Γ( ζ ) = (cid:80) a ∈A ζ ( a ) aa T . 12ombining the results obtained in Lemmas 1 and 2, we find that with probability at least − δ , thefollowing holds for any b ∈ A : |(cid:104) b, ˆ θ − θ (cid:105)| ≤ (cid:114) dm log (cid:16) δ (cid:17) + Cmν (cid:114) dum . (35)In the following lemma, we bound the total epoch length u = u h in terms of the quantity m = m h from Algorithm 1. Lemma 3.
Let m = 4 d (log log d + 18) , and let ν ∈ (0 , be the truncation parameter. Then, theepoch length in Algorithm 1 is bounded as u ≤ m (1 + m ν ) .Proof. We have u = (cid:88) a ∈A ,ζ ( a ) (cid:54) =0 (cid:100) m max { ζ ( a ) , ν }(cid:101) (36) ≤ (cid:88) a ∈A ,ζ ( a ) (cid:54) =0 (cid:0) m max { ζ ( a ) , ν } + 1 (cid:1) (37) ≤ d (log log d + 18) + (cid:88) a ∈A ,ζ ( a ) (cid:54) =0 m max { ζ ( a ) , ν } (38) ≤ m + (cid:88) a ∈A ,ζ ( a ) (cid:54) =0 m max { ζ ( a ) , ν } (39) ≤ m (cid:88) a ∈A ,ζ ( a ) (cid:54) =0 max { ζ ( a ) , ν } (40) ≤ m (1 + m ν ) , (41)where (38) uses the support bound in (5), (39) uses m = 4 d (log log d + 18) and m ≤ m , (40)follows since (cid:80) a ∈A ,ζ ( a ) (cid:54) =0 max { ζ ( a ) , ν } ≥ (cid:80) a ζ ( a ) = 1 for any ν ∈ (0 , , and (41) uses max { α, β } ≤ α + β for α, β ≥ .We are now in position to state the main lemma of this section, which follows by combining Lemma 3with (35), and provides corruption-tolerant confidence bounds. Lemma 4.
In the given (arbitrary) epoch under consideration, with probability at least − δ , wehave for any a ∈ A that |(cid:104) a, ˆ θ − θ (cid:105)| ≤ (cid:114) dm log (cid:16) δ (cid:17) + Cmν (cid:112) d (1 + νm ) . (42) In addition, the same holds simultaneously for all a ∈ A with probability at least − kδ . Note that we renamed b to a , and the second part follows by a union bound over the k arms. A.2 Regret Analysis (Known Corruption Budget)
We start by showing that with high probability, Algorithm 1 never eliminates the optimal arm.Recalling that the optimal arm is a ∗ = arg max a ∈A (cid:104) θ, a (cid:105) , we trivially have a ∗ ∈ A . We first showthat a ∗ ∈ A with high probability.At the end of epoch , the estimate ˆ θ is formed. Letting ˆ a = arg max a ∈A (cid:104) ˆ θ , a (cid:105) and conditioningon the second part of Lemma 4 holding true, we have (cid:104) ˆ θ , ˆ a − a ∗ (cid:105) ≤ (cid:104) ˆ θ, ˆ a (cid:105) − (cid:104) θ, ˆ a (cid:105) + (cid:104) θ, a ∗ (cid:105) − (cid:104) ˆ θ, a ∗ (cid:105) . (43) = (cid:104) ˆ θ − θ, ˆ a (cid:105) + (cid:104) θ − ˆ θ, a ∗ (cid:105) (44) ≤ (cid:114) dm log (cid:16) δ (cid:17) + 2 Cm ν (cid:112) d (1 + νm ) , (45)where (43) holds since a ∗ maximizes (cid:104) θ, a (cid:105) , while (45) is due to (42).13otice that (45) is precisely the condition used in the algorithm to retain arms. It follows that thealgorithm will not eliminate the optimal arm at the end of the first epoch, with probability at least − kδ . By applying an induction argument with the same steps as above in subsequent epochs, itfollows that if ˜ H is any almost-sure upper bound on the number of epochs H , then with probabilityat least − k ˜ Hδ , the algorithm will retain the optimal arm in every epoch. We claim that we canset ˜ H = log ( T ) . To see this, note that m h = 2 h m , and because each epoch’s length u h ≥ m h ishence greater than h , the total number of epochs H is deterministically upper bounded by log T .In the remainder of the proof, we condition on the preceding events that hold with probability at least − k ˜ Hδ (we will later rescale δ by k ˜ H for consistency with the statement of Theorem 1). Hence,the optimal arm is retained, and the confidence bounds (42) apply in all epochs.We proceed by analyzing the regret. Fix h ∈ { , . . . , H − } , and let u h ( a ) denote the number oftimes arm a is played in epoch h . From the definition of regret, we have R T = T (cid:88) t =1 (cid:0) (cid:104) θ, a ∗ (cid:105) − (cid:104) θ, A t (cid:105) (cid:1) (46) = H − (cid:88) h =0 (cid:88) a ∈A h u h ( a ) (cid:0) (cid:104) θ, a ∗ (cid:105) − (cid:104) θ, a (cid:105) (cid:1) (47) ≤ u + H − (cid:88) h =1 (cid:88) a ∈A h u h ( a ) (cid:0) (cid:104) θ, a ∗ (cid:105) − (cid:104) θ, a (cid:105) (cid:1) (48) ≤ u + H − (cid:88) h =1 (cid:88) a ∈A h u h ( a ) (cid:16) (cid:113) dm h − log (cid:0) δ (cid:1) + Cm h − ν (cid:112) d (1 + νm ) (cid:17) (49) = 2 u + H − (cid:88) h =1 u h (cid:16) (cid:113) dm h − log (cid:0) δ (cid:1) + Cm h − ν (cid:112) d (1 + νm ) (cid:17) (50) ≤ u + H − (cid:88) h =1 m h (1 + νm ) (cid:16) (cid:113) dm h − log (cid:0) δ (cid:1) + Cm h − ν (cid:112) d (1 + νm ) (cid:17) (51) = 2 u + H − (cid:88) h =1 m h (cid:16) (cid:113) dm h − log (cid:0) δ (cid:1) + Cm m h − √ d (cid:17) (52) = 2 u + H − (cid:88) h =1 (cid:16) (cid:113) dm h − log (cid:0) δ (cid:1) + 64 Cm √ d (cid:17) (53) ≤ u + 64 c (cid:113) dT log (cid:0) δ (cid:1) + 64 Cm √ d log T, (54)where: • (47) uses the definition of u h ( a ) ; • (48) uses the fact that the instant regret is at most and the length of the first epoch is u ; • (49) follows since (cid:104) θ, a ∗ (cid:105) − (cid:104) θ, a (cid:105) ≤ (cid:104) ˆ θ h − , a ∗ (cid:105) − (cid:104) ˆ θ h − , a (cid:105) + 2 (cid:16)(cid:113) dm h − log (cid:0) δ (cid:1) + Cm h − ν (cid:112) d (1 + νm ) (cid:17) (55) ≤ (cid:16)(cid:113) dm h − log (cid:0) δ (cid:1) + Cm h − ν (cid:112) d (1 + νm ) (cid:17) , (56)where (55) follows by using (42) to upper and lower bound (cid:104) θ, a ∗ (cid:105) and (cid:104) θ, a (cid:105) respectively, and(56) follows from the condition (7) for retaining arms (with ˆ C h = C ); • (51) follows from Lemma 3; 14 (52) follows by choosing ν = m as per Theorem 1; • (53) follows by applying m h = 2 m h − and simplifying; • (54) holds for some constant c > since a sum of exponentially increasing terms is upperbounded by a constant times the last (and the longest epoch length is trivially at most T ), andwe also use H ≤ ˜ H = log ( T ) for the second term.We note that the first term in (54) is insignificant compared to the last term, since u ≤ m byLemma 3 with ν = m . Since the preceding analysis holds with probability at least − k ˜ Hδ , wescale δ ← δ k ˜ H to obtain that with probability at least − δ that R T = ˜ O (cid:16)(cid:113) dT log (cid:0) kδ (cid:1) + Cd / log T (cid:17) , (57)where the O (log ˜ H ) = O (log log T ) term and the O (log log d ) term from m are absorbed into the ˜ O ( · ) notation. A.3 Unknown Corruption Budget
Recall that for unknown C , the algorithm uses ˆ C h = min { √ Tm log T , m √ d ˜ H − h } , where ˜ H =log T is a deterministic upper bound on the number H of epochs. Recall also that we assume C < √ Tm log T , with the case C ≥ √ Tm log T discussed following Theorem 2.A few observations are in order before we proceed: • We refer to epochs for which ˆ C h ≥ C as safe , as the adversary cannot eliminate the optimalarm in these epochs. This follows from the analysis of Section A.2, where we showed that ifthe true C is used in the criterion for retaining arms, then the optimal arm is retained with highprobability (thus, the same follows if ˆ C h ≥ C is used instead). • Let h (cid:48) be the first epoch index for which C ≥ ˆ C h . It follows that C ≥ ˜ H − h (cid:48) m √ d, and hence log (cid:0) Cm √ d (cid:1) ≥ ˜ H − h (cid:48) . (58)Thus, the number of remaining epochs H − h (cid:48) is at most log (cid:0) Cm √ d (cid:1) . • In the worst case, the adversary can distribute its budget C among these later epochs withindices h ≥ h (cid:48) to force arms to be eliminated. That is, with C h denoting the budget used inepoch h (so that (cid:80) H − h =0 C h = C ), it is possible to have C h > ˆ C h in these epochs. We thereforerefer to these epochs as unsafe . Note that since C < √ Tm log T , the adversary does not haveenough budget to make any "early" epoch for which min { m √ d ˜ H − h , √ Tm log T } = √ Tm log T to be unsafe.We consider the first unsafe epoch h (cid:48) , and suppose that the adversary eliminates the optimal arm.Hence, it holds that C h (cid:48) > ˆ C h (cid:48) . Our goal will be to show that although the optimal arm getseliminated, an arm a h (cid:48) that is "almost" as good as the optimal arm is retained.Because a ∗ got eliminated, the rule (7) for retaining arms implies: max a ∈A h (cid:48) (cid:104) ˆ θ h (cid:48) , a − a ∗ (cid:105) > (cid:113) dm h (cid:48) log (cid:0) δ (cid:1) + C h (cid:48) m h (cid:48) ν (cid:112) d (1 + νm ) . (59)Let a h (cid:48) = arg max a ∈A h (cid:48) (cid:104) ˆ θ h (cid:48) , a − a ∗ (cid:105) , and observe that a h (cid:48) is not eliminated, i.e., a h (cid:48) ∈ A h (cid:48) +1 .We again condition on the second part of Lemma 4, which holds with probability at least − k ˜ Hδ .This event implies for every a ∈ A h (cid:48) that |(cid:104) a, ˆ θ h (cid:48) − θ (cid:105)| ≤ (cid:113) dm h (cid:48) log( δ ) + Cm h (cid:48) ν (cid:112) d (1 + νm ) , (60) The final term in (51) contains both increasing and decreasing factors with respect to ν , thus not permittingus to set ν to an arbitrary small value. The choice ν = m is convenient for the analysis, though we do notclaim it to be optimal, nor necessarily the best in practice. a h (cid:48) and a ∗ , since both a h (cid:48) , a ∗ ∈ A h (cid:48) (recall that the adversary doesnot have enough budget to remove a ∗ from A h (cid:48) according to the definition of h (cid:48) ). Combining thetwo associated bounds, we obtain (cid:104) ˆ θ h (cid:48) , a h (cid:48) − a ∗ (cid:105) ≤ (cid:104) θ, a h (cid:48) − a ∗ (cid:105) + 2 (cid:113) dm h (cid:48) log( δ ) + Cm h (cid:48) ν (cid:112) d (1 + νm ) , (61)and combining (59) with (61) gives (cid:104) θ, a h (cid:48) (cid:105) > (cid:104) θ, a ∗ (cid:105) − m h (cid:48) ν (cid:112) d (1 + νm )( C − ˆ C h (cid:48) ) (62) ≥ (cid:104) θ, a ∗ (cid:105) − Cm h (cid:48) ν (cid:112) d (1 + νm ) . (63)By denoting η h (cid:48) := (cid:104) θ, a ∗ (cid:105) − (cid:104) θ, a h (cid:48) (cid:105) , we can rewrite (63) as η h (cid:48) < Cm h (cid:48) ν (cid:112) d (1 + νm ) . (64)Note that η h (cid:48) represents the additional regret (due to the optimal arm elimination) that our algorithmcan incur in epoch h (cid:48) + 1 for each arm pull.In epoch h (cid:48) + 1 , the optimal arm a ∗ is already eliminated in the worst case, and again, the adversarycan potentially eliminate the best remaining arm a ∗ h (cid:48) +1 = arg max a ∈A h (cid:48) +1 (cid:104) θ, a (cid:105) by using C h (cid:48) +1 > ˆ C h (cid:48) +1 . By repeating the same arguments as those leading to (64), the additional regret can be writtenas η h (cid:48) +1 = (cid:104) θ, a ∗ (cid:105) − (cid:104) θ, a h (cid:48) +1 (cid:105) , where a h (cid:48) +1 = arg max a ∈A h (cid:48) +1 (cid:104) ˆ θ h (cid:48) +1 , a − a ∗ h (cid:48) +1 (cid:105) , and it holdsthat η h (cid:48) +1 < η h (cid:48) + Cνm h (cid:48) +1 (cid:112) d (1 + νm ) (65) < s Cν (cid:112) d (1 + νm ) (cid:16) m h (cid:48) + m h (cid:48) +1 (cid:17) . (66)By induction, for each of the remaining epochs, we have: η h (cid:48) + l < Cν (cid:112) d (1 + νm ) h (cid:48) + l (cid:88) i = h (cid:48) m i (67) ≤ c Cm h (cid:48) ν (cid:112) d (1 + νm ) , (68)where (68) holds for some constant c , since the sum of exponentially shrinking terms is dominatedby the first one in the sum. Next, by substituting ν = m and using u h ≤ m h , we deduce that thetotal additional regret that we incur is: H − (cid:88) h = h (cid:48) u h +1 η h ≤ H − (cid:88) h = h (cid:48) m h +1 η h (69) ≤ c m Cm h (cid:48) √ d H − (cid:88) h = h (cid:48) m h +1 (70) ≤ c (cid:48) m C √ d m H − m h (cid:48) (71) = O (cid:0) C (cid:1) , (72)where (70) applies (68) with ν = m , (71) holds for some constant c (cid:48) (depending on c ) since a sumof exponentially increasing terms is upper bounded by a constant times the last term, and (72) uses m H − m h (cid:48) = H − m h (cid:48) m = 2 H − h (cid:48) − ≤ ˜ H − h (cid:48) ≤ log (cid:0) Cm √ d (cid:1) = Cm √ d , (73)with the first inequality using (58). 16e now proceed to bound the regret similarly to (46): R T = T (cid:88) t =1 (cid:0) (cid:104) θ, a ∗ (cid:105) − (cid:104) θ, A t (cid:105) (cid:1) (74) = H − (cid:88) h =0 (cid:88) a ∈A h u h ( a ) (cid:0) (cid:104) θ, a ∗ (cid:105) − (cid:104) θ, a (cid:105) (cid:1) (75) ≤ u + H − (cid:88) h =1 (cid:88) a ∈A h u h ( a ) (cid:0) (cid:104) θ, a ∗ (cid:105) − (cid:104) θ, a (cid:105) (cid:1) (76) ≤ u + (cid:16) H − (cid:88) h =1 m h (cid:16) (cid:113) dm h − log (cid:0) δ (cid:1) + C h − m m h − √ d + Cm m h − √ d (cid:17)(cid:17) + H − (cid:88) h = h (cid:48) u h +1 η h , (77)where the ˆ C h − term comes from the condition (7) for retaining arms, and the C term comes fromthe use of (42) (this is in contrast to the known C case, in which the former term also uses C ).It remains to bound the terms in (77) separately. We have already shown the bound on the last term(see in (72)). Next, we show: H − (cid:88) h =1 m h C h − m m h − √ d = 16 m √ d H − (cid:88) h =1 ˆ C h − (78) = 16 m √ d H (cid:88) h =1 min { √ Tm log T , m √ d ˜ H − h } (79) ≤ m √ dTm (80) = O ( √ dT ) , (81)where (78) uses m h = 2 m h − , (79) substitutes the choice of ˆ C h , and (80) upper bounds the minimumby the first term and applies H ≤ ˜ H = log T .Similarly, recalling that m = d log log d + 16 , we have H − (cid:88) h =1 m h Cm m h − √ d = 16 m √ dC log T (82) = ˜ O ( d / C log T ) , (83)and H − (cid:88) h =1 m h (cid:16) (cid:113) dm h − log (cid:0) δ (cid:1)(cid:17) ≤ c (cid:113) dT log (cid:0) δ (cid:1) (84) = O (cid:16)(cid:113) dT log (cid:0) δ (cid:1)(cid:17) , (85)where (84) holds for some constant c > similarly to (54). Then, again replacing δ ← δ k ˜ H in thesame way as the known C case, the term (85) becomes ˜ O (cid:16)(cid:113) dT log (cid:0) kδ (cid:1)(cid:17) , (86)and the associated probability is now − δ .Combining (72), (81), (83) and (86), we arrive at the regret bound: R T = ˜ O (cid:18)(cid:113) dT log (cid:0) kδ (cid:1) + d / C log T + C (cid:19) . (87)17 lgorithm 2 Contextual Greedy
Require:
Initialize ˆ θ arbitrarily for t = 1 , , . . . , T do Receive a set of contexts { a ,t , . . . , a k,t } Choose arm I t = arg max i ∈{ ,...,K } (cid:104) ˆ θ t , a i,t (cid:105) Observe: Y t = (cid:104) θ, a I t ,t (cid:105) + (cid:15) t + c t ( a I t ,t ) Update ˆ θ t +1 ∈ arg min θ (cid:48) (cid:80) tτ =1 ( (cid:104) θ (cid:48) , a I τ ,τ (cid:105) − Y τ ) (cid:46) break ties arbitrarily end for B Proofs for Section 3 (Contextual Greedy Algorithm)
For reference, a complete description of the greedy algorithm is given in Algorithm 2.Before proving Theorem 5, we introduce some useful auxiliary results. Our proof builds heavilyon that of [12], whose setup matches ours but does not consider adversarial attacks (i.e., their setupcorresponds to the case that C = 0 ). Lemma 5 (Lemma 3.1 [12]) . If (cid:107) a i,t (cid:107) ≤ for all i, t , then for any t < T , we have R T ≤ t + 2 T (cid:88) t = t (cid:107) θ − ˆ θ t (cid:107) . (88) Proof.
We reproduce the proof for the sake of demonstrating the use of the greedy rule in (14). Recallthat the least-squares estimator ˆ θ t is computed by using the previously observed attacked rewards.We can bound the regret incurred in the first t rounds by the maximum regret value . Then, weconsider the regret r t incurred at time t ; denoting i ∗ t = arg max i ∈{ ,...,K } (cid:104) θ, a i,t (cid:105) , we have r t = (cid:104) θ, a i ∗ t ,t (cid:105) − (cid:104) θ, a I t ,t (cid:105) (89) = (cid:0) (cid:104) θ, a i ∗ t ,t (cid:105) − (cid:104) ˆ θ t , a i ∗ t ,t (cid:105) (cid:1) − (cid:0) (cid:104) θ, a I t ,t (cid:105) − (cid:104) ˆ θ t , a I t ,t (cid:105) (cid:1) + (cid:0) (cid:104) ˆ θ t , a i ∗ t ,t (cid:105) − (cid:104) ˆ θ t , a I t ,t (cid:105) (cid:1) (90) ≤ (cid:0) (cid:104) θ, a i ∗ t ,t (cid:105) − (cid:104) ˆ θ t , a i ∗ t ,t (cid:105) (cid:1) − (cid:0) (cid:104) θ, a I t ,t (cid:105) − (cid:104) ˆ θ t , a I t ,t (cid:105) (cid:1) (91) ≤ (cid:12)(cid:12) (cid:104) θ, a i ∗ t ,t (cid:105) − (cid:104) ˆ θ t , a i ∗ t ,t (cid:105) (cid:12)(cid:12) + (cid:12)(cid:12) (cid:104) θ, a I t ,t (cid:105) − (cid:104) ˆ θ t , a I t ,t (cid:105) (cid:12)(cid:12) (92) ≤ (cid:107) θ − ˆ θ t (cid:107) (cid:107) a i ∗ t ,t (cid:107) + (cid:107) θ − ˆ θ t (cid:107) (cid:107) a I t ,t (cid:107) (93) ≤ (cid:107) θ − ˆ θ t (cid:107) , (94)where (91) follows since I t is selected greedily, and hence (cid:104) ˆ θ t , a i ∗ t ,t (cid:105) − (cid:104) ˆ θ t , a I t ,t (cid:105) ≤ . Lemma 6.
For each round t , let Γ t = (cid:80) τ ≤ t a I τ a TI τ , and suppose that all contexts satisfy (cid:107) a i,t (cid:107) ≤ ,the reward noise is -sub-Gaussian, and the attack budget is C ≥ . If λ min (Γ t ) > , then withprobability at least − δ , it holds that (cid:107) θ − ˆ θ t (cid:107) ≤ (cid:112) dt log( td/δ ) λ min (Γ t ) + Cλ min (Γ t ) . (95) Proof.
Since λ min (Γ t ) > , the matrix Γ t is invertible, and we can use the standard closed-formleast squares solution expression: ˆ θ t = Γ − t (cid:80) τ ≤ t a I τ ,τ Y τ . Decomposing Y τ into the sum of thereward, noise, and adversarial corruption (similarly to (21)), we obtain ˆ θ t = θ + Γ − t (cid:88) τ ≤ t a I τ ,τ (cid:15) τ + Γ − t (cid:88) τ ≤ t a I τ ,τ c τ ( a I τ ,τ ) , (96)which implies that (cid:107) ˆ θ t − θ (cid:107) ≤ (cid:13)(cid:13) Γ − t (cid:88) τ ≤ t a I τ ,τ (cid:15) τ (cid:13)(cid:13) + (cid:13)(cid:13) Γ − t (cid:88) τ ≤ t a I τ ,τ c τ ( a I τ ,τ ) (cid:13)(cid:13) (97) ≤ λ min (Γ t ) (cid:16)(cid:13)(cid:13) (cid:88) τ ≤ t a I τ ,τ (cid:15) τ (cid:13)(cid:13) + (cid:13)(cid:13) (cid:88) τ ≤ t a I τ ,τ c τ ( a I τ ,τ ) (cid:13)(cid:13) (cid:17) . (98)18ith probability at least − δ , the first term is bounded as [12, Lemma A.1] (cid:13)(cid:13) (cid:88) τ ≤ t a I τ ,τ (cid:15) τ (cid:13)(cid:13) ≤ (cid:112) dt log( td/δ ) . (99)For the second term in (98), we note the following: (cid:13)(cid:13) (cid:88) τ ≤ t a I τ ,τ c τ ( a I τ ,τ ) (cid:13)(cid:13) ≤ (cid:88) τ ≤ t (cid:13)(cid:13) a I τ ,τ c τ ( a I τ ,τ ) (cid:13)(cid:13) (100) ≤ (cid:88) τ ≤ t | c τ ( a I τ ,τ ) | · (cid:13)(cid:13) a I τ ,τ (cid:13)(cid:13) (101) ≤ (cid:88) τ ≤ t | c τ ( a I τ ,τ ) | (102) ≤ C. (103)Combining (98) with (99) and (103) completes the proof.We are now ready to prove Theorem 5. Proof of Theorem 5.
We follow the steps of the proof of [12, Thm. 3.1]. We start byproving the following counterpart of [12, Corollary 3.1]: Letting t = max (cid:8) (cid:113) log (cid:0) kδ (cid:1) ,
32 log( Tδ ) ,
80 log(2 dT/δ ) λ (cid:9) , for every t ≥ t , it holds with probability at least − δ that (cid:107) θ − ˆ θ t (cid:107) ≤ (cid:112) d log(2 T d/δ ) λ √ t + 16 Cλ t . (104)To prove this, we use Lemma 6 with δ in place of δ ; (104) will then follow once we show that λ min (Γ t ) ≥ tλ . (105)This result is shown in [12, Lemma B.1] (making use of the assumption t ≥ t ), and only requiresthat the random context perturbations are ( r, λ ) -diverse. Thus, it continues to hold in the corruptedsetting with C > .Combining (88) and (104), we have with probability at least − δ that R T ≤ t + 2 T (cid:88) t = t (cid:16) √ d log Tdδ λ √ t + Cλ t (cid:17) (106) ≤ t + (cid:114) dT log (cid:0) T dδ (cid:1) λ + C log Tλ , (107)since (cid:80) Tt =1 1 √ t ≤ √ T and (cid:80) Tt =1 1 t ≤ T (with the latter assuming T > ). Substituting thedefinition of t , it follows that with probability at least − δ , we have R T = O (cid:18) λ (cid:16)(cid:113) dT log (cid:0) T dδ (cid:1) + C log T + log (cid:0) dTδ (cid:1)(cid:17) + (cid:113) log( kδ ) (cid:19) . (108)We now consider the special case of Gaussian perturbations, i.e., each ξ i,t is drawn independentlyfrom N (0 , η I ) for some η > . We make use of the above results, as well as the ones from [12,Section 3.2], to show that Algorithm 2 has sublinear regret under small perturbations. This is formallystated in the following corollary. Corollary 1.
Assume the context perturbations ξ i,t are drawn independently from N (0 , η I ) for all i, t , the reward noise is -sub-Gaussian, and the attack budget of the adversary is C ≥ . Then for afixed number of arms k and η ≤ O (( (cid:112) d log( T kd/δ )) − ) , with probability at least − δ , the greedyalgorithm (Algorithm 2) has regret bounded by R T = O (cid:16) √ T dη log (cid:0) dTδ (cid:1) / + C (log T ) η (cid:17) . (109)19 roof. The proof strategy is to invoke Theorem 5, and in particular, since k is assumed to be fixed,its variant stated in (16).In Theorem 5, it is assumed that (cid:107) a i,t (cid:107) ≤ . However, the right-hand side of one here was onlychosen for convenience, and (as noted in [12]), the same result holds when (cid:107) a i,t (cid:107) ≤ R (cid:48) for anygiven R (cid:48) ≥ satisfying R (cid:48) = O (1) . This change only affects the constants, in particular introducingan ( R (cid:48) ) / term [12]; this still behaves as O (1) since we focus on the case that R (cid:48) = O (1) .To invoke such a variant of Theorem 5, we need to condition on an event that ensures (cid:107) a i,t (cid:107) ≤ R (cid:48) for some constant R (cid:48) > for every i, t and the context perturbations need to be ( r , /T )–boundedand ( r, λ ) –diverse for some λ > and r ≤ R (cid:48) . Next, we show these conditions hold for the caseof Gaussian context perturbations.Towards that end, we start by outlining some results from [12]. First [12, Lemma 3.5] states thatwhen ˆ R ≥ η (cid:112) kdT /δ ) , we have that: P (cid:2) | ξ i,t [ j ] | ≤ ˆ R, ∀ i ∈ { , . . . , K } , t ≤ T, j ∈ { , . . . , d } (cid:3) ≥ − δ/ . (110)In what follows, we set ˆ R = 2 η (cid:112) kdT /δ ) , and we condition on the event in (111) holdingtrue. Then, [12, Lemma 3.6] states that when (cid:107) µ i,t (cid:107) ≤ , we have (cid:107) a i,t (cid:107) ≤ √ d ˆ R := R (cid:48) for all i, t, (111)and the perturbations are ( r, /T ) − bounded for r ≥ η √ T .The perturbation distribution conditioned on the event in (110) holding true is a truncated Gaussiansupported on [ − ˆ R, ˆ R ] , and satisfies ( r, λ ) -diversity when [12] λ = Ω (cid:16) η r (cid:17) = Ω (cid:16) η log T (cid:17) . (112)Finally, from the definitions of R (cid:48) in (111) and ˆ R = 2 η (cid:112) kdT /δ ) , we have R (cid:48) = 1 + √ d ˆ R ≤ { , √ d ˆ R } = 2 max { , η (cid:112) d log( dkT /δ ) } , (113)and we see that to have R (cid:48) = O (1) , it suffices to have η ≤ O (( (cid:112) d log( T kd/δ )) − ) .Finally, we apply the above mentioned variant of Theorem 5 with parameter R (cid:48) = O (1) . By theunion bound, the events in Theorem 5 and event (111) simultaneously hold with probability at least − δ . By substituting the bound on λ from (112) into (16), we arrive at R T = O (cid:16) √ T dη log (cid:0) dTδ (cid:1) / + C (log T ) η (cid:17) . (114) C Proofs of Lower Bounds
We prove Theorems 3 and 4 in Sections C.3 and C.1 respectively. In Sections C.2 and C.4, we provetwo Ω( C ) lower bounds using standard arguments, e.g., see [22]. C.1 Lower Bound for d = k = 2 (Unknown C ) Here we show that for d = 2 dimensions and k = 2 arms, for any algorithm that guarantees R T ≤ ¯ R (0) T (say, with probability − δ ) for some uncorrupted regret bound ¯ R (0) T ≤ T when C = 0 ,there exists an instance in which R T = Ω( T ) (again with probability − δ ) when the attack budgetis C = 2 ¯ R (0) T . We show that this is true even with no noise, i.e., (cid:15) t = 0 for all t .To prove this, consider an instance with feature vectors a = (cid:2) , (cid:3) T and a = (cid:2) , (cid:3) T , andparameter vector θ = (cid:2) , (cid:3) T . In this case, pulling a incurs zero regret, and pulling a incurs regret . Hence, by the assumption R T ≤ ¯ R (0) T , we see that a is pulled at most R (0) T times when C = 0 .20ow consider a different instance, with feature vectors a = (cid:2) , (cid:3) T and a = (cid:2) , (cid:3) T , and again θ = (cid:2) , (cid:3) T . In this case, a incurs zero regret, and a incurs regret . Suppose that C = 2 ¯ R (0) T ,and consider an adversary that pushes the reward of a from down to whenever it is pulled, at acost of c t ( A t ) = . Since C = 2 ¯ R (0) T , the adversary can afford to do this R (0) T times.However, as long as the adversary is corrupting, the observed rewards are exactly the same as in thefirst instance above, in which we established that a is pulled at most R (0) T times. Since we assumethat ¯ R (0) T ≤ T , it follows that a is pulled at least T − R (0) T ≥ T times, leading to Ω( T ) regret. C.2 Lower Bound for d = 1 and k = 2 (Known C ) In the case that C is known, we can obtain an Ω( C ) lower bound using a simple argument from[22, Sec. 5], which we reproduce here for completeness. Consider an instance with feature scalars a = 1 and a = − . Clearly, arm 1 is better when θ = 1 , but arm 2 is better when θ = − , andin both cases, the worse arm incurs regret . Consider the case that there is no random noise, andsuppose that the adversary shifts every reward to zero until its budget is depleted, i.e., for (cid:98) C (cid:99) rounds.During these rounds, the learner must pull some arm at least (cid:98) C (cid:99) / times, and for one of the twovalues of θ ∈ {− , } , a cumulative regret of at least (cid:98) C (cid:99) is incurred. Since the two θ values areindistinguishable during these rounds, we conclude that Ω( C ) regret is unavoidable. C.3 Lower Bound for d = k > (Known C ) Here we generalize the argument of the previous subsection to deduce a stronger lower bound with ajoint dependence on d and C . We consider the case that d = k , with a i being the i -th standard basisvector. Hence, for any θ ∈ R d , we have (cid:104) a i , θ (cid:105) = θ i . We again consider the noiseless setting.Consider d different bandit instances, the i -th of which has θ = a i . Hence, in the i -th instance, arm i has reward , and the rest have reward zero. In addition, consider an adversary that pushes thereward of the i -th arm down to zero whenever it is pulled; this can again be done (cid:98) C (cid:99) times. Roughlyspeaking, the learner can do no better than pull each arm in succession, incurring Ω( Cd ) regret.To make this more precise, note that in the i -th instance, the adversary only runs out of its budgetafter the i -th arm is pulled (cid:98) C (cid:99) times. However, after (cid:4) (cid:98) C (cid:99) d (cid:5) rounds, there must remain at least d arms that have not been pulled (cid:98) C (cid:99) times. When θ = a i for any i corresponding to one of these d arms, the regret incurred is Ω( Cd ) . C.4 Lower Bound for Diverse Contexts (Known C ) Finally, we argue that the approach of Section C.2 gives an Ω( C ) lower bound on R T even underthe assumption of diverse contexts. Recall that in (13) we consider fixed center points µ , . . . , µ k and assume that these are perturbed by ξ ∼ D . The following argument holds under any such setupsatisfying the mild assumption that, in each round, a constant positive fraction (e.g., . ) of the armshave regret lower bounded by some positive constant (e.g., . ).We again consider an adversary that pushes the reward to zero (while leaving the random rewardnoise unchanged) until the budget is exhausted. Hence, for the first (cid:98) C (cid:99) rounds, the learner learnsnothing about θ . The preceding assumption rules out pathological cases such as all arms beingidentical, and we conclude that constant regret is incurred per round (with constant probability),for Ω( C ) regret total. D Attack Methods
Recall that we consider the model Y t = (cid:104) θ, A t (cid:105) + (cid:15) t + c t ( A t ) , where (cid:15) t is random noise and c t ( A t ) is the adversarial corruption. In our experiments, we consider four attacks, summarized as follows Garcelon et al. attack.
In an attack proposed in [8], the attacker selects a target arm a target ∈ A that it wants to trick the learner into thinking is optimal, and operates as follows: (i) If a target ispulled, leave the reward unchanged; (ii) If any other arm is pulled, change the reward so that Y t = ˜ (cid:15) t ,with ˜ (cid:15) t being artificial random noise generated by the adversary.21ince our adversary is assumed to know (cid:104) θ, A t (cid:105) and (cid:15) t individually, we can alter this attack to removethe need for artificial noise; instead, in case (ii) above, the adversary shifts (cid:104) θ, A t (cid:105) down to zero,while leaving (cid:15) t unchanged. However, if a target has a negative reward, then this attack will not makeit appear optimal, so we also allow the adversary to shift down to a more generic value v target . In ourexperiments, we set v target = − , which is the smallest possible reward.Of course, the adversary will eventually run out of budget eventually, at which point the attack stops.The same applies to all of the alternative attacks below. Oracle MAB attack.
Since the adversary has full knowledge of the instance, we can consider theoracle attack proposed in [11], in which for some target arm a target , the following is performed: (i)If a target is pulled, leave the reward unchanged; (ii) If any other arm a is pulled, shift the rewarddown by max { , (cid:104) θ, a (cid:105) − (cid:104) θ, a target + (cid:15) (cid:105)} for some (cid:15) > . This means that every other arm looks (cid:15) -suboptimal compared to a target . Simple θ -based attack. In the contextual setting, fixing a target arm a target by index (e.g., the first)may not be the most suitable choice, since the contexts are changing every round. In the setup ofSection 3, we are primarily interested in the case that the perturbations are small, which mitigatesthis issue. Nevertheless, when the perturbation variance η becomes large enough, it is likely moreeffective for the attack to use a different strategy to choose a target .Thus, we propose an attack that tries to make the arm most aligned with some vector θ target to appearbest. To do this, we simply the above variation of the Garcelon et al. attack, but instead of letting a target correspond to a fixed arm index (e.g., the first arm), the attacker updates a target every round,choosing a target = arg max a ∈A t (cid:104) θ target , a (cid:105) . Flip- θ attack. This attack simply flips the reward from (cid:104) θ, a (cid:105) to (cid:104)− θ, a (cid:105) . This attack can beconsidered as highly aggressive, potentially using the budget quickly to the rewards appear to be thecomplete opposite of what they really are. E Additional Experimental Details and Results
E.1 Additional DetailsDetails of contextual experiment.
The synthetic experimental setup of Section 4.2 is detailed asfollows. We generate k = 25 “center points” µ , . . . , µ k (one per arm), each having entries drawni.i.d. from the uniform distribution on (cid:2) − √ d , √ d (cid:3) . The contexts { a i,t } ki =1 at each time t are thencreated by letting a i,t = µ i + ξ i,t , where ξ i,t are i.i.d. N (cid:0) , η d I d (cid:1) for some variance η > . Wefix the true parameter vector as θ = (cid:0) √ d , . . . , √ d (cid:1) , and we assume that observations are subject to N (cid:0) , σ d I d (cid:1) noise with σ = 0 . . Details of MovieLens experiment.
The MovieLens experimental setup of Section 4.2 follows [5],and is detailed as follows. The data for 1682 movies and 943 users takes the form of an incompletematrix R of ratings, where R i,j is the rating of movie i given by the user j . To impute the missingrating values, we apply non-negative matrix factorization with d = 15 latent factors. This produces afeature vector for each movie M i ∈ R d and user U j ∈ R d . We use of the user data for training,in which we fit a Gaussian distribution N ( U | µ, Σ) . The reward for movie i is given by (cid:104) M i , U j (cid:105) forsome fixed j . Changes to the robust PE algorithm.
The parameters in Algorithm 1 (e.g., ˆ C h and m ) werechosen for convenience in the theoretical analysis that ignores constants, but we found that alternativechoices are preferable in practice. Accordingly, we run the algorithm with the following modifi-cations: (i) m = d ; (ii) ˆ C h = min {√ T , log T − h } ; (iii) the right-hand side of (7) is replaced by (cid:113) dm h log (cid:0) δ (cid:1) + C h m h √ d ; (iv) We fix δ = 0 . and ν = 0 . . Thus, the key changes are removingthe m terms from ˆ C h , and removing the division by ν in the elimination condition.22
100 200 300 400 500
Adversarial Budget R e g r e t Garcelon et al.Oracle MABSimple-Flip- 0 100 200 300 400 500
Adversarial Budget R e g r e t Garcelon et al.Oracle MABSimple-Flip- 0 100 200 300 400 500
Adversarial Budget R e g r e t Garcelon et al.Oracle MABSimple-Flip-0 100 200 300 400 500
Adversarial Budget R e g r e t Garcelon et al.Oracle MABSimple-Flip- 0 100 200 300 400 500
Adversarial Budget R e g r e t Garcelon et al.Oracle MABSimple-Flip- 0 100 200 300 400 500
Adversarial Budget R e g r e t Garcelon et al.Oracle MABSimple-Flip-
Figure 4: Contextual synthetic experiments: Regret as a function of C with Greedy (Left), LinUCB(Middle), and Thompson Sampling (Right), under the perturbation levels η = 0 . (Top) and η = 0 . (Bottom). Round0200040006000800010000 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Round02500500075001000012500150001750020000 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Round025005000750010000125001500017500 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Round0200040006000800010000120001400016000 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Round02000400060008000100001200014000 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Figure 5: Non-contextual synthetic experiment with 40 trials: (Top-Left) Average regret as a functionof time; (Remaining) Worst 4 runs among 40.
E.2 Additional ResultsContextual setting: Budget vs. Regret.
In Figure 4, we provide analogous plots to Figure 1 (Left)for all three algorithms (Greedy, LinUCB, and Thompson sampling) and two choices of η ( . and . ). In all cases, we see a similar linear trend to that of Figure 4. Non-contextual setting with 40 trials and unknown C . In Figure 5, we provide analogous plots toFigure 3, but with 40 trials instead of 10, and showing the worst 4 out of 40 curves instead of theworst 2 out of 10. We observe similar findings to those discussed in Section 4.3
Non-contextual setting with known C . In Figure 6, we provide analogous plots to Figure 3 whenAlgorithm 1 is used with known C , with the following modifications similar to the unknown C case:(i) m = d ; (ii) the right-hand side of (7) is replaced by (cid:113) dm h log (cid:0) δ (cid:1) + Cm h √ d ; (iii) We fix δ = 0 . and ν = 0 . . The attack is chosen to start during the same epoch as the unknown C case. From theregret plots, we observe broadly similar behavior to the unknown C case, with the exception that theregret is considerably lower when there is no attack (i.e., C = 0 ). This is to be expected, since in theknown C case, knowing that C = 0 means that one can confidently eliminate arms much faster.23 Round0200040006000800010000120001400016000 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Round02500500075001000012500150001750020000 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Round025005000750010000125001500017500 R e g r e t Robust PE - Top-3Robust PE - Top-5LinUCB - Flip-Thompson - Flip-Non-robust PE - Top-5Robust PE - Flip-Robust PE - No attack
Figure 6: Non-contextual synthetic experiment with 10 trials and known C : (Left) Average regret asa function of time; (Middle) Worst run among 10; (Right) Second-worst run among 10. F Additional Related Works
Best of both worlds.
The stochastic setting often leads to significantly smaller regret bounds, but atthe expense of potentially restrictive modeling assumptions. Algorithms attaining the best of bothworlds (stochastic and adversarial) [6, 24, 2] are also related to the corruption-tolerant setting, butconsider an unbounded adversary (and a different regret notion) in the adversarial case. Hence, a keydistinction is that the adversary’s budget is “all-or-nothing” rather than being smoothly parametrizedby C . See [22] for further discussion. Model mismatch and misspecification.
A distinct but related direction in the linear bandit literaturehas been to address robustness to model mismatch and misspecification [9, 15] (see also [16, Sec. 24.4]and [28]). In [9], the deviations for each arm are fixed and the same every round, whereas in [15] thedeviations may depend on a context vector but not on the learner’s action. Hence, both can be viewedas considering a weaker adversary than the present paper. On the other hand, this can lead to strongerregret guarantees, such as paying essentially no penalty under broad misspecification scenarios [15].
Fractional corruption model.
In [13], a different corruption model was considered, both in the caseof independent arms and linear rewards. The constraint on the adversary therein is that at any time t ,at most ηt fraction of the observed rewards have been corrupted by the adversary. This is distinctfrom (and complementary to) the setting that we consider, in which the adversary can choose whereto concentrate its budget. The algorithms and bounds of [13] are not applicable in our setting. Phased elimination.
Similar phased elimination algorithms to Algorithm 1 (but without robustnessto adversarial attacks) have been previously considered in various settings, including the standardsetting [16, Ch. 22], misspecified setting [17], and graph bandits [27]. Among these, our algorithm ismost similar to [17], with the main differences being the estimator of θθ