Action-Manipulation Attacks Against Stochastic Bandits: Attacks and Defense
AAction-Manipulation Attacks Against StochasticBandits: Attacks and Defense
Guanlin Liu and Lifeng Lai
Abstract —Due to the broad range of applications of stochasticmulti-armed bandit model, understanding the effects of adver-sarial attacks and designing bandit algorithms robust to attacksare essential for the safe applications of this model. In this paper,we introduce a new class of attack named action-manipulationattack. In this attack, an adversary can change the actionsignal selected by the user. We show that without knowledgeof mean rewards of arms, our proposed attack can manipulateUpper Confidence Bound (UCB) algorithm, a widely used banditalgorithm, into pulling a target arm very frequently by spendingonly logarithmic cost. To defend against this class of attacks, weintroduce a novel algorithm that is robust to action-manipulationattacks when an upper bound for the total attack cost is given.We prove that our algorithm has a pseudo-regret upper boundedby O (max { log T, A } ) , where T is the total number of rounds and A is the upper bound of the total attack cost. Index Terms —Stochastic bandits, action-manipulation attack,UCB.
I. I
NTRODUCTION
In order to develop trustworthy machine learning systems,understanding adversarial attacks on learning systems andcorrespondingly building robust defense mechanisms haveattracted significant recent research interests [2]–[9]. In thispaper, we focus on multiple armed bandits (MABs), a simplebut very powerful framework of online learning that makes de-cisions over time under uncertainty. MABs problem is widelyinvestigated in machine learning and signal processing [10]–[16] and has many applicants in a variety of scenarios such asdisplaying advertisements [17], articles recommendation [18],cognitive radios [19], [20] and search engines [21], to name afew.Of particular relevance to our work is a line of interestingrecent work on online reward-manipulation attacks on stochas-tic MABs [22]–[25]. In the reward-manipulation attacks, thereis an adversary who can change the reward signal from theenvironment, and hence the reward signal received by theuser is not the true reward signal from the environment. Inparticular, [22] proposes an interesting attack strategy thatcan force a user, who runs either (cid:15) -Greedy and or UpperConfidence Bound (UCB) algorithm, to select a target armwhile only spending effort that grows in logarithmic order.[23] proposes an optimization based framework for offline
G. Liu and L. Lai are with Department of Electrical and Com-puter Engineering, University of California, Davis, CA, 95616. Email: { glnliu,lflai } @ucdavis.edu. The work of G. Liu and L. Lai was supported byNational Science Foundation under Grants CCF-1717943, ECCS-1711468,CNS-1824553 and CCF-1908258. This paper will be presented in part inthe 2020 IEEE International Conference on Acoustics, Speech and SignalProcessing [1]. reward-manipulation attacks. Furthermore, it studies a formof online attack strategy that is effective in attacking anybandit algorithm that has a regret scaling in logarithm order,without knowing what particular algorithm the user is using.[25] considers an attack model where an adversary attacks witha certain probability at each round but its attack value can bearbitrary and unbounded. The paper proposes algorithms thatare robust to these types of attacks. [24] considers how todefend against reward-manipulation attacks, a complementaryproblem to [22], [23]. In particular, [24] introduces a banditalgorithm that is robust to reward-manipulation attacks undercertain attack cost, by using a multi-layer approach. [26]introduces another model of adversary setting where each armis able to manipulate its own reward and seeks to maximizeits own expected number of pull count. Under this setting,[26] analyzes the robustness of Thompson Sampling, UCB,and (cid:15) -greedy with the adversary, and proves that all threealgorithms achieve a regret upper bound that increases overrounds in a logarithmic order or increases with attack costin a linear order. This line of reward-manipulation attack hasalso recently been investigated for contextual bandits in [27],which develops an attack algorithm that can force the banditalgorithm to pull a target arm for a target contextual vector byslightly manipulating rewards in the data.In this paper, we introduce a new class of attacks on MABsnamed action-manipulation attack. In the action-manipulationattack, an attacker, sitting between the environment and theuser, can change the action selected by the user to another ac-tion. The user will then receive a reward from the environmentcorresponding to the action chosen by the attacker. Comparedwith the reward-manipulation attacks discussed above, theaction-manipulation attack is more difficult to carry out. Inparticular, as the action-manipulation attack only changesthe action, it can impact but does not have direct controlof the reward signal, because the reward signal will be arandom variable drawn from a distribution depending on theaction chosen by the attacker. This is in contrast to reward-manipulation attacks where an attacker has direct control andcan change the reward signal to any value.In order to demonstrate the significant security threat ofaction-manipulation attacks to stochastic bandits, we proposean action-manipulation attack strategy against the widely usedUCB algorithm. The proposed attack strategy aims to forcethe user to pull a target arm chosen by the attacker frequently.We assume that the attacker does not know the true meanreward of each arm. The assumption that the attacker doesnot know the mean rewards of arms is necessary, as otherwise a r X i v : . [ c s . L G ] F e b he attacker can perform the attack trivially. To see this, withthe knowledge of the mean rewards, the attacker knows whicharm has the worst mean reward and can perform the followingoracle attack: when the user pulls a non-target arm, the attackerchange the arm to the worst arm. This oracle attack makes allnon-target arms have expected rewards less than that of thetarget arm, if the target arm selected by the attacker is notthe worst arm. In addition, under this attack, all sublinear-regret bandit algorithms will pull the target arm O ( T ) times.However, the oracle attack is not practical. The goal ofour work is to develop an attack strategy that has similarperformance of the oracle attack strategy without requiring theknowledge of the true mean rewards. When the user pulls anon-target arm, the attacker could decide to attack by changingthe action to the possible worst arm. As the attacker does notknow the true value of arms, our attack scheme relies on lowerconfidence bounds (LCB) of the value of each arm in makingattack decision. Correspondingly, we name our attack schemeas LCB attack strategy. Our analysis shows that, if the targetarm selected by the attacker is not the worst arm, the LCBattack strategy can successfully manipulate the user to selectthe target arm almost all the time with an only logarithmiccost. In particular, LCB attack strategy can force the user topull the target arm T − O (log( T )) times over T rounds, withtotal attack cost being only O (log( T )) . On the other hand,we also show that, if the target arm is the worst arm and theattacker can only incur logarithmic costs, no attack algorithmcan force the user to pull the worst arm more than T − O ( T α ) times.Motivated by the analysis of the action-manipulation attacksand the significant security threat to MABs, we then designa bandit algorithm which can defend against the action-manipulation attacks and still is able to achieve a small regret.The main idea of the proposed algorithm is to bound themaximum amount of offset, in terms of user’s estimate ofthe mean rewards, that can be introduced by the action-manipulation attacks. We then use this estimate of maximumoffset to properly modify the UCB algorithm and build spe-cially designed high-probability upper bounds of the meanrewards so as to decide which arm to pull. We name ourbandit algorithm as maximum offset upper confidence bound(MOUCB). In particular, our algorithm firstly pulls every arma certain of times and then pulls the arm whose modifiedupper confidence bound is largest. Furthermore, we prove thatMOUCB bandit algorithm has a pseudo-regret upper boundedby O (max { log T, A } ) , where T is the total number of roundsand A is an upper bound for the total attack cost. In particular,if A scales as log( T ) , MOUCB archives a logarithm pseudo-regret which is same as the regret of UCB algorithm.The remainder of the paper is organized as follows. In Sec-tion II, we describe the model. In Section III, we describe theLCB attack strategy and analyze its accumulative attack cost.In Section IV, we propose MOUCB and analyze its regret.In Section V, we provide numerical examples to validate thetheoretic analysis. Finally, we offer several concluding remarksin Section VI. The proofs are collected in Appendix. II. M ODEL
In this section, we introduce our model. We consider thestandard multi-armed stochastic bandit problems setting. Theenvironment consists of K arms, with each arm corresponds toa fixed but unknown reward distribution. The bandit algorithm,which is also called “user” in this paper, proceeds in discretetime t = 1 , , . . . , T , in which T is the total number ofrounds. At each round t , the user pulls an arm (or action) I t ∈ { , . . . , K } and receives a random reward r t drawn fromthe reward distribution of arm I t . Denote τ i ( t ) := { s : s ≤ t, I s = i } as the set of rounds up to t where the user choosesarm i , N i ( t ) := | τ i ( t ) | as the number of rounds that arm i was pulled by the user up to time t and ˆ µ i ( t ) := N i ( t ) − (cid:88) s ∈ τ i ( t ) r s (1)as the empirical mean reward of arm i . The pseudo-regret ¯ R ( T ) is defined as ¯ R ( T ) = T max max i ∈ [ K ] µ i − E (cid:34) T (cid:88) t =1 r t (cid:35) . (2)The goal of the user to minimize ¯ R ( T ) .In this paper, we introduce a novel adversary setting, inwhich the attacker sits between the user and the environment.The attacker can monitor the actions of the user and the rewardsignals from the environment. Furthermore, the attacker canintroduce action-manipulation attacks on stochastic bandits. Inparticular, at each round t , after the user chooses an arm I t ,the attacker can manipulate the user’s action by changing I t toanother I t ∈ { , . . . , K } . If the attacker decides not to attack, I t = I t . The environment generates a random reward r t fromthe reward distribution of post-attack arm I t . Then the userand the attacker receive reward r t from the environment. Fig. 1. Action-manipulation attack model
Without loss of generality and for notation convenience,we assume arm K is the “attack target” arm or target arm.The attacker’s goal is to manipulate the user into pulling thetarget arm very frequently but by making attacks as rarely aspossible. Define the set of rounds when the attacker decidesto attack as C := { t : t ≤ T, I t (cid:54) = I t } . The cumulative attack2ost is the total number of rounds where the attacker decidesto attack, i.e., |C| .In this paper, we assume that the reward distribution of arm i follows σ -sub-Gaussian distributions with mean µ i . Denotethe true reward vector as µ = [ µ , · · · , µ K ] . Neither the usernor the attacker knows µ , but σ is known to both the userand the attacker. Define the difference of mean value of arm i and j as ∆ i,j = µ i − µ j . Furthermore, we refer to the best armas i O = arg max i µ i and the worst arm as i W = arg min i µ i .Note that the assumption that the attacker does not know µ is important. If the attacker knows these values, the attackercan adopt a trivial oracle attack scheme: whenever the userpulls a non-target arm I t , the attacker changes I t to the worstarm i W . It is easy to show that, if the user uses a banditalgorithm that has a regret upper bounded of O (log( T )) whenthere is no attack, the oracle attack scheme can force the userto pull the target arm T −O (log( T )) times, using a cumulativecost |C| = O (log( T )) . However, the oracle attack scheme isnot practical when the true reward vector is unknown. In thispaper, we will first design an effective attack scheme, whichdoes not assume the knowledge of true reward vector andnearly matches the performance of the oracle attack scheme,to attack the UCB algorithm. We will then design a new banditalgorithm that is robust against the action-manipulation attack.The action-manipulation attack considered here is differentfrom reward-manipulation attacks introduced by interestingrecent work [22], [23], where the attacker can change thereward signal from the environment. In the setting consideredin [22], [23], the attacker can change the reward signal r t fromthe environment to an arbitrary value chosen by the attacker.Correspondingly, the cumulative attack cost in [22], [23] isdefined to be the sum of the absolute value of the changes onthe reward. Compared with the reward-manipulation attacksdiscussed above, the action-manipulation attack is more dif-ficult to carry out. In particular, as the action-manipulationattack only changes the action, it can impact but does nothave direct control of the reward signal, which will be arandom variable drawn from a distribution depending on theaction chosen by the attacker. This is in contrast to reward-manipulation attacks where an attacker can change the rewardsto any value.III. A TTACK ON
UCB
AND C OST A NALYSIS
In this section, we use UCB algorithm as an example toillustrate the effects of action-manipulation attack. We willintroduce LCB attack strategy on the UCB bandit algorithmand analyze the cost.
A. Attack strategy
UCB algorithm [28] is one of the most popular banditalgorithm. In the UCB algorithm, the user initially pulls eachof the K arms once in the first K rounds. After that, the userchooses arms according to I t = arg max i (cid:40) ˆ µ i ( t −
1) + 3 σ (cid:115) log tN i ( t − (cid:41) . (3) Under the action-manipulation attack, as the user does notknow that r t is generated from arm I t instead of I t , theempirical mean ˆ µ i ( t ) computed using (1) is not a properestimate of the true mean reward of arm i anymore. On theother hand, the attack is able to obtain a good estimate of µ i by ˆ µ i ( t ) := N i ( t ) − (cid:88) s ∈ τ i ( t ) r s , (4)where τ i ( t ) := { s : s ≤ t, I s = i } is the set of rounds up to t when the attacker changes an arm to arm i , and N i ( t ) = | τ i ( t ) | is the number of pulls of post-attack arm i up to round t . This information gap provides a chance for attack. In thissection, we assume that the target arm is not the worst arm,i.e., µ K > µ i W . We will discuss the case where the target armis the worst arm in Section III-C.The proposed attack strategy works as follows. In the first K rounds, the attacker does not attack. After that, at round t ,if the user chooses a non-target arm I t , the attacker changes itto arm I t that has the smallest lower confidence bound (LCB): I t = arg min i (cid:8) ˆ µ i ( t − − CB (cid:0) N i ( t − (cid:1)(cid:9) , (5)where CB ( N ) = (cid:114) σ N log π KN δ . (6)Here δ is a parameter that is related to the probabilitystatements in the analytical results presented in Section III-B.We call our scheme as LCB attack strategy. If at round t theuser chooses the target arm, the attacker does not attack. Thusthe cumulative attack cost of our LCB attack scheme is equalto the total of times when the non-target arms are selected bythe user. The algorithm is summarized in Algorithm 1. Algorithm 1:
LCB attack strategy on UCB algorithm
Input:
The user’s bandit algorithm, target arm K for t = 1 , , . . . do The user chooses arm I t to pull according to UCBalgorithm (3). if I t = K then The attacker does not attack, and I t = I t . else The attacker attacks and changes arm I t to I t chosen according to (5). end if The environment generates reward r t from arm I t . The attacker and the user receive r t . end for Here, we highlight the main idea why LCB attack strategyworks. As discussed in Section II, if the attacker knows whicharm is the worst, the attacker can simply change the actionto the worst arm when the user pulls the non-target arm. Themain idea of the attack scheme is to estimate the mean of each3rm, and change the non-target arm to the arm whose lowerconfidence bound is the smallest. Effectively, this will almostalways change the non-target arm to the worst arm. Moreformally, for i (cid:54) = K , we will show that this attack strategywill ensure that ˆ µ i computed using (1) by the user convergesto µ i W . On the other hand, as the attacker does not attackwhen the user selects K , ˆ µ K computed by the user will stillconverge to the true mean µ K with N K increasing. Becausethe assumption that the target arm is not the worst, whichimplies that µ K > µ i W , ˆ µ i could be smaller than ˆ µ K . Thenthe user will rarely pull the non-target arms as ˆ µ i is smallerthan ˆ µ K . Hence, the attack cost would also be small. Therigorous analysis of the cost will be provided in Section III-B. B. Cost analysis
To analyze the cost of the proposed scheme, we need totrack ˆ µ i ( t ) , the estimate obtained by the attacker using (4),and ˆ µ i ( t ) , the estimate obtained by the user using (1).The analysis of ˆ µ i ( t ) is relatively simple, as the attackerknows which arm is truly pulled and hence ˆ µ i ( t ) is the trueestimate of the mean of arm i . Define event E := {∀ i, ∀ t > K : | ˆ µ i ( t ) − µ i | < CB ( N i ( t )) } . (7)Roughly speaking, event E is the event that the empiricalmean computed by the attacker using (4) is close to the truemean. The following lemma, proved in [22], shows that theattacker can accurately estimate the average reward to eacharm. Lemma 1. (Lemma 1 in [22]) For δ ∈ (0 , , P ( E ) > − δ . The analysis of ˆ µ i ( t ) computed by the user is more com-plicated. When the user pulls arm i , because of the action-manipulation attacks, the random rewards may be drawnfrom different reward distributions. Define τ i,j ( t ) := { s : s ≤ t, I s = i and I s = j } as the set of rounds up to t when the user chooses arm i and the attacker changesit to arm j . Lemma 2 shows a high-probability confidencebounds of ˆ µ i,j ( t ) := N i,j ( t ) − (cid:80) s ∈ τ i,j ( t ) r s , the empiricalmean rewards of a part of arm i whose post-attack arm is j , where N i,j ( t ) := | τ i,j ( t ) | . Define event E := {∀ i, ∀ j, ∀ t > K : | ˆ µ i,j ( t ) − µ j | < (cid:115) σ N i,j ( t ) log π K ( N i,j ( t )) δ (cid:41) . (8) Lemma 2.
For δ ∈ (0 , , P ( E ) > − δ .Proof. Please refer to Appendix A.Although r s in (1), used to calculate ˆ µ i ( t ) , may be drawnfrom different reward distributions, we can build a high-probability bound of ˆ µ i ( t ) with the help of Lemma 2. Lemma 3.
Under event E , for all arm i and all t > K , wehave (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ µ i ( t ) − N i ( t ) (cid:88) s ∈ τ i ( t ) µ I s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < β ( N i ( t )) , (9) where β ( N i ( t )) = (cid:115) σ KN i ( t ) log π ( N i ( t )) δ . (10) Proof.
Please refer to Appendix B.Under events E and E , we can build a connection between ˆ u i ( t ) and µ i W . In the proposed LCB attack strategy, theattacker explores and exploits the worst arm by a lowerconfidence bound method. Thus, when the user pulls a non-target arm, the attacker changes it to the worst arm at most ofrounds, which means that for all i (cid:54) = K , ˆ u i ( t ) will convergeto µ i W as N i ( t ) increases. Lemma 4 shows the relationshipbetween ˆ u i ( t ) and µ i W . Lemma 4.
Under events E and E , using LCB attack strat-egy 1, we have ˆ µ i ( t ) ≤ u i W + 1 N i ( t ) (cid:88) j (cid:54) = i W σ ∆ j,i W log π Kt δ (11) + (cid:115) σ KN i ( t ) log π ( N i ( t )) δ , ∀ i, t. (12) Proof.
Please refer to Appendix C.Lemma 4 shows an upper bound of the empirical meanreward of pre-attack arm i , for all arm i (cid:54) = K . Our mainresults is the following upper bound on the attack cost |C| . Theorem 1.
With probability at least − δ , when T ≥ (cid:16) π K δ (cid:17) , using LCB attack strategy specified in Algorithm 1,the attacker can manipulate the user into pulling the targetarm in at least T − |C| rounds, with an attack cost |C| ≤ K − K,i W (cid:32) σ (cid:112) log T + (cid:114) σ K log π T δ + (cid:32) σ (cid:112) log T + (cid:114) σ K log π T δ (cid:33) +4∆ K,i W (cid:88) j (cid:54) = i W σ ∆ j,i W log π KT δ . (13) Proof.
Please refer to Appendix D.The expression of the cost bound in Theorem 1 is compli-cated. The following corollary provides a simpler bound thatis more explicit and interpretable.4 orollary 1.
Under the same assumptions in Theorem 1, thetotal attack cost |C| of Algorithm 1 is upper bounded by O Kσ log T ∆ K,i W K + (cid:88) j (cid:54) = i W ∆ K,i W ∆ j,i W + (cid:115) K (cid:88) j (cid:54) = i W ∆ K,i W ∆ j,i W , (14) and the total number of target arm pulls is T − |C| . From Corollary 1, we can see that the attack cost scalesas log T . Two important constants σ ∆ K,iW and (cid:80) j (cid:54) = i W ∆ K,iW ∆ j,iW have impact on the prelog factor. In Section V, we providesome numerical examples to illustrate the effects of these twoconstants on the attack cost. C. Attacks fail when the target arm is the worst arm
One weakness of our LCB attack strategy is that the attacktarget arm is necessarily a non-worst arm. In the LCB attackstrategy, the attacker can not force the user to pull the worstarm very frequently by spending only logarithmic cost. Themain reason is that, when the target arm is the worst, theaverage reward of each arm is larger or equal to that of thetarget arm. As the result, our attack scheme is not able toensure that the target arm has a higher expected reward thanthe user’s estimate of the rewards of other arms. In fact, thefollowing theorem shows that all action-manipulation attackcan not manipulate the UCB algorithm into pulling the worstarm more than T − O (log( T )) by spending only logarithmiccost. Theorem 2.
Let δ < . Suppose the attack cost is limitedby O (log( T )) , there is no attack that can force the UCBalgorithm to pick the worst arm more than T − O ( T α ) timeswith probability at least − δ , in which α < K .Proof. Please refer to Appendix E.This theorem shows a contrast between the case where thetarget arm is not the worst arm and the case where the targetarm is the worst arm. If the target arm is not the worst arm,our scheme is able to force the user to pick the target arm T − O (log( T )) times with only logarithmic cost. On the otherhand, if the target arm is the worst, Theorem 2 shows thatthere is no attack strategy that can force the user to pick theworst arm more than T − O ( T α ) times while incurring onlylogarithmic cost.IV. R OBUST ALGORITHM AND REGRET ANALYSIS
The results in Section III exposes a significant securitythreat of the action-manipulation attacks on MABs. Under only O (log( T )) times of attacks carried out using the proposedLCB strategy, the UCB algorithm will almost always pull thetarget arm selected by the attacker. Although there are somedefense algorithms [24] and universal best arm identificationschemes [29] for stochastic or adversarial bandit, they do notapply to action-manipulation attack setting. This motivates us to design a new bandit algorithm that is robust against action-manipulation attacks. In this section, we propose such a robustbandit algorithm and analyze its regret. A. Robust Bandit algorithm
In this section, we assume that a valid upper bound A forthe cumulative attack cost |C| is known for the user, althoughthe user do not have to know the exact cumulative attack cost |C| . A does not need to be constant, it can scale with T . Inother words, for a given A , our proposed algorithm is robustto all action-manipulation attacks with a cumulative attackcost |C| < A . This assumption is reasonable, as if the cost isunbounded, it will not be possible to design a robust scheme.We first introduce some notation. Denote N ( t −
1) :=( N ( t − , . . . , N K ( t − as the vector counting howmany times each action has been taken by the user, and ˆ µ ( t −
1) = (ˆ µ ( t − , . . . , ˆ µ K ( t − as the vector of thesample means computed by the user. The proposed algorithmis a modified UCB method by taking the maximum possiblemean estimate offset due to attack into consideration. We nameour scheme as maximum offset UCB (MOUCB).The proposed MOUCB works as follows. In the first AK rounds, MOUCB algorithm pulls each arm A times. Afterthat, at round t , the user chooses an arm I t by a modifiedUCB method: I t = arg max a { ˆ µ a ( t −
1) + β ( N a ( t − γ ( ˆ µ ( t − , N ( t − } , (15)where γ ( ˆ µ ( t − , N ( t − AN a ( t − i,j { ˆ µ i ( t − − ˆ µ j ( t −
1) + β ( N i ( t − β ( N j ( t − } , and β ( N ) = (cid:114) σ KN log π N δ . (16)The algorithm is summarized in Algorithm 2.Compared with the original UCB algorithm in (3), the maindifference is the additional term γ ( ˆ µ ( t − , N ( t − in (15).We now highlight the main idea why our bandit algorithmworks and the role of this additional term. In particular, inthe standard multi-armed stochastic bandit problem, ˆ µ i ( t ) isa proper estimation of µ i , the true mean reward of arm i .However, under the action-manipulation attacks, as the userdoes not know which arm is used to generate r t , ˆ µ i ( t ) is nota proper estimate of the true mean reward anymore. However,we can try to find a good bound of the true mean reward. Ifwe know ∆ i O ,i W , the reward difference between the optimalarm and the worst arm, we can describe the maximum offsetof the mean rewards caused by the attack. In particular, wehave µ i − AN i ( t ) ∆ i O ,i W ≤ N i ( t ) (cid:88) s ∈ τ i ( t ) µ I s ≤ µ i + AN i ( t ) ∆ i O ,i W , (17)5 lgorithm 2: Proposed MOUCB bandit algorithm
Input:
A valid upper bound A for the cumulative attack cost. for t = 1 , , . . . do if t ≤ AK then The user pulls the arm whose pull count is thesmallest, i.e. I t = arg min i N i ( t − . else The user chooses arm I t to pull according (15). end if if The attacker decides to attack then The attacker attacks and changes I t to I t . else The attacker does not attack and I t = I t . end if The environment generates reward r t from arm I t . The attacker and the user receive r t . end for which implies µ i ≤ AN i ( t ) ∆ i O ,i W + 1 N i ( t ) (cid:88) s ∈ τ i ( t ) µ I s . (18)In (18), the first term in the right hand side is the maximumoffset that an attacker can introduce regardless of the attackstrategy. The second term in the right hand side is related tothe mean estimated by the user. In particular, under event E ,as shown in Lemma 3, we have N i ( t ) (cid:88) s ∈ τ i ( t ) µ I s < ˆ µ i ( t ) + β ( N i ( t )) . (19)Hence, regardless the attack strategy, we have a upperconfidence bound on µ i : µ i ≤ ˆ µ i ( t ) + AN i ( t ) ∆ i O ,i W + β ( N i ( t )) . (20)In our case, however, ∆ i O ,i W is also unknown. In our algo-rithm, we obtain a high-probability bound on ∆ i O ,i W : ∆ i O ,i W ≤ i,j { ˆ µ i − ˆ µ j + β ( N i ( t )) + β ( N j ( t )) } , (21)which will proved in Lemma 5 below. Now, the second termof (20) becomes γ ( ˆ µ ( t − , N ( t − if we replace ∆ i O ,i W with the bound (21), and we obtain our final algorithm. B. Regret analysis
Lemma 5 shows a boundary of ∆ i O ,i W the maximumreward difference between any two arms, under event E . Lemma 5.
For δ ≤ , t > AK and under event E , MOUCBalgorithm have ∆ i O ,i W ≤ i,j { ˆ µ i − ˆ µ j + β ( N i ( t )) + β ( N j ( t )) }≤ i O ,i W + 8 (cid:114) σ KA log 4 π A δ . (22) Proof.
Please refer to Appendix F.Using Lemma 5, we now bound the regret of Algorithm 2.
Theorem 3.
Let A be an upper bound on the total attackcost |C| . For δ ≤ and T ≥ AK , MOUCB algorithm haspseudo-regret ¯ R ( T )¯ R ( T ) ≤ (cid:88) a (cid:54) = i O max (cid:26) σ K ∆ i O ,a log π T δ , A (∆ i O ,a +2∆ i O ,i W + 8 (cid:114) σ KA log 4 π A δ (cid:33)(cid:41) , (23) with probability at least − δ .Proof. Please refer to Appendix G.Theorem 3 reveals that our bandit algorithm is robust to theaction-manipulation attacks. If the total attack cost is boundedby O (log T ) , the pseudo-regret of MOUCB bandit algorithmis still bounded by O (log T ) . This is in contrast with UCB,for which we have shown that the pseudo-regret is O ( T ) withattack cost O (log T ) in Section III. If the total attack cost is upto Ω(log T ) , the pseudo-regret of MOUCB bandit algorithmis bounded by O ( A ) , which is linear in A .V. N UMERICAL RESULTS
In this section, we provide numerical examples to illustratethe analytical results obtained. In our simulation, the bandithas 10 arms. The rewards distribution of arm i is N ( µ i , σ ) .The attacker’s target arm is K . We let δ = 0 . . We thenrun the experiment for multiple trials and in each trial we run T = 10 rounds. A. LCB attack strategy
We first illustrate the impact of the proposed LCB attackstrategy on UCB algorithm. T T a r ge t A r m P u ll C oun t Fig. 2. Number of rounds the target arm was pulled
In Figure 2, we fix σ = 0 . and ∆ K,i W = 0 . and comparethe number of rounds at which the target arm is pulled withand without attack. In this experiment, the mean rewards ofall arms are 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.1, and 0.26espectively. Arm K is not the worst arm, but its averagereward is lower than most arms. The results are averaged over20 trials. The attacker successfully manipulates the user intopulling the target arm very frequently. T c o s t Fig. 3. Attack cost vs σ ∆ K,iW
In Figure 3, in order to study how σ ∆ K,iW affects the attackcost, we fix ∆ K,i W = 0 . and set σ as 0.1, 0.3 and 0.5respectively. The mean rewards of all arms are same as above.From the figure, we can see that as σ ∆ K,iW increases, the attackcost increases. In addition, as predicted in our analysis, theattack cost increases with T , the total number of rounds, in alogarithmic order. T c o s t Fig. 4. Attack cost vs (cid:80) j (cid:54) = i W ∆ K,iW ∆ j,iW Figure 4 illustrates how (cid:80) j (cid:54) = i W ∆ K,iW ∆ j,iW affects the attackcost. In this experiment, we fix σ ∆ K,iW = 1 and set ∆ K,i W as 0.2, 0.6 and 0.9 respectively. The mean rewards of allarms are the same as above. The figure illustrates that, as (cid:80) j (cid:54) = i W ∆ K,iW ∆ j,iW increases, the attack cost also increases. Thisis consistent with our analysis in Corollary 1. B. MOUCB bandit algorithm
We now illustrate the effectiveness of MOUCB banditalgorithm.In this experiment, we use the similar setting as in thesimulation of the LCB attack scheme. The mean rewards of all arms are set to be 1.0, 0.8, 0.9, 0.5, 0.2, 0.3, 0.1, 0.4, 0.7, and0.6 respectively. The total attack cost |C| is limited by 2000.A given valid upper bound for total attack cost is A = 3000 .The results are averaged over 20 trials. T O p t i m a l A r m P u ll C oun t Fig. 5. Comparison of number of rounds the optimal arm was pulled
In Figure 5, we simulate MOUCB algorithm with twodifferent attacks, and compare the numbers of rounds whenthe optimal arm is pulled under these attacks. The first attackis the LCB attack discussed in Section III. The second attackis the oracle attack, in which the attacker knows the truemean reward of arms and implements the oracle attacks thatchange any non-target arm to a worst arm (see the discussionin Section II). For comparison purposes, we also add the curvefor MOUCB under no attack, and the curve for UCB underno attack. The results show that, even under the oracle attack,the proposed MOUCB bandit algorithm achieves almost thesame performance as the UCB without attack. T O p t i m a l A r m P u ll C oun t Fig. 6. Number of rounds the optimal arm was pulled using UCB algorithm
To further compare the performance of UCB and MOUCB,in Figure 6, we illustrate the performance of UCB algorithmfor the three scenarios discussed above: under LCB attack,under oracle attack and under no attack. The results show thatboth LCB and oracle attacks can successfully manipulates theUCB algorithm into pulling a non-optimal arm very frequently,as the curves for the LCB attack and oracle attack are far awayfrom the curve for no attack. This is in sharp contrast with the7ituation for MOUCB algorithm shown in Figure 5, where theall curves are almost identical. T P s eudo - R eg r e t -3 Fig. 7. Pseudo-regret of MOUCB algorithm T -0.100.10.20.30.4 P s eudo - R eg r e t Fig. 8. Pseudo-regret of UCB algorithm
Figure 7 and Figure 8 illustrate the pseudo-regret ofMOUCB bandit algorithm and UCB bandit algorithm respec-tively. In Figure 7, as predicted in our analysis, MOUCBalgorithm archives logarithmic pseudo-regrets under both LCBattacks and the oracle attacks. Furthermore, the curves underboth attacks are very close to that of the case without attacks.However, as shown in Figure 8, the pseudo-regret of UCBgrows linearly under both attacks, while grows logarithmicallyunder no attack. The figures again show that UCB is vulnerableto action-manipulation attacks while the proposed MOUCB isrobust to the attacks (even for oracle attacks).VI. C
ONCLUSION
In this paper, we have introduced a new class of attackson stochastic bandits: action-manipulation attacks. We haveanalyzed the attack against on the UCB algorithm and provedthat the proposed LCB attack scheme can force the userto almost always pull a non-worst arm with only logarithmeffort. To defend against this type of attacks, we have furtherdesigned a new bandit algorithm MOUCB that is robustto action-manipulation attacks. We have analyzed the regretof MOUCB under any attack with bounded cost, and have showed that the proposed algorithm is robust to the action-manipulation attacks. A
PPENDIX AP ROOF OF L EMMA { X j } ∞ j =1 be a sequence of i.i.d σ -sub-Gaussian random variables with mean µ . Let ˆ µ ( t ) = N ( t ) (cid:80) N ( t ) j =1 X j . By Hoeffding’s inequality. P ( | ˆ µ ( t ) − µ | ≥ η ) ≤ (cid:18) − N ( t ) η σ (cid:19) . (24)In order to ensure that E holds for all arm i , all arm j and allpull counts N = N i,j ( t ) , we set δ i,j,N := δπ K N . We have P (cid:32) ∃ i, ∃ j, ∃ N : | ˆ µ i,j ( t ) − µ j | ≥ (cid:114) σ N log π K N δ (cid:33) = K (cid:88) i =1 K (cid:88) j =1 ∞ (cid:88) N =1 δ i,N = δ. (25)A PPENDIX BP ROOF OF L EMMA E , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ µ i ( t ) − N i ( t ) (cid:88) s ∈ τ i ( t ) µ I s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) j =1 N i,j ( t ) N i ( t ) (ˆ µ i,j ( t ) − µ j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K (cid:88) j =1 N i,j ( t ) N i ( t ) | ˆ µ i,j ( t ) − µ j | < N i ( t ) K (cid:88) j =1 (cid:114) σ N i,j ( t ) log π K ( N i,j ( t )) δ . (26)Define a function f ( N ) = (cid:114) σ N log π K N δ : R → R ,and we have f (cid:48)(cid:48) ( N ) = ∂ ∂N (cid:114) σ N log π K N δ = − (cid:16) σ log π K N δ (cid:17) + 16 σ (cid:0) σ N log π K N δ (cid:1) < , (27)when N ≥ .Hence f is strictly concave when N ≥ , and we have K (cid:88) j =1 f ( N i,j ( t )) < Kf K K (cid:88) j =1 N i,j ( t ) = Kf (cid:18) N i ( t ) K (cid:19) . (28)8hus, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ µ i ( t ) − N i ( t ) (cid:88) s ∈ τ i ( t ) µ I s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < N i ( t ) K (cid:118)(cid:117)(cid:117)(cid:116) σ N i ( t ) K log π K (cid:16) N i ( t ) K (cid:17) δ = (cid:115) σ KN i ( t ) log π ( N i ( t )) δ . (29)A PPENDIX CP ROOF OF L EMMA t . Consider the case that in round t + 1 , the user chooses anon-target arm I t +1 = i (cid:54) = K and the attacker changes it to anon-worst arm I t +1 = j (cid:54) = i W . On one hand, under event E ,we have ˆ µ i W ( t ) − µ i W < CB ( N i W ( t )) , and ˆ µ j ( t ) − µ j > − CB ( N j ( t )) . (30)On the other hand, according to the attack scheme, it mustbe the case that ˆ µ i W ( t ) − CB ( N i W ( t )) > ˆ µ j ( t ) − CB ( N j ( t )) , (31)which is equivalent to CB ( N j ( t )) > ˆ µ j ( t ) − (ˆ µ i W ( t ) − CB ( N i W ( t ))) . (32)Combining (32) with (30), we have CB ( N j ( t )) > µ j − CB ( N j ( t )) − µ i W and CB ( N j ( t )) > ∆ j,i W . (33)Using on the fact that N j ( t ) ≤ t and N i,j ( t ) ≤ N j ( t ) , wehave ∆ j,i W < CB ( N j ( t ))= (cid:115) σ N j ( t ) log π K ( N j ( t )) δ ≤ (cid:115) σ N j ( t ) log π Kt δ ≤ (cid:115) σ N i,j ( t ) log π Kt δ , (34)which is equivalent to N i,j ( t ) < σ ∆ j,i W log π Kt δ . (35) Hence, under event E , we have ˆ µ i ( t ) < N i ( t ) (cid:88) s ∈ τ i ( t ) µ I s + (cid:115) σ KN i ( t ) log π ( N i ( t )) δ = 1 N i ( t ) (cid:88) j (cid:88) s ∈ τ i,j ( t ) µ I s + (cid:115) σ KN i ( t ) log π ( N i ( t )) δ = 1 N i ( t ) (cid:88) j N i,j ( t ) µ j + (cid:115) σ KN i ( t ) log π ( N i ( t )) δ = (cid:88) j N i,j ( t ) N i ( t ) (∆ j,i W + µ i W ) + (cid:115) σ KN i ( t ) log π ( N i ( t )) δ<µ i W + (cid:115) σ KN i ( t ) log π ( N i ( t )) δ + 1 N i ( t ) (cid:88) j (cid:54) = i W σ ∆ j,i W log π Kt δ . (36)The lemma is proved. A PPENDIX DP ROOF OF T HEOREM − δK , ∀ t > K : | ˆ µ K ( t ) − µ K | < CB ( N K ( t )) .Because the LCB attack scheme does not attack the targetarm, we can also conclude that with probability − δK , ∀ t >K : | ˆ µ K ( t ) − µ K | < CB ( N K ( t )) .The user relies on the UCB algorithm to choose arms. If atround t , the user chooses an arm I t = i (cid:54) = K , which is notthe target arm, we have ˆ µ i ( t −
1) + 3 σ (cid:115) log tN i ( t − > ˆ µ K ( t −
1) + 3 σ (cid:115) log tN K ( t − , (37)which is equivalent to σ (cid:115) log tN i ( t − > − ˆ µ i ( t −
1) + ˆ µ K ( t −
1) + 3 σ (cid:115) log tN K ( t − . (38)Under event E , we have ˆ µ K ( t ) > µ K − CB ( N K ( t )) . (39)Under event E ∩ E , according to Lemma 4, we have ˆ µ i ( t ) ≤ µ i W + (cid:115) σ KN i ( t ) log π ( N i ( t )) δ + 1 N i ( t ) (cid:88) j (cid:54) = i W σ ∆ j,i W log π Kt δ . (40)9ombing the inequalities above, σ (cid:115) log tN i ( t − > − µ i W − (cid:115) σ KN i ( t −
1) log π ( N i ( t − δ − N i ( t − (cid:88) j (cid:54) = i W σ ∆ j,i W log π K ( t − δ + µ K − CB ( N K ( t − σ (cid:115) log tN K ( t − . (41)When t ≥ ( π K δ ) , σ (cid:115) log tN K ( t − ≥ (cid:115) σ log tN K ( t −
1) + 5 σ log( π K δ ) N K ( t − ≥ (cid:115) σ log π Kt δ N K ( t − ≥ (cid:115) σ log π K ( N K ( t − δ N K ( t − CB ( N K ( t − . (42)Now the inequality only depends on N i ( t − and someconstants: σ (cid:115) log tN i ( t − > ∆ K,i W − (cid:115) σ KN i ( t −
1) log π ( N i ( t − δ − N i ( t − (cid:88) j (cid:54) = i W σ ∆ j,i W log π K ( t − δ> ∆ K,i W − (cid:115) σ KN i ( t −
1) log π t δ − N i ( t − (cid:88) j (cid:54) = i W σ ∆ j,i W log π Kt δ . (43)By solving the inequality above, we have: N i ( t − < K,i W (cid:32) σ (cid:112) log t + (cid:114) σ K log π t δ + (cid:32) σ (cid:112) log t + (cid:114) σ K log π t δ (cid:33) +4∆ K,i W (cid:88) j (cid:54) = i W σ ∆ j,i W log π Kt δ . (44)Since event E ∩ E occurs with probability at least − δ , wehave that (44) holds with probability at least − δ . Theorem 1follows immediately from the definition of the attack cost and(44). A PPENDIX EP ROOF OF T HEOREM N i ( t ) (cid:88) s ∈ τ i ( t ) µ I s ≥ µ K . (45)If the user pulls arm K at round t , according to UCBalgorithm, we have for any arm i (cid:54) = K , ˆ µ i ( t −
1) + 3 σ (cid:115) log tN i ( t − < ˆ µ K ( t −
1) + 3 σ (cid:115) log tN K ( t − . (46)Under event E , we have Lemma 3 and (29) holds for allarm i , which implies ˆ µ i ( t − > N i ( t − (cid:88) s ∈ τ i ( t − µ I s − (cid:115) σ KN i ( t −
1) log π ( N i ( t − δ , (47)and ˆ µ K ( t − < N K ( t − (cid:88) s ∈ τ K ( t − µ I s + (cid:115) σ KN K ( t −
1) log π ( N K ( t − δ . (48)Noted that for δ > , h ( N ) = 2 σ NK log π N δ : R → R is monotonically decreasing in N ≥ .If N i ( t − < N K ( t − and N i ( t − < √ δπ t K holdfor any arm i , we have σ (cid:115) log tN K ( t − < σ (cid:115) log tN i ( t − , (49)and (cid:115) σ KN K ( t −
1) log π ( N K ( t − δ< (cid:115) σ KN i ( t −
1) log π ( N i ( t − δ< σ (cid:115) log tN i ( t − . (50)Combining the inequalities above, we find σ (cid:115) log tN i ( t − < N K ( t − (cid:88) s ∈ τ K ( t − µ I s − µ K . (51)10ince the attack cost is limited in O (log t ) , N K ( t − (cid:88) s ∈ τ K ( t − µ I s − µ K = O (log t ) N K ( t − , (52)so N i ( t −
1) = Ω( σ ( N K ( t − ) . (53)In summary, as long as the event E holds, at least one ofthe three following equations must be true: N i ( t −
1) = Ω( σ ( N K ( t − ) ,N i ( t − ≥ N K ( t − ,N i ( t − ≥ √ δπ t K . (54)In addition, any one of the three equations shows that theuser pulls the non-target arm more than O ( t α ) times, in which α < K . Since event E holds with probability at least − δ ,the conclusion in the Theorem holds with probability at least − δ . A PPENDIX FP ROOF OF L EMMA δ ≤ , β ( N ) = (cid:113) σ KN log π N δ is mono-tonically decreasing in N , as ∂∂N β ( N ) = 2 σ KN (cid:18) − log π N δ (cid:19) ≤ σ KN (cid:18) − log π δ (cid:19) < . (55)We first prove the first inequality in Lemma 5. Consider theoptimal arm i O and the worst arm i W . Define C i := |{ t : t ≤ T, I t (cid:54) = I t = i }| . In the action-manipulation setting, when t > AK , MOUCB algorithm has N i O ( t ) (cid:88) s ∈ τ iO ( t ) µ I s ≥ N i O ( t ) − C i O N i O ( t ) µ i O + C i O N i O ( t ) µ i W = µ i O − ∆ i O ,i W C i O N i O ( t ) ≥ µ i O − ∆ i O ,i W C i O A , (56)and N i W ( t ) (cid:88) s ∈ τ iW ( t ) µ I s ≤ N i W ( t ) − C i W N i W ( t ) µ i W + C i W N i W ( t ) µ i O = µ i W + ∆ i O ,i W C i W N i W ( t ) ≤ µ i W + ∆ i O ,i W C i W A . (57) Combining (56) and (57), we have N i O ( t ) (cid:88) s ∈ τ iO ( t ) µ I s − N i W ( t ) (cid:88) s ∈ τ iW ( t ) µ I s ≥ µ i O − µ i W − ∆ i O ,i W C i W A − ∆ i O ,i W C i O A ≥ µ i O − µ i W − ∆ i O ,i W A A = ∆ i O ,i W . (58)From (29), we could find N i O ( t ) (cid:88) s ∈ τ iO ( t ) µ I s − N i W ( t ) (cid:88) s ∈ τ iW ( t ) µ I s ≤ ˆ µ i O ( t ) + β ( N i O ( t )) − (ˆ µ i W ( t ) − β ( N i W ( t ))) ≤ max i,j { ˆ µ i ( t ) + β ( N i ( t )) − (ˆ µ j ( t ) − β ( N j ( t ))) } . (59)We now prove the second inequality in Lemma 5: max i,j { ˆ µ i ( t ) + β ( N i ( t )) − (ˆ µ j ( t ) − β ( N j ( t ))) }≤ max i,j N i ( t ) (cid:88) s ∈ τ i ( t ) µ I s + 2 β ( N i ( t )) − N i ( t ) (cid:88) s ∈ τ i ( t ) µ I s − β ( N j ( t )) ≤ ∆ i O ,i W + max i,j { β ( N i ( t )) + 2 β ( N j ( t )) } . (60)Recall that for δ ≤ , β ( N ) = (cid:113) σ KN log π N δ ismonotonically decreasing in N . Therefore, max i,j { β ( N i ( t )) + 2 β ( N j ( t )) } ≤ β (2 A ) . (61)A PPENDIX GP ROOF OF T HEOREM A times. Thenfor t > AK and under event E , if at round t + 1 , MOUCBalgorithm choose a non-optimal arm I t +1 = a (cid:54) = i O , we have ˆ µ a + β ( N a ( t ))+2 AN a ( t ) max i,j { ˆ µ i − ˆ µ j + β ( N i ( t )) + β ( N j ( t )) }≥ ˆ µ i O + β ( N i O ( t ))+2 AN i O ( t ) max i,j { ˆ µ i − ˆ µ j + β ( N i ( t )) + β ( N j ( t )) } , which implies to ˆ µ a + AN a ( t ) (cid:32) i O ,i W + 8 (cid:114) σ KA log 4 π A δ (cid:33) + β ( N a ( t )) ≥ ˆ µ i O + AN i O ( t ) ∆ i O ,i W + β ( N i O ( t )) , according to Lemma 5.11rom equation (29), we could find ˆ µ a ≤ N a ( t ) (cid:88) s ∈ τ a ( t ) µ I s + β ( N a ( t )) ≤ µ a + ∆ i O ,a C a N a ( t ) + β ( N a ( t )) ≤ µ a + ∆ i O ,a AN a ( t ) + β ( N a ( t )) , and ˆ µ i O ≥ N i O ( t ) (cid:88) s ∈ τ iO ( t ) µ I s − β ( N i O ( t )) ≥ µ i O − ∆ i O ,i W C i O N i O ( t ) − β ( N i O ( t )) ≥ µ i O − ∆ i O ,i W AN i O ( t ) − β ( N i O ( t )) . By combining the inequalities above, we have µ i O ≤ µ a + ∆ i O ,a AN a ( t ) + 2 β ( N a ( t ))+ AN a ( t ) (cid:32) i O ,i W + 8 (cid:114) σ KA log 4 π A δ (cid:33) , which is equivalent to ∆ i o ,a ≤ ∆ i O ,a AN a ( t ) + 2 (cid:115) σ KN a ( t ) log π ( N a ( t )) δ + AN a ( t ) (cid:32) i O ,i W + 8 (cid:114) σ KA log 4 π A δ (cid:33) ≤ (cid:115) σ KN a ( t ) log π t δ + AN a ( t ) (∆ i O ,a +2∆ i O ,i W + 8 (cid:114) σ KA log 4 π A δ (cid:33) . Therefore, N a ( t ) ≤ max (cid:40) σ K ∆ i O ,a log π t δ , A ∆ i o ,a (∆ i O ,a +2∆ i O ,i W + 8 (cid:114) σ KA log 4 π A δ (cid:33)(cid:41) . (62)As event E holds with probability at least − δ , (44) holdswith probability at least − δ . Then Theorem 3 followsimmediately from the definition of the pseudo-regret in (2)and equation (62). R EFERENCES[1] G. Liu and L. Lai, “Action-manipulation attacks on stochastic bandits,”in
Proc. of IEEE International Conference on Acoustics, Speech, andSignal Processing , Barcelona, Spain, May 2020.[2] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572 , 2014.[3] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel,“Adversarial attacks on neural network policies,” arXiv preprintarXiv:1702.02284 , 2017. [4] Y. Lin, Z. Hong, Y. Liao, M. Shih, M. Liu, and M. Sun, “Tactics ofadversarial attack on deep reinforcement learning agents,” arXiv preprintarXiv:1703.06748 , 2017.[5] S. Mei and X. Zhu, “Using machine teaching to identify optimaltraining-set attacks on machine learners,” in
Proc. of AAAI Conferenceon Artificial Intelligence , Austin, TX, Jan. 2015, pp. 2871–2877.[6] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against supportvector machines,” arXiv preprint arXiv:1206.6389 , 2012.[7] H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli, “Isfeature selection secure against training data poisoning?,” in
Proc. ofInternational Conference on Machine Learning , Francis Bach and DavidBlei, Eds., Lille, France, July 2015, vol. 37 of
Proceedings of MachineLearning Research , pp. 1689–1698.[8] B. Li, Y. Wang, A. Singh, and Y. Vorobeychik, “Data poisoningattacks on factorization-based collaborative filtering,” in
Advances inNeural Information Processing Systems , D. D. Lee, M. Sugiyama, U. V.Luxburg, I. Guyon, and R. Garnett, Eds., 2016, pp. 1885–1893.[9] S. Alfeld, X. Zhu, and P. Barford, “Data poisoning attacks againstautoregressive models,” in
Proc. of AAAI Conference on ArtificialIntelligence , Phoenix, AZ, Feb. 2016, pp. 1452–1458.[10] H. S. Chang, J. Hu, M. C. Fu, and S. I. Marcus, “Adaptive adversarialmulti-armed bandit approach to two-person zero-sum markov games,”
IEEE Transactions on Automatic Control , vol. 55, no. 2, pp. 463–468,Feb 2010.[11] C. Tekin and M. van der Schaar, “Distributed online learning via co-operative contextual bandits,”
IEEE Transactions on Signal Processing ,vol. 63, no. 14, pp. 3700–3714, July 2015.[12] N. M. Vural, H. Gokcesu, K. Gokcesu, and S. S. Kozat, “Minimaxoptimal algorithms for adversarial bandit problem with multiple plays,”
IEEE Transactions on Signal Processing , vol. 67, no. 16, pp. 4383–4398,Aug 2019.[13] K. Liu, Q. Zhao, and B. Krishnamachari, “Dynamic multichannel accesswith imperfect channel state detection,”
IEEE Transactions on SignalProcessing , vol. 58, no. 5, pp. 2795–2808, May 2010.[14] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit withmultiple players,”
IEEE Transactions on Signal Processing , vol. 58, no.11, pp. 5667–5681, Nov 2010.[15] S. Shahrampour, M. Noshad, and V. Tarokh, “On sequential eliminationalgorithms for best-arm identification in multi-armed bandits,”
IEEETransactions on Signal Processing , vol. 65, no. 16, pp. 4281–4292, Aug2017.[16] C. Tekin and M. Liu, “Online learning of rested and restless bandits,”
IEEE Transactions on Information Theory , vol. 58, no. 8, pp. 5588–5611, Aug 2012.[17] O. Chapelle, E. Manavoglu, and R. Rosales, “Simple and scalableresponse prediction for display advertising,”
ACM Trans. Intell. Syst.Technol. , vol. 5, no. 4, pp. 61:1–61:34, Dec. 2014.[18] L. Li, W. Chu, J. Langford, and R. Schapire, “A contextual-banditapproach to personalized news article recommendation,” in
Proc. ofInternational Conference on World Wide Web , New York, NY, Apr. 2010,pp. 661–670.[19] L. Lai, H. El Gamal, H. Jiang, and H. Vincent Poor, “Cognitive mediumaccess: Exploration, exploitation and competition,”
IEEE Transactionson Mobile Computing , vol. 10, no. 2, pp. 239–253, Feb. 2011.[20] M. Bande and V. V. Veeravalli, “Adversarial multi-user bandits foruncoordinated spectrum access,” in
Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing , Brighton, United Kingdom,May 2019, pp. 4514–4518.[21] B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan, “Cascading bandits:Learning to rank in the cascade model,” in
Proc. of InternationalConference on Machine Learning , Francis Bach and David Blei, Eds.,Lille, France, July 2015, vol. 37 of
Proceedings of Machine LearningResearch , pp. 767–776.[22] K. Jun, L. Li, Y. Ma, and X. Zhu, “Adversarial attacks on stochasticbandits,” in
Proc. of International Conference on Neural InformationProcessing Systems , Montr´eal, Canada, Dec. 2018, pp. 3644–3653.[23] F. Liu and N. Shroff, “Data poisoning attacks on stochastic bandits,”in
Proc. of International Conference on Machine Learning , KamalikaChaudhuri and Ruslan Salakhutdinov, Eds., Long Beach, CA, June 2019,vol. 97, pp. 4042–4050.[24] T. Lykouris, V. Mirrokni, and R. Paes Leme, “Stochastic bandits robustto adversarial corruptions,” in
Proc. of Annual ACM SIGACT Symposiumon Theory of Computing , Los Angeles, CA, June 2018, pp. 114–122.
25] Ziwei Guan, Kaiyi Ji, D. J. Bucci Jr, Timothy Y Hu, Joseph Palombo,Michael Liston, and Yingbin Liang, “Robust stochastic bandit algorithmsunder probabilistic unbounded adversarial attack,” in
Proc. AAAI , NewYork City, NY, Feb. 2020.[26] Z. Feng, D. Parkes, and H. Xu, “The intrinsic robustness of stochasticbandits to strategic manipulation,”
CoRR , vol. abs/1906.01528, 2019.[27] Y. Ma, K. Jun, L. Li, and X. Zhu, “Data poisoning attacks in contextualbandits,”
CoRR , vol. abs/1808.05760, 2018.[28] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic andnonstochastic multi-armed bandit problems,”
Foundations and Trends R (cid:13) in Machine Learning , vol. 5, no. 1, pp. 1–122, 2012.[29] C. Shen, “Universal best arm identification,” IEEE Transactions onSignal Processing , vol. 67, no. 17, pp. 4464–4478, Sep. 2019., vol. 67, no. 17, pp. 4464–4478, Sep. 2019.