Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms
RRegret Bounds and Reinforcement LearningExploration of EXP-based Algorithms
Mengfan Xu
Northwestern UniversityEvanston, IL 60208
Diego Klabjan
Northwestern UniversityEvanston, IL 60208 [email protected]
Abstract
EXP-based algorithms are often used for exploration in multi-armed bandit. Werevisit the EXP3.P algorithm and establish both the lower and upper bounds ofregret in the Gaussian multi-armed bandit setting, as well as a more generaldistribution option. The analyses do not require bounded rewards compared toclassical regret assumptions. We also extend EXP4 from multi-armed bandit toreinforcement learning to incentivize exploration by multiple agents. The resultingalgorithm has been tested on hard-to-explore games and it shows an improvementon exploration compared to state-of-the-art.
Multi-armed bandit (MAB) is to maximize cumulative reward of a player throughout a bandit gameby choosing different arms at each time step. It is also equivalent to minimizing the regret definedas the difference between the best rewards that can be achieved and the actual reward gained by theplayer. Formally, given time horizon T , in time step t ≤ T the player choose one arm a t among K arms, receives r ta t among rewards r t = ( r t , r t , . . . , r tK ) , and maximizes the total reward (cid:80) Tt =1 r ta t or minimizes the regret. Computationally efficient and with abundant theoretical analyses are theEXP-type MAB algorithms. In EXP3.P, each arm has a trust coefficient (weight). The player sampleseach arm with probability being the sum of its normalized weights and a bias term, receives reward ofthe sampled arm and exponentially updates the weights based on the corresponding reward estimates.It achieves the regret of the order O ( √ T ) in a high probability sense. In EXP4, there are any numberof experts. Each has a sample rule over actions and a weight. The player samples according to theweighted average of experts’ sample rules and updates the weights respectively.Contextual bandit is a variant of MAB by adding context or state space S . At time step t , the playerhas context s t ∈ S with s T = ( s , s , . . . , s T ) being independent. Rewards r t follow F ( µ ( s t )) where F is any distribution and µ ( s t ) is the mean vector that depends on state s t . ReinforcementLearning (RL) generalizes contextual bandit, where state and reward transitions follow a MarkovDecision Process (MDP) represented by transition kernel P ( s t +1 , r t | a t , s t ) . A key challenge in RLis the trade-off between exploration and exploitation. Exploration is to encourage the player to trynew arms in MAB or new actions in RL to understand the game better. It helps to plan for the future,but with the sacrifice of potentially lowering the current reward. Exploitation aims to exploit currentlyknown states and arms to maximize the current reward, but it potentially prevents the player to gainmore information to increase local reward. To maximize the cumulative reward, the player needs toknow the game by exploration, while guaranteeing current reward by exploitation.How to incentivize exploration in RL has been a main focus in RL. Since RL is built on MAB,it is natural to extend MAB techniques to RL and UCB is such a success. UCB motivates count-based exploration in RL and the subsequent Pseudo-Count exploration. New deep RL explorationalgorithms have been recently proposed. Using deep neural networks to keep track of the Q -values Preprint. Under review. a r X i v : . [ c s . L G ] S e p y means of Q -networks in RL is called DQN [1]. This combination of deep learning and RL hasshown great success. (cid:15) -greedy [2] is a simple exploration technique using DQN. Besides (cid:15) -greedy,intrinsic model exploration computes intrinsic rewards by focusing on experiences. Intrinsic rewardsdirectly measure and incentivize exploration if added to extrinsic (actual) rewards of RL, e.g. DORA[3] and [4]. Random Network Distillation (RND) [5] is a more recent suggestion relying on a fixedtarget network. A drawback of RND is its local focus without global exploration.In order to address weak points of these various exploration algorithms in the RL context, thenotion of experts is natural and thus EXP-type MAB algorithms are appropriate. The allowance ofarbitrary experts provides exploration for harder contextual bandits and hence providing explorationpossibilities for RL. We develop an EXP4 exploration algorithm for RL that relies on several generalexperts. This is the first RL algorithm using several exploration experts enabling global exploration.Focusing on DQN, in the computational study we focus on two agents consisting of RND and (cid:15) -greedy DQN.MF: UPDATE We implement the RL EXP4 algorithm on the hard-to-explore RL game Montezuma’sRevenge and compare it with the benchmark algorithm RND [5]. The numerical results show thatthe algorithm gains more exploration than RND and it gains the ability of global exploration by notgetting stuck in local maximums of RND. Its total reward also increases with training. Overall, ouralgorithm improves exploration and exploitation on the benchmark game and demonstrates a learningprocess in RL.Reward in RL in many cases is unbounded which relates to unbounded MAB rewards. There are threemajor versions of MAB: Adversarial, Stochastic, and herein introduced Gaussian. For adversarialMAB, rewards of the K arms r t can be chosen arbitrarily by the adversary at step t . For stochasticMAB, the rewards at different steps are assumed to be i.i.d. and also the rewards of different arms areindependent. It is assumed that ≤ r ti ≤ for any arm i and step t . For Gaussian MAB, rewards r t follow multi-variate normal N ( µ, Σ) with µ being the mean vector and Σ the covariance matrixof the K arms. Here the rewards are neither bounded, nor independent among the arms. For thisreason the introduced Gaussian MAB reflects the RL setting and is the subject of our MAB analysesof EXP3.P.EXP-type algorithms [6] are optimal in the two classical MABs. [6] shows lower and upper boundson regret of the order O ( √ T ) for adversarial MAB and of the order O (log( T )) for stochastic MAB.All of the proofs of these regret bounds by EXP-type algorithms are based on the bounded rewardassumption, which does not hold for Gaussian MAB. Therefore, the regret bounds for Gaussian MABwith unbounded rewards studied herein are significantly different from prior works.We show both lower and upper bounds on regret of Gaussian MAB under certain assumptions. Someanalyses even hold for more generally distributed MAB. Upper bounds borrow some ideas fromthe analysis of the EXP3.P algorithm [6] for bounded MAB to our unbounded MAB, while lowerbounds are by our brand new construction of instances. Precisely, we derive lower bounds of order T for certain fixed T and upper bounds of order O ∗ ( √ T ) for T being large enough. The question ofbounds for any value of T remains open.The main contributions of this work are as follows. On the analytical side we introduce GaussianMAB with the unique aspect and challenge of unbounded rewards. We provide the very first regretlower bound in such a case by constructing a novel family of Gaussian bandits and we are able toanalyze the EXP3.P algorithm for Gaussian MAB. Unbounded reward poses a non-trivial challengein the analyses. We also provide the very first extension of EXP4 to RL exploration. We show itssuperior performance on two hard-to-explore RL games.A literature review is provided in Section 2. Then in Section 3 we exhibit upper bounds for unboundedMAB of the EXP3.P algorithm and lower bounds, respectively. Section 4 discusses the EXP4algorithm for RL exploration. Finally, in Section 5, we present numerical results related to theproposed algorithm. The importance of exploration in RL is well understood. Count-based exploration in RL relieson UCB. [7] develops Bellman value iteration V ( s ) = max a ˆ R ( s, a ) + γE [ V ( s (cid:48) )] + βN ( s, a ) − ,2here N ( s, a ) is the number of visits to ( s, a ) for state s and action a . Value N ( s, a ) − is positivelycorrelated with curiosity of ( s, a ) and encourages exploration. This method is limited to tableaumodel-based MDP for small state spaces, while [8] introduces Pseudo-Count exploration for non-tableau MDP with density models.In conjunction with DQN, (cid:15) -greedy [2] is a simple exploration technique using DQN. Besides (cid:15) -greedy, intrinsic model exploration computes intrinsic rewards by the accuracy of a model trainedon experiences. Intrinsic rewards directly measure and incentivize exploration if added to extrinsic(actual) rewards of RL, e.g. DORA [3] and [4]. Intrinsic rewards in [4] are defined as e ( s, a ) = || σ ( s (cid:48) ) − M φ ( σ ( s ) , a ) || where M φ is a parametric model, s (cid:48) is the next state and σ is input extraction.Intrinsic reward e ( s, a ) relies on stochastic transition from s to s (cid:48) and thus brings noise to exploration.Random Network Distillation(RND) [5] addresses this by defining e ( s, a ) = || ˆ f ( s (cid:48) ) − f ( s (cid:48) ) || where ˆ f is a parametric model and f is a randomly initialized but fixed model. Here e ( s, a ) , independentof the transition, only depends on state s (cid:48) and drives RND to outperform other algorithms onMontezuma’s Revenge. None of these algorithms use several experts which is a significant departurefrom our work.In terms of MAB regret analyses focusing on EXP-type algorithms, Auer et al. [6] first introduceEXP3.P for bounded adversarial MAB and EXP4 for contextual bandits. Under the EXP3.P algorithm,an upper bound on regret of the order O ( √ T ) is achieved, which has no gap with the lower boundand hence it establishes that EXP3.P is optimal. However these regret bounds are not applicable toGaussian MAB since rewards can be infinite. Meanwhile for unbounded MAB, [9] demonstrates aregret bound of order O ( √ T · γ T ) for noisy Gaussian process bandits where a reward observationcontains noise. The information gain γ T is not well-defined in a noiseless Gaussian setting. Fornoiseless Gaussian bandits, [10] shows both the optimal lower and upper bounds on regret, but theregret definition is not consistent with the one used in [6]. We establish a lower bound of the order O ( T ) for certain T and an upper bound of the order O ∗ ( √ T ) asymptotically on regret of unboundednoiseless Gaussian MAB following standard definitions of regret. For Gaussian MAB with time horizon T , at step < t ≤ T rewards r t follow multi-variatenormal N ( µ, Σ) where µ = ( µ , µ , . . . , µ K ) is the mean vector and Σ = ( a ij ) i,j ∈{ ,...,K } is thecovariance matrix of the K arms. The player receives reward y t = r ta t by pulling arm a t . We use R (cid:48) T = T · max k µ k − (cid:80) t E [ y t ] to denote pseudo regret called simply regret. (Note that the alternativedefinition of regret R T = max i (cid:80) Tt =1 r ti − (cid:80) Tt =1 y t depends on realizations of rewards.) In this section we derive a lower bound for Gaussian and general MAB under an assumption. GeneralMAB replaces Gaussian with a general distribution. The main technique is to construct instances orsub-classes that have certain regret, no matter what strategies are deployed. We need the followingassumption or setting.
Assumption 1
There are two types of arms with general K with one type being superior ( S isthe set of superior arms) and the other being inferior ( I is the set of inferior arms). Let − q, q bethe proportions of the superior and inferior arms, respectively which is known to the adversary andclearly ≤ q ≤ . The arms in S are indistinguishable and so are those in I . The first pull of theplayer has two steps. In the first step the player selects an inferior or superior set of arms based on P ( S ) = 1 − q and P ( I ) = q and once a set is selected, the corresponding reward of an arm from theselected set is received.An interesting special case of Assumption 1 is the case of two arms and q = 1 / . In this case, theplayer has no prior knowledge and in the first pull chooses an arm uniformly at random.The lower bound is defined as R L ( T ) = inf sup R (cid:48) T , where, first, inf is taken among all the strategiesand then sup is among all Gaussian MAB. All proofs are in the Appendix.The following is the main result with respect to lower bounds and it is based on inferior arms beingdistributed as N (0 , and superior as N ( µ, with µ > .3 heorem 1. In Gaussian MAB under Assumption 1, for any q ≥ / we have R L ( T ) ≥ ( q − (cid:15) ) · µ · T where µ has to satisfy G ( q, µ ) < q with (cid:15) and T determined by G ( q, µ ) < (cid:15) < q, T ≤ (cid:15) − G ( q, µ )(1 − q ) · (cid:82) (cid:12)(cid:12)(cid:12) e − x − e − ( x − µ )22 (cid:12)(cid:12)(cid:12) + 2 and G ( q, µ ) = max (cid:110)(cid:82) (cid:12)(cid:12)(cid:12) qe − x − (1 − q ) e − ( x − µ )22 (cid:12)(cid:12)(cid:12) dx, (cid:82) (cid:12)(cid:12)(cid:12) (1 − q ) e − x − qe − ( x − µ )22 (cid:12)(cid:12)(cid:12) dx (cid:111) . To prove Theorem 1, we construct a special subset of Gaussian MAB with equal variances and zerocovariances. On these instances we find a unique way to explicitly represent any policy. This builds aconnection between abstract policies and this concrete mathematical representation. Then we showthat pseudo regret R (cid:48) T must be greater than certain values no matter what policies are deployed, whichindicates a regret lower bound on these subset of instances.The feasibility of the aforementioned conditions is established in the following theorem. Theorem 2.
In Gaussian MAB under Assumption 1, for any q ≥ / , there exist µ and (cid:15), (cid:15) < µ suchthat R L ( T ) ≥ ( q − (cid:15) ) · µ · T . The following result with two arms and equal probability in the first pull deals /with general probabil-ities. Even in the case of Gaussian MAB it is not a special case of Theorem 2 since it is stronger.
Theorem 3.
For general MAB under Assumption 1 with K = 2 , q = 1 / , we have that R L ( T ) ≥ T · µ holds for any distributions f for the arms in I and f for the arms in S with (cid:82) | f − f | > (possiblywith unbounded support), for any µ > and T satisfying T ≤ · (cid:82) | f − f | + 1 . The theorem establishes that for any fixed µ > there is a finite set of horizons T and instances ofGaussian MAB so that no algorithm can achieve regret smaller than linear in T . Table 1 provides thevalues of the relationship between µ and largest T in the Gaussian case where the inferior arms aredistributed based on the standard normal and the superior arms have mean µ > and variance 1. Forexample, there is no way to attain regret lower than T · − / for any ≤ T ≤ . The functiondecreases very quickly. Table 1: Upper bounds for T as a function of µµ − − − − − Upper bound for T R L ( T ) ≥ O ( T ) is larger than known results of classical MAB.This is not surprising since the rewards in classical MAB are assumed to be bounded, while rewardsin our setting follow an unbounded Gaussian distribution, which apparently increases regret.Besides the known result O ∗ ( √ T ) of adversarial MAB and O ∗ (log T ) of stochastic MAB, for noisyGaussian Process bandits, [9] shows R L ( T ) ≥ O ( √ T · γ T ) . Our lower bound for Gaussian MABis different from this lower bound. The information gain term γ T in noisy Gaussian bandits is notwell-defined in Gaussian MAB and thus the two lower bounds are not comparable. In this section, we establish upper bounds for regret of Gaussian MAB by means of the EXP3.Palgorithm (see Algorithm 1) from [6]. We stress that rewards can be infinite, without the boundedassumption present in stochastic and adversarial MAB. We only consider non-degenerate GaussianMAB where variance of each arm is strictly positive, i.e. min i a ii > .Formally, we provide analyses for upper bounds on R T with high probability, on E [ R T ] and on R (cid:48) T .In [6] EXP3.P is studied to yield a bound on regret R T with high probability in the bounded MABsetting. As part of our contributions, we show that EXP3.P regret is of the order O ∗ ( √ T ) in theunbounded Gaussian MAB in the case of R T with high probability, E [ R T ] and R (cid:48) T . The results aresummarized as follows. The density of N ( µ, Σ) is denoted by f . Theorem 4.
For Gaussian MAB, any time horizon T , for any < η < , EXP3.P has regret R T ≤ η ) · ( (cid:113) KT log( KTδ ) + 4 (cid:113) KT log K + 8 log( KTδ )) with probability (1 − δ ) · (1 − η ) T lgorithm 1: EXP3.PInitialization: Weights w i = exp ( αδ (cid:113) TK ) , i ∈ { , , . . . , K } for parameters α > and δ ∈ (0 , ; for t = 1 , , . . . , T dofor i = 1 , , . . . , K do p i ( t ) = (1 − δ ) w i ( t ) (cid:80) Kj =1 w j ( t ) + δK end Choose i t randomly according to the distribution p ( t ) , . . . , p K ( t ) ;Receive reward r i t ( t ) ; for j = 1 , . . . , K do ˆ x j ( t ) = r j ( t ) p j ( t ) · j = i t , w j ( t + 1) = w j ( t ) exp δ K (ˆ x j ( t ) + αp j ( t ) √ KT ) endend where ∆( η ) is determined by (cid:82) ∆ − ∆ . . . (cid:82) ∆ − ∆ f ( x , . . . , x K ) dx . . . dx K = 1 − η. In the proof of Theorem 4, we first perform truncation of the rewards of Gaussian MAB by dividingthe rewards to a bounded part and unbounded tail throughout the game. For the bounded part, wedirectly borrow the regret upper bound of EXP3.P [6] and conclude with the regret upper bound oforder O (∆( η ) √ T ) . Since a Gaussian distribution is a light-tailed distribution we can control theprobability of tail shrinking which leads to the overall result.The dependence of the bound on ∆ can be removed by considering large enough T as stated next. Theorem 5.
For Gaussian MAB, and any a > , < δ < , EXP3.P has regret R T ≤ log(1 /δ ) O ∗ ( √ T ) with probability (1 − δ ) · (1 − T a ) T . The constant behind O ∗ depends on K, a, µ and Σ .The above theorems deal with R T but the aforementioned lower bounds are with respect to pseudoregret. To complete the analysis of Gaussian MAB, it is desirable to have an upper bound on pseudoregret which is established next. It is easy to verify by the Jensen’s inequality that R (cid:48) T ≤ E [ R T ] andthus it suffices to obtain an upper bound on E [ R T ] .For adversarial and stochastic MAB, the upper bound for E [ R T ] is of the same order as R T whichfollows by a simple argument. For Gaussian MAB, establishing an upper bound on E [ R T ] or R (cid:48) T based on R T requires more work. We show an upper bound on E [ R T ] by using select mathematicalinequalities, limit theories, and Randemacher complexity. To this end, the main result reads asfollows. Theorem 6.
The regret of EXP3.P in Gaussian MAB satisfies R (cid:48) T ≤ E [ R T ] ≤ O ∗ ( √ T ) . All these three theorems also hold for sub-Gaussian MAB, which is defined by replacing Gaussianwith sub-Gaussian. This generalization is straightforward and it is directly shown in the proof ofGaussian MAB in Appendix. Optimal upper bounds for adversarial MAB and noisy Gaussian Processbandits are of the same order as our upper bound. Work [6] derives an upper bound of the same order O ( √ T ) as the lower bound for adversarial and stochastic MAB. For noisy Gaussian Process bandits,there is also no gap between its upper and lower bounds.Our upper bound of the order O ∗ ( √ T ) is of the same order as the one for bounded MAB. In our casethe upper bound result O ∗ ( √ T ) holds for large enough T which is hidden behind O ∗ while the linearlower bounds is valid only for small values of T . This illustrates the rationality of the lower bound of O ( T ) and the upper bound of order O ∗ ( √ T ) . 5 EXP4 algorithm for RL
EXP4 has shown great success in contextual bandits. Therefore, in this section, we extend EXP4 toRL and develop EXP4-RL illustrated in Algorithm 2.
Algorithm 2:
EXP4-RLInitialization: Trust coefficients w k = 1 for any k ∈ { , . . . , E } , E = number of experts( Q -networks), K = number of actions, ∆ , (cid:15), η > and temperature z, τ > , n r = −∞ (an upperbound on reward); while True do Initialize episode by setting s ; for i = 1 , , . . . , T (length of episode) do Observe state s i ;Let probability of Q k -network be ρ k = (1 − η ) w k (cid:80) Ek =1 w k + ηE ;Sample network ¯ k according to { ρ k } k ;For Q ¯ k -network, use (cid:15) -greedy to sample an action a ∗ = argmax a Q ¯ k ( s i , a ) , π j = (1 − (cid:15) ) · j = a ∗ + (cid:15)K − · j (cid:54) = a ∗ j ∈ { , , . . . , K } Sample action a i based on π ;Interact with the environment to receive reward r i and next state s i +1 ; n r = max { r i , n r } ;Update the trust coefficient w k of each Q k -network as follows: P k = (cid:15) -greedy ( Q k ) , ˆ x kj = 1 − j = a P kj + ∆ ( n r − r i ) , j ∈ , , . . . , K, y k = E [ˆ x kj ] , w k = w k · e ykz Store ( s i , a i , r i , s i +1 ) in experience replay buffer B ; end Update each expert’s Q k -network from buffer B ; end The player has experts that are represented by deep Q -networks trained by RL algorithms (thereis a one to one correspondence between the experts and Q -networks). Each expert also has a trustcoefficient. Trust coefficients are also updated exponentially based on the reward estimates as inEXP4. At each step of one episode, the player samples an expert ( Q -network) with probability that isproportional to the weighted average of expert’s trust coefficients. Then (cid:15) -greedy DQN is applied onthe chosen Q -network. Here different from EXP4, the player needs to store all the interaction tuplesin experience buffer since RL is a MDP. After one episode, the player trains all Q -networks with theexperience buffer and uses the trained networks as experts for the next episode.The basic idea is the same as EXP4 by using the experts that give advice vectors with deep Q -networks.It is a combination of deep neural networks with EXP4 updates. From a different perspective, wecan also view it as an ensemble in classification [11], by treating Q -networks as ensembles in RL,instead of classification algorithms. While Q -networks do not necessarily have to be experts, i.e.,other experts can be used, these are natural in a DQN framework.In our implementation and experiments we use two experts, thus E = 2 with two Q -networks. Thefirst one is based on RND [5] while the second one is a simple DQN. To this end, in the algorithmbefore storing to the buffer, we also record c ir = || ˆ f ( s i ) − f ( s i ) || , the RND intrinsic reward as in[5]. This value is then added to the 4-tuple pushed to B . When updating Q corresponding to RNDat the end of an iteration in the algorithm, by using r j + c jr we modify the Q -network and by using c jr an update to ˆ f is executed. Network Q pertaining to (cid:15) -greedy is updated directly by using r j .Intuitively, Algorithm 2 circumvents this drawback with the total exploration guided by two expertswith EXP4 updated trust coefficients. When the RND expert drives high exploration, its trustcoefficient leads to a high total exploration. When it has low exploration, the second expert DQNshould have a high one and it incentivizes the total exploration accordingly. Trust coefficients areupdated by reward estimates iteratively as in EXP4, so they keep track of the long-term performance6f experts and then guide the total exploration globally. These dynamics of EXP4 combined withintrinsic rewards guarantees global exploration. The experimental results exhibited in the next sectionverify this intuition regarding exploration behind Algorithm 2.We point out that potentially more general RL algorithms based on Q -factors can be used, e.g., boost-rapped DQN [12], random prioritized DQN [13] or adaptive (cid:15) -greedy VDBE [14] are a possibility.Furthermore, experts in EXP4 can even be policy networks trained by PPO [15] instead of DQN forexploration. These possibilities demonstrate the flexibility of the EXP4-RL algorithm. As a numerical demonstration of the superior performance and exploration incentive of Algorithm2, we show the improvements on baselines on two hard-to-explore RL games, Mountain Car andMontezuma’s Revenge. More precisely, we present that the real reward on Mountain Car improvessignificantly by our algorithm in Section 5.1. Then for exploration incentive, we implement Algorithm2 on Montezuma’s Revenge and show the growing and remarkable improvement of exploration onbaselines in Section 5.2.For the Mountain Car experiment, we use the Adam optimizer with the · − learning rate. Thebatch size for updating models is 64 with the replay buffer size of 10,000. The remaining parametersare as follows: the discount factor for the Q -networks is 0.95, the temperature parameter τ is 0.1, η is 0.05, and (cid:15) is decaying exponentially with respect to the number of steps with maximum 0.9and minimum 0.05. The length of one epoch is 200 steps. The target networks load the weights andbiases of the trained networks every 400 steps. Since a reward upper bound is known in advance, weuse n r = 1 .For the Montezuma’s Revenge experiment, we use the Adam optimizer with the − learning rate.The other parameters read: the mini batch size is 4, replay buffer size is 1,000, the discount factorfor the Q -networks is 0.999 and the same valus is used for the intrinsic value head, the temperatureparameter τ is 0.1, η is 0.05, and (cid:15) is increasing exponentially with minimum 0.05 and maximum 0.9.The length of one epoch is 100 steps. Target networks are updated every 300 steps. Pre-normalizationis 50 epochs and the weights for intrinsic and extrinsic values in the first network are 1 and 2,respectively. The upper bound on reward is set to be constant n r = 1 .Intrinsic reward c ir = || ˆ f ( s i ) − f ( s i ) || given by intrinsic model ˆ f represents the exploration ofRND in [5] as introduced in Sections 2 and 4. We use the same criterion for evaluating explorationperformance of our algorithm and RND herein. RND incentivizes local exploration with the singlestep intrinsic reward but with the absence of global exploration. In this part, we summarize the experimental results of Algorithm 2 on Mountain Car, a classicalcontrol RL game. This game has very sparse positive rewards, which brings the necessity andhardness of exploration. Blog post [16] shows RND based on DQN improves the performance oftraditional DQN, since RND has intrinsic reward to incentivize exploration. We use RND on DQNfrom [16] as the baseline and show the real reward improvement of Algorithm 2, which supports theintuition and superiority of the algorithm.The neural networks of both experts are linear. For the RND expert, it has the input layer with 2 inputneurons, followed by a hidden layer with 64 neurons, and then a two-headed output layer. The firstoutput layer represents the Q values with 64 hidden neurons as input and the number of actions outputneurons, while the second output layer corresponds to the intrinsic values, with 1 output neuron. Forthe DQN expert, the only difference lies in the absence of the second output layer.The comparison between Algorithm 2 and RND is presented in Figure 1. Here the x-axis is theepoch number and the y-axis is the cumulative reward of that epoch. Figure 1a shows the rawdata comparison between EXP4-RL and RND. We observe that though at first RND has severalspikes exceeding those of EXP4-RL, EXP4-RL has much higher rewards than RND after 300 epochs.Overall, the relative difference of areas under the curve (AUC) is 4.9% for EXP4-RL over RND,which indicates the significant improvement of our algorithm. This improvement is better illustratedin Figure 1b with the smoothed reward values. Here there is a notable difference between EXP4-RL7nd RND. Note that the maximum reward hit by EXP4-RL is − and the one by RND is − ,which additionally demonstrates our improvement on RND. (a) original (b) smooth Figure 1: The performance of Algorithm 2 and RND measured by the epoch-wise reward on MountainCar, with the left one being the original data and the right being the smoothed reward values.Based on the above discussions on Mountain Car, we arrive to the conclusion that Algorithm 2performs better than the RND baseline and that the improvement increases at the later trainingstage. Exploration brought by Algorithm 2 gains real reward on this hard-to-explore Mountain Car,compared to the RND counterpart (without the DQN expert). The power of our algorithm withmultiple experts and trust coefficients could be enhanced by adopting more complex experts, notlimited to only DQN.
In this section, we show the experimental details of Algorithm 2 on Montezuma’s Revenge, anothernotoriously hard-to-explore RL game. The benchmark on Montezuma’s Revenge is RND basedon DQN which achieves a reward of zero in our environment (the PPO algorithm reported in [5]has reward 8,000 with many more computing resources; we ran the PPO-based RND with 10environments and 800 epochs to observe that the reward is also 0), which indicates that DQN hasroom for improvement regarding exploration.To this end, we first implement the DQN-version RND on Montezuma’s Revenge as our benchmarkby replacing the PPO architecture in [5] with DQN. Then we implement Algorithm 2 with two expertsas aforementioned.The experiment of RND with PPO in [5] uses 1024 parallel environments and runs 30,000 epochs foreach environment. For the DQN-version of RND (called simply RND hereafter), we use the samesettings as in [5], such as observations, intrinsic reward normalization and random initialization. Ourcomputing environment allows at most 10 parallel environments. In subsequent figures the x-axisalways corresponds to the number of epochs. RND update probability is the proportion of experiencein the replay buffer that are used for training the intrinsic model ˆ f in RND (see [5]).We use CNN architectures since we are dealing with videos. More precisely, for the Q -network ofthe DQN expert in EXP4-RL and the predictor network ˆ f for computing the intrinsic rewards, weuse Alexnet [17] pretrained on ImageNet [18]. The number of output neurons of the final layer is18, the number of actions in Montezuma. For the RND baseline and RND expert in EXP4-RL, wecustomize the Q -network with different linear layers while keeping all the layers except the finallayer of pretrained Alexnet. Here we have two final linear layers representing two value heads, theextrinsic value head and the intrinsic value head. The number of output neurons in the first valuehead is again 18, while the second value head is with 1 output neuron.A comparison between Algorithm 2 (EXP4-RL) and RND without parallel environments (the updateprobability is 100% since it is a single environment) is shown in Figure 2 with the emphasis onexploration by means of the intrinsic reward. We use 3 different numbers of burn-in periods (58,88, 167 burn-in epochs) to remove the initial training steps, which is common in Gibbs sampling.Overall EXP4-RL outperforms RND with many significant spikes in the intrinsic rewards. The largerthe number of burn-in periods is, the more significant is the dominance of EXP4-RL over RND.EXP4-RL has much higher exploration than RND at some epochs and stays close to RND at otherepochs. At some epochs, EXP4-RL even has 6 times higher exploration. The relative difference inthe areas under the curves are 6.9%, 17.0%, 146.0%, respectively, which quantifies the much betterperformance of EXP4-RL. (a) small (b) medium (c) large Figure 2: The performance of Algorithm 2 and RND measured by intrinsic reward without parallelenvironments with three different burn-in periodsWe next compare EXP4-RL and RND with 10 parallel environments and different RND updateprobabilities in Figure 3. The experiences are generated by the 10 parallel environments and arestored in the replay buffer.Figure 3a shows that both experts in EXP4-RL are learning with decreasing losses of their respective Q -networks. The drop is steeper for the RND expert but it also starts with a higher loss. With RNDupdate probability 0.25 in Figure 3b we observe that EXP4-RL and RND are very close when RNDexhibits high exploration. When RND is at its local minima, EXP4-RL outperforms it. Usually theselocal minima are driven by sticking to local maxima and then training the model intensively at localmaxima, typical of the RND local exploration behavior. EXP4-RL improves on RND as trainingprogresses, e.g. the improvement after 550 epochs is higher than the one between epochs 250 and 550.In terms for AUC, this is expressed by 1.6% and 3.5%, respectively. Overall, EXP4-RL improvesRND local minima of exploration, keeps high exploration of RND and induces a smoother globalexploration.With the update probability of 0.125 in Figure 3c, EXP4-RL almost always outperforms RND with anotable difference. The improvement also increases with epochs and is dramatically larger at RND’slocal minima. These local minima appear more frequently in training of RND, so our improvementis more significant as well as crucial. The relative AUC improvement is 49.4%. The excellentperformance in Figure 3c additionally shows that EXP4-RL improves RND with global explorationby improving local minima of RND or not staying at local maxima. (a) Q -network losses with 0.25 up-date (b) Intrinsic reward after smoothingwith 0.25 update (c) Intrinsic reward after smoothingwith 0.125 update Figure 3: The performance of Algorithm 2 and RND with 10 parallel environments and with RNDupdate probability 0.25 and 0.125, measured by loss and intrinsic reward.9verall, with either 0.25 or 0.125, EXP4-RL incentivizes global exploration on RND by not gettingstuck in local exploration maxima and outperforms RND exploration aggressively. With 0.125 theimprovement with respect to RND is more significant and steady. These experimental evidenceexactly verifies our intuition behind EXP4-RL and provides excellent support for the algorithm.With experts being more advanced RL exploration algorithms, e.g. DORA [3], EXP4-RL can bringadditional possibilities. 10
Proof of results in Section 3.1
For brevity, we define n = T − .We start by showing the following proposition that is used in the proofs. Proposition 1.
Let G ( q, µ ) , q , and µ be defined as in Theorem 1. Then for any q ≥ / , there existsa µ that satisfies the constraint G ( q, µ ) < q .Proof. Let us denote G = (cid:82) | qf ( x ) − (1 − q ) f ( x ) | dx, G = (cid:82) | (1 − q ) f ( x ) − qf ( x ) | dx .Then we have G ( q, µ ) = (cid:90) | qf ( x ) − (1 − q ) f ( x ) | dx = (cid:90) ( qf ( x ) − (1 − q ) f ( x )) qf ( x ) > (1 − q ) f ( x ) dx + (cid:90) ( − qf ( x ) + (1 − q ) f ( x )) qf ( x ) < (1 − q ) f ( x ) dx = (cid:90) ( qf ( x ) − (1 − q ) f ( x )) x
As in Assumption 1, let the inferior arm set be I and the superior one be S ,respectively, P ( I ) = q and P ( S ) = 1 − q . Arms in I follow f ( x ) = N (0 , and arms in S follow f ( x ) = N ( µ, where µ > . According to Assumption 1, at the first step the player pulls an armfrom either I or S and receives reward y . At time step i > , the reward is y i and let b i represent apolicy of the player. We can always define b i as b i = (cid:26) if the chosen arm at step i is not in the same arm set as the initial arm , otherwise . Let a i ∈ { , } be the actual arm played at step i . It suffices to only specify a i is in arm set I ( a i = 0 )or S ( a i = 1 ) since the arms in I and S are identical. The connection between a i and b i is explicitlygiven by b i = | a i − a | . By Assumption 1, it is easy to argue that b i = S (cid:48) i ( y , y , ..., y i − ) for a setof functions S (cid:48) , S (cid:48) , . . . , S (cid:48) n , S (cid:48) n +1 . We proceed with the following lemma. Lemma 1.
Let the rewards of the arms in set I follow any L distribution f ( x ) and in set S followany L distribution f ( x ) where the means satisfy µ ( f ) > µ ( f ) . Let B be the number of armsplayed in the game in set S . Let us assume the player meets Assumption 1. Then no matter whatstrategy the player takes, we have (cid:12)(cid:12) E [ B ] − (1 − q ) · ( n +1) n +1 (cid:12)(cid:12)(cid:12) ≤ (cid:15) where (cid:15), T, f , f satisfy G ( q, f , f ) + (1 − q )( n − (cid:82) | f ( x ) − f ( x ) | ≤ (cid:15) , G ( q, f , f ) = max (cid:8)(cid:82) | qf ( x ) − (1 − q ) f ( x ) | dx, (cid:82) | (1 − q ) f ( x ) − qf ( x ) | dx (cid:9) .Proof. We have E [ B ] = (cid:90) ( a + a + · · · + a n +1 ) f a ( y ) f a ( y ) . . . f a n ( y n ) dy dy . . . dy n . If a = 0 , then a i = b i and E [ B | a = 0] = (cid:90) (0 + b ( y ) + . . . + b n +1 ( y n )) f ( y ) f b ( y ) . . . f b n ( y n ) dy dy . . . dy n . If a = 1 , then − a i = b i and E [ B | a = 1] = (cid:90) (1 + 1 − b ( y ) + · · · + 1 − b n +1 ( y n )) f ( y ) . . . f − b n ( y n ) dy dy . . . dy n . This gives us E [ B ] = q · E [ B | a = 0] + (1 − q ) · E [ B | a = 1]= (1 − q )( n + 1)+ (cid:90) ( b + · · · + b n +1 ) · ( q · f ( y ) . . . f b n ( y n ) − (1 − q ) · f ( y ) . . . f − b n ( y n )) dy dy . . . dy n . By defining b = 0 , we have E [ B ] = (1 − q ) · ( n + 1)+ (cid:90) ( b + · · · + b n +1 ) ( q · f b ( y ) . . . f b n ( y n ) − (1 − q ) · f − b ( y ) . . . f − b n ( y n )) dy dy . . . dy n . For any ≤ m ≤ n we also derive (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:89) i =1 f b i ( y i ) − m (cid:89) i =1 f − b i ( y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dy dy . . . dy m ≤ (cid:90) m − (cid:89) i =1 f b i ( y i ) | f b m ( y m ) − f − b m ( y n ) | dy dy . . . dy m + (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m − (cid:89) i =1 f b i ( y i ) − m − (cid:89) i =1 f − b i ( y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f − b m ( y m ) dy dy . . . dy m ≤ (cid:90) | f ( x ) − f ( x ) | dx + (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m − (cid:89) i =1 f b i ( y i ) − m − (cid:89) i =1 f − b i ( y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f − b m ( y m ) dy dy . . . dy m = (cid:90) | f ( x ) − f ( x ) | dx + (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m − (cid:89) i =1 f b i ( y i ) − m − (cid:89) i =1 f − b i ( y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dy dy . . . dy m − ≤ · (cid:90) | f ( x ) − f ( x ) | dx + (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m − (cid:89) i =1 f b i ( y i ) − m − (cid:89) i =1 f − b i ( y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dy dy . . . dy m − ≤ m (cid:90) | f ( x ) − f ( x ) | . (1)12his provides (cid:12)(cid:12)(cid:12)(cid:12) E [ B ] − (1 − q ) · ( n + 1) n + 1 (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q · n (cid:89) i =1 f b i ( y i ) − (1 − q ) · n (cid:89) i =1 f − b i ( y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dy dy . . . dy n ≤ (cid:90) n − (cid:89) i =1 f b i ( y i ) | q · f b n ( y n ) − (1 − q ) · f − b n ( y n ) | dy dy . . . dy n + (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (1 − q ) · n − (cid:89) i =1 f b i ( y i ) − (1 − q ) · n − (cid:89) i =1 f − b i ( y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f − b n ( y n ) dy dy . . . dy n ≤ max (cid:26)(cid:90) | q · f ( x ) − (1 − q ) · f ( x ) | dx, (cid:90) | (1 − q ) · f ( x ) − q · f ( x ) | dx (cid:27) +(1 − q ) · (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − (cid:89) i =1 f b i ( y i ) − n − (cid:89) i =1 f − b i ( y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dy dy . . . dy n − ≤ max (cid:26)(cid:90) | q · f ( x ) − (1 − q ) · f ( x ) | dx, (cid:90) | (1 − q ) · f ( x ) − q · f ( x ) | dx (cid:27) +(1 − q ) · ( n − · (cid:90) | f ( x ) − f ( x ) | , where the last inequality follows from (1).The statement of the lemma now follows.According to Proposition 1, there is such µ satisfying the constraint G ( q, µ ) < q . Note that G ( q, µ ) = G ( q, f , f ) . Then we can choose (cid:15) to be any quantity such that G ( q, µ ) < (cid:15) < q. Finally,there is T satisfying T ≤ (cid:15) − G ( q,µ )(1 − q ) · (cid:82) | f ( x ) − f ( x ) | + 2 that gives us G ( q, µ ) + (1 − q )( T − (cid:82) | f ( x ) − f ( x ) | ≤ (cid:15). By choosing (cid:15), T, µ as above, by Lemma 1 we have (cid:12)(cid:12)(cid:12)(cid:12) E [ B ] − (1 − q ) · TT (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15), which is equivalent to E [ B ] < (1 − q + (cid:15) ) · T . Therefore, regret R (cid:48) T satisfies, with A being thenumber of arm pulls from I , inequality R (cid:48) T = (cid:88) t max k ( µ k ) − (cid:88) t E [ y t ] = T µ − (cid:88) t E [ y t ] = T µ − ( E [ B ] · µ + E [ A ] · ≥ T µ − (1 − q + (cid:15) ) µT = ( q − (cid:15) ) µT. This yields R LT = inf sup R (cid:48) T ≥ ( q − (cid:15) ) · µT. Theorem 2 follows from Theorem 1 and Proposition 1.
Proof of Theorem 3.
The assumption here is the special case of Assumption 1 where there are twoarms and q = 1 / . Set I follows f and S follows f where µ ( f ) < µ ( f ) .In the same was as in the proof of Theorem 1 we obtain R L ( T ) ≥ (cid:0) − (cid:15) (cid:1) · T · µ under the constraint that n/ · (cid:82) | f − f | = n/ · TV ( f , f ) < (cid:15) where TV stands for total variation.Here we use G (1 / , µ ) = 1 / · TV ( f , f ) . Setting (cid:15) = 1 / yields the statement.13n the Gaussian case it turns out that (cid:15) = 1 / yields the highest bound. For total variation of Gaussianvariables N ( µ , σ ) and N ( µ , σ ) , [19] show that TV (cid:0) N (cid:0) µ , σ (cid:1) , N (cid:0) µ , σ (cid:1)(cid:1) ≤ | σ − σ | σ + | µ − µ | σ , which in our case yields T V ≤ µ . From this we obtain µ · T ≥ (cid:15) and in turn R LT ≥ (cid:15) · ( − (cid:15) ) . Themaximum of the right-hand side is obtained at (cid:15) = . This justifies the choice of (cid:15) in the proof of 3. B Proof of results in Section 3.2
B.1 Proof for Theorem 4
Proof.
Since the rewards can be unbounded in our setting, we consider truncating the reward withany ∆ > for any arm i by r ti = ¯ r ti + ˆ r ti where ¯ r ti = r ti · ( − ∆ ≤ r ti ≤ ∆) , ˆ r ti = r ti · ( | r ti | > ∆) .Then for any parameter < η < , we choose such ∆ that satisfies P ( r ti = ¯ r ti , i ≤ K ) = P ( − ∆ ≤ r t ≤ ∆ , . . . , − ∆ ≤ r tK ≤ ∆)= (cid:90) ∆ − ∆ (cid:90) ∆ − ∆ . . . (cid:90) ∆ − ∆ f ( x , . . . , x K ) dx . . . dx K ≥ − η . (2)The existence of such ∆ = ∆( η ) follows from elementary calculus.Let A = {| r ti | ≤ ∆ for every i ≤ K, t ≤ T } . Then the probability of this event is P ( A ) = P ( r ti = ¯ r ti , i ≤ K, t ≤ T ) ≥ (1 − η ) T .With probability (1 − η ) T , the rewards of the player are bounded in [ − ∆ , ∆] throughout the game.Then R BT = (cid:80) Tt =1 (max i ¯ r ti − ¯ r it ) ≤ T · ∆ − (cid:80) Tt =1 r t is the regret under event A , i.e. R T = R BT with probability (1 − η ) T . For the EXP3.P algorithm and R BT , for every δ > , according to [6] wehave R BT ≤ (cid:32)(cid:114) KT log( KTδ ) + 4 (cid:114) KT log K + 8 log( KTδ ) (cid:33) with probability − δ. Then we have R T ≤ η ) (cid:16)(cid:113) KT log( KTδ ) + 4 (cid:113) KT log K + 8 log( KTδ ) (cid:17) with probability (1 − δ ) · (1 − η ) T . B.2 Proof for Theorem 5Lemma 2.
For any non-decreasing differentiable function ∆ = ∆( T ) > satisfying lim T →∞ ∆( T ) log( T ) = ∞ , lim T →∞ ∆ (cid:48) ( T ) ≤ C < ∞ ,and any < δ < , a > we have P (cid:16) R T ≤ ∆( T ) · log(1 /δ ) · O ∗ ( √ T ) (cid:17) ≥ (1 − δ ) (cid:18) − T a (cid:19) T for any T large enough. roof. Let a > and let us denote F ( y ) = (cid:90) y − y f ( x , x , . . . , x K ) dx dx . . . dx K ,ζ ( T ) = F (∆( T ) · ) − (cid:18) − T a (cid:19) for y ∈ R K and = (1 , . . . , ∈ R K . Let also y − i = ( y , . . . , y i − , y i +1 , . . . , y K ) and x | x i = y =( x , . . . , x i − , y, x i +1 , . . . , x K ) . We have lim T →∞ ζ ( T ) = 0 .The gradient of F can be estimated as ∇ F ≤ (cid:32)(cid:90) y − − y − f ( x | x = y ) dx . . . dx K , . . . , (cid:90) y − K − y − K f ( x | x K = y K ) dx . . . dx K − (cid:33) . According to the chain rule and since ∆ (cid:48) ( T ) ≥ , we have dF (∆( T ) · ) dT ≤ (cid:90) ∆( T ) · − − ∆( T ) · − f (cid:0) x | x =∆( T ) (cid:1) dx . . . dx K · ∆ (cid:48) ( T )+ . . . + (cid:90) ∆( T ) · − K − ∆( T ) · − K f (cid:0) x | x K =∆( T ) (cid:1) dx . . . dx K − · ∆ (cid:48) ( T ) . Next we consider (cid:90) ∆( T ) − i − ∆( T ) − i f (cid:0) x | x i =∆( T ) (cid:1) dx . . . dx i − dx i +1 . . . dx K = e − a ii (∆( T )) + µ i ∆( T ) · (cid:90) ∆( T ) − i − ∆( T ) − i e g ( x − i ) dx . . . dx i − dx i +1 . . . dx K . Here e g ( x − i ) is the conditional density function given x i = ∆( T ) and thus (cid:82) ∆( T ) − i − ∆( T ) − i e g ( x − i ) dx . . . dx i − dx i +1 . . . dx K ≤ . We have (cid:90) ∆( T ) − i − ∆( T ) − i f (cid:0) x | x i =∆( T ) (cid:1) dx . . . dx i − dx i +1 . . . dx K ≤ e − a ii (∆( T )) + µ i ∆( T ) ≤ e − min j a jj (∆( T )) +max j µ j ∆( T ) . Then for T ≥ T we have ∆ (cid:48) T ≤ C + 1 and in turn ζ (cid:48) ( T ) ≤ ( C + 1) · K · e − min j a jj (∆( T )) +max j µ j ∆( T ) − a · T − a − .Since we only consider non-degenerate Gaussian bandits with min a ii > , µ i are constants and ∆( T ) → ∞ as T → ∞ according to the assumptions in Lemma 2, there exits C > and T suchthat e − min j a jj (∆( T )) +max j µ j ∆( T ) ≤ e − C ∆( T ) for every T > T .Since lim T →∞ ∆( T ) log( T ) = ∞ , we have ∆( T ) > a +1) C · log( T ) for T > T .These give us that ζ ( T ) (cid:48) ≤ ( C + 1) Ke − a +1) log T − aT − a − = ( C + 1) Ke − a +1) log T − ae − ( a +1) log T < for T ≥ T ≥ max( T , T , T ) . ζ (cid:48) ( T ) < for T ≥ T . We also have lim T →∞ ζ ( T ) = 0 according to theassumptions. Therefore, we finally arrive at ζ ( T ) > for T ≥ T . This is equivalent to (cid:90) ∆( T ) · − ∆( T ) · f ( x , . . . , x K ) dx . . . dx K ≥ − T a , i.e. the rewards are bounded by ∆( T ) with probability − T a . Then by the same argument for T large enough as in the proof of Theorem 4, we have P (cid:16) R T ≤ ∆( T ) · log(1 /δ ) · O ∗ ( √ T ) (cid:17) ≥ (1 − δ )(1 − T a ) T . Proof of Theorem 5.
In Lemma 2, we choose ∆( T ) = log( T ) , which meets all of the assumptions.The result now follows from log T · O ∗ ( √ T ) = O ∗ ( √ T ) , Lemma 2 and Theorem 4. B.3 Proof for Theorem 6
We first list 3 known lemmas. The following lemma by John Duchi [20] provides a way to bounddeviations.
Lemma 3.
For any function class F , and i.i.d. random variable { x , x , . . . , x T } , the result E x (cid:104) sup f ∈ F (cid:12)(cid:12)(cid:12) E x f − T (cid:80) Tt =1 f ( x t ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ R cT ( F ) holds where R cT ( F ) = E x,σ (cid:104) sup f (cid:12)(cid:12)(cid:12) T (cid:80) Tt =1 σ t f ( x t ) (cid:12)(cid:12)(cid:12)(cid:105) and σ t is a {− , } random walk of t steps. The following result holds according to [21].
Lemma 4.
For any subclass A ⊂ F , we have ˆ R cT ≤ R ( A, T ) · √ | A | T , where R ( A, T ) =sup f ∈ A (cid:16)(cid:80) Tt =1 f ( x t ) (cid:17) and ˆ R cT = sup f (cid:12)(cid:12)(cid:12) T (cid:80) Tt =1 σ t f ( x t ) (cid:12)(cid:12)(cid:12) . A random variable X is σ -sub-Gaussian if for any t > , the tail probability satisfies P ( | X | > t ) ≤ Be − σ t , where B is a positive constant. The following lemma is listed in the Appendix A of [22]. Lemma 5.
For i.i.d. σ -sub-Gaussian random variables { Y , Y , . . . , Y T } , we have E [max ≤ t ≤ T | Y t | ] ≤ σ √ T + σ √ T . Proof for Theorem 6.
Let us define F = { f j : x → x j | j = 1 , , . . . , K } . Let x t = ( r t , r t , . . . , r tK ) where r ti is the reward of arm i at step t and let a t be the arm selected at time t by EXP3.P. Then forany f j ∈ F , f j ( x t ) = r tj . In Gaussian-MAB, { x , x , . . . , x T } are i.i.d. random variables since theGaussian distribution N ( µ, Σ) is invariant to time and independent of time. Then by Lemma 3, wehave E (cid:104) max i (cid:12)(cid:12)(cid:12) µ i − T (cid:80) Tt =1 r ti (cid:12)(cid:12)(cid:12)(cid:105) ≤ R cT ( F ) .16e consider E [ | R (cid:48) T − R T | ] = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T · max i µ i − T (cid:88) t =1 µ a t − (cid:32) max i T (cid:88) t =1 r ti − T (cid:88) t =1 r ta t (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T · max i µ i − max i T (cid:88) t =1 r ti − (cid:32) T (cid:88) t =1 µ a t − T (cid:88) t =1 r ta t (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T · max i µ i − max i T (cid:88) t =1 r ti (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) + E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t =1 µ a t − T (cid:88) t =1 r ta t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ E (cid:34) max i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T · µ i − T (cid:88) t =1 r ti (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) + E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t =1 µ a t − T (cid:88) t =1 r ta t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ T R cT ( F ) + 2 T R cT ( F ) + · · · + 2 T K R cT K ( F ) (3)where T i is the number of pulls of arm i . Clearly T + T + . . . + T K = T . By Lemma 4 with A = F we get R cT ( F ) = E (cid:104) ˆ R cT ( F ) (cid:105) ≤ E [ R ( F, T )] · √ KT ,R cT i ( F ) ≤ E [ R ( F, T i )] · √ KT i i = { , , , . . . , K } . Since R ( F, T ) is increasing in T and T i ≤ T , we have R cT i ( F ) ≤ E [ R ( F, T )] · √ KT i .We next bound the expected deviation E [ | R (cid:48) T − R T | ] based on (3) as follows E [ | R (cid:48) T − R T | ] ≤ T E [ R ( F, T )] √ KT + K (cid:88) i =1 (cid:20) T i E [ R ( F, T )] √ KT i (cid:21) ≤ K + 1) (cid:112) KE [ R ( F, T )] . (4)Regarding E [ R ( F, T )] , we have E [ R ( F, T )] = E sup f ∈ F (cid:32) T (cid:88) t =1 f ( x t ) (cid:33) = E sup i (cid:32) T (cid:88) t =1 ( r ti ) (cid:33) ≤ E K (cid:88) i =1 (cid:32) T (cid:88) t =1 ( r ti ) (cid:33) ≤ K (cid:88) i =1 E (cid:34)(cid:18) T · max ≤ t ≤ T ( r it ) (cid:19) (cid:35) = √ T · K (cid:88) i =1 E (cid:20) max ≤ t ≤ T | r ti | (cid:21) . (5)We next use Lemma 5 for any arm i . To this end let Y t = r ti . Since x t are Gaussian, the marginals Y t are also Gaussian with mean µ i and standard deviation of a ii . Combining this with the factthat a Gaussian random variable is also σ -sub-Gaussian justifies the use of the lemma. Thus E (cid:104) max ≤ j ≤ T | r ji | (cid:105) ≤ a i,i · √ T + a i,i √ T .Continuing with (5) we further obtain E [ R ( F, T )] ≤ √ T · K · max i (cid:18) a i,i (cid:112) T + 4 a i,i √ T (cid:19) = (cid:32) K (cid:112) T log T + 4 √ T √ T (cid:33) · max i a i,i . (6)17y combining (4) and (6) we conclude E [ | R (cid:48) T − R T | ] ≤ K + 1) (cid:112) K · max i a i,i · (cid:32) K (cid:112) T log T + 4 √ T √ T (cid:33) = O ∗ ( √ T ) . (7)We now turn our attention to the expectation of regret E [ R T ] . It can be written as E [ R T ] = E (cid:104) R T R T ≤ O ∗ ( √ T ) (cid:105) + E (cid:104) R T R T >O ∗ ( √ T ) (cid:105) ≤ O ∗ ( √ T ) P (cid:16) R T ≤ O ∗ ( √ T ) (cid:17) + E (cid:104) R T R T >O ∗ ( √ T ) (cid:105) ≤ O ∗ ( √ T ) + E (cid:104) R T R T >O ∗ ( √ T ) (cid:105) = O ∗ ( √ T ) + E (cid:104) R T O ∗ ( √ T )
Nature
International Conference on Learning Representations . 2018.[4] B. C. Stadie, S. Levine, and P. Abbeel. “Incentivizing exploration in reinforcement learningwith deep predictive models.” In: arXiv preprint arXiv:1507.00814 (2015).[5] Y. Burda, H. Edwards, A. Storkey, and O. Klimov. “Exploration by random network distilla-tion.” In:
International Conference on Learning Representations . 2018.[6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. “The nonstochastic multiarmedbandit problem.” In:
SIAM Journal on Computing
Journal of Computer and System Sciences
Advances in Neural InformationProcessing Systems . 2016, pp. 1471–1479.[9] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. “Gaussian process optimization in thebandit setting: no regret and experimental design.” In:
Proceedings of the 27th InternationalConference on Machine Learning . 2010.[10] S. Grünewälder, J. Y. Audibert, M. Opper, and J. Shawe–Taylor. “Regret bounds for Gaussianprocess bandit problems.” In:
Proceedings of the Thirteenth International Conference onArtificial Intelligence and Statistics . 2010, pp. 273–280.[11] R. Xia, C. Zong, and S. Li. “Ensemble of feature sets and classification algorithms for sentimentclassification.” In:
Information Sciences
Advances in Neural Information Processing Systems . 2016, pp. 4026–4034.[13] I. Osband, J. Aslanides, and A. Cassirer. “Randomized prior functions for deep reinforcementlearning.” In:
Advances in Neural Information Processing Systems . 2018, pp. 8617–8629.[14] M. Tokic. “Adaptive ε -greedy exploration in reinforcement learning based on value differences.”In: Annual Conference on Artificial Intelligence . Springer. 2010, pp. 203–210.[15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal policy optimizationalgorithms.” In: arXiv preprint arXiv:1707.06347 (2017).[16] O. Rivlin.
MountainCar_DQN_RND . https://github.com/orrivlin/MountainCar_DQN_RND . 2019. 1917] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deep convolutionalneural networks.” In: Advances in Neural Information Processing Systems . 2012, pp. 1097–1105.[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: a large-scale hierarchi-cal image database.” In: .IEEE. 2009, pp. 248–255.[19] L. Devroye, A. Mehrabian, and T. Reddad. “The total variation distance between high-dimensional Gaussians.” In: arXiv preprint arXiv:1810.08693 (2018).[20] J. Duchi.