[PDF] Efficient Exploration for Model-based Reinforcement Learning with Continuous States and Actions

Abstract

Full PDF

EEfﬁcient Exploration for Model-based ReinforcementLearning with Continuous States and Actions

Ying Fan

Department of Computer ScienceUniversity of Wisconsin, Madison [email protected]

Yifei Ming

Department of Computer ScienceUniversity of Wisconsin, Madison [email protected]

Abstract

Balancing exploration and exploitation is crucial in reinforcement learning (RL). Inthis paper, we study the model-based posterior sampling algorithm in continuousstate-action spaces theoretically and empirically. First, we improve the regretbound: with the assumption that reward and transition functions can be modeledas Gaussian Processes with linear kernels, we develop a Bayesian regret boundof ˜ O ( H / d √ T ) , where H is the episode length, d is the dimension of the state-action space, and T indicates the total time steps. Our bound can be extended tononlinear cases as well: using linear kernels on the feature representation φ , theBayesian regret bound becomes ˜ O ( H / d φ √ T ) , where d φ is the dimension of therepresentation space. Moreover, we present MPC-PSRL, a model-based posteriorsampling algorithm with model predictive control for action selection. To capturethe uncertainty in models and realize posterior sampling, we use Bayesian linearregression on the penultimate layer (the feature representation layer φ ) of neuralnetworks. Empirical results show that our algorithm achieves the best sampleefﬁciency in benchmark control tasks compared to prior model-based algorithms,and matches the asymptotic performance of model-free algorithms. In reinforcement learning (RL), an agent interacts with an unknown environment which is typicallymodeled as a Markov Decision Process (MDP). Efﬁcient exploration has been one of the mainchallenges in RL: the agent is expected to balance between exploring unseen state-action pairs togain more knowledge about the environment, and exploiting existing knowledge to optimize rewardsin the presence of known data.To achieve efﬁcient exploration, Bayesian reinforcement learning is proposed, where the MDP itselfis treated as a random variable with a prior distribution. This prior distribution of the MDP providesan initial uncertainty estimate of the environment, which generally contains distributions of transitiondynamics and reward functions. The epistemic uncertainty (subjective uncertainty due to limiteddata) in reinforcement learning can be captured by posterior distributions given the data collected bythe agent.Posterior sampling reinforcement learning (PSRL), motivated by Thompson sampling in banditproblems [1], serves as a provably efﬁcient algorithm under Bayesian settings. In PSRL, the agentmaintains a posterior distribution for the MDP and follows an optimal policy with respect to a singleMDP sampled from the posterior distribution for interaction in each episode. Appealing results ofPSRL in tabular RL were presented by both model-based [2, 3] and model free approaches [4] interms of the Bayesian regret. For H -horizon episodic RL, PSRL was proved to achieve a regret boundof ˜ O ( H √ SAT ) , where S and A denote the number of states and actions, respectively. However, incontinuous state-action spaces S and A can be inﬁnite, hence the above results do not apply. Preprint. Under review. a r X i v : . [ c s . L G ] N ov lthough PSRL in continuous spaces has also been studied in episodic RL, existing results eitherprovide no guarantee or suffer from an exponential order of H . In this paper, we achieve the ﬁrstBayesian regret bound for posterior sampling algorithms that is near optimal in T (i.e. √ T ) andpolynomial in the episode length H for continuous state-action spaces. We will explain the limitationsof previous works in Section 1.1, summarize our approach, and highlight contributions in Section 1.2. H : In model-based settings, [5] derive a regret bound of ˜ O ( σ R (cid:112) d K ( R ) d E ( R ) T + E [ L ∗ ] σ p (cid:112) d K ( P ) d E ( P )) , where L ∗ is a global Lipschitz constant for thefuture value function deﬁned in their eq. (3). However, L ∗ is dependent on H : the difference betweeninput states will propagate in H steps, which results in a term dependent of H in the value function,although the authors do not mention this dependency. As a result, there is no clear dependency on H in their regret. Moreover, they use the lipschitz constant of the underlying value function as anupper bound of L ∗ in the corollaries, which yields an exponential order in H . Take their Corollary2 of linear quadratic systems as an example: the regret bound is ˜ O ( σCλ n √ T ) , where λ is thelargest eigenvalue of the matrix Q in the optimal value function V ( s ) = s T Qs , where V denotesthe value function counting from step 1 to H within an episode, s is the initial state, reward at the i -th step r i = s Ti P s i + a Ti Ra i + (cid:15) P,i , and the state at the i + 1 -th step s i +1 = As i + Ba i + (cid:15) P,i , i ∈ [ H ] . However, the largest eigenvalue of Q is actually exponential in H . Even if we changethe reward function from quadratic to linear, say r i = s Ti P + a Ti R + (cid:15) P,i the Lipschitz constant ofthe optimal value function is still exponential in H . Chowdhury and Gopalan [6] maintains theassumption of this Lipschitz property, thus there exists E [ L ∗ ] in their bound. As a result, there is stillno clear dependency on H in their regret, and in their Corollary 2 of LQR, they follow the same stepsas Osband and Van Roy [5], and still maintain a term with λ , which is actually exponential in H as discussed. Although Osband and Van Roy [5] mentions that system noise helps to smooth futurevalues, but they do not explore it although the noise is assumed to be subgaussian. The authors directlyuse the Lipschitz continuity of the underlying function in the analysis of LQR, thus they cannot avoidthe exponential term in H . [6] do not explore how the system noise can improve the theoreticalbound either. In model-free settings, [7] develops a regret bound of ˜ O ( d φ √ T ) using a linear functionapproximator in the Q-network, where d φ is the dimension of the feature representation vector of thestate-action space, but their bound is still exponential in H as mentioned in their paper. High dimensionality : The eluder dimension of neural networks in Osband and Van Roy [5] can beinﬁnite, and the information gain [8] used in [6] yields exponential order of the state-action spacesdimension d if nonlinear kernels are used, such as SE kernels. However, linear kernels can onlymodel linear functions, thus the representation power is highly restricted if the polynomial order of d is desired. To further imporve the regret bound for PSRL in continuous spaces, especially with explicit depen-dency on H , we study model-based posterior sampling algorithms in episodic RL. We assume thatrewards and transitions can be modeled as Gaussian Processes with linear kernels, and extend theassumption to non-linear settings that utilize features extracted by neural networks. For the linearcase, we develop a Bayesian regret bound of ˜ O ( H / d √ T ) . Using feature embedding as mentionedin Yang and Wang [9], we derive a bound of ˜ O ( H / d φ √ T ) . Our Bayesian regret is the best-knownBayesian regret for posterior sampling algorithms in continuous state-action spaces, and it alsomatches the best-known frequentist regret ([10], will be discussed in Section 2). Explicitly dependenton d, H, T , our result achieves a signiﬁcant improvement in terms of the Bayesian regret of PSRLalgorithms compared to previous works:

1. We signiﬁcantly improved the order of H to polynomial : In our analysis, we use the property ofsubgaussian noise, which is already assumed in [5] and [6], to develop a bound with clear polynomial Recall the Bellman equation we have V i ( s i ) = min a i s Ti P s i + a Ti Ra i + (cid:15) P,i + V i +1 ( As i + Ba i + (cid:15) P,i ) , V H +1 ( s ) = 0 . Thus in V ( s ) , there is a term of ( A H − s ) T P ( A H − s ) , and the eigenvalue of the matrix ( A H − ) T P A H − is exponential in H . There is still a term of ( A H − s ) T P in V ( s ) . H , without assuming the Lipschitz continuity of the underlying value function .More speciﬁcally, we prove Lemma 1, and use it to develop a clear dependency on H as long as theexpected reward is bounded, thus we can avoid handling the Lipschitz continuity of the underlyingvalue function.

2. Lower dimensionality compared to [5] and [6] : In addition to the aforementioned differences,we also use feature embedding to lower the dimensionality in regret bounds. We ﬁrst derive results forlinear kernels, and increase the representation power of our model by extracting the last hidden layerof neural networks to construct a Bayesian linear regression, and thus we can use the result of linearkernels to derive a bound linear in the dimension of the last hidden layer. The feature dimension,which in practice is dimension of the last hidden layers required for learning, is much lower thanexponential of the input dimension, so we avoid the exponential order of the dimension from the useof nonlinear kernels in Chowdhury and Gopalan [6].

3. Looser assumptions compared to Chowdhury and Gopalan [6] : Although we also use kernel-ized MDPs like Chowdhury and Gopalan [6], we omit their assumption A1 (Lipschitz assumption)and A2 (Regularity assumption), and only use the assumption of subgaussian noise. We directlyanalyze the regret bound of posterior sampling (PSRL), instead of ﬁrst analyzing UCRL (UpperConﬁdence Bound in RL) and then applying it to PSRL as their analysis. The only overlap betweenour proof and theirs is the use of information gain Srinivas et al. [8], and we develop a better boundwith less assumptions.Moreover, we implement PSRL using Bayesian linear regression (BLR) on the penultimate layer (forfeature representation) of neural networks when ﬁtting transition and reward models. We use modelpredictive control (MPC,Camacho and Alba [11]) to optimize the policy under the sampled models ineach episode as an approximate solution of the sampled MDP as described in Section 5. Experimentsshow that our algorithm achieves more efﬁcient exploration compared with previous model-basedalgorithms in control benchmark tasks.

Besides the aforementioned works on Bayesian regret bounds, the majority of papers in efﬁcient RLchoose the non-Bayesian perspective and develop frequentist regret bounds where the regret for anyMDP M ∗ ∈ M is bounded and M ∗ ∈ M holds with high probability. frequentist regret bounds canbe expressed in the Bayesian view: for a given conﬁdence set M , the frequentist regret bound impliesan identical Bayes regret bound for any prior distribution with support on M . Note that frequentistregret is extensively studied in tabular RL (see Jaksch et al. [12], Azar et al. [13], and Jin et al. [14]as examples), among which the best bound for episodic settings is ˜ O ( H √ SAT ) .There is also a line of work that develops frequentist bounds with feature representation. Mostrecently, MatrixRL proposed by [9] uses low dimensional representation and achieves a regret boundof ˜ O ( H d φ √ T ) , which is the best-known frequentist bound in model based settings. While ourmethod is also model-based, we achieve a tighter regret bound when compared in the Bayesianview. In model-free settings, Jin et al. [15] developed a bound of ˜ O ( H / d / φ √ T ) . Zanette et al.[10] further improved the regret to ˜ O ( H / d φ √ T ) by the proposed an algorithm called ELEANOR,which achieves the best-known frequentist bound in model-free settings. They showed that it isunimprovable with the help of a lower bound established in the bandit literature. Despite that ourregret is developed in model-based settings, it matches their bound with the same order of H , d φ and T in the Bayesian view. Moreover, their algorithm involves optimization over all MDPs in theconﬁdence set, and thus can be computationally prohibitive. Our method is computationally tractableas it can be easily implemented by only optimizing a single sampled MDP, while matching theirregret bound in the Bayesian view. We model an episodic ﬁnite-horizon Markov Decision Process (MDP) M as {S , A , R M , P M , H, σ r , σ f , R max , ρ } , where S ⊂ R d s and A ⊂ R d a denote state and ac-3ion spaces, respectively. Each episode with length H has an initial state distribution ρ . At timestep i ∈ [1 , H ] within an episode, the agent observes s i ∈ S , selects a i ∈ A , receives a noisedreward r i ∼ R M ( s i , a i ) and transitions to a noised new state s i +1 ∼ P M ( s i , a i ) . More speciﬁcally, r ( s i , a i ) = ¯ r M ( s i , a i ) + (cid:15) r and s i +1 = f M ( s i , a i ) + (cid:15) f , where (cid:15) r ∼ N (0 , σ r ) , (cid:15) f ∼ N (0 , σ f I d s ) .Variances σ r and σ f are ﬁxed to control the noise level. Without loss of generality, we assume theexpected reward an agent receives at a single step is bounded | ¯ r M ( s, a ) | ≤ R max , ∀ s ∈ S , a ∈ A .Let µ : S → A be a deterministic policy. Here we deﬁne the value function for state s at time step i withpolicy µ as V Mµ,i ( s ) = E [Σ Hj = i [¯ r M ( s j , a j ) | s i = s ] , where s j +1 ∼ P M ( s j , a j ) and a j = µ ( s j ) . Withthe bound expected reward, we have that | V ( s ) | ≤ HR max , ∀ s .We use M ∗ to indicate the real unknown MDP which includes R ∗ and P ∗ , and M ∗ itself istreated as a random variable. Thus, we can treat the real noiseless reward function ¯ r ∗ and tran-sition function f ∗ as random processes as well. In the posterior sampling algorithm π P S , M k is a random sample from the posterior distribution of the real unknown MDP M ∗ in the k thepisode, which includes the posterior samples of R k and P k , given history prior to the k th episode: H k := { s , , a , , r , , · · · , s k − ,H , a k − ,H , r k − ,H } , where s k,i , a k,i and r k,i indicate the state,action, and reward at time step i in episode k . We deﬁne the the optimistic policy under M as µ M ( s i ) ∈ argmax V Mµ,i ( s i ) . In particular, µ ∗ indicates the optimal policy under M ∗ and µ k representsthe optimal policy under M k . Let ∆ k denote the regret over the k th episode: ∆ k = (cid:90) ρ ( s )( V M ∗ µ ∗ , ( s ) − V M ∗ µ k , ( s )) ds (1)Then we can express the regret of π ps up to time step T as: Regret ( T, π ps , M ∗ ) := Σ (cid:100) TH (cid:101) k =1 ∆ k , (2)Let BayesRegret ( T, π ps , φ ) denote the Beyesian regret of π ps as deﬁned in Osband and Van Roy[3], where φ is the prior distribution of M ∗ : BayesRegret ( T, π ps , φ ) = E [ Regret ( T, π ps , M ∗ )] . (3) Generally, we consider modeling an unknown target function g : R d → R . We are given a set of noisysamples y = [ y ...., y T ] T at points X = [ x , ..., x T ] T , X ⊂ D , where D is compact and convex, y i = g ( x i ) + (cid:15) i with (cid:15) i ∼ N (0 , σ ) i.i.d. Gaussian noise ∀ i ∈ { , · · · , T } .We model g as a sample from a Gaussian Process GP ( µ ( x ) , K ( x, x (cid:48) )) , speciﬁed by the mean function µ ( x ) = E [ g ( x )] and the covariance (kernel) function K ( x, x (cid:48) ) = E [( g ( x ) − µ ( x )( g ( x (cid:48) ) − µ ( x (cid:48) )] .Let the prior distribution without any data as GP (0 , K ( x, x (cid:48) )) . Then the posterior distribu-tion over g given X and y is also a GP with mean µ T ( x ) , covariance K T ( x, x (cid:48) ) , and variance σ T ( x ) : µ T ( x ) = K ( x, X )( K ( X, X ) + σ I ) − y, K T ( x, x (cid:48) ) = K ( x, x (cid:48) ) − K ( X, x ) T ( K ( X, X ) + σ I ) − K ( X, x ) , σ T ( x ) = K T ( x, x ) , where K ( X, x ) = [ K ( x , x ) , ..., K ( x T , x )] T , K ( X, X ) =[ K ( x i , x j )] ≤ i ≤ T, ≤ j ≤ T .We model our reward function ¯ r M as a Gaussian Process with noise σ r . For transition models, wetreat each dimension independently: each f i ( s, a ) , i = 1 , .., d S is modeled independently as above,and with the same noise level σ f in each dimension. Thus it corresponds to our formulation in theRL setting. Since the posterior covariance matrix is only dependent on the input rather than the targetvalue, the distribution of each f i ( s, a ) shares the same covariance matrice and only differs in themean function. 4 Bayesian Regret Analysis

In the RL problem formulated in Section 3.1, under the assumption of Section 3.2 withlinear kernels , we have BayesRegret ( T, π ps , M ∗ ) = ˜ O ( H / d √ T ) , where d is the dimension ofthe state-action space, H is the episode length, and T is the time elapsed.Proof The regret in episode k can be rearranged as: ∆ k = (cid:90) ρ ( s )( V M ∗ µ ∗ , ( s ) − V M k µ k , ( s )) + ( V M k µ k , ( s ) − V M ∗ µ k , ( s ))) ds (4)Note that conditioned upon history H k for any k , M k and M ∗ are identically distributed. Osbandand Van Roy [5] showed that V M ∗ µ ∗ , − V M k µ k , is zero in expectation, and that only the second part ofthe regret decomposition need to be bounded when deriving the Bayesian regret of PSRL. Thus wecan focus on the policy µ k , the sampled M k and real environment data generated by M ∗ . For clarity,the value function V M k µ k , is simpliﬁed to V kk, and V M ∗ µ k , to V ∗ k, . It sufﬁces to derive bounds for anyinitial state s as the regret bound will still hold through integration of the initial distribution ρ ( s ) .We can rewrite the regret from concentration via the Bellman operator (see Section 5.1 in Osbandet al. [2]): E [ ˜∆ k |H k ] := E [ V kk, ( s ) − V ∗ k, ( s ) |H k ]= E [¯ r k ( s , a ) − ¯ r ∗ ( s , a ) + (cid:90) P k ( s (cid:48) | s , a ) V kk, ( s (cid:48) ) ds (cid:48) − (cid:90) P ∗ ( s (cid:48) , | s , a ) V ∗ k, ( s (cid:48) ) ds (cid:48) |H k ]= E [Σ Hi =1 ¯ r k ( s i , a i ) − ¯ r ∗ ( s i , a i ) + Σ Hi =1 ( (cid:90) ( P k ( s (cid:48) | s i , a i ) − P ∗ ( s (cid:48) | s i , a i )) V kk,i +1 ( s (cid:48) ) ds (cid:48) ) |H k ]= E [ ˜∆ k ( r ) + ˜∆ k ( f ) |H k ] (5)where a i = µ k ( s i ) , s i +1 ∼ P ∗ ( s i +1 | s i , a i ) , ˜∆ k ( r ) = Σ Hi =1 ¯ r k ( s i , a i ) − ¯ r ∗ ( s i , a i ) , ˜∆ k ( f ) =Σ Hi =1 ( (cid:82) ( P k ( s (cid:48) | s i , a i ) − P ∗ ( s (cid:48) | s i , a i )) V kk,i +1 ( s (cid:48) ) ds (cid:48) ) . Thus, here ( s i , a i ) is the state-action pair thatthe agent encounters in the k th episode while using µ k for interaction in the real MDP M ∗ . Wecan deﬁne V k,H +1 = 0 to keep consistency. Note that we cannot treat s i and a i as determin-istic and only take the expectation directly on random reward and transition functions. Instead,we need to bound the difference using concentration properties of reward and transition func-tions modeled as Gaussian Processes (which also applies to any state-action pair), and then derivebounds of this expectation. For all i , we have (cid:82) ( P k ( s (cid:48) | s i , a i ) − P ∗ ( s (cid:48) | s i , a i )) V kk,i +1 ( s (cid:48) ) ds (cid:48) ≤ max s | V kk,i +1 ( s ) | (cid:82) | P k ( s (cid:48) | s i , a i ) − P ∗ ( s (cid:48) | s i , a i ) | ds (cid:48) ≤ HR max (cid:82) | P k ( s (cid:48) | s i , a i ) − P ∗ ( s (cid:48) | s i , a i ) | ds (cid:48) . Lemma 1

For two multivariate Gaussian distribution N ( µ , σ I ) , N ( µ (cid:48) , σ I ) with probabilitydensity function p ( x ) and p ( x ) respectively, x ∈ R d , (cid:90) | p ( x ) − p ( x ) | d x ≤ (cid:114) πσ || µ − µ (cid:48) || . The proof is in Appendix A. Clearly, this result can also be extended to sub-Gaussian noises. Usingthis lemma, we can derive a regret bound with explicit dependency on the episode length H .Recall that P k ( s (cid:48) | s i , a i ) = N ( f k ( s i , a i ) , σ f I ) and P ∗ ( s (cid:48) | s i , a i ) = N ( f ∗ ( s i , a i ) , σ f I ) . By Lemma1 we have (cid:90) | P k ( s (cid:48) | s i , a i ) − P ∗ ( s (cid:48) | s i , a i ) | ds (cid:48) ≤ (cid:115) πσ f || f k ( s i , a i ) − f ∗ ( s i , a i ) || (6) GP with linear kernel correspond to Bayesian linear regression f ( x ) = w T x , where the prior distributionof the weight is w ∼ N (0 , Σ p ) . emma 2 [16] Let X , ..., X N be N sub-Gaussian random variables with variance σ (not requiredto be independent). Then for any t > , P (max ≤ i ≤ N | X i | > t ) ≤ N e − t σ . Given history H k , let ¯ f k ( s, a ) indicate the posterior mean of f k ( s, a ) in episode k , and σ k ( s, a ) denotes the posterior variance of f k in each dimension. Note that f ∗ and f k share the same variancein each dimension given history H k , as described in Section 3. Consider all dimensions of thestate space, by Lemma 2, we have that with probability at least − δ , max ≤ i ≤ d s | f ki ( s, a ) − ¯ f ki ( s, a ) | ≤ (cid:113) σ k ( s, a ) log d s δ . Also, we can derive an upper bound for the norm of the statedifference || f k ( s, a ) − ¯ f k ( s, a ) || ≤ √ d s max ≤ i ≤ d s | f ki ( s, a ) − ¯ f ki ( s, a ) | , and so does || f ∗ ( s, a ) − ¯ f k ( s, a ) || since f ∗ and f k share the same posterior distribution. By the union bound, we have thatwith probability at least − δ || f k ( s, a ) − f ∗ ( s, a ) || ≤ (cid:113) d s σ k ( s, a ) log d s δ .Then we look at the sum of the differences over horizon H , without requiring each variable in thesum to be independent: P (Σ Hi =1 || f k ( s i , a i ) − f ∗ ( s i , a i ) || > Σ Hi =1 (cid:114) d s σ k ( s i , a i ) log d s δ ) ≤ P ( H (cid:91) i =1 {|| f k ( s i , a i ) − f ∗ ( s i , a i ) || > (cid:114) d s σ k ( s i , a i ) log d s δ } ) ≤ Σ Hi =1 P ( || f k ( s i , a i ) − f ∗ ( s i , a i ) || > (cid:114) d s σ k ( s i , a i ) log d s δ ) (7)Thus, with probability at least − Hδ , we have Σ Hi =1 || f k ( s i , a i ) − f ∗ ( s i , a i ) || ≤ Σ Hi =1 (cid:113) d s σ k ( s i , a i ) log d s δ . Let δ (cid:48) = 2 Hδ , we have that with probability − δ , Σ Hi =1 || f k ( s i , a i ) − f ∗ ( s i , a i ) || ≤ Σ Hi =1 (cid:113) d s σ k ( s i , a i ) log Hd s δ ≤ H (cid:113) d s σ k ( s k max , a k max ) log Hd s δ , where theindex k max = arg max i σ k ( s i , a i ) , i = 1 , ..., H in episode k . Here, since the posterior distribution isonly updated every H steps, we have to use data points with the max variance in each episode to boundthe result. Similarly, using the union bound for [ TH ] episodes, and let C = (cid:113) πσ f , we have that withprobability at least − δ , Σ [ TH ] k =1 [ ˜∆ k ( f ) |H k ] ≤ Σ [ TH ] k =1 Σ Hi =1 CHR max || f k ( s i , a i ) − f ∗ ( s i , a i ) || ≤ Σ [ TH ] k =1 CH R max (cid:113) d s σ k ( s k max , a k max ) log T d s δ . In each episode k , let σ (cid:48) k ( s, a ) denote the posterior variance given only a subset of data points { ( s max , a max ) , ..., ( s k − max , a k − max ) } , where each element has the max variance in the cor-responding episode. By Eq.(6) in Williams and Vivarelli [17], we know that the posterior vari-ance reduces as the number of data points grows. Hence ∀ ( s, a ) , σ k ( s, a ) ≤ σ (cid:48) k ( s, a ) . By The-orem 5 in Srinivas et al. [8] which provides a bound on the information gain, and Lemma 2 inRusso and Van Roy [18] that bounds the sum of variances by the information gain, we have that Σ [ TH ] k =1 σ (cid:48) k ( s k max , a k max ) = O (( d s + d a ) log [ TH ]) for linear kernels with bounded variances. Notethat the bounded variance property for linear kernels only requires the range of all state-action pairsactually encountered in M ∗ not to expand to inﬁnity as T grows, which holds in general episodicMDPs.Thus with probability − δ , and let δ = T , Σ [ TH ] k =1 [ ˜∆ k ( f ) |H k ] ≤ Σ [ TH ] k =1 CH R max (cid:114) d s σ k ( s k max , a k max ) log T d s δ ≤ Σ [ TH ] k =1 CH R max (cid:113) d s σ (cid:48) k ( s k max , a k max ) log (2 T d s ) ≤ CH R max (cid:113) Σ [ TH ] k =1 σ (cid:48) k ( s k max , a k max ) (cid:114) [ TH ] (cid:112) d s log (2 T d s )= 8 CH R max √ T (cid:112) d s log (2 T d s ) ∗ (cid:114) O (( d s + d a ) log [ TH ]) = ˜ O (( d s + d a ) H √ T ) (8)6 lgorithm 1 MPC-PSRLInitialize data D with random actions for one episode repeat Sample a transition model and a cost model at the beginning of each episode for i = 1 to H steps do Obtain action using MPC with planning horizon τ : a i ∈ arg max a i : i + τ (cid:80) i + τt = i E [ r ( s t , a t )] D = D ∪ { ( s i , a i , r i , s i +1 ) } end for Train cost and dynamics representations φ r and φ f using data in D Update φ r ( s, a ) , φ f ( s, a ) for all ( s, a ) collectedPerform posterior update of w r and w f in cost and dynamics models using updated representa-tions φ r ( s, a ) , φ f ( s, a ) for all ( s, a ) collected until convergencewhere ˜ O ignores logarithmic factors.Therefore, E [Σ [ TH ] k =1 ˜∆ k ( f ) |H k ] ≤ (1 − T ) ˜ O (( d s + s a ) H T ) + T HR max ∗ [ TH ] = ˜ O ( H d √ T ) ,where HR max is the upper bound on the difference of value functions, and d = d s + d a . Bysimilar derivation, E [Σ [ TH ] k =1 ˜∆ k ( r ) |H k ] = ˜ O ( √ dHT ) . Finally, through the tower property we have BayesRegret ( T, π ps , M ∗ ) = ˜ O ( H d √ T ) . (cid:3) We can slightly modify the previous proof to derive the bound in settings that use feature representa-tions. We can transform the state-action pair ( s, a ) to φ f ( s, a ) ∈ R d φ as the input of the transitionmodel , and transform the newly transitioned state s (cid:48) to ψ f ( s (cid:48) ) ∈ R d ψ as the target, then the transitionmodel can be established with respect to this feature embedding. We further assume d ψ = O ( d φ ) asAssumption 1 in Yang and Wang [9]. Besides, we assume d φ (cid:48) = O ( d φ ) in the feature representation φ r ( s, a ) ∈ R d φ (cid:48) , then the reward model can also be established with respect to the feature embedding.Following similar steps, we can derive a Bayesian regret of ˜ O ( H / d φ √ T ) . In this section, we elaborate our proposed algorithm, MPC-PSRL, as shown in Algorithm 1.

When model the rewards and transitions, we use features extracted from the penultimate layer of ﬁttedneural networks, and perform Bayesian linear regression on the feature vectors to update posteriordistributions.

Feature representation: we ﬁrst ﬁt neural networks for transitions and rewards, usingthe same network architecture as Chua et al. [19]. Let x i denote the state-action pair ( s i , a i ) and y i denote the target value. Speciﬁcally, we use reward r i as y i to ﬁt rewards, and we take the differencebetween two consecutive states s i +1 − s i as y i to ﬁt transitions. The penultimate layer of ﬁttedneural networks is extracted as the feature representation, denoted as φ f and φ r for transitions andrewards, respectively. Note that in the transition feature embedding, we only use one neural networkto extract features of state-action pairs from the penultimate layer to serve as φ , and leave the targetstates without further feature representation (the general setting is discussed in Section 4.2 wherefeature representations are used for both inputs and outputs), so the dimension of the target in thetransition model d ( ψ ) equals to d s . Thus we have a modiﬁed regret bound of ˜ O ( H / (cid:112) dd φ T ) . Wedo not ﬁnd the necessity to further extract feature representations in the target space, as it mightintroduce additional computational overhead. Although higher dimensionality of the hidden layersmight imply better representation, we ﬁnd that only modifying the width of the penultimate layer to d φ = d s + s a sufﬁces in our experiments for both reward and transition models. Note that how to7ptimize the dimension of the penultimate layer for more efﬁcient feature representation deservesfurther exploration. Bayesian update and posterior sampling: here we describe the Bayesian update of transition andreward models using extracted features. Recall that Gaussian process with linear kernels is equivalentto Bayesian linear regression. By extracting the penultimate layer as feature representation φ , thetarget value y and the representation φ ( x ) could be seen as linearly related: y = w (cid:62) φ ( x ) + (cid:15) , where (cid:15) is a zero-mean Gaussian noise with variance σ (which is σ f for the transition model and σ r for thereward model as deﬁned in Section 3.1). We choose the prior distribution of weights w as zero-meanGaussian with covariance matrix Σ p , then the posterior distribution of w is also multivariate Gaussian(Rasmussen [20]): p ( w |D ) ∼ N (cid:0) σ − A − Φ Y, A − (cid:1) where A = σ − ΦΦ (cid:62) + Σ − p , Φ ∈ R d × N is the concatenation of feature representations { φ ( x i ) } Ni =1 ,and Y ∈ R N is the concatenation of target values. At the beginning of each episode, we sample w from the posterior distribution to build the model, collect new data during the whole episode, andupdate the posterior distribution of w at the end of the episode using all the data collected.Besides the posterior distribution of w , the feature representation φ is also updated in each episodewith new data collected. We adopt a similar dual-update procedure as Riquelme et al. [21]: afterrepresentations for rewards and transitions are updated, feature vectors of all state-action pairscollected are re-computed. Then we apply Bayesian update on these feature vectors. See thedescription of Algorithm 1 for details. During interaction with the environment, we use a MPC controller (Camacho and Alba [11]) forplanning. At each time step i , the controller takes state s i and an action sequence a i : i + τ = { a i , a i +1 , · · · , a i + τ } as the input, where τ is the planning horizon. We use transition and reward mod-els to produce the ﬁrst action a i of the sequence of optimized actions arg max a i : i + τ (cid:80) i + τt = i E [ r ( s t , a t )] ,where the expected return of a series of actions can be approximated using the mean return of severalparticles propagated with noises of our sampled reward and transition models. To compute theoptimal action sequence, we use CEM (Botev et al. [22]), which samples actions from a distributioncloser to previous action samples with high rewards. We compare our method with the following state-of-the art model-based and model-free algorithmson benchmark control tasks.

Model-free:

Soft Actor Critic (SAC) from Haarnoja et al. [23] is an off-policy deep actor-criticalgorithm that utilizes entropy maximization to guide exploration. Deep Deterministic Policy Gradient(DDPG) from Barth-Maron et al. [24] is an off-policy algorithm that concurrently learns a Q-functionand a policy, with a discount factor to guide exploration.

Model-based:

Probabilistic Ensembles with Trajectory Sampling (PETS) from Chua et al. [19]models the dynamics via an ensemble of probabilistic neural networks to capture epistemic uncertaintyfor exploration, and uses MPC for action selection, with a requirement to have access to oracle rewardsfor planning. Model-Based Poilcy Optimization (MBPO) from Janner et al. [25] uses the samebootstrap ensemble techniques as PETS in modeling, but differs from PETS in policy optimizationwith a large amount of short model-generated rollouts, and can cope with environments with nooracle rewards provided. We do not compare with Gal et al. [26], which adopts a single Bayesianneural network (BNN) with moment matching, as it is outperformed by PETS that uses an ensembleof BNNs with trajectory sampling. And we don’t compare with GP-based trajectory optimizationmethods with real rewards provided (Deisenroth and Rasmussen [27], Kamthe and Deisenroth [28]),which are not only outperformed by PETS, but also computationally expensive and thus are limitedto very small state-action spaces.We use environments with various complexity and dimensionality for evaluation. Low-dimensionalenvironments: continuous Cartpole ( d s = 4 , d a = 1 , H = 200 , with a continuous action spacecompared to the classic Cartpole, which makes it harder to learn) and Pendulum Swing Up ( d s = 3 ,8igure 1: Training curves of MPC-PSRL (shown in red), and other baseline algorithms in differenttasks. Solid curves are the mean of ﬁve trials, shaded areas correspond to the standard deviationamong trials, and the doted line shows the rewards at convergence. d a = 1 , H = 200 , a modiﬁed version of Pendulum where we limit the start state to make it harderfor exploration). Trajectory optimization with oracle rewards in these two environments is easyand there is almost no difference in the performances for all model-based algorithms we compare,so we omit showing these learning curves. Higher dimensional environments: 7-DOF Reacher( d s = 17 , d a = 7 , H = 150 ) and 7-DOF pusher ( d s = 20 , d a = 7 , H = 150 ) are two morechallenging tasks as provided in [19], where we conduct experiments both with and without truerewards, to compare with all baseline algorithms mentioned.The learning curves of these algorithms are showed in Figure 1. When the oracle rewards are providedin Pusher and Reacher, our method outperforms PETS and MBPO: it converges more quickly withsimilar performance at convergence in Pusher, while in Reacher, not only does it learn faster butalso performs better at convergence. As we use the same planning method (MPC) as PETS, resultsindicate that our model better captures uncertainty, which is beneﬁcial to improving sample efﬁciency.When exploring in environments where both rewards and transition are unknown, our method learnssigniﬁcantly faster than previous model-based and model-free methods which do no require oraclerewards. Meanwhile, it matches the performance of SAC at convergence. Moreover, the performancesof our algorithm in environments with and without oracle rewards can be similar, or even fasterconvergence (see Pusher with and without rewards), indicating that our algorithm excels at exploringboth rewards and transitions.From experimental results, it can be veriﬁed that our algorithm better captures the model uncertainty,and makes better use of uncertainty using posterior sampling. In our methods, by sampling from aBayesian linear regression on a ﬁtted feature space, and optimizing under the same sampled MDP inthe whole episode instead of re-sampling at every step, the performance of our algorithm is guaranteedfrom a Bayesian view as analysed in Section 4. While PETS and MBPO use bootstrap ensembles ofmodels with a limited ensemble size to "simulate" a Bayesian model, in which the convergence of theuncertainty is not guaranteed and is highly dependent on the training of the neural network. However,in our method there is a limitation of using MPC, which might fail in even higher-dimensional tasksshown in Janner et al. [25]. Incorporating policy gradient techniques for action-selection mightfurther improve the performance and we leave it for future work. In our paper, we derive a novel Bayesian regret for PSRL algorithm in continuous spaces with theassumption that true rewards and transitions (with or without feature embedding) can be modeled by9P with linear kernels. While matching the best-known bounds in previous works from a Bayesianview, PSRL also enjoys computational tractability. Moreover, we propose MPC-PSRL in continuousenvironments, and experiments show that our algorithm exceeds existing model-based and model-freemethods with more efﬁcient exploration.

References [1] William R Thompson. On the likelihood that one unknown probability exceeds another in viewof the evidence of two samples.

Biometrika , 25(3/4):285–294, 1933.[2] Ian Osband, Van Roy Benjamin, and Russo Daniel. (More) efﬁcient reinforcement learning viaposterior sampling. In

Proceedings of the 26th International Conference on Neural InformationProcessing Systems - Volume 2 , NIPS’13, pages 3003–3011, USA, 2013. Curran Associates Inc.[3] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism forreinforcement learning? In Doina Precup and Yee Whye Teh, editors,

Proceedings of the 34thInternational Conference on Machine Learning , pages 2701–2710, International ConventionCentre, Sydney, Australia, 2017. PMLR.[4] Ian Osband, Benjamin Van Roy, Daniel J Russo, and Zheng Wen. Deep exploration viarandomized value functions.

Journal of Machine Learning Research , 20(124):1–62, 2019.[5] Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluderdimension. In

Advances in Neural Information Processing Systems , pages 1466–1474, 2014.[6] Sayak Ray Chowdhury and Aditya Gopalan. Online learning in kernelized markov decisionprocesses. In

The 22nd International Conference on Artiﬁcial Intelligence and Statistics , pages3197–3205, 2019.[7] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efﬁcient explorationthrough bayesian deep q-networks. In , pages 1–9. IEEE, 2018.[8] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias W Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting.

IEEE Transac-tions on Information Theory , 58(5):3250–3265, 2012.[9] Lin F Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels,and regret bound. arXiv preprint arXiv:1905.10389 , 2019.[10] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning nearoptimal policies with low inherent bellman error. arXiv preprint arXiv:2003.00153 , 2020.[11] Eduardo F Camacho and Carlos Bordons Alba.

Model predictive control . Springer Science &Business Media, 2013.[12] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcementlearning.

Journal of Machine Learning Research , 11(Apr):1563–1600, 2010.[13] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds forreinforcement learning. arXiv preprint arXiv:1703.05449 , 2017.[14] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provablyefﬁcient? In

Advances in Neural Information Processing Systems , pages 4863–4873, 2018.[15] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efﬁcient reinforcementlearning with linear function approximation. In

Conference on Learning Theory , pages 2137–2143, 2020.[16] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics.

Lecture notes forcourse 18S997 , 2015.[17] Christopher KI Williams and Francesco Vivarelli. Upper and lower bounds on the learningcurve for gaussian processes.

Machine Learning , 40(1):77–102, 2000.1018] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling.

Mathematicsof Operations Research , 39(4):1221–1243, 2014.[19] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcementlearning in a handful of trials using probabilistic dynamics models. In

Advances in NeuralInformation Processing Systems , pages 4754–4765, 2018.[20] Carl Edward Rasmussen. Gaussian processes in machine learning. In

Summer School onMachine Learning , pages 63–71. Springer, 2003.[21] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: Anempirical comparison of bayesian deep networks for thompson sampling. arXiv preprintarXiv:1802.09127 , 2018.[22] Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’Ecuyer. The cross-entropymethod for optimization. In

Handbook of statistics , volume 31, pages 35–59. Elsevier, 2013.[23] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprintarXiv:1801.01290 , 2018.[24] Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, DhruvaTb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional determin-istic policy gradients. arXiv preprint arXiv:1804.08617 , 2018.[25] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model:Model-based policy optimization. In

Advances in Neural Information Processing Systems ,pages 12498–12509, 2019.[26] Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving pilco with bayesianneural network dynamics models. In

Data-Efﬁcient Machine Learning workshop, ICML ,volume 4, page 34, 2016.[27] Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efﬁcient approachto policy search. In

Proceedings of the 28th International Conference on machine learning(ICML-11) , pages 465–472, 2011.[28] Sanket Kamthe and Marc Deisenroth. Data-efﬁcient reinforcement learning with probabilisticmodel predictive control. In

International Conference on Artiﬁcial Intelligence and Statistics ,pages 1701–1710. PMLR, 2018.

A Proof of Lemma 1

Here we provide a proof of Lemma 1.We ﬁrst prove the results in R d with d = 1 : p ( x ) ∼ N ( µ, σ ) , p ( x ) ∼ N ( µ (cid:48) , σ ) , without loss ofgenerality, assume µ (cid:48) ≥ µ . The probabilistic distribution is symmetric with regard to µ + µ (cid:48) . Note that p ( x ) = p ( x ) at x = µ + µ (cid:48) . Thus the integration of absolute difference between pdf of p and p can be simpliﬁed as twice the integration of one side: (cid:90) ∞−∞ | p ( x ) − p ( x ) | dx = 2 √ πσ (cid:90) ∞ µ + µ (cid:48) e − ( x − µ (cid:48) )22 σ − e − ( x − µ )22 σ dx (9)11et z = x − µ, z = x − µ (cid:48) , we have: √ πσ (cid:90) ∞ µ + µ (cid:48) e − ( x − µ (cid:48) )22 σ − e − ( x − µ )22 σ dx = (cid:114) πσ (cid:90) ∞ µ − µ (cid:48) e − z σ dz − (cid:114) πσ (cid:90) ∞ µ (cid:48)− µ e − z σ dz = (cid:114) πσ (cid:90) µ (cid:48)− µ µ − µ (cid:48) e − z σ dz = 2 (cid:114) πσ (cid:90) µ (cid:48)− µ e − z σ dz ≤ (cid:114) πσ (cid:90) µ (cid:48)− µ dz = (cid:114) πσ | µ (cid:48) − µ | . (10)Now we extend the result to R d ( d ≥ : p ( x ) ∼ N ( µ, σ I ) , p ( x ) ∼ N ( µ (cid:48) , σ I ) . We canrotate the coordinate system recursively to align the last axis with vector µ − µ (cid:48) , such that thecoordinates of µ and µ (cid:48) can be written as (0 , , · · · , , ˆ µ ) , and (0 , , · · · , , ˆ µ (cid:48) ) respectively, with | ˆ µ (cid:48) − ˆ µ | = (cid:107) µ − µ (cid:48) (cid:107) . Without loss of generality, let ˆ µ ≥ ˆ µ (cid:48) .Clearly, all points with equal distance to ˆ µ (cid:48) and ˆ µ deﬁne a hyperplane P : x d = ˆ µ +ˆ µ (cid:48) where p ( x ) = p ( x ) , ∀ x ∈ P . More speciﬁcally, the probabilistic distribution is symmetric with regard to P . Similar to the analysis in R : (cid:90) ∞−∞ (cid:90) ∞−∞ · · · (cid:90) ∞−∞ | p ( x ) − p ( x ) | dx dx · · · dx d = 2 (cid:112) (2 π ) d σ d (cid:90) ∞−∞ (cid:90) ∞−∞ · · · (cid:90) ∞ ˆ µ +ˆ µ (cid:48) e − x σ · · · e − x d − σ e − ( xd − ˆ µ )22 σ dx dx · · · dx d − (cid:112) (2 π ) d σ d (cid:90) ∞−∞ (cid:90) ∞−∞ · · · (cid:90) ∞ ˆ µ +ˆ µ (cid:48) e − x σ · · · e − x d − σ e − ( xd − ˆ µ (cid:48) )22 σ dx dx · · · dx d = 2 (cid:112) (2 π ) d σ d (cid:90) ∞−∞ e − x σ dx (cid:90) ∞−∞ e − x σ dx · · · (cid:90) ∞−∞ e − x d − σ dx d − (cid:90) ∞ ˆ µ +ˆ µ (cid:48) e − ( xd − ˆ µ )22 σ dx d − (cid:112) (2 π ) d σ d (cid:90) ∞−∞ e − x σ dx (cid:90) ∞−∞ e − x σ dx · · · (cid:90) ∞−∞ e − x d − σ dx d − (cid:90) ∞ ˆ µ +ˆ µ (cid:48) e − ( xd − ˆ µ (cid:48) )22 σ dx d = (cid:114) πσ ( (cid:90) ∞ ˆ µ +ˆ µ (cid:48) e − ( xd − ˆ µ )22 σ dx d − (cid:90) ∞ ˆ µ +ˆ µ (cid:48) e − ( xd − ˆ µ (cid:48) )22 σ dx d ) (11)let z = x d − ˆ µ , z = x d − ˆ µ (cid:48) , we have: (cid:90) ∞ ˆ µ +ˆ µ (cid:48) e − ( xd − ˆ µ )22 σ dx d − (cid:90) ∞ ˆ µ +ˆ µ (cid:48) e − ( xd − ˆ µ (cid:48) )22 σ dx d = (cid:90) ∞ ˆ µ (cid:48)− ˆ µ e − z σ dz − (cid:90) ∞ ˆ µ − ˆ µ (cid:48) e − z σ dz = (cid:90) ˆ µ − ˆ µ (cid:48) µ (cid:48)− ˆ µ e − z σ dz = 2 (cid:90) ˆ µ − ˆ µ (cid:48) e − z σ dz ≤ (cid:90) ˆ µ − ˆ µ (cid:48) dz = | ˆ µ − ˆ µ (cid:48) | (12)Thus (cid:82) ∞−∞ (cid:82) ∞−∞ · · · (cid:82) ∞−∞ | p ( x ) − p ( x ) | dx dx · · · dx d ≤ (cid:113) πσ (cid:107) µ − µ (cid:48) (cid:107) .12 Experimental details

Here we provide hyperparameters for MBPO:env cartpole pendulum pusher reacherenv stepsper episode 200 200 150 150model rolloutsper env step 400ensemble size 5networkarchitecture MLP with2 hidden layersof size 200 MLP with2 hidden layersof size 200 MLP with4 hidden layersof size 200 MLP with4 hidden layersof size 200policy updatesper env step 40model horizon 1->15 fromepisode 1->30 1->15 fromepisode 1->30 1 1->15 fromepisode 1->30Table 1: Hyperparamters for MBPOAnd we provide hyperparamters for MPC and Neural Networks in PETS:env pusher reacherenv stepsper episode 150 150popsize 500 400numberof elites 50 50networkarchitecture MLP with4 hidden layersof size 200planninghorizon 30 30max iter 5ensemble size 5Table 2: Hyperparamters for PETSHere are hyperparameters of our algorithm, which is similar with PETS, except for ensemblesize(since we do not use ensembled models):env cartpole pendulum pusher reacherenv stepsper episode 200 200 150 150popsize 500 100 500 400numberof elites 50 5 50 50networkarchitecture MLP with2 hidden layersof size 200 MLP with2 hidden layersof size 200 MLP with4 hidden layersof size 200 MLP with4 hidden layersof size 200planninghorizon 30 20 30 30max iter 5 Table 3: Hyperparamters for our method13or SAC and DDPG, we use the open source code ( https://github.com/dongminlee94/deep_rlhttps://github.com/dongminlee94/deep_rl