[PDF] Mitigating Planner Overfitting in Model-Based Reinforcement Learning

Abstract

An agent with an inaccurate model of its environment faces a difficult choice: it can ignore the errors in its model and act in the real world in whatever way it determines is optimal with respect to its model. Alternatively, it can take a more conservative stance and eschew its model in favor of optimizing its behavior solely via real-world interaction. This latter approach can be exceedingly slow to learn from experience, while the former can lead to "planner overfitting" - aspects of the agent's behavior are optimized to exploit errors in its model. This paper explores an intermediate position in which the planner seeks to avoid overfitting through a kind of regularization of the plans it considers. We present three different approaches that demonstrably mitigate planner overfitting in reinforcement-learning environments.

Full PDF

MMitigating Planner Overﬁtting in Model-Based Reinforcement Learning

Dilip Arumugam David Abel Kavosh Asadi Nakul Gopalan Christopher Grimm Jun Ki Lee Lucas Lehnert Michael L. Littman Abstract

An agent with an inaccurate model of its envi-ronment faces a difﬁcult choice: it can ignorethe errors in its model and act in the real world inwhatever way it determines is optimal with respectto its model. Alternatively, it can take a more con-servative stance and eschew its model in favor ofoptimizing its behavior solely via real-world in-teraction. This latter approach can be exceedinglyslow to learn from experience, while the formercan lead to “planner overﬁtting”—aspects of theagent’s behavior are optimized to exploit errors inits model. This paper explores an intermediate po-sition in which the planner seeks to avoid overﬁt-ting through a kind of regularization of the plans itconsiders. We present three different approachesthat demonstrably mitigate planner overﬁtting inreinforcement-learning environments.

1. Introduction

Model-based reinforcement learning (RL) has proven to bea powerful approach for generating reward-seeking behaviorin sequential decision-making environments. For example, anumber of methods are known for guaranteeing near optimalbehavior in a Markov decision process (MDP) by adoptinga model-based approach (Kearns & Singh, 1998; Brafman& Tennenholtz, 2002; Strehl et al., 2009). In this line ofwork, a learning agent continually updates its model ofthe transition dynamics of the environment and activelyseeks out parts of its environment that could contribute toachieving high reward but that are not yet well learned.Policies, in this setting, are designed speciﬁcally to exploreunknown transitions so that the agent will be able to exploit(that is, maximize reward) in the long run.A distinct model-based RL problem is one in which an agenthas explored its environment, constructed a model, and must Department of Computer Science, Stanford University Department of Computer Science, Brown University Departmentof Computer Science & Engineering, University of Michigan. Cor-respondence to: Dilip Arumugam < [email protected] > . then use this learned model to select the best policy that itcan. A straightforward approach to this problem, referredto as the certainty equivalence approximation (Dayan & Se-jnowski, 1996), is to take the learned model and to computeits optimal policy, deploying the resulting policy in the realenvironment. The promise of such an approach is that, forenvironments that are deﬁned by relatively simple dynamicsbut require complex behavior, a model-based learner canstart making high-quality decisions with little data.Nevertheless, recent large-scale successes of reinforcementlearning have not been due to model-based methods butinstead derive from value-function based or policy-searchmethods (Mnih et al., 2015; 2016; Schulman et al., 2017;Hessel et al., 2018). Attempts to leverage model-basedmethods have fallen below expectations, particularly whenmodels are learned using function-approximation methods.Jiang et al. (2015) highlighted a signiﬁcant shortcoming ofthe certainty equivalence approximation, showing that it isimportant to hedge against possibly misleading errors ina learned model. They found that reducing the effectiveplanning depth by decreasing the discount factor used fordecision making can result in improved performance whenoperating in the true environment.At ﬁrst, this result might seem counter intuitive—the bestway to exploit a learned model can be to exploit it incom-pletely. However, an analogous situation arises in supervisedmachine learning. It is well established that, particularlywhen data is sparse, the representational capacity of super-vised learning methods must be restrained or regularized toavoid overﬁtting. Returning the best hypothesis in a hypoth-esis class relative to the training data can be problematic ifthe hypothesis class is overly expressive relative to the sizeof the training data. The classic result is that testing perfor-mance improves, plateaus, then drops as the complexity ofthe learner’s hypothesis class is increased.In this paper, we extend the results on avoiding planner over-ﬁtting via decreasing discount rates by introducing severalother ways of regularizing policies in model-based RL. Ineach case, we see the classic “overﬁtting” pattern in whichresisting the urge to treat the learned model as correct andto search in a reduced policy class is repaid by improvedperformance in the actual environment. We believe this re- a r X i v : . [ c s . L G ] M a r itigating Planner Overﬁtting in Model-Based Reinforcement Learning search direction may hold the key to large-scale applicationsof model-based RL.Section 2 provides a set of deﬁnitions, which provide avocabulary for the paper. Section 3 reviews the resultson decreasing discount rates, Section 4 presents a new ap-proach that plans using epsilon greedy policies, and Sec-tion 5 presents results where policy-search is performedusing lower capacity representations of policies. Section 6summarizes related work and Section 7 concludes.

2. Deﬁnitions

An MDP M is deﬁned by the quantities (cid:104) S, A, R, T, γ (cid:105) ,where S is a state space, A is an action space, R : S × A → R is a reward function, T : S × A → P ( S ) is a transitionfunction, and ≤ γ < is a discount factor. The notation P ( X ) represents the set of probability distributions over thediscrete set X . Given an MDP M = (cid:104) S, A, R, T, γ (cid:105) , itsoptimal value function Q ∗ is the solution to the Bellmanequation: Q ∗ ( s, a ) = R ( s, a ) + γ (cid:88) s (cid:48) T ( s, a ) s (cid:48) max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) . This function is unique and can be computed by algorithmssuch as value iteration or linear programming (Puterman,1994).A (deterministic) policy is a mapping from states to ac-tions, π : S → A . Given a value function Q : S × A → R , the greedy policy with respect to Q is π Q ( s ) =argmax a Q ( s, a ) . The greedy policy with respect to Q ∗ maximizes expected discounted reward from all states. Weassume that ties between actions of the greedy policy are bro-ken arbitrarily but consistently so there is always a uniqueoptimal policy for any MDP.The value function for a policy π deployed in M can befound by solving Q πM ( s, a ) = R ( s, a ) + γ (cid:88) s (cid:48) T ( s, a ) s (cid:48) Q πM ( s (cid:48) , π ( s (cid:48) )) . The value function of the optimal policy is the optimalvalue function. For a policy π , we also deﬁne the scalar V πM = (cid:80) s w s Q πM ( s, π ( s )) , where w is an MDP-speciﬁcweighting function over the states.The epsilon-greedy policy (Sutton & Barto, 1998) is astochastic policy where the probability of choosing action a is (1 − (cid:15) ) + (cid:15)/ | A | if a = argmax a Q ( s, a ) and (cid:15)/ | A | otherwise. The optimal epsilon greedy policy for M is notgenerally the epsilon greedy policy for Q ∗ . Instead, it is necessary to solve a different set of Bellman equations: Q (cid:15) ( s, a ) = R ( s, a ) + γ (cid:88) s (cid:48) T ( s, a ) s (cid:48) × (cid:32) (1 − (cid:15) ) max a (cid:48) Q (cid:15) ( s (cid:48) , a (cid:48) ) + (cid:15)/ | A | (cid:88) a (cid:48) Q (cid:15) ( s (cid:48) , a (cid:48) ) (cid:33) . The optimal epsilon-greedy policy plays an important role inthe analysis of learning algorithms like SARSA (Rummery,1994; Littman & Szepesv´ari, 1996).These examples of optimal policies are with respect toall possible deterministic Markov policies. In this pa-per, we also consider optimization with respect to a re-stricted set of policies q Π . The optimal restricted policy canbe found by comparing the scalar values of the policies: ρ ∗ = argmax ρ ∈ q Π V ρ .

3. Decreased Discounting

Let M = (cid:104) S, A, R, T, γ (cid:105) be the evaluation environment and (cid:99) M = (cid:104) S, A, R, (cid:98)

T , q γ (cid:105) be the planning environment, where (cid:98) T is the learned model and q γ ≤ γ is a smaller discountfactor used to decrease the effective planning horizon.Jiang et al. (2015) proved a bound on the difference betweenthe performance of the optimal policy in M and the perfor-mance of the optimal policy in (cid:99) M when executed in M : γ − q γ (1 − γ )(1 − q γ ) R max + 2 R max (1 − q γ ) (cid:114) n log 2 | S || A || Π R, q γ | δ . (1)Here, R max = max s,a R ( s, a ) is the largest reward (weassume all rewards are non-negative), δ is the certaintywith which the bound needs to hold, n is the number ofsamples of each transition used to build the model, and | Π R, q γ | is the number of distinct possibly optimal policiesfor (cid:104) S, A, R, · , q γ (cid:105) over the entire space of possible transitionfunctions.They show that | Π R, q γ | is an increasing function of q γ , grow-ing from 1 to as high as | A | | S | , the size of the set of allpossible deterministic policies. They left open the shape ofthis function, which is most useful if it grows gradually, butcould possibly jump abruptly.To help ground intuitions, we estimated the shape of | Π R, q γ | over a set of randomly generated MDPs. Following Jianget al. (2015), a “ten-state chain” MDP M = (cid:104) S, A, T, R, γ (cid:105) is drawn such that, for each state–action pair, ( s, a ) ∈ S × A ,the transition function T ( s, a ) is constructed by choosing states at random from S , then assigning probabilities tothese states by drawing independent samples from a uni-form distribution over [0 , and normalizing the resultingnumbers. The probability of transition to any other state itigating Planner Overﬁtting in Model-Based Reinforcement Learning Figure 1.

The number of distinct optimal policies found generatingrandom transition functions for a ﬁxed reward function varying q γ in random MDPs. is zero. For each state–action pair ( s, a ) ∈ S × A , thereward R ( s, a ) is drawn from a uniform distribution withsupport [0 , . For our MDPs, we chose | S | = 10 , | A | = 2 and γ = 0 . . We examined q γ in { . , . , ..., . , . } ,computed optimal policies by running value iteration with iterations. We sampled repeatedly until no new optimalpolicy was discovered for consecutive samples.Figure 1 is an estimate of how | Π R, q γ | grows in this class ofrandomly generated MDPs. Fortunately, the set appears togrow gradually, making q γ an effective parameter for ﬁghtingplanner overﬁtting.Estimating | Π R, q γ | ≈ e q γ − , Figure 2 shows the boundof Equation 1 applied to the random MDP distribution( | S | = 10 , | A | = 2 , R max = 1 , γ = . ).Note that the expected “U” shape is visible, but only fora relatively narrow range of values of n . For under ksamples, the minimal loss bound is achieved for q γ = 0 .For over k samples, the minimal loss bound is achievedfor q γ = γ . (Note that the pattern shown here is relativelyinsensitive to the estimated shape of | Π R, q γ | .)For actual MDPs, the “U” shape is much more robust. Usingthis same distribution over MDPs, Figure 3 replicates an em-pirical result of Jiang et al. (2015) showing that intermediatevalues of q γ are most successful and that this value growsas the model used in planning becomes more accurate (hav-ing been trained on more trajectories). We sampled MDPsfrom the same random distribution and, for each value of n ∈ { , , , } , we generated datasets each con-sisting of n trajectories of length starting from a stateselected uniformly at random and executing a random policy. .

80 0 .

85 0 .

90 0 .

95 1 . γ E q u a t i o n o un d n=10000n=50000n=200000n=900000n=1500000 Figure 2.

Bound on policy loss for randomly generated MDPs,showing the tightest bound for intermediate values of γ for inter-mediate amounts of data. In all experiments, the estimated MDP ( (cid:99) M ) was computedusing maximum likelihood estimates of T and R with noadditive Gaussian noise. Optimal policies were all foundby running value iteration in the estimated MDP (cid:99) M . Theempirical loss (Equation 14 of Jiang et al. (2015)) was com-puted for each value of γ ∈ { . , . , . , . . . , . , . } .The error bars shown in the ﬁgure represent % conﬁdenceintervals.

4. Increased Exploration

In this section, we consider a novel regularization approachin which planning is performed over the set of epsilon-greedy policies. The intuition here is that adding noise tothe policies makes it harder for them to be tailored explicitlyto the learned model, resulting in less planner overﬁtting.In Section 4.1, a general bound is introduced and then Sec-tion 4.2 applies the bound to the set of epsilon greedy poli-cies.

We can relate the structure of a restricted set of policies q Π to the performance in an approximate model with thefollowing theorem. Theorem 1.

Let q Π be a set of policies for an MDP M = (cid:104) S, A, T, R, γ (cid:105) . Let (cid:99) M = (cid:104) S, A, (cid:98)

T , R, γ (cid:105) be an MDP like M , but with a different transition function. Let π be theoptimal policy for M and (cid:98) π be the optimal policy for (cid:99) M .Let ρ be the optimal policy in q Π for M and (cid:98) ρ be the optimalpolicy in q Π for (cid:99) M . Then, | V πM − V (cid:98) ρM | ≤ | V πM − V ρM | + 2 max p ∈ q Π | V pM − V p (cid:99) M | . itigating Planner Overﬁtting in Model-Based Reinforcement Learning Figure 3.

Reducing the discount factor used in planning combatsplanner overﬁtting in random MDPs.

Proof.

We can write V πM − V (cid:98) ρM = ( V πM − V ρM ) + ( V ρM − V ρ (cid:99) M ) − ( V (cid:98) ρM − V (cid:98) ρ (cid:99) M ) − ( V (cid:98) ρ (cid:99) M − V ρ (cid:99) M ) ≤ ( V πM − V ρM ) + ( V ρM − V ρ (cid:99) M ) − ( V (cid:98) ρM − V (cid:98) ρ (cid:99) M ) (2) ≤ | V πM − V ρM | + | V ρM − V ρ (cid:99) M | + | V (cid:98) ρM − V (cid:98) ρ (cid:99) M |≤ | V πM − V ρM | + 2 max p ∈ q Π | V pM − V p (cid:99) M | . (3)Equation 2 follows from the fact that V (cid:98) ρ (cid:99) M − V ρ (cid:99) M ≥ , since (cid:98) ρ is chosen as optimal among the set of restricted policieswith respect to (cid:99) M . Equation 3 follows because both ρ and (cid:98) ρ are included in q Π . The theorem follows from the fact that V πM − V (cid:98) ρM ≥ since π is chosen to be optimal in M .Theorem 1 shows that the restricted policy set q Π impacts theresulting value of the plan in two ways. First, the bigger theclass is, the closer V ρM becomes to V πM —that is, the morepolicies we consider, the closer to optimal we become. Atthe same time, max p ∈ q Π | V pM − V p (cid:99) M | grows as q Π gets largeras there are more policies that can differ in value between M and (cid:99) M .Jiang et al. (2015) leverage this structure in the speciﬁccase of deﬁning q Π by optimizing policies using a smallervalue for γ . Our Theorem 1 generalizes the idea to arbitraryrestricted policy classes and arbitrary pairs of MDPs M and (cid:99) M . In particular, Consider a sequence of Π i such that Π i ⊆ Π i +1 . Then, the ﬁrst part of the bound is monotonicallynon-increasing (it goes down each time a better policy isincluded in the set) and the second part of the bound ismonotonically non-decreasing (it goes up each time a policyis included that magniﬁes the difference in performancepossible in the two MDPs).In Lemma 1, we show that the particular choice of (cid:99) M thatcomes from statistically sampling transitions as in certaintyequivalence leads to a bound on | V pM − V p (cid:99) M | , for an arbitrarypolicy p . Lemma 1.

Given true MDP M , let (cid:99) M be an MDP com-prised of a reward function R and transition function (cid:98) T estimated from n samples for each state–action pair, and let p be a policy, then the following holds with probability atleast − δ : p ∈ q Π | V pM − V p (cid:99) M | ≤ R max (1 − γ ) (cid:115) n log 2 | S || A || q Π | δ . Proof.

This lemma is a variation of the classic “SimulationLemma” (Kearns & Singh, 1998; Strehl et al., 2009) andis proven in this form as Theorem 2 of Jiang et al. (2015)(speciﬁcally, Lemma 2). Note that their proof, though statedwith respect to a particular choice of q Π set, holds in thisgeneral form. It remains to show that (cid:107) V πM − V (cid:98) ρM (cid:107) ∞ is bounded when re-stricted to epsilon-greedy policies. For the case of planningwith a decreased discount factor, Jiang et al. (2015) providea bound for this quantity in their Lemma 1. For the case ofepsilon-greedy policies, the corresponding bound is provenin the following lemma. Lemma 2.

For any MDP M , the difference in value ofthe optimal policy π and the optimal (cid:15) -greedy policy ρ isbounded by: | V πM − V ρM | ≤ R max (cid:15) (1 − γ )(1 − γ (1 − (cid:15) )) . Proof.

Let π be the optimal policy for M and u be a policythat selects actions uniformly at random. We can deﬁne π (cid:15) ,an (cid:15) -greedy version of π , as: π a(cid:15) ( s ) = (1 − (cid:15) ) π a ( s ) + (cid:15)u a ( s ) . (4)where π a ( s ) refers to the probability associated with action a under a policy π . Let T π denote the transition matrix fromstates to states under policy π . Using the above deﬁnition,we can decompose the transition matrix into T π (cid:15) = (1 − (cid:15) ) T π + (cid:15)T u . (5) itigating Planner Overﬁtting in Model-Based Reinforcement Learning Similarly, we have for the reward vector over states, R π (cid:15) = (1 − (cid:15) ) R π + (cid:15)R u . (6)To obtain our bound, note V πM − V π (cid:15) M = ∞ (cid:88) t =1 γ t − [ T π ] t − R π − γ t − [ T π (cid:15) ] t − R π (cid:15) = ∞ (cid:88) t =1 γ t − [ T π ] t − R π − γ t − [ T π (cid:15) ] t − [(1 − (cid:15) ) R π + (cid:15)R u ]= ∞ (cid:88) t =1 γ t − [ T π ] t − R π − γ t − [ T π (cid:15) ] t − (1 − (cid:15) ) R π − γ t − [ T π (cid:15) ] t − (cid:15)R u (cid:124) (cid:123)(cid:122) (cid:125) ≥ ≤ ∞ (cid:88) t =1 γ t − [ T π ] t − R π − γ t − [(1 − (cid:15) ) T π + (cid:15)T u ] t − (1 − (cid:15) ) R π . (7)Since T π is a transition matrix, all its entries lie in [0 , ;hence, we have the following element-wise matrix inequal-ity: [(1 − (cid:15) ) T π + (cid:15)T u ] t − ≥ [(1 − (cid:15) ) T π ] t − . (8)Plugging inequality 8 into the bound 7 results in V πM − V π (cid:15) M ≤ ∞ (cid:88) t =1 γ t − [ T π ] t − R π − γ t − (1 − (cid:15) ) t [ T π ] t − R π ≤ (cid:107) ∞ (cid:88) t =1 γ t − (1 − (1 − (cid:15) ) t ) [ T π ] t − R π (cid:107) ∞ Since (cid:107) R π (cid:107) ∞ = R max we can upper bound the norm ofdifference of the values vector over states with (cid:107) V πM − V π (cid:15) M (cid:107) ∞ ≤ ∞ (cid:88) t =1 γ t − (1 − (1 − (cid:15) ) t ) R max = (cid:15) (1 − γ )( (cid:15)γ − γ + 1) R max . Using this inequality, we can bound the difference in valueof the optimal policy π and the optimal (cid:15) -greedy policy ρ by (cid:107) V πM − V ρM (cid:107) ∞ ≤ (cid:107) V πM − V π (cid:15) M (cid:107) ∞ ≤ (cid:15) (1 − γ )( (cid:15)γ − γ + 1) R max . Figure 4.

The number of distinct optimal policies found generatingrandom transition functions for a ﬁxed reward function varying (cid:15) . Figure 4 is an estimate of how | Π R,(cid:15) | grows over the classof randomly generated MDPs. Again, the set appears togrow gradually as (cid:15) decreases, making (cid:15) another effectiveparameter for ﬁghting planner overﬁtting. We evaluated this exploration-based regularization ap-proach in the distribution over MDPs used in Fig-ure 3. Figure 5 shows results for each value of (cid:15) ∈{ . , . , . , . . . , . , . } . Here, the maximum likelihoodtransition function (cid:98) T was replaced with the epsilon-softenedtransition function T (cid:15) . In contrast to the previous ﬁgure,regularization increases as we go to the right. Once again,we see that intermediate values of (cid:15) are most successful andthe best value of (cid:15) decreases as the model used in planningbecomes more accurate (having been trained on more tra-jectories). The similarity to Figure 3 is striking—in spite ofthe difference in approach, it is essentially the mirror imageof Figure 3.We see that manipulating either q γ or (cid:15) can be used to modu-late the impact of planner overﬁtting. Which method to usein practice depends on the particular planner being used andhow easily it is modiﬁed to use these methods.

5. Decreased Policy Complexity

In addition to indirectly controlling policy complexity via (cid:15) and γ , it is possible to control for the complexity via therepresentation of the policy itself. In this section, we look atvarying the complexity of the policy in the context of model-based RL in which a model is learned and then a policy for itigating Planner Overﬁtting in Model-Based Reinforcement Learning Figure 5.

Increasing the randomness in action selection duringplanning combats planner overﬁtting in random MDPs. that model is optimized via a policy search approach. Suchan approach was used in the setting of helicopter control (Nget al., 2003) in the sense that collected data in that workwas used to build a model and a policy was constructed tooptimize performance in this model (via policy search, inthis case) and then deployed in the environment.Our test domain was Lunar Lander, an environment with acontinuous state space and discrete actions. The goal of theenvironment is to control a falling spacecraft so as to landgently in a target area. It consists of 8 state variables, namelythe lander’s x and y coordinates, x and y velocities, angleand angular velocities, and two Boolean ﬂags correspondingto whether each leg has touched down. The agent can take4 actions, corresponding to which of its three thrusters (orno thruster) is active during the current time step. TheLunar Lander environment is publicly available as part ofthe OpenAI Gym Toolkit (Brockman et al., 2016).We collected k -step episodes of data on Lunar Lan-der. During data collection, decisions were made by apolicy-gradient algorithm. Speciﬁcally, we ran the REIN-FORCE algorithm with the state–value function as the base-line (Williams, 1992; Sutton et al., 2000). For the policyand value networks, we used a single hidden layer neuralnetwork with 16 hidden units and relu activation functions.We used the Adam algorithm (Kingma & Ba, 2014) withthe default parameters and a step size of . . The learnedmodel was a 3-layer neural net with ReLU activation func-tions mapping the agent’s state (8 inputs corresponding to 8state variables) as well as a one-hot representation of actions(4 inputs corresponding to 4 possible actions). The modelconsisted of two fully connected hidden layers with 32 units m e a n r e t u r n Optimized-Policy Performanceon modelon environment

Figure 6.

Decreasing the number of hidden units used to representa policy combats planner overﬁtting in the Lunar Lander domain. each and ReLU activations. We again used Adam and usedstep size . to learn the model.We then ran policy-gradient RL (REINFORCE) as a plannerusing the learned model. The policy was represented by aneural network with a single hidden layer. To control thecomplexity of the policy representation, we varied the num-ber of units in the hidden layer from to . Results wereaveraged over runs. Figure 6 shows that increasing thesize of the hidden layer in the policy resulted in better andbetter performance on the learned model (top line). How-ever, after 250 or so units, the resulting policy performedless well on the actual environment (bottom line). Thus, wesee that reducing policy complexity serves as yet anotherway to reduce planner overﬁtting.

6. Related Work

Prior work has explored the use of regularization in rein-forcement learning to mitigate overﬁtting. We survey someof the previous methods according to which function isregularized: (1) value, (2) model, or (3) policy.

Many prior approaches have applied regularization to valuefunction approximation, including Least Squares Tempo-ral Difference learning (Bradtke & Barto, 1996), PolicyEvaluation, and the batch approach of Fitted Q -Iteration(FQI) (Ernst et al., 2005).Kolter & Ng (2009) applied regularization techniques toLSTD (Bradtke & Barto, 1996) with an algorithm theycalled LARS-TD. In particular, they argued that, without itigating Planner Overﬁtting in Model-Based Reinforcement Learning regularization, LSTD’s performance depends heavily onthe number of basis functions chosen and the size of thedata set collected. If the data set is too small, the techniqueis prone to overﬁtting. They showed that L and L reg-ularization yield a procedure that inherits the beneﬁts ofselecting good features while making it possible to com-pute the ﬁxed point. Later work by Liu et al. (2012) builton this work with the algorithm RO-TD, an L regularizedoff policy Temporal Difference Learning method. Johnset al. (2010) cast the L regularized ﬁxed-point computa-tion as a linear complementarity problem, which providesstronger solution-uniqueness guarantees than those providedfor LARS-TD. Petrik et al. (2010) examined the approxi-mate linear programming (ALP) framework for ﬁnding ap-proximated value functions in large MDPs. They showedthe beneﬁts of adding an L regularization constraint to theALP that increases the error bound at training time and helpsﬁght overﬁtting.Farahmand et al. (2008a) and Farahmand et al. (2009) fo-cused on regularization applied to Policy Iteration and Fitted Q -Iteration (FQI) (Ernst et al., 2005) and developed tworelated methods for Regularized Policy Iteration, each lever-aging L regularization during the evaluation of policiesfor each iteration. The ﬁrst method adds a regularizationterm to the Least Squares Temporal Difference (LSTD)error (Bradtke & Barto, 1996), while the second adds asimilar term to the optimization of Bellman residual mini-mization (Baird et al., 1995; Schweitzer & Seidmann, 1985;Williams & Baird, 1993) with regularization (Loth et al.,2007). Their main result shows ﬁnite convergence for the Q function under the approximated policy and the true op-timal policy. A method for FQI adds a regularization costto the least squares regression of the Q function. Followup work (Farahmand et al., 2008b) expanded RegularizedFitted Q -Iteration to planning. That is, given a data set D = (cid:104) ( s , a , r , s (cid:48) ) , . . . , ( s m , a m , r m , s (cid:48) m ) (cid:105) and a func-tion family F (like regression trees), FQI approximatesa Q function through repeated iterations of the followingregression problem: (cid:98) Q t +1 = argmin Q ∈ F m (cid:88) i =1 (cid:20) r i + γ max a (cid:48) ∈ A (cid:98) Q t ( s i , a ) − Q ( s (cid:48) i , a i ) (cid:21) + λ Pen ( (cid:98) Q ) , where λ Pen ( (cid:98) Q ) imposes a regularization penalty term and λ is a regularization coefﬁcient. They prove bounds relatingthis regularization cost to the approximation error in (cid:98) Q between iterations of FQI.Farahmand & Szepesv´ari (2011) and Farahmand (2011) fo-cused on a problem relevant to our approach—regularizationfor Q value selection in RL and planning. They consideredan ofﬂine setting in which an algorithm, given a data set of experiences and set of possible Q functions, must choosea Q function from the set that minimizes the true Bellmanerror. They provided a general complexity regularizationbound for model selection, which they applied to boundthe approximation error for the Q function chosen by theirproposed algorithm, BE R M IN . In model-based RL, regularizaion can be used to improveestimates of R and T when data is ﬁnite or limited.Taylor & Parr (2009) investigated the relationship betweenKernelized LSTD Xu et al. (2005) and other related tech-niques, with a focus on regularization in model-based RL.Most relevant to our work is their decomposition of theBellman error into transition and reward error, which theyempirically show offers insight into the choice of regulariza-tion parameters.Bartlett & Tewari (2009) developed an algorithm, R EGAL ,with optimal regret for weakly communicating MDPs. R E - GAL heavily relies on regularization; based on all priorexperience, the algorithm continually updates a set M that,with high probability, contains the true MDP. Letting λ ∗ ( M ) denote the optimal per-step reward of the MDP M , the tra-ditional optimistic exploration tactic would suggest that theagent should choose the M (cid:48) in M with maximal λ ∗ ( M (cid:48) ) .R EGAL also includes a regularization term to this maximiza-tion to prevent overﬁtting based on the experiences so far,resulting in state-of-the-art regret bounds.

The focus of applying regularization to policies is to limitthe complexity of the policy class being searched in theplanning process. It is this approach that we adopt in thepresent paper.Somani et al. (2013) explored how regularization can helponline planning for Partially Observable Markov DecisionProcesses (POMDPs). They introduced the D

ESPOT al-gorithm (Determinized Sparse Partially Observable Tree),which constructs a tree that models the execution of allpolicies on a number of sampled scenarios (rollouts). How-ever, the authors note that D

ESPOT typically succumbs tooverﬁtting, as a policy that performs well on the sampledscenarios is not likely to perform well in general. The workproposes a regularized extension of D

ESPOT , R-D

ESPOT ,where regularization takes the form of balancing betweenthe performance of the policy on the samples with the com-plexity of the policy class. Speciﬁcally, R-D

ESPOT imposesa regularization penalty on the utility of each node in thebelief tree. The algorithm then computes the policy thatmaximizes regularized utility for the tree using a bottom updynamic programming procedure on the tree. The approach itigating Planner Overﬁtting in Model-Based Reinforcement Learning is similar to ours in that it also limits policy complexitythrough regularization, but focuses on regularizing utilityinstead of regularizing the use of a transition model. Inves-tigating the interplay between these two approaches posesan interesting direction for future work. In a similar vein,Thomas et al. (2015) developed a batch RL algorithm witha probabilistic performance guarantee that limits the com-plexity of the policy class as a means of regularization.Petrik & Scherrer (2008) conducted analysis similar to Jianget al. (2015). Speciﬁcally, they investigated the situations inwhich using a lower-than-actual discount factor can improvesolution quality given an approximate model, noting thatthis procedure has the effect of regularizing rewards. Thework also advanced the ﬁrst bounds on the error of using asmaller discount factor.

7. Conclusion

For three different regularization methods—decreased dis-counting, increased exploration, and decreased policy com-plexity, we found a consistent U-shaped tradeoff betweenthe size of the policy class being searched and its perfor-mance on a learned model. Future work will evaluate othermethods such as drop out and early stopping.The plots that varied (cid:15) and γ were quite similar, raising thepossibility that perhaps epsilon-greedy action selection isfunctioning as another way to decrease the effective horizondepth using in planning—chaining together random actionsmakes future states less predictable and therefore carry lessweight. Later work can examine whether jointly choosing (cid:15) and γ is more effective than setting only one at a time.More work is needed to identify methods that can learn inmuch larger domains (Bellemare et al., 2013). One conceptworth considering is adapting regularization non-uniformlyto the state space. That is, it should be possible to modulatethe complexity of policies considered in parts of the statespace where the model is more accurate, allowing moreexpressive plans is some places than others. References

Baird, Leemon et al. Residual algorithms: Reinforcementlearning with function approximation. In

Proceedings ofthe twelfth international conference on machine learning ,pp. 30–37, 1995.Bartlett, Peter L. and Tewari, A. REGAL: A regularizationbased algorithm for reinforcement learning in weaklycommunicating MDPs.

Proceedings of the Twenty-FifthConference on Uncertainty in Artiﬁcial Intelligence , pp.35–42, 2009.Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and Bowl-ing, Michael. The arcade learning environment: An eval-uation platform for general agents.

Journal of ArtiﬁcialIntelligence Research , 47:253–279, 2013.Bradtke, Steven J and Barto, Andrew G. Linear least-squaresalgorithms for temporal difference learning.

MachineLearning , 22(1-3):33–57, 1996.Brafman, Ronen I. and Tennenholtz, Moshe. R-MAX—a general polynomial time algorithm for near-optimalreinforcement learning.

Journal of Machine LearningResearch , 3:213–231, 2002.Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig,Schneider, Jonas, Schulman, John, Tang, Jie, andZaremba, Wojciech. Openai gym, 2016.Dayan, Peter and Sejnowski, Terrence J. Explorationbonuses and dual control.

Machine Learning , 25:5–22,1996.Ernst, Damien, Geurts, Pierre, and Wehenkel, Louis. Tree-based batch mode reinforcement learning.

Journal ofMachine Learning Research , 6(Apr):503–556, 2005.Farahmand, Amir-massoud.

Regularization in reinforcementlearning . PhD thesis, University of Alberta, 2011.Farahmand, Amir-massoud and Szepesv´ari, Csaba. Modelselection in reinforcement learning.

Machine Learning ,85:299–332, 2011.Farahmand, Amir Massoud, Ghavamzadeh, Mohammad,Szepesv´ari, Csaba, and Mannor, Shie. Regularized PolicyIteration.

Nips , pp. 441–448, 2008a.Farahmand, Amir Massoud, Ghavamzadeh, Mohammad,Szepesv´ari, Csaba, and Mannor, Shie. Regularized ﬁttedQ-Iteration: Application to planning.

Lecture Notes inComputer Science (including subseries Lecture Notes inArtiﬁcial Intelligence and Lecture Notes in Bioinformat-ics) , 5323 LNAI:55–68, 2008b. itigating Planner Overﬁtting in Model-Based Reinforcement Learning

Farahmand, Amir Massoud, Ghavamzadeh, Mohammad,Szepesv´ari, Csaba, and Mannor, Shie. Regularized ﬁttedq-iteration for planning in continuous-space markoviandecision problems.

Proceedings of the American ControlConference , pp. 725–730, 2009.Hessel, Matteo, Modayil, Joseph, van Hasselt, Hado, Schaul,Tom, Ostrovski, Georg, Dabney, Will, Horgan, Dan, Piot,Bilal, Azar, Mohammad Gheshlaghi, and Silver, David.Rainbow: Combining improvements in deep reinforce-ment learning. In

AAAI , 2018.Jiang, Nan, Kulesza, Alex, Singh, Satinder, and Lewis,Richard. The dependence of effective planning horizonon model accuracy. In

Proceedings of the 2015 Interna-tional Conference on Autonomous Agents and MultiagentSystems , pp. 1181–1189, 2015.Johns, J, Painter-Wakeﬁeld, C, and Parr, R. Linear com-plementarity for regularized policy evaluation and im-provement.

Advances in neural information processingsystems , 23:1009–1017, 2010.Kearns, Michael and Singh, Satinder. Near-optimal rein-forcement learning in polynomial time. In

Proceedings ofthe 15th International Conference on Machine Learning ,pp. 260–268, 1998. URL citeseer.nj.nec.com/kearns98nearoptimal.html .Kingma, Diederik P. and Ba, Jimmy Lei. Adam: Amethod for stochastic optimization. In arXiv preprintarXiv:1412.6980 , 2014.Kolter, J. Zico and Ng, Andrew Y. Regularization and fea-ture selection in least-squares temporal difference learn-ing.

Proceedings of the 26th Annual International Con-ference on Machine Learning - ICML ’09 , 94305:1–8,2009.Littman, Michael L. and Szepesv´ari, Csaba. A generalizedreinforcement-learning model: Convergence and applica-tions. In Saitta, Lorenza (ed.),

Proceedings of the Thir-teenth International Conference on Machine Learning ,pp. 310–318, 1996.Liu, Bo, Mahadevan, Sridhar, and Liu, Ji. Regularized off-policy td-learning. In

Advances in Neural InformationProcessing Systems , pp. 836–844, 2012.Loth, Manuel, Davy, Manuel, and Preux, Philippe. Sparsetemporal difference learning using lasso. In , pp. 352–359.IEEE, 2007.Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,Rusu, Andrei A., Veness, Joel, Bellemare, Marc G.,Graves, Alex, Riedmiller, Martin A., Fidjeland, Andreas, Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik,Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dhar-shan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis.Human-level control through deep reinforcement learn-ing.

Nature , 518:529–533, 2015.Mnih, Volodymyr, Badia, Adri`a Puigdom`enech, Mirza,Mehdi, Graves, Alex, Lillicrap, Timothy P., Harley, Tim,Silver, David, and Kavukcuoglu, Koray. Asynchronousmethods for deep reinforcement learning. In

ICML , 2016.Ng, Andrew Y., Kim, H. Jin, Jordan, Michael I., and Sastry,Shankar. Autonomous helicopter ﬂight via reinforcementlearning. In

Advances in Neural Information ProcessingSystems 16 (NIPS-03) , 2003.Petrik, Marek and Scherrer, Bruno. Biasing approximatedynamic programming with a lower discount factor.

Ad-vances in Neural Information Processing Systems (NIPS) ,1:1–8, 2008.Petrik, Marek, Taylor, Gavin, Parr, Ron, and Zilberstein,Shlomo. Feature selection using regularization in ap-proximate linear programs for markov decision processes. arXiv preprint arXiv:1005.1860 , 2010.Puterman, Martin L.

Markov Decision Processes—DiscreteStochastic Dynamic Programming . John Wiley & Sons,Inc., New York, NY, 1994.Rummery, G. A.

Problem solving with reinforcement learn-ing . PhD thesis, Cambridge University Engineering De-partment, 1994.Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford,Alec, and Klimov, Oleg. Proximal policy optimizationalgorithms.

CoRR , abs/1707.06347, 2017.Schweitzer, Paul J and Seidmann, Abraham. Generalizedpolynomial approximations in markovian decision pro-cesses.

Journal of mathematical analysis and applica-tions , 110(2):568–582, 1985.Somani, A, Ye, Nan, Hsu, D, and Lee, Ws. DESPOT :Online POMDP Planning with Regularization.

Advancesin Neural Information Processing Systems , pp. 1–9, 2013.Strehl, Alexander L., Li, Lihong, and Littman, Michael L.Reinforcement learning in ﬁnite MDPs: PAC analysis.

Journal of Machine Learning Research , 10:2413–2444,2009.Sutton, Richard S. and Barto, Andrew G.

ReinforcementLearning: An Introduction . The MIT Press, 1998.Sutton, Richard S., McAllester, David, Singh, Satinder, andMansour, Yishay. Policy gradient methods for reinforce-ment learning with function approximation. In

Advances itigating Planner Overﬁtting in Model-Based Reinforcement Learning in Neural Information Processing Systems 12 , pp. 1057–1063, 2000.Taylor, Gavin and Parr, Ronald. Kernelized value functionapproximation for reinforcement learning.

Proceedingsof the 26th Annual International Conference on MachineLearning - ICML ’09 , pp. 1–8, 2009.Thomas, Philip, Theocharous, Georgios, and Ghavamzadeh,Mohammad. High Conﬁdence Policy Improvement. In

Proceedings of the 32nd International Conference onMachine Learning (ICML-15) , pp. 2380–2388, 2015.Williams, Ronald J. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.

Ma-chine Learning , 8(3):229–256, 1992.Williams, Ronald J and Baird, Leemon C. Tight perfor-mance bounds on greedy policies based on imperfectvalue functions. Technical report, Citeseer, 1993.Xu, Xin, Xie, Tao, Hu, Dewen, and Lu, Xicheng. Kernelleast-squares temporal difference learning.