[PDF] A General Markov Decision Process Framework for Directly Learning Optimal Control Policies

Abstract

We consider a new form of reinforcement learning (RL) that is based on opportunities to directly learn the optimal control policy and a general Markov decision process (MDP) framework devised to support these opportunities. Derivations of general classes of our control-based RL methods are presented, together with forms of exploration and exploitation in learning and applying the optimal control policy over time. Our general MDP framework extends the classical Bellman operator and optimality criteria by generalizing the definition and scope of a policy for any given state. We establish the convergence and optimality-both in general and within various control paradigms (e.g., piecewise linear control policies)-of our control-based methods through this general MDP framework, including convergence of Q -learning within the context of our MDP framework. Our empirical results demonstrate and quantify the significant benefits of our approach.

Full PDF

AA Control-Model-Based Approachfor Reinforcement Learning

Yingdong Lu, Mark S. Squillante, Chai W. Wu

Mathematical SciencesIBM ResearchYorktown Heights, NY 20198, USA {yingdong, mss, cwwu}@us.ibm.com

Abstract

We consider a new form of model-based reinforcement learning methods thatdirectly learns the optimal control parameters, instead of learning the underlyingdynamical system. This includes a form of exploration and exploitation in learningand applying the optimal control parameters over time. This also includes a generalframework that manages a collection of such control-model-based reinforcementlearning methods running in parallel and that selects the best decision from amongthese parallel methods with the different methods interactively learning together.We derive theoretical results for the optimal control of linear and nonlinear instancesof the new control-model-based reinforcement learning methods. Our empiricalresults demonstrate and quantify the signiﬁcant beneﬁts of our approach.

Over the past many years, reinforcement learning (RL) has proven to be very successful in solving awide variety of learning and decision making under uncertainty problems. This includes problemssuch as those related to game playing (e.g., Tesauro [20], Togelius et al. [21]), bicycle riding (e.g.,Randlov and Alstrom [13]), and robotic control (e.g., Riedmiller et al. [15]). Many different RLapproaches, with varying levels of success, have been developed to address these problems [8, 19, 18].Among these different approaches, model-free RL has been demonstrated to learn and solve variousproblems without any prior knowledge (e.g., Randlov and Alstrom [13] and Mnih et al. [11]). Suchmodel-free approaches, however, often suffer from high sample complexity that can require aninordinate amount of samples for some problems which can be prohibitive in practice, especiallyfor those problems limited by time or other constraints. Model-based RL has been demonstrated tosigniﬁcantly reduce sample complexity and has been shown to outperform model-free approachesfor various problems (e.g., Deisenroth and Rasmussen [6] and Meger et al. [10]). Such model-basedapproaches, however, can often suffer from the difﬁculty of learning an appropriate model andfrom worse asymptotic performance than model-free approaches due to model bias from inherentlyassuming the learned system dynamics model accurately represents the true system environment(e.g., Atkeson and Santamara [3], Schneider [17], and Schaal [16]). While model-free approachescan asymptotically achieve better performance because they are not limited by the accuracy of thesystem model, this again comes at the potential expense of signiﬁcantly higher sample complexity.Some recent work has sought to address these issues by initializing a model-free approach with aproposed model-based approach (e.g., Nagabandi et al. [12]).In this paper we consider a novel general approach for RL comprising two key aspects. The ﬁrstaspect is our proposal of alternative model-based RL methods that, instead of learning a systemdynamics model, learn an optimal control model for a general underlying (unknown) dynamicalsystem and directly apply the corresponding optimal control from the model. Many traditional

Preprint. Under review. a r X i v : . [ c s . L G ] M a y odel-based RL methods, after learning the system dynamics model which is often of high com-plexity and dimensionality, then use this system dynamics model to compute an optimal solutionof a corresponding dynamic programming problem, often applying model predictive control (e.g.,Nagabandi et al. [12]). In contrast, our new alternative model-based RL methods learn the parametersof an optimal control model, often of lower complexity and dimensionality, from which the optimalsolution is directly obtained. Furthermore, we establish that our control-model-based (CMB) RLapproach converges to the optimal solution analogous to model-free RL approaches while eliminatingthe problems of model bias in traditional model-based RL approaches.The second aspect of our model-based RL approach is a general framework that supports multipleCMB RL methods running in parallel. This framework includes a metacontroller that continuallyselects the best decision from among the different CMB RL methods over time and that managesthe interactions among the different CMB RL methods through the novel concept of a control-tablewhich further allows the methods to interactively learn from each other. In doing so, our frameworkaims to reap the advantages of each of the different CMB RL methods running in parallel whileminimizing any disadvantages of these different CMB RL methods. The resulting beneﬁts includecloser and more general interactions among the different RL methods than those based on initializinga model-free approach with a model-based approach as in [12].To the best of our knowledge, this paper presents the ﬁrst proposal and derivation of such generalCMB RL methods and general frameworks for parallel CMB RL methods, an approach that should beexploited to a much greater extent in the RL literature. In the remainder of this paper, we ﬁrst deviseour new CMB RL methods in which the parameters of both linear and nonlinear optimal controlmodels are learned, including the convergence of our CMB methods to the optimal solution. We thenpresent our general framework that manages multiple CMB RL methods running in parallel. Lastly,we present empirical results for a couple of classical problems in the OpenAI Gym framework [5] thatdemonstrate and quantify the signiﬁcant beneﬁts of our general approach over existing algorithms.We will release on GitHub the corresponding python code and make it publically available. The dynamic programming formulation for the RL problems of interest can be expressed as follows min u ,...,u T T (cid:88) t =1 E [ c ( x t , u t )] , s.t. x t = f ( x t − , u t − ) , (1)where x t represents the state of the system, u t represents the control variables, f ( · , · ) represents theevolution function of the dynamical system characterizing the system state given the previous stateand the taken action together with unknown uncertainty, and c ( · , · ) represents a cost-based objectivefunction of both the system state and control action. Alternatively, the dynamic programmingformulation associated with the RL problem of interest can be expressed as max u ,...,u T T (cid:88) t =1 E [ r ( x t , u t )] , s.t. x t = f ( x t − , u t − ) , (2)where r ( · , · ) represents a reward-based objective function of both the system state and control action,with all other variables and functions as given above.We note that (1) and (2) can represent a wide variety of RL problems and corresponding dynamicprogramming problems based on the different forms taken by the evolution function f ( · , · ) . Forexample, a linear system dynamics model results when f ( · , · ) takes the form of linear transformations.The function f ( · , · ) can also characterize the discretized evolutionary system dynamics governed by(partial) differential equations. In addition, the cost function c ( · , · ) and reward function r ( · , · ) are alsoallowed to take on various general forms, and thus can represent any combination of cumulative andterminal costs or rewards, respectively. Both formulations (1) and (2) can also be analogously deﬁnedin continuous time. On the other hand, most classical RL formulations assume the dynamics evolve indiscrete time. When the underlying system is based on a continuous-time model, then a discretizationoperator such as forward Euler discretization is used to generate the discrete time samples.Traditional model-based RL methods seek to ﬁrst learn the system dynamics model of high complexityand dimensionality, and then incur the additional overhead of computing the optimal solution to (1)2r (2) with respect to (w.r.t.) the learned system dynamics model. In contrast, our CMB RL methodshave more of an optimal control-theoretic focus by seeking to learn the parameters of an optimalcontrol model of lower complexity and dimensionality for a general underlying dynamical system,and then directly applying the corresponding optimal control policy from the model.We begin with a derivation of a particular general class of our CMB RL methods. By taking aﬁrst-order Taylor series as an approximation, we obtain the linearized dynamical system f ( x , u ) ≈ f ( x , u ) + ∇ x f ( x , u ) x + ∇ u f ( x , u ) u . The corresponding control can be assumed to bear theform u ( y ) = By in the case of a linear control model, and bear the form u ( y ) = By + Cy (cid:48) Dy in the case of a simple nonlinear control model, where y = K ( x ) is the output process representing ameasurement of the state variable (which can be linear or nonlinear), C and D are matrices with theproper dimensions, and y (cid:48) denotes the transpose of the vector y .We then need to learn the matrix B in the case of the linear control model, and learn the matrices B , C , D in the case of the simple nonlinear control model, based on the sample measurementssatisfying the corresponding expressions above. Note that B , C and D can be updated in parallel(w.r.t. a given state, an action taken, and a corresponding outcome) as part of the learning process, thussupporting close interactions and collaborative learning among the CMB RL methods. To facilitatethis individual and collaborative learning, we introduce the novel concept of a control table, or c -table,that is somehwhat analogous to the Q -table but different in several important ways including itsconstruction and usage. Alternative forms of RL can be applied to determine the best parametersfor the matrix B or the matrices B , C , D . In particular, we can exploit the low complexity anddimensionality of learning the parameters of the optimal control model, especially relative to the highcomplexity and dimensionality of learning the system dynamics model, to solve the correspondingoptimization problem after a relatively small number of sample measurements. Hence, the learningproblem is reduced to solving a small stochastic optimization problem in which uncertainty has to besampled. Many different algorithms, including a combination of local search with random restartsand regression-based ﬁtting against the c -table, can be deployed to solve these optimization problemsand learn the desired control parameters based on forms of exploration and exploitation.We note that higher-order Taylor series approximations can be considered in an analogous manner.Furthermore, although the above derivation focuses on a deterministic optimal control method, acorresponding set of stochastic optimal control methods are the subject of ongoing research.Our general CMB RL approach can be summarized as follows (see also Algorithm 1 of Appendix A).1. Identify the reduced complexity and dimensionality of the control model based on thedimensions of the state and action vectors, possibly exploiting additional prior knowledge.2. Initialize the matrices B ∗ = B (linear) or B ∗ = B , C ∗ = C , D ∗ = D (nonlinear).3. Execute the system for the e -th episode of the task using the current control parameters B e (linear) or B e , C e , D e (nonlinear) generating a sequence of state, action, reward tuples.4. Exploit the c -table to record the best possible reward for each state-action pair.5. When the episode yields is an improvement in the total reward, update the control-modelparameters B ∗ = B e (linear) or B ∗ = B e , C ∗ = C e , D ∗ = D e (nonlinear).6. Identify an alternative candidate for the control model B e +1 (linear) or B e +1 , C e +1 , D e +1 (nonlinear), based on one of several available options.7. Increment e and repeat from Step until B ∗ or B ∗ , C ∗ , D ∗ satisfy tolerance.Prior knowledge (e.g., expert opinion, mathematical models), when available, can also be exploited inStep to bound the degree of freedom of the problem; in such cases, we simply take advantage of theadditional prior knowledge and boundary conditions available in many real-world problems. Initialconditions can be determined from prior solutions of similar problems, determined mathematicallyfrom a simple model, or chosen randomly. Step evaluates the total reward of episode e and updatesthe c -table based on the path of the episode. An alternative candidate for the control model canbe identiﬁed in several ways based on various forms of exploration and exploitation. While ouralgorithm can exploit many approaches of interest, we focus in this paper on a combination of localsearch with random restarts and hierarchical regression-based ﬁtting of the control matrices againstthe c -table. To start, we use the local search with random restart to ﬁnd new candidates and ﬁll-inthe c -table. Then, as the score ( R -value) of the hierarchical regression reaches a level of sufﬁcient3ccuracy, we switch to using a combination of local search with random restarts and hierarchicalregression-based ﬁtting, where the regression score is used to determine the candidate control model.Local search has been previously considered in linear policy optimization [9]. Our approach,however, combines local search together with hierarchical regression ﬁtting against the control tableto efﬁciently and effectively learn the control model parameters. The choice between the local-search and hierarchical-regression candidates can be formulated as a multi-bandit problem with abalance between exploration (local search) and exploitation (hierarchical regression). Although ouralgorithm can substitute linear regression for the hierarchical regression, we note that the latter hasthe advantage of partitioning the control matrices into submatrices with different parameters forthe control submatrices that best ﬁt the corresponding region of the c -table; then the control modelparameters are based on the regions of the control submatrices traversed by the path in the currentepisode. In particular, when the c -table is adequately populated, it can be beneﬁcial to identify controlmodels by conducting linear regressions that treat the state and optimal-action pair as observations ofindependent and dependent variables. To allow different linear controls for different regions of states(e.g., these regions of states could depend on the geometric or algebraic properties of the states suchas the distance to the landing zone or permutations of the directions in the lunar lander problem),hierarchical regression can be run with the introduction of new variables. Speciﬁcally, a two-levelhierarchical linear regression typically takes the form Y ij = β ,j + β j X ij + r ij , where i indexes theindividual state and j indexes the different domains of the states; X and Y represent the independentand dependent variables, β ,j and β j are the slopes and intercepts that can be different for differentdomains, and r ij represents the errors. For additional details on hierarchical regression, refer to [14].Note that software implementations of hierarchical regression are readily available in many statisticalpackages (e.g., R, SAS, and SPSS), as well as machine learning packages such as Stan. As noted in the introduction and discussed in the literature (, e.g., [12]), one advantage of model-freeRL over traditional model-based RL is that the Bellman equation in the former guarantees conver-gence of Q -learning to the optimal policy, whereas traditional model-based RL that learns the systemdynamics model and then uses this model to compute an optimal solution of the correspondingdynamic programming problem (often applying model predictive control) may not guarantee con-vergence, especially if the learned system dynamics model deviates from the true dynamics of thesystem. On the other hand, there is no clear bound on the number of iterations that model-free RLmay require to reach the optimal policy and these approaches can often exhibit very slow convergencebehavior in practice for some RL problems. This degree of high sample complexity requiring aninordinate amount of samples can be prohibitive in practice for such problems when each trial iscostly or time-consuming or there are simply only a limited number of trials available to compute theaction policy. W.r.t our CMB RL method, we establish the following theoretical result on guaraneedconvergence to an optimal stabilizing feedback control. Theorem 2.1.

Under the assumption that optimal control matrices exist, the CMB RL approach willasymptotically converge to a set of optimal control matrices.Proof.

Consider the dynamics of a system given by ˙ x = f ( x , t ) for the continuous-time case or by x n +1 = f ( x n , n ) for the discrete-time case. Suppose f is Lipschitz continuous with Lipschitz constant L and the goalof the task at hand is to ensure that the trajectory of the system converges toward an equilibrium pointof the system x . Lyapunov stability analysis and known results from time-varying linear systems [7]show that u = Bx is a linear stabilizing feedback provided that Df ( x , t ) + B has eigenvalues with negative real parts bounded away from the origin and slowly varying, where Df ( · , · ) is the Jacobian matrix of f ( · , · ) . In particular, − α I is a stabilizing feedback matrix where α > L . 4he proof implies that a binary search on α will result in a stabilizing feedback. A more optimalsearch can be performed in practice to ﬁnd feedback matrices that satisfy bounds, such as spectralbounds, using both gradient-based and gradient-free methods, such as genetic algorithms, differentialevolution, and local-search methods. The eigenvalue conditions show that a feedback matrix withina small ball around B will still stabilize the system, and therefore such instances of our CMBoptimization method will guarantee to ﬁnd a solution given sufﬁcient time to search the space,analogous to the guaranteed convergence under model-free RL via the Bellman equation. We nextillustrate these approaches using two well known examples of RL control. In this subsection we apply our CMB RL methods to two different problems from the OpenAI Gym[5], namely Lunar Lander and Cart Pole, which also comprise the basis for our experimental resultsin Section 4. While exact solution can be obtained for both problems with complete informationusing optimal control techniques (e.g., Anderson & Moore [2]), we seek to obtain the optimal controlmodels based solely on the known dimensions of the state and action vectors, with all other aspectsof the problem unknown. In the case of cart pole, as an illustrative example, we further assume that asmall amount of partial, domain-based, information is additionally known about the problem at hand,which we exploit within the context of a hybrid version of our CMB RL method.

Lunar Lander.

This problem is discussed in Brockman et al. [5]. In each state, characterized by an -dimensional state vector, there are four possible discrete actions (left, right, vertical or no thrusters).The goal is to maximize the cumulative reward comprising positive points for successful degrees oflanding and negative points for fuel usage and crashing.Under the assumption that the system dynamics are unknown and that a simple linear or nonlinearstate feedback control is sufﬁcient to solve the Lunar Lander problem, the goal then becomes learningthe parameters of the corresponding optimal control model. Here our rationale is that the controlmatrix is of a simpler complexity and dimensionality than the system dynamics as the number ofcontrols inputs is smaller. In particular, we assume a (unknown) dynamical system model ˙ x = f ( x ) , and further assume that a linear control model ( B , b ) exists such that ˙ x = f ( x ) + Bx + b will solvethe problem of landing the spacecraft; we also analogously consider a simple nonlinear (quadratic)control model of the form ˙ x = f ( x ) + Bx + b + Cx (cid:48) Dx . W.r.t. complexity and dimensionality, inthe Lunar Lander problem, the system is nonlinear and the state space is of dimension , implyingthat each linearized vector ﬁeld is of size × , has elements, and depends on the current state.This is in comparison with there being only two control dimensions (left/right and vertical), and thusthe linear control matrix B and vector b (which do not depend on the state) are of size × and × , respectively, having a total of only elements. Similarly, the simple nonlinear control usesthe scalar x in addition to the -dimensional state, which results in a × matrix and a × vector,totaling elements. This representative example illustrates how the complexity and dimensionalityof the system dynamics model tends to be higher than that for the optimal control model, given thatthe physics of the problem is well known to be complex. Any additional knowledge, as we exploitnext, can further restrict the degrees of freedom of the optimal control model. Cart Pole.

This problem is discussed in Barto et al. [4]. In each state, characterized by a -dimensional state vector, there are two possible discrete actions (push left, push right). The goal is tomaximize the score representing the number of steps that the cart pole stays upright before eitherfalling over or going out of bounds. The task is solved when the number of steps reaches .We consider two approaches for this problem. The ﬁrst is analogous to the approach used for LunarLander, where we assume a linear control model of the form u = Bx + b . Since the actions are -dimensional (left, right), the matrix B and vector b are of size × and × , respectively, totaling elements. The solution proceeds along the lines of the above Lunar Lander description.For the second approach, we assume that the form of the system dynamics equations are known, butnot the system parameters and physical constants such as the gravitational acceleration g , the mass m and length l of the pole, the mass M of the cart, and the force F that is applied. The state vectoris -dimensional, consisting of x , ˙ x , θ and ˙ θ , and satisﬁes the equations (3) and (4) in Appendix B.When linearized around the origin, which is the goal position, the function f ( x ) is reduced to alinearized form Ax + F W , with A and W respectively given in (5) and (6) of Appendix B.5e obtain estimates of A via the forward Euler approximation of the state equation that is returned bythe state transition function x t +1 = x t + h ( Ax t + F W ) , where we further assume the discretizationconstant h to be unknown. Hence, with p observations of the next state given the current state, wehave p nonlinear equations with unknown parameters g , m , M , l , F and h . On the other hand, inorder to ﬁnd the optimal control, we do not need to know the values of these parameters, ratherwe solely need to know the value of the matrices A and W . We therefore can use a least squaresapproach to solve for the elements of A and W and then use these learned parameters to construct anoptimal control near the origin with a Linear Quadratic Regulator (LQR) [2]. Since a discrete actionis assumed, the vector is quantized to a discrete action.We would like to point out the important difference here between this hybrid CMB approach andtraditional model-based RL. Even though we use a form of the system dynamics equations, thismodel form is exploited solely to derive the structural properties of the matrices A and W in orderto reduce the degrees of freedom of the optimal control model. Unlike traditional model-based RL,our hybrid CMB approach directly learns the system matrices A and W , together with the optimalcontrol model, and does not learn the system dynamics parameters such as M and l ; whereas thetraditional approach learns these individual parameters and then computes a control policy. Summary.

A main thesis of our general CMB RL approach is that there exists a continuum ofknowledge about the system that we could exploit in order to reduce the complexity of the optimalcontrol model and speed up the learning of the optimal solution where these different modeling andlearning techniques can beneﬁt from each other (as discussed in the next section). With no knowledgeof the system dynamics, we can directly learn the control model parameters such as ( B , b ) for linearcontrol and ( B , C , D , b ) for simple nonlinear control. Then, with different degrees of knowledge ofthe form of the system dynamics equations such as knowing the state equations are due to Newton’slaws, we can additionally infer the quasi-companion form of the relatively sparse matrix A and thestructual form of W to reduce the degrees of freedom of the optimal control model. Our general framework supports the running of multiple CMB RL methods simultaneously. Thisframework seeks to gain the advantages of different CMB RL methods and to minimize any disad-vantages of these different RL methods by allowing the multiple RL methods to run in parallel andlearn from each other primarily through the c -table. Each CMB RL method operates exactly as inits standard RL environment, providing the next action that should be taken given the current stateof the system. A metacontroller then determines which action to actually take given the set of nextaction recommendations provided by each of the RL methods. The c -table and each CMB RL methodare then updated in an appropriate manner with the information they maintain based on the actionactually taken and the resulting outcome. Figure 3 illustrates our general framework.The design of the metacontroller for our general framework includes a higher level RL problemin which the metacontroller must determine which next action to take given the set of next-actionrecommendations from the collection of CMB RL methods running in parallel. This can includefactors such as changes in the operating environment. We focus here on having the metacontrollerdetermine the next best action from a linear optimal control model and a simple nonlinear optimalcontrol model, both for a general (unknown) dynamical system underlying the problem of interest.The metacontroller initially starts by using the linear-CMB RL method described in the previoussection and applies its next-action recommendation for an episode. The sequence of actions takenand resulting outcomes are then recorded in the c -table and made available to both the linear-CMBRL method and the nonlinear-CMB RL method described in the previous section, each of whichcan update its internal information which is also visible to the metacontroller. Iterations of thegeneral RL framework continue in this manner, during which time the metacontroller monitors boththe hierarchical regression score and the current best total reward for each CMB RL method. Themetacontroller then selects the CMB RL method that currently is performing best and applies itscontrol model parameters for the next episode. To provide an additional level of exploration andexploitation, the metacontroller will switch to one of the alternative CMB RL methods with someprobability. While our focus here is on the design of a metacontroller over the combination of thesetwo linear-control and nonlinear-control models – for which our experimental results in Section 4demonstrate signiﬁcant performance improvements – we note that this can be extended in a similarmanner to the design of a metacontroller across a broader set of CMB RL methods.6e note that the design of the metacontroller also includes support for managing the interactionsamong the different RL methods that allows the different methods to closely interact and learn fromeach other. This is achieved through the updating of the c -table based on each episode and the useof hierarchical regression to efﬁciently and effectively ﬁt the corresponding control models againstthe c -table, as described above. More speciﬁcally, once the metacontroller selects one of the CMBRL methods and applies its recommended next-actions for the current episode, the c -table is updatedaccordingly and each CMB RL method update its control model parameters based on the revised c -table. The metacontroller therefore provides support that allows the linear-CMB RL method andthe simple nonlinear-CMB RL method to operate in parallel in a cooperative manner which enablesinteractions and learning from each other. s c o r e Lunar Lander

Q-learningcontrol model-based RL (linear)control model-based RL (nonlinear)

Figure 1: Comparison of CMB RL with Classical Q -Learning for the Lunar Lander problem.Our general metacontroller approach for managing the interactions among the different CMB RLmethods can be summarized as follows. First, we initialize the c -table comprising the four dimensionsof state, next state, action, and reward. Second, at the start of each episode, the metacontroller selectsa control from among the set of control models. Third, at the end of each episode, the c -table isupdated according to the sequence of observation of the system under the actions of the currentcontrol model; and the control model parameters are updated based on their best ﬁt against the latestversion of the c -table that maximizes the total reward.As noted above, hybrid approaches are possible in which our new CMB RL methods can be exploitedto learn both the optimal control model parameters and structural properties of the system dynamicsequations. Speciﬁcally, in Sections 2 and 4, we consider one instance of such an approach thatconsists of simultaneously learning a simpliﬁed lower-dimensional version of the system dynamicsmodel and learning the control parameters of linear-control and nonlinear-control models. In this section we present experiments for the two problems from OpenAI Gym [5] discussed above,i.e., Lunar Lander and Cart Pole. For each case the state space is continuous and in order to use Q -learning, it is discretized to a ﬁnite set of states, where each dimension is partitioned into equallyspaced bins and the number of bins depends on both the problem to be solved and the referencecodebase that is used. For our CMB methods the continuous state is used. The experiments for eachproblem from OpenAI Gym with model-free RL methods were executed using the existing codefound at [22, 1] exactly as is; our CMB RL methods described in Section 2 were implemented on topof this same existing code base, namely our algorithm starts with a combination of local search andrandom restarts, building up the c -table, and then alternates between local search with random restartsand hierarchical regression-based ﬁtting of the control matrices against the contents of the c -table. Recall that the state vector is -dimensional with a total of four possible actions, and the scorerepresents the cumulative reward comprising positive points for successful degrees of landing andnegative points for fuel usage and crashing. In our Q -learning experiments, the continuous statevariables are each discretized into bins. Our algorithm is applied to the space of all control matricesof size (linear control) and (simple nonlinear control) to ﬁnd the optimal control directlyderived from the control matrices except for a special condition when one of the legs makes contact.The code change to the existing Q -learning codebase is straightforward and consists of replacing thefunction that uses a quantizer and the Q -table to determine the next action with a linear or simplenonlinear function that maps the (unquantized) state to a control vector that is quantized to an action.The total reward of an episode is used as the objective function to be maximized by our algorithm.7igure 1 plots the corresponding score results, averaged over trials, from our CMB RL methods incomparison with the corresponding results from the classical Q -learning approach using the Bellmanoperator. The results plotted for our CMB RL methods represent those obtained with local searchand random restarts; further improvements are obtained, consisting of some additional increases inaverage scores and signiﬁcant reductions in the variance of scores, when regression-based ﬁttingis included after a few thousand or so episodes. Observe from these results that, while Q -learningcontinues to realize a negative score on average after , episodes, our CMB algorithm ﬁnds acontrol matrix for the simple nonlinear control model that achieves a mean score above after afew hundred episodes. Note that the simple nonlinear control method achieves a mean score above sooner than the linear control method, though both CMB RL methods perform very well. Furthernote that constructing the table for Q -learning over a high dimensional space can be prohibitivebecause of the large number of grid points and the way the table is updated (in contrast to the c -table). s c o r e Cartpole

Q-learningcontrol model-based RL (linear) (a) Average scores for linear CMB RL and Q -learning. s c o r e Cartpole

Q-learningcontrol model-based RL (b) Scores for hybrid linear CMB RL and Q -learning. Figure 2: Comparison of CMB RL methods with Classical Q -Learning for the Cart Pole problem. Recall that the state vector is -dimensional with two actions possible in each state, and the scorerepresents the number of steps where the cart pole stays upright before either falling over or goingout of bounds. With a score of , the problem is considered solved and the simulation ends, i.e.,no score is above 200. In our Q -learning experiments, the position and velocity are discretized into bins whereas the angle and angular velocity are discretized into bins. We will consider twoversions of our CMB RL methods to ﬁnd the feedback control policy. The ﬁrst method is analogousto the method for Lunar Lander, which we use to directly ﬁnd the linear control model matrices B and b . In Figure 2a we plot the corresponding score results, averaged over trials. Observe that ourCMB RL approach is able to ﬁnd the optimal control that solves the problem within a few hundredepisodes, whereas Q -learning still oscillates well below the maximal score of .The second version of our methods consists of expanding the base CMB RL approach to also obtainan estimate of A via the forward Euler approximation returned by the state transition function given ithe cart pole example of Section 2.2. We therefore have p nonlinear equations w.r.t. any p observationsof the next state given the current state. The unknown parameters are g , m , M , l , F and h , i.e., thetime discretization constant h is also assumed to be unknown. Once again, our only assumptionhere is that we know the structural equations of the systems, but not the exact details of the physicssince the timescale h and the gravitational acceleration on the Earth’s surface g are consideredunknown. As described above, rather than estimating these parameters, we estimate the nonzeroentries of the matrices A and W directly. Applying a least squares approach on a sliding windowof the past observations near the origin, we can solve for the nonzero and nonconstant parametersof A and W and update the current estimate of A and W as A new = α A old + (1 − α ) A est and W new = α W old + (1 − α ) W est . These estimates are then used to construct an optimal control nearthe origin based on an LQR. Since a discrete control is assumed, the control vector is quantized. InFigure 2b we plot the score of this hybrid instance of our base CMB RL method over episodes incomparison with the classical Q -learning approach using the Bellman operator. We observe that theCMB RL method quickly solves the problem in contrast with Q -learning which continues to takemany episodes to converge to the optimal solution. For both CMB methods, the code changes areanalogous to the Lunar Lander case described above, with the exception that in the second methodthe local search is replaced with the LQR derived from past observations.Observe from Figure 2b that, in comparison with Figure 2a, the hybrid version of our CMB RLmethod solves the cart pole problem faster than the base linear CMB RL method to directly obtain8he optimal feedback control. This is due to the fact that the hybrid linear CMB RL method takesadvantage of partial knowledge of the physical equations of the system in consideration, whichtherefore results in a lower complexity (degrees of freedom) model (i.e., the matrices A and W arenot dense) and further supports our thesis that we should exploit as much accurate prior knowledgeas possible in building our optimal control model. Lastly, we note that additional improvements areobtained with both versions of our CMB RL methods in Figure 2, consisting of some additionalincreases in average scores and signiﬁcant reductions in the variance of scores, when regression-basedﬁtting is included together with the LQR after a few thousand or so episodes. In this paper we considered a novel general approach for RL comprising: (1) a new form of model-based RL methods that, instead of learning a system dynamics model, directly learns optimal controlmodel parameters; (2) a general framework that supports multiple CMB RL methods running inparallel and interactively learning together. We presented derivations of instances of our new CMB RLmethods and results on the convergence of our CMB RL approach to the optimal solution. Empiricalresults also demonstrated and quantiﬁed the signiﬁcant beneﬁts of our general approach.

References [1] M. Alzantot. Solution of mountaincar OpenAI Gym problem using Q-learning. https://gist.github.com/malzantot /9d1d3fa4fdc4a101bc48a135d8f9a289 , 2017.[2] B. D. O. Anderson and J. B. Moore.

Optimal Control: Linear Quadratic Methods . Prentice-Hall,Inc., Upper Saddle River, NJ, USA, 1990.[3] C. G. Atkeson and J. C. Santamaria. A comparison of direct and model-based reinforcementlearning. In

Proc. International Conference on Robotics and Automation , 1997.[4] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solvedifﬁcult learning control problems.

IEEE Transactions on Systems, Man, and Cybernetics ,SMC-13(5):834–846, Sept. 1983.[5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.OpenAI Gym.

CoRR , abs/1606.01540, 2016.[6] M. Deisenroth and C. Rasmussen. PILCO: a model-based and data-efﬁcient approach to policysearch. In

Proc. International Conference on Machine Learning , 2011.[7] A. Ilchmann, D. Owens, and D. Pratzel-Wolters. Sufﬁcient conditions for stability of lineartime-varying systems.

Systems & Control Letters , 9(2):157–163, 1987.[8] L. Kaelbling, M. Littman, and A. Moore. Reinforcement learning: A survey.

Journal ofArtiﬁcial Intelligence Research , 4:237–285, 1996.[9] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach toreinforcement learning. In

NIPS , 2018.[10] D. Meger, J. Higuera, A. Xu, P. Giguere, and G. Dudek. Learning legged swimming gaits fromexperience. In

Proc. International Conference on Robotics and Automation , 2015.[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and R. Riedmiller.Playing Atari with deep reinforcment learning. In

NIPS Workshop on Deep Learning , 2013.[12] A. Nagabandi, G. Kahn, R. Fearing, and S. Levine. Neural network dynamics for model-baseddeep reinforcement learning with model-free ﬁne-tuning.

ArXiv e-prints , December 2017.[13] J. Randlov and P. Alstrom. Learning to drive a bicycle using reinforcement learning and shaping.In

Proc. International Conference on Machine Learning , 1998.[14] S. Raudenbush, S. Bryk, A. Bryk, and B. coaut.

Hierarchical Linear Models: Applicationsand Data Analysis Methods . Advanced Quantitative Techniques in the Social Sciences. SAGEPublications, 2002.[15] M. Riedmiller, T. Gabel, R. Hafner, and S. Lange. Reinforcement learning for robot soccer.

Autonomous Robots , 27, 2009. 916] S. Schaal. Learning from demonstration.

Advances in Neural Information Processing Systems ,1997.[17] J. Schneider. Exploiting model uncertainty estimates for safe dynamic control learning.

Ad-vances in Neural Information Processing Systems , 1997.[18] R. S. Sutton and A. G. Barto.

Reinforcement Learning: An Introduction . MIT Press, 2011.[19] C. Szepesvari. Algorithms for reinforcement learning. In

Synthesis Lectures on ArtiﬁcialIntelligence and Machine Learning , volume 4.1, pages 1–103. Morgan & Claypool, 2010.[20] G. Tesauro. Temporal difference learning and td-gammon.

Communications of the ACM , 38,1995.[21] J. Togelius, S. Karakovskiy, J. Koutnik, and J. Schmidhuber. Super Mario evolution. In

Proc.Symposium on Computational Intelligence and Games , 2009.[22] V. M. Vilches. Basic reinforcement learning tutorial 4: Q-learning in OpenAIGym. https://github.com/vmayoral /basic_reinforcement_learning/blob/master/tutorial4/README.md , May 2016. 10

Algorithm for General Control-Model-Based Reinforcment Learning

Algorithm 1:

General CMB RL Method

Input :

Initial control matrices ( B , C , D ) , Initial c -table c ( · , · ) ; Output :

Set of best control matrices C ∗ p = ( B ∗ , C ∗ , D ∗ ) , corresponding c -table c ∗ ( · , · ) ;Initialize C p = newC p = ( B , C , D ) for episode e ∈ [1 , . . . ) do ( x i , a i , r i ) i ∈ [ T e ] = Run_Episode ( newC p ) c ( · , · ) , C p = Update_Table_and_Model ( ( x i , a i , r i ) i ∈ [ T e ] , c ( · , · ) , newC p , C p ) newC p = Find_Next_Control_Model ( C p , c ( · , · ) ) if tolerance satisﬁed thenreturn C p , c ( · , · ) endendFunction Run_Episode

Data:

Current control matrices newC p for episode Result:

Sequence of ( x i , a i , r i ) i ∈ [ T ] comprising current episode under newC p Execute the real system (or simulation thereof) using newC p to obtain the corresponding samplemeasurements for the episide. Function

Update_Table_and_Model

Data:

Sequence ( x i , a i , r i ) i ∈ [ T ] , Current c -table c ( · , · ) , Current control matrices newC p and C p Result:

Updated c -table c ( · , · ) , Updated C p for t ranging from T to do r t ( t ) = r ( t ) + γr t ( t + 1) if r t ( t ) > c ( x ( t ) , a ( t )) then c ( x ( t ) , a ( t )) = r t ( t ) else r t ( t ) = c ( x ( t ) , a ( t )) endend Update C p with newC p if newC p performs better. return Updated c ( · , · ) and C p Function

Find_Next_Control_Model

Data:

Current best control matrices C p , Current c -table c ( · , · ) Result:

Next control matrix newC p Apply hierarchical regression to obtain best ﬁt for ( B , C , D ) based on c ( · , · ) Compute overall score δ based on hierarchical regression if δ < µ then Set newC p based on local search with random restart; else Set newC p based on ( B , C , D ) ; endreturn newC p B Form of System Dynamics Equations: Cart Pole

For the second approach considered in the cart pole example of Section 2.2, we assume that the formof the system dynamics equations are known, but not the system parameters and physical constantssuch as the gravitational acceleration g , the mass m and length l of the pole, the mass M of the cart,and the force F that is applied. The state vector is -dimensional, consisting of x , ˙ x , θ and ˙ θ , and11atisﬁes the equations ¨ θ = g sin( θ ) − cos( θ ) (cid:16) F + ml ˙ θ sin( θ ) (cid:17) l (4 / − ( m cos ( θ ) / ( m + M ))) , (3) ¨ x = ( F + ml ˙ θ sin( θ )) − ml ¨ θ cos θm + M . (4)When linearized around the origin, which is the goal position, the function f ( x ) is reduced to alinearized form Ax + F W , where A =  gmm − / M + m )

00 0 0 10 0 g ( M + m ) l ( − m +4 / M + m ))  (5) W =  − m/ M + m ) l ( m − / M + m ))  . (6) C General Framework for Control-Model-Based Reinforcment Learning

Meta Controller