Combining system identification with reinforcement learning-based MPC
CCombining system identification withreinforcement learning-based MPC
Andreas B. Martinsen, Anastasios M. Lekkas andS´ebastien Gros
Department of Engineering Cybernetics, NorwegianUniversity of Science and Technology (NTNU), O. S.Bragstads plass 2D, 7491 Trondheim, NorwayE-mails: { andreas.b.martinsen,anastasios.lekkas,sebastien.gros } @ntnu.no Abstract:
In this paper we propose and compare methods for combining system identification(SYSID) and reinforcement learning (RL) in the context of data-driven model predictivecontrol (MPC). Assuming a known model structure of the controlled system, and considering aparametric MPC, the proposed approach simultaneously: a) Learns the parameters of the MPCusing RL in order to optimize performance, and b) fits the observed model behaviour usingSYSID. Six methods that avoid conflicts between the two optimization objectives are proposedand evaluated using a simple linear system. Based on the simulation results, hierarchical, parallelprojection, nullspace projection, and singular value projection achieved the best performance.
Keywords:
Reinforcement Learning, Model predictive control, System identification1. INTRODUCTIONReinforcement Learning (RL) is a powerful tool for tack-ling Markov Decision Processes (MDP) without dependingon a model of the probability distributions underlyingthe state transitions. Most RL methods rely purely onobserved state transitions, and realizations of the stagecost in order to increase the performance of the controlpolicy. RL has drawn increasing attention due to recenthigh profile accomplishments made possible using functionapproximators (Busoniu et al., 2017). Notable examplesinclude performing at super human levels in games suchas Go, chess and Atari (Silver et al., 2017a,b; Mnih et al.,2013), and robots learning to walk, fly without supervision,and perform complex manipulation (Wang et al., 2012;Abbeel et al., 2007; Andrychowicz et al., 2018). Most ofthese recent advances have been the result of RL with DeepLearning (DL) by using Deep Neural Networks (DNNs)as function approximators. While systems controlled byDNNs show a lot of promise, they are difficult to analyze,and in turn their behaviour is difficult to certify and trust.Model Predictive Control (MPC) is a popular approachfor optimizing the closed loop performance of complexsystems subject to constraints. MPC works by solvingan optimal control problem at each control interval inorder to find an optimal policy. The optimal controlproblem seeks to minimize the sum of stage costs overa horizon, provided a model of the system and the currentobserved state. While MPC is a well-studied approach, andan extensive literature exists on analysing its properties(Mayne et al., 2000; Rawlings and Amrit, 2009), theclosed loop performance heavily relies on the accuracyof the underlying system model, which naturally presentschallenges when significant unmodeled uncertainties arepresent. In recent works, such as (Gros and Zanon, 2019; Zanon andGros, 2019), RL and MPC have been combined, by allow-ing RL to use a MPC as a function approximator. Thisapproach allows to combine the benefits of data-drivenoptimization from RL with the tools available for analysingand certifying the closed loop performance of MPC. Inthis paper we extend the work by Gros and Zanon (2019),by using a parametric MPC as a function approximatorfor performing RL, and combining it with on-line systemidentification (SYSID). The SYSID component is addedwith the purpose of aiding RL when there is a large modelmismatch, as well as helping to improve the accuracyfrom the resulting MPC trajectory prediction. The maincontribution of the paper are the methods for combiningthe competing optimization objectives of the RL and theSYSID in a way that minimizes plant model mismatchwhile not affecting the closed loop performance of theMPC. This paper focuses on the Q-learning approach toRL.The paper is organized into five sections. Section 2 givesa brief overview of data-driven MPC, reinforcement learn-ing and system identification. Section 3 describes severalapproaches for combining RL and SYSID in order to avoidloss in performance due to conflicting objectives. Section 4shows simulation results for the different proposed meth-ods, and finally, Section 5 concludes the paper.2. BACKGROUND
As in Gros and Zanon (2019), we will use a parametricoptimization problem as a function approximator for re-inforcement learning. Given a stage cost L ( x , u ) we canexpress the following MPC problem a r X i v : . [ ee ss . S Y ] A p r in x , u , σ λ θ ( x ) + N − (cid:88) i =0 γ i (cid:0) L ( x i , u i ) + L θ ( x i , u i ) + ω (cid:62) σ i (cid:1) + γ N V f θ ( x N ) (1a)s . t . x i +1 = f θ ( x i , u i ) , (1b) h ( x i , u i ) + h θ ( x i , u i ) ≤ σ i , (1c) x = s , (1d)where we optimize the state, x , action u and slack vari-ables σ over the time horizon N . In the optimization prob-lem, λ θ ( x ) is an initial cost modifier, L ( x , u ) is the stagecost, L θ ( x , u ) is a parametric stage cost modifier, V f θ ( x )is a parametric terminal cost approximation, f θ ( x , u ) is aparametric model approximation, h ( x , u ) and h θ ( x , u ) areinequality constraints and inequality constraint modifiers,and γ ∈ (0 ,
1] is the discount factor. The goal of the RLcomponent is to modify the parameters θ of the parametricoptimization problem in order to find a policy π θ ( x ) thatminimizes the expected cumulative discounted baselinestage cost: min θ E (cid:34) ∞ (cid:88) i =0 γ i ¯ L ( x i , π θ ( x i )) (cid:35) , where the baseline stage cost ¯ L is defined as:¯ L ( x i , u i ) = L ( x i , u i ) + ω (cid:62) max(0 , h ( x i , u i )) . Here the second term penalizes the constraint violations.Ideally we would like strict constraints, however this wouldmean the MPC problem can become infeasible when modelmismatch or disturbances cause constraint violations. Inorder to mitigate this problem, a slack penalty ω is used,which is chosen large enough such that the constraints areonly violated when the MPC becomes infeasible. For theRL, adding slack constraints is also important, as strictconstraints means a penalty of ∞ for constraint violations,which most RL algorithms are not able to deal with. Given the parametric optimization problem (1), we definethe parametric action-value function as: Q θ ( s , a ) = min x , u , σ (1a) (2a)s . t . (1b) - (1d) , (2b) u = a , (2c)which trivially satisfies the fundamental equalities under-lying the Bellman equation: V θ ( s ) = min a Q θ ( s , a ) , (3) π θ ( s ) = arg min a Q θ ( s , a ) . (4) A classical RL approach is Q-Learning (Watkins, 1989).To perform Q-Learning for MPC we can use semi-gradientmethods (Sutton and Barto, 2018), which are based onparameter updates driven by minimizing the temporal-difference error δ : δ t = y t − Q θ ( s t , a t ) , where y t = ¯ L ( x t , u t ) + γV θ ( x t +1 ) is the fixed targetvalue. Defining the squared temporal-difference error as the minimization objective, and assuming that the targetvalue is independent of the parameterization θ , we get thesemi-gradient update: θ ← θ + αδ ∇ θ Q θ ( x t , u t ) , (5)where α > θ ← θ + αδ H − ∇ θ Q θ ( x t , u t ) , (6)where H = ∇ θ ( y t − Q θ ( x t , u t )) is the Hessian of the errorbetween the targets and the action-value function. For abatch of transitions, the problem becomes a nonlinear leastsquares problem:min θ ψ ( θ ) , where ψ ( θ ) = (cid:88) t δ t which may be solved using a Gauss-Newton method, asproposed in Zanon et al. (2019). The modified Gauss-Newton method gives the following update law: θ ← θ + α ( J (cid:62) Q J Q + λ Q I ) − J (cid:62) Q δ (cid:124) (cid:123)(cid:122) (cid:125) := ∆ θ Q , (7)where J Q is the Jacobian of the action-value function overthe batch in use, and δ is the vector of temporal differenceerrors: J Q = ∇ θ Q θ ( x t, , u t, ) ∇ θ Q θ ( x t, , u t, )... ∇ θ Q θ ( x t,B , u t,B ) , δ = δ δ ... δ B over the batch B = { ( x t,i , u t,i , x t +1 ,i ) | i ∈ . . . B } . Thediagonal matrix λ Q I is added such that J (cid:62) Q J Q + λ Q I ispositive definite, and acts as a regularization of the Gauss-Newton method.It is worth noting that the semi-gradient Q-Learningmethod given above yields no guarantee to find the globaloptimum of the parameter for nonlinear function approx-imators Q θ . This limitation pertains to most applicationsof RL relying on nonlinear function approximators such asthe commonly used DNN. It is also worth noting that inpractice the parameterization θ is limited. This means wein general are not able to fit the Q function globally, butrather that the formulation above fits the Q function tothe distribution from which the samples are drawn. System identification offers a large set of tools for buildingmathematical models of dynamic systems, using measure-ments of the systems input and output signals. Basedon the data-driven MPC scheme outlined in the previoussection, we want an on-line parameter estimation methodcompatible with the parametric model. A classical SYSIDapproach is the Prediction Error Method (PEM) wherethe objective is to minimize the difference between theobserved state and the predicted state given the observedtransition ( x t , u t , x t +1 ). For a parametric model approxi-mation of the form: ˆ x t +1 = f θ ( x t , u t ) , he state error e between the parametric model and theobserved state can then be expressed as follows: e t = x t +1 − ˆ x t +1 = x t +1 − f θ ( x t , u t ) . In the simplest case, where the state vector x is fullyobservable, PEM can be performed by minimizing thesquared error between the observed state, and the pre-dicted state: min θ φ ( θ ) , where φ ( θ ) = e (cid:62) e where e collects a batch of measurements e i . This op-timization problem can be tackled via gradient descent,giving the following update law: θ ← θ − β ∇ θ e (cid:62) e , where β is the learning rate. Since θ are all the parametersappearing in the MPC (1), PEM is in practise onlymodifying the subset of the parameters θ that appearin the parametric model. For faster learning, we proposeusing a second order approach, and perform quasi-Newtonsteps on the parameters. One such method is the modifiedGauss-Newton method, which for a batch of transitionsreads as follows: θ ← θ + β ( J (cid:62) f J f + λ f I ) − J (cid:62) f e (cid:124) (cid:123)(cid:122) (cid:125) := ∆ θ f , (8)where J f is the Jacobian of parametric system model,and e is the vector of model errors over a batch B = { ( x t,i , u t,i , x t +1 ,i ) | i ∈ . . . B } . J f = ∇ θ f θ ( x t, , u t, ) ∇ θ f θ ( x t, , u t, )... ∇ θ f θ ( x t,B , u t,B ) , e = x t +1 , − f θ ( x t, , u t, ) x t +1 , − f θ ( x t, , u t, )... x t +1 ,B − f θ ( x t,B , u t,B ) Similarly to RL, for a batch of transitions, the problembecomes a least-squares problem, fitting the observedtransitions to the parametric model.It is worth noting that for a linear parameterization,the Gauss-Newton method gives convergence to the leastsquares solution over the batch in one step, when β = 1and λ f = 0. It is also worth noting that since PEM onlyworks on a subset of the parameters θ , J f is rank deficientand hence J (cid:62) f J f is singular by construction. Choosing theregularization term λ f >
0, will ensure J (cid:62) f J f + λ f I isnonsingular and hence invertible. For the regularizationparameter λ f and λ Q we typically want to choose a smallvalue, to get performance close to the pure Gauss-Newtonmethod, while only slightly regularizing in order to avoidissues arising form a singular Hessian approximation.3. SYSTEM IDENTIFICATION FOR DATA-DRIVENMPCThe Prediction Error Method and Reinforcement Learningare modifying the same parameter vector θ , but operateusing two different objectives. RL is targeting policyoptimization by minimizing the temporal difference erroragainst a fixed target, while PEM is fitting the parametricmodel to the observed state transitions.A combination of the two methods then becomes a multi-objective optimization problem. The simplest approach isto directly combine the steps from both the Q-Learningand PEM. Using the second order update laws in (7) and (8), with the parameter updates ∆ θ Q and ∆ θ f respectively, we get the following: θ ← θ + α ∆ θ Q + β ∆ θ f (9)Here the step-lengths α and β can be thought of as theweighting between the Q-Learning and SYSID respec-tively. However, the end goal is arguably to maximize theclosed-loop performance of the MPC scheme rather thanminimizing the prediction error of the model, hence if thetwo objectives are competing, the RL objective should beprioritized. This suggests that an alternative to the naivesum of update laws approach must be considered. In order to introduce a hierarchy between minimizing thePEM and RL objectives, we can consider the optimizationproblem: min θ φ ( θ ) , (10a)s . t . ∇ θ ψ ( θ ) = 0 , (10b)which requires θ to minimize the PEM while being astationary point of the RL objsective. If ∇ θ ψ ( θ ) ≥ θ is a (local) mini-mizer of the RL objective. The KKT conditions associatedto (10) read as: ∇ θ φ ( θ ) + ∇ θ ψ ( θ ) λ = 0 , (11a) ∇ θ ψ ( θ ) = 0 . (11b)A quasi-Newton step on (11) reads as: ∇ θ φ ( θ )∆ θ + ∇ θ ψ ( θ ) λ = −∇ θ φ ( θ ) , (12a) ∇ θ ψ ( θ )∆ θ = −∇ θ ψ ( θ ) . (12b)Let us consider a (possibly θ -dependent) nullspace / full-space decomposition of the RL Hessian ∇ θ ψ , i.e. N , F such that: ∇ θ ψ ( θ ) N = 0 , [ N F ] full rank , N (cid:62) F = 0 , (13)and the associated decomposition of the primal step ∆ θ :∆ θ = N n + F f . (14)We then observe that the primal quasi-Newton step givenby (12) can be decomposed into: N (cid:62) ∇ θ φ N n + N (cid:62) ∇ θ φ F f = − N (cid:62) ∇ θ φ, (15a) F (cid:62) ∇ θ ψ F f = − F (cid:62) ∇ θ ψ. (15b)One can then verify that: n = − (cid:16) N (cid:62) ∇ θ φ N (cid:17) † (cid:16) N (cid:62) ∇ θ φ − N (cid:62) ∇ θ φ F f (cid:17) , (16a) f = − (cid:16) F (cid:62) ∇ θ ψ F (cid:17) − F (cid:62) ∇ θ ψ, (16b)where . † stands for the Moore-Penrose pseudo-inverse. Letus label: ∇ θ φ †⊥ = N (cid:16) N (cid:62) ∇ θ φ N (cid:17) † N (cid:62) , (17)the pseudo-inverse of the SYSID Hessian ∇ θ φ projected inthe nullpsace of the RL Hessian. Let us additionally label∆ θ HQ = F f , ∆ θ Hf = N n . (18)The primal step ∆ θ then reads as:∆ θ = ∆ θ HQ + ∆ θ Hf , (19a)∆ θ HQ = − F (cid:62) (cid:16) F (cid:62) ∇ θ ψ F (cid:17) − F (cid:62) ∇ θ ψ = −∇ θ ψ † ∇ θ ψ, (19b)∆ θ Hf = −∇ θ φ †⊥ ∇ θ φ + ∇ θ φ †⊥ ∇ θ φ ∆ θ HQ . (19c)n practice, pseudo-inverses are not always numericallystable. In order to alleviate this potential issue, we canuse regularizations of the PEM and RL Hessians instead,i.e. we can select λ Q , λ f > θ HQ = − (cid:0) ∇ θ ψ + λ f I (cid:1) − ∇ θ ψ, (20a) ∇ θ φ †⊥ = N (cid:16) N (cid:62) (cid:0) ∇ θ φ + λ Q I (cid:1) N (cid:17) − N (cid:62) , (20b)together with (19c) instead of (19b) and (17). For λ Q,f →
0, (20)-(19c) asymptotically deliver the same steps as (19).
In this section we will discuss several projections we canperform in order to mitigate conflicts between the twooptimization objectives. As discussed earlier, we typicallywant the RL updates to dominate, as these are directlyrelated to the MPC closed-loop performance.
Parallel projection
We first consider a parallel projec-tion, where the PEM step ∆ θ f is projected along the RLstep ∆ θ Q , giving the following projected PEM step ∆ θ (cid:107) f = ∆ θ Q ∆ θ (cid:62) Q ∆ θ (cid:62) Q ∆ θ Q ∆ θ f Intuitively, this projection can be thought of as an adaptivestep-length for the RL step, i.e. the SYSID modifies theRL step-length in the direction that improves the SYSIDloss.
Orthogonal projection
Similar to the parallel projection,we may use the orthogonal projection: ∆ θ ⊥ f = (cid:32) I − ∆ θ Q ∆ θ (cid:62) Q ∆ θ (cid:62) Q ∆ θ Q (cid:33) ∆ θ f The orthogonal projection is dual to the parallel projec-tion, in that it does not effect the length of the of the RLstep. It may however have the drawback of working againstthe RL step since we do not account for the sensitivity ofthe RL objective in the orthogonal direction. This can beeasily be seen in the case where a optimum of the RLobjective is achieved, i.e. δ ∇ θ Q θ ( x , u ) = , where anyPEM step will in general push the parameters away formthe RL optimum. Nullspace projection
Based on the heriarcical optimiza-tion problem in (19c), we see that in the particular casethat ∇ θ φ = c − · I holds, the hierarchical optimizationproblem reduces to simply projecting the PEM step intothe nullspace of the RL step: ∇ θ φ †⊥ = c N N (cid:62) , (21a)∆ θ Hf = − c N N (cid:62) ∇ θ φ = − N N (cid:62) ∇ θ φ − ∇ θ φ. (21b)Using this nullspace projection, with the gauss newtonapproach in (8) we get the following update law: ∆ θ N f = N N (cid:62) ∆ θ f . Choosing this simplified nullspace projection, the PEMstep is projected into a direction for which the valuefunction is not sensitive, hence the gradient step for theSYSID will not effect the primary goal of optimizingthe RL objective. The nullspace projection may also bethought of as a regularization of the RL objective.
Smallest singular value projection
The nullspace projectsthe PEM steps into the nullspace of H , i.e. the space wherethe singular values of H are zero. As a generalization ofthe nullspace projection, we can project the PEM stepsinto the space where the Hessian is the least sensitive.Using the singular value decomposition of the Hessian ofthe temporal difference loss. U Σ V = H We can extract an orthonomal basis of the p smallestsingular values V as the last p rows of V . The projectioninto the p smallest singular values is then given by thefollowing. ∆ θ Sf = V (cid:62) V ∆ θ f We can alternatively choose p to be the number of singularvalues under a certain threshold. Note that if we choose p to be the number of singular values equal to zero, theprojection becomes equivalent to the nullspace projection.While the nullspace projection will give no progress if H is full rank, the singular value projection ensures someprogress on the PEM objective, at a small cost to to theRL objective. 4. SIMULATIONSIn this section we will compare the performance of thedifferent RL MPC modifications proposed above. In orderto gauge the results we consider the following simple linearMPC problem:min x , u , σ N − (cid:88) i =0 γ i (cid:18) || x i || + 12 || u i || + f (cid:62) (cid:20) x i u i (cid:21) + ω (cid:62) σ i (cid:19) + V + γ N x (cid:62) N Sx N (22a)s . t . x i +1 = Ax i + Bu i + b , (22b) (cid:20) − (cid:21) + x − σ i ≤ x i ≤ (cid:20) (cid:21) + ¯ x + σ i , (22c) − ≤ u i ≤ θ of the optimization problem aregiven as: θ = ( V , f , S , A , B , b , x , ¯ x )For the initial model parameter guess used in the MPC wehave the following: A = (cid:20) . . . . (cid:21) , B = (cid:20) . . (cid:21) , b = (cid:20) (cid:21) Additionally the terminal cost matrix S was chosen as thesolution to the discrete-time algebraic Riccati equation,while the rest of the parameters were initialized to zero.For the real process we used the following dynamics: x i +1 = (cid:20) . . . . (cid:21) x i + (cid:20) . . (cid:21) u i + (cid:20) e k (cid:21) where e k is uniformly distributed on the interval [ − . , θ . In Figure 1 the states x andaction u are shown for the baseline method, which onlyuses pure RL steps. As seen in the figure, the constraintson the first state x are violated in the beginning, butby updating the parameters using RL, the system quickly 500 1 ,
000 1 ,
500 2 , . x ,
000 1 ,
500 2 , − − . . x ,
000 1 ,
500 2 , − − . .
51 Timesteps u Fig. 1. Baseline when using only reinforcement learning(7). The optimal unconstrained solution would be toregulate the system to x = 0, however due to theconstraint, and disturbances this is no longer the case.learns to avoid the constraints, while at the same timestaying as close to them as possible in order to minimizethe discounted stage cost.Running the on-line RL together with the proposed PEMmethods, we get the results seen in Figures 2, 3 and 4.Figure 2 shows the moving average stage cost, which is agood performance measure of the closed loop performanceof the MPC. From the results we see the the hierarchical,parallel, singular value and nullspace projections all con-verge to a slightly better performance than the baseline,while the orthogonal projection, and weighted sum of stepsperform worse then the baseline. The drop in performanceof the orthogonal projection, weighted sum of steps, andto a certain degree the parallel projection, is the result ofcompeting objectives. This is also reflected in the param-eter error as seen in Figure 3, where the model fit comesat the expense of closed loop performance. Looking at thetemporal difference error in Figure 4, we see that mostof the proposed methods give faster initial convergence.This is a result of the improved plant model mismatchwhich in turn gives better value function estimates fromthe MPC. A similar observation can be made in Figure 5and 6, where the initial model parameters were chosen asa double integrator: A = (cid:20) (cid:21) , B = (cid:20) (cid:21) , b = (cid:20) (cid:21) , giving a lager plant model mismatch. From the results wesee that all the proposed methods have a better initialconvergence of the closed loop performance, with the par-allel, singular value and nullspace projections, also givingbetter final closed loop performance. For the parametererror, we see a significant improvement of all methods,except for the hierarchical and nullspace projection, incomparison with the baseline. The results are in line withthe constraints imposed by the different projections, wherethe hierarchical and nullspace projection being the mostconservative, and the summation of gradients being theleast conservative. 0 500 1 ,
000 1 , − Timesteps M o v i n ga v e r ag e s t ag ec o s t ¯ L ( x , u ) BaselineParalell projectionOrthogonal projectionSingular value projectionNullspace projectionSum of stepsHierarchicalFig. 2. Moving average stage cost over 100 steps.Jumps/steps in performance indicates constraint vio-lations, which results in a large cost.0 500 1 ,
000 1 ,
500 2 , M o d e l e rr o r BaselineParalell projectionOrthogonal projectionSingular value projectionNullspace projectionSum of stepsHierarchicalFig. 3. Norm of the parameter error for the model param-eters A , B and b . Lower error means the parametricmodel in the MPC is closer to the simulated model.5. CONCLUSIONIn this paper we proposed and tested a number of strate-gies for combing RL, PEM and data-driven MPC in orderto perform on-line learning and control. The main con-tribution is the addition of PEM as an on-line systemidentification method, which is added in order to aid theRL when there is a large plant model mismatch, as well ashelp to get better accuracy from the resulting MPC tra-jectory prediction. The proposed parallel, singular valueand nullspace projection methods show promising resultsin terms of decreasing plant model miss-match, and givingslightly better closed loop MPC performance than usingpure RL, while the orthogonal projection, and sum of stepsresulted in improved model fit, however at the cost ofclosed loop performance of the proposed MPC scheme. Inconclusion, combing PEM with RL, can give better initiallearning when we do not have a good initial guess for theparameters, as well as lead to better overall performance 500 1 ,
000 1 , . .
52 Timesteps M o v i n ga v e r ag e T D e rr o r | δ | BaselineParalell projectionOrthogonal projectionSingular value projectionNullspace projectionSum of stepsHierarchicalFig. 4. Moving average absolute temporal difference error | δ | over 100 steps.0 500 1 ,
000 1 , − Timesteps M o v i n ga v e r ag e s t ag ec o s t ¯ L ( x , u ) BaselineParalell projectionOrthogonal projectionSingular value projectionNullspace projectionSum of stepsHierarchicalFig. 5. Moving average stage cost over 100 steps using poorinitial model parameters, we see a clear improvementin performance in the early learning stage.of the closed loop MPC, without significant additionalcomputational overhead.For future work, it is of interest to look at methods foradaptively changing the step-length of the two objectives.For example choosing a step-length β dependant on theRL step, may help mitigate the problem of competingobjectives, and in turn improve the performance of theproposed methods. Combining the proposed method withpolicy gradient, is also an area of interest, as policygradient methods offer a way of directly optimizing thepolicy. REFERENCESAbbeel, P., Coates, A., Quigley, M., and Ng, A.Y. (2007).An application of reinforcement learning to aerobatichelicopter flight. In Advances in neural informationprocessing systems , 1–8.Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R.,McGrew, B., Pachocki, J., Petron, A., Plappert, M., 0 500 1 ,
000 1 ,
500 2 , M o d e l e rr o r BaselineParalell projectionOrthogonal projectionSingular value projectionNullspace projectionSum of stepsHierarchicalFig. 6. Norm of the parameter error for the parametricmodel with poor initial model parameters.Powell, G., Ray, A., et al. (2018). Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177 .Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D.(2017).
Reinforcement learning and dynamic program-ming using function approximators . CRC press.Gros, S. and Zanon, M. (2019). Data-driven EconomicNMPC using Reinforcement Learning.
IEEE Transac-tions on Automatic Control .Mayne, D.Q., Rawlings, J.B., Rao, C.V., and Scokaert,P.O. (2000). Constrained model predictive control:Stability and optimality.
Automatica , 36(6), 789–814.Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013).Playing atari with deep reinforcement learning. arXivpreprint arXiv:1312.5602 .Rawlings, J.B. and Amrit, R. (2009). Optimizing pro-cess economic performance using model predictive con-trol. In
Nonlinear model predictive control , 119–138.Springer.Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I.,Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran,D., Graepel, T., et al. (2017a). Mastering chess andshogi by self-play with a general reinforcement learningalgorithm. arXiv preprint arXiv:1712.01815 .Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I.,Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,Bolton, A., et al. (2017b). Mastering the game of gowithout human knowledge.
Nature , 550(7676), 354.Sutton, R.S. and Barto, A.G. (2018).
Reinforcementlearning: An introduction . MIT press.Wang, S., Chaovalitwongse, W., and Babuska, R. (2012).Machine learning algorithms in bipedal robot control.
IEEE Transactions on Systems, Man, and Cybernetics,Part C (Applications and Reviews) , 42(5), 728–743.Watkins, C.J.C.H. (1989). Learning from delayed rewards.
PhD thesis, Cambridge University .Zanon, M. and Gros, S. (2019). Safe Reinforce-ment Learning Using Robust MPC. arXiv preprintarXiv:1906.04005 .Zanon, M., Gros, S., and Bemporad, A. (2019). Practicalreinforcement learning of stabilizing economic MPC. arXiv preprint arXiv:1904.04614arXiv preprint arXiv:1904.04614