[PDF] Reinforcement Learning of the Prediction Horizon in Model Predictive Control

Abstract

Model predictive control (MPC) is a powerful trajectory optimization control technique capable of controlling complex nonlinear systems while respecting system constraints and ensuring safe operation. The MPC's capabilities come at the cost of a high online computational complexity, the requirement of an accurate model of the system dynamics, and the necessity of tuning its parameters to the specific control application. The main tunable parameter affecting the computational complexity is the prediction horizon length, controlling how far into the future the MPC predicts the system response and thus evaluates the optimality of its computed trajectory. A longer horizon generally increases the control performance, but requires an increasingly powerful computing platform, excluding certain control applications.The performance sensitivity to the prediction horizon length varies over the state space, and this motivated the adaptive horizon model predictive control (AHMPC), which adapts the prediction horizon according to some criteria. In this paper we propose to learn the optimal prediction horizon as a function of the state using reinforcement learning (RL). We show how the RL learning problem can be formulated and test our method on two control tasks, showing clear improvements over the fixed horizon MPC scheme, while requiring only minutes of learning.

Full PDF

RReinforcement Learning of the PredictionHorizon in Model Predictive Control

Eivind Bøhn ∗ Sebastien Gros ∗∗ Signe Moe ∗ Tor Arne Johansen ∗∗ , ∗∗∗∗ SINTEF Digital, Oslo, Norway (email: { eivind.bohn,signe.moe } @sintef.no ∗∗ Department of Engineering Cybernetics, NTNU, Trondheim,Norway (email: [email protected]) ∗∗∗

Centre for Autonomous Marine Operations and Systems (e-mail:[email protected])

Abstract:

Model predictive control (MPC) is a powerful trajectory optimization controltechnique capable of controlling complex nonlinear systems while respecting system constraintsand ensuring safe operation. The MPC’s capabilities come at the cost of a high onlinecomputational complexity, the requirement of an accurate model of the system dynamics, andthe necessity of tuning its parameters to the speciﬁc control application. The main tunableparameter aﬀecting the computational complexity is the prediction horizon length, controllinghow far into the future the MPC predicts the system response and thus evaluates the optimalityof its computed trajectory. A longer horizon generally increases the control performance, butrequires an increasingly powerful computing platform, excluding certain control applications.The performance sensitivity to the prediction horizon length varies over the state space, and thismotivated the adaptive horizon model predictive control (AHMPC), which adapts the predictionhorizon according to some criteria. In this paper we propose to learn the optimal predictionhorizon as a function of the state using reinforcement learning (RL). We show how the RLlearning problem can be formulated and test our method on two control tasks — showing clearimprovements over the ﬁxed horizon MPC scheme — while requiring only minutes of learning.

Keywords:

Adaptive horizon model predictive control, Reinforcement learning control1. INTRODUCTIONModel predictive control (MPC) is a well studied andwidely adopted control technique, particularly in the pro-cess control industry. Its popularity stems in large partfrom its ability to control complex systems while respect-ing system constraints, ensuring safe operation. It operatesby solving an optimal control problem (OCP) with thecurrent state of the plant as the initial condition, andusing a model of the plant to predict the plant response tothe controlled variables. In this way it ﬁnds the sequenceof control inputs that minimizes the objective functionover the prediction horizon while remaining feasible inthe sense of the trajectories remaining within the speciﬁedconstraints. The ﬁrst control input of the solution sequenceis applied to the plant, and the MPC then solves theOCP again at the next sampling instance. The drawbacksof the MPC framework is that the quality of the inputsequence relies heavily on the accuracy of the model of theplant dynamics, the hyperparameters of the MPC needsto be ﬁne-tuned to the task at hand, and further thatthe computational complexity of solving the OCP is fairlyhigh, limiting the type of platforms and applications thatcan implement MPC. (cid:63)

This work was ﬁnanced by grants from the Research Council ofNorway (PhD Scholarships at SINTEF grant no. 272402, and NTNUAMOS grant no. 223254).

The prediction horizon length is a key parameter of theMPC framework. In conjunction with the step size itcontrols how far into the future the controller evaluatesthe consequences of its actions. If chosen too short, thecomputed trajectories are myopic in nature and mightlead to instability and poor approximations of the inﬁ-nite horizon solution, while the computational complexitygrows at best linearly with increasing prediction horizon.Moreover, diﬀerent regions of the state space might havevarying requirements on the horizon length for stabilityand to ﬁnd nearly optimal trajectories. This observationmotivated the adaptive horizon model predictive control(AHMPC). In Michalska and Mayne (1993) the horizon isadapted so that a terminal constraint is satisﬁed and thesystem enters a known region of attraction of a second ter-minal controller. Krener (2018) proposes a heuristics-basedapproach, presenting one ideal but not implementableapproach, and one practical method using iterative deep-ening search where stability criteria are checked on eachiteration to determine the lowest stabilizing horizon. Amore direct approach is presented in Scokaert and Mayne(1998) where the prediction horizon is included as a de-cision variable of the MPC scheme. Gardezi and Hasan(2018) proposes a learning based approach in which theyconstruct a rich dataset of numerous combinations ofstates and MPC computations with varying horizons, and a r X i v : . [ ee ss . S Y ] F e b hen apply supervised learning on this dataset to developan optimal horizon predictor.Reinforcement learning (RL) (Sutton and Barto, 2018)is a ﬁeld of machine learning concerned with optimalsequential decision making. While RL has proven to be thestate-of-the-art approach for certain classes of problemssuch as game-playing (Schrittwieser et al., 2020), it hasnot seen many real world applications in control. This isin large part due to its data intensive nature, combinedwith its inability to handle constraints and therefore lackof guarantees for safe operation of the system, both inthe learning stage and in production. However, RL canbe employed for control in a safe manner by using it toaugment existing control techniques such as MPC (Aswaniet al., 2013; Fisac et al., 2018; Zanon and Gros, 2020), e.g.to learn the system dynamics (Nagabandi et al., 2018) ortune parameters (Mehndiratta et al., 2018).In this paper we propose to learn the optimal predic-tion horizon length of the MPC scheme as a functionof the state using RL. To the best of our knowledge,this is the ﬁrst work to employ RL for AHMPC. Thecontribution of this paper lies in exploring how the RLproblem of optimizing the MPC prediction horizon can beformulated, and showcasing its eﬀectiveness on two controlproblems. Further, we suggest to jointly learn the MPCvalue function due to its synergistic relationship with theprediction horizon, enhancing the adaptive capabilities.While the AHMPC approaches described earlier can bedesigned with favorable properties such as theoretical sta-bility guarantees, they often assume access to privilegedinformation such as terminal sets and control Lyapunovfunctions. Learned approaches on the other hand typicallyassume little is known, and as such are applicable to moreproblems.The rest of the paper is organized as follows. Section 2presents the algorithms and theory employed in this paper.Section 3 presents the formulation of learning the optimalMPC prediction horizon as an RL problem, while Section 4describes the experiments undertaken, the results of whichare presented and discussed in Section 5. Finally, Section 6concludes the paper with our thoughts about the proposedmethod and future prospects.2. BACKGROUND MPC is a model-based control method where the controlinputs are obtained by solving at every time step anopen loop ﬁnite-horizon OCP (1), using a model of theplant to predict the response to the control inputs fromthe current state of the plant. Solving the OCP yieldsa control input sequence that minimizes the objectivefunction over the optimization horizon. The ﬁrst controlinput of this sequence is then applied to the plant, and theOCP is solved again at the subsequent time step to getthe next control input. In this paper we consider discrete-time state-feedback nonlinear constrained MPC for whichthe MPC receives exact measurements of the states atequidistant points in time. It reads as: min x,u N − (cid:88) k =0 γ k (cid:96) ( x k , u k , ˆ p k ) + γ N m ( x N ) , (1a)s.t. x = ¯ x (1b) x k +1 = f ( x k , u k , ˆ p k ) , ∀ k ∈ , . . . , N − H ( x k , u k ) ≤ , ∀ k ∈ , . . . , N − ∗ to indicate solving for the arguments thatminimize the function. Here, x k is the plant state vectorat optimization step k and ¯ x is the plant state at thecurrent time, u k is the vector of the control inputs, ˆ p k aretime-varying parameters whose values are projected overthe optimization horizon, f is the model dynamics, H isthe constraint vector and N is the horizon. The state andcontrol inputs are subject to constraints, which must holdover the whole optimization horizon for the MPC solutionto be considered feasible. The MPC objective functionconsists of the stage cost (cid:96) ( x k , u k , ˆ p k ), the terminal cost m ( x N ), and the discounting factor γ ∈ (0 , u (cid:62) k D ∆ u k that discouragesbang-bang control, where ∆ u k = u k − u k − . The stagecost only evaluates the trajectory locally up to a length of N − m ( x N ) should ideally provideglobal information about the desirability of the consideredterminal state, helping the MPC avoid local minima. Themore accurate the terminal cost is wrt. to the inﬁnitehorizon solution to problem (1), the shorter the horizoncan be while still achieving good control performance(Zhong et al., 2013; Lowrey et al., 2019). The ideal choice for the terminal cost m ( x N ) in the MPCscheme would be the optimal value function V ∗ . A valuefunction V π measures the expected total inﬁnite horizondiscounted cost accrued when following the control law π (2), and the optimal value function is then the valueof an optimal control law π ∗ that chooses the optimalinput at every point (3). Equation (3) is written in theform of a Bellman equation where the value is decomposedinto a one-step cost (cid:96) and the total value from the nextstate x (cid:48) = f ( x, u, ˆ p ). Computing V ∗ exactly from (3)is intractable for problems with continuous state andinput spaces, and iterative approaches such as Q-learningrequires an enormous amount of data for such problems. V π ( x, ˆ p ) = E (cid:34) ∞ (cid:88) t =0 γ t (cid:96) ( x t , π ( x t , ˆ p ) , ˆ p ) | x = x (cid:35) (2) V ∗ ( x, ˆ p ) = min u (cid:96) ( x, u, ˆ p ) + V ∗ ( x (cid:48) , ˆ p (cid:48) ) , ∀ x, ˆ p (3)The MPC scheme delivers local approximations to (3), andas such V MPC is a good surrogate for V ∗ as the terminalcost m ( x N ). While computing V MPC exactly is not pos-sible either — due to requiring running the MPC schemewith an inﬁnite horizon — it can be approximated withﬁtted value iteration from data gathered when runningthe MPC.

MPC = min θ MPC E (cid:20)(cid:16) y ( x, ˆ p ) − ˆ V θ MPC ( x, ˆ p ) (cid:17) (cid:21) (4a) y ( x, ˆ p ) = E (cid:104) (cid:96) ( x, π MPC ( x, ˆ p ) , ˆ p ) + γ ˆ V θ MPC ( x (cid:48) , ˆ p (cid:48) ) (cid:105) (4b)The value function approximator ˆ V θ MPC is parameterizedby the parameters θ MPC that are updated according to(4) to minimize the mean squared Bellman error (MSBE).Moreover the MPC scheme provides good approximationsto the n-step Bellman equation, which when employed inthe update rule (4) is known to accelerate convergence andpromote stability of the value function learning process: V ∗ ( x, ˆ p ) = min u : u N − E (cid:34) N − (cid:88) t =0 γ t (cid:96) ( x t , u t , ˆ p t ) + γ N V ∗ ( x N , ˆ p ) (cid:35) (5)This is because a larger share of the value of the futuretrajectory is known exactly, and the contribution fromthe bootstrapping component γ N V ( x N , ˆ p ) is reduced (vanSeijen, 2016; Sutton and Barto, 2018). The system to optimize using RL is framed as a Markovdecision process (MDP) which is deﬁned by a set ofcomponents ( S , A , T , R, γ ). S is the state space of thesystem, A is the action space, T is the discrete-time statetransition function which describes the transformation ofthe states due to time and actions, i.e. s (cid:48) = T ( s, a ), R ( s, a )is the cost function and γ ∈ [0 ,

1) is the discount factordescribing the relative value of immediate and future costs.The aim of RL methods is to discover optimal decisionmaking for the problem as deﬁned above, usually byconstructing a policy π θ — i.e. a (possibly stochastic)function that maps states to actions, here parameterizedby θ — and/or a value function as in (2), where (cid:96) corresponds to R . The objective to be optimized is then: J ( θ ) = min θ E (cid:34) ∞ (cid:88) t =0 γ t R ( s, π θ ( s )) (cid:35) , ∀ s ∈ S (6)that is, minimize the expected sum of costs acquired overthe states visited by the policy in an inﬁnite horizon. Theexpectation is taken over the initial state distribution S ,and the trajectory distribution generated by the policy andthe state transition function. Soft actor critic (SAC) (Haarnoja et al., 2018) is anactor-critic entropy-maximization RL algorithm with aparameterized stochastic policy. Entropy is a measureof the randomness of a variable, and is in the caseof a continuous action space deﬁned as H ( π θ ( ·| s )) = E a ∼ π θ ( a | s ) [ − log π θ ( a | s )], i.e. the probability of taking agiven action in state s given the policy π θ . In maximumentropy RL the objective is regularized by the entropy ofthe policy, that is, the aim of the policy is to minimize thesum of expected costs while simultaneously maximizingthe expected entropy. This in turn yields multi-modal behaviour and innate exploration of the environment, aswell as improved robustness because the policy is explicitlytrained to handle perturbations. For the sake of brevity,we will limit the discussion of the speciﬁcs of the SAC al-gorithm to the policy implementation, see Haarnoja et al.(2018) for details on how the policy is optimized. SAClearns a parameterized stochastic policy implemented as: π θ ( s, ξ ) = tanh( µ θ ( s ) + σ θ ( s ) (cid:12) ξ ) (7)Here µ θ and σ θ are the two outputs of the policy functionapproximator, representing the mean action and the co-variance, respectively. ξ ∼ N (0 ,

1) is independently drawnGaussian noise, (cid:12) denotes element-wise matrix multipli-cation, and tanh is employed to squash the Gaussian’sinﬁnite support to the interval [ − , σ θ . When evaluating the policy we set σ θ = 0, such that the policy becomes deterministic, asthis tends to give better performance.3. METHOD We learn a policy π Nθ to output the prediction horizon N of the MPC scheme using SAC. The prediction horizon isa positive integer, that for convenience we choose to upperbound. As such we modify the output of the SAC policyby linearly scaling the output from the tanh’s limits of -1and 1, to 1 and N max , and then round the output to theclosest integer: a t = round (cid:0) scale (cid:0) π Nθ ( s t ) , [ − , , [1 , N max ] (cid:1)(cid:1) (8)The gradients of the RL problem are not aﬀected by thesetransformations as they are applied in the environment,while the gradients are calculated based on the unscaledand unrounded outputs from π Nθ ( s t ). This does howevermean that the agent must “learn” that similar outputsfrom the policy will be rounded to the same action in (8),and thus lead to the same subsequent state and cost. Weconsidered alternative ways of formulating the policy asa discrete distribution from which integer horizon lengthscould be drawn directly, such as N-head neural networks(NNs), Poisson models, and negative binomial models, butsettled on the described rounding approach due to itssimplicity and favorable results.The cost function of the horizon policy consists of acontrol performance cost R P , i.e. the MPC stage cost (cid:96) , aconstraint violation cost R C , and a computation cost R N to encourage lower horizons when suitable: R ( s, a ) = R P ( s (cid:48) ) + λ C ( t max − t ) R C ( s (cid:48) ) + λ N R N ( a ) (9)where λ C , λ N are weighting factors. R C ( s ) is a binaryvariable indicating whether a hard constraint of the prob-lem was violated — upon which the episode is ended —and t max − t is the number of steps left in the episodesuch that the agent receives a penalty proportional to howearly the episode is ended. We assume the computationalomplexity of the MPC scheme grows linearly in the hori-zon length, i.e. R N ( a ) = a , as a lower bound for the truecomplexity. This generally holds true for the interior pointmethod we use if one assumes local convergence and aninitial guess that is reasonable (Rao et al., 1998).The RL state space S = { x, ˆ p } consists of the MPC statespace x and the time-varying parameters ˆ p , as these arenecessary to ensure the Markov property. The MPC’s value function is trained jointly with theRL horizon policy to minimize the MSBE as describedin Section 2.2, using 32-step bootstrapping. We foundthat N-step learning provided suﬃcient stabilization suchthat other common techniques in value estimation such astarget networks and multiple estimators were not needed(Fujimoto et al., 2018). We experimented with two typesof approximators, NNs and polynomial regression models,ﬁnding that they achieved similar prediction accuracy. Wetherefore use quadratic polynomial regression models dueto their convexity, reducing the computational complexityof the MPC scheme.

Since the environments are randomized we construct atest set consisting of 10 episodes for which all stochasticvariables such as state initial conditions and referencesare drawn in advance and thereby ﬁxed for all policies,ensuring a fair comparison. The learned horizon policyis compared against the standard MPC scheme with aﬁxed horizon, to assess the contribution of the learning.Each ﬁxed horizon MPC also has its own value functionestimated using a dataset of 15k time steps.4. EXPERIMENTSWe illustrate our approach on two systems. We set N max =50, λ N = 3 · − , · − and λ C = 10 , π Nθ is a 2-layer fully connected NN with 32 nodes in each layer, γ = 0 .

97 for both the MPC scheme and the RL algorithm,and a reward scaling of 0.6 for the inverted pendulumsystem and 0.3 for the collision avoidance system.

The ﬁrst system we experiment on is the classic controlproblem of stabilizing an inverted pendulum mounted on acart that is ﬁxed on a track, so that the cart can only moveback and forth in one dimension. The cart’s position isconstrained to the size of the track and the pendulum angleis constrained to be above perpendicular to the surface.The controller should also track a time-varying positionreference. As the position of the cart and stabilizationof the pendulum are intricately linked, respecting bothof the constraints while tracking the position referencerequires a fairly high optimization horizon. Each episodeis terminated after a maximum of 100 time steps, or whena constraint is violated. The state space consists of the states x k = [ η, v, β, ω ],where η and v is position and velocity of the cart along thehorizontal axis, while β and ω is the angle to the uprightposition and the angular velocity of the pendulum. Thesystem dynamics are described by the equations in (10),where m = 0 . M = 0 . l = 0 .

25 is thelength of the pendulum. For the MPC model the dynamicsare discretized with a step time of ∆ k = 0 . s .The stage cost (cid:96) ( x k , u k , ˆ p k ) = E kinetic − E potential + 10 · ( η k − η k,r ) + 0 . u k reﬂects the objective of stabilizingthe pendulum in the up position, formulated throughminimizing the negative potential energy of the system,while tracking the position reference η k,r .˙ η = v (10a)˙ v = mgsin ( β ) cos ( β ) − ( u + mlω sin ( β )) mcos ( β ) − M (10b)˙ β = ω (10c)˙ ω = M gsin ( β ) − cos ( β )( u + mlω sin ( β )) M l − mlcos ( β ) (10d) − ≤ u ≤ , − . ≤ η ≤ . , − ◦ ≤ β ≤ ◦ (10e) The second system we consider is a reference trackingproblem, in which a vehicle is controlled to follow a tra-jectory τ where obstacles are placed in the path that needto be avoided. The MPC receives information about thereference trajectory as well as any obstacles in its vicinity,however the position of the obstacles grows more uncertainthe longer the prediction horizon is. This means longerhorizons considers increasingly uncertain information, anda short or medium horizon might be more suited in somesituations. The episode is ended when reaching the endpoint of the trajectory, when colliding with an obstacle, orafter a maximum of 150 time steps.For the vehicle we employ a unicycle model (11), wherethe MPC provides a forward velocity u s as well as anangular velocity u ω to turn the vehicle. The MPC modelis discretized with a step time of ∆ k = 0 . s . The positionsand sizes of the obstacles are randomly generated at thebeginning of every episode, and their projected positionssupplied to the MPC are randomly drawn within a two-dimensional cone originating from the vehicle, such thatthe uncertainty grows the further away the object is fromthe vehicle. An episode is illustrated in Figure 1.˙ p x = u s cos( β ) (11a)˙ p y = u s sin( β ) (11b)˙ β = u ω (11c)0 ≤ u s ≤ , − ≤ u ω ≤ (cid:96) ( x k , u k , ˆ p k ) = || v k − τ k || where v k and τ k is the vehicle position and trajectoryreference at time k . Further, soft constraints with slackvariables are added around each obstacle with 150% ofthe obstacles radius.ig. 1. An episode in the collision avoidance environment.The blue rectangle represents the vehicle trackingthe trajectory (the orange dashed line) from left toright, while avoiding the grey obstacles. The sensorbeam’s inaccuracy grows with distance, such that theprojected position of the object in the beam is drawnfrom the yellow circle. The vehicles’ size is enlargedfor visual clarity. 5. RESULTSThe standard MPC scheme with various prediction hori-zons and the learned RL AHMPC are compared in Figure3. The RL policy outperforms the standard MPC schemefor all horizons lengths, improving on the second bestachieving policy by about 4% and 8% for the inverted pen-dulum and collision avoidance systems, respectively. Theimprovement is more signiﬁcant for the latter system asthe performance objective varies more with the predictionhorizons, and the RL policy is able to identify when touse long and short horizons. For the inverted pendulumsystem, all horizons capable of respecting the constraintsachieve similar performance costs, and as such the diﬀer-ence lies mainly in the computation term, although theRL policy achieves the lowest performance cost here aswell. RL’s ability to ﬁnd improvement in a problem withsuch a noisy cost landscape and with such little potentialimprovement speaks to its strength. Moreover, the gainsfrom reducing computation would be greater when usinge.g. active set methods for the MPC scheme which typi-cally yields quadratic growth in computational complexity(Lau et al., 2015).In the collision avoidance environment, the best perform-ing ﬁxed horizon is the short 10 step horizon. With theshortest 5 step horizon, the MPC is unable to navigatearound all the obstacles, preferring to stay still in front oflarge obstacles, although the addition of the value functionmitigates this issue to some extent. Longer predictionhorizons allows the MPC to recognize that sometimes thelonger way around the closest obstacle yields a shortertotal path due to other obstacle locations, but its plannedroutes are more sensitive to the uncertainty in the pro-jected locations. A robust MPC scheme could alleviate thisdeﬁciency, however the RL policy is also able to recognize Percentage Improvement

FH20FH15FH25FH45FH5RLFH50FH40FH35FH10FH30 H o r i z on P o li cy Inverted PendulumCosts total 0 20 40 60

Percentage Improvement

FH5FH10FH45RLFH35FH40FH25FH30FH20FH15FH50 H o r i z on P o li cy Collision AvoidanceCosts total

Fig. 2. Total cost improvement over the test sets with thevalue function as the terminal cost in the MPC.this issue and leverage the strengths of both short and longhorizons.We found that implementing value function estimationin the MPC scheme could signiﬁcantly improve the per-formance when using horizons in a neighborhood of thehorizon scale where performance changes abruptly, as il-lustrated in Figure 2, which shows the percentage im-provement for each policy when including ˆ V θ MPC as theterminal cost. The RL horizon policy does not beneﬁt asmuch from the value function as we would expect, evenbeing the best performing policy when removing the valuefunction from it but not from the other policies. The ben-eﬁt would probably be more signiﬁcant in problems thatare more temporally or spatially complex. In the collisionavoidance problem the shortest horizons show the largestimprovements, while for the inverted pendulum systemthe most improved horizons are the ones that lie closeto the apparent minimum horizon required to successfullystabilize the pendulum and track the position reference.For both systems, the longer horizons beneﬁt less fromthe addition of the value function. This is in part due tothe fact that both these systems are heavily inﬂuencedby future information that is not available to the valuefunction estimator, i.e. accurate information about distantobstacles and the future position reference for the cart.We note that the performance costs and the value functionimprovement is not monotonic wrt. the horizon length.This could partly be explained by the randomness in thedata collection stage for the value function estimation.Figure 4 shows the progression of the training process ofthe RL horizon policy. It learns quickly, converging afteraround 15 thousand time steps for both systems, corre-sponding to about 10 and 25 minutes of data collectionfor the inverted pendulum and collision avoidance systems,respectively. Moreover, we ﬁnd that the RL horizon policyitself converges even faster and that the value functionestimation is the slower, less data eﬃcient component.From these results it seems evident that RL is able tocope well with the rounding described in Section 3.1.6. CONCLUSIONWe have shown in this paper that RL can be used toautomatically tune and adapt the prediction horizon of theMPC scheme on-line with only minutes of data collection,at least for simple systems. An important further work isto investigate how this aﬀects the stability properties ofthe MPC framework, and if any guarantees can be given.

50 100 150 200 250 300

Cost

RLFH35FH40FH45FH30FH50FH25FH20FH15FH5FH10 H o r i z on P o li cy Inverted Pendulum Costs totalperformancecomputationconstraint 0 10 20 30 40 50 60 70

Cost

RLFH10FH15FH25FH20FH35FH30FH40FH45FH50FH5 H o r i z on P o li cy Collision Avoidance Costs totalperformancecomputationconstraint

Fig. 3. Mean episode costs for the horizon policies on the test sets for the two systems using the value function as theterminal cost, where lower is better. The objectives are connected in that the policy does not accrue performanceor computation cost after the episode is terminated from a constraint violation. The left ﬁgure is cut oﬀ due to theworst performing policies having a signiﬁcantly higher cost.

Time steps

Inverted Pendulum

RL (SAC) 2.5 k 5 k 7.5 k 10 k 12.5 k 15 k

Time steps

Collision Avoidance

RL (SAC)

Fig. 4. Total cost on the test set for the RL policy atdiﬀerent stages of the learning process. The solid lineis the mean score while the shaded region is onestandard deviation over three initialization seeds.REFERENCESAswani, A., Gonzalez, H., Sastry, S.S., and Tomlin, C.(2013). Provably safe and robust learning-based modelpredictive control.

Automatica , 49(5), 1216–1226.Fisac, J.F., Akametalu, A.K., Zeilinger, M.N., Kaynama,S., Gillula, J., and Tomlin, C.J. (2018). A generalsafety framework for learning-based control in uncertainrobotic systems.

IEEE Transactions on AutomaticControl , 64(7), 2737–2752.Fujimoto, S., Hoof, H., and Meger, D. (2018). Addressingfunction approximation error in actor-critic methods. In

International Conference on Machine Learning . PMLR.Gardezi, M.S.M. and Hasan, A. (2018). Machine learningbased adaptive prediction horizon in ﬁnite control setmodel predictive control.

IEEE Access , 6, 32392–32400.doi:10.1109/ACCESS.2018.2839519.Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).Soft actor-critic: Oﬀ-policy maximum entropy deep re-inforcement learning with a stochastic actor. In

Inter-national Conference on Machine Learning . PMLR.Krener, A.J. (2018). Adaptive Horizon Model PredictiveControl.

IFAC-PapersOnLine , 51(13), 31–36. doi:10.1016/j.ifacol.2018.07.250.Lau, M., Yue, S., Ling, K., and Maciejowski, J. (2015). Acomparison of interior point and active set methods forfpga implementation of model predictive control.

Proc.European Control Conference .Lowrey, K., Rajeswaran, A., Kakade, S., Todorov, E.,and Mordatch, I. (2019). Plan Online, Learn Oﬄine: Eﬃcient Learning and Exploration via Model-BasedControl. In

International Conference on Learning Rep-resentations (ICLR) .Mehndiratta, M., Camci, E., and Kayacan, E. (2018). Au-tomated tuning of nonlinear model predictive controllerby reinforcement learning. In , 3016–3021. IEEE.Michalska, H. and Mayne, D.Q. (1993). Robust recedinghorizon control of constrained nonlinear systems.

IEEETransactions on Automatic Control , 38(11), 1623–1633.doi:10.1109/9.262032.Nagabandi, A., Kahn, G., Fearing, R.S., and Levine,S. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free ﬁne-tuning. In , 7559–7566. doi:10.1109/ICRA.2018.8463189.Rao, C.V., Wright, S.J., and Rawlings, J.B. (1998). Ap-plication of interior-point methods to model predictivecontrol.

Journal of optimization theory and applications ,99(3), 723–757.Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan,K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Has-sabis, D., Graepel, T., et al. (2020). Mastering atari,go, chess and shogi by planning with a learned model.

Nature , 588(7839), 604–609.Scokaert, P.O.M. and Mayne, D.Q. (1998). Min-maxfeedback model predictive control for constrained linearsystems.

IEEE Transactions on Automatic Control ,43(8), 1136–1142. doi:10.1109/9.704989.Sutton, R.S. and Barto, A.G. (2018).

ReinforcementLearning: An Introduction . A Bradford Book, Cam-bridge, MA, USA. doi:10.5555/3312046.van Seijen, H. (2016). Eﬀective multi-step temporal-diﬀerence learning for non-linear function approxima-tion.Zanon, M. and Gros, S. (2020). Safe reinforcement learningusing robust mpc.

IEEE Transactions on AutomaticControl .Zhong, M., Johnson, M., Tassa, Y., Erez, T., and Todorov,E. (2013). Value function approximation and modelpredictive control. In2013 IEEE Symposium on Adap-tive Dynamic Programming and Reinforcement Learning(ADPRL)