[PDF] Neural Lyapunov Model Predictive Control: Learning Safe Global Controllers from Sub-optimal Examples

Abstract

With a growing interest in data-driven control techniques, Model Predictive Control (MPC) provides an opportunity to exploit the surplus of data reliably, particularly while taking safety and stability into account. In many real-world and industrial applications, it is typical to have an existing control strategy, for instance, execution from a human operator. The objective of this work is to improve upon this unknown, safe but suboptimal policy by learning a new controller that retains safety and stability. Learning how to be safe is achieved directly from data and from a knowledge of the system constraints. The proposed algorithm alternatively learns the terminal cost and updates the MPC parameters according to a stability metric. The terminal cost is constructed as a Lyapunov function neural network with the aim of recovering or extending the stable region of the initial demonstrator using a short prediction horizon. Theorems that characterize the stability and performance of the learned MPC in the bearing of model uncertainties and sub-optimality due to function approximation are presented. The efficacy of the proposed algorithm is demonstrated on non-linear continuous control tasks with soft constraints. The proposed approach can improve upon the initial demonstrator also in practice and achieve better stability than popular reinforcement learning baselines.

Full PDF

NNeural Lyapunov Model Predictive Control

Mayank Mittal * 1 2

Marco Gallieri * 1

Alessio Quaglino Seyed Sina Mirrazavi Salehian Jan Koutník Abstract

This paper presents

Neural Lyapunov MPC , analgorithm to alternately train a Lyapunov neuralnetwork and a stabilising constrained Model Pre-dictive Controller (MPC), given a neural networkmodel of the system dynamics. This extends re-cent works on Lyapunov networks to be able totrain solely from expert demonstrations of one-step transitions. The learned Lyapunov networkis used as the value function for the MPC in orderto guarantee stability and extend the stable region.Formal results are presented on the existence of aset of MPC parameters, such as discount factors,that guarantees stability with a horizon as short asone. Robustness margins are also discussed andexisting performance bounds on value functionMPC are extended to the case of imperfect mod-els. The approach is tested on unstable non-linearcontinuous control tasks with hard constraints.Results demonstrate that, when a neural networktrained on short sequences is used for predictions,a one-step horizon Neural Lyapunov MPC cansuccessfully reproduce the expert behaviour andsigniﬁcantly outperform longer horizon MPCs.

1. Introduction

Control systems comprise of constraints that need to beconsidered during the controller designing process. In mostrobotic applications, these constraints appear as actuatorsaturations or state boundaries. Typically, a control strategythat violates these speciﬁcations can adversely affect theperformance of the overall system and lead to unsafe behav-iors. In this work, we wish to learn robust control policiesthat can perform set-point reaching tasks safely, i.e., whilerespecting the constraints in the system.A pivotal concept to safe control of a dynamics system isthe stability of an equilibrium point. One way to verifythat a given feedback controller stabilizes a system is by * Equal contribution NNAISENSE, Lugano, Switzerland ETH,Zurich, Switzerland. Correspondence to: Mayank Mittal , Marco Gallieri . using Lyapunov functions. If the closed-loop response to agiven state maps to a strictly smaller value of the Lyapunovfunction, then it can be shown that the resulting trajectoryeventually converges to the equilibrium point. However,ﬁnding an appropriate Lyapunov function is cumbersome.Recently, Berkenkamp et al. (2017) and follow-up worksproposed methods that try to learn the Lyapunov function byexploiting the expressive power of neural networks (NNs).These approaches show that the learned Lyapunov NN canbe used to produce stability (safety) certiﬁcates (Bobiti,2017; Bobiti & Lazar, 2016) as well as improve an existingcontroller (Gallieri et al., 2019; Chang et al., 2019).

Summary of contributions

In this paper, we present

Neu-ral Lyapunov MPC , an algorithmic framework that obtainsa single-step horizon Model Predictive Controller (MPC)for Lyapunov-based control of a non-linear deterministicsystem with constraints. We justify the choice of a unitaryhorizon by using an imperfect forward model for predic-tions. In our proposed framework, alternate learning is usedto train a Lyapunov NN in a supervised manner and to tunethe parameters of the MPC. The learned Lyapunov NN isused as the terminal cost for the MPC to obtain closed-loopstability and a robustness margin to model errors. To thebest of our knowledge, this is the ﬁrst time such a frame-work is presented. Using our approach, we show that thesize of the stable region increases compared to a baselineMPC that has a longer prediction horizon. By treating thelearned Lyapunov NN as a value function estimate, we pro-vide theoretical results for the performance of an MPC withan imperfect forward model. These results complement theones by Lowrey et al. (2018), which only consider the casewhen a perfect dynamics model is available.

Summary of experiments

We demonstrate our approachon constrained non-linear continuous control tasks: a torque-limited inverted pendulum and a non-holonomic vehiclekinematics. We show that our approach can successfullytransfer between an inaccurate surrogate and a nominalmodel, outperforming a long horizon MPC demonstrator. a r X i v : . [ c s . A I] F e b eural Lyapunov Model Predictive Control

2. Preliminaries

Controlled Dynamical System

Let us consider adiscrete-time, time-invariant, deterministic dynamical sys-tem of the form: x ( t + 1) = f ( x ( t ) , u ( t )) , y ( t ) = x ( t ) , (1)where t ∈ N is the timestep index, x ( t ) ∈ R n x , u ( t ) ∈ R n u and y ( t ) ∈ R n y are, respectively, the state, controlinput, and measurement at timestep t . We assume thatthe states and measurements are equivalent. Further, thesystem described by Equation (1) is subjected to closed andbounded, convex constraints over the state and input spaces.These speciﬁcations can be compactly denoted as: x ( t ) ∈ X ⊆ R n x , u ( t ) ∈ U ⊂ R n u , ∀ t > . (2)The system is to be controlled by a feedback policy, K : R n x → R n u , its closed-loop behaviour being deﬁned by x ( t + 1) = f ( x ( t ) , K ( x ( t ))) = f K ( x ( t )) . The policy K isconsidered safe if there exists and invariant set, X s ⊆ X , forthe closed-loop dynamics, inside the constraints. The set X s is also referred to as the safe-set for f K . Namely, everytrajectory for the system f K that starts at some x ∈ X s remains inside this set. Furthermore, if x asymptoticallyreaches the equilibrium/target state, ¯ x ∈ X s , then X s is alsoa Region of Attraction (ROA). Lyapunov Functions

Lyapunov functions are positivescalar functions of the state, V : R n x → R + , whichmonotonically decrease along the closed-loop trajectoryof a controlled system until a target point (or set) is reached.The existence of such a function is a necessary and sufﬁ-cient condition for stability and convergence of dynamicalsystems (Khalil, 2014). Candidate Lyapunov Functions(or Control Lyapunov Functions) can be used to designcontrol policies, for instance, through direct numerical opti-mization (Blanchini & Miani, 2007), for piece-wise lineardynamics. The quest for a general solution to non-linearLyapunov-based constrained control design is still open. Lyapunov Conditions

In order to leverage the representa-tional power of neural networks to approximate a Lyapunovfunction, the learned function must satisfy two conditions.The ﬁrst requirement for the candidate function is to beupper and lower bounded by strictly increasing, unbounded,positive ( K ∞ ) functions (Khalil, 2014). In the context ofoptimal control with a stage cost: (cid:96) ( x, u ) = x T Qx + u T Ru, Q (cid:23) , R (cid:31) , (3)a possible choice for a K ∞ -function is the scaled sum-of-squares of the states: l (cid:96) (cid:107) x (cid:107) ≤ V ( x ) ≤ L V (cid:107) x (cid:107) , (4) In general, V ( x − ¯ x ) , where ¯ x is a target state. For notationconvenience, we set ¯ x = 0 . where l (cid:96) and L V are the minimum eigenvalue of Q and a(possibly local) Lipschitz constant for V respectively.The second and a more important condition is that V ( x ) must decrease along the closed-loop system f K ( x ) . A com-mon condition that relates to the Bellman equation in op-timal control is that, ∀ x ( t ) ∈ X s and u ( t ) = K ( x ( t )) , thefollowing should be true: V ( x ( t + 1)) − V ( x ( t )) ≤ − (cid:96) ( x ( t ) , u ( t )) . (5)Here X s is the the safe-set as described earlier. For a validLyapunov function V , the safe-set can be deﬁned as a level-set of V : X s = { x ∈ X : V ( x ) ≤ l s } . (6)It is worth noting that, if K solves an inﬁnite-horizon op-timal control problem with the stage cost (3) and V as its value function , then Equation (5) holds with equality andthe value function V is also a Lyapunov function.

3. Neural Lyapunov MPC

In recent work, Gallieri et al. (2019) proposed an alternatingdescent method for training a Lyapunov NN along with acontrol policy. They designed the control policy as a multi-layer perceptron (MLP) with tanh activation functions andinitialized the network with a known stabilizing policy K .This design choice is a bottleneck of their approach sincehaving an explicit K is often not possible. To overcomethis limitation, we present a novel approach where the con-trol policy K ( x ) is a non-linear Model Predictive Controller(MPC) (Maciejowski, 2000; Rawlings & Mayne, 2009; Kou-varitakis & Cannon, 2015; Rakovi´c & Levine, 2019). Ourmethod uses an expert policy, K , to only generate an initialdataset of single-step demonstrations. This dataset is thenused to learn a stabilizing controller.In the context of MPC, a function V ( x ) , which satisﬁesthe Lyapunov property (5) for some local controller K , isinstrumental to formally guarantee stability (Mayne et al.,2000; Limon et al., 2003). We use this insight and builda general Lyapunov function terminal cost for our MPC,based on neural networks. We discuss the formulation ofthe Lyapunov network and the MPC in Section 3.1 and Sec-tion 3.2 respectively. In order to extend the controller’s ROAwhile maintaining a short prediction horizon, an alternateoptimization scheme is proposed to tune the parameters ofthe MPC and re-train the Lyapunov NN. We describe thisprocedure in Section 3.3 and provide a pseudocode of ourapproach in Algorithm 1. We use the Lyapunov function network introduced by Gal-lieri et al. (2019): V ( x ) = x T (cid:0) l (cid:96) I + V net ( x ) T V net ( x ) (cid:1) x, (7) eural Lyapunov Model Predictive Control where V net ( x ) is a (Lipschitz) feedforward network thatproduces a n V × n x matrix. The scalars n V and l (cid:96) > are hyper-parameters. It is easy to verify that Equation (7)satisﬁes the condition mentioned in Equation (4). In ouralgorithm, we learn the parameters of the network, V net ( x ) ,and a safe level, l s .In order to reduce the dependency on the stage cost parame-ters, instead of enforcing Equation (5), a V is learned suchthat: ∀ x ( t ) ∈ X s , u ( t ) = K ( x ( t )) ⇒ V ( x ( t + 1)) − λV ( x ( t )) ≤ , (8)where λ ∈ [0 , is a constant. If Equation (8) holds, thenthe set X s is positively-invariant (Blanchini & Miani, 2007;Kerrigan, 2000). Since λ < , it is also λ -contractive andthe state converges to the origin. Equation (8) allows tolearn V from demonstrations without knowing an explicitparameterization of the demonstrator. Loss function

Suppose D K denotes a set of state-action-transition tuples of the form ( x, u, x + ) , where x + is thenext state obtained from applying the policy u = K ( x ) .The Lyapunov network is trained using the following loss: min V net , l s E ( x, u, x + ) ∈D K (cid:2) J (cid:0) x, u, x + (cid:1) (cid:3) , (9)where: J ( x, u, x + ) = I X s ( x ) ρ J s ( x, u, x + ) + J vol ( x, u, x + ) , (10)and: I X s ( x ) = 0 . sign [ l s − V ( x )] + 1) ,J s ( x, u, x + ) = ReLU [∆ V ( x )] V ( x ) + (cid:15) V ,J vol ( x, u, x + ) = sign (cid:2) ∆ V ( x ) (cid:3) [ l s − V ( x )] , ∆ V ( x ) = V (cid:0) x + (cid:1) − λV ( x ) + υ (cid:96) ( x, u ) . In Equation (10), I X s is the indicator function for the safeset X s , which is multiplied to J s , a function that penalisesthe instability. The term J vol is a classiﬁcation loss thattries to compute the correct boundary between the stableand unstable points. It is also instrumental in increasing thesafe set volume. The scalars (cid:15) V > , λ ∈ [0 , , υ ∈ [0 , and ρ > , are hyper-parameters, where the latter trades offvolume for stability. When (cid:96) is unknown for K , then weset υ = 0 . An additional term could be added to the cost toencourage X s ⊆ X . We instead choose to scale-down thecomputed level l s a-posteriori, for practical reasons.The loss (10) extends the one proposed by Berkenkampet al. (2017) in the sense that only one-step transitions are re-quired, and safe trajectories are not explicitly labeled beforetraining. This loss is also used to tune the MPC parameters. Local Lyapunov functions and safety

It is often possi-ble that the trained function V does not satisfy the Lyapunovcondition everywhere. This is not uncommon in controlproblems that deal with non-linearities and uncertainties orwhen function approximators are used for generating thecontrol signals. Training a Lyapunov NN with a gradient-based method does not guarantee that Equation (8) is satis-ﬁed. In general, it may hold only between two level-sets X s ,and X T = ψ X s , with ψ < , of V : ∀ x ( t ) ∈ X s \ X T , u ( t ) = K ( x ( t )) ⇒ V ( x ( t + 1)) − λV ( x ( t )) ≤ , (11)However, if V does not increase beyond l s in ψ X s : ∀ x ( t ) ∈ X T , u ( t ) = K ( x ( t )) ⇒ V ( x ( t + 1)) < l s , (12)then trajectories starting from X s would remain in X s un-der the policy used to generate the training data for V . Ifinstead, X T is invariant, then all the trajectories from thedemonstrator would terminate in X T .Formal veriﬁcation methods can also be used to check thatthe Lyapunov conditions are satisﬁed. We refer the readerto (Gallieri et al., 2019; Bobiti, 2017; Bobiti & Lazar, 2016;Berkenkamp et al., 2017) for a set of veriﬁcation algorithms. It is desirable to learn powerful control policies that canperform setpoint-reaching tasks safely, namely, while re-specting constraints on the system state and inputs. Atthe same time, the policies need to be general and directlyaware of the constraints. This motivates the use of ModelPredictive Control (MPC) policies. The Lyapunov func-tion network techniques from the previous section can beleveraged to formally enforce stability. The key idea is thatone can accept a policy to be only locally optimal, or evensub-optimal, as soon as this would result in safe behavior.Control solutions based on MPC are particularly powerfuldue to their potential to exploit the inductive bias arisingfrom the direct use of a surrogate model and well studiedconstrained optimization methods, e.g., sequential quadraticprogramming (SQP, Nocedal & Wright (2006)), whichmakes use of convex optimization steps (Boyd & Vanden-berghe, 2004). MPC is a global control policy that solves alocal optimal control problem of the form: J (cid:63) MPC ( x ( t )) = min u γ N αV (ˆ x ( N )) + N − (cid:88) i =0 γ i (cid:96) (ˆ x ( i ) , ˆ u ( i ))s . t . ˆ x ( i + 1) = ˆ f (ˆ x ( i ) , ˆ u ( i )) , (13) ˆ x ( i ) ∈ X , ∀ i ∈ [0 , N ] , ˆ u ( i ) ∈ U , ∀ i ∈ [0 , N − , ˆ x (0) = x ( t ) , eural Lyapunov Model Predictive Control where ˆ x ( i ) and ˆ u ( i ) are the predicted state and the inputat i -steps in the future, u = { u ( i ) } N − i =0 , the stage cost (cid:96) isgiven by (10), γ ∈ (0 , is a discount factor, the function ˆ f is a forward model, the function V is the terminal cost,in our case a Lyapunov NN from Equation (7), scaled by afactor α ≥ to provide stability, and x ( t ) is the measuredsystem state at the current time. For practical reasons, stateconstraints are softened and penalty functions are used, asin (Kerrigan & Maciejowski, 2000). The problem (13) is ex-tended with a set of slack variables, ˆ s ( i ) , for state constraintviolation and an additional penalty: (cid:96) X ( s ) = η s T s + η (cid:107) s (cid:107) , (14)where η > , η (cid:29) .For the MPC, problem (13) is solved online (in real-time)given the current state x ( t ) ; then, the ﬁrst element of theoptimal control sequence, u (cid:63) (0) , provides the action forthe physical system. Then, a new state is measured, andEquation (13) is again solved, in a receding horizon . NN-based dynamics model

Neural network models areof particular interest in order to leverage existing pipelinesfor automatic differentiation, computer vision, and the ﬂexi-bility offered by the NN formalism. It is quite straightfor-ward to include structural priors in the model (Quaglinoet al., 2020; Yıldız et al., 2019; Pozzoli et al., 2019) andto exploit time dependencies (Hochreiter & Schmidhuber,1997; Gers et al., 2000). Recurrent structures can be verybeneﬁcial in reducing the modeling errors over a time hori-zon, which is key for the MPC success. On the other hand,one-step models can perform quite poorly when unrolled. Inpractice, prediction horizons can be very long, and it mightnot be possible to gather sufﬁcient data from demonstrations.Even if this is not the case, every model is imperfect, anderrors can accumulate through the prediction window. Weanalyze the framework performance for a bounded one-step-ahead prediction error, w ( t ) , where: w = f ( x, u ) − ˆ f ( x, u ) , (cid:107) w (cid:107) ≤ µ, ∀ ( x, u ) ∈ ˜ X × U , (15)for some compact set of states, ˜ X ⊇ X . We assume that both f and ˆ f are Lipschitz in this set with constants L fx , L fu ,and L ˆ fx , L ˆ fu respectively. Stability and safety

Stability of MPC without terminalconstraint (undiscounted) has been considered by severalauthors in the past. The proposed framework is based on theresults from the work by Limon et al. (2003), where it wasﬁrst shown that an appropriate α can be computed so that theresulting MPC is stable without a constraint on the terminalcost. The stability of discounted inﬁnite-horizon discrete-time optimal control problems was analyzed, for instance, by Postoyan et al. (2014); Gaitsgory et al. (2015), where itwas shown that for γ sufﬁciently close to 1, the stability ofan undiscounted problem could be retained locally.The intrinsic robustness to bounded additive uncertaintyof an MPC (undiscounted) with a nominal forward modeland without an explicit representation of uncertainty can beformulated within the framework of Input-to-State Stability(ISS) (Limon et al., 2009). We extend this result to thediscounted case. In order to prove this, we make use of theuniform continuity of the model, the MPC and the terminalcost, V , as done by Limon et al. (2009). Consider the set: Υ N,γ,α = (cid:26) x ∈ R n x : J (cid:63) MPC ( x ) ≤ − γ N − γ d + γ N αl s (cid:27) (16)where: d = inf x (cid:54)∈ X s (cid:96) ( x, . (17)The following result is obtained for a deterministic system(1) in closed loop with the MPC deﬁned by problem (13): Lemma 1. Stability and robustness

Assume that V ( x ) satisﬁes (8) , for a given λ ∈ [0 , . Then, for any horizonlength N ≥ there exist a constant ¯ α ≥ , a minimumdiscount factor ¯ γ ∈ (0 , , and a model error bound ¯ µ suchthat, if α ≥ ¯ α , µ ≤ ¯ µ and γ ≥ ¯ γ , then, ∀ x (0) ∈ R ( X s ) :1. If N = 1 , µ = 0 , then the system is asymptoticallystable for any γ > , ∀ x (0) ∈ Υ N,γ,α .2. If

N > , µ = 0 , then the system reaches a set B γ thatis included in X s . This set increases monotonicallywith decreasing discount factors, γ , ∀ x (0) ∈ Υ N,γ,α .3. If

N > , µ = 0 , and once in X s we switch to theexpert policy, then the system is asymptotically stable, ∀ x (0) ∈ Υ N,γ,α .4. If αV ( x ) is the global value function for the discountedproblem, µ = 0 , and if R ( X s ) = X s , then the systemis asymptotically stable, ∀ x (0) ∈ Υ N,γ,α .5. If αV ( x ) is only the value function in X s for the prob-lem, µ = 0 , and if R ( X s ) (cid:54) = X s , then the system isasymptotically stable, ∀ x (0) ∈ Υ N,γ,α .6. The MPC has a stability margin. If the MPC uses asurrogate model satisfying Equation (15) , then the sys-tem is Input-to-State (practically) Stable (ISpS) andthere exists a set B N,γ,µ such that x ( t ) → B N,γ,µ , ∀ x (0) ∈ β Υ N,γ,α , with β ≤ . This ﬁnal error boundincreases monotonically with the model error, µ , andthe horizon length, N , as well as for decreasing dis-count factors, γ . Lemma 1 states that for a given horizon length N and con-traction factor λ , one can ﬁnd a minimum scaling of the eural Lyapunov Model Predictive Control Lyapunov function V and a lower bound on the discountfactor such that the system under the MPC is stable. Hence,if the model is perfect, then the state would converge tothe origin as time progresses. However, if the model is notperfect, then the safety of the system depends on the sizeof the model error. If this error is less than the maximumtolerable error, µ ≤ ¯ µ , then the system is safe . The statewould converge to a bound, the size of which increases withthe size of the model error, the prediction horizon N , and isinversely proportional to the scaling α . In other words, thelonger the predictions with an incorrect model, the worsethe outcome. Note that the ROA also improves with larger α and γ . The proof of the lemma is provided in Appendix A. MPC solvers and intrinsic robustness

The use of itera-tive convex optimisation, e.g., iterative LQR (iLQR) Tassaet al. (2012) or Sequential Quadratic Program (SQP) No-cedal & Wright (2006), is convenient for several reasons.A clear advantage of this approach is that, if the model ˆ f is Lipschitz and the sub-optimal MPC solution is producedby a sequence of strongly convex optimizations such as inSQP, then the resulting control law K ( x ) is also Lipschitz(Bemporad et al., 2000; Darup et al., 2017). This is instru-mental for proving the existence of a robust stability margin,namely, to account for the model and solver uncertainties, asshown in Lemma 1. In the same way, when using SQP as asolver, the resulting MPC optimal cost, J (cid:63) MPC ( x ) is also Lip-schitz. Moreover, it can be bounded by K ∞ -functions, in thesame form as Equation (4) with appropriate constants. Thisallows using the MPC optimal cost as a candidate robustLyapunov function. In general, if the solution is feasible andit can be demonstrated that the optimal cost is uniformlycontinuous, then the MPC has intrinsic robustness indepen-dently of the solver (Limon et al., 2003). We formulate ourMPC as an SQP. The implementation details of the sameare provided in Appendix B. Suboptimal solutions

The use of sub-optimal solvers canbe interpreted as an additional disturbance, w solver , in theMPC predictions. If an SQP is used, then K ( x ) is Lipschitz,and the set of initial states is bounded. Hence the effectof sub-optimality is also going to be bounded and can betreated as in point 6 of Lemma 1. Local Lyapunov functions

If the function V is a Lya-punov function only outside the set X T , then in the worstcase, the MPC optimal cost would decrease only when theoptimal prediction at time N is outside X T . This can be usedto check for convergence to a set as in point 2 of Lemma 1. Performance with surrogate models

Our approach toshow the performance with a surrogate model is related tothe work by Lowrey et al. (2018). We use αV ( x ) , insteadof an approximation of the value function for the task de- ﬁned using the stage cost (cid:96) ( x, u ) . The focus of Lowreyet al. (2018) was not on stability but instead on optimalityand value function learning. If the estimated value functionis not enforced to also be a local Lyapunov function, thenthe constraint satisfaction and safety can only be guaran-teed once the value function training has converged and thelearning is completed.Lowrey et al. (2018) formally characterize the MPC per-formance with an imperfect value function and provide asub-optimality bound for perfect models. Building uponLemma 1, we extend the performance bound provided byLowrey et al. (2018) to the case of imperfect models.Let E D [ J V (cid:63) ( K (cid:63) )] deﬁne the expected inﬁnite-horizon per-formance of the optimal policy K (cid:63) , evaluated by using thecorrect value function V (cid:63) , for a task speciﬁed by the stagecost in Equation (10). Similarly, let E x ∈D [ J (cid:63) MPC ( x )] deﬁnethe MPC’s expected performance with the learned V , whena surrogate model is used and E x ∈D [ J (cid:63) MPC ( x ; f )] when aperfect model is used with the learned V . We obtain thefollowing lemma: Lemma 2. Performance with imperfect models

Assumethat the value function error is bounded for all x , namely, (cid:107) V (cid:63) ( x ) − αV ( x ) (cid:107) ≤ (cid:15) , and that the model error satisﬁes(15), for some µ > . Then, for any δ > : E x ∈D [ J (cid:63) MPC ( x )] − E x ∈D [ J (cid:63)V (cid:63) ( x )] ≤ γ N (cid:15) − γ N + (cid:18) δ (cid:19) (cid:107) Q (cid:107) N − (cid:88) i =0 γ i  i − (cid:88) j =0 ¯ L jf  µ + (cid:18) δ (cid:19) γ N αL V (cid:32) N − (cid:88) i =0 ¯ L if (cid:33) µ + ¯ ψ ( µ )+ δ E x ∈D [ J (cid:63) MPC ( x ; f )] , where ¯ L f = min( L ˆ fx , L fx ) and ¯ ψ is a K ∞ -function repre-senting the constraint penalty terms. Lemma 2 is related to the result by Asadi et al. (2018) forvalue-based RL. However, here we do not constrain theLipschitz constant of the model or the system, which canbe open-loop unstable. Moreover, in Lemma 2, we do notassume the MPC optimal cost to be Lipschitz. The proof ofthe lemma is described in Appendix A.

MPC auto-tuning

The stability bounds discussed inLemma 1 can be conservative and their computation is non-trivial. Theoretically, the bigger the α the larger is the ROA(the safe region) for the MPC, up to its maximum extent.Practically, for a very high α , the MPC solver may notconverge due to ill-conditioning. Good stability and perfor-mance can be often achieved by manually tuning the MPCparameters, although this can be time-consuming. In theory,it is possible to use a gradient-based optimization scheme to eural Lyapunov Model Predictive Control tune the MPC parameters thanks to the recent advances indifferentiable convex programming (Amos & Kolter, 2017;Agrawal et al., 2019b;a). East et al. (2020) trained a stableMPC for the linear case by differentiating through the LQRsolution. For the non-linear case, Amos et al. (2018) trainedan MPC without state constraints by differentiating throughan iLQR (Tassa et al., 2014). However, these approaches arelimited in their scope as they either do not address stabilityfor non-linear systems (if at all addressed), do not explicitlyhandle state constraints, and require optimality in order tocompute the gradients correctly. Initially, by using the toolfrom Agrawal et al. (2019a) within an SQP scheme, wetried to tune the parameters through gradient-based opti-mization of the loss (9). These attempts were not successful,as expected from the considerations in Amos et al. (2018).Therefore, for practical reasons, in this work, we perform agrid search over the MPC parameter α . Our alternate optimization of the Lyapunov NN, V ( x ) , andthe controller is similar to the one of Gallieri et al. (2019).However, instead of training a NN policy, we tune the scal-ing α for V ( x ) , used in the MPC (13). Further, we extendtheir approach by using a dataset of demonstrations, D demo ,instead of an explicitly deﬁned initial policy. These are one-step transition tuples, ( x (0) , u (0) , x (1)) m , m = 1 , . . . , M ,generated by a (possibly unkown) stabilizing policy, K .Unlike in the approach by Berkenkamp et al. (2017), V is learned without labels. It is also not needed to back-propagate through the forward model during the ﬁrst com-putation for V , differently from Chang et al. (2019). Algorithm 1

Neural Lyapunov MPC learning

In: D demo , ˆ f , λ ∈ [0 , , { l (cid:96) , (cid:15) ext } > , γ ∈ (0 , , N ≥ , α list , N ext , N V , (cid:15) V , V init , (cid:96) ( x, u ) (if known), υ ∈ [0 , Out: V net , l s , α (cid:63) D ← D demo V net ← V init for j = 0 ...N V do ( V net , l s ) ← Adam step on Equation (9) for i = 0 ...N ext do l s ← (1 + (cid:15) ext ) l s for α ∈ α list do U (cid:63) ← MPC ( V net , ˆ f, D demo ; α ) , from Equation (13) D MPC ( α ) ← one_step_sim ( ˆ f, D demo , U (cid:63) ) L ( α ) ← Evaluate Equation (9) on D MPC ( α ) α (cid:63) ← arg min( L ( α )) D ← D

MPC ( α (cid:63) ) V net ← V init (optional) for j = 0 ...N V do ( V net , l s ) ← Adam step on Equation (9)

Once an initial V is learned from the demonstrations, givena forward model, we tune the MPC parameter α to minimizethe loss deﬁned in Equation (7). This procedure over mul-tiple iterations where after each iteration, the tuned MPCserves as a demonstrator for training the next V that veriﬁesthe MPC in closed-loop with the model. The resulting pro-cedure is outlined in Algorithm 1. Additionally, we selectthe Lyapunov function and the MPC using the criteria thatthe percentage of stable points ( ∆ V < ) increases and thatof unstable points decreases while iterating over j and i .In Algorithm 1, MPC denotes the proposed Neural LyapunovMPC, while one_step_sim denotes a one-step propaga-tion of the MPC action into the system surrogate model. Totrain the parameters of V and the level-set l s , Adam opti-mizer is used (Kingma & Ba, 2014). A grid search over theMPC parameter α is performed. A thorough tuning of allMPC parameters is possible, for instance, by using black-box optimisation methods. This is left for future work.

4. Numerical experiments

Through our experiments, we show the following: 1) in-crease in the safe set for the controller by using our pro-posed alternate learning algorithm, 2) comparison betweenthe proposed Neural Lyapunov MPC against an MPC with alonger horizon in a nominal environment, and, 3) robustnessof our one-step MPC compared to a longer horizon MPCwith controllers used in conjunction with a surrogate model. (a) Inverted Pendulum (b) Vehicle Kinematics

Figure 1:

Non-linear robotic control problems.

We testour approach on: (a) torque-limited inverted pendulum ofmass m and length l , and (b) non-holonomic vehicle kine-matics with constraints.The proposed approach is veriﬁed on two different roboticcontrol problems, shown in Figure 1. In Section 4.1, atorque-limited inverted pendulum is considered while in Sec-tion 4.2, we consider a non-holonomic vehicle kinematicswith constraints. Details about the models and their conﬁgu-rations can be found in Appendix C. In this task, the pendulum starts near the unstable equilib-rium ( θ = 0 ◦ ). The goal is to stay upright. We bound theinput so that the system cannot be stabilized if | θ | > ◦ . eural Lyapunov Model Predictive Control Table 1:

Inv. Pendulum: Learning on nominal model. I TER . L

OSS ( log(1+ x ) ) V ERIFIED (%) N OT V ERIFIED (%)1 3.21 13.25 0.002 1.08 13.54 0.00

Setup

We use an MPC with horizon as a demonstrator,with terminal cost, x T P LQR x , where P LQR is the LQRoptimal cost matrix. This is evaluated on K equallyspaced initial states to generate the dataset D demo . We traina grey-box NN model. More details are in Appendix C. Results

The learned V and α , obtained from Algorithm 1,produce a one-step MPC that stabilizes both the surrogateand the actual system. Table 1 shows that the loss and per-centage of veriﬁed points improve across iterations. Theﬁnal ROA estimate is nearly maximal and is depicted alongwith the safe trajectories, produced by the MPC while usingpredictions from the nominal and surrogate models, in Fig-ure 2. The performance matches that of the baseline andthe transfer is successful due to the accuracy of the learnedmodel. A full ablation study is in Appendix D. . . . . . . . . . . . . (a) MPC demonstratorwith nominal . . . . . . . . . . . . (b) Lyapunov MPCwith nominal . . . . . . . . . . . . (c) Lyapunov MPCwith surrogate Figure 2:

Inverted Pendulum: Testing learned controlleron nominal model.

Lyapunov function with safe trajecto-ries for steps. The learning and transfer are successful. The goal is to steer the car to the (0 , position with zeroorientation. This is only possible through non-linear control. Setup

The vehicle cannot move sideways, hence policiessuch as LQR is not usable to generate demonstrations. Thusto create D demo , an MPC with horizon is evaluated over K random initial states. The surrogate is a grey-box NN.More details are in Appendix C. Results

Figure 3 shows the learning curve for the loss,and the percentages of stable and unstable points, over threeiterations using the nominal model for training. All the met-rics improve across the iterations, indicating an increase inthe ROA. This is summarized in Table 2. Similar results areobtained when learning on the surrogate model, as shownin Table 3. We test the transfer capability of the approach in two ways. First, we learn using the nominal model and testusing the surrogate model for the MPC predictions. This isreported in Figure 4, where it can be noticed that our MPCsigniﬁcantly outperforms the demonstrator when the inaccu-rate model is used for predictions. Second, the learning isperformed using the surrogate model as in Algorithm 1, andthe MPC is then tested on the nominal model while still us-ing the surrogate for prediction. This is depicted in Figure 5.Once again, our MPC works better than the demonstratorwhen using the incorrect model. The learned MPC transferssuccessfully and completes the task safely.Table 2:

Car Kinematics: Learning on nominal model. I TER . L

OSS ( log(1+ x ) ) V ERIFIED (%) N OT V ERIFIED (%)1 1.55 92.20 4.422 0.87 93.17 4.893 0.48 94.87 3.89

Table 3:

Car Kinematics: Learning on surrogate model. I TER . L

OSS ( log(1+ x ) ) V ERIFIED (%) N OT V ERIFIED (%)1 1.84 91.74 8.262 1.43 92.26 7.743 1.65 91.61 8.39

5. Additional references

In the context of reinforcement learning, Gu et al. (2016)consider the use of local models and quadratic advantagefunctions to accelerate deep Q-learning. On the other hand,Lowrey et al. (2018) learn a value function and use it with anMPC for a perfect model. They show that this improves boththe learning and the performance. We extend their resultsin Lemma 2 for imperfect models. Farshidian et al. (2019)extend the work by Lowrey et al. (2018) to stochastic MPCin continuous-time. Asadi et al. (2018) provide probabilisticbounds on value error with Lipschitz models, where theLipschitz constant is limited by the discount factor. Janneret al. (2019) characterize and analyze the performance ofpolicy iteration with imperfect models.

6. Conclusion

We presented Neural Lyapunov MPC, a framework to traina stabilizing non-linear MPC based on learned neural net-work terminal cost and surrogate model. After extendingexisting theoretical results for MPC and value-based rein-forcement learning, we have demonstrated that the proposedframework can incrementally increase the stability regionof the MPC and safely transfer on simulated constrainednon-linear control scenarios. eural Lyapunov Model Predictive Control (a) Lyapunov Loss ( log(1 + x ) ) (b) Veriﬁed Points (%) (c) Not Veriﬁed Points (%) Figure 3:

Car kinematics: Alternate learning on nominal model.

We plot the Lyapunov loss (9) on a log(1+ x ) scale andindicate its performance as percentage veriﬁed and not veriﬁed points. Vertical lines separate alternate learning iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . − . − . − . − . − . . . (a) Short-horizon MPCusing nominal − . − . − . − . − . . . (b) MPC demonstratorusing nominal − . − . − . − . − . . . (c) Lyapunov MPCusing nominal − . − . − . − . − . . . (d) MPC demonstratorusing surrogate − . − . − . − . − . . . (e) Lyapunov MPCusing surrogate Figure 4:

Car kinematics: Testing learned controller on nominal model. Top : The Lyapunov function at φ = 0 withtrajectories for steps. Middle : The evaluated Lyapunov function.

Bottom : The Lyapunov function time differences. . . . . . . . . . .

08 0 . . . . . . . . . .

08 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . − . − . − . . . (a) Short-horizon MPCusing surrogate − . − . − . . . (b) MPC demonstratorusing surrogate − . − . − . . . (c) Lyapunov MPCusing surrogate − . − . − . . . (d) MPC demonstratorusing nominal − . − . − . . . (e) Lyapunov MPCusing nominal Figure 5:

Car kinematics: Transfer from surrogate to a nominal model. Top : The Lyapunov function at φ = 0 withtrajectories for steps. Middle : The evaluated Lyapunov function.

Bottom : The Lyapunov function time differences. eural Lyapunov Model Predictive Control

References

Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S.,and Kolter, Z. Differentiable Convex Optimization Layers. arXiv:1910.12430 [cs, math, stat] , October 2019a. URL http://arxiv.org/abs/1910.12430 . arXiv:1910.12430.Agrawal, A., Barratt, S., Boyd, S., Busseti, E., andMoursi, W. M. Differentiating Through a Cone Pro-gram. arXiv:1904.09043 [math] , April 2019b. URL http://arxiv.org/abs/1904.09043 . arXiv:1904.09043.Amos, B. and Kolter, J. Z. OptNet: Differentiable Optimiza-tion as a Layer in Neural Networks. arXiv:1703.00443[cs, math, stat] , March 2017. URL http://arxiv.org/abs/1703.00443 . arXiv: 1703.00443.Amos, B., Jimenez, I., Sacks, J., Boots, B., and Kolter, J. Z.Differentiable MPC for End-to-end Planning and Control.In

Advances in Neural Information Processing Systems31 , pp. 8289–8300. Curran Associates, Inc., 2018.Asadi, K., Misra, D., and Littman, M. L. Lips-chitz Continuity in Model-based Reinforcement Learn-ing. arXiv:1804.07193 [cs, stat] , July 2018. URL http://arxiv.org/abs/1804.07193 . arXiv:1804.07193.Bemporad, A., Morari, M., Dua, V., and Pistikopoulos,E. The explicit solution of model predictive controlvia multiparametric quadratic programming. In

Pro-ceedings of the American Control Conference . IEEE,2000. URL https://doi.org/10.1109/acc.2000.876624 .Berkenkamp, F., Turchetta, M., Schoellig, A. P., andKrause, A. Safe Model-based Reinforcement Learningwith Stability Guarantees. arXiv:1705.08551 [cs, stat] ,May 2017. URL http://arxiv.org/abs/1705.08551 . arXiv: 1705.08551.Blanchini, F. and Miani, S.

Set-Theoretic Methods in Con-trol (Systems & Control: Foundations & Applications) .Birkhäuser, 2007. ISBN 0817632557.Bobiti, R. and Lazar, M. Sampling-based veriﬁcation ofLyapunov’s inequality for piecewise continuous nonlinearsystems. arXiv:1609.00302 [cs] , September 2016. URL http://arxiv.org/abs/1609.00302 . arXiv:1609.00302.Bobiti, R. V.

Sampling driven stability domains computationand predictive control of constrained nonlinear systems .PhD thesis, 2017. URL https://pure.tue.nl/ws/files/78458403/20171025_Bobiti.pdf . Boyd, S. and Vandenberghe, L.

Convex Optimization . Cam-bridge University Press, 2004. ISBN 0521833787.Chang, Y.-C., Roohi, N., and Gao, S. Neu-ral Lyapunov Control. In

Advances in Neu-ral Information Processing Systems , pp. 3240–3249.2019. URL http://papers.nips.cc/paper/8587-neural-lyapunov-control.pdf .Darup, M. S., Jost, M., Pannocchia, G., and Mönnigmann,M. On the maximal controller gain in linear MPC. , 50(1):9218–9223, July 2017.East, S., Gallieri, M., Masci, J., Koutnik, J., and Cannon, M.Inﬁnite-Horizon Differentiable Model Predictive Control. arXiv:2001.02244 [cs, eess, math] , January 2020. URL http://arxiv.org/abs/2001.02244 . arXiv:2001.02244.Farshidian, F., Hoeller, D., and Hutter, M. Deep ValueModel Predictive Control. arXiv:1910.03358 [cs, stat] ,October 2019. URL http://arxiv.org/abs/1910.03358 . arXiv: 1910.03358.Fossen, T. I.

Handbook of marine craft hydrodynamics andmotion control . John Wiley & Sons, 2011.Gaitsgory, V., Gruene, L., and Thatcher, N. Stabilizationwith discounted optimal control. volume 82, pp. 91–98,07 2015. doi: 10.1016/j.sysconle.2015.05.010.Gallieri, M., Salehian, S. S. M., Toklu, N. E., Quaglino, A.,Masci, J., Koutník, J., and Gomez, F. Safe InteractiveModel-Based Learning. arXiv:1911.06556 [cs, eess] ,November 2019. URL http://arxiv.org/abs/1911.06556 . arXiv: 1911.06556.Gers, F. A., Schmidhuber, J., and Cummins, F. Learning toforget: Continual prediction with LSTM.

Neural Compu-tation , 12(10):2451–2471, 2000.Glorot, X. and Bengio, Y. Understanding the difﬁcultyof training deep feedforward neural networks. In Teh,Y. W. and Titterington, M. (eds.),

Proceedings of theInternational Conference on Artiﬁcial Intelligence andStatistics , volume 9 of

Proceedings of Machine LearningResearch , pp. 249–256, Chia Laguna Resort, Sardinia,Italy, 13–15 May 2010. PMLR.Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuousdeep q-learning with model-based acceleration. 03 2016.Hadﬁeld-Menell, D., Lin, C., Chitnis, R., Russell, S., andAbbeel, P. Sequential quadratic programming for taskplan optimization. In

Proceedings of IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems , pp.5040–5047. IEEE, 2016. eural Lyapunov Model Predictive Control

Hochreiter, S. and Schmidhuber, J. Long short-term memory.

Neural Computation , 9(8):1735–1780, 1997.Janner, M., Fu, J., Zhang, M., and Levine, S. When toTrust Your Model: Model-Based Policy Optimization. arXiv:1906.08253 [cs, stat] , November 2019. URL http://arxiv.org/abs/1906.08253 . arXiv:1906.08253.Kerrigan, E. Robust constraint satisfaction: Invariant setsand predictive control. Technical report, 2000. URL http://hdl.handle.net/10044/1/4346 .Kerrigan, E. C. and Maciejowski, J. M. Soft constraints andexact penalty functions in model predictive control. In

Proceedings of UKACC International Conference , 2000.Khalil, H. K.

Nonlinear Control . Pearson, 2014. ISBN9780133499261.Kingma, D. P. and Ba, J. Adam: A Method forStochastic Optimization. arXiv:1412.6980 [cs] , Decem-ber 2014. URL http://arxiv.org/abs/1412.6980 . arXiv: 1412.6980.Kouvaritakis, B. and Cannon, M.

Model Predictive Control:Classical, Robust and Stochastic . Advanced Textbooks inControl and Signal Processing, Springer, London, 2015.Limon, D., Alamo, T., and Camacho, E. F. Stable con-strained MPC without terminal constraint.

Proceedingsof the American Control Conference , 2003.Limon, D., Alamo, T., Raimondo, D. M., de la Peña, D. M.,Bravo, J. M., Ferramosca, A., and Camacho, E. F. Input-to-State Stability: A Unifying Framework for RobustModel Predictive Control. In

Nonlinear Model PredictiveControl , pp. 1–26. Springer Berlin Heidelberg, 2009.Lowrey, K., Rajeswaran, A., Kakade, S., Todorov, E.,and Mordatch, I. Plan Online, Learn Ofﬂine: Efﬁ-cient Learning and Exploration via Model-Based Con-trol. arXiv:1811.01848 [cs, stat] , November 2018. URL http://arxiv.org/abs/1811.01848 . arXiv:1811.01848.Maciejowski, J.

Predictive Control with Constraints . Pren-tice Hall, 2000. ISBN 0201398230.Mayne, D., Rawlings, J., Rao, C., and Scokaert, P. Con-strained model predictive control: Stability and optimality.

Automatica , 36(6):789 – 814, 2000.Nocedal, J. and Wright, S.

Numerical optimization . SpringerScience & Business Media, 2006.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Rai-son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang,L., Bai, J., and Chintala, S. Pytorch: An imperativestyle, high-performance deep learning library, 2019. URL http://arxiv.org/abs/1912.01703 .Postoyan, R., Busoniu, L., Nesi’c, D., and Daafouz, J. Sta-bility of inﬁnite-horizon optimal control with discountedcost. In

Proceedings of the IEEE Conference on Decisionand Control , 2014.Pozzoli, S., Gallieri, M., and Scattolini, R. Tustin neuralnetworks: a class of recurrent nets for adaptive MPCof mechanical systems. arXiv:1911.01310 [cs, eess] ,November 2019. URL http://arxiv.org/abs/1911.01310 . arXiv: 1911.01310.Quaglino, A., Gallieri, M., Masci, J., and Koutník, J.SNODE: Spectral Discretization of Neural ODEs forSystem Identiﬁcation. arXiv:1906.07038 [cs] , Jan-uary 2020. URL http://arxiv.org/abs/1906.07038 . arXiv: 1906.07038.Rakovi´c, S. V. and Levine, W. S. (eds.).

Handbook of ModelPredictive Control . Springer International Publishing,2019. doi: 10.1007/978-3-319-77489-3. URL https://doi.org/10.1007/978-3-319-77489-3 .Rawlings, J. B. and Mayne, D. Q.

Model Predictive ControlTheory and Design . Nob Hill Pub, Llc, 2009. ISBN0975937707.Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabi-lization of complex behaviors through online trajectoryoptimization. In

Proceedings of the IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems. , pp.4906–4913, 10 2012. ISBN 978-1-4673-1737-5. doi:10.1109/IROS.2012.6386025.Tassa, Y., Mansard, N., and Todorov, E. Control-limited dif-ferential dynamic programming. In

Proceedings of IEEEInternational Conference on Robotics and Automation ,pp. 1168–1175, 2014. doi: 10.1109/ICRA.2014.6907001.Yıldız, C., Heinonen, M., and Lähdesmäki, H. ODE VAE:Deep generative second order ODEs with Bayesianneural networks. arXiv:1905.10994 [cs, stat] , Octo-ber 2019. URL http://arxiv.org/abs/1905.10994 . arXiv: 1905.10994. eural Lyapunov Model Predictive Control

Supplementary Material

Mayank Mittal * 1 2

Marco Gallieri * 1

Alessio Quaglino Seyed Sina Mirrazavi Salehian Jan Koutník A. Proof of the Lemmas

Here we provide the proofs of lemmas stated in the paper.We write the proof for Lemma 2 before Lemma 1 since it issimpler and helps in proving the latter.

Proof of Lemma 2

Proof.

To prove the lemma, we ﬁrst write: E x ∈D [ J (cid:63) MPC ( x )] − E x ∈D [ J (cid:63)V (cid:63) ( x )] = E x ∈D [ J (cid:63) MPC ( x )] − E x ∈D [ J (cid:63) MPC ,f ( x )] (cid:124) (cid:123)(cid:122) (cid:125) I + E x ∈D [ J (cid:63) MPC ,f ( x )] − E x ∈D [ J (cid:63)V (cid:63) ( x )] (cid:124) (cid:123)(cid:122) (cid:125) I , (18)where E x ∈D [ J (cid:63) MPC ,f ( x )] denotes the MPC performancewhen a perfect model is used for predictions.For the term I , Lowrey et al. (2018) provide a bound on theperformance of the MPC policy. It is important to note thatin their problem formulation, the MPC’s objective is deﬁnedas a maximization over the cumulative discounted reward,while in our formulation (13) we consider a minimizationover the cost. Consequently, compared to inequality pre-sented by Lowrey et al. (2018), there is a change in sign ofthe terms in the left-hand side of the inequality. This means: E x ∈D [ J (cid:63) MPC ,f ( x )] − E x ∈D [ J (cid:63)V (cid:63) ( x )] ≤ γ N (cid:15) − γ N . (19)We now focus on the term I in Equation (18). Let us denote x (cid:63) ( i ) and u (cid:63) ( i ) as the optimal state and action predictionsrespectively, obtained by using the correct model, f , andthe MPC policy at time i . By the principle of optimality, theoptimal sequence for the MPC using the correct model, u f ,can be used to upper-bound the optimal cost for the MPCusing the surrogate model: E x ∈D [ J (cid:63) MPC ( x )] − E x ∈D [ J (cid:63) MPC ,f ( x )] ≤ E x ∈D [ J MPC ( x, u f )] − E x ∈D [ J (cid:63) MPC ,f ( x )] . (20)Since the input sequence is now the same for both termsin right-hand side of Equation (20), the difference in the cost is driven by the different state trajectories cost (the coston x over the horizon which includes the state-constraintviolation penalty, as deﬁned in Equation (14)) as well as theterminal cost. In form of equation, this means: E x ∈D [ J MPC ( x, u f )] − E x ∈D [ J (cid:63) MPC ,f ( x )]= E x ∈D [ J MPC ( x, u f ) − J (cid:63) MPC ,f ( x )]= E x (0) ∈D (cid:34) N − (cid:88) j =0 γ j (cid:110) ˆ x ( j ) T Q ˆ x ( j ) − x (cid:63) ( j ) T Qx (cid:63) ( j ) (cid:111) + γ N α (cid:8) V (ˆ x ( N )) − V ( x (cid:63) ( N )) (cid:9) + N (cid:88) j =0 (cid:8) (cid:96) X (ˆ x ( j )) − (cid:96) X ( x (cid:63) ( j )) (cid:9)(cid:35) . (21)Recall that we assume the surrogate model is Lipschitz withconstant L ˆ fx . This means that ∀ ˜ x, x ∈ R n x and the sameinput u ∈ R n u , we have: || ˆ f (˜ x, u ) − ˆ f ( x, u ) || ≤ L ˆ fx || ˜ x − x || , Further, from Equation (15), ∀ ( x, u ) ∈ ˜ X × U , we have: || w ( x, u ) || = || f ( x, u ) − ˆ f ( x, u ) || ≤ µ. Under the optimal policy for the correct model, let us denotethe deviation in the state prediction when the MPC inputprediction is applied with a different model, ˆ f , as ˆ d ( j ) :=ˆ x ( j ) − x (cid:63) ( j ) .At step j = 1 : (cid:107) ˆ d (1) (cid:107) = (cid:107) ˆ x (1) − x (cid:63) (1) (cid:107) = (cid:107) ˆ f ( x (cid:63) (0) , u (cid:63) (0)) − f ( x (cid:63) (0) , u (cid:63) (0)) (cid:107) = (cid:107) w ( x (cid:63) (0) , u (cid:63) (0)) (cid:107) ≤ µ. eural Lyapunov Model Predictive Control – Supplementary Material

At step j = 2 : (cid:107) ˆ d (2) (cid:107) = (cid:107) ˆ x (2) − x (cid:63) (2) (cid:107) = (cid:107) ˆ f ( x (cid:63) (1) + ˆ d (1) , u (cid:63) (1)) − f ( x (cid:63) (1) , u (cid:63) (1)) (cid:107) = (cid:107) ˆ f ( x (cid:63) (1) + ˆ d (1) , u (cid:63) (1)) − ˆ f ( x (cid:63) (1) , u (cid:63) (1))+ ˆ f ( x (cid:63) (1) , u (cid:63) (1)) − f ( x (cid:63) (1) , u (cid:63) (1)) (cid:107)≤ (cid:107) ˆ f ( x (cid:63) (1) + ˆ d (1) , u (cid:63) (1)) − ˆ f ( x (cid:63) (1) , u (cid:63) (1)) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) ≤ L ˆ fx (cid:107) ˆ d (1) (cid:107) + (cid:107) ˆ f ( x (cid:63) (1) , u (cid:63) (1)) − f ( x (cid:63) (1) , u (cid:63) (1)) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) = (cid:107) w ( x (cid:63) (1) ,u (cid:63) (1)) (cid:107)≤ µ ≤ L ˆ fx µ + µ By induction, it can be shown that: (cid:107) ˆ d ( j ) (cid:107) = (cid:107) ˆ x ( j ) − x (cid:63) ( j ) (cid:107) ≤ j − (cid:88) i =0 L i ˆ fx µ. (22)Alternately, if we assume the correct system that is to becontrolled is Lipschitz with constant L fx , then proceedingas before:At step j = 1 : (cid:107) ˆ d (1) (cid:107) = (cid:107) ˆ x (1) − x (cid:63) (1) (cid:107) ≤ µ. At step j = 2 : (cid:107) ˆ d (2) (cid:107) = (cid:107) ˆ x (2) − x (cid:63) (2) (cid:107) = (cid:107) ˆ f (ˆ x (1) , u (cid:63) (1)) − f (ˆ x (1) − ˆ d (1) , u (cid:63) (1)) (cid:107)≤ (cid:107) ˆ f (ˆ x (1) , u (cid:63) (1)) − f (ˆ x (1) , u (cid:63) (1)) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) = (cid:107) w (ˆ x (1) ,u (cid:63) (1)) (cid:107)≤ µ + (cid:107) f (ˆ x (1) , u (cid:63) (1)) − f (ˆ x (1) − ˆ d (1) , u (cid:63) (1)) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) ≤ L fx (cid:107) ˆ d (1) (cid:107) ≤ µ + L fx µ By induction, again we have: (cid:107) ˆ d ( j ) (cid:107) = (cid:107) ˆ x ( j ) − x (cid:63) ( j ) (cid:107) ≤ j − (cid:88) i =0 L ifx µ. (23)Combining equations (22) and (23) and by letting ¯ L if =min( L i ˆ fx , L ifx ) , we obtain: (cid:107) ˆ d ( j ) (cid:107) = (cid:107) ˆ x ( j ) − x (cid:63) ( j ) (cid:107) ≤ j − (cid:88) i =0 ¯ L if µ. (24) The following identity is used; ∀ δ > : (cid:107) a + b (cid:107) ≤ (cid:18) δ (cid:19) (cid:107) a (cid:107) + (1 + δ ) (cid:107) b (cid:107) . Hence, we can write the cost over the predicted state as: ˆ x ( j ) T Q ˆ x ( j )= (cid:107) Q / ˆ x ( j ) (cid:107) = (cid:107) Q / ( x (cid:63) ( j ) + ˆ d ( j )) (cid:107) ≤ (1 + δ ) (cid:107) Q / x (cid:63) ( j ) (cid:107) + (cid:18) δ (cid:19) (cid:107) Q / ˆ d ( j ) (cid:107) ≤ (1 + δ ) x (cid:63) ( j ) T Qx (cid:63) ( j ) + (cid:18) δ (cid:19) (cid:107) Q (cid:107) (cid:107) ˆ d ( j ) (cid:107) . (25)From equations (24) and (25), ∀ j ∈ { , , ..., N − } , weobtain: ˆ x ( j ) T Q ˆ x ( j ) − x (cid:63) ( j ) T Qx (cid:63) ( j ) ≤ δ x (cid:63) ( j ) T Qx (cid:63) ( j ) (cid:124) (cid:123)(cid:122) (cid:125) = (cid:96) ( x (cid:63) ( j ) , + (cid:18) δ (cid:19) (cid:107) Q (cid:107) (cid:32) j − (cid:88) i =0 ¯ L if µ (cid:33) ≤ δ (cid:96) ( x (cid:63) ( i ) , u (cid:63) ( i )) + (cid:18) δ (cid:19) (cid:107) Q (cid:107) (cid:32) j − (cid:88) i =0 ¯ L if µ (cid:33) . (26)Recall from Equation (4) that V ( x ) ≤ L V (cid:107) x (cid:107) . Proceedingas before, we can write the following for the terminal cost: V (ˆ x ( N )) − V ( x (cid:63) ( N )) ≤ δ V ( x (cid:63) ( N )) + (cid:18) δ (cid:19) L V (cid:32) N − (cid:88) i =0 ¯ L if µ (cid:33) . (27)The ﬁnal part of the proof concerns the constraints cost term.Let the state constraints be deﬁned as a set of inequalities: X = { x ∈ R n : g ( x ) ≤ } , where g is a convex function. For the optimal solution, x (cid:63) ,the violation of the constraint is represented through theslack variable: s (cid:63) = s ( x (cid:63) ) = ( g ( x (cid:63) ) −

1) + | g ( x (cid:63) ) − | . Since the constraints are convex and compact, and theycontain the origin, then at the optimal solution, x (cid:63) , we havethat there exists a K ∞ -function, ¯ η ( r ) , such that: (cid:12)(cid:12)(cid:12) (cid:96) X ( s ( x (cid:63) + ˆ d )) − (cid:96) X ( s ( x (cid:63) )) (cid:12)(cid:12)(cid:12) ≤ ¯ η ( (cid:107) ˆ d (cid:107) ) . eural Lyapunov Model Predictive Control – Supplementary Material

Using the above inequality and Equation (24), it followsthat, ∀ j ∈ { , , ..., N } : (cid:96) X (ˆ x ( j )) − (cid:96) X ( x (cid:63) ( j )) ≤ ¯ η (cid:32) j − (cid:88) i =0 ¯ L if µ (cid:33) = ¯ η j . (28)By combining equations (18), (19), (20), (21), (26), (27)and (28), we obtain the bound stated in the lemma: E x ∈D [ J (cid:63) MPC ( x )] − E x ∈D [ J (cid:63)V (cid:63) ( x )] ≤ γ N (cid:15) − γ N + (cid:18) δ (cid:19) (cid:107) Q (cid:107) N − (cid:88) i =0 γ i  i − (cid:88) j =0 ¯ L jf  µ + (cid:18) δ (cid:19) γ N αL V (cid:32) N − (cid:88) i =0 ¯ L if (cid:33) µ + N (cid:88) j =0 ¯ η j (cid:124) (cid:123)(cid:122) (cid:125) =: ¯ ψ ( µ ) + δ E x ∈D [ J (cid:63) MPC ( x ; f )] . Proof of Lemma 1

Proof.

In order to prove point 1 in the lemma, we ﬁrst usethe standard arguments for the MPC without terminal con-straint (Mayne et al., 2000; Limon et al., 2003) in the undis-counted case. We then extend the results to the discountedcase.

Nominal stability

First, when an invariant set terminalconstraint is used, which in our case corresponds to thecondition V ( x ( N )) ≤ l s with X s ⊆ X , then Mayne et al.(2000) have provided conditions to prove stability by demon-strating that J (cid:63) MPC ( x ) is a Lyapunov function. These re-quire the terminal cost to be a Lyapunov function that sat-isﬁes Equation (5). Hence, we start by looking for valuesof α such that αV ( x ) satisﬁes Equation (5). In other words,we wish to ﬁnd an ¯ α ≥ such that, for all α ≥ ¯ α and forsome policy K (in our case, the demonstrator for V ), thefollowing condition holds: αV ( f ( x, K ( x )) − αV ( x ) ≤ − (cid:96) ( x, K ( x )) . (29)Let us denote x + = f ( x, K ( x )) for brevity. We have, byassumption, that: α ( V ( x + ) − λV ( x )) ≤ . (30)This implies that: αV ( x + ) − αV ( x ) + αV ( x ) − αλV ( x ) ≤ , ⇒ αV ( x + ) − αV ( x ) ≤ − α (1 − λ ) V ( x ) . (31) Recall that the loss function satisﬁes l (cid:96) (cid:107) x (cid:107) ≤ (cid:96) ( x, u ) .Since the MPC is solved using a sequence of convexquadratic programs, it is also Lipschitz (Bemporad et al.,2000). Similarly, if K is Lipschitz or (uniformly) con-tinuous over the closed and bounded set U , then since X is also closed and bounded, there also exists a local up-per bound for the loss function on this policy, namely, (cid:96) ( x, K ( x )) ≤ L (cid:96) (cid:107) x (cid:107) .Further, recall from Equation (4) that l (cid:96) (cid:107) x (cid:107) ≤ V ( x ) . Us-ing the above notions, we have: αV ( x + ) − αV ( x ) ≤ − α (1 − λ ) l (cid:96) (cid:107) x (cid:107) , = − α (1 − λ ) l (cid:96) L (cid:96) L (cid:96) (cid:107) x (cid:107) ≤ − α (1 − λ ) l (cid:96) L (cid:96) (cid:96) ( x, K ( x ))= − β(cid:96) ( x, K ( x )) . (32)To satisfy the above condition, solving for a β ≥ is sufﬁ-cient. From Equations (31) and (32), it implies that: α ≥ L (cid:96) l (cid:96) (1 − λ ) = ¯ α ≥ . (33)Now, the function αV ( x ) satisﬁes all the sufﬁcient condi-tions stated by Mayne et al. (2000) for the stability of anMPC under the terminal constraint ˆ x ( N ) ∈ X s which isequivalent to V (ˆ x ( N )) ≤ l s , without discount (with γ = 1 ).Since we do not wish to have such a terminal constraint,we wish for another lower bound ˆ α ≥ such that, if α ≥ ¯ α , then V (ˆ x ( N )) ≤ l s at the optimal solution. Thecomputation of this ¯ α has been outlined by Limon et al.(2009) for the undiscounted case. Since our constraints areclosed, bounded and they contain the origin, our model andthe MPC control law are both Lipschitz, we directly use theresult from Limon et al. (2009) to compute ¯ α : ¯ α = (cid:80) N − i =0 (cid:96) (˜ x ( i ) , ˜ u ( i )) − N d (1 − ρ ) l s (34)where ˜ x ( i ) , ˜ u ( i ) represent a sub-optimal state-action se-quence for which V (˜ x ( N )) ≤ ρl s with ρ ∈ [0 , , and d is a lower bound for the stage loss (cid:96) for all x outside X s and all u in U .Then, one can take: α ≥ max (¯ α , ¯ α ) = ¯ α (35)to guarantee stability when γ = 1 .When the discount factor ( γ < ) is used, condition (29) isstill respected by the same range of α since γV ( x + ) − V ( x ) ≤ V ( x + ) − V ( x ) . (36) eural Lyapunov Model Predictive Control – Supplementary Material

However, from the discussion in (Gaitsgory et al., 2015), forinﬁnite horizon optimal control, it appears that Equation (29)is not sufﬁcient for J (cid:63) MPC ( x ) to be a Lyapunov function, evenwhen a terminal constraint is used.We wish to ﬁnd a lower-bound ¯ γ such that, given α sat-isfying Equation (35), the MPC is stable for γ ≥ ¯ γ . Forinﬁnite-horizon optimal control, this was done by Gaitsgoryet al. (2015). First, recall that: αV ( x ) ≤ αL V (cid:107) x (cid:107) ≤ α L V l (cid:96) (cid:96) ( x, C inf u ∈ U (cid:96) ( x, u ) . (37)In Gaitsgory et al. (2015), it shown that ≤ C < / (1 − γ ) is sufﬁcient for stability of an inﬁnite-horizon discountedoptimal control problem, when αV ( x ) is its value function.This means that: αL V l (cid:96) < − γ , (38)which implies that: γ > − l (cid:96) αL V = ¯ γ ∈ [0 , . (39)For MPC, we will instead present an additional conditionto the above one that leads to at least convergence to thesafe set. This results in a bounded and safe solution. Exactconvergence to the origin will be then conﬁrmed when V isthe actual value function, as in Gaitsgory et al. (2015), orif we switch to the demonstrating policy, K , once in theterminal set. Finally, we will remove the terminal constraintas done for the undiscounted case with a ﬁnal bound on α and γ .Recall that condition (29) applies. If the terminal constraintwas met at time t , namely, if V ( x (cid:63) ( N )) ≤ l s , then at thenext time step, t + 1 we have that u ( N + 1) = K ( x (cid:63) ( N )) is feasible. Hence, the optimal MPC solution can be upper-bounded by the shifted solution at the previous time t , withthe K policy appended at the end of the horizon (Mayneet al., 2000). Denote this policy as ˜ u and ˜ x as the predictions.We have that: ∆ J (cid:63)MP C ( x ) = J (cid:63)MP C ( x + ) − J (cid:63)MP C ( x ) ≤ J MP C ( x + , ˜ u ) − J (cid:63)MP C ( x ) . Hence, ∆ J (cid:63)MP C ( x ) ≤ N (cid:88) i =1 γ i − (cid:96) (˜ x ( i ) , ˜ u ( i )) + γ N αV (˜ x + ( N )) − (cid:96) ( x, ˜ u (0)) − N − (cid:88) i =1 γ i (cid:96) (˜ x ( i ) , ˜ u ( i )) − γ N αV (˜ x ( N ))= (1 − γ ) L N − ( x ) − (cid:96) ( x, ˜ u (0))+ γ N (cid:0) αV (˜ x + ( N )) − αV (˜ x ( N )) (cid:1) + (cid:96) (˜ x ( N ) , K ( x ))) ≤ (1 − γ ) L N − ( x ) − (cid:96) ( x, ˜ u (0)) , where L N − ( x ) = (cid:80) N − i =1 γ i − (cid:96) (˜ x ( i ) , ˜ u ( i )) . Now, for γ = 1 , the effect of L N − disappears and theMPC optimal cost is a Lyapunov function as in the stan-dard MPC stability result from (Mayne et al., 2000). Byinspection of L N − , since the cost is bounded over boundedsets, also a small enough γ could be found such that L N − ( x ) < (cid:96) ( x, ˜ u (0)) . This γ , however, depends on x .Consider x (cid:54)∈ X s , for which there exist a feasible solution,namely a solution providing ˜ x ( N ) ∈ X s . Then, since (cid:96) isstrictly increasing, (cid:96) (0 ,

0) = 0 , X s contains the origin andthe constraints are bounded, we have that there exist a υ ≥ such that for any feasbile x : ¯ L N − = υ ( N −

1) inf x (cid:54)∈ X s (cid:96) ( x, , is an upper bound for L N − ( x ) . For instance, υ = sup ( x,u ) ∈ (cid:15) X × U (cid:96) ( x, u )inf x (cid:54)∈ X s (cid:96) ( x, , is sufﬁcient for any closed set of initial conditions x (0) ∈ (cid:15) X ⊃ X s , with (cid:15) > . In order to have stability, it sufﬁcesto have (1 − γ ) ¯ L N − − (cid:96) ( x, ˜ u (0)) ≤ which requires: γ ≥ − (cid:96) ( x, ˜ u (0))¯ L N − = ¯ γ ( x ) . (40)In the above condition ¯ γ ( x ) can be less than only outsidea neighborhood of origin. Consider again d = inf x (cid:54)∈ X s (cid:96) ( x, . (41)Then taking γ ≥ − d ¯ L N − = ¯ γ ∈ (0 , , (42)provides that the system trajectory will enter the safe set X s ,hence B γ ⊆ X s . Finally, once x ∈ X s , we that the policy K ( x ) is feasible and: (cid:96) ( x, K ( x )) ≤ αV ( x ) − αV ( x + ) ≤ αV ( x ) . eural Lyapunov Model Predictive Control – Supplementary Material

Hence, we can use this policy to upper bound the MPC cost: J (cid:63)MP C ( x ) ≤ N − (cid:88) i =0 γ i αV (˜ x ( i )) + γ N αV (˜ x ( N )) . If the above is true with equality, then we can proceed asin Theorem 3.1 of Gaitsgory et al. (2015), with γ > ¯ γ .This would require αV ( x ) to be also a value function forthe discounted problem.From the above considerations, we can conclude that that:1. If N = 1 , then ¯ L N − = 0 and the system is asymptot-ically stable for any γ > .2. If N > , γ ≥ ¯ γ , then the system reaches an bound B γ that is included in X s .3. If N > γ ≥ ¯ γ and once in X s we switch to thepolicy K ( x ) then the system is asymptotically stable.4. If αV ( x ) is the global value function for the discountedproblem and if R ( X s ) = X s , then γ > ¯ γ providesthat the system is Asymptotically stable.5. If αV ( x ) is only the value function in X s for the dis-counted problem, and if R ( X s ) (cid:54) = X s , then γ > max(¯ γ , ¯ γ ) provides that the system is Asymptoti-cally stable.Finally, following Theorem 3 from (Limon et al., 2003),the terminal constraint can be removed for all points x ∈R N ( ρ X s ) , with ρ ∈ [0 , , by setting: α ≥ ¯ α = max (¯ α , ¯ α ) , (43) ¯ α = ¯ L N − − γ N − γ d (1 − ρ ) γ N l s > . (44)In fact, by the same argument of Limon et al. (2003), forany states for which it exists a feasible sequence ˜ u taking ˆ x ( N ) to ρ X s we have that: J (cid:63) MPC ( x ) ≤ J MPC ( x, ˜ u ) = L N ( x ) + ρ αγ N l s . (45)If α satisﬁes (43), then we also have that: (1 − ρ ) α γ N l s ≥ L N ( x ) − − γ N − γ d, (46)from which we can directly verify that the set deﬁned in(16) is a ROA (for either asymptotic or practical stability): Υ N,γ,α = (cid:26) x ∈ R n x : J (cid:63) MPC ( x ) ≤ − γ N − γ d + γ N αl s (cid:27) . Robustness

For point 6, the stability margins of nominalMPC have been studied in (Limon et al., 2009). In particular,in a setup without terminal constraint, under nominal stabil-ising conditions, with a uniformly continuous model (in ourcase even Lipschitz), cost functions being also uniformlycontinuous, then the optimal MPC cost is also uniformlycontinuous (Limon et al., 2009, Proposition 1). In otherwords, from (Limon et al., 2009, Lemma 1), there is a K ∞ -function, σ , such that at the optimal solution, u (cid:63) , we have: | J (cid:63)MP C ( x + w ) − J (cid:63)MP C ( x ) | ≤ σ ( (cid:107) w (cid:107) ) . (47)Using the above identity, one wish to bound the increasein the MPC cost due to uncertainty. At the same time, wewish the MPC to remain feasible and perform a contraction,namely, to have a stability margin. Since we are using softconstraints, then the MPC remains always feasible, however,we need the predictions at the end of the horizon to be inan invariant set X s even under the effect of uncertainty. Inparticular, we will use V ( x ) and its contraction factor λ tocompute a smaller level set ζ X s , for some ζ ∈ (0 , whichis invariant under the uncertainty. Once this is found thenwe can compute a new α for this set according to (43). Inparticular, under the policy K , we have that: V ( x + + w ) − V ( x ) ≤ V ( x + ) − V ( x ) + L V (cid:107) w (cid:107) ≤ ( λ − V ( x ) + L V (cid:107) w (cid:107) . We wish this quantity to be non-positive for x (cid:54)∈ ζ X s , whichmeans that V ( x ) ≥ ζl s . For this it is sufﬁcient to have: (cid:107) w (cid:107) ≤ ¯ µ = 1 − λL V ζl s ≤ − λL V V ( x ) (48)Therefore, given the model error w , if the MPC remainsfeasible and if α and γ exceed their lower bounds given therestricted set ζ X s , we have that (cid:107) w (cid:107) ≤ ¯ µ implies that: ∆ J (cid:63)MP C ( x ) = J (cid:63)MP C ( x + + w ) − J (cid:63)MP C ( x )+ J (cid:63)MP C ( x + ) − J (cid:63)MP C ( x + ) ≤ J MP C ( x + , ˜ u ) − J (cid:63)MP C ( x ) + σ ( (cid:107) w (cid:107) ) ≤ (1 − γ ) ¯ L N − − (cid:96) ( x, ˜ u (0)) + σ ( (cid:107) w (cid:107) ) ≤ − l (cid:96) (cid:107) x (cid:107) + σ ( (cid:107) w (cid:107) ) + ¯ d ( N ) , which is the deﬁnition of Input-to-State practical Stability(Khalil, 2014; Limon et al., 2009), where we have deﬁned ¯ d ( N ) = (1 − γ ) ¯ L N − . The trajectory of the system istherefore bounded by the level-set of J (cid:63)MP C ( x ) outsidewhich σ ( µ ) + ¯ d ( N ) ≤ l (cid:96) (cid:107) x (cid:107) . Since σ is strictly increasingand ¯ d is strictly decreasing, we can also conclude that thesize of this bound increases with increasing model error µ and with the horizon length N . Note that the term ¯ d vanishes if γ = 1 but the term σ will also increase with N if ¯ L f > . From the restriction of the terminal set, it also eural Lyapunov Model Predictive Control – Supplementary Material follows that the ROA deﬁned in Equation (16) will also berestricted unless we recompute a larger α for this new set.In practice, the ISS bound σ ( µ ) from Lemma 1 has a formsimilar to the one discussed for the constraint penalty inthe proof of Lemma 2, see Equation (28). Its explicit com-putation is omitted for brevity; however, in general, wecan expect the bound to become worse for systems that areopen-loop unstable as the horizon length increases. B. MPC as SQP Formulation

We solve the MPC problem through iterative linearisationsof the forward model, which gives the state and input matri-ces: A ( i ) = ∂ ˆ f∂x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ x ( i ) , B ( i ) = ∂ ˆ f∂u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ u ( i ) . (49)These are evaluated around a reference trajectory:x (cid:63) = { ¯ x (cid:63) ( i ) , i = 0 , . . . .N } , (50)u (cid:63) = { ¯ u (cid:63) ( i ) , i = 0 , . . . .N − } . (51)This is initialised by running the forward model with zeroinputs, and then updated at each iteration by simulating theforward model on the current optimal soulution.The Lyapunov function V is expanded to a second orderterm by using Taylor expansion and is evaluated aroundthe same trajectory. The Jacobian and Hessian matrices,respectively, Γ and H , are: Γ = ∂V∂x (cid:12)(cid:12)(cid:12)(cid:12) ¯ x ( N ) , H = 12 ∂ V∂ x (cid:12)(cid:12)(cid:12)(cid:12) ¯ x ( N ) . (52)All the quantities in Equation (49) and (52) are computedusing automatic differentiation. Using these matrices, wesolve the convex optimization problem: δ u (cid:63) = arg min γ N α (cid:16) (cid:107) H / δ ˆ x ( N ) (cid:107) + Γ T δ ˆ x ( N ) (cid:17) + N − (cid:88) i =0 γ i (cid:96) (ˆ x ( i ) , ˆ u ( i )) (53) s . t . ˆ x ( i + 1) = A ( i ) δ ˆ x ( i ) + B ( i ) δ ˆ u ( i ) + ˆ f (¯ x ( i )) , ˆ x ( i ) − δ ˆ x ( i ) = ¯ x ( i )ˆ u ( i ) − δ ˆ u ( i ) = ¯ u ( i )ˆ x ( i ) ∈ X , ∀ i ∈ [0 , N ] , ˆ u ( i ) ∈ U , ∀ i ∈ [0 , N − , ˆ x (0) = x ( t ) , (cid:107) δ ˆ u ( i ) (cid:107) ∞ ≤ r trust , ∀ i ∈ [0 , N − , where the state constraints are again softened and the lastinequality constraint is used to impose a trust region with aﬁxed radius, r trust . This notably improves the search for anoptimal solution, as shown for the inverted pendulum casein Figure 15.Once problem (53) is solved, we obtain the delta sequence δ u ∗ . The new optimal solution is then computed accordingto the update rule: u (cid:63) ← u (cid:63) + lr δ u (cid:63) , where lr < is a learning rate, which is annealed aftereach iteration. Finally, the reference trajectories used for thenext linearisation are u (cid:63) and the state series resulting fromsimulating the forward model on u (cid:63) , namelyx (cid:63) = ˆ f ◦ u (cid:63) . This is summarised in Algorithm 2. Interested readerscan ﬁnd more details on SQP in the work by Nocedal &Wright (2006); Hadﬁeld-Menell et al. (2016), where adap-tive schemes for computing the trust radius are discussed.

Algorithm 2

Neural Lyapunov MPC solver

In: x ( t ) , ˆ f , α , V , N steps , (cid:15) lr ∈ [0 , , r trust > , lr Out: u ( t ) x (cid:63) ← { x ( t ) } N +1 , u (cid:63) ← { } N for j = 0 , ..., N steps do { A ( i ) } , { B ( i ) } ← linearisation of ˆ f using Equation (49) (Γ , H ) ← Taylor expansion of V using Equation (52) δ u (cid:63) ← solution of optimization problem (53)u (cid:63) ← u (cid:63) + lr δ u (cid:63) x (cid:63) ← ˆ f ◦ u (cid:63) lr ← (1 − (cid:15) lr ) lru ( t ) ← u (cid:63) (0) C. Experimental Setup

We demonstrate our approach on an inverted pendulum anda vehicle kinematics control problem. We implement ourcode using PyTorch (Paszke et al., 2019).

C.1. Implementation Speciﬁcs

A practical consideration for implementing Algorithm 1 istuning the MPC terminal cost scaling, α , via grid-search.The MPC needs to run over the entire dataset of initial points, X = { x m (0) } Mm =1 , with different conﬁgurations. In orderto speed up the search for α , we run the MPC only on asample of of the initial dataset. Once the optimal α isfound, only then we run the MPC over the entire dataset anduse this data to compute the next V ( x ) .Additionally to what presented in the main text, we parame- eural Lyapunov Model Predictive Control – Supplementary Material

Table 4:

Conﬁguration for Experiments.

We specify the parameters used for the simulation of the system dynamics, theMPC demonstrator, the Neural Lyapunov MPC as well as for the alternate learning algorithm. P ARAMETER S YMBOL V ALUE C AR K INEMATICS P ENDULUM G ENERAL

MASS m - 0.15 kg LENGTH l - 0.5 m ROTATIONAL FRICTION λ F - 0.1 N m s rad − GRAVITY g - 9.81 m s − SAMPLING TIME dt s s STATE CONSTRAINT X [ − , × [ − π, π ] [ − π, π ] × [ − π, π ] INPUT CONSTRAINT U [ − , × [ − π, π ] [ − . , . STATE PENALTY Q DIAG (1 , , . π ) DIAG (0 . , . INPUT PENALTY R DIAG (100 , π ) 0 . DISCOUNT FACTOR γ D EMONSTRATOR

MPC H ORIZON N OF LINEARISATIONS N steps TRUST REGION r trust ∞ LEARNING RATE lr DECAY RATE (cid:15) lr TERMINAL PENALTY P Q P LQR N EURAL L YAPUNOV

MPC H ORIZON N OF LINEARISATIONS N steps

18 18

TRUST REGION r trust LEARNING RATE lr SCALING OF

V α T ABLE

ABLE DECAY RATE (cid:15) lr V net ARCHITECTURE

MLP (128 , , , , V net OUTPUT n V × n x × × A LTERNATE LEARNING O UTER ITERATIONS N ext NLARGEMENT FACTOR (cid:15) ext

LINE SEARCH α list { , , ..., } { . , . , ..., } L YAPUNOV EPOCHS N V

500 200L

OSS E QUATION (9) ρ DECREASE FACTOR v CONTRACTION FACTOR λ YAPUNOV LEARNING RATE lr YAPUNOV WEIGHT DECAY wd terize V ( x ) with a trainable scaling factor, β , as: V ( x ) = softplus ( β ) x T (cid:0) l (cid:96) I + V net ( x ) T V net ( x ) (cid:1) x, (54)where softplus ( x ) = log(1 + exp( x )) . The parameter β isinitialized to in all experiments except for the invertedpendulum without LQR loss, i.e. for results in Figure 14.The full set of parameters for the experiments can be foundin Table 4. C.2. Baseline Controllers

Our Neural Lyapunov MPC has a single-step horizon anduses the learned Lyapunov function as the terminal cost. Tocompare its performance, we consider two other MPCs: • Long-horizon MPC (demonstrator) : An MPC witha longer horizon and a quadratic terminal cost x T P x .This MPC is also used to generate the initial demon-strations for alternate learning. • Short-horizon MPC : An MPC with a single-step hori-zon and the same terminal cost as the long-horizonMPC. All other parameters are the same as the NeuralLyapunov MPC except α , which is tuned manually. C.3. Forward models

We use an Euler forward model for the environments. Con-sider dt as the sampling time, then the state transition is: η ( t + 1) = η ( t ) + dt f u ( η ( t ) , u ( t )) , (55) eural Lyapunov Model Predictive Control – Supplementary Material nominal surrogate (a) Test on Inverted Pendulum nominal surrogate (b) Test on Vehicle Kinematics

Figure 6:

Testing the surrogate models.

We generate each trajectory by propagating an initial state η (0) with inputsequence { u ( t ) } T − t =0 through the nominal system and the learned surrogate model. (a) For the inverted pendulum: u ( t ) ∼U ([ − . , . and simulation is performed for T = 25 . (b) For the vehicle kinematics: u ( t ) ∼ U ([ − . , . × [ − . , . and simulation is performed for T = 5 .where η ( t ) is the state, u ( t ) is the input and f u ( η ( t ) , u ( t )) is the time-invariant, deterministic dynamical system.C.3.1. V EHICLE K INEMATICS

World model

For the non-holonomic vehicle, η =( x, y, φ ) ∈ R is the state, respectively, the coordinates inthe plane and the vehicle orientation, and u = ( v, ω ) ∈ R is the control input, respectively, the linear and angular ve-locity in the body frame. f u ( η, u ) encodes the coordinatetransformation from the body to the world frame (Fossen,2011): f u ( η ( t ) , u ( t )) =  ˙ x ˙ y ˙ φ  =  v ( t ) cos φ ( t ) v ( t ) sin φ ( t ) ω ( t )  (56) =  cos φ ( t ) 0sin φ ( t ) 00 1 (cid:124) (cid:123)(cid:122) (cid:125) J ( η ) (cid:18) v ( t ) ω ( t ) (cid:19) Surrogate model

We build a gray-box model using a neu-ral network to model J ( η ) , similar to the work by Quaglinoet al. (2020). The input feature to the network is sin φ and cos φ , where φ is the vehicle’s orientation. The networkconsists of a hidden layer with hidden units and tanh ac-tivation function, and an output layer without any activationfunction. The weights in the network are initialized usingXavier initialization (Glorot & Bengio, 2010) and biases areinitialized from a standard normal distribution. Training the surrogate model

We generate a dataset of K sequences, each of length T = 1 . For each sequence, the initial state η (0) is sampled uniformly from X , while theinput u (0) is sampled uniformly from U . We use a trainingand validation split of . Training is performed for epochs using the Adam optimizer (Kingma & Ba, 2014) andthe mean-squared error (MSE) loss over the predicted states.The learning rate is . and the batch size is .C.3.2. I NVERTED P ENDULUM

World model

Inverted pendulum is one of the most stan-dard non-linear systems for testing control methods. Weconsider the following model (Berkenkamp et al., 2017): ml ¨ θ = mgl sin θ − λ F ˙ θ + u (57)where θ ∈ R is the angle, m and l are the mass and polelength, respectively, g is gravitational acceleration, λ F is therotational friction coefﬁcient and u ∈ R is the control input.We denote the state of the system as η = ( θ, ˙ θ ) ∈ X ⊂ R and input as u ∈ U ⊂ R . We use an Euler discretization, asin Equation (55), to solve the initial-value problem (IVP)associated with the following equation: f u ( η ( t ) , u ( t )) = (cid:18) ˙ θ ¨ θ (cid:19) = (cid:32) ˙ θ ( t ) mgl sin θ ( t )+ u ( t ) − λ F ˙ θ ( t ) ml (cid:33) (58) Surrogate model

We use a neural network to predict theacceleration of the pendulum, ¨ θ ( t ) . The input to the networkis the state η ( t ) and action u ( t ) at the current time-step t .The network is a three layer feedforward network with hidden units and tanh activation in each hidden layer. Theoutput layer is linear. All the layers have no biases and theirweights are initialized as in Glorot & Bengio (2010). eural Lyapunov Model Predictive Control – Supplementary Material

Training the surrogate model

To train the surrogatemodel, we generate a dataset of K sequences, eachof length T = 1 . We use MSE loss and Adam opti-mizer (Kingma & Ba, 2014) to train the model. The restof the parameters are kept the same as those used for thevehicle kinematics model training. D. Additional Results

Here we provide additional results and plots related to theexperiments speciﬁed in the paper.

D.1. Vehicle Kinematics

The vehicle model results are run over different machines.The resulting losses and α are discussed. In all experiments,the parameters of V are reinitialised after every outer epoch.The Lyapunov loss does not make use of the LQR loss (cid:96) . Choosing number of outer iterations N ext We run ouralgorithm for more iterations on a machine with differentoperating system. This leads to a slight difference from theresults mentioned in the paper. As observed from Table 5,the third iteration leads to the best performance in terms ofveriﬁed and not veriﬁed percentages. We set N ext = 3 inall experiments.Table 5: Car Kinematics: Choosing number of outeriterations.

Lyapunov Function learning performed on adifferent machine (results are slightly different from theones in the paper). I TER . L

OSS ( log(1+ x ) ) V ERIFIED (%) N OT V ERIFIED (%)1 1.42 92.83 3.952 0.91 93.08 4.713 0.62 94.26 3.894 0.46 93.65 4.385 0.53 92.18 5.63

Alternate learning on nominal model

We train the Neu-ral Lyapunov MPC while using a nominal model for internaldynamics. In Figure 7, we plot the training curves, the re-sulting Lyapunov function, and line-search for MPC in eachouter epoch. The results are also shown in Table 6. Ascan be seen, tuning the MPC parameter α helps in min-imizing the loss (9) further. Points near the origin don’talways verify. The MPC obtained after the third iterationachieves the best performance. This can further be validatedfrom Figure 8, where we plot trajectories obtained by usingthe Neural Lyapunov MPC from each iteration. Alternate learning on surrogate model

In order to testthe transfer capability of the approach, we perform the train-ing of Neural Lyapunov MPC using an inaccurate surrogate model for the internal dynamics. This model is also usedfor calculating the Lyapunov loss (9) and evaluating theperformance of the Lyapunov function. We plot the trainingcurves, the resulting Lyapunov function, and line-searchfor MPC in each outer epoch in Figure 9. The results ofthe training procedure are presented in Table 7. The MPCobtained from the second iteration achieves the best perfor-mance. In the third iteration, the Lyapunov loss increasesand number of veriﬁed and not veriﬁed points becomesworse. The poor performance also reﬂects in the trajecto-ries obtained by using the Lyapunov MPC from the thirditeration, as shown in Figure 10.Table 6:

Car Kinematics: Learning on nominal model.

Results for training Neural Lyapunov MPC while using anominal model for internal dynamics. We use the Lyapunovloss (9) for both learning the Lyapunov function and tuningthe MPC. This is speciﬁed in the log(1 + x ) scale. (a) Lyapunov Function LearningI TER . L

OSS ( log(1+ x ) ) V ERIFIED (%) N OT V ERIFIED (%)1 1.55 92.20 4.422 0.87 93.17 4.893 0.48 94.87 3.89(b) MPC Parameter TuningI

TER . L

OSS ( log(1 + x ) ) P ARAMETERBEFORE AFTER α (cid:63) Table 7:

Car Kinematics: Learning on surrogate model.

Results for training Neural Lyapunov MPC while usingthe surrogate model for internal dynamics as well as inLyapunov training. We use the Lyapunov loss (9) for bothlearning the Lyapunov function and tuning the MPC. Thisis speciﬁed in the log(1 + x ) scale. (a) Lyapunov Function LearningI TER . L

OSS ( log(1+ x ) ) V ERIFIED (%) N OT V ERIFIED (%)1 1.84 91.74 8.262 1.43 92.26 7.743 1.65 91.61 8.39(b) MPC Parameter TuningI

TER . L

OSS ( log(1 + x ) ) P ARAMETERBEFORE AFTER α (cid:63) eural Lyapunov Model Predictive Control – Supplementary Material

D.2. Inverted Pendulum

Differently from the car kinematics, for the inverted pen-dulum task, the parameters of V are not re-initialized afterevery outer epoch, and the Lyapunov loss makes use of theLQR loss, (cid:96) , for all the experiments except for the resultsin Figure 14. In this section, we discuss the results obtainedfrom the alternate learning on the nominal model. We alsoprovide an ablation study on: 1) a solution based solely ona contraction factor, and 2) the effect of having an imperfectsolver, in particular the instability resulting from wronglytuning the trust-region radius. Alternate learning

Since the trained surrogate model hasa high accuracy, we only consider the scenario for alternatelearning with the nominal model. The main results for thisscenario are in Figure 11 and Table 8. We notice a slightimprovement in the MPC’s performance in the second it-eration of the training procedure. In Figure 11, it can benoticed that a small α needs to be used, which contradictsthe ideal theoretical result. In practice, a very large value ofthis parameter results in bad conditioning for the QPs usedby the MPC and causes the solver to fail. Since the pendu-lum is open-loop unstable, an increase of the Lyapunov losscan be noticed for larger values of α . This demonstrates thatit is necessary to perform a search over the parameter andthat we cannot simply set it to a very large value.In Figure 12, we show the trajectories obtained by runningthe Neural Lyapunov MPC obtained from the ﬁrst and sec-ond iterations. The initial states are sampled inside thesafe level-set by using rejection sampling. The trajectoriesobtained from both the iterations are similar even thoughthe Lyapunov function is different. The Lyapunov functionfrom second iteration has a larger ROA.We also compare the Neural Lyapunov MPC from the sec-ond iteration with the baseline MPCs in Figure 13. Thebaselines controllers behave quite similarly in this problem,although they have different prediction horizons. This isbecause, for both of them, the LQR terminal cost is a dom-inating term in the optimization’s objective function. TheNeural Lyapunov MPC achieves a slightly slower decreaserate compared to the demonstrator; however, it still stabi-lizes the system. The transfer from nominal to surrogatemodel is very successful for all the controllers, though inthis case, the surrogate model is particularly accurate.It should be kept in mind that in order to produce theseresults, it was necessary to impose in the Lyapunov loss (9)that the decrease rate of V ( x ) needs to be upper boundedby the LQR stage loss, as in Equation 5. This resulted in themost effective learning of the function V ( x ) , contrarily tothe vehicle kinematics example. When the solver fails, we simply set the solution to zero.

Table 8:

Inverted Pendulum: Learning on nominalmodel.

Results for training Neural Lyapunov MPC whileusing a nominal model for internal dynamics. We use theLyapunov loss (9) for both learning the Lyapunov functionand tuning the MPC. This is speciﬁed in the log(1 + x ) scale. (a) Lyapunov Function LearningI TER . L

OSS ( log(1+ x ) ) V ERIFIED (%) N OT V ERIFIED (%)1 3.21 13.25 0.002 1.08 13.54 0.00(b) MPC Parameter TuningI

TER . L

OSS ( log(1 + x ) ) P ARAMETERBEFORE AFTER α (cid:63) Alternate learning without LQR loss

We now considerthe case when the learning is performed while using a con-traction factor of λ = 0 . and without the LQR loss termin the Lyapunov loss (i.e., v = 0 ). The results are depictedin Figure 14. In order to obtain these results, the LyapunovNN scaling β , in Equation (54), was initialized with: β = softplus − (25) , according to a rough estimate of the minimum scaling α from Equation (33). This was able to produce a Lyapunovfunction that makes the MPC safe with α = 12 . How-ever, the learning becomes more difﬁcult, and it results in asmaller safe region estimate with a slower convergence ratefor the system trajectories. Effects of the trust region

In Figure 15, we show theresult of varying the trust radius of the SQP solver on theinverted pendulum. While a larger value can result in furtherapproximation, given the limited number of iterations, inthis case a small value of the radius results in a weakercontrol signal and local instability near the origin.

Acknowledgements

The authors would like to thank Nihat Engin Toklu, Sebas-tian East, and Mark Cannon for their constructive input, aswell as everyone at NNAISENSE, for contributing to aninspiring research environment. eural Lyapunov Model Predictive Control –

Supplementary Material (a) Lyapunov Loss ( log(1 + x ) ) (b) Veriﬁed Points (%) (c) Not Veriﬁed Points (%) − . − . . . . − . − . . . . . . . . . . . . . . . . − . − . . . . − . − . . . . . . . . . . . . . . . . − . − . . . . − . − . . . . . . . . . . . . . . . .

20 5 10 15 20 25 30 35 a . . . . . Iteration 1 a . . . . . Iteration 2 a . . . . . Iteration 3 (best)

Figure 7:

Car kinematics: Alternate learning on nominal model.

After every N V = 500 epochs of Lyapunov learning,the learned Lyapunov function is used to tune the MPC parameters. Top:

The training curves for Lyapunov function.Vertical lines separate iterations.

Middle:

The resulting Lyapunov function V at φ = 0 with the best performance. Bottom:

Line-search for the MPC parameter α to minimize the Lyapunov loss (9) with V as terminal cost. The loss is plotted on they-axis in a log(1 + x ) scale. The point marked in red is the parameter which minimizes the loss. eural Lyapunov Model Predictive Control – Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . − . − . − . − . − . . . Iteration 1 − . − . − . − . − . . . Iteration 2 − . − . − . − . − . . . Iteration 3 (best)

Figure 8:

Car kinematics: Testing Neural Lyapunov MPC obtained from training on nominal model.

For eachiteration, we show the trajectories obtained through our Neural Lyapunov MPC while using the resulting Lyapunov functionand the MPC parameter selected from the line-search.

Top : The Lyapunov function at φ = 0 with trajectories for steps ateach iteration. Middle : The evaluated Lyapunov function.

Bottom : The Lyapunov function time difference. eural Lyapunov Model Predictive Control –

Supplementary Material (a) Lyapunov Loss( log(1 + x ) ) (b) Veriﬁed Points(%) (c) Not Veriﬁed Points(%) − . − . . . . − . − . . . . . . . . . . . . . . − . − . . . . − . − . . . . . . . . . . . . . . − . − . . . . − . − . . . . . . . . . . . . . . a . . . . . . . Iteration 1 a . . . . . . . Iteration 2 (best) a . . . . . . . Iteration 3

Figure 9:

Car kinematics: Alternate learning on surrogate model.

After every N V = 800 epochs of Lyapunov learning,the learned Lyapunov function is used to tune the MPC parameters. Top:

The training curves for Lyapunov function.Vertical lines separate iterations.

Middle:

The resulting Lyapunov function V at φ = 0 with the best performance. Bottom:

08 0 . . . . . . . . . . . . . . . . . .

14 0 . . . . . . . . . . . . . . − . − . − . − . . . Iteration 1 − . − . − . . . Iteration 2 (best) − . − . − . . . Iteration 3

Figure 10:

Car kinematics: Testing Neural Lyapunov MPC obtained from training on surrogate model.

For eachiteration, we show the trajectories obtained through our Neural Lyapunov MPC while using the resulting Lyapunov functionand the MPC parameter selected from the line-search.

Top : The Lyapunov function at φ = 0 with trajectories for steps ateach iteration. Middle : The evaluated Lyapunov function.

Bottom : The Lyapunov function time difference. eural Lyapunov Model Predictive Control –

Supplementary Material (a) Lyapunov Loss( log(1 + x ) ) . . . . . . . (b) Veriﬁed Points(%) . . . . . . (c) Not Veriﬁed Points(%) − . − . . . . − . − . . . . . . − . − . . . . − . − . . . . . . . . . . . . . . . .

40 2 4 6 8 10 12 14 16 a . . . . . . . MPC solver starts failing

Iteration 1 a . . . . . . . MPC solver starts failing

Iteration 2 (best)

Figure 11:

Inverted Pendulum: Alternate learning on surrogate model.

After every N V = 200 epochs of Lyapunovlearning, the learned Lyapunov function is used to tune the MPC parameters. Unlike the vehicle kinematics example, we donot reinitialize V between the iterations. Top:

The training curves for Lyapunov function. Vertical lines separate iterations.

Middle:

The resulting Lyapunov function V with the best performance. Bottom:

Line-search for the MPC parameter α tominimize the Lyapunov loss (9) with V as terminal cost. The loss is plotted on the y-axis in a log(1 + x ) scale. The pointmarked in red is the parameter which minimizes the loss. eural Lyapunov Model Predictive Control – Supplementary Material . . . . . . . . . . . . . . − . − . − . . Iteration 1 − . − . − . . Iteration 2 (best)

Figure 12:

Inverted Pendulum: Testing Neural Lyapunov MPC obtained from training on nominal model overiterations.

For each iteration, we show the trajectories obtained through our Neural Lyapunov MPC while using theresulting Lyapunov function and the MPC parameter selected from the line-search. The initial states are sampled inside thesafe level-set using rejection sampling.

Top : The Lyapunov function with trajectories for steps at each iteration. Middle :The evaluated Lyapunov function.

Bottom : The Lyapunov function time difference. eural Lyapunov Model Predictive Control –

Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . − . − . − . . (a) Short-horizon MPCwith nominal system − . − . − . . (b) Long-horizon MPCwith nominal model − . − . − . . (c) Lyapunov MPC withnominal model − . − . − . . (d) Long-horizon MPCwith surrogate model − . − . − . . (e) Lyapunov MPC withsurrogate model Figure 13:

Inverted Pendulum: Transfer from nominal to surrogate model. Top : The Lyapunov function with overlaidtrajectories for timesteps. Middle : The Lyapunov function evaluated along trajectories.

Bottom : The Lyapunov decreaseevaluated along trajectories. . . (a) MPC Demonstrator with Nominal Model . . (b) Lyapunov MPC with Nominal Model Figure 14:

Inverted Pendulum: Effect of using Lyapunov MPC with contractor factor and no LQR loss.

TheLyapunov function and safe-level set obtained from the ﬁrst iteration of alternate learning with λ = 0 . , v = 0 in theLyapunov loss (9). This results in a smaller safe region estimate and slower closed-loop trajectories compared to the casewhen λ = 0 . , v = 1 . Each trajectory is simulated for T = 80 timesteps. . . Lyapunov Function (a) r trust = 0 . . . Lyapunov Function (b) r trust = 0 . Figure 15:

Inverted Pendulum: Effect of trust region on MPC.

The solver hyperparameter, r trusttrust