[PDF] Safe Interactive Model-Based Learning

Abstract

Control applications present hard operational constraints. A violation of these can result in unsafe behavior. This paper introduces Safe Interactive Model Based Learning (SiMBL), a framework to refine an existing controller and a system model while operating on the real environment. SiMBL is composed of the following trainable components: a Lyapunov function, which determines a safe set; a safe control policy; and a Bayesian RNN forward model. A min-max control framework, based on alternate minimisation and backpropagation through the forward model, is used for the offline computation of the controller and the safe set. Safety is formally verified a-posteriori with a probabilistic method that utilizes the Noise Contrastive Priors (NPC) idea to build a Bayesian RNN forward model with an additive state uncertainty estimate which is large outside the training data distribution. Iterative refinement of the model and the safe set is achieved thanks to a novel loss that conditions the uncertainty estimates of the new model to be close to the current one. The learned safe set and model can also be used for safe exploration, i.e., to collect data within the safe invariant set, for which a simple one-step MPC is proposed. The single components are tested on the simulation of an inverted pendulum with limited torque and stability region, showing that iteratively adding more data can improve the model, the controller and the size of the safe region.

Full PDF

SSafe Interactive Model-Based Learning

Marco Gallieri ∗ Seyed Sina Mirrazavi Salehian Nihat Engin Toklu Alessio QuaglinoJonathan Masci Jan Koutník Faustino Gomez

NNAISENSE, Lugano, Switzerland - Austin, Texas {marco,sina,engin,alessio,jon,jan,tino}@nnaisense.com

Abstract

Control applications present hard operational constraints. A violation of these canresult in unsafe behaviour. This paper introduces Safe Interactive Model BasedLearning (SiMBL), a framework to reﬁne an existing controller and a system modelwhile operating on the real environment. SiMBL is composed of the followingtrainable components: a Lyapunov function, which determines a safe set; a safecontrol policy; and a Bayesian RNN forward model. A min-max control framework,based on alternate minimisation and backpropagation through the forward model, isused for the ofﬂine computation of the controller and the safe set. Safety is formallyveriﬁed a-posteriori with a probabilistic method that utilizes the Noise ContrastivePriors (NPC) idea to build a Bayesian RNN forward model with an additive stateuncertainty estimate which is large outside the training data distribution. Iterativereﬁnement of the model and the safe set is achieved thanks to a novel loss thatconditions the uncertainty estimates of the new model to be close to the current one.The learned safe set and model can also be used for safe exploration, i.e., to collectdata within the safe invariant set, for which a simple one-step MPC is proposed.The single components are tested on the simulation of an inverted pendulum withlimited torque and stability region, showing that iteratively adding more data canimprove the model, the controller and the size of the safe region.

Safe Interactive Model-Based Learning (SiMBL) aims to control a deterministic dynamical system: x ( t + 1) = x ( t ) + dt f ( x ( t ) , u ( t )) , y ( t ) = x ( t ) , (1)where x is the state and y are the measurements, in this case assumed equivalent. The system (1) issampled with a known constant time dt and it is subject to closed and bounded, possibly non-convex,operational constraints on the state and input: x ( t ) ∈ X ⊆ R n x , u ( t ) ∈ U ⊂ R n u , ∀ t > . (2)The stability of (1) is studied using discrete time systems analysis. In particular, tools from discrete-time control Lyapunov functions (Blanchini & Miani, 2007; Khalil, 2014) will be used to computepolicies that can keep the system safe . Safety.

In this work, safety is deﬁned as the capability of a system to remain within a subset X s ⊆ X of the operational constraints and to return asymptotically to the equilibrium state from anywhere in X s . A feedback control policy, u = K ( x ) ∈ U , is certiﬁed as safe if it can provide safety with highprobability . In this work, safety is veriﬁed with a statistical method that extends Bobiti (2017). ∗ Corresponding author.Preprint. Under review. a r X i v : . [ ee ss . S Y ] N ov afelycollect data Learn modelLearn safetyconditions Forwardmod el DatasetSafetyconditionInitialcontroller

Iteration == 0

Figure 1:

Safe interactive Model-Based Learning(SiMBL) rationale.

Approach is centered around anuncertain RNN forward model for which we computea safe set and a control policy using principles fromrobust control. This allows for safe exploration thoughMPC and iterative reﬁnement of the model and thesafe set. An initial safe policy is assumed known.

Safe learning.

The proposed frameworkaims at learning a policy, K ( x ) , and Lya-punov function, V ( x ) , by means of simu-lated trajectories from an uncertain forwardmodel and an initial policy K , used to col-lect data. The model, the policy, and theLyapunov function are iteratively reﬁned while safely collecting more data througha Safe Model Predictive Controller (Safe-MPC). Figure 1 illustrates the approach.

Summary of contribution.

This workpresents the algorithms for: 1) Iterativelylearning a novel Bayesian RNN model witha large posterior over unseen states and in-puts; 2) Learning a safe set and the associ-ated controller with neural networks fromthe model trajectories; 3) Safe explorationwith MPC. For 1) and 2), we propose to re-tain the model from scratch using a consistency prior to include knowledge of the previous uncertaintyand then to recompute the safe set. The safe set increase as more data becomes available and the safetyof the exploration strategy are demonstrated on an inverted pendulum simulation with limited controltorque and stability region. Their ﬁnal integration for continuous model and controller reﬁnementwith data from safe exploration (see Figure 1) is left for future work.

A discrete-time stochastic forward model of system (1) is formulated as a Bayesian RNN. A grey-boxapproach is used, where available prior knowledge is integrated into the network in a differentiableway (for instance, the known relation between an observation and its derivative). The model providesan estimate of the next states distribution that is large (up to a deﬁned value) where there is noavailable data. This is inspired by recent work on Noise Contrastive Priors (NCP) (Hafner et al.,2018b). We extend the NCP approach to RNNs, and propose the Noise Contrastive Prior BayesianRNN (NCP-BRNN), with full state information, which follows the discrete-time update: ˆ x ( t + 1) = ˆ x ( t ) + dt d ˆ x ( t ) , d ˆ x ( t ) = µ (ˆ x ( t ) , u ( t ); θ µ ) + ˆ w ( t ) , (3) ˆ w ( t ) ∼ q (ˆ x ( t ) , u ( t ); θ Σ ) , q (ˆ x ( t ) , u ( t ); θ Σ ) = N (0 , Σ (ˆ x ( t ) , u ( t ); θ Σ )) , (4) ˆ y ( t ) ∼ N (cid:0) ˆ x ( t ) , σ y (cid:1) , ˆ x (0) ∼ N ( x (0) , σ y ) , (5) Σ( · ) = σ w sigm (Σ net ( · )) , µ ( · ) = GreyBox net ( · ) , (6)where ˆ x ( t ) , ˆ y ( t ) denote the state and measurement estimated from the model at time t , and d ˆ x ( t ) isdrawn from the distribution model, where µ and Σ are computed from neural networks, sharing someinitial layers. In particular, µ combines an MLP with some physics prior while the ﬁnal activation of Σ is a sigmoid which is then scaled by the hyperparameter σ w , namely, a ﬁnite maximum variance.The next state distribution depends on the current state estimate ˆ x , the input u , and a set of unknown constant parameters θ , which are to be learned from the data. The estimated state ˆ x ( t ) is for simplicityassumed to have the same physical meaning of the true system state x ( t ) . The system state ismeasured with a Gaussian uncertainty with standard deviation σ y , which is also learned from data.During the control, the measurement noise is assumed to be negligible ( σ y ≈ ). Therefore, thecontrol algorithms will need to be robust with respect to the model uncertainty. Extensions to partialstate information and output noise robust control are also possible but are left for future work. Towards reliable uncertainty estimates with RNNs

The fundamental assumption for model-based safe learning algorithm is that the model predictions contain the actual system state transitionswith high probability (Berkenkamp et al., 2017). This is difﬁcult to meet in practice for most neuralnetwork models. To mitigate this risk, we train our Bayesian RNNs on sequences and include aNoise-Contrastive Prior (NCP) term (Hafner et al., 2018b). In the present work, the uncertainty ismodelled as a point-wise Gaussian with mean and standard deviation that depend on the current state2s well as on the input. The learned 1-step standard deviation, Σ , is assumed to be a diagonal matrix.This assumption is limiting but it is common in variational neural networks for practicality reasons(Zhao et al., 2017; Chen et al., 2016). The NPC concept is illustrated in Figure 2. More complexuncertainty representations will be considered in future works.The cost function used to train the model is: L ( θ µ , θ Σ ) = − T T (cid:88) t =0 E p train ( x (0) ,u ( t )) (cid:2) E q (ˆ x ( t ) ,u ( t ); θ Σ ) [ln p (ˆ y ( t ) | y ( t ) , u ( t ); θ µ ; θ Σ )] (cid:3) + D KL (cid:2) q (˜ x ( t ) , ˜ u ( t ); θ Σ ) || N (0 , σ w ) (cid:3) + E p train ( x (0) ,u ( t )) (cid:2) ReLU (cid:2)

Σ (ˆ x ( t ) , u ( t ); θ Σ ) − Σ (cid:0) ˆ x ( t ) , u ( t ); θ Σ prev (cid:1)(cid:3)(cid:3) (7)where the ﬁrst term is the expected negative log likelihood over the uncertainty distribution, evaluatedover the training data. The second term is the KL-divergence which is evaluated in closed-formover predictions ˜ x generated from a set of background initial states and input sequences, ˆ x (0) and ˆ u ( t ) . These are sampled from a uniform distribution for the ﬁrst model and then, once aprevious model is available and new data is collected, they are obtained using rejection samplingwith PyMC (Salvatier et al., 2016) with acceptance condition: Σ(˜ x (0) , ˜ u ( t ); θ Σ prev ) ≥ . σ w . If aprevious model is available, then the ﬁnal term is used which is an uncertainty consistency prior which forces the uncertainty estimates over the training data to not increase with respect to theprevious model. The loss (7) is optimised using stochastic backpropagation through truncatedsequences. In order to have further consistency between model updates, it a previous model isavailable, we train from scratch but stop optimising once the ﬁnal loss of the previous model is reached.Figure 2: Variational neural net-works with Noise Contrastive Priors(NCP).

Predicting sine-wave data (red-black) with conﬁdence bounds (bluearea) using NAIS-Net (Ciccone et al.,2018) and NCP (Hafner et al., 2018b).

We approximate a chance-constrained stochastic controlproblem with a min-max robust control problem over aconvex uncertainty set. This non-convex min-max controlproblem is then also approximated by computing the lossonly at the vertices of the uncertainty set. To compensatefor this approximation, (inspired by variational inference)the centre of the set is sampled from the uncertainty distri-bution itself (Figure 3). The procedure is detailed below.

Lyapunov-Net.

The considered Lyapunov function is: V ( x ) = x T (cid:0) (cid:15)I + V net ( x ) T V net ( x ) (cid:1) x + ψ ( x ) , (8)where V net ( x ) is a feedforward network that produces a n V × n x matrix, where n V and (cid:15) > are hyperparameters.The network parameters have to be trained and they areomitted from the notation. The term ψ ( x ) ≥ represents the prior knowledge of the state constraints.In this work we use: ψ ( x ) = ReLU ( φ ( x ) − , (9)where φ ( x ) ≥ is the Minkowski functional of a user-deﬁned usual region of operation , namely: X φ = { x ∈ X : φ ( x ) ≤ } . (10)Possible choices for the Minkowski functional include quadratic functions, norms or semi-norms(Blanchini & Miani, 2007; Horn & Johnson, 2012). Since V ( x ) must be positive deﬁnite, thehyperparameter (cid:15) > is introduced . While other forms are possible as in Blanchini & Miani (2007),with (8) the activation function does not need to be invertible. The study of the generality of theproposed function is left for future consideration. Minkowski functionals measure the distance from the set center and they are positive deﬁnite. The trainable part of the function V ( x ) is chosen to be piece-wise quadratic but this is not the only possiblechoice. In fact one can use any positive deﬁnite and radially unbounded function. For the same problem multipleLyapunov functions can exist. See also Blanchini & Miani (2007). afe set deﬁnition. Denote the candidate safe level set of V as: X s = { x ⊆ X : V ( x ) ≤ l s } , (11)where l s is the safe level. If, for x ∈ X s , the function V ( x ) satisﬁes the Lyapunov inequality over thesystem closed-loop trajectory with a control policy K , namely, u ( t ) = K ( x ( t )) ⇒ V ( x ( t + 1)) − V ( x ( t )) ≤ , ∀ x ( t ) ∈ X s , (12)then set X s is safe , i.e., it satisﬁes the conditions of positive-invariance (Blanchini & Miani, 2007;Kerrigan, 2000). Note that the policy K ( x ) can be either a neural network or a model-based controller,for instance a Linear Quadratic Regulator (LQR, see Kalman (2001)) or a Model Predictive Controller(MPC, see Maciejowski (2000); Rawlings & Mayne (2009); Kouvaritakis & Cannon (2015); Rakovi´c& Levine (2019)). A stronger condition to eq. (12) is often used in the context of optimal control: u ( t ) = K ( x ( t )) ⇒ V ( x ( t + 1)) − V ( x ( t )) ≤ − (cid:96) ( x ( t ) , u ( t )) , ∀ x ( t ) ∈ X s (13)where (cid:96) ( x ( t ) , u ( t )) is a positive semi-deﬁnite stage loss. In this paper, we focus on training policieswith the quadratic loss used in LQR and MPC, where the origin is the target equilibrium, namely: (cid:96) ( x, u ) = x T Qx + u T Ru, Q (cid:23) , R (cid:31) . (14) From chance constrained to min-max control

Consider the problem of ﬁnding a controller K and a function V such that u ( t ) = K ( x ( t )) and: P (cid:2) V (ˆ x ( t + 1)) − V ( x ( t )) ≤ − (cid:96) ( x ( t ) , u ( u )) (cid:3) ≥ − (cid:15) p , (15)where P represents a probability and < (cid:15) p << . This is a chance-constrained non-convex optimalcontrol problem (Kouvaritakis & Cannon, 2015). We truncate the distributions and approximate (15): max ˆ x ( t +1) ∈ W ( x ( t ) ,u ( t ) ,θ ) (cid:2) V (ˆ x ( t + 1)) (cid:3) − V ( x ( t )) ≤ − (cid:96) ( x ( t ) , K ( x ( t ))) , (16)which is deterministic. A strategy to jointly learn ( V, K ) fulﬁlling (16) is presented next. We wish to build a controller K , a function V , and a safe level l s given the state transition probabilitymodel, ( µ, Σ) , such that the condition in (13) is satisﬁed with high probability for the physical systemgenerating the data. Denote the one-step prediction from the model in (3), in closed loop with K , as: ˆ x + = x + dt d ˆ x, with, u = K ( x ) , where ˆ x + represents the next state prediction and the time index t is omitted. Approximating the high-conﬁdence prediction set.

A polytopic approximation of a high conﬁ-dence region of the estimated uncertain set ˆ x + ∈ W ( x, u, θ ) is obtained from the parameters of Σ and used for training ( V, K ) . In this work, the uncertain set is taken as a hyper-diamond centered at x , scaled by the (diagonal) standard deviation matrix, Σ : W ( x, u, θ ) = (cid:8) x + : x + = x + dt d ˆ x, (cid:13)(cid:13) (Σ( x, u ; θ Σ )) − ˆ w (cid:13)(cid:13) ≤ ¯ σ (cid:9) , (17)where ¯ σ > is a hyper-parameter. This choice of set is inspired by the Unscented Kalman ﬁlter (Wan& Van Der Merwe, 2000). Since Σ is diagonal, the vertices are given by the columns of the matrixresulting from multiplying Σ with a mask M such that:vert [ W ( x, u, θ )] = cols [Σ( x, u ; θ Σ ) M ] , M = ¯ σ [ I, − I ] . (18) Learning the safe set.

Assume that a controller K is given. Then, we wish to learn a V of thefrom of (8), such that the corresponding safe set X s is as big as possible, ideally as big as the stateconstraints X . In order to do so, the parameters of V net are trained using a grid of initial states,a forward model to simulate the next state under the policy K , and an appropriate cost function.The cost for V net and l s is inspired by (Richards et al., 2018). It consists of a combination of twoobjectives: the ﬁrst one penalises the deviation from the Lyapunov stability condition; the second4ne is a classiﬁcation penalty that separates the stable points from the unstable ones by means of thedecision boundary, V ( x ) = l s . The combined robust Lyapunov function cost is: min V net , l s E [ x ( t ) ∈ X grid ] (cid:2) J ( x ( t )) (cid:3) , (19) J ( x ) = I X s ( x ) J s ( x ) + sign (cid:2) ∇ V ( x ) (cid:3) [ l s − V ( x )] , (20) I X s ( x ) = 0 . sign [ l s − V ( x )] + 1) , J s ( x ) = 1 ρV ( x ) ReLU [ ∇ V ( x )] , (21) ∇ V ( x ) = max ˆ x + ∈ W ( x,K ( x ) ,θ ) (cid:2) V (cid:0) ˆ x + (cid:1) (cid:3) − V ( x ) + (cid:96) ( x, K ( x )) , (22)where ρ > trades off stability for volume. The robust Lyapunov decrease in (22) is evaluated byusing sampling to account for uncertainty over the conﬁdence interval W . Sampling of the set centreis performed as opposite of setting W = W , which didn’t seem to produce valid results. Let us omit θ for ease of notation. We substitute ∇ V ( x ) with E W (cid:2) ∇ V ( x ) (cid:3) , which we deﬁne as: E ˆ w ∼N (0 , Σ( x,K ( x ))) (cid:26) max ˆ x + ∈ W ( x,K ( x ) ,θ )+ ˆ wdt (cid:2) V (cid:0) ˆ x + (cid:1) (cid:3)(cid:27) − V ( x ) + (cid:96) ( x, K ( x )) , (23)Equations (22) and (23) require a maximisation of the non-convex function V ( x ) over the convexset W . For the considered case, a sampling technique or another optimisation (similar to adversariallearning) could be used for a better approximation of the max operator. The maximum over W isinstead approximated by the maximum over its vertices: ∇ V ( x ) ≈ max ˆ x + ∈ vert [ W ( x,K ( x ) ,θ )]+ ˆ wdt (cid:2) V (cid:0) ˆ x + (cid:1) (cid:3) − V ( x ) + (cid:96) ( x, K ( x )) . (24)This consists of a simple enumeration followed by a max over tensors that can be easily handled.Finally, during training (23) is implemented in a variational inference fashion by evaluating (24) ateach epoch over a different sample of ˆ w . This entails a variational posterior over the center of theuncertainty interval. The approach is depicted in Figure 3.The proposed cost is inspired by Richards et al. (2018), with the difference that here there is no needfor labelling the states as safe by means of a multi-step simulation. Moreover, in this work we trainthe Lyapunov function and controller together , while in (Richards et al., 2018) the latter was given. Learning the safe policy.

We alternate the minimisation of the Lyapunov loss (19) and the solutionof the variational robust control problem : min u = K ( x ) E [ x ∈ X grid ] [ I X s ( x ) L c ( x, u )] , s.t.: K (0) = 0 , (25) L c ( x, u ) = (cid:96) ( x, u )+ E W (cid:26) max ˆ x + ∈ W ( x,u,θ ) (cid:2) V (ˆ x + ) − γ log( l s − V (ˆ x + )) (cid:3)(cid:27) , (26) V ( x ) W Σ Figure 3:

Approximating the non-convex maximisation.

Centre of theuncertain set is sampled and Lyapunovfunction is evaluated at its vertices. subject to the forward model (3). In this work, (25) issolved using backpropagation through the policy, the modeland V . The safety constraint, ˆ x + ∈ X s , namely, V (ˆ x + ) ≤ l s is relaxed through a log-barrier (Boyd & Vandenberghe,2004). If a neural policy K ( x ) solves (25) and satisﬁes thesafety constraint, ∀ x ∈ X s , then it is a candidate robustcontroller for keeping the system within the safe set X s .Note that the expectation in (26) is once again treated as avariational approximation of the expectation over the centerof the uncertainty interval.Obtaining an exact solution to the control problem for allpoints is computationally impractical. In order to providestatistical guarantees of safety, probabilistic veriﬁcation isused after V and K have been trained. This reﬁnes thesafe level set l s and, if successful, provides a probabilisticsafety certiﬁcate. If the veriﬁcation is unsuccessful, then the learned ( X s , K ) are not safe. The datacollection continues with the previous safe controller until suitable V , l s , and K are found. Note thatthe number of training points used for the safe set and controller is in general lower than the onesused for veriﬁcation. The alternate learning procedure for X s and K is summarised in Algorithm 1.The use of 1-step predictions makes the procedure highly scalable through parallelisation on GPU.5 lgorithm 1: Alternate descent for safe set

In: K , X grid , θ µ , θ Σ , σ w ≥ , ¯ σ > , (cid:15) > Out: ( V net , l s , K ) for i = 0 ...N dofor j = 0 ...N v do ( V net , l s ) ← Adam step on (20) for j = 0 ...N k do K ← Adam step on (25)

Probabilistic safety veriﬁcation.

A probabilisticveriﬁcation is used to numerically prove the phys-ical system stability with high probability. The re-sulting certiﬁcate is of the form (15), where the (cid:15) P decreases with increasing number of samples. Fol-lowing the work of Bobiti (2017), the simulationis evaluated at a large set of points within the esti-mated safe set X s . Monte Carlo rejection samplingis performed with PyMC (Salvatier et al., 2016).In practical applications, several factors limit theconvergence of the trajectory to a neighborhood of the target (the ultimate bound , Blanchini &Miani (2007)). For instance, the policy structural bias, discount factors in RL methods or persistentuncertainty in the model, the state estimates, and the physical system itself. Therefore, we extendedthe veriﬁcation algorithm of (Bobiti, 2017) to estimate the ultimate bound as well as the invariant set,as outlined in Algorithm 2. Given a maximum and minimum level, l l , l u , we ﬁrst sample initial statesuniformly within these two levels and check for a robust decrease of V over the next state distribution.If this is veriﬁed, then we sample uniformly from inside the minimum level set l l (where V may notdecrease) and check that V does not exceed the maximum level l u over the next state distribution.The distribution is evaluated by means of uniform samples of w , independent of the current state,within ( − ¯ σ, ¯ σ ) . These are then scaled using Σ from the model. We search for l l , l u with a step δ . Algorithm 2:

Probabilistic safety veriﬁcation

In: N , V , K , θ µ , θ Σ , σ w ≥ , ¯ σ > , δ > Out: ( SAFE , l u , l l ) SAFE ← False for l u = 1 , − δ, − δ, ..., dofor l l = 0 , δ, δ, ..., l u do draw N uniform x -samples s.t.: l l l s ≤ V ( x ) ≤ l u l s draw N w -samples from U ( − ¯ σ, ¯ σ ) if V (ˆ x + ) − V ( x ) ≤ , ∀ x, ∀ w then draw N uniform x -samples s.t.: V ( x ) ≤ l l l s if V (ˆ x + ) ≤ l u l s , ∀ x, ∀ w then SAFE ← True return

SAFE, l u , l l Veriﬁcation failed.

Note that, in Algorithm 2, the uncertaintyof the surrogate model is taken into accountby sampling a single uncertainty realisationfor the entire set of initial states. The val-ues of w will be then scaled using Σ in theforward model. This step is computation-ally convenient but breaks the assumptionthat variables are drawn from a uniform dis-tribution. We leave this to future work. Inthis paper, independent Gaussian uncertaintymodels are used and stability is veriﬁed di-rectly on the environment. Note that prob-abilistic veriﬁcation is expensive but neces-sary, as pathological cases could result inthe training loss (19) for the safe set couldconverging to a local minima with a verysmall set. If this is the case then usually theforward model is not accurate enough or theuncertainty hyperparameter σ w is too large. Note that Algorithm 2 is highly parallelizable. Once a veriﬁed safe set is found the environment can be controlled by means of a 1-step MPC withprobabilistic stability (see Appendix). Consider the constraint V ( x ) ≤ l (cid:63)s = l u l s , where V and l s come from Algorithm 1 and l u from Algorithm 2. The Safe-MPC exploration strategy follows: Safe-MPC for exploration.

For collecting new data, solve the following MPC problem: u (cid:63) = arg min u ∈ U (cid:26) β(cid:96) ( x, u ) − α (cid:96) expl ( x, u ) + max ˆ x + ∈ W ( x,u,θ ) (cid:2) βV (ˆ x + ) − γ log( l (cid:63)s − V (ˆ x + )) (cid:3) (cid:27) , (27)where α ≤ γ is the exploration hyperparameter, β ∈ [0 , is the regulation or exploitation parameterand (cid:96) expl ( x, u ) is the info-gain from the model, similar to (Hafner et al., 2018b): (cid:96) expl ( x, u ) = (cid:88) i =1 ,...,N x (Σ ii ( x, u, p, θ )) N x σ y ii . (28)The full derivation of the problem and a probabilistic safety result are discussed in Appendix.6 lternate min-max optimization. Problem (27) is approximated using alternate descent. Inparticular, the maximization in the loss function over the uncertain future state ˆ x + with respect to ˆ w ,given the current control candidate u , is alternated with the minimization with respect to u , given thecurrent candidate ˆ x + . Adam (Kingma & Ba, 2014) is used for both steps. The approach is demonstrated on an inverted pendulum, where the input is the angular torque and thestates/outputs are the angular position and velocity of the pendulum. The aim is to collect data safelyaround the unstable equilibrium point (the origin). The system has a torque constraint that limits thecontrollable region. In particular, if the initial angle is greater than 60 degrees, then the torque is notsufﬁcient to swing up the pendulum. In order to compare to the LQR, we choose a linear policy witha tanh activation, meeting the torque constraints while preserving differentiability.

Safe set with known environment, comparison to LQR.

We ﬁrst test the safe-net algorithm onthe nominal pendulum model and compare the policy and the safe set with those from a standardLQR policy. Figure 4 shows the safe set at different stages of the algorithm, approaching the LQR. i = 1 K = [ − , − . i = 30 K = [ − . , − . i = 50 K = [ − . , − . i = 60 K = [ − . , − . LQR K = [ − . , − . Figure 4:

Inverted Pendulum. Safe set and controller with proposed method for known envi-ronment model.

Initial set ( i = 0 ) is based on a unit circle plus the constraint | α | ≤ . . Contoursshow the function levels. Control gain gets closer to the LQR solution as iterations progress until circa i = 50 , where the minimum of the Lyapunov loss (19) is achieved. The set and controller at iteration are closest to the LQR solution, which is optimal around the equilibrium in the unconstrainedcase. In order to maximise the chances of veriﬁcation the optimal parameters are selected with a earlystopping, namely when the Lyapunov loss reaches its minimum, resulting in K = [ − . , − . . Safe set with Bayesian model.

In order to test the proposed algorithms, the forward model is ﬁttedon sequences of length for an increasing amount of data points ( k to k ). Data is collected inclosed loop with the initial controller, K = [ − , , with different initial states. In particular, weperturb the initial state and control values with a random noise with standard deviations starting from,respectively, . and . and doubling each k points. The only prior used is that the velocity is thederivative of the angular position (normalized to π and π ). The uncertainty bound was ﬁxed to σ w =0 . . The architecture was cross-validated from k datapoints with a - split. The model with thebest validation predictions as well as the largest safe set was used to generate the results in Figure 6. (a) environment (b) NCP-BRNN ( k points). Figure 5:

Inverted pendulum veriﬁcation . Nominaland robust safe sets are veriﬁed on the pendulum sim-ulation using k samples. We search for the largeststability region and the smallest ultimate bound ofthe solution. If a simulation is not available, then atwo-level sampling on BRNN is performed. The results demonstrate that the size of thesafe set can improve with more data, pro-vided that the model uncertainty decreasesand the predictions have comparable accu-racy. This motivates for exploration. Veriﬁcation on the environment.

Thecandidate Lyapunov function, safe levelset, and robust control policy are formallyveriﬁed through probabilistic sampling ofthe system state, according to Algorithm 2,where the simulation is used directly. Theresults for k samples are shown in Figure 5.In particular, the computed level sets verifyat the ﬁrst attempt and no further search forsub-levels or ultimate bounds is needed.7 k points k points k points k points k points k points k points k points k points environment Figure 6:

Inverted pendulum safe set with Bayesian model.

Surrogates are obtained with in-creasing amount of data. The initial state and input perturbation from the safe policy are drawnfrom Gaussians with standard deviation that doubles each k points. Top : Mean predictions anduncertainty contours for the NCP-BRNN model. After k points no further improvement is noticed. Bottom : Comparison of safe sets with surrogates and environment. Reducing the model uncertaintywhile maintaining a similar prediction accuracy leads to an increase in the safe set. After k pointsno further beneﬁts are noticed on the set which is consistent with the uncertainty estimates. Semi-random exploration trials of k stepsvol = 0 . Safe-MPC exploration trial of k stepsvol = 0 . Figure 7:

Safe exploration . Comparison of a naivesemi-random exploration strategy with the proposedSafe-MPC for exploration. The proposed algorithmhas an efﬁcient space coverage with safety guarantees.

Safe exploration.

Safe exploration is per-formed using the min-max approach in Sec-tion 5. For comparison, a semi-random ex-ploration strategy is also used: if inside thesafe set, the action magnitude is set to maxi-mum torque and its sign is given by a randomuniform variable once V ( x ) ≥ . l s , thenthe safe policy K is used. This does not pro-vide any formal guarantees of safety as thevalue of V ( x ) could exceed the safe level,especially for very fast systems and large in-put signals. This is repeated for several trialsin order to estimate the maximum reachableset within the safe set. The results are shownin Figure 7, where the semi-random strategyis used as a baseline and is compared to asingle trial of the proposed safe-exploration algorithm. The area covered by our algorithm in a single trial of k steps is about of that of the semi-random baseline over trials of k steps.Extending the length of the trials did not signiﬁcantly improve the baseline results. Despite beingmore conservative, our algorithm continues to explore safely indeﬁnitely. Preliminary results show that the SiMBL produces a Lyapunov function and a safe set using neuralnetworks that are comparable with that of standard optimal control (LQR) and can account forstate-dependant additive model uncertainty. A Bayesian RNN surrogate with NCP was proposed andtrained for an inverted pendulum simulation. An alternate descent method was presented to jointlylearn a Lyapunov function, a safe level set, and a stabilising control policy for the surrogate modelwith back-propagation. We demonstrated that adding data-points to the training set can increasethe safe-set size provided that the model improves and its uncertainty decreases. To this end, anuncertainty prior from the previous model was added to the framework. The safe set was then formallyveriﬁed through a novel probabilistic algorithm for ultimate bounds and used for safe data collection(exploration). A one-step safe MPC was proposed where the Lyapunov function provides the terminalcost and constraint to mimic an inﬁnite horizon with high probability of recursive feasibility. Resultsshow that the proposed safe-exploration strategy has better coverage than a naive policy whichswitches between random inputs and the safe policy.8 eferences

Akametalu, A. K., Fisac, J. F., Gillula, J. H., Kaynama, S., Zeilinger, M. N., & Tomlin, C. J. (2014).Reachability-based safe learning with gaussian processes. In . IEEE.URL https://doi.org/10.1109/cdc.2014.7039601

Bemporad, A., Borrelli, F., & Morari, M. (2003). Min-max control of constrained uncertain discrete-time linear systems.

Automatic Control, IEEE Transactions on , , 1600 – 1606.Ben-Tal, A., Ghaoui, L. E., & Nemirovski, A. (2009). Robust Optimization (Princeton Series inApplied Mathematics) . Princeton University Press.Berkenkamp, F., Turchetta, M., Schoellig, A. P., & Krause, A. (2017). Safe Model-based Reinforce-ment Learning with Stability Guarantees. arXiv:1705.08551 [cs, stat] . ArXiv: 1705.08551.URL http://arxiv.org/abs/1705.08551

Blanchini, F., & Miani, S. (2007).

Set-Theoretic Methods in Control (Systems & Control: Foundations& Applications) . Birkhäuser.Bobiti, R., & Lazar, M. (2016). Sampling-based veriﬁcation of Lyapunov’s inequality for piecewisecontinuous nonlinear systems. arXiv:1609.00302 [cs] . ArXiv: 1609.00302.URL http://arxiv.org/abs/1609.00302

Bobiti, R. V. (2017).

Sampling driven stability domains computation and predictive control ofconstrained nonlinear systems . Ph.D. thesis.URL https://pure.tue.nl/ws/files/78458403/20171025_Bobiti.pdf

Borrelli, F., Bemporad, A., & Morari, M. (2017).

Predictive Control for Linear and Hybrid Systems .Cambridge University Press.Boyd, S., & Vandenberghe, L. (2004).

Convex Optimization . New York, NY, USA: CambridgeUniversity Press.Camacho, E. F., & Bordons, C. (2007).

Model Predictive control . Springer London.Carron, A., Arcari, E., Wermelinger, M., Hewing, L., Hutter, M., & Zeilinger, M. N. (2019). Data-driven model predictive control for trajectory tracking with a robotic arm.URL http://hdl.handle.net/20.500.11850/363021

Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., & Abbeel,P. (2016). Variational Lossy Autoencoder. arXiv:1611.02731 [cs, stat] . ArXiv: 1611.02731.URL http://arxiv.org/abs/1611.02731

Cheng, R., Orosz, G., Murray, R. M., & Burdick, J. W. (2019). End-to-End Safe Reinforcement Learn-ing through Barrier Functions for Safety-Critical Continuous Control Tasks. arXiv:1903.08792 [cs,stat] . ArXiv: 1903.08792.URL http://arxiv.org/abs/1903.08792

Chow, Y., Nachum, O., Duenez-Guzman, E., & Ghavamzadeh, M. (2018). A Lyapunov-basedApproach to Safe Reinforcement Learning. arXiv:1805.07708 [cs, stat] . ArXiv: 1805.07708.URL http://arxiv.org/abs/1805.07708

Chow, Y., Nachum, O., Faust, A., Duenez-Guzman, E., & Ghavamzadeh, M. (2019). Lyapunov-based Safe Policy Optimization for Continuous Control. arXiv:1901.10031 [cs, stat] . ArXiv:1901.10031.URL http://arxiv.org/abs/1901.10031

Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). Deep Reinforcement Learning in aHandful of Trials using Probabilistic Dynamics Models. arXiv:1805.12114 [cs, stat] . ArXiv:1805.12114.URL http://arxiv.org/abs/1805.12114

Ciccone, M., Gallieri, M., Masci, J., Osendorfer, C., & Gomez, F. (2018). Nais-net: Stable deepnetworks from non-autonomous differential equations. In

NeurIPS .9eisenroth, M., & Rasmussen, C. (2011). Pilco: A model-based and data-efﬁcient approach to policysearch. In

Proceedings of the 28th International Conference on Machine Learning, ICML 2011 ,(pp. 465–472). Omnipress.Deisenroth, M. P., Fox, D., & Rasmussen, C. E. (2015). Gaussian Processes for Data-EfﬁcientLearning in Robotics and Control.

IEEE Transactions on Pattern Analysis and Machine Intelligence , (2), 408–423. ArXiv: 1502.02860.URL http://arxiv.org/abs/1502.02860 Depeweg, S., Hernández-Lobato, J. M., Doshi-Velez, F., & Udluft, S. (2016). Learning and PolicySearch in Stochastic Dynamical Systems with Bayesian Neural Networks. arXiv:1605.07127 [cs,stat] . ArXiv: 1605.07127.URL http://arxiv.org/abs/1605.07127

Frigola, R., Chen, Y., & Rasmussen, C. E. (2014). Variational gaussian process state-space models.In

NIPS .Gal, Y., McAllister, R., & Rasmussen, C. E. (2016). Improving PILCO with Bayesian neural networkdynamics models. In

Data-Efﬁcient Machine Learning workshop, ICML .Gallieri, M. (2016).

Lasso-MPC – Predictive Control with (cid:96) -Regularised Least Squares . SpringerInternational Publishing.URL https://doi.org/10.1007/978-3-319-27963-3 Gros, S., & Zanon, M. (2019). Towards Safe Reinforcement Learning Using NMPC and PolicyGradients: Part II - Deterministic Case. arXiv:1906.04034 [cs] . ArXiv: 1906.04034.URL http://arxiv.org/abs/1906.04034

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2018a). LearningLatent Dynamics for Planning from Pixels. arXiv:1811.04551 [cs, stat] . ArXiv: 1811.04551.URL http://arxiv.org/abs/1811.04551

Hafner, D., Tran, D., Irpan, A., Lillicrap, T., & Davidson, J. (2018b). Reliable Uncertainty Estimatesin Deep Neural Networks using Noise Contrastive Priors. arXiv:1807.09289 [cs, stat] . ArXiv:1807.09289.URL http://arxiv.org/abs/1807.09289

Hewing, L., Kabzan, J., & Zeilinger, M. N. (2017). Cautious Model Predictive Control using GaussianProcess Regression. arXiv:1705.10702 [cs, math] . ArXiv: 1705.10702.URL http://arxiv.org/abs/1705.10702

Horn, R. A., & Johnson, C. R. (2012).

Matrix Analysis . New York, NY, USA: Cambridge UniversityPress, 2nd ed.Kalman, R. (2001). Contribution to the theory of optimal control.

Bol. Soc. Mat. Mexicana , .Kerrigan, E. (2000). Robust constraint satisfaction: Invariant sets and predictive control. Tech. rep.URL http://hdl.handle.net/10044/1/4346 Kerrigan, E. C., & Maciejowski, J. M. (2004). Feedback min-max model predictive control using asingle linear program: robust stability and the explicit solution.

International Journal of Robustand Nonlinear Control , (4), 395–413.URL https://doi.org/10.1002/rnc.889 Khalil, H. K. (2014).

Nonlinear Control . Pearson.Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980[cs] . ArXiv: 1412.6980.URL http://arxiv.org/abs/1412.6980

Koller, T., Berkenkamp, F., Turchetta, M., & Krause, A. (2018). Learning-based Model PredictiveControl for Safe Exploration and Reinforcement Learning. arXiv:1803.08287 [cs] . ArXiv:1803.08287.URL http://arxiv.org/abs/1803.08287

Model Predictive Control: Classical, Robust and Stochastic .Advanced Textbooks in Control and Signal Processing, Springer, London.Kurutach, T., Clavera, I., Duan, Y., Tamar, A., & Abbeel, P. (2018). Model-Ensemble Trust-RegionPolicy Optimization. arXiv:1802.10592 [cs] . ArXiv: 1802.10592.URL http://arxiv.org/abs/1802.10592

Limon, D., Calliess, J., & Maciejowski, J. (2017). Learning-based nonlinear model predictive control.

IFAC-PapersOnLine , (1), 7769–7776.URL https://doi.org/10.1016/j.ifacol.2017.08.1050 Lorenzen, M., Cannon, M., & Allgower, F. (2019). Robust MPC with recursive model update.

Automatica , , 467–471.URL https://ora.ox.ac.uk/objects/pubs:965898 Lowrey, K., Rajeswaran, A., Kakade, S., Todorov, E., & Mordatch, I. (2018). Plan Online, LearnOfﬂine: Efﬁcient Learning and Exploration via Model-Based Control. arXiv:1811.01848 [cs, stat] .ArXiv: 1811.01848.URL http://arxiv.org/abs/1811.01848

Maciejowski, J. (2000).

Predictive Control with Constraints . Prentice Hall.Mayne, D. Q., Rawlings, J. B., Rao, C. V., & Scokaert, P. O. M. (2000). Constrained model predictivecontrol: Stability and optimality.Papini, M., Battistello, A., Restelli, M., & Battistello, A. (2018). Safely exploring policy gradient.Pozzoli, S. (2019).

State Estimation and Recurrent Neural Networks for Model Predictive Control .Politecnico di Milano, MS thesis, supervisors: R. Scattolini, M. Gallieri, E. Terzi, M. Farina.Pozzoli, S., Gallieri, M., & Scattolini, R. (2019). Tustin neural networks: a class of recurrent nets foradaptive MPC of mechanical systems. arXiv:1911.01310 [cs, eess] . ArXiv: 1911.01310.URL http://arxiv.org/abs/1911.01310

Raimondo, D., Limon, D., Lazar, M., Magni, L., & Camacho, E. (2009). Min-max model predictivecontrol of nonlinear systems: A unifying overview on stability.

European Journal of Control , .Rakovi´c, S. V., Kouvaritakis, B., Findeisen, R., & Cannon, M. (2012). Homothetic tube modelpredictive control. Automatica , , 1631–1638.Rakovi´c, S. V., & Levine, W. S. (Eds.) (2019). Handbook of Model Predictive Control . SpringerInternational Publishing.URL https://doi.org/10.1007/978-3-319-77489-3

Rawlings, J. B., & Mayne, D. Q. (2009).

Model Predictive Control Theory and Design . Nob HillPub, Llc.Richards, A. G. (2004).

Robust Constrained Model Predictive Control , . Ph.D. thesis, MIT.Richards, S. M., Berkenkamp, F., & Krause, A. (2018). The Lyapunov Neural Network: AdaptiveStability Certiﬁcation for Safe Learning of Dynamical Systems. arXiv:1808.00924 [cs] . ArXiv:1808.00924.URL http://arxiv.org/abs/1808.00924

Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution strategies as a scalablealternative to reinforcement learning. arXiv preprint arXiv:1703.03864 .Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in python usingPyMC3.

PeerJ Computer Science , , e55.URL https://doi.org/10.7717/peerj-cs.55 Shyam, P., Jaskowski, W., & Gomez, F. (2018). Model-based active exploration.

CoRR , abs/1810.12162 .URL http://arxiv.org/abs/1810.12162 Evolutionary computation , (2), 99–127.Taylor, A. J., Dorobantu, V. D., Le, H. M., Yue, Y., & Ames, A. D. (2019). Episodic Learning withControl Lyapunov Functions for Uncertain Robotic Systems. arXiv:1903.01577 [cs] . ArXiv:1903.01577.URL http://arxiv.org/abs/1903.01577 Thananjeyan, B., Balakrishna, A., Rosolia, U., Li, F., McAllister, R., Gonzalez, J. E., Levine, S.,Borrelli, F., & Goldberg, K. (2019). Safety Augmented Value Estimation from Demonstrations(SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks. arXiv:1905.13402 [cs,stat] . ArXiv: 1905.13402.URL http://arxiv.org/abs/1905.13402

Verdier, C. F., & M. Mazo, J. (2017). Formal Controller Synthesis via Genetic Programming.

IFAC-PapersOnLine , (1), 7205–7210.URL https://doi.org/10.1016/j.ifacol.2017.08.1362 Vinogradska, J. (2017). Gaussian Processes in Reinforcement Learning: Stability Analysis andEfﬁcient Value Propagation.URL http://tuprints.ulb.tu-darmstadt.de/7286/1/GPs_in_RL_Stability_Analysis_and_Efficient_Value_Propagation_Version1.pdf

Wabersich, K. P., Hewing, L., Carron, A., & Zeilinger, M. N. (2019). Probabilistic model predictivesafety certiﬁcation for learning-based control. arXiv:1906.10417 [cs, eess] . ArXiv: 1906.10417.URL http://arxiv.org/abs/1906.10417

Wan, E., & Van Der Merwe, R. (2000). The unscented kalman ﬁlter for nonlinear estimation. (pp.153–158).Williams, G., Wagener, N., Goldfain, B., Drews, P., Rehg, J. M., Boots, B., & Theodorou, E. A.(2017). Information theoretic MPC for model-based reinforcement learning. In . IEEE.URL https://doi.org/10.1109/icra.2017.7989202

Yan, S., Goulart, P., & Cannon, M. (2018). Stochastic Model Predictive Control with DiscountedProbabilistic Constraints. ArXiv: 1807.07465.URL http://arxiv.org/abs/1807.07465

Yang, X., & Maciejowski, J. (2015a). Risk-sensitive model predictive control with gaussian processmodels.

IFAC-PapersOnLine , (28), 374–379.URL https://doi.org/10.1016/j.ifacol.2015.12.156 Yang, X., & Maciejowski, J. M. (2015b). Fault tolerant control using gaussian processes and modelpredictive control.

International Journal of Applied Mathematics and Computer Science , (1),133–148.URL https://doi.org/10.1515/amcs-2015-0010 Zhao, S., Song, J., & Ermon, S. (2017). InfoVAE: Information Maximizing Variational Autoencoders. arXiv:1706.02262 [cs, stat] . ArXiv: 1706.02262.URL http://arxiv.org/abs/1706.02262 Robust optimal control for safe learning

Further detail is provided regarding robust and chance constrained control.

Chance-constrained and robust control.

Consider the problem of ﬁnding a controller K and afunction V such that u ( t ) = K ( x ( t )) and: P (cid:2) V (ˆ x ( t + 1)) − V ( x ( t )) ≤ − (cid:96) ( x ( t ) , u ( u )) (cid:3) ≥ − (cid:15) p , (29)where ˆ x is given by the forward model (3), P represents a probability and < (cid:15) p << . Thisis a chance-constrained control problem (Kouvaritakis & Cannon, 2015; Yan et al., 2018). Sinceﬁnding K and V that satisfy (29) requires solving a non-convex and also stochastic optimization, weapproximate (29) with a min-max condition over a high-conﬁdence interval, in the form of a convexset W ( x ( t ) , u ( t ) , θ ) , as follows: max ˆ x ( t +1) ∈ W ( x ( t ) ,u ( t ) ,θ ) (cid:2) V (ˆ x ( t + 1)) (cid:3) − V ( x ( t )) ≤ − (cid:96) ( x ( t ) , K ( x ( t ))) , (30)This is a robust control problem, which is still non-convex but deterministic. In the convex case, (30)can be satisﬁed by means of robust optimization (Ben-Tal et al., 2009; Rawlings & Mayne, 2009). Byfollowing this consideration, we frame the control problem as a non-convex min-max optimization. Links to optimal control and intrinsic robustness.

To link our approach with optimal controland reinforcement learning, note that if the condition in (13) is met with equality, then the controller K and the Lyapunov function V satisfy the Bellman equation (Rawlings & Mayne, 2009). Therefore, u = K ( x ) is optimal and V ( x ) is the value-function of the inﬁnite horizon optimal control problemwith stage loss (cid:96) ( x, u ) . In practice, this condition is not met with exact equality. Nevertheless, theinequality in (13) guarantees by deﬁnition that the system controlled by K ( x ) is asymptotically stable(converges to x = 0 ) and it has a degree of tolerance to uncertainty in the safe set X s (i.e. if thesystem is locally Lipschitz) (Rawlings & Mayne, 2009). Vice versa, inﬁnite horizon optimal controlwith the considered cost produces a value function which is also a Lyapunov function and providesan intrinsic degree of robustness (Rawlings & Mayne, 2009). B From robust MPC to safe exploration

Once a robust Lyapunov function and invariant set are found, the environment can be controlled bymeans of a one-step MPC with probabilistic safety guarantees.

One-step robust MPC.

Start by considering the following min-max 1-step MPC problem: u (cid:63) = arg min u ∈ U (cid:26) (cid:96) ( x, u ) + max ˆ x + ∈ W ( x,u,θ ) (cid:2) V (ˆ x + ) (cid:3) (cid:27) , (31)s.t. max ˆ x + ∈ W ( x,u,θ ) (cid:2) V (ˆ x + ) (cid:3) ≤ l s , and to (3) – (42) , given x = x ( t ) , with V ( x ) ≤ l s . This is a non-convex min-max optimisation problem with hard non-convex constraints. Solving (31)is difﬁcult, especially in real-time, but is in general possible if the constraints are feasible. This istrue with a probability that depends from the veriﬁcation procedure, the conﬁdence level used in theprocedures, as well as the probability of the model being correct.

Relaxed problem.

Solutions of (31) can be computed in real-time, to a degree of accuracy, byiterative convexiﬁcation of the problem and the use of fast convex solvers. This is described inAppendix. For the purpose of this paper, we will consider the soft-constrained or relaxed problem: u (cid:63) = arg min u ∈ U (cid:26) (cid:96) ( x, u ) + max ˆ x + ∈ W ( x,u,θ ) (cid:2) V (ˆ x + ) − γ log( l s − V (ˆ x + )) (cid:3) (cid:27) , (32)once again subject to (3). It is assumed that a scalar, γ > , exists such that the constraint can beenforced. For the sake of simplicity, problem (32) will be addressed using backpropagation, at theprice of losing real-time guarantees. 13 afe exploration. For collecting new data, we modify the robust MPC problem as follows: u (cid:63) = arg min u ∈ U (cid:26) β(cid:96) ( x, u ) − α (cid:96) expl ( x, u ) + max ˆ x + ∈ W ( x,u,θ ) (cid:2) βV (ˆ x + ) − γ log( l s − V (ˆ x + )) (cid:3) (cid:27) , (33)where α ≤ γ is the exploration hyperparameter, β ∈ [0 , is the regulation or exploitation parameterand (cid:96) expl ( x, u ) is the info-gain from the model, similar to (Hafner et al., 2018b): (cid:96) expl ( x, u ) = (cid:88) i =1 ,...,N x (Σ ii ( x, u, p, θ )) N x σ y ii . (34) Probabilistic Safety.

We study the feasibility and stability of the proposed scheme, following theframework of Mayne et al. (2000); Rawlings & Mayne (2009). In particular, if the MPC (31) is always feasible , and the terminal cost and terminal set satisfy (15) with probability , then the MPC(31) enjoys some intrinsic robustness properties. In other words, we should be able to control thephysical system and come back to a neighborhood of the initial equilibrium point for any state in X s ,the size of this neighborhood depending on the model accuracy. We assume a perfect solver is usedand that the relaxed problems enforce the constraints exactly for a given γ .For the exploration MPC to be safe , we wish to be able to ﬁnd a u (cid:63) ( t ) satisfying the terminalconstraint: V (ˆ x + ( t )) ≤ l s , ∀ t, starting from the stochastic system: ˆ x + ( t ) = x ( t ) + ( µ ( x ( t ) , u (cid:63) ( t ) , θ µ ) + w ( t )) dt, w ( t ) ∼ N (0 , Σ( x ( t ) , u (cid:63) ( t ) , θ Σ )) . We aim at a probabilistic result. First, recall that we truncate the distribution of, w , to a highconﬁdence level z-score, ¯ σ . Once again, we switch to a set-valued uncertainty representation asit is most convenient and provide a result that depends on the z-score ¯ σ . Assume known theprobability of our model to be able to perform one step predictions, given the model M , such thatthe real state increments are within the given conﬁdence interval, ¯ σ , and deﬁne it as: P ( x ( t + 1) ∈M ( x ( t )) |M , ¯ σ ) = 1 − (cid:15) M (¯ σ ) . This probability can be estimated and improved using cross-validation, for instance by ﬁne-tuning σ w . It can also be increased with ¯ σ after the model training.This can however make the control search more challenging. Finally, since we use probabilisticveriﬁcation, from (15) we have a probability of the terminal set to be invariant for the model withtruncated distributions: P ( R ( X s ) ⊆ X s |M , ¯ σ ) = P ( R ( X s ) ⊆ X s ) = 1 − (cid:15) p , where R is theone-step reachability set operator (Kerrigan, 2000) computed using the model in closed loop with K .Note that this probability is determined by the number of veriﬁcation samples (Bobiti, 2017). Safetyof the next state is determined by: Theorem 1.

Given x ( t ) ∈ X s , the probability of (31-33) to be feasible (safe) at the next time step is: P ( x ( t + 1) ∈ X s |M , ¯ σ, x ( t ) ∈ X s ) = (35) P ( x ( t + 1) ∈ M ( x ( t )) |M , ¯ σ ) P ( R ( X s ) ⊆ X s ) = (36) (1 − (cid:15) M (¯ σ ))(1 − (cid:15) p ) . (37)It must be noticed that, whilst P ( R ( X s ) ⊆ X s ) is constant, the size of X s will generally decrease forincreasing ¯ σ as well as σ w . The probability of any state to lead to safety in the next step is given by: Theorem 2.

Given x ( t ) , the probability of (31-33) to be feasible (safe) at the next step is: P SAFE ( M , ¯ σ ) = P ( x ( t ) ∈ C ( X s )) P ( x ( t + 1) ∈ X s |M , ¯ σ, x ( t ) ∈ X s ) ≥ (38) P ( x ( t ) ∈ X s )(1 − (cid:15) M (¯ σ ))(1 − (cid:15) p ) , (39)where C denotes the one-step robust controllable set for the model (Kerrigan, 2000). The size of thesafe set is a key factor for a safe system. This depends also on the architecture of V and K as well ason the stage cost matrices Q and R . A stage cost is not explicitly needed for the proposed approach,however, Q can be beneﬁcial in terms of additional robustness and R serves as a regularisation for K .14 Network architecture for inverted pendulum

Forward Model

Recall the NPC-BRNN deﬁnition: ˆ x ( t + 1) = ˆ x ( t ) + dt d ˆ x ( t ) , d ˆ x ( t ) = µ (ˆ x ( t ) , u ( t ); θ µ ) + ˆ w ( t ) , (40) ˆ w ( t ) ∼ q (ˆ x ( t ) , u ( t ); θ Σ ) , q (ˆ x ( t ) , u ( t ); θ Σ ) = N (0 , Σ (ˆ x ( t ) , u ( t ); θ Σ )) , (41) ˆ y ( t ) ∼ N (cid:0) ˆ x ( t ) , σ y (cid:1) , ˆ x (0) ∼ N ( x (0) , σ y ) . (42)Partition the state as x = [ x x ] T , where the former represents the angular position and the latter thevelocity. They are normalised, respectively, to a range of ± π and ± π . We consider a µ of the form: µ ( x, u ) = (cid:20) · x f µ ( x, u ) (cid:21) (43)where f µ ( x, u ) is a three-layer feed-forward neural network with hidden units and tanh activationsin the two hidden layers. The ﬁnal layer is linear. The ﬁrst layer of f µ is shared with the standarddeviation network, Σ( x, u ) , where it is then followed by one further hidden layer of units beforethe ﬁnal sigmoid layer. The parameter σ w is set to . . The noise standard deviation, σ y is passedthrough a softplus layer in order to maintain it positive and was initialised at . by inverting thesoftplus. We used epochs for training, with a learning rate of 1E-4 and an horizon of . Thesequences where all of samples, the number of sequences was increased by increments of and the batch size adjusted to have sequences of length . The target loss was initilisized as − . .We point out that this architecture is quite general and has been positively tested on other applications,for instance a double pendulum or a joint-space robot model, with states partitioned accordingly. Lyapunov function

The Lyapunov net consists of three fully-connected hidden layers with units and tanh activations which are then followed by a ﬁnal linear layer with outputs. These arethen reshaped into a matrix, V net , of size × and is evaluated as: V ( x ) = x T sofplus ( α ) (cid:0) (cid:15)I + V net ( x ) T V net ( x ) (cid:1) x + ψ ( x ) , (44)where (cid:15) > is a hyperparameter and α is a trainable scaling parameter which is passed through asoftplus. The introduction of α noticeably improved results. The prior function φ was set to keep | x | ≤ . . We used outer epochs for training and inner epochs for the updates of V and K ,with a with learning rate 1E-3. We used a uniform grid of k initial stastes as a single batch. Exploration MPC

For demonstrative purposes we solved the Safe-MPC using Adam with epochs for the minimisation step and SGD with epochs for the maximisation step. The outer loopused iterations. The learning rates were set to, respectively, . and 1E-4. The exploration factor, α ,was set to as well as the soft constraint factor, γ . The exploitation factor, β , was set to . D Considerations on model reﬁnement and policies

Using neural networks present several advantages over other popular inference models: for instance,their scalability to high dimensional and to large amount of data, the ease of including physics-basedpriors and structure in the architecture and the possibility to learn over long sequences. At thesame time, NNs require more data than other methods and no offer no formal guarantees. For theguarantees, we have considered a-posteriori probabilistic veriﬁcation. For the larger amount of data,we have assumed that an initial controller exists (this is often the case) that can be used to safelycollect as much data as we need.

Model reﬁnement.

A substantial difﬁculty was encountered while trying to incrementally improvethe results of the neural network with increasing amount data. In particular, as more data is collected,larger or more batches must be used. This implies that either the gradient computation or the numberof backward passes performed per epoch is different from the previous model training. Consequentely,if a model is retrained entirely from scratch, then the ﬁnal loss and the network parameters can besubstantially different from the ones obtained in the previous trial. This might result in having alarger uncertainty than before in certain regions of the state space. If this is the case, then stabilisingthe model can become more difﬁcult and the resulting safe set might be smaller. We have observed15his pathology initially and have mitigated it by employing these particular steps: ﬁrst, we use asigmoid layer to effectively limit the maximum uncertainty to a known hyperparameter; second, weadded a consistency loss that encourages the new model to have uncertainty smaller than the previousone over the (new) training set; third, we used rejection sampling for the background based on theuncertainty of the previous model, so that the NCP does not penalise previously known datapoints;ﬁnally, we stop the training loop as soon as the ﬁnal loss of the previous model is exceeded. Theseingredients have proven successful in reducing this pathology and, together with having training datawith increasing variance, have provided that the uncertainty and safe set improve up to k datapoints.After that, however, adding further datapoints has not improved the size of the safe set which has notreached its maximial possible size. We believe that closing the loop with exploration could improveon this result but are also going to investigate further alternatives.Noticeably, Gal et al. (2016) remarked that improving their BNN model was not simple. They triedfor instance to use a forgetting factor which was not successful and concluded that their best solutionwas to save only a ﬁxed number of most recent trials. We believe this could not be sufﬁcient for safelearning as the uncertain space needs to be explored. Future work will further address this topic,for instance, by retraining only part of the network, or possibly by exploring combinations of ourapproach with the ensemble approach used in Shyam et al. (2018). Initial trials of the former seemedencouraging for deterministic models. Training NNs as robust controllers.

In this paper, we have used a neural network policy for thecomputation of the safe controller. This choice was made fundamentally to compare the results withan LQR, which can successfully solve the example. Training policies with backpropagation is not aneasy task in general. For more complex scenarios, we envisage two possible solutions: the ﬁrst isto use evolutionary strategies (Stanley & Miikkulainen, 2002; Salimans et al., 2017) or other globaloptimisation methods to train the policy; the second is to use a robust MPC instead of a policy. Initialtrials of the former seemed encouraging. The latter would result in a change of Algorithm 1, wherethe K would be not learned but just evaluated point-wise through an MPC solver. Future work isgoing to investigate these alternatives. E Related work

Robust and Stochastic MPC

Robust MPC can be formulated using several methods, for instance:min-max optimisation (Bemporad et al., 2003; Kerrigan & Maciejowski, 2004; Raimondo et al.,2009), tube MPC (Rawlings & Mayne, 2009; Rakovi´c et al., 2012) or constraints restriction (Richards,2004) provide robust recursive feasibility given a known bounded uncertainty set. In tube MPC aswell as in constraints restriction, the nominal cost is optimized while the constraints are restrictedaccording to the uncertainty set estimate. This method can be more conservative but it does notrequire the maximization step. For non-linear systems, computing the required control invariantsets is generally challenging. Stochastic MPC approaches the control problem in a probabilisticway, either using expected or probabilistic constraints. For a broad review of MPC methods andtheory one can refer to Maciejowski (2000); Camacho & Bordons (2007); Rawlings & Mayne (2009);Kouvaritakis & Cannon (2015); Gallieri (2016); Borrelli et al. (2017); Rakovi´c & Levine (2019).

Adaptive MPC

Lorenzen et al. (2019) presented an approach based on tube MPC for linearparameter varying systems using set membership estimation. In particular, the constraints and modelparameter set estimates are updated in order to guarantee recursive feasibility. Pozzoli et al. (2019)used the Unscented Kalman Filter (UKF) to adapt online the last layer of a novel RNN architecture,the Tustin Net (TN), which was then used to successfully control a double pendulum though MPC.TN is a deterministic RNN which is related to the architecture used in this paper. A comparison ofdifferent network architectures, estimation and adaptation heuristics for neural MPC can be found,for instance, in Pozzoli (2019).

Stability certiﬁcation

Bobiti & Lazar (2016) proposed a grid-based deterministic veriﬁcationmethod which relies on local Lipschitz bounds of the system dynamics. This approach requiresknowledge of the model equations and it was extended to black-box simulations (Bobiti, 2017) usinga probabilistic approach. We extend this framework by means of a check for ultimate boundednessand propose to use it with Bayesian models through uncertainty sampling.16

PC for Reinforcement Learning

Williams et al. (2017) presented an information-theoreticalframework to solve a non-linear MPC in real-time using a neural network model and performedmodel-based RL on a race car scale-model with non-convex constraints. Gros & Zanon (2019) usedpolicy gradient methods to learn a classic robust MPC for linear systems.

Safe learning

Yang & Maciejowski (2015a,b) looked, respectively, at using GPs for risk sensitiveand fault-tolerant MPC. Vinogradska (2017) proposed a quadrature method for computing invariantsets, stabilising and unrolling GP models for use in RL. Berkenkamp et al. (2017) used the deter-ministic veriﬁcation method of (Bobiti & Lazar, 2016) on GP models and embedded it into an RLframework using approximate dynamic programming. The resulting policy has a high probability ofsafety. Akametalu et al. (2014) studied the reachable sets of GP models and proposed an iterativeprocedure to reﬁne the stabilisable (safe) set as more data is collected. Hewing et al. (2017) revieweduncertainty propagation methods and formulated a chance-constrained MPC for grey-box GP models.The approach was demonstrated on an autonomous racing example with non-linear constraints. Limonet al. (2017) presented an approach to learn a non-linear robust model predictive controller basedon worst case bounding functions and Holder constant estimates from a non-parametric method.In particular, they use both the trajectories from an initial ofﬂine model for recursive feasibility aswell as of an online reﬁned model to compute the optimal loss. Koller et al. (2018) provide highprobability guarantees of feasibility for a GP-based MPC with Gaussian kernels. This is done using aclosed-form exact Taylor expansion that results in the solution of a generalised eigenvalue problemper each step of the prediction horizon. Cheng et al. (2019) complemented model-free RL methods(TRPO and DDPG) with a GP model based approach using a barrier function safety loss, the GPbeing reﬁned online. Chow et al. (2018) developed a safe Q-learning variant for constrained Markovdecision processes based on a state-action Lyapunov function. The Lyapunov function is shown tobe equal to the value function for a safety constraint function, deﬁned over a ﬁnite horizon. This isconstructed by means of a linear programme. Chow et al. (2019) extended this approach to policygradient methods for continuous control. Two projection strategies have been proposed to map thepolicy into the space of functions that satisfy the Lyapunov stability condition. Papini et al. (2018)proposed a policy gradient method for exploration with a statistical guarantee of increase of the valuefunction. Wabersich et al. (2019) formulated a probabilistically safe method to project the actionresulting from a model-free RL algorithm into a safe manifold. Their algorithm is based on resultsfrom chance constrained tube-MPC and makes use of a linear surrogate model. Thananjeyan et al.(2019) approximated the model uncertainty using an ensemble of recurrent neural networks. Safetywas approached by constraining the ensemble to be close to a set of successful demonstrations, forwhich a non-parametric distribution is trained. Thus, a stochastic constrained MPC is approximatedby using a set of the ensemble models trajectories. The model rollouts are entirely independent.Under several assumptions the authors proved the system safety. These assumptions can be rarelymet in practise, however, the authors demonstrated that the approach works practically on the controlof a manipulator in non-convex constrained spaces with a low ensemble size.

Learning Lyapunov functions

Verdier & M. Mazo (2017) used genetic programming to learna polynomial control Lyapunov function for automatic synthesis of a continuous-time switchingcontroller. Richards et al. (2018) proposed an architecture and a learning method to obtain a Lyapunovneural network from labelled sequences of state-action pairs. Our Lyapunov loss function is inspiredby Richards et al. (2018) but does not make use of labels nor of sequences longer than one step. Theseapproaches were all demonstrated on an inverted pendulum simulation. Taylor et al. (2019) developedan episodic learning method to iteratively reﬁne the derivative of a continuous-time Lyapunovfunction and improve an existing controller solving a QP. Their approach exploits a factorisation ofthe Lyapunov function derivatives based on feedback linearisation of robotic system models. Theytest the approach on a segway simulation.

Planning and value functions

POLO (Lowrey et al., 2018) consists of a combination of onlineplanning (MPC) and ofﬂine value function learning. The value function is then used as the terminalcost for the MPC, mimicking an inﬁnite horizon. The result is that, as the value function estimationimproves, one can, in theory, shorten the planning horizon and have a near-optimal solution. Theauthors demonstrated the approach using exact simulation models. This work is related to SiMBL,with the difference that our terminal cost is a Lyapunov function that can be used to certify safety.17 ncertain models for RL