[PDF] Safe Learning-based Gradient-free Model Predictive Control Based on Cross-entropy Method

Abstract

In this paper, a safe and learning-based control framework for model predictive control (MPC) is proposed to optimize nonlinear systems with a gradient-free objective function under uncertain environmental disturbances. The control framework integrates a learning-based MPC with an auxiliary controller in a way of minimal intervention. The learning-based MPC augments the prior nominal model with incremental Gaussian Processes to learn the uncertain disturbances. The cross-entropy method (CEM) is utilized as the sampling-based optimizer for the MPC with a gradient-free objective function. A minimal intervention controller is devised with a control Lyapunov function and a control barrier function to guide the sampling process and endow the system with high probabilistic safety. The proposed algorithm shows a safe and adaptive control performance on a simulated quadrotor in the tasks of trajectory tracking and obstacle avoidance under uncertain wind disturbances.

Full PDF

SSafe Learning-based Gradient-free Model Predictive Control Based onCross-entropy Method

Lei Zheng a,1 , Rui Yang b , Zhixuan Wu b , Jiesen Pan b , and Hui Cheng b, ∗ a School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510006, China b School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China

Abstract

In this paper, a safe and learning-based control framework for model predictive control (MPC) is proposedto optimize nonlinear systems with a gradient-free objective function under uncertain environmental dis-turbances. The control framework integrates a learning-based MPC with an auxiliary controller in a wayof minimal intervention. The learning-based MPC augments the prior nominal model with incrementalGaussian Processes to learn the uncertain disturbances. The cross-entropy method (CEM) is utilized asthe sampling-based optimizer for the MPC with a gradient-free objective function. A minimal interventioncontroller is devised with a control Lyapunov function and a control barrier function to guide the samplingprocess and endow the system with high probabilistic safety. The proposed algorithm shows a safe andadaptive control performance on a simulated quadrotor in the tasks of trajectory tracking and obstacleavoidance under uncertain wind disturbances.

Keywords:

Model predictive control, learning-based control, cross-entropy method, minimal interventioncontroller

1. Introduction

Robots and autonomous systems are increasingly widely applied to solve complex tasks in highly uncer-tain and dynamic environments [1]. Operating in such environments requires sophisticated control methodsthat can adapt to the uncertain environmental disturbances, plan and execute trajectories utilizing the fulldynamics and complete the predeﬁned tasks safely. Model predictive control (MPC) [2] provides a generalframework to consider the system constraints naturally, anticipate future events and take control actionsaccordingly to complete complex tasks which are encoded in the objective function. However, designing dif-ferentiable objective functions corresponding to all requirements of the task is non-trivial [3]. For example,it is challenging to design a diﬀerentiable objective function accounting for simultaneously avoiding unex-pectedly detected obstacles and aggressively tracking trajectory for a safety-critical quadrotor in a clusteredobstacle ﬁeld. The trade-oﬀ between safety and tracking control performance is hard to mediate using asimpliﬁed diﬀerentiable form. It is more convenient for the designer to compose multiple high-level, simplebut possibly gradient-free terms in the objective function, but the resulting gradient-free objective functionbrings diﬃculties to the optimization [4]. Obtaining the solutions to the optimization problem using regulargradient-based optimizers is diﬃcult due to the nonlinear dynamics constraints and gradient-free objectivefunction.Sampling-based methods such as random shooting [5], path integral control [6], and cross-entropy method(CEM) [7] are eﬀective approaches to solve the optimization problem with a gradient-free objective function.The CEM is initially devised to estimate the probability of a rare event and later employed as a generaloptimization framework. The optimal sampling distribution in CEM is recursively approximated to guide thesampled trajectories toward lower cost regions until converging to a delta distribution at the local optimum ∗ Corresponding author

Email address: [email protected] (and Hui Cheng) First author. Email: [email protected] a r X i v : . [ c s . R O ] F e b igure 1: Diagram of the proposed control scheme for a nonlinear system under uncertain disturbances. Taking in the desiredstate x d and current state x , the MPC optimizes the nonlinear system with a simple gradient-free objective function using CEM.The output u MPC of MPC is modiﬁed by an auxiliary controller in a way of minimal intervention, which guides the samplingprocess and preserves system safety. The environmental uncertainties are learned using incremental GPs. The learned model iscombined with a prior nominal model to serve in both the CEMPC and the MI auxiliary controller. [7]. It has been demonstrated to eﬀectively optimize complex nonlinear systems with a gradient-free objectivefunction [8][9]. An adaptive sampling strategy is developed in [10] using receding-horizon cross-entropytrajectory optimization with Gaussian Process upper conﬁdence bound [11]. In [12], the planning problemof high-dimensional systems is solved by interleaving CEM and gradient descent optimization. Actually, asa local search method, the CEM is essentially prone to the model errors caused by unexpected disturbances,which may make the sampling get stuck in a bad region of state space and even diverge, resulting in dangersto safety-critical systems [4]. To address this problem, decent initialization and gradient signal are requiredin this case to lead sampling distribution back to the low-cost region safely, and, on the other hand, theaccuracy of the predictive model under uncertain disturbances should be improved.To guide the sampling distribution safely in case of external disturbances while keeping the controlproactive, a sampling-based Tube-MPC method is augmented in [4] with an iterative linear quadratic Gaus-sian controller for sampling distribution guidance and disturbance rejection. Model predictive path integralcontrol [6] is integrated with L • A novel learning-based CEM-based MPC (CEMPC) framework is proposed for system optimizationwith a gradient-free objective function. The CEM is utilized as the optimizer to solve the gradient-freeMPC based on a prior predictive model and incremental GPs (IGPs) for disturbance estimation. • A minimal intervention (MI) auxiliary controller based on CBF and CLF is devised to intervenethe sampling-based MPC, endowing the system with safety and guides the sampling distribution tolow-cost regions when necessary. • The proposed control methodology is validated on a quadrotor tracking aggressive trajectory andsimultaneously avoiding detected obstacles under uncertain time-varying wind disturbances in simu-lation.The rest of this paper is organized as follows. The problem statement is presented in Section 2. The IGPs,learning-based CEMPC and the MI auxiliary controller are described in Section 3. Numerical simulationof the proposed algorithm on a quadrotor system is shown in Section 4. Finally, a conclusion is drawn inSection 5.

2. Problem Statement

Consider a nonlinear control aﬃne system with dynamics˙ x = f ( x ) + G ( x ) u, (1)where x ∈ X ⊂ R n denotes the system state and u ∈ U ⊂ R m is the control input. Assume that the function f : X → R n is partially unknown but has a bounded reproducing kernel Hilbert space (RKHS) norm undera known kernel, and the function G : X → R n × n is known and Lipschitz continuous. The partially unknown f ( x ) consists of a known nominal model ˆ f ( x ) and the uncertain disturbances d ( x ), f ( x ) = ˆ f ( x ) + d ( x ) . (2)If there is no disturbance, then the nominal model matches the actual one and d ( x ) = 0. However, it isdiﬃcult to get an accurate model in advance for practical nonlinear systems under uncertain disturbances,e.g. for a quadrotor under uncertain wind disturbances.3he goal is to optimize the system (1) to accomplish speciﬁed complex tasks safely under uncertainenvironmental disturbances. The speciﬁed tasks can be described with the cost function of the form: L ( x, u ) = ι ( x ) T Qι ( x ) + N (cid:88) i =1 w i I C i ( x ) + u T R u u, (3)where ι ( x ) extracts features from the state, Q and R u are a semi-positive deﬁnite and a positive deﬁniteweight matrix, respectively. N is the number of simple encodings of task descriptions, w i is the correspondingweight coeﬃcient, I C i is an indicator function for the set C i , which is 1 if x ∈ C i and 0 otherwise.The ﬁrst portion of the cost function (3) is to encode the main task for the system, e.g. tracking atrajectory. The second term encodes speciﬁc tasks, which could be sparse, gradient-free and hard to rewriteinto a typically diﬀerentiable form like the ﬁrst term, e.g. to avoid the unexpectedly detected obstacles fora quadrotor with a limited sensing range. The last term regularizes the control inputs. With this form ofthe cost function, the tasks can be easily encoded with several interpretable terms and weighted diﬀerentlyaccording to the importance of task requirements in convenience. Remark 1.

Note that though the gradient-free terms are technically soft constraints, they act like hardconstraints immediately once the state falls into the speciﬁed sets. However, unlike hard constraints, theyhave the advantage that the importance of diﬀerent requirements can be delineated by setting diﬀerent weightcoeﬃcients w i .

3. Methodology

In this section, the proposed control scheme for the nonlinear system (1) is described, as shown in Fig. 1.With a predeﬁned objective function consisting of simple and gradient-free encodings of task descriptions,the CEMPC optimizes the nonlinear system with a predictive model. This predictive model is composed ofa nominal model from prior knowledge and a discrepancy model learned via GPs. To guide the samplingdistribution back toward the low-cost regions in case of disturbances and preserve system safety, the controlinputs computed by the CEMPC are modiﬁed with a designed MI auxiliary controller in an eﬃcient QPframework.The GPs for learning the uncertain disturbances (2) and the incremental implementation to reduce thecomputational complexity are ﬁrst introduced in Section 3.1. With the prior and learned model, the CEMPCis then presented in Section 3.2 and the MI auxiliary controller is described in the Section 3.3.

A GP is an eﬃciently nonparametric regression method to estimate complex functions and their uncertaindistribution [28]. It assumes that function values associated with diﬀerent inputs are random variables, andany ﬁnite number of them have a joint Gaussian distribution. The uncertain model errors d ( x ) resultedfrom uncertain disturbances in (2) can be learned using GPs with the data collected from the system duringoperation. Similar to [26], we train n separate GPs to model the disturbances d ( x ) with the output of n dimensions based on the follwing assumption. Assumption 1.

The unknown disturbances d ( x ) in (2) are uncorrelated. The approximation of disturbances d ( x ) can be denoted by (cid:101) d . To make the problem tractable, similarto the Assumption 1 in [29], the following assumption is considered. Assumption 2.

The unknown disturbances d ( x ) has a bounded norm in the associated Reproducing KernelHilbert Space (RKHS) [30], corresponding to a continuously diﬀerentiable kernel k . This assumption can be interpreted as a requirement on the smoothness of the disturbances d ( x ). Itsboundedness implies that d ( x ) is regular with respect to the kernel k [31]. It is common practice to use thesquared-exponential kernel k ( x, x (cid:48) ) = σ f exp ( − ( x − x (cid:48) ) T L − ( x − x (cid:48) )) to estimate the similarity betweenstates x and x (cid:48) , which is characterized by hyperparameters of the length scale diagonal matrix L and theprior variance σ f . Besides, we assume that the training set D is available to the GPs for regression.4 ssumption 3. The state x and the function value d ( x ) can be measured with noises over a ﬁnite timehorizon to make up a training set with n data pairs D = (cid:110)(cid:16) x ( i ) , y ( i ) (cid:17)(cid:111) ni =1 , y ( i ) = d (cid:16) x ( i ) (cid:17) + υ i , (4)where υ i are i.i.d. noises υ i ∼ N (cid:0) , σ noise I n (cid:1) , σ noise ∈ R .Given the dataset D , the mean and variance of (cid:101) d ( x ∗ ) at the query state x ∗ can be given by [28] µ ( x ∗ ) = k Tn ( K σ + σ noise I ) − y n , (5) σ ( x ∗ ) = k ( x ∗ , x ∗ ) − k Tn ( K σ + σ noise I ) − k n , (6)respectively, where y n = [ y ( x ) , y ( x ) , ..., y ( x n )] T is the observed vector, K σ ∈ R n × n is the covariance matrixwith entries [ K σ ] ( i,j ) = k ( x i , x j ), i, j ∈ { , ..., n } , k n = [ k ( x , x ∗ ) , k ( x , x ∗ ) , ..., k ( x n , x ∗ )] is the vector withkernel function k ( x i , x j ), I n ∈ R n × n is the identity matrix.A high probability conﬁdence interval D ( x ) on (cid:101) d ( x ) can then be obtained [29] D ( x ) = { (cid:101) d | µ ( x ) − c δ σ ( x ) ≤ (cid:101) d ≤ µ ( x ) + c δ σ ( x ) } , (7)where c δ is a parameter designed to get a conﬁdence interval of (1 − δ ), δ ∈ (0 , .

5% and99 .

7% conﬁdence of the uncertainty bound can be achieved at c δ = 2 and c δ = 3, respectively.The computational complexity of GPs is O ( n ) due to the matrix inverse of K , K = K σ + σ I . It bringsnon-negligible challenges for practical applications. An incremental implementation of GPs is devised toreduce the time of inference and learning by ﬁxing the size of the dataset and recursively computing thematrix inverse. Speciﬁcally, the IGP considers two stages as follows.

1) Adding New Data:

Given a predeﬁned size n of the dataset, a new data point is added into the datasetat each time step until the size of the dataset reaches the predeﬁned size. Assume there has been i datapoints in the dataset, 0 < i < n . Denote the computed kernel matrix as K old ∈ R i × i and the correspondingmatrix inverse as K − old based on the old dataset. When added a new data point ( x i +1 , y i +1 ), the new kernelmatrix K new ∈ R ( i +1) × ( i +1) can be computed based on K old : K new = (cid:20) K old k i +1 k Ti +1 k i +1 + σ noise I (cid:21) , (8)where k i +1 = k ( x i +1 , x i +1 ). The incremental update of the matrix inverse K − new can be computed by K − new = (cid:20) K − old + ξ Γ · Γ T − ξ Γ − ξ Γ T ξ (cid:21) (9)where Γ = K − old · k i +1 and ξ = ( k i +1 + σ noise I − k Ti +1 · Γ) − ∈ R × . Remark 2.

The inversion operation only needs to perform on a scalar ξ rather than on the entire matrix K new ∈ R ( i +1) × ( i +1) . Matrix multiplication in the method takes O ( i ) complexity. Thus, this method canreduce the computation burden largely by saving and reusing the computed matrix inverse result, especiallywhen the predeﬁned size n of the dataset is quite large.2) Replacing Data: If the size of the dataset reaches the predeﬁned size n , a new data point will beadded to the dataset at each timestep while the oldest one will be deleted. Denote the old kernel matrix inthe form of a block matrix as K old = (cid:20) k k T k Υ (cid:21) , (10)where k and k are the covariance vector and variance value of the oldest data point in the dataset, andΥ ∈ R ( n − × ( n − are the sub-matrix at the right bottom corner. The inverse matrix of K old can be computedas K − old = (cid:20) ρ q T q Ξ (cid:21) , (11)5here ρ , q and Ξ is the sub-matrixs, Ξ ∈ R ( n − × ( n − . With the obtained new data point, the oldest datapoint is removed from the dataset, and variance k i +1 and covariance vector k i +1 are computed. The newkernel matrix can be obtained based on the old kernel matrix K old : K new = (cid:20) Υ k i +1 k Ti +1 k i +1 + σ noise I (cid:21) , (12)and the inverse matrix can be computed by K − new = (cid:20) Λ + (Λ k i +1 )(Λ k Ti +1 ) l − (Λ k i +1 ) l − (Λ k Ti +1 ) l l (cid:21) , (13)where Λ = Ξ − q · q T ρ − , l = ( k i +1 + σ noise I − k Ti +1 · Λ · k i +1 ) − . Remark 3.

Note that there are only 4 times of matrix multiplications in this method. Taking several matrixaddition and transposition into consideration, it takes O ( n ) computational complexity.3.2. Learning-based Model Predictive Control with Cross-Entropy Method (CEMPC) With the learned disturbances (cid:101) d via the IGPs, the predictive model for the MPC can be obtained basedon the prior model. Considering the gradient-free objective function, a sampling-based MPC scheme isdesigned with CEM to provide the nominal controls for the nonlinear systems (1). The MPC is based on thefollowing open-loop ﬁnite horizon optimal control problem given the measured state x ( t k ) at each samplingtime t k = k · dt with a control period dt , k ∈ N + . u ∗ = arg min ¯ u ( t ) ∈PC ([ t k ,t k + T ] , R m ) Φ(¯ x ( t + T )) + (cid:90) t k + Tt k L (¯ x ( τ ) , ¯ u ( τ )) dτ., (14)s.t. ˙¯ x ( t ) = ˆ f (¯ x ( t )) + g (¯ x ( t ))¯ u ( t ) + (cid:101) d (¯ x ( t )) , (15)¯ x ( t k ) = x ( t k ) , (16)where PC ([ t k , t k + T ] , R m ) represents the set of all piece-wise continuous functions ϕ : [ t k , t k + T ] → R m ,¯ x denotes the state predicted based on the system model (15) given candidate controls ¯ u ( t ) over the pre-diction horizon T >

0, Φ is a terminal cost function, L is the gradient-free running cost (3), and (cid:101) d is theapproximation to the actual environmental uncertainties d ( x ). The equation (16) is the state initializationof the ﬁnite horizon optimal control problem.Optimization and execution take place alternatively. With the current state x ( t k ) feedbacked at anysampling time t k , the problem (14)-(16) is solved to obtain the optimal open-loop control inputs u ∗ ( t ) , ∀ t ∈ [ t k , t k + T ], and only the ﬁrst-step input u MP C ( t ) = u ∗ ( t ) , ∀ t ∈ [ t k , t k + dt ] is applied to the nonlinear system(1). The overall process is repeated at the next sampling time t k +1 .It is hard to obtain a closed-form solution or apply a gradient-based optimizer to the optimizationproblem (14)-(16), since the dynamics (15) is nonlinear and stochastic, and the objective function (14) isgradient-free. The CEM is adopted as an adaptive sampling-based method to solve the MPC, which is widelyapplied as a general optimization framework to solve complex optimization problems[32][8][9]. It treats theoptimization problem as an estimation problem of the probability of a rare event, with the distributionparameters to be estimated. The control space is sampled repeatedly and the sampling distributions of thecontrols are optimized via the importance sampling technique.To ease the understanding, the optimization process with CEM for (14)-(16) is described in discretecontrol. Denote the time variable ( t + k · dt ) with the time step subscript k . At the i -th CEM iteration,multiple M random control sequences with H time steps are sampled { ( u ( i )0 , ..., u ( i ) H − ) j } M − j =0 ∼ N ( O ( i )0: H − , Σ ( i )0: H − ) , (17)where N ( O ( i )0: H , Σ ( i )0: H ) denotes H separate multivariate Gaussian distributions, from which control inputsequences are sampled. The mean and covariance matrix of the distributions are set initially O (0)0: H =6 lgorithm 1 Learning-based CEMPC

Require: N : Number of iterations, M : Sample numbers per iteration, H : Predictive time steps of the MPC, K : Size of the elite set,Σ min : A minimum variance bound for optimization, β : Update rate, N ( O (0)0: H − , Σ (0)0: H − ): Initial sampling distribution, where O (0)0: H − and Σ (0)0: H − denotes the mean andcovariance matrix of H separate initial multivariate Gaussian distributions, respectively. Ensure: µ ∗ : Optimized mean value of the control input sampling distribution, while Task is not completed do Measure current state x k . i = 0. while i < N and max (Σ H − ) > Σ min do Sample { ( u ( i )0 , ..., u ( i ) H − ) j } M − j =0 ∼ N ( O ( i )0: H − , Σ ( i )0: H − ). Score the samples according to (18), (15) and (16). Sort them in an ascending order, J ( i )0 ≤ ... ≤ J ( i ) M − . Choose K sequences according to J ( i )0 ≤ ... ≤ J ( i ) K − . Update sampling distribution using the elite set O ( i +1)0: H − ← M ean ( { ( u ( i )0 , ..., u ( i ) H − ) j } K − j =0 ) , Σ ( i +1)0: H − ← V ar ( { ( u ( i )0 , ..., u ( i ) H − ) j } K − j =0 ). ( O ( i +1)0: H − , Σ ( i +1)0: H − ) = (1 − β )( O ( i +1)0: H − , Σ ( i +1)0: H − ) + β ( O ( i )0: H − , Σ ( i )0: H − ) i + + end while u MP C = { ( u ( N − ) j } j =0 . u k ← MIControlSheme( x k , u MP C ) in

Algorithm 2 . x k +1 ← Apply u k to the system (1). Collect data point { x k , u k , x k +1 } and Update GP in (5) and (6). Reinitialize sampling distribution O (0)0: H − ← O ( N − H , Σ (0)0: H − ← Σ ( N − H . end while { } H and Σ (0)0: H = { Σ init } H , respectively, where Σ init = ( u max − u min ) . With these control sequences, theaccumulated costs J ( i ) j of the j -th control sequences are evaluated based on the cost function L (3): J ( i ) j = Φ( x H ) + H − (cid:88) k =0 L ( x k , u k ) , ∀ j = 0 , ..., M − . (18)With the estimated costs { J ( i ) j } M − j =0 , an elite set of K control sequences with the K lowest costs are chosenout from the M sequences (17). The elite set is used to update the sampling distribution N ( O ( i +1)0: H , Σ ( i +1)0: H )for the next i + 1 iteration: O ( i +1)0: H ← (1 − β ) M ean ( { ( u ( i )0 , ..., u ( i ) H − ) j } K − j =0 ) + βO ( i )0: H , (19)Σ ( i +1)0: H ← (1 − β ) V ar ( { ( u ( i )0 , ..., u ( i ) H − ) j } K − j =0 ) + β Σ ( i )0: H . (20)where β is a smoothness coeﬃcient to adjust the update of the distribution parameters.The sampling distribution is updated towards the regions with lower cost, as the above process iteratesfor N times or the maximum variance drops below a minimum variance bound Σ min . The remaining7ontrol sequence { ( u ( N )0 , ..., u ( N ) H − ) j } j =0 is returned with the lowest cost and the ﬁrst-step control input u MP C = { ( u ( N )0 ) j } j =0 is applied as the output of the CEMPC. Modiﬁed by the auxiliary controller whennecessary, the control input is applied to the system. The data of the system evolution is then collected inthe dataset to update the GPs. Algorithm 1 details the proposed learning-based CEMPC.

Remark 4.

Note that the prediction of mean in (5) takes O ( n H + M nH ) computational complexity whenpredicting M samples over H timesteps in (17). The calculation of the cost function in (18) is applied for M samples and H timesteps for each sample, which takes O ( M H ) complexity.3.3. Minimal Intervention (MI) Auxiliary Controller In the learning-based CEMPC, task requirements and state constraints can be encoded in the costfunctions with diﬀerent penalty weights. Though the design process is simpliﬁed in this way, system safetyregarding state constraints is not guaranteed. Besides, the sampling distribution should be guided to thelow-cost regions in the case of unexpected disturbances. To solve the problem, an auxiliary controller isdesigned in this part to minimally intervene the CEMPC.

Safety under environmental uncertainties should be considered carefully before the optimized controlinputs are applied to the system. It can be enforced with safety constraints via CBF, which can quantifythe system safety regarding state constraints [33].The safety set S of the system (1) can be deﬁned by S := { x ∈ X | h ( x ) ≥ } , (21)where h : R n → R is a continuously diﬀerentiable function related to the state constraints. Deﬁnition 1.

The set S is called forward invariant , if for every x ∈ S , x ( t, x ) ∈ S for all t ∈ R +0 . To ensure forward invariance of S , e.g. quadrotors stay in the collision-free safety set at all times, weconsider the following deﬁnition. Deﬁnition 2. (Deﬁnition 5 of [34]) For the dynamical system (1), given a set

S ⊂ R n deﬁned by (21) for acontinuously diﬀerentiable function h : R n → R , the function h is called a Zeroing Control Barrier Function ( ZCBF ) deﬁned on the set E with S ⊆ E ⊂ R n , if there exists an extended class K function κ ( κ (0) = 0 and strictly increasing) such that sup u ∈U [ L f h ( x ) + L g h ( x ) u + κ ( h ( x ))] ≥ , ∀ x ∈ E , (22) where L represents the Lie derivatives. To be more speciﬁc, L f h ( x ) = ∂h ( x ) ∂x f ( x ) , L g h ( x ) = ∂h ( x ) ∂x g ( x ) . (23)ZCBF is a special control barrier function that comes with asymptotic stability [35]. The existence of aZCBF implies the asymptotic stability and forward invariance of S as proved in [35].Based on the ZCBF deﬁned in Deﬁnition 2 , a safety barrier for safety-critical systems can be con-structed. Concretely, we aim to design a safety barrier for the uncertain system (1) to keep the state x inthe forward invariant safety set S , which requires to hold ˙ h ( x ) ≥ − κ ( h ( x )).With the learned disturbances (cid:101) d ( x ) via IGPs and the high conﬁdence interval D (7) for the uncertaindynamical system (1), the following safe control space K rzbf is formulated as shown in our previous work[17] K rzbf ( x ) = { u ∈ U | inf d ∈D ( x ) [ ˙ h ( x ) + κ ( h ( x ))] ≥ } . (24)where h ( x ) is a ZCBF, ˙ h ( x ) = ∂h ( x ) ∂x ˙ x = L ˆ f h ( x ) + L g h ( x ) u + L (cid:101) d h ( x ), where the L ˆ f h ( x ) and L (cid:101) d h ( x ) denotesthe Lie derivative of h with respect to the known nominal model ˆ f of f and the learned disturbances (cid:101) d ( x ),respectively. 8 emma 1. Given a set

S ⊂ R n deﬁned by (21) with an associated ZCBF h ( x ) , the control input u ∈ K rzbf has a probability of at least (1 − δ ) , δ ∈ (0 , , to guarantee the forward invariance of the set S for theuncertain dynamical system (1)Proof. From (7), there is a probability of at least (1 − δ ) such that the bounded model uncertainty d ( x ) ∈ D ( x )for all x ∈ X . Since the control input u ∈ U of the safe control space K rzbf satisﬁes the constraint in (24),the following result holds with a probability of at least (1 − δ ):˙ h ( x ) + κ ( h ( x )) ≥ , ∀ x ∈ S . (25)As a result, the control input u ∈ U of the safe control space K rzbf has a probability of at least (1 − δ ) toguarantee the forward invariance of the set S for the uncertain dynamical system (1) as proven in [36].For convenience, with the estimated high conﬁdence interval D (7) via GPs, the constraint in (24) canbe equivalently expressed as L ˆ f h ( x ) + L g h ( x ) u + L µ h ( x ) − c δ | L σ h ( x ) | ≥ − κ ( h ( x )) , (26)where L µ h ( x ) and L σ h ( x ) denote the Lie derivatives of h ( x ) with respect to µ and σ , respectively. Remark 5.

Note that with more informative data collected, the bounded uncertainty σ (6) will graduallydecrease. Thus, the probability rendering S forward invariant is much higher than (1 − δ ) in most cases.3.3.2. Sampling Guidance Scheme In this section, we develop a sampling guidance scheme for CEMPC under uncertain disturbances basedon the stability property. Sampling distribution could easily deviate from the low-cost regions in the caseof external disturbances. Based on a deviated distribution, the optimization may get stuck in the localminimums. The Lyapunov methods have proven to be an eﬃcient way to improve sampling performancefor nonlinear systems in model-based RL [29, 37]. To guide the sampling distribution of CEMPC to thelow-cost region in the case of uncertain disturbances, the stability constraints with regard to the maintask, corresponding to the ﬁrst term in the cost function (3), can be constructed based on the CLF V ( x ):˙ V ( x ) ≤ − αV ( x ), α > (cid:101) d estimated by the IGPs, CLF can be utilized to construct a stability controlset K rclf [17] K rclf ( x ) = { u ∈ U | sup d ∈D ( x ) [ ˙ V ( x ) + αV ( x )] ≤ } , (27)where ˙ V ( x ) = ∂V ( x ) ∂x ˙ x = L ˆ f V ( x ) + L g V ( x ) u + L (cid:101) d V ( x ) and α > D (7) via GPs, the constraint in (27) can be simpliﬁed as: L ˆ f V ( x ) + L g V ( x ) u + L µ V ( x ) + c δ | L σ V ( x ) | ≤ − αV ( x ) , (28)where L µ V ( x ) and L σ V ( x ) denote the Lie derivatives of V ( x ) with respect to µ and σ , respectively. With the above two conditions, a QP (29)-(32) can be constructed to modify the control inputs u MP C of the learning-based CEMPC in a way of minimal intervention. The nonlinear system under environmentaluncertainties can be guaranteed safe and guided to a low-cost region of the main task in a high probabilityby solving the following QP. u ∗ ( x ) = arg min ( u,ε,η ) ∈ R m +1 || u − u MP C || + λ (cid:15) (cid:15) + λ η η (29)s.t. A cbf u + b cbf ≤ ε, (30) A clf u + b clf ≤ η, (31) u min ≤ u ≤ u max , (32)9here u min , u max ∈ U are the lower and upper bound of the control inputs, respectively. λ ε , λ η ∈ R + are penalty coeﬃcients of the slack variables ε ∈ R and η ∈ R , respectively. A cbf = − L g h ( x ), b cbf = − L f h ( x ) − L µ h ( x ) + c δ | L σ h ( x ) | − κ ( h ( x )), A clf = L g V ( x ), b clf = L f V ( x ) + L µ V ( x ) + c δ | L σ V ( x ) | − αV ( x ).The feasibility of the QP (29)-(32) can be ensured with the slack variables, while the violations of safetyconstraints can be heavily penalized as long as the corresponding coeﬃcients are large enough.The designed MI auxiliary control scheme is shown in Algorithm 2 . We can always directly applythe control outputs u MP C of the learning-based CEMPC (

Algorithm 1 , line 10) if the u MP C satisﬁes theconstraints (26) and (28). Otherwise, it is modiﬁed by solving the QP (29)-(32).

Algorithm 2

MI Auxiliary Control Scheme

Require: x k : Current state, u MP C : Control inputs from learning-based CEMPC. u k ← u MP C if (26) and (28) Infeasible then u k ← Solve QP (29)-(32) with x k and u MP C end ifRemark 6. Note that the optimization (29) is not sensitive to the parameters λ ε and λ η . The violationof the safety (26) and stability constraints (28) can be heavily penalized as long as the λ ε and λ η are largeenough (e.g. λ ε = 10 , λ η = 10 ). Besides, λ ε can be set extremely larger than λ η to make the safetyconstraints much stricter.

4. Simulation Studies

In this section, the proposed control architecture is veriﬁed on a task of simultaneous trajectory trackingand obstacle avoidance of a quadrotor. For a safety-critical quadrotor with a limited sensing range, avoidingan uncertain number of detected obstacles around or on the trajectory can be conveniently encoded as somegradient-free running-cost terms into the objective function. It is diﬃcult to describe such a complex taskand trade oﬀ the tracking and safety well with a simpliﬁed diﬀerentiable objective function. Besides, aquadrotor is prone to uncertain wind disturbances, which are hard to be accurately modeled. Uncertainwind disturbances not only pose a critical challenge to achieve accurate tracking, but also may cause thequadrotor to collide in a cluttered obstacle ﬁeld.

The quadrotor is a well-modeled dynamical system with torques and forces generated by four rotors andgravity. The Euler angles (roll φ , pitch θ , and yaw ψ ) are deﬁned with the ZYX convention. The attituderotation matrix R ∈ SO (3) from the body frame B to the global frame W can be written as [39]: R =  cθcψ sφsθcψ − cφsψ cφsθcψ + sφsψcθsψ sφsθsψ + cφcψ cφsθsψ − sφcψ − sθ sφcθ cφcθ  , (33)where s and c denote sin and cos , respectively.The nonlinear quadrotor system can be modeled as following [40]:˙ p = v, (34) mv = mge + Rf u + d w , (35)˙ R = RS ( ω ) , (36)where m and g denote the mass and the gravity acceleration, respectively. e = [0 , , T is the unit vector, p = [ p x , p y , p z ] T and v = [ v x , v y , v z ] T denote translational position and velocity in W , respectively. f u =100 , , f T ] T with f T the total thrust generated from the four rotors in B , and S ( . ) is skew-symmetric mapping.The uncertain wind disturbances acting on the quadrotor dynamics is represented as d w = K drag ( v w − v ),where v w ∈ R is the velocity of wind disturbances in W and K drag ∈ R × is the drag coeﬃcient diagonalmatrix. We deﬁne in the dynamics equation (1) the state x = [ p x , p y , p z , v x , v y , v z , φ, θ, ψ ] T and the controlinput u = [ f T , ω T ] T , where ω = [ ω x , ω y , ω z ] T is the the body rotational rates. It is assumed that thebody rotational rates are directly controllable through the fast response onboard controller of commercialquadrotors.For the task of simultaneous trajectory tracking and obstacle avoidance, the cost function L in theobjective function (18) can be deﬁned as L ( x ) = ι ( x − x d ) T Qι ( x − x d ) + N (cid:88) i w i I C i r i (37) C i = { x | || r i || < . } , (38)where ι = Diag (1 , , , , , , , ,

0) extracts the position and velocity states from the state x , the weightcoeﬃcient matrix Q = Diag (8 . , . , . , . , . , . N denotes the number of the detected obstacles, andthe weight coeﬃcient w i is a positive constant, ∀ i = 1 , ..., N . x d = [ p Td , ˙ p dT , φ d , θ d , ψ d ] T is the desired state,where φ d , θ d , ψ d ∈ R are the desired attitudes, r i is the shortest Euler distance from the quadrotor to the i -th detected obstacle, and I C i is an indicator function that will be turned on if the state in the set C i . Remark 7.

Note that the second term in (37) is designed to show the predictivity inherent in the CEMPCfor obstacle avoidance. The trade-oﬀ between the safety and tracking performance can be adjusted by theweight w i in (37). Remark 8.

Note that designing a diﬀerentiable cost function for this task would be nontrivial! Besides,composing a cost function with diﬀerent nonlinear cost terms may lead to local minima, which brings dif-ﬁculties to the optimization. In this case, specifying a gradient-free cost function with several interpretableterms and diﬀerent weights can be convenient according to the importance of task requirements.4.2. Simulation setup

A simulation platform is created using Python 3.6 on an Intel Xeon X5675 CPU with 3.07 GHz clockfrequency. A quadrotor model of Blade mQX quadrotor is used with the parameters set referred to [41]. Themass is m = 0 . kg . The arm length from the center of mass to each motor is 0 . m . The maximum totalthrust f T max = 1 . N . The maximum and minimum body rotational rates are ω max = [3 . , . , . T rad/s and ω min = [ − . , − . , − . T rad/s , respectively. The detection range for the obstacle is set to beomnidirectional 2 m . The simulation time is set to be 20 s with a control frequency of 50 Hz .A wind model in [42] is utilized to evaluate the algorithm performance. Wind velocity consists ofa constant component v c and a turbulent component v t , i.e., v w = v c + v t . The turbulence wind uses thevon K´arm´an velocity model deﬁned analytically in the speciﬁcation MIL-F-8785C [43], with the speciﬁc low-altitude model for the model parameters. The drag coeﬃcient diagonal matrix K drag = Diag (0 . , . , . T .Four magnitudes of constant wind components are used to validate the trajectory tracking performance, asshown in Table 1.Three GPs are built to estimate the eﬀects of the unknown wind disturbances d w . Each GP uses thesame squared-exponential kernel with parameters L = 1 and σ f = 1. The size of the IGP dataset is ﬁxed at n = 20. We set c δ = 3 in (7) to generate high conﬁdence intervals of the estimation.The quadrotor with a limited sensing range is required to track a reference trajectory while avoidingobstacles under varying wind disturbances in two scenarios . The initial state of the quadrotor is set asthe initial position of the reference trajectory with zero velocity and attitude. Scenario 1 is designed to validate the eﬀectiveness of the proposed learning-based CEMPC method withthe MI controller. The reference trajectory is given as a spiral curve p d ( t ) = [2 sin (0 . t ) , − cos (0 . t ) , . t ] T and ψ d ( t ) = 0. It lies in a dense cluttered obstacle ﬁeld under time-varying wind disturbances, where amoving obstacle ﬂies along the reference trajectory to the quadrotor with a speed of 0 . m/s . The weightcoeﬃcients are set w i = 10 , ∀ i = 1 , ..., N in (37). 11 cenario 2 is studied to further validate the safety and tracking performance trade-oﬀ in the design ofthe gradient-free cost function (37), where the weight coeﬃcient w i is set to diﬀerent orders of magnitude, ∀ i = 1 , ..., N . The diﬀerence to Scenario 1 is that there are only one static obstacle and one dynamicobstacle along the reference trajectory. It is devised to clearly show the proactive ability inherent in theCEMPC framework for obstacle avoidance.The number of the samples and the size of the elite set for CEMPC per iteration are set M = 100 and K = 10, respectively. The number of iterations of CEMPC is N = 5. The minimum variance bound foroptimization is set as Σ min = 0 .

001 and the update rate β = 0 . h can be constructed with the distance from the quadrotor to the obstacles within itssensing region. This distance can be obtained with the largest ellipsoidal region of obstacle-free space, whichcan be eﬃciently computed using the IRIS algorithm [44] through semi-deﬁnite programming. Speciﬁcally,an ellipsoid can be represented as an image of the unit ball: E ( C, ζ ) = { Co + ζ | (cid:107) o (cid:107) = 1 } , (39)where o ∈ R × , o T o = 1 denotes a unit ball, the mapping matrix C ∈ R × and the oﬀset vector ζ ∈ R × canbe obtained using the IRIS algorithm. For the vector (cid:15) on the ellipsoid (cid:15) ∈ E , we have ( (cid:15) − ζ ) T C − T C − ( (cid:15) − ζ ) = 1. The CBF can be constructed to enforce the quadrotor to stay within the safety ellipsoid region as h ( x ) = 1 − ( ι p x − ζ ) T C − T C − ( ι p x − ζ ) . (40)where ι p = Diag (1 , , , , , , , ,

0) extracts the position from the state x .For the main task of trajectory tracking, the CLF is designed as V ( x ) = ( ι p x − p d ) T Q p ( ι p x − p d ) + ( ι v x − ˙ p d ) T Q v ( ι v x − ˙ p d ) , (41)where Q p = Diag (0 . , . , . Q v = Diag (0 . , . , .

2) and ι v = Diag (0 , , , , , , , ,

0) extracts thevelocity from the state x . The extended K class function κ is chosen as κ ( h ( x )) = 10 h ( x ), and the positiveconstant α = 0 .

1. The penalty coeﬃcients of the slack variables in the QP (29) are set λ ε = 1 e

30 and λ η = 1 e

20. The QP is solved with CVXOPT solver [45].

To validate the eﬀectiveness of the proposed control scheme, an ablation study is conducted to assessthat the proposed method is able to: 1) learn and adapt to the uncertain environmental disturbances, 2)handle the task with a gradient-free objective function, 3) achieve safe control with a low tracking error,and 4) provide a way to trade oﬀ between safety and tracking performance. We compare the following fourmethods: • CEMPC : A CEM-based MPC without the IGPs for learning the uncertain disturbances. • LB-CEMPC : A learning-based CEMPC without the MI auxiliary controller. • LB-CEMPC-CBF : The proposed learning-based CEMPC without a sampling guidance scheme inthe MI auxiliary controller. • LB-CEMPC-MI : The proposed learning-based CEMPC with the designed MI auxiliary controller.

Figures 2(a) and 2(b) shows the time of learning and inference with the IGP at each iteration underWind-4. The learning time is less than 0 . s most of the time and the inference time keeps below 0 . s . Itshows that the IGP can be used as an online learning technique with fast learning and inference time. Notethat the code in Python has not been optimized for speed and can be accelerated in a C++ implementation.The uncertain wind disturbances in three axes modeled by GPs are shown in Fig. 3. It can be seenthat the IGPs can estimate well the actual wind disturbances with turbulence. The actual disturbances liewithin the uncertainty bounds of the estimations. 12 a) (b)Figure 2: (a) The time of learning, and (b) the time of inference of the IGP under the Wind-4.Figure 3: The wind disturbances estimated by IGPs in three axes on the spiral trajectory under the Wind-4. As shown in Table 1 and Fig. 4, the CEMPC with a longer horizon achieves smaller tracking RMS errorunder mild Wind-1 or Wind-2, due to the robustness from the predictivity and receding horizon optimizationinherent in the MPC [2]. However, under larger wind disturbances, e.g. Wind-3 and Wind-4, the RMS errorof the CEMPC instead increases as the prediction horizon gets longer, since the accumulation of the modelerror along the multi-step predictions degrades the control performance. The tracking errors of the CEMPCand LB-CEMPC with a prediction horizon of T h = 0 . s under Wind-1 and Wind-4 are shown in Fig. 5. Itcan be seen that the tracking errors of the CEMPC are similar to that of the LB-CEMPC under Wind-1, while the tracking errors of the CEMPC get quite high without the IGPs under the stronger Wind-4.These results indicate that the LB-CEMPC beneﬁts from the IGPs learning and compensating the winddisturbances.The proposed LB-CEMPC-MI controller achieves the lowest tracking errors among the four controllersunder diﬀerent all four magnitudes of wind disturbances, as shown in Table 1. Figs. 6, 7 and 8 showthe control performance with our proposed LB-CEMPC-MI control framework on the quadrotor under theWind-4. At about 4.3 and 7.5 seconds, the quadrotor deviates from the reference trajectory with increasing13 able 1: Statistics of RMS Errors (in meter) with Diﬀerent Control Schemes and Conﬁgurations. Prediction Horizon T h High-level Scheme Wind-1 v c = 5 m/s Wind-2 v c = 8 m/s Wind-3 v c = 10 m/s Wind-4 v c = 12 m/s LB-CEMPC MI 0.354 0.195 0.246 0.388

LB-CEMPC MI 0.215 0.124 0.214 0.349

LB-CEMPC MI 0.170 0.178 0.271 0.341

Figure 4: The RMS errors of the quadrotor trajectory tracking using the LB-CEMPC and the CEMPC controllers with diﬀerentprediction horizons under wind disturbances. The proposed LB-CEMPC outperforms the baseline CEMPC in four settings oftime-varying wind disturbances.(a) Wind-1. (b) Wind-4.Figure 5: Evolution of trajectory tracking errors using the CEMPC and the LB-CEMPC under the mild Wind-1 and the strongWind-4 disturbances. tracking errors. It has changed its trajectory to safely avoid the collision with the grey static obstacle on thetrajectory. Fig. 7(b) shows that the barrier function keeps positive and safety is ensured when avoiding theobstacle. Besides, compared with the other three baseline controllers, the LB-CEMPC-MI controller canachieve smoother obstacle avoidance behavior as shown in Fig. 6(a). Moreover, Figs. 6(a) and 8 illustrate14 a) Tracking error. (b) VelocityFigure 6: Safe trajectory tracking in

Scenario 2 . (a) Tracking error with diﬀerent controllers, and (b) Tracking velocity withdiﬀerent controllers under Wind-4 disturbances.(a) (b)Figure 7: Safe trajectory tracking in

Scenario 2 . (a) Position of the quadrotor with the proposed LB-CEM-MI control scheme,and (b) values of CBF in the LB-CEMPC-CBF and the LB-CEMPC-MI under Wind-4 disturbances. that the MI auxiliary controller can eﬀectively help the main task of trajectory tracking by guiding thesampling distribution. The quadrotor can be guided quickly back to the references when deviating fromthe desired trajectories due to the unexpected obstacle avoidance. These results demonstrate that the LB-CEMPC-MI control scheme can drive the system to track the reference trajectory stably when the trajectoryis safe and can relax the tracking to avoid obstacles safely when it is to collide with the detected obstacles.

Table 2: Trajectory Tracking Control and Safety Performance with Diﬀerent Weight Coeﬃcients. w i MI Scheme Time to avoidobstacle 1 (in second) Time to avoidobstacle 2 (in second) Minimum Barrier Value(in meter) Max tracking error(in meter) RMS error(in meter)10 CBF 3.283 11.010 0.994 1.957 0.6321 CBF 3.737 11.136 0.427 1.321 0.3110.1 CBF 3.737 11.364 0.617 0.772 0.2140.01 CBF 4.289 11.710 0.485 0.703 0.1670 CBF 4.293 11.717 0.481 0.723 0.190 a) Snapshot at t = 6s. (b) Snapshot at t = 8s(c) Snapshot at t = 12s (d) Snapshot at t = 20sFigure 8: Numerical validation of the trajectory tracking in Scenario 1 , where a quadrotor ﬂies through a densely clutteredobstacle ﬁeld under uncertain Wind-4 disturbances. Snapshots of the simulation are shown in 8(a)-8(d). The black dashed linedenotes the reference trajectory. The irregular polyhedron in orange and gray denote the dynamic and static obstacles crossingthe reference trajectory, respectively.

To validate the trade-oﬀ between safety and tracking performance,

Scenario 2 with Wind-2 is ﬁxedin this part. To demonstrate the eﬀects of predictivity inherent in the MPC and the CBF-enforced safetyscheme, we only keep the CBF (40) in the MI scheme without the CLF-based sampling guidance schemeand design the LB-CEMPC algorithm with diﬀerent coeﬃcients w i in (37) for obstacle avoidance.Statistic of the simulations is listed in Table 2 with the trajectories graphically shown in Fig. 9 andtracking errors shown in Fig. 10(a). We highlight three key takeaways from these results. Firstly, thecontrollers with w i = 10 predict to avoid the unexpectedly gray static obstacle and the orange dynamicobstacle in advance at 3 . s and 11 . s seconds, respectively, while the controllers with w i = 0 avoid thesetwo obstacles at later 4 . s and 11 . s . It illustrates the proposed method can keep the predictive ability forobstacle avoidance even under time-varying environmental disturbances. Secondly, the RMS and maximumtracking errors increase with a larger positive weight w i , while the time for obstacle avoidance decreases asobserved in Table 2. This indicates that there is a trade-oﬀ between safety margin and tracking performance,which can be adjusted with the weight coeﬃcient w i in the cost function (37). Besides, to obtain an intuitiveview on the safety performance under disturbances, Fig. 10(b) shows the safety performance of the quadrotortracking the reference trajectory. It can be seen that the values of the CBF keep positive, which indicates16 a) Snapshot at t = 6s. (b) Snapshot at t = 12s(c) Snapshot at t = 16s (d) Snapshot at t = 20sFigure 9: Numerical validation of the trajectory tracking in the Scenario 2 , where a quadrotor tracks the reference spiraltrajectory under uncertain Wind-2 disturbances. Snapshots of the simulation are shown in 9(a)-9(d). The black dashed linedenotes the reference trajectory p d ( t ). The irregular polyhedron in orange and gray denote the dynamic and static obstaclescrossing the reference trajectory, respectively. The quadrotor tracks the reference trajectory and strictly guarantees non-collisionwith the obstacles. that the position of the quadrotor always stays within the safe obstacle-free ellipsoid region. The predictive nature of the MPC brings proactivity to the control scheme, which improves the trajectorytracking accuracy and introduces conservation in terms of safety to the system, as shown in Section 4.3.It results in safer behaviors of the system but also degrades the control performance of the main task, e.g.trajectory tracking accuracy. Such trade-oﬀ could be considered by adjusting the weight coeﬃcients w i inthe cost function (37) according to the requirements of practical applications. For example, if larger weightcoeﬃcients w i are set to achieve better foresight to avoid obstacles, there will be a larger tracking RMSerror, given a determined prediction horizon T h . In contrast, if we choose small weight coeﬃcients w i toachieve low tracking RMS error, there will be poor foresight for obstacle avoidance. Actually, it is ﬂexiblefor the designer to set the weight coeﬃcients w i according to their considerations. The designed MI schemereserves the safety and guides the optimization, which enables the customized and convenient design of thecost function for high-level tasks. 17 a) (b)Figure 10: Simulation results in Scenario 2 . The prediction horizon is T h = 0 . s . (a) The evolution of tracking errors withdiﬀerent weight coeﬃcients w , where the dashed red line is used to distinguish the tracking error using the controller with w = 1 e −

1; (b) The values of the CBF.

5. Conclusion

In this paper, a safe learning-based MPC architecture is designed to optimize the nonlinear system witha gradient-free objective function under uncertain environmental uncertainties. Our proposed approachallows the convenient design of objective function using simple but gradient-free running-cost terms. TheIGPs are utilized to estimate model uncertainties with a low computational burden and augment the priorpredictive model in the MPC. Solved with the sampling-based CEM, the proposed CEMPC is augmented byan auxiliary controller based on the control Lyapunov function and the CBF to guide the sampling processand theoretically endows the system with safety in a way of minimal intervention. We provide numericalsimulation results comparing the CEMPC, LB-CEMPC, LB-CEMPC-CBF, and the LB-CEMPC-MI algo-rithms on a quadrotor trajectory tracking and obstacle avoidance task under diﬀerent wind disturbances intwo diﬀerent scenarios. The results show that the proposed LB-CEMPC-MI algorithm can successfully andsafely optimize the quadrotor system with a conveniently designed gradient-free objective function, achiev-ing accurate tracking performance and safe obstacle avoidance under uncertain wind disturbances. In futurework, the proposed learning-based MPC framework will be veriﬁed in hardware platforms under realisticconditions, and extended to solve a navigation problem considering uncertain environmental disturbances.

References [1] Bruno Siciliano and Oussama Khatib.

Springer handbook of robotics . Springer, 2016.[2] David Q Mayne. Model predictive control: Recent developments and future promise.

Automatica ,50(12):2967–2986, 2014.[3] Stefan Schaal and Christopher G Atkeson. Learning control in robotics.

IEEE Robotics & AutomationMagazine , 17(2):20–29, June 2010.[4] Grady Williams, Brian Goldfain, Paul Drews, Kamil Saigol, James M Rehg, and Evangelos ATheodorou. Robust sampling based model predictive control with sparse objective information. In

Robotics: Science and Systems , 2018.[5] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamicsfor model-based deep reinforcement learning with model-free ﬁne-tuning. In , pages 7559–7566, 2018.186] Grady Williams, Andrew Aldrich, and Evangelos A Theodorou. Model predictive path integral control:From theory to parallel computation.

Journal of Guidance, Control, and Dynamics , 40(2):344–357,2017.[7] Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on thecross-entropy method.

Annals of operations research , 134(1):19–67, 2005.[8] Marin Kobilarov. Cross-entropy randomized motion planning. In

Robotics: Science and Systems ,volume 7, pages 153–160, 2012.[9] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learningin a handful of trials using probabilistic dynamics models. In

Advances in Neural Information ProcessingSystems , pages 4754–4765, 2018.[10] Yew Teck Tan, Abhinav Kunapareddy, and Marin Kobilarov. Gaussian process adaptive samplingusing the cross-entropy method for environmental sensing and monitoring. In , pages 6220–6227, 2018.[11] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process opti-mization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995 ,2009.[12] Homanga Bharadhwaj, Kevin Xie, and Florian Shkurti. Model-predictive control via cross-entropy andgradient-based optimization. arXiv preprint arXiv:2004.08763 , 2020.[13] Jintasit Pravitra, Kasey A Ackerman, Chengyu Cao, Naira Hovakimyan, and Evangelos ATheodorou. L1-adaptive mppi architecture for robust and agile control of multirotors. arXiv preprintarXiv:2004.00152 , 2020.[14] Karen Leung, Edward Schmerling, Mengxuan Zhang, Mo Chen, John Talbot, J Christian Gerdes, andMarco Pavone. On infusing reachability-based safety assurance within planning frameworks for human–robot vehicle interactions.

The International Journal of Robotics Research , 39(10-11):1326–1345, 2020.[15] Wenhao Luo and A. Kapoor. Multi-robot collision avoidance under uncertainty with probabilistic safetybarrier certiﬁcates.

ArXiv , abs/1912.09957, 2020.[16] Tom Hirshberg, Sai Vemprala, and A. Kapoor. Safety considerations in deep control policies with safetybarrier certiﬁcates under uncertainty. , pages 6245–6251, 2020.[17] Lei Zheng, Rui Yang, Jiesen Pan, Hui Cheng, and Haifeng Hu. Learning-based safety-stability-drivencontrol for safety-critical systems under model uncertainties. In , pages 1112–1118. IEEE, 2020.[18] Hiroaki Fukushima, Tae-Hyoung Kim, and Toshiharu Sugie. Adaptive model predictive control for aclass of constrained linear systems based on the comparison model.

Automatica , 43(2):301–308, 2007.[19] Anil Aswani, Humberto Gonzalez, S Shankar Sastry, and Claire Tomlin. Provably safe and robustlearning-based model predictive control.

Automatica , 49(5):1216–1226, 2013.[20] Vishnu R Desaraju and Nathan Michael. Experience-driven predictive control.

Robot Learning andPlanning (RLP 2016) , page 29, 2016.[21] Angel Urbina, Sankaran Mahadevan, and Thomas L Paez. Quantiﬁcation of margins and uncertaintiesof complex systems in the presence of aleatoric and epistemic uncertainty.

Reliability Engineering &System Safety , 96(9):1114–1125, 2011. 1922] Gang Cao, Edmund M-K Lai, and Fakhrul Alam. Gaussian process model predictive control of anunmanned quadrotor.

Journal of Intelligent & Robotic Systems , 88(1):147–162, 2017.[23] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.

TheInternational Journal of Robotics Research , 32(11):1238–1274, 2013.[24] Lukas Hewing, Kim P Wabersich, Marcel Menner, and Melanie N Zeilinger. Learning-based model pre-dictive control: Toward safe learning in control.

Annual Review of Control, Robotics, and AutonomousSystems , 3:269–296, 2020.[25] Lukas Hewing, Alexander Liniger, and Melanie N Zeilinger. Cautious nmpc with gaussian processdynamics for autonomous miniature race cars. In , pages1341–1348. IEEE, 2018.[26] Chris J Ostafew, Angela P Schoellig, and Timothy D Barfoot. Robust constrained learning-basednmpc enabling reliable mobile robot path tracking.

The International Journal of Robotics Research ,35(13):1547–1563, 2016.[27] Mohit Mehndiratta and Erdal Kayacan. Gaussian process-based learning control of aerial robots forprecise visualization of geological outcrops. In , pages 10–16.IEEE, 2020.[28] Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (gpml) toolbox.

The Journal of Machine Learning Research , 11:3011–3015, 2010.[29] Felix Berkenkamp, Riccardo Moriconi, Angela P. Schoellig, and Andreas Krause. Safe learning of regionsof attraction for uncertain, nonlinear systems with gaussian processes. , pages 4661–4666, 2016.[30] Bernhard Sch¨olkopf, Alexander J Smola, Francis Bach, et al.

Learning with kernels: support vectormachines, regularization, optimization, and beyond . MIT press, 2002.[31] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias W Seeger. Information-theoretic re-gret bounds for gaussian process optimization in the bandit setting.

IEEE Transactions on InformationTheory , 58(5):3250–3265, 2012.[32] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In , pages 2786–2793, 2017.[33] Aaron D Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, and PauloTabuada. Control barrier functions: Theory and applications. In , pages 3420–3431. IEEE, 2019.[34] Aaron D Ames, Xiangru Xu, Jessy W Grizzle, and Paulo Tabuada. Control barrier function basedquadratic programs for safety critical systems.

IEEE Transactions on Automatic Control , 62(8):3861–3876, 2016.[35] Xiangru Xu, Paulo Tabuada, Jessy W Grizzle, and Aaron D Ames. Robustness of control barrierfunctions for safety critical control.

IFAC-PapersOnLine , 48(27):54–61, 2015.[36] A. Ames, X. Xu, J. Grizzle, and P. Tabuada. Control barrier function based quadratic programs forsafety critical systems.

IEEE Transactions on Automatic Control , 62:3861–3876, 2017.[37] Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause. Safe model-based re-inforcement learning with stability guarantees. In

Advances in Neural Information Processing Systems ,pages 908–918, 2017. 2038] Hassan K Khalil and Jessy W Grizzle.

Nonlinear systems , volume 3. Prentice hall Upper Saddle River,NJ, 2002.[39] Markus Hehn and Raﬀaello D’Andrea. Real-time trajectory generation for quadrocopters.

IEEE Trans-actions on Robotics , 31(4):877–892, 2015.[40] Guanya Shi, Xichen Shi, Michael O’Connell, Rose Yu, Kamyar Azizzadenesheli, Animashree Anand-kumar, Yisong Yue, and Soon-Jo Chung. Neural lander: Stable drone landing control using learneddynamics. In , pages 9784–9790.IEEE, 2019.[41] David Cabecinhas, Rita Cunha, and Carlos Silvestre. A globally stabilizing path following controllerfor rotorcraft with wind disturbance rejection.

IEEE Transactions on Control Systems Technology ,23(2):708–714, 2014.[42] Kenan Cole and Adam M Wickenheiser. Reactive trajectory generation for multiple vehicles in unknownenvironments with wind disturbances.

IEEE Transactions on Robotics , 34(5):1333–1348, 2018.[43] D Moorhouse and R Woodcock. Us military speciﬁcation mil–f–8785c.

US Department of Defense ,1980.[44] Robin Deits and Russ Tedrake. Computing large convex regions of obstacle-free space through semidef-inite programming. In

Algorithmic foundations of robotics XI , pages 109–124. Springer, 2015.[45] Martin S Andersen, Joachim Dahl, and Lieven Vandenberghe. Cvxopt: A python package for convexoptimization. abel. ee. ucla. edu/cvxoptabel. ee. ucla. edu/cvxopt