Learning-based Robust Motion Planning with Guaranteed Stability: A Contraction Theory Approach
IImitation Learning for Robust and Safe Online Motion Planning:A Contraction Theory Approach
Hiroyasu Tsukamoto and Soon-Jo Chung
Abstract — This paper presents Learning-based AutonomousGuidance with Robustness, Optimality, and Safety guarantees(LAG-ROS), a real-time robust motion planning algorithmfor safety-critical nonlinear systems perturbed by boundeddisturbances. The LAG-ROS method consists of three phases: 1)Control Lyapunov Function (CLF) construction via contractiontheory; 2) imitation learning of the CLF-based robust feed-back motion planner; and 3) its real-time and decentralizedimplementation with a learning-based model predictive safetyfilter. For the CLF, we exploit a neural-network-based methodof Neural Contraction Metrics (NCMs), which provides adifferential Lyapunov function to minimize an upper boundof the steady-state Euclidean distance between perturbed andunperturbed system trajectories. The NCM ensures the per-turbed state to stay in bounded error tubes around givendesired trajectories, where we sample training data for imitationlearning of the NCM-CLF-based robust centralized motionplanner. Using local observations in training also enables itsdecentralized implementation. Simulation results for perturbednonlinear systems show that the LAG-ROS achieves highercontrol performance and task success rate with faster executionspeed for real-time computation, when compared with theexisting real-time robust MPC and learning-based feedforwardmotion planners.
Index Terms — Machine Learning for Robot Control, Ro-bust/Adaptive Control, Optimization and Optimal Control.
I. I
NTRODUCTION
In the near future of aerospace and robotic explorationon land, in water, and in deep space, teams of robotsare expected to perform complex decision-making tasksautonomously in extreme environments, where their mo-tions are governed by nonlinear dynamical systems withunknown external disturbances. In order for such operationsto be successful, the agents are required to compute 1)an optimal target trajectory for the given nonlinear system,and 2) a feedback control input which robustly guarantees(exponential) convergence to the trajectory, both in real-time with their limited on-board computational resources.The objective of this work is to propose a learning-basedautonomous guidance and control algorithm that meets thesechallenging requirements.
Related Work:
Learning-based guidance and control de-signs of nonlinear systems have been an emerging area ofresearch with the rise of neural networks and reinforcementlearning [1], [2]. These techniques can be categorized intomodel-free and model-based methods, where each of themhas pros and cons. The former approach is supposed to
The authors are with the Graduate Aerospace Laboratories, Cal-ifornia Institute of Technology, Pasadena, CA, { htsukamoto,sjchung } @caltech.edu .Code: https://github.com/astrohiro. learn desired optimal policies which work robustly as ituses training data obtained in real-world environments. Thisin turn implies that these techniques are not suitable forsituations where sampling large training datasets is difficultas in aerospace and robotics systems. Also, proving stabilityand robustness properties of such model-free systems ischallenging in general, although some techniques do ex-ist [3], [4]. In contrast, the latter approach allows us tosample as much data as we want using a given dynamicalsystem model, thereby constructing optimal policies usingreinforcement learning [5], [6], imitation learning [7], [8],or both [9]–[11]. However, the learnt controller could yieldcascading errors in the real-world environment if the nominalmodel poorly represents the true underlying dynamics [12].Control theoretical approaches to circumvent such diffi-culties include robust Model Predictive Control (MPC) [13]equipped with a feedback control law for stability certificates,i.e., u = k ( x , x d , u d , t ) , where ( x d , u d ) is a target trajectorygiven by the tube-based MPC and k ( · ) is a feedback controlpolicy that guarantees x → x d for perturbed systems [14].In [15], a differential feedback control law is proposed toguarantee tracking to any feasible nominal target trajectorygiven by existing motion planners for nonlinear systemsincluding [16]–[19]. Since these frameworks explicitly usethe knowledge of the desired states x d for robustness prop-erties to keep the system trajectories in a safe invarianttube around x d even under external disturbances, they canprovide provably stable and optimal control inputs at eachtime instant unlike most of learning-based control frame-works. One drawback is that they assume the existence ofeither a real-time implementable motion planning algorithmor a stabilizing tracking controller (e.g. Control LyapunovFunction (CLF) [20]–[22]) of given nonlinear systems, whichis not always the case for practical applications. Contributions:
Therefore, in this study, we presentLearning-based Autonomous Guidance with Robustness, Op-timality, and Safety guarantees (LAG-ROS) as a novel way tobridge the gap between the learning-based and robust MPC-based motion planning techniques. In particular, the LAG-ROS requires only one neural network evaluation to get anoptimal control input, which imitates the CLF-based robustfeedback control policy. Since it explicitly models the robustpolicy x (cid:55)→ u implicitly accounting for target trajectories ( x d , u d ) , instead of a feedforward control policy x d (cid:55)→ u d given by existing motion planners [16]–[18], the LAG-ROS attains high control performance even under unknownexternal disturbances. Although we consider bounded distur-bances in this paper, stochastic disturbances and parameteri a r X i v : . [ c s . R O ] F e b ig. 1. Illustration of LAG-ROS. uncertainty could be handled similarly using [23], [24].Further, we propose one explicit way to construct a CLF forany general nonlinear systems in the LAG-ROS frameworkusing a Neural Contraction Metric (NCM) [25], which is alearning-based optimal contraction metric designed to mini-mize an upper bound of the steady-state Euclidean distancebetween perturbed and unperturbed system trajectories [26].We also delineate how to implement it in a distributed wayand guarantee safety even for perturbed nonlinear systems,using the global-to-local autonomy synthesis [27] and MPC-based safety filter [8], respectively. Note that existing systemidentification and adaptive control techniques including [24],[28] could be used to obtain a more accurate descriptionof the underlying dynamical system, for the sake of less-conservative application of the LAG-ROS. The architectureof the proposed framework depicted in Fig. 1 is summarizedas follows.In the offline phase, we first construct an NCM as de-scribed in [23]–[25], and utilize it as a CLF to design a robustoptimal feedback control law. Since the NCM-CLF controlensures the system state to stay in bounded error tubesaround x d sampled by existing motion planners including theone used in this work [17], [18], we then sample state x andthe NCM-CLF robust control u in these tubes for imitationlearning of x (cid:55)→ u . We also show that a known Lyapunovfunction based on a special structure of underlying dynamicalsystems can be used in the LAG-ROS framework, takingLagrangian systems [29] as an example. In the online phase,we implement it by evaluating one neural network at eachtime instant, using the decentralization technique [8], [30]and the safety filter [27] on top the learnt robust control.It is demonstrated that the LAG-ROS outperforms existingreal-time motion planning frameworks [8], [16]–[18] in cart-pole balancing [31], motion planning and collision avoidanceof multiple spacecraft simulators [32] in a cluttered environ-ment, and autonomous reconfiguration of multiple spacecraftin Low Earth Orbit (LEO) [17]. In particular, the LAG-ROSrequires less than 0 .
1s for real-time computation in all ofthese tasks, and achieves higher control performance andtask success rate when compared with the existing real-time robust MPC and learning-based feedforward motion planners.
Notation:
For a vector x ∈ R n and a matrix A ∈ R n × m , welet (cid:107) x (cid:107) , δ x , and (cid:107) A (cid:107) , denote the Euclidean norm, infinitesi-mal variation of x , and induced 2-norm, respectively. For asquare matrix A , we use the notation A (cid:31) A (cid:23) A ≺
0, and A (cid:22) ( A ) = ( A + A (cid:62) ) /
2. Also, I ∈ R n × n denotes the identitymatrix.II. R OBUST AND O PTIMAL T RACKING C ONTROL USING N EURAL C ONTRACTION M ETRICS
We first propose a novel feedback control frameworkfor general nonlinear systems, which guarantees robustness,optimality, and stability when a target trajectory ( x d , u d ) isgiven. In particular, we use the Neural Contraction Metric(NCM) [25] when the Lyapunov function of a given nonlin-ear system is unknown. We also investigate the case wherewe know a Lyapunov function candidate, taking Lagrangiansystems as an example. A. Neural Contraction Metrics as Lyapunov Functions
The NCM is a recently developed deep learning-baseddesign framework for constructing an optimal Lyapunovfunction for provably stable feedback control and estimationof nonlinear systems given as ˙ x ( t ) = f ( x ( t ) , t )+ d ( x , t ) , where d = sup x , t (cid:107) d ( x , t ) (cid:107) < + ∞ [25]. This section presents theNCM Control Lyapunov Function (NCM-CLF) for designingan optimal robust control law, which guarantees exponentialstability with respect to a given target trajectory ( x d , u d ) . Weconsider the following nonlinear dynamical system with acontroller u ∈ R m :˙ x = f ( x , t ) + B ( x , t ) u + d ( x , t ) , ˙ x d = f ( x d , t ) + B ( x d , t ) u d (1)where t ∈ R ≥ , x , x d : R ≥ (cid:55)→ R n , u d : R n × R ≥ (cid:55)→ R n , f : R n × R ≥ (cid:55)→ R n , and B : R n × R ≥ (cid:55)→ R n × m . The objectiveof this section is to find a robust optimal feedback controllaw u = k ( x , x d , t ) given x d and u d ( x d , t ) for the system (1).The following lemma is useful for this purpose. Lemma 1:
Consider a general feedback controller u de-fined as u = k ( x , x d , t ) with k ( x d , x d , t ) = u d ( x d , t ) , where k : R n × R n × R ≥ (cid:55)→ R m . If k is piecewise continuouslydifferentiable, then ∃ K : R n × R n × R ≥ (cid:55)→ R m × n s.t. u = k ( x d , x d , t ) = u d ( x d , t ) + K ( x , x d , t )( x − x d ) . Proof:
We have u = u d + ( k ( x , x d , t ) − k ( x d , x d , t )) due to k ( x d , x d , t ) = u d ( x d , t ) . Since k ( x , x d , t ) − k ( x d , x d , t ) = (cid:82) ( dk ( cx +( − c ) x d , x d , t ) / dc ) dc , choosing K as (cid:82) ( ∂ k ( cx +( − c ) x d , x d , t ) / ∂ x ) dc gives the desired relation.Lemma 1 implies that designing optimal k of u = k ( x , x d , t ) reduces to designing optimal K ( x , x d , t ) of u = u d ( x d , t ) − K ( x , x d , t )( x − x d ) . For such u , the virtual system of (1),which has y = x , x d as its particular solutions, is given asfollows:˙ y = ˙ x d + ( A ( x , x d , t ) + B ( x , t ) K ( x , x d , t ))( y − x d ) + d e ( y , t ) (2)iihere A is the State-Dependent Coefficient (SDC) form ofthe dynamical system (1) given by Lemma 2 of [23], i.e., A ( x , x d , t )( x − x d ) = f ( x , t ) + B ( x , t ) u d − f ( x d , t ) − B ( x d , t ) u d ,and d e satisfies d e ( x , t ) = d ( y , t ) and d ( x d , t ) = Theorem 1:
Suppose that f and B are piecewise contin-uously differentiable, and let B = B ( x , t ) and A = A ( x , x d , t ) in (2) for notational simplicity. Consider a neural networktrained to perfectly model the optimal contraction metric M ( x , x d , t ) = W ( x , x d , t ) − (cid:31) J ∗ CV = min χ ∈ R , ¯ W (cid:31) d χα s.t. (4) and (5) . (3)with the convex constraints (4) and (5) given as − ˙¯ W + ( A ¯ W ) − ν BR − B (cid:62) (cid:22) − α ¯ W , ∀ x , x d , t (4) I (cid:22) ¯ W ( x , x d , t ) (cid:22) χ I , ∀ x , x d , t (5)where α , ω , ω > χ = ω / ω , ¯ W = ν W , ν = / ω , and R = R ( x , x d , t ) (cid:31) u . Suppose also that the dynamics (1) is controlled by u = u d + K ∗ ( x , x d , t ) e , where e = x − x d and K ∗ is given by thefollowing convex optimization-based controller for ( x , x d , t ) : K ∗ = arg min K ∈ R m × n (cid:107) u − u d (cid:107) = arg min K ∈ R m × n (cid:107) K ( x , x d , t ) e (cid:107) (6)s.t. ˙ M + ( MA + MK ( x , x d , t )) ≤ − α M . (7)Then (6) is always feasible, i.e., V = δ y (cid:62) M δ y is a CLF for y in (2), and minimizes the deviation of u from u d under thestability constraint (7). Furthermore, we have that (cid:107) x ( t ) − x d ( t ) (cid:107) ≤ R ( ) √ ω e − α t + d α (cid:115) ωω = R NCM ( t ) (8)where R ( t ) = (cid:82) xx d (cid:107) Θ ( x ( t ) , x d ( t ) , t ) δ y ( t ) (cid:107) for M = Θ (cid:62) Θ . Themetric M minimizes the steady-state upper bound of (8), andwe call (6) the NCM-CLF control. Proof:
Let us first show that (6) is always feasible.Since the virtual dynamics is given as δ ˙ y = ( A − BK ) δ y for d e = K = − R − B (cid:62) M that ˙ V = δ y T ( ˙ M + ( MA ) − MBR − B (cid:62) M ) δ y = ν − δ y T M ( − ˙¯ W + ( A ¯ W ) − ν BR − B (cid:62) ) M δ y . Thus, the condition (4) en-sures that (7) is feasible for K = − R − B (cid:62) M . Also, forsystems perturbed by d e as in (2), the controller (6) gives˙ V ≤ − α V + δ y (cid:62) M δ d e , resulting in ˙ R ≤ − α R + d / √ ω due to (5). The comparison lemma [22] with (5) yields (8).Further, Lemma 1 indicates that (6) minimizes the deviationof u from u d under the stability constraint (7) without lossof generality. The optimality of M follows from Corollary 1of [25]. Remark 1:
Although a neural network is used to model M in Theorem 1, we could directly solve the convex optimiza-tion at the expense of computational efficiency [33]. Also,one can select other performance-based objective functionsin (3) and (6) depending on their application of interest, asdescribed in [23], [33]. B. Dynamical Systems with Known Lyapunov Functions
There are several science and engineering applicationswhere we know a Lyapunov function candidate of a givennonlinear system, one of which is a feedback linearizablesystem [22], [29], [34]. In this section, we illustrate howto construct robust tracking control for such systems, takingLagrangian systems as an example.Consider the following system with a bounded disturbance d ( x , t ) : H ( q ) ¨ q + C ( q , ˙ q ) ˙ q + G ( q ) = u + d ( x , t ) (9)where q : R ≥ → R n , u : R ≥ → R n , H : R n → R n × n , C : R n × R n → R n × n , G : R n → R n , and d : R n × R ≥ → R n with d = sup x , t (cid:107) d ( x , t ) (cid:107) < + ∞ and x = [ q (cid:62) , ˙ q (cid:62) ] (cid:62) . We note that thematrix C ( q , ˙ q ) is selected to make ˙ H − C skew-symmetric,so we have a useful property s.t. z (cid:62) ( ˙ H − C ) z = , ∀ z ∈ R n . Theorem 2:
Suppose that (9) is controlled by u = u n = H ( q ) ¨ q r + C ( q , ˙ q ) ˙ q r + G ( q ) − K (cid:96) ( t ) s (10)where ˙ q r = ˙ q d − Λ ( q − q d ) , s = ˙ q − ˙ q r , K (cid:96) (cid:31)
0, and Λ (cid:31)
0. If K (cid:96) , Λ , ε > ∃ α > (cid:20) K (cid:96) − ε I − ε I Λ (cid:21) (cid:22)− α M , where M = Θ (cid:62) Θ = diag ( H , ε I ) , then we have that (cid:107) x − x d (cid:107) ≤ √ p (cid:18) R ( ) e − α t + d α √ m (cid:19) = R NCM ( t ) (11)where x d = [ q (cid:62) d , ˙ q (cid:62) d ] (cid:62) , R ( t ) = (cid:82) ξ (cid:107) Θ ( q ( t )) δ y ( t ) (cid:107) , ξ =[ s (cid:62) , e (cid:62) ] (cid:62) , e = x − x d , mI (cid:22) M , P = (cid:20) Λ H Λ + ε I Λ HH Λ H (cid:21) (cid:31) pI (cid:22) P , and y is the state of the virtual system of (9), whichhas [ s (cid:62) , e (cid:62) ] (cid:62) and [ (cid:62) , (cid:62) ] (cid:62) as its particular solutions. Proof:
For the virtual system defined as H ˙ y + Cy + K (cid:96) y = d s ( y , t ) and ˙ y = y − Λ y , we have y = [ y (cid:62) , y (cid:62) ] (cid:62) = ξ = [ s (cid:62) , e (cid:62) ] (cid:62) and y = d s ( ξ , t ) = d ( x , t ) and d s ( , t ) =
0. Thus using a differentialLyapunov function V = δ y (cid:62) M δ y , we can show that ˙ R ≤− α R + d / √ m under the condition of this theorem [29], [35].Since we have ξ (cid:62) M ξ = x (cid:62) Px by definition, applying thecomparison lemma [22] yields the desired relation. Remark 2:
The control parameters K (cid:96) and Λ in Theorem 2could also be optimized using the same technique in Theo-rem 1 or the one in Remark 1 [33].III. I MITATION L EARNING OF R OBUST O PTIMAL M OTION P LANNERS
Although the NCM-CLF control of Theorem 1 guaranteesexponential stability and is robust against external distur-bances, it requires online computation of an optimal targettrajectory ( x d , u d ) , which is not necessarily straightforwardin practice. This section thus proposes one approach todirectly compute the robust control input (6) of Theorem 1 inreal-time via imitation learning. Note that all the followingdiscussions are also applicable to systems with a knownLyapunov function due to Theorem 2.iii ig. 2. Illustration of state sampling in robust bounded error tube. A. Sampling of Expert Demonstrations
We generate expert demonstration data using existingcentralized motion planners including the one we use in thiswork [17]. Since the dynamics is perturbed by d ( x , t ) as in(1), we sample target trajectories by solving the followingbounded error tube-based motion planning problem:min x , u c (cid:90) T (cid:107) u ( t ) (cid:107) dt + c (cid:90) T P ( x ( t ) , u ( t ) , t ) dt (12)s.t. ˙ x = f ( x , t ) + B ( x , t ) u and x ∈ ¯ X (13)where c > c ≥ P ( x ( t ) , u ( t ) , t ) is some performance-based cost function, and ¯ X is robust admissible statespace defined as ¯ X = { x ∈ R n |∀ ξ ∈ { ξ |(cid:107) x − ξ (cid:107) ≤ lim t → ∞ R NCM ( t ) } , ξ ∈ X } with X being admissible statespace for the unperturbed system. Note that R NCM ( t ) is givenin (8) (and in (11) for systems with a known Lyapunovfunction). Lemma 2:
Consider the perturbed system (1) and supposethat x d ( ) = x ( ) . If ( x d , u d ) is given by (12), then the NCM-CLF control (6) of Theorem 1 ensures that the perturbedsolution x stays within the admissible state space X , i.e., x ∈ X , ∀ t . Proof:
The NCM-CLF control with the relation x d ( ) = x ( ) guarantees (cid:107) x ( t ) − x d ( t ) (cid:107) ≤ lim t → ∞ R NCM ( t ) , ∀ t by (8)of Theorem 1. Since x d ∈ ¯ X , we have that ∀ ξ ∈ S ( x d ) = { ξ |(cid:107) x d − ξ (cid:107) ≤ lim t → ∞ R NCM ( t ) , ξ ∈ X . These indicate that x ∈ S ( x d ) and thus x ∈ X .Therefore, we utilize (12) for generating expert demonstra-tions in N random environments, { ( x d , u d ) } Ni = , based onthe result of Lemma 2. The same argument holds for theLagrangian systems due to Theorem 2. B. Robust Control Inputs Sampling and Imitation Learning
The novelty of our approach lies in modeling the robustcontrol policy (6) that maps x to u , instead of the feedforwardpolicy that maps x d to u d . To this end, we sample state androbust control input pairs ( x , u ) using the target trajectories { ( x d , u d ) } Ni = given by (12) as follows.We first sample artificially perturbed states x in thetube B ( x d ) given as B ( x d ) = { ξ ∈ R n |(cid:107) ξ − x d (cid:107) ≤ lim t → ∞ R NCM ( t ) } as depicted in Fig. 2. Since any perturbedstate trajectory x ( t ) with x d ( ) = x ( ) satisfies x ∈ B ( x d ) and x ∈ X as long as it is controlled by the NCM-CLF (6)as shown in (8) of Theorem 1 and Lemma 2, respectively,we then sample u by solving (6) with such x ∈ B ( x d ) and ( x d , u d ) of (12). For each of N target trajectories { ( x d , u d ) } Ni = , we compute M robust control inputs, resultingin NM training samples { ( x , u ) } NMj = . We propose a novel imitation learning-based robust and optimal controller in thefollowing proposition. Proposition 1:
The imitation learning-based NCM-CLFcontrol of Theorem 1 is a neural network model for approx-imating state-robust control input training data { ( x , u ) } NMj = ,sampled based on Lemma 2 and Sec. III-B (see Fig. 2). Thearchitecture of this framework is summarized in Fig. 1. Remark 3:
Since (6) is modeled by a neural networkin Proposition 1, using local observations in training alsoenables its decentralized implementation accounting for cen-tralized global solutions [8], [30]. This property will bedemonstrated in Sec. V to enable real-time implementationof Proposition 1.IV. L
EARNING - BASED A UTONOMOUS G UIDANCE WITH R OBUSTNESS , O
PTIMALITY , AND S AFETY G UARANTEES (LAG-ROS)The final component we add to the proposed imitationlearning-based NCM-CLF control law of Proposition 1 isprovable safety guarantee even for systems perturbed bybounded disturbances (1). To this end, we utilize a recently-developed concept of learning-based MPC [27]. The keydifference of this approach in Proposition 2 from existingMPC frameworks is that we can select its time horizon muchsmaller for online implementation depending on the relativedegree of safety constraints (see Corollary 1), as it attemptsto find u that minimizes the deviation from the learning-based robust and optimal controller u L of Proposition 1, (cid:82) (cid:107) u − u L (cid:107) dt , instead of (cid:82) (cid:107) u (cid:107) dt , along with hard safetyconstraints. Proposition 2:
Let u L denote the learning-based NCM-CLF control of Proposition 1, u = u d + K ∗ ( x , x d , t ) e , ap-proximately modeled by a neural network. The Learning-based Autonomous Guidance of Nonlinear Systems withRobustness, Optimality, and Safety (LAG-ROS) control inputduring t ∈ [ t i , t t + ) is given by u ∗ i as a result of the followingoptimization problem, constructed to minimize the deviationof u ∗ i from u L ( t i ) : u ∗ i , u ∗ i + , · · · , u ∗ I i − = arg min u k ∈ R m , ∀ k I i − ∑ k = i (cid:107) u k − u L ( t k ) (cid:107) (14)s.t. ˙¯ x = f ( ¯ x , t ) + B ( ¯ x , t ) u k , ∀ t ∈ [ t k , t k + ) , (15)s.t. ¯ x ( t i ) = x ( t i ) , and ¯ x ( t k + ) ∈ ¯ X k , ∀ k = i , · · · , I i − x ( t i ) is the perturbed solution of (1) observed at t = t i , ¯ X k = { q ∈ R n |∀ ξ ∈ { ξ |(cid:107) q − ξ (cid:107) ≤ d ∆ t i , ξ ∈ X ( t k + ) } , ∆ t i = t i + − t i , X is admissible safe state space, and d = sup x , t (cid:107) d ( x , t ) (cid:107) as defined in (1). If I i is selected smallenough to make (14) solvable online at t = t i , the LAG-ROSis implementable in real-time and its perturbed solution x governed by (1) stays in X for t ∈ [ t i , t i + ) . Proof:
Integrating (15) and (1) from t i to t with theconstraint ¯ x ( t i ) = x ( t i ) of (16) yields (cid:107) x ( t ) − ¯ x ( t ) (cid:107) ≤ d | t − t i | .Thus ¯ x ( t i + ) ∈ ¯ X i of (16) implies that x ( t i + ) ∈ X ( t i + ) .Since | t − t i | < ∆ t i for t ∈ [ t i , t i + ) , x of (1) stays in X for t ∈ [ t i , t i + ) as desired.ivhe following corollary is useful for dramatically reducingcomputational complexity of Proposition 2 for real-timeimplementation. Corollary 1:
The optimization problem (14) of Proposi-tion 2 with I i = i + X i of (16) is convex/quadratic. Proof:
Since ¯ x ( t i + ) is affine in u i due to (15)and (16), ¯ x ( t i + ) ∈ ¯ X i is convex/quadratic if ¯ X i is con-vex/quadratic [36].Corollary 1 indicates that a wide range of safety con-straints such as affine state constraints (e.g. x min ≤ x ≤ x max )and collision avoidance [17], can be incorporated into theLAG-ROS framework without destroying convex structurefor computational efficiency. We demonstrate its performanceand its real-time implementability later in Sec. V. Remark 4:
The safety part of the LAG-ROS approachin Proposition 2 is not intended to supersede but can bereplaced by other existing state-of-the-art control techniquesfor safety-critical systems, depending on the application ofinterest. This includes [37], which utilizes control barrierfunctions for ensuring safety of nonlinear dynamical systemsunder input disturbances (i.e. ˙ x = f ( x ) + B ( x )( u + d ) ). Ingeneral, safety can be augmented on top of the learning-based NCM-CLF control u L in Proposition 1 as follows: u ∗ ( t i ) = arg min u ∈ R m (cid:107) u − u L ( t i ) (cid:107) (17)s.t. safety constraints given by existing techniques. (18)Note that input constraints u ∈ U can be incorporated in (6),(12), (14), and (17) following the same proof.The pseudocode to construct and implement the LAG-ROScontrol is presented in Algorithm 1. We train learning-basedNCM-CLF control of Proposition 1 offline, and then imple-ment it online with a safety constraint as in Proposition 2and Corollary 1. V. S IMULATION
Our proposed LAG-ROS framework is demonstrated usingmultiple motion planning problems under external distur-bances (https://github.com/astrohiro). CVXPY [38] with theMOSEK solver [39] is used to solve optimization problems.In this section, we use a neural network with 3 layers and 100neurons for imitation learning in Proposition 1, and 3 layersand 32 neurons for deep set-based global-to-local safe au-tonomy synthesis [8], [30]. Note that the computational timeof each framework is measured for the Macbook Pro laptop(2.2 GHz Intel Core i7, 16 GB 1600 MHz DDR3 RAM),and simulation results are the average over 50 simulationsfor each random environment and disturbance realization.
A. Cart-Pole Balancing
We first utilize the cart-pole balancing task in Fig. 3 forillustrating how to apply the LAG-ROS motion planner inpractice. Its equation of motion is given in [24], [31], and itsparameters are given as g = . m c = . m = . µ c = . µ p = . l = . x = [ p , θ , ˙ p , ˙ θ ] (cid:62) in Fig. 3, to x = [ p f , , , ] (cid:62) in 10 Algorithm 1:
LAG-ROS Algorithm
1. OFFLINE PHASE
Inputs :
Random environments { E i } Ni = NCM of [25]/known Lyapunov functionMotion planner P Outputs:
Learning-based control u L of Proposition 1 for i ← to N do Solve (12) for E i using P and obtain ( x ( i ) d , u ( i ) d ) Sample M robust NCM-CLF control inputs { ( x ( i ) j , u ( i ) j ) } Mj = based on Lemma 2 andSec. III-B (see Fig. 2)Model {{ ( x ( i ) j , u ( i ) j ) } Mj = } Ni = by a neural network as inProposition 1 (see Fig. 1) and save it as u L
2. ONLINE PHASE
Inputs :
Current state and time ( x ( t ) , t ) Learning-based control u L of Proposition 1 Outputs:
LAG-ROS control input of Proposition 2Determine I i of (14) small enough based on currentcomputational power at time t Solve (14) of Proposition 2 and obtain u ( t ) u p2l 𝜃 Fig. 3. Cart-pole balancing task. seconds with minimum L (cid:107) d (cid:107) = . p f is a random terminalposition at each episode. To this end, we sample targettrajectories ( x d , u d ) for 100 random initial and terminalstates (10000 training samples) using the centralized motionplanner in [17], and construct the LAS-ROS neural net asexplained in Proposition 1 using the NCM [25] as a CLF.Its performance is compared with the global robust MPC-based controller [16], learning-based controller which modelsthe feedforward (FF) control policy u d (cid:55)→ x d [8], and robustNCM controller [25].As summarized in Table I, the LAG-ROS objective value (cid:82) (cid:107) u ( t ) (cid:107) dt is only 9 .
5% larger than that of the robustMPC with 42 .
2% smaller computational time. Since weselect the time interval of control inputs to be ∆ t = . ∆ t = . .
0% success rate even under external distur-bances unlike the learning-based FF control (40 .
0% successrate), and gives 61 .
3% smaller (cid:82) (cid:107) u ( t ) (cid:107) dt than that of theNCM robust control. B. Multi-Agent Nonlinear Motion Planning
We next consider motion planning and collision avoid-ance of multiple spacecraft simulators [32] in a clutteredv
ABLE IC
ONTROL P ERFORMANCES FOR C ART -P OLE B ALANCING ( AVERAGED OVER SIMULATIONS WITH sup (cid:107) d (cid:107) = . OTE THAT COMPUTATIONAL TIME SHOULD BE LESS THAN CONTROL TIME INTERVAL ∆ t = . S ) FOR REAL - TIME IMPLEMENTATION .Success rate out of 50 trials (%) L (cid:82) (cid:107) u ( t ) (cid:107) dt ) Computational time per time step (s)(a) Learning-based FF [8] 4 . ×
10 5 . × (7 . ×
10% of (d)) 7 . × − (b) Robust NCM [25] 1 . × . × (2 . × % of (d)) 4 . × − (c) LAG-ROS 1 . × . × (1 . × % of (d)) 3 . × − (d) Robust MPC [14] 1 . × . × (1 . × % of (d)) 1 . × − environment with external disturbances, where the agents aresupposed to perform tasks based only on local observationsand thus centralized motion planners are not implementablein real-time unlike in the previous setting. Note that itsequation of motion is given in [32] and all of its parametersare normalized to 1. The goal of each agent is to reach a ran-dom target position with minimum L .
5m in radius) and of other spacecraft within 45sunder bounded disturbances (sup (cid:107) d (cid:107) = . . . × − s) evenwith the safety filter of Proposition 2, as it is less than thecontrol time interval ∆ t = . .
4% improvement in its control performanceand 111 .
3% improvement in its success rate compared withthe robust MPC, even when the learning-based FF controlends up with only 2 .
6% success rate. This is primarilydue to the fact that the LAG-ROS approximates the globalrobust optimal solution, while the learning-based FF lacksin robustness and the robust MPC only considers localoptimality. We remark that a slightly better onboard computeris required for the robust MPC as its computational time2 . × − s is greater than ∆ t = . .
7% larger than that of theglobal robust motion planner, which indicates that it canapproximate the robust optimal solution in real-time onlywith local observations. These overall trends are the samefor other sizes of disturbances as can be seen in Fig. 7:the LAG-ROS keeps more than 90% success rate for allthe disturbances when averaged over 50 simulations, with L . C. Spacecraft Reconfiguration in Low Earth Orbit (LEO)
We finally consider the problem of spacecraft recon-figuration in LEO as an example of systems equippedwith a known Lyapunov function. The dynamical systemis presented in [40] accounting for J and atmosphericdrag perturbations. Since it can be expressed as a fully-actuated Lagrangian system [17], we can design a Lyapunovfunction by its inertia matrix [29] and use it for the LAG-ROS construction of Proposition 2 due to Theorem 2. Thespacecraft communication radius for local observations isselected as 2 . (cid:107) d (cid:107) = . .
9% increase in the objectivevalue even with the computational time less than ∆ t = . . × − s is greater than ∆ t = . .
4% decreasein the objective value when compared with the robust MPC,even though the success rate of learning-based FF controlis only 2 .
6% due to external disturbances. These resultsindicate that the LAG-ROS can be applied with knownLyapunov functions constructed utilizing a special structureof underlying dynamical systems [29], [41] for the sake ofa learning-based robustness guarantee. These trends are thesame for other sizes of disturbances as can be seen in Fig. 7:the LAG-ROS keeps more than 98% success rate for allthe disturbances when averaged over 50 simulations, with L . ONCLUSION
In this work, we propose the LAG-ROS, a real-time imple-mentable safe and optimal motion planner with a learning-based robustness guarantee. It utilizes imitation learning andcontraction theory to model a tube-based robust feedbackmotion planner, constructed using an NCM (or a knownLyapunov function if available [29]) as a CLF. Simulationresults demonstrate that it outperforms existing state-of-the-art real-time motion planners even in decentralized multi-agent settings with external disturbances. We remark thatvi horizontal coordinate p x [m] v e r ti ca l c oo r d i n a t e p y [ m ] (a) Learning-based feedforward (b) Robust MPC (c) LAG-ROS start goal (d) Global solution Fig. 4. Spacecraft trajectories for baselines (a) [8] & (b) [16] and for (c) LAG-ROS, where (d) shows the global solution computed offline assuming eachagent have access to global information. The LAG-ROS achieves the highest success rate with the control effort closest to that of the global solution.TABLE IIC
ONTROL P ERFORMANCES FOR M ULTI - AGENT S PACECRAFT M OTION P LANNING ( AVERAGED OVER SIMULATIONS WITH sup (cid:107) d (cid:107) = . OTE THAT COMPUTATIONAL TIME SHOULD BE LESS THAN CONTROL TIME INTERVAL ∆ t = . S ) FOR REAL - TIME IMPLEMENTATION .Success rate out of 50 trials (%) L (cid:82) (cid:107) u ( t ) (cid:107) dt ) Computational time per time step (s)(a) Learning-based FF [8] 2 . . × (1 . × % of (d)) 8 . × − (b) Robust MPC [16] 4 . ×
10 7 . × (4 . × % of (d)) 2 . × − (c) LAG-ROS 9 . ×
10 2 . × (1 . × % of (d)) 9 . × − (d) Global solution 1 . × . × (1 . × % of (d)) 1 . × (offline computation)TABLE IIIC ONTROL P ERFORMANCES FOR M ULTI - AGENT S PACECRAFT R ECONFIGURATION IN
LEO (
AVERAGED OVER SIMULATIONS WITH sup (cid:107) d (cid:107) = . OTE THAT COMPUTATIONAL TIME SHOULD BE LESS THAN CONTROL TIME INTERVAL ∆ t = . S ) FOR REAL - TIME IMPLEMENTATION .Success rate out of 50 trials (%) L (cid:82) (cid:107) u ( t ) (cid:107) dt ) Computational time per time step (s)(a) Learning-based FF [8] 3 . . × (1 . × % of (d)) 7 . × − (b) Robust MPC [16] 1 . × . × (1 . × % of (d)) 1 . × − (c) LAG-ROS 1 . × . × (1 . × % of (d)) 8 . × − (d) Global solution 1 . × . × (1 . × % of (d)) 7 . × (offline computation) . . . . . x , t k d ( x , t ) k s u cce ss r a t e ( % ) (a)(b) (c)(d) . . . . . x , t k d ( x , t ) k L c on t r o l e ff o r t Fig. 5. Control performances versus sup x , t (cid:107) d ( x , t ) (cid:107) for multi-agentspacecraft motion planning (averaged over 50 simulations). (a)–(d) are asin Table II. other types of disturbances can be incorporated in thisframework using [23] for stochastic systems and [24] forparametric uncertain systems. Acknowledgments:
This work was funded by the JetPropulsion Laboratory, California Institute of Technology,and benefited from discussions with Julie Castillo-Rogez,Michel D. Ingham, and Jean-Jacques E. Slotine.R
EFERENCES[1] R. S. Sutton and A. G. Barto,
Reinforcement Learning: An Introduc-tion . MIT Press, 1998. [2] D. P. Bertsekas and J. N. Tsitsiklis,
Neuro-Dynamic Programming .Athena Scientific, 1996.[3] N. M. Boffi, S. Tu, N. Matni, J.-J. E. Slotine, and V. Sindhwani,“Learning stability certificates from data,” in
CoRL , Nov. 2020.[4] D. Ho and J. C. Doyle, “Robust model-free learning and controlwithout prior knowledge,” in
IEEE CDC , 2019, pp. 4577–4582.[5] M. Everett, Y. F. Chen, and J. P. How, “Motion planning amongdynamic, decision-making agents with deep reinforcement learning,”in
IEEE/RSJ IROS , 2018, pp. 3052–3059.[6] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safemodel-based reinforcement learning with stability guarantees,” in
Adv.Neural Inf. Process. Syst. , vol. 30. Curran Associates, Inc., 2017, pp.908–918.[7] J. Carius, F. Farshidian, and M. Hutter, “Mpc-net: A first principlesguided policy search,”
IEEE Robot. Automat. Lett. , vol. 5, no. 2, pp.2897–2904, 2020.[8] B. Rivi`ere, W. H¨onig, Y. Yue, and S. Chung, “GLAS: Global-to-localsafe autonomy synthesis for multi-robot motion planning with end-to-end learning,”
IEEE Robot. Automat. Lett. , vol. 5, no. 3, pp. 4249–4256, 2020.[9] J. Ho and S. Ermon, “Generative Adversarial Imitation Learning,” in
Adv. Neural Inf. Process. Syst. , D. Lee, M. Sugiyama, U. Luxburg,I. Guyon, and R. Garnett, Eds., vol. 29. Curran Associates, Inc.,2016, pp. 4565–4573.[10] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “SocialGAN: Socially acceptable trajectories with generative adversarialnetworks,” in
IEEE/CVF Conf. Comput. Vision Pattern Recognit. ,2018, pp. 2255–2264.[11] A. Kuefler, J. Morton, T. Wheeler, and M. Kochenderfer, “Imitating vii ( k m ) y ( k m ) z ( k m ) x ( k m ) y ( k m ) z ( k m ) x ( k m ) y ( k m ) z ( k m ) start goal x ( k m ) y ( k m ) z ( k m ) Fig. 6. Spacecraft reconfiguration in LEO with baselines (a) [8] & (b) [16] and for (c) LAG-ROS, where (d) shows the solution computed offline assumingglobal environment observations. The LAG-ROS achieves the highest success rate with the control effort closest to that of the global solution. . . . . . sup x , t k d ( x , t ) k s u cce ss r a t e ( % ) (a)(b)(c)(d) . . . . . sup x , t k d ( x , t ) k L c on t r o l e ff o r t Fig. 7. Control performances versus sup x , t (cid:107) d ( x , t ) (cid:107) for spacecraft reconfig-uration in LEO (averaged over 50 simulations). (a)–(d) are as in Table III.driver behavior with generative adversarial networks,” in IEEE Intell.Vehicles Symp. , 2017, pp. 204–211.[12] S. Ross and D. Bagnell, “Efficient reductions for imitation learning,”in , ser. Proc. Mach. Learn. Res.,vol. 9, May 2010, pp. 661–668.[13] A. Bemporad and M. Morari, “Robust model predictive control: Asurvey,” in
Robustness in identification and control . London: SpringerLondon, 1999, pp. 207–226.[14] D. Mayne, M. Seron, and S. Rakovi´c, “Robust model predictive controlof constrained linear systems with bounded disturbances,”
Automatica ,vol. 41, no. 2, pp. 219 – 224, 2005.[15] S. Singh, A. Majumdar, J.-J. E. Slotine, and M. Pavone, “Robust onlinemotion planning via contraction theory and convex optimization,” in
IEEE ICRA , May 2017, pp. 5883–5890.[16] D. Q. Mayne, E. C. Kerrigan, E. J. van Wyk, and P. Falugi, “Tube-based robust nonlinear model predictive control,”
Int. J. Robust Non-linear Control , vol. 21, no. 11, pp. 1341–1353, 2011.[17] D. Morgan, S.-J. Chung, and F. Y. Hadaegh, “Model predictive controlof swarms of spacecraft using sequential convex programming,”
J.Guid. Control Dyn. , vol. 37, no. 6, pp. 1725–1740, 2014.[18] J. Schulman, Y. Duan, J. Ho, A. Lee, I. Awwal, H. Bradlow, J. Pan,S. Patil, K. Goldberg, and P. Abbeel, “Motion planning with sequentialconvex optimization and convex collision checking,”
Int. J. Robot.Res. , vol. 33, no. 9, pp. 1251–1270, 2014.[19] Y. K. Nakka, A. Liu, G. Shi, A. Anandkumar, Y. Yue, and S. J. Chung,“Chance-constrained trajectory optimization for safe exploration andlearning of nonlinear systems,”
IEEE Robot. Automat. Lett. , vol. 6,no. 2, pp. 389–396, 2021.[20] E. Sontag, “A Lyapunov-like characterization of asymptotic control-lability,”
SIAM J. Control Optim. , vol. 21, no. 3, pp. 462–471, 1983.[21] E. D. Sontag, “A ‘universal’ construction of Artstein’s theorem onnonlinear stabilization,”
Syst. Control Lett. , vol. 13, no. 2, pp. 117 –123, 1989.[22] H. K. Khalil,
Nonlinear Systems , 3rd ed. Prentice-Hall, 2002.[23] H. Tsukamoto, S.-J. Chung, and J.-J. E. Slotine, “Neural stochasticcontraction metrics for learning-based control and estimation,”
IEEEControl Syst. Lett. , vol. 5, no. 5, pp. 1825–1830, 2021.[24] ——, “Learning-based adaptive control via contraction theory,”
Sub-mitted to IEEE Control Syst. Lett. , Mar. 2021.[25] H. Tsukamoto and S.-J. Chung, “Neural contraction metrics for robust estimation and control: A convex optimization approach,”
IEEEControl Syst. Lett. , vol. 5, no. 1, pp. 211–216, 2021.[26] W. Lohmiller and J.-J. E. Slotine, “On contraction analysis fornonlinear systems,”
Automatica , no. 6, pp. 683 – 696, 1998.[27] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger,“Learning-based model predictive control: Toward safe learning incontrol,”
Annu. Rev. Control Robot. Auton. Syst. , vol. 3, no. 1, pp.269–296, 2020.[28] G. Shi et al. , “Neural lander: Stable drone landing control usinglearned dynamics,” in
IEEE ICRA , May 2019.[29] J.-J. E. Slotine and W. Li,
Applied Nonlinear Control . Upper SaddleRiver, NJ: Pearson, 1991.[30] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov,and A. J. Smola, “Deep sets,” in
Adv. Neural Inf. Process. Syst. , vol. 30.Curran Associates, Inc., 2017, pp. 3391–3401.[31] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptiveelements that can solve difficult learning control problems,”
IEEETrans. Syst. Man Cybern. , vol. SMC-13, no. 5, pp. 834–846, 1983.[32] Y. Nakka et al. , “A six degree-of-freedom spacecraft dynamics simula-tor for formation control research,” in
Astrodyn. Special. Conf. , 2018.[33] H. Tsukamoto and S.-J. Chung, “Robust controller design for stochas-tic nonlinear systems via convex optimization,”
IEEE Trans. Autom.Control, to appear , Oct. 2021.[34] A. Isidori,
Nonlinear Control Systems , 3rd ed. Berlin, Heidelberg:Springer-Verlag, 1995.[35] J.-J. E. Slotine, “Modular stability tools for distributed computationand control,”
Int. J. Adapt. Control Signal Process. , vol. 17, no. 6, pp.397–416, 2003.[36] S. Boyd and L. Vandenberghe,
Convex Optimization . CambridgeUniversity Press, Mar. 2004.[37] S. Kolathaya and A. D. Ames, “Input-to-state safety with controlbarrier functions,”
IEEE Control Syst. Lett. , vol. 3, no. 1, pp. 108–113, 2019.[38] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modelinglanguage for convex optimization,”
J. Mach. Learn. Res. , 2016.[39] MOSEK ApS,
MOSEK Optimizer API for Python 9.0.105 , 2020.[40] D. Morgan, S.-J. Chung, L. Blackmore, B. Acikmese, D. Bayard, andF. Y. Hadaegh, “Swarm-keeping strategies for spacecraft Under J2 andatmospheric drag perturbations,”
J. Guid. Control Dyn , vol. 35, no. 5,pp. 1492–1506, 2012.[41] I. R. Manchester and J.-J. E. Slotine, “Control contraction metrics:Convex and intrinsic criteria for nonlinear feedback design,”
IEEETrans. Autom. Control , vol. 62, no. 6, pp. 3046–3053, Jun. 2017., vol. 62, no. 6, pp. 3046–3053, Jun. 2017.