Learning to Satisfy Unknown Constraints in Iterative MPC
LLearning to Satisfy Unknown Constraints in Iterative MPC
Monimoy Bujarbaruah † , Charlott Vallon † , and Francesco BorrelliJune 11, 2020 Abstract
We propose a control design method for linear time-invariant systems that iterativelylearns to satisfy unknown polyhedral state constraints. At each iteration of a repetitivetask, the method constructs an estimate of the unknown environment constraints usingcollected closed-loop trajectory data. This estimated constraint set is improved iterativelyupon collection of additional data. An MPC controller is then designed to robustly satisfythe estimated constraint set. This paper presents the details of the proposed approach,and provides robust and probabilistic guarantees of constraint satisfaction as a functionof the number of executed task iterations. We demonstrate the efficacy of the proposedframework in a detailed numerical example.
Data-driven decision making and control has garnered significant attention in recent times [1–4].As such approaches are increasingly being deployed in automated systems [5–8], the satisfactionof safety requirements is of utmost importance.
Safety is often represented as containment ofsystem states (or outputs) within a pre-defined constraint set over all possible time evolutions ofthe considered system. Such constraint sets define the safe environment for the system, in whichthe system is allowed to evolve during execution of a control task. Various control methods existfor ensuring system safety during a control task execution [9–11].In the additional presence of uncertainty in the system model, data-driven methods have beenused to quantify and bound the uncertainty in order to ensure system safety either robustly[12–15], or with high probability [16–19]. The majority of existing methods assume that theenvironment constraints are known to the control designer. If the environment constraints areunknown, data-driven methods can be used to first learn the unknown constraints [20–22] andthen design safe controllers using one of the previous methods. However, these approachesassume a perfectly known system model, not subject to any disturbances. The literature onsafe, data-driven controller design in the presence of uncertainties in both the system model andthe constraint set is rather limited. In particular, such methods typically are unable to quantifythe probability of the system failing to satisfy the true environment constraints.In this paper we propose an algorithm to design a safe controller for an uncertain systemwhile learning polyhedral state constraints. Specifically, we consider a linear, time-invariantsystem with known system matrices, subject to an additive disturbance, performing an iterative ∗ † authors contributed equally to this work. Emails: { monimoyb, charlott, fborrelli } @berkeley.edu. a r X i v : . [ ee ss . S Y ] J un ask. The environment constraints of the task are assumed polyhedral, characterized by a set ofhyperplanes, some of which are unknown to the control designer. We assume that violations ofthe unknown constraints can be directly measured or observed from closed-loop state trajectories.Our algorithm iteratively constructs estimates of the unknown constraints using collectedsystem trajectories. These estimates are then used to design a robust MPC controller [13, 23]for safely achieving the control task despite the uncertainty.The main contributions of this paper are as follows: • Given a user-specified upper bound (cid:15) on the probability of violating the true constraintset Z within any j th task iteration, we construct constraint estimates ˆ Z j from previouslycollected closed-loop task data, using convex hull operations (for (cid:15) = 0) or a SupportVector Machine (SVM) classifier (for (cid:15) ∈ (0 , Z j along the j th iteration, for all possible additive disturbance values. • When ˆ Z j is formed with the SVM classification approach (for (cid:15) ∈ (0 , Z j is deemedsafe with respect to (cid:15) . Here, “successful task iterations” refers to closed-loop trajectoriessatisfying the unknown constraints Z . • When ˆ Z j is formed using the convex hull approach (for (cid:15) = 0), we show how to designa robust MPC that provides satisfaction of the true constraints Z at all future iterations k ≥ j .The remainder of the paper is organized as follows: In Section 2 we formulate the robust opti-mization problem to be solved in each iteration, and define the inherent system dynamics alongwith state and input constraints. In Section 3 the MPC optimization problem is presented alongwith a definition of Iteration Failure under unknown (or partially known) state constraints. Sec-tion 4 delineates the control design requirements while finding approximations of the unknownconstraints and consequently presents the associated algorithms. Finally, we present detailednumerical simulations corroborating our results in Section 5. We consider linear time-invariant systems of the form: x t +1 = Ax t + Bu t + w t , (1)where x t ∈ R n is the state at time t , u t ∈ R m is the input, and A and B are known systemmatrices. At each time step t , the system is affected by an independently and identically dis-tributed (i.i.d.) random disturbance w t with a known polytopic support W ⊂ R n . We define H x ∈ R s × n , h x ∈ R s , H u ∈ R o × m , and h u ∈ R o , and formulate the state and input constraintsimposed by the task environment for all time steps t ≥ Z := { ( x, u ) : H x x ≤ h x , H u u ≤ h u } . (2)Throughout the paper, we assume that system (1) performs the same task repeatedly, witheach task execution referred to as an iteration . Our goal is to design a controller that, at each2teration j , aims to solve the following finite horizon robust optimal control problem: V j,(cid:63) ( x S ) =min u j ,u j ( · ) ,... T − (cid:88) t =0 (cid:96) (cid:0) ¯ x jt , u jt (cid:0) ¯ x jt (cid:1)(cid:1) s.t. x jt +1 = Ax jt + Bu jt ( x jt ) + w jt , ¯ x jt +1 = A ¯ x jt + Bu jt (¯ x jt ) ,H x x jt ≤ h x , ∀ w jt ∈ W ,H u u jt ≤ h u , ∀ w jt ∈ W ,x j = x S , t = 0 , , . . . , ( T − , (3)where x jt , u jt and w jt denote the realized system state, control input and disturbance at time t ofthe j th iteration respectively. The pair (¯ x jt , u jt (¯ x jt )) denotes the disturbance-free nominal stateand corresponding nominal input. The optimal control problem (3) minimizes the nominal costover a time horizon of length T (cid:29) j th iteration with j ∈ { , , . . . } . The state andinput constraints must be robustly satisfied for all uncertain realizations. The optimal controlproblem (3) consists of finding [ u j , u j ( · ) , u j ( · ) , . . . ], where u jt : R n (cid:51) x jt (cid:55)→ u jt = u jt ( x jt ) ∈ R m arestate feedback policies.In this work we consider constraints of the form: H x = (cid:20) H b x H ub x (cid:21) , h x = (cid:20) h b x h ub x (cid:21) , where the superscripts { b , ub } denote the known and unknown parts of the constraints, respec-tively. That is to say, we consider a scenario in which we only know a subset of the system’senvironment constraint set. At the beginning of the j th task iteration we construct approxima-tions of H x and h x , denoted as ˆ H jx and ˆ h jx , respectively, using closed-loop trajectories of thesystem from previous task iterations. The estimated constraints form a safe set estimate ˆ Z j :ˆ Z j := { ( x, u ) : ˆ H jx x ≤ ˆ h jx , H u u ≤ h u } . (4)These estimates are refined iteratively using new data as the system continues to perform thetask, and are used to solve an estimate of (3). The construction of the safe set estimates isdetailed in Section 4. For computational tractability when considering task duration T (cid:29)
0, we try to approximatea solution to the optimal control problem (3) by solving a simpler constrained optimal controlproblem with prediction horizon N (cid:28) T in a receding horizon fashion.3 .1 Problem Definition Since the true constraint set Z is not completely known, we use our estimate ˆ Z j built from dataand formulate the robust optimal control problem as: V t → t + N MPC ,j ( x jt , ˆ Z j , ˆ X jN ) :=min U jt ( · ) t + N − (cid:88) k = t (cid:96) (¯ x jk | t , v jk | t ) + Q (¯ x jt + N | t )s.t x jk +1 | t = Ax jk | t + Bu jk | t + w jk | t , ¯ x jk +1 | t = A ¯ x jk | t + Bv jk | t ,u jk | t = k − (cid:88) l = t M jk,l | t w jl | t + v jk | t , ˆ H jx x jk | t ≤ ˆ h jx ,H u u jk | t ≤ h u ,x jt | t = ¯ x jt | t ,x jt + N | t ∈ ˆ X jN , ∀ w jk | t ∈ W , ∀ k = { t, . . . , t + N − } , (5)where in the j th iteration, x jt is the measured state at time t , x jk | t is the predicted state at time k , obtained by applying predicted input policies [ u jt | t , . . . , u jk − | t ] to system (1). We denote thedisturbance-free nominal state and corresponding input as { ¯ x jk | t , v jk | t } with v jk | t = u jk | t (¯ x jk | t ). TheMPC controller minimizes the cost over the predicted nominal trajectory (cid:110) { ¯ x jk | t , v jk | t } t + N − k = t , ¯ x jt + N | t (cid:111) ,which is comprised of a positive definite stage cost (cid:96) ( · , · ) and terminal cost Q ( · ). We note that theabove formulation uses affine disturbance feedback parameterization [23] of input policies. Weuse state feedback to construct a terminal set ˆ X jN = { x ∈ R n : ˆ Y j x ≤ ˆ z j , ˆ Y j ∈ R r j × n , ˆ z j ∈ R r j } ,which is the ( T − N ) step robust reachable set [11, Chapter 10] to the set of state constraints in(4). Specifically, this set has the properties:ˆ X jN ⊆ { x | ( x, Kx ) ∈ ˆ Z j } , ˆ H jx (( A + BK ) i x + i − (cid:88) ˜ i =0 ( A + BK ) i − ˜ i − w ˜ i ) ≤ ˆ h jx ,H u ( K (( A + BK ) i x + i − (cid:88) ˜ i =0 ( A + BK ) i − ˜ i − w ˜ i )) ≤ h u , ∀ x ∈ ˆ X jN , ∀ w i ∈ W , ∀ i = 1 , , . . . , ( T − N ) . (6)After solving (5) at time step t of the j th iteration, we apply u jt = v j,(cid:63)t | t (7)to system (1). We then resolve the problem (5) again at the next ( t + 1)-th time step, yieldinga receding horizon strategy. 4e note that computing the set ˆ X jN in (6) at each iteration can be computationally expensive.In such cases one can opt for data-driven methods such as [24, 25] or simple approximationmethods such as [26, 27] to construct these terminal sets. Assumption 1 (Well-Posedness of Task) . We assume that given an initial task state x S , theoptimization problem (5) is feasible at all times ≤ t ≤ T − for the true constraint set ˆ Z j = Z as defined in (2) , for all iterations j ∈ { , , . . . } . We further assume that n × ∈ Z . At each iteration, the true constraint set Z is unknown and being estimated with ˆ Z j built fromdata. Depending on how ˆ Z j is constructed, robust satisfaction of the true constraints (2) duringan iteration may not be guaranteed. It is thus possible that (2) becomes infeasible at somepoint while solving (5) during 0 ≤ t ≤ T − j th iteration. We formalize this with thefollowing definition: Definition 1 (Successful Iteration) . A Successful Iteration for an iteration j is defined as theevent [SI] j : H x x jt ≤ h x , ∀ t ∈ [0 , T ] . (8) That is, an iteration is successful if there are no state constraint violations during ≤ t ≤ T .Otherwise, the iteration is deemed failed; that is, an Iteration Failure event is implicitly definedas [IF] j = ([SI] j ) c , where ([ · ]) c denotes the complement of an event. The probability of a Successful Iteration [SI] j is a function of the sets ˆ Z j . Our aim is not only to keep the probability of [IF] j low along each iteration, but also to maintainsatisfactory controller performance in terms of cost during successful iterations. Let the closed-loop cost of a successful iteration j under observed disturbance samples w j be denoted byˆ V j ( x S , w j ) = T − (cid:88) t =0 (cid:96) ( x jt , v j,(cid:63)t | t ) , where notation w j denotes [ w j , w j , . . . , w jT − ]. We use the average closed-loop cost E [ ˆ V j ( x S , w j )]to quantify controller performance. Specifically, our goal is to lower the iteration performanceloss , defined as [PL] j = E [ ˆ V j ( x S , w j )] − E [ V (cid:63) ( x S , w j )] , (9)where E [ V (cid:63) ( x S , w j )] denotes the average closed-loop cost of an iteration if Z had been known,i.e. if ˆ Z j = Z for all j ∈ { , , . . . } . To formalize this joint focus on obtaining a low probabilityof Iteration Failures while maintaining satisfactory controller performance, we summarize ourcontrol design objectives as:(C1) Design a closed-loop MPC control law (7) which ensures that the system (1) maintains auser-specified upper bound on the probability of Iteration Failure [IF] j (8), for all iterations j ∈ { , , . . . } . 5C2) Minimize [PL] j (as defined in (9)) at each iteration j ∈ { , , . . . } , while satisfying (C1).However, as we start the control task from scratch without assuming the initial availability ofa large number of trajectory data samples, and it is difficult in general to obtain statisticalproperties of estimated constraint sets ˆ Z j , methods such as [28–30] cannot be used to satisfy(C1)-(C2) directly. We therefore relax the above two specifications and formulate two controldesign specifications (D1) and (D2) in the next section. We consider the following design specifications:(D1) Design a closed-loop MPC control law (7) which ensures that the system (1) maintains auser-specified upper bound (cid:15) on the probability of Iteration Failure, after some iteration j ∈ { , , . . . } .(D2) Minimize [PL] j (as defined in (9)) after some iteration j ∈ { , , . . . } .We wish to find the smallest index ¯ j , such that (D1) and (D2) are satisfied for all j ≥ ¯ j .The design specifications (D1)-(D2) indicate that the approach to construct estimated stateconstraint sets proposed in this paper, is our best possible attempt to satisfy (C1)-(C2), giventhe information available at each iteration j . Assumption 2 (Feasibility Classification) . Given a system state trajectory, we assume that aclassifier is available to check the feasibility of each point in the trajectory based on whether itsatisfies the true state constraints in (2) . This classifier returns a corresponding sequence offeasibility flags.
Assumption 3 (Simulator) . We assume that each iteration is run until completion at time T ,and that state constraint satisfaction as described in Assumption 2 is checked only at the end ofthe simulation. We note that Assumption 3 could be relaxed in several ways. For example, constraintsatisfaction could be checked in real-time and the simulations stopped if violations occur. Onecould also run physical experiments and check the feasibility of (2) in real-time, by observingif the physical experiment fails. Some constraint violations may be hard to evaluate duringphysical experiments, but this discussion goes beyond the scope of this paper. ˆ Z j We show how the estimated constraint sets ˆ Z j are constructed in order to satisfy the designspecifications (D1) and (D2). This process depends on the user-specified upper bound (cid:15) on theprobability of Iteration Failure. To satisfy (D1) we search for the smallest ¯ j , such that P ([IF] j ) ≤ (cid:15), (10)for all j ≥ ¯ j , where (cid:15) ∈ (0 ,
1) is the bound on the probability of Iteration Failure. At the startof the first iteration, j = 1, we use only the known information about the imposed constraints:ˆ Z := { ( x, u ) : H b x x ≤ h b x , H u u ≤ h u } . (11)6ext, consider any j ∈ { , , . . . } . Let the closed-loop realized states collected until the end ofthe j th iteration be x j = [ x j , x j , . . . , x jT ] , (12)where x ji ∈ R n × j is a matrix containing all states corresponding to time step i from the first j iterations. Let f j ( x ) : R n (cid:55)→ R denote a curve that separates the points in (12) according towhether they satisfy all true state constraints in (2), such that f j ( n × ) ≤
0. Based on Assump-tion 2, such a binary classification curve can be obtained with supervised learning techniques.In this paper we use a kernelized Support Vector Machine algorithm [31, Chapter 12].Let a polyhedral inner approximation of the intersection of f j ( x ) ≤ Z be given by:ˆ P j +1svm = { x : ˆ H j +1 x, svm x ≤ ˆ h j +1 x, svm } (13)= { x : f j ( x ) ≤ } ∩ { x : H b x x ≤ h b x } . We then use (13) to form the constraint set estimates for the following iteration:ˆ Z j +1svm := { ( x, u ) : ˆ H j +1 x, svm x ≤ ˆ h j +1 x, svm , H u u ≤ h u } , (14)setting ˆ Z j +1 = ˆ Z j +1svm in our robust optimization problem (5) for j ∈ { , , . . . } . In other words,at each iteration j >
1, the estimated state constraints in ˆ Z j are formed out of the SVMclassification boundary learned from all previous state trajectories, intersected with the knownstate constraints. Remark 1.
In case the set ˆ Z j +1svm in (14) yields either infeasibility of (5) or an empty terminalset ˆ X j +1 N for any iteration j ∈ { , , . . . } , the set of estimated state constraints can be scaledappropriately until feasibility of (5) is obtained. Such scaling is not further analyzed in theremaining sections of this paper. Since the estimated constraint sets (14) are not necessarily inner approximations of thetrue unknown constraints (2), the closed-loop state trajectories in future iterations may resultin Iteration Failures with a nonzero probability. In the following proposition we quantify theprobability of an Iteration Failure, given a ˆ Z ¯ j , for some ¯ j ∈ { , , . . . } . Proposition 1.
Consider ˆ Z = ˆ Z from (11) for ¯ j = 1 or a constraint estimate set ˆ Z ¯ j svm from (14) formed using trajectories up to iteration ¯ j − for ¯ j > . Let this set ˆ Z ¯ j svm be used asthe constraint estimate set for the next N it task iterations, beginning with iteration ¯ j . If for achosen (cid:15) ∈ (0 , and < β (cid:28) , Successful Iterations are obtained for the next N it ≥ ln 1 /β ln 1 / (1 − (cid:15) ) iterations, then P ([IF] j ) ≤ (cid:15) with confidence at least − β for all subsequent task iterations j ≥ ¯ j using ˆ Z j = ˆ Z ¯ j svm .Proof. See Appendix.Proposition 1 requires that the polytope ˆ P j +1svm for j ∈ { , , . . . } is updated only if newviolation points for constraints (2) are seen at the end of an iteration j . This update strategy ishighlighted in Algorithm 1. If no violations are seen for N it successive iterations, a probabilisticsafety certificate is provided and Algorithm 1 is terminated. Approximation techniques are elaborated in Section 5. .2 Safety vs Performance Trade-Off Proposition 1 proves that constructing estimated constraint sets as per (14), can result in sat-isfaction of (10) for some (cid:15) ∈ (0 , (cid:15) = 0. In such cases, we can utilize the closed-loop system trajectories for obtaining guaranteed inner approximations of (2), so that P ([IF] j ) = 0 for all future iterations j ≥ ¯ j , for some ¯ j to bedetermined.Recalling (12), let the closed-loop realized states collected until the end of the j th iterationbe denoted as x j = [ x j , x j , . . . , x jT ] , (15)and let ˆ x j denote the collection of states from (15) which satisfy all true state constraints in (2).Then an inner approximation the of state constraints in (2) is provided by the polyhedron:ˆ P j +1cvx = { x : ˆ H j +1 x, cvx x ≤ ˆ h j +1 x, cvx } (16)= conv([ n × , ˆ x j ]) , where conv( · ) denotes the convex hull operator. We can now defineˆ Z j +1cvx := { ( x, u ) : ˆ H j +1 x, cvx x ≤ ˆ h j +1 x, cvx , H u u ≤ h u } , (17)and use ˆ Z j = ˆ Z j cvx for j ∈ { , , . . . } in (5) as a robust alternative to (14). Proposition 2. If ˆ Z ¯ j cvx (17) yields feasibility of (5) for some ¯ j ∈ { , , . . . } , then ˆ Z ¯ j cvx ⊆ Z , and ˆ Z j cvx = ˆ Z ¯ j cvx for all j ≥ ¯ j .Proof. Let the closed-loop realized states collected until the end of the (¯ j − th iteration be x j − = [ x j − , x j − , . . . , x j − T ] , (18)and let ˆ x ¯ j − be the collection of all trajectory points in (18) that satisfy the state constraints in(2). Following (16) we form ˆ P ¯ j cvx = { x : ˆ H ¯ jx, cvx x ≤ ˆ h ¯ jx, cvx } as,ˆ P ¯ j cvx = conv([ n × , ˆ x ¯ j − ]) . By the convexity of the true unknown state constraints (2) and Assumption 1, we have thatˆ P ¯ j cvx ⊆ { x : H x x ≤ h x } . This implies ˆ Z ¯ j ⊆ Z .Furthermore, since (5) is feasible at time t = 0 in iteration ¯ j , (5) remains feasible with system(1) in closed-loop with the MPC controller (9) at all future times t ≤ ( T − It follows that x ¯ jt ∈ ˆ P ¯ j cvx for all 0 ≤ t ≤ T , which implies ˆ P ¯ j +1cvx = ˆ P ¯ j cvx from (16). Extending this argument,we can similarly prove ˆ P j cvx = ˆ P ¯ j cvx for all j > ¯ j , which implies ˆ Z j cvx = ˆ Z ¯ j cvx for all j > ¯ j . Thiscompletes the proof. This recursive feasibility property is stated without proof. Interested readers can look into the standarddetailed proofs in [11, Chapter 12]. j for which (17) yields feasibility of (5), then theprobability of Iteration Failure at iteration j is exactly 0 for all j ≥ ¯ j . Moreover, Proposition 2suggests that after the ¯ j th iteration, the constraint estimation update (17) can be terminated.The update strategy (17) strictly ensures that ˆ Z j cvx ⊆ Z for all j ∈ { , , . . . } , which is notnecessarily true for sets obtained using the SVM method (14). However, choosing this robustconstraint estimation can increase the performance loss (9) over successful iterations after j ≥ ¯ j .This is the safety vs. performance trade-off , which the user can manage with an appropriatechoice of (cid:15) . Remark 2.
Following Remark 1, if the optimization problem (5) is infeasible or the terminal set ˆ X jN constructed in (6) using the estimate ˆ Z j cvx is empty, one can switch to constraint estimates (14) and collect additional trajectory data, since ˆ P j cvx ⊆ ˆ P j cvx , for any ≤ j < j . We present our Robust MPC with Iterative Constraint Learning (RMPC-ICL) algorithm, whichuses the estimated constraint sets ˆ Z j from Section 4.1 or Section 4.2 while solving (5) in aniterative fashion. The algorithm terminates upon finding the smallest ¯ j such that (10) is satisfied. We verify the effectiveness of the proposed Algorithm 1 in a simulation example. We findapproximate solutions to the following iterative optimal control problem in receding horizon: V j,(cid:63) ( x S ) =min u j ,u j ( · ) ,... T − (cid:88) t =0 (cid:13)(cid:13) ¯ x jt − x ref (cid:13)(cid:13) + 2 (cid:13)(cid:13) u jt (¯ x jt ) (cid:13)(cid:13) s.t. x jt +1 = Ax jt + Bu jt ( x jt ) + w jt , (cid:20) H b x H ub x (cid:21) x jt ≤ (cid:20) × × × (cid:21) , ∀ w jt ∈ W , − ≤ u jt ( x jt ) ≤ , ∀ w jt ∈ W ,x j = x S , t = 0 , , . . . , ( T − , (19)where W = [ − . , . × [ − . , . ,A = (cid:20) (cid:21) , B = [0 , (cid:62) are known. The known and unknown parts of the state constraints are parametrized by thepolytopes { x : H b x x ≤ } and { x : H ub x x ≤ } respectively, where the matrices are given by H b x =
20 00 20 −
20 00 − , H ub x = (cid:20) − (cid:21) . lgorithm 1: RMPC-ICL Algorithm
Initialize: j = 1 , l = 0, ˆ Z = ˆ Z = ˆ Z from (11) Inputs: W , (cid:15), β, N and x j = x S for all j ∈ { , , . . . } Data: ˜ x = [ x , x , . . . , x T ], ˆ P formed with (16); while j ≥ do if Points in ˜ x j − violate (2) then Construct ˆ Z j svm with (14); construct ˆ X jN with (6); else ˆ Z j svm = ˆ Z j − ; l = l + 1; (if l ≥ ln 1 /β ln 1 / (1 − (cid:15) ) , break ; (10) issatisfied) end if if P ([IF] j ) = 0 desired then Construct ˆ Z j cvx with (17); construct ˆ X jN with (6); if Problem (5) is feasible with ˆ Z j cvx then Use ˆ Z j = ˆ Z j cvx for solving (5); break ; ( P ([IF] j ) = 0 is satisfied) else Use ˆ Z j = ˆ Z j svm for solving (5); end if else Use ˆ Z j = ˆ Z j svm for solving (5); end if Set ˜ x j = x S , t = 0; while ≤ t ≤ T − do Solve (5) and apply MPC control (7) to (1);
Collect states and append x j = [ x j , x jt +1 ]; t = t + 1 end while j = j + 1 end while We solve the above optimization problem (19) with the initial state x S = [ − , (cid:62) and referencepoint x ref = [5 , (cid:62) for task duration T = 10 steps over all the iterations. Algorithm 1 isimplemented with a control horizon of N = 4, and the feedback gain K in (6) is chosen to be theoptimal LQR gain with parameters Q LQR = 10 I × and R LQR = 2. The optimization problemsare formulated with YALMIP interface [32] in MATLAB, and we use the Gurobi solver to solvea quadratic program at every time step for control synthesis. The goal of this section is to show: • The design specification (D1) is satisfied by Algorithm 1. Consequently, we find an iterationindex ¯ j such that (10) is guaranteed to hold for all iterations j ≥ ¯ j . • Performance loss [PL] j over Successful Iterations (after j ≥ ¯ j ) increases as the tolerableprobability (cid:15) of Iteration Failure is lowered. This highlights the safety vs. performancetrade-off. 10 .1 Bounding the Probability of Iteration Failure We demonstrate satisfaction of design specification (D1) by Algorithm 1. First, we focus onthe SVM-based approach. We choose an SVM classifier with a Radial Basis kernel function[31, Chapter 12]. For introducing exploration properties, the SVM classifier f ( x ) was initiallywarm-started by exciting the system (1) with random inputs and collecting trajectory data fortwo trajectories. After that the control process was started by solving (5). The polytopes P j +1svm were generated by taking a convex hull of randomly generated 1000 test points before eachiteration, which were classified as f j ( x j test ) ≤ j ∈ { , , . . . } .We consider two cases of tolerable Iteration Failure, with respective probabilities of 30% and50%, corresponding to (cid:15) = 0 . (cid:15) = 0 . Z ¯ j were obtained for ¯ j = 5 and ¯ j = 3 respectively . These sets satisfy design requirement(D1) and are shown in Fig. 1. As expected, the constraint set estimated with (cid:15) = 0 . (cid:15) = 0 .
3. Both estimated sets partially violate the true constraintset (outlined in black).Furthermore, in order to verify the certificate obtained using Proposition 1, we run 100 offline
Monte-Carlo simulations (or trials ) of iterations by solving (5), with ˆ Z = ˆ Z ¯ j , foreach of the above ˆ Z ¯ j sets, and estimate the actual resulting Iteration Failure probability. Thisprobability is estimated by averaging over 100 Monte Carlo draws of disturbance samples w T − = [ w , w , . . . , w T − ], i.e., P ( x T / ∈ ˆ Z ¯ js ) ≈ (cid:88) ˜ m =1 ( F ( x T )) (cid:63) ˜ m , where ( F ( x T )) (cid:63) ˜ m = (cid:40) , if x T / ∈ ˆ Z ¯ js | ( w T − ) (cid:63) ˜ m , , otherwise , and ( · ) (cid:63) ˜ m represents the ˜ m th Monte Carlo sample .The values obtained were ˆ (cid:15) ≈ .
01 and ˆ (cid:15) ≈ .
04 for (cid:15) = 0 . (cid:15) = 0 . −
96% lower than thecorresponding chosen (cid:15) . This highlights the conservatism of the bounds given in Proposition 1.We next verify the satisfaction of design requirement (D1) when the estimated constraint setsare obtained using the robust convex hull based approach from Section 4.2. We use the same100 draws of disturbance sequences w = [ w T − , w T − , . . . , w T − ] as for the SVM-basedapproach above. The resulting constraint estimate set is shown in Fig. 1 and is obtained at ¯ j = 4.Using this set in (5) ensures no Iteration Failures for all j ≥ ¯ j , as proven in Proposition 2. Theseresults from Section 5.1 are summarized in Table I. We note that the exact value of ¯ j , as well as the associated estimated constraint sets, depend on the distur-bance sequence. Running this example several times will yield similar results, but not the exact same results. For brevity, with slight abuse of notation, we use ˆ Z ¯ js to denote the corresponding state constraints. P ([IF] j ). For the same Monte Carlo draws of w , we approximate the average closed-loop cost E [ ˆ V ¯ j ( x S , w T )]by taking an empirical average over the 100 Monte Carlo draws, E [ ˆ V ¯ j ( x S , w T − )] ≈ (cid:88) ˜ m =1 ˆ V ¯ j ( x S , ( w T − ) (cid:63) ˜ m ) , with ˆ Z ¯ j sets obtained in Fig. 1. The cost values are normalized by V (cid:63) ( x S ), which denotes theempirical average closed-loop cost if Z had been known.The results are summarized in Table I. We see that the average closed-loop cost shows aninverse relationship with the tolerable Iteration Failure probability (cid:15) . For lower probabilities of[IF] j with (cid:15) = 0 .
30, we pay a 3% lower average closed-loop cost compared to V (cid:63) ( x S ). Allowingfor higher probability of [PL] j with (cid:15) = 0 .
50 proves to be cost-efficient, where we pay around7% lower average closed-loop cost compared to V (cid:63) ( x S ). The cost for the approach in Section 4.2is the highest, with a 4% higher average closed-loop cost compared to V (cid:63) ( x S ). This directlyreflects the safety vs. performance trade-off.Table 1 (cid:15) ¯ j ˆ (cid:15) E [ ˆ V ¯ j ( x S , w T − )] / V (cid:63) ( x S ) ≈ We propose a framework for an uncertain LTI system to iteratively learn to satisfy unknownpolyhedral state constraints in the environment. From historical trajectory data, we constructan estimate of the true environment constraints before starting an iteration, which the MPC12ontroller robustly satisfies at all times along the iteration. A safety certification is then providedfor the estimated constraints, if the true (and unknown) environment constraints are also satisfiedby the controller in closed-loop. We further highlight a trade-off between safety and controllerperformance, demonstrating that a controller designed with estimated constraint sets which aredeemed highly safe, result in a higher average closed-loop cost incurred across iterations. Finally,we demonstrated the efficacy of the proposed framework via a detailed numerical example.
Acknowledgements
We thank Akhil Shetty for useful comments and discussions. This research was sustained in partby fellowship support from the National Physical Science Consortium and the National Instituteof Standards and Technology. The research was also partially funded by Office of Naval Researchgrant ONR-N00014-18-1-2833.
Appendix
Proof of Proposition 1
Recall matrices H x and h x defined in (2). Let us denote H x = H x ⊗ I T and h x = h x ⊗ I T . Let [ H x ] i and [ h x ] i denote the i th row of H x and h x respectively. For a fixed initial condition x j = x S anda random disturbance realization w = [ w , w , . . . , w T − ], consider the corresponding closed-looptrajectory x ( w ) = [ x (cid:62) , x (cid:62) , . . . , x (cid:62) T ] (cid:62) . Now consider the following function Q ( w ) := max i { [ H x ] i x ( w ) − h i } , and then define ˆ Q N it := max j =1 , ,...,N it { Q ( w j ) } , where w j for j ∈ { , , . . . , N it } are a collectionof independent samples of w drawn according to P . It follows [33, Theorem 3.1] that, if N it ≥ ln 1 /β ln 1 / (1 − (cid:15) ) , then P N it (cid:2) P [ Q ( w ) > ˆ Q N it ] ≤ (cid:15) (cid:3) ≥ − β. Proposition 1 now follows upon setting ˆ Q N it = 0. References [1] M. Tanaskovic, L. Fagiano, C. Novara, and M. Morari, “Data-driven control of nonlinearsystems: An on-line direct approach,”
Automatica , vol. 75, pp. 1–10, 2017.[2] B. Recht, “A tour of reinforcement learning: The view from continuous control,”
AnnualReview of Control, Robotics, and Autonomous Systems , vol. 2, pp. 253–279, 2019.[3] U. Rosolia, X. Zhang, and F. Borrelli, “Data-driven predictive control for autonomoussystems,”
Annual Review of Control, Robotics, and Autonomous Systems , vol. 1, pp. 259–286, 2018.[4] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger, “Learning-based modelpredictive control: Toward safe learning in control,”
Annual Review of Control, Robotics,and Autonomous Systems , vol. 3, 2019. 135] A. Liniger, A. Domahidi, and M. Morari, “Optimization-based autonomous racing of 1: 43scale RC cars,”
Optimal Control Applications and Methods , vol. 36, no. 5, pp. 628–647,2015.[6] D. Jain, A. Li, S. Singhal, A. Rajeswaran, V. Kumar, and E. Todorov, “Learning deepvisuomotor policies for dexterous hand manipulation,” in , May 2019, pp. 3636–3643.[7] F. Berkenkamp and A. P. Schoellig, “Safe and robust learning control with gaussian pro-cesses,” in , July 2015, pp. 2496–2501.[8] D. P. Losey, M. Li, J. Bohg, and D. Sadigh, “Learning from my partner’s actions: Roles indecentralized robot teams,” arXiv preprint arXiv:1910.07613 , 2019.[9] H. Yin, A. Packard, M. Arcak, and P. Seiler, “Finite horizon backward reachability analysisand control synthesis for uncertain nonlinear systems,” in , July 2019, pp. 5020–5026.[10] A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function basedquadratic programs for safety critical systems,”
IEEE Transactions on Automatic Control ,vol. 62, no. 8, pp. 3861–3876, 2016.[11] F. Borrelli, A. Bemporad, and M. Morari,
Predictive control for linear and hybrid systems .Cambridge University Press, 2017.[12] S. L. Herbert, M. Chen, S. Han, S. Bansal, J. F. Fisac, and C. J. Tomlin, “FaSTrack:A modular framework for fast and guaranteed safe motion planning,” in , Dec 2017, pp. 1517–1522.[13] M. Bujarbaruah, X. Zhang, M. Tanaskovic, and F. Borrelli, “Adaptive MPC under timevarying uncertainty: Robust and Stochastic,” arXiv preprint arXiv:1909.13473 , 2019.[14] J. K¨ohler, M. A. M¨uller, and F. Allg¨ower, “Nonlinear reference tracking: An economicmodel predictive control perspective,”
IEEE Transactions on Automatic Control , vol. 64,no. 1, pp. 254–269, Jan 2019.[15] S. Singh, A. Majumdar, J. Slotine, and M. Pavone, “Robust online motion planning viacontraction theory and convex optimization,” in , 2017, pp. 5883–5890.[16] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforce-ment learning with stability guarantees,” in
Advances in neural information processingsystems , 2017, pp. 908–918.[17] L. Hewing and M. N. Zeilinger, “Cautious model predictive control using gaussian processregression,” arXiv preprint arXiv:1705.10702 , 2017.[18] R. Soloperto, M. A. M¨uller, S. Trimpe, and F. Allg¨ower, “Learning-based robust modelpredictive control with state-dependent uncertainty,” in
IFAC Conference on NonlinearModel Predictive Control , Madison, Wisconsin, USA, Aug. 2018.1419] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based model predictivecontrol for safe exploration,” in ,Dec 2018, pp. 6059–6066.[20] L. Armesto, J. Bosga, V. Ivan, and S. Vijayakumar, “Efficient learning of constraints andgeneric null space policies,” in , May 2017, pp. 1520–1526.[21] C. P´erez-D’Arpino and J. A. Shah, “C-LEARN: Learning geometric constraints fromdemonstrations for multi-step manipulation in shared autonomy,” in , May 2017, pp. 4058–4065.[22] G. Chou, D. Berenson, and N. Ozay, “Learning constraints from demonstrations,” arXivpreprint arXiv:1812.07084 , 2018.[23] P. J. Goulart, E. C. Kerrigan, and J. M. Maciejowski, “Optimization over state feedbackpolicies for robust control with constraints,”
Automatica , vol. 42, no. 4, pp. 523–533, 2006.[24] U. Rosolia and F. Borrelli, “Learning model predictive control for iterative tasks: A com-putationally efficient approach for linear system,”
IFAC-PapersOnLine , vol. 50, no. 1, pp.3142–3147, 2017.[25] K. P. Wabersich and M. N. Zeilinger, “Linear model predictive safety certification forlearning-based control,” in , Dec2018, pp. 7130–7135.[26] A. Girard and G. J. Pappas, “Approximation metrics for discrete and continuous systems,”
IEEE Transactions on Automatic Control , vol. 52, no. 5, pp. 782–798, 2007.[27] A. B. Kurzhanski and P. Varaiya, “Ellipsoidal techniques for reachability analysis,” in
International Workshop on Hybrid Systems: Computation and Control . Springer, 2000,pp. 202–214.[28] G. C. Calafiore and M. C. Campi, “The scenario approach to robust control design,”
IEEETransactions on automatic control , vol. 51, no. 5, pp. 742–753, 2006.[29] X. Zhang, K. Margellos, P. Goulart, and J. Lygeros, “Stochastic model predictive con-trol using a combination of randomized and robust optimization,” in
IEEE Conference onDecision and Control (CDC) , Florence, Italy, 2013.[30] M. Bujarbaruah, A. Shetty, K. Poolla, and F. Borrelli, “Learning robustness with boundedfailure: An iterative MPC approach,” arXiv preprint arXiv:1911.09910 , 2019.[31] J. Friedman, T. Hastie, and R. Tibshirani,
The elements of statistical learning . Springerseries in statistics New York, 2001, vol. 1, no. 10.[32] J. L¨ofberg, “YALMIP: A toolbox for modeling and optimization in MATLAB,” in
IEEEInternational Symposium on Computer Aided Control Systems Design , 2004, pp. 284–289.[33] R. Tempo, E. W. Bai, and F. Dabbene, “Probabilistic robustness analysis: explicit boundsfor the minimum number of samples,” in