[PDF] Recurrent Model Predictive Control

Abstract

This paper proposes an off-line algorithm, called Recurrent Model Predictive Control (RMPC), to solve general nonlinear finite-horizon optimal control problems. Unlike traditional Model Predictive Control (MPC) algorithms, it can make full use of the current computing resources and adaptively select the longest model prediction horizon. Our algorithm employs a recurrent function to approximate the optimal policy, which maps the system states and reference values directly to the control inputs. The number of prediction steps is equal to the number of recurrent cycles of the learned policy function. With an arbitrary initial policy function, the proposed RMPC algorithm can converge to the optimal policy by directly minimizing the designed loss function. We further prove the convergence and optimality of the RMPC algorithm thorough Bellman optimality principle, and demonstrate its generality and efficiency using two numerical examples.

Full PDF

11 Recurrent Model Predictive Control

Zhengyu Liu, Jingliang Duan, Wenxuan Wang, Shengbo Eben Li*, Yuming Yin, Ziyu Lin, Qi Sun and Bo Cheng

Abstract —This paper proposes an off-line algorithm, calledRecurrent Model Predictive Control (RMPC), to solve generalnonlinear ﬁnite-horizon optimal control problems. Unlike tradi-tional Model Predictive Control (MPC) algorithms, it can makefull use of the current computing resources and adaptively selectthe longest model prediction horizon. Our algorithm employsa recurrent function to approximate the optimal policy, whichmaps the system states and reference values directly to the controlinputs. The number of prediction steps is equal to the numberof recurrent cycles of the learned policy function. With anarbitrary initial policy function, the proposed RMPC algorithmcan converge to the optimal policy by directly minimizing thedesigned loss function. We further prove the convergence andoptimality of the RMPC algorithm thorough Bellman optimalityprinciple, and demonstrate its generality and efﬁciency using twonumerical examples.

Index Terms —Model predictive control, Recurrent function,Dynamic programming

I. I

NTRODUCTION M ODEL Predictive Control (MPC) is a well-knownmethod to solve ﬁnite-horizon optimal control prob-lems online, which has been widely studied and appliedin various ﬁelds [1]–[3]. However, these MPC algorithmstypically suffer from a major challenge: vary low computationefﬁciency [4].To overcome this issue, several methods are proposed.One famous approach is moving blocking which assumes thecontrol input as a constant for a ﬁxed portions of the predictionhorizon. The number of variables to be optimized is thusreduced. Cagienard et al . (2007) and Shekhar et al . (2012)presented a time-varying blocking structure, which allows ashifted version of the previous optimal control sequence tobe accessible for the following time step [5], [6]. This solu-tion, however, is sometimes unstable or infeasible. Besides,Wang and Boyd (2009) proposed early termination interior-point method to reduce the calculation time by limiting themaximum number of iterations [7].However, these methods are still not able to meet thereal-time requirement for nonlinear systems. Therefore, manyalgorithms choose to calculate an optimal explicit policy of-ﬂine to ensure the real-time online implementation. Bemporad et al . (2002) ﬁrst proposed the explicit MPC to increasethe computing speed which partitions the constrained statespace and designs explicit feedback control law for eachregion [8]. During online implementation, the computer onlyneeds to choose the corresponding state feedback control lawaccording to the current system state, which reduces the onlinecomputation burden. However, the method can only be applied

Z. Liu and J. Duan have equally contributed to this study. All correspon-dences should be sent to S. Li with email: [email protected]. to small-scale problems, as the required storage capacity growsexponentially with the increase of the problem scale [9].Furthermore, signiﬁcant efforts have been devoted to ap-proximation algorithms which can reduce polyhedral stateregions and simplify explicit control laws. Geyer et al . (2008)provided an approach to reduce partitions by merging regionswith the same control law [10]. Jones et al . (2010) proposed apolytopic approximation method using double description andbarycentric functions to estimate the optimal policy, whichgreatly reduce the partitions and can be applied to any convexoptimization problem [11]. Wen et al . (2009) proposed apiecewise continuous grid function to express explicit MPCsolution which reduces the requirements of storage spaceand online computation [12]. Borrelli et al . (2010) proposedan explicit MPC algorithm which can be executed partiallyonline, and made a trade-off between the storage space andonline computation burden [13].Several researchers have proposed the approximation algo-rithms which have only one partition and uses a continuousfunction as the explicit control law. They update the controllerparameters by means of supervised learning or reinforcementlearning while minimizing a cost function with a ﬁxed predic-tion horizon [14]–[16].Noted that the policy performance and the computationtime for each step usually increase with the number ofprediction steps, especially for nonlinear and non input-afﬁnesystems. Because the above-stated algorithms require a ﬁxedprediction horizon. These algorithms usually have to make atrade-off between control performance and calculation timelimits to determine a feasible prediction horizon. In practicalapplications, however, the computing resources are often dy-namically changing. As a result, these algorithms often leadto calculation timeouts or calculation redundancy. In otherwords, these algorithms cannot adapt to the dynamic allocationof computing resources and make full use of the availablecomputing time to select the longest model prediction horizon.In this paper, we propose an ofﬂine-optimization MPCalgorithm, called recurrent MPC (RMPC), for ﬁnite-horizonoptimal control problems with general nonlinear and noninput-afﬁne systems. Our main contributions can be summa-rized as follows:1) Our algorithm can make full use of the current comput-ing resources and adaptively select the longest modelprediction horizon. We employs a recurrent function toapproximate the optimal policy, which maps the systemstates and reference values directly to the control inputs.The number of prediction steps is equal to the numberof recurrent cycles of the learned policy function.2) The optimal recurrent policy can be obtained by directlyminimizing the designed loss function, which is applica- a r X i v : . [ ee ss . S Y ] F e b ble for general nonlinear and non input-afﬁne systems.We prove the convergence and optimality of the RMPCalgorithm thorough Bellman optimality principle.3) Simulation results show that the RMPC algorithm is 20times faster than a traditional MPC algorithm under thesame problem scale.The paper is organized as follows. In Section II, we providethe formulation of the MPC problem. Section III presentsRMPC algorithm and proves the corresponding convergence.In Section IV, we present simulation examples that show thegeneralizability and effectiveness of the RMPC algorithm.Section V concludes this paper.II. P RELIMINARIES

Consider a discrete-time nonlinear dynamical system x i +1 = f ( x i , u i ) (1)with state x i ∈ X ⊂ R n , control input u i ∈ U ⊂ R m and thesystem dynamics function f : R n × R m → R n . We assumethat f ( x i , u i ) is Lipschitz continuous on a compact set X , andthe system is stabilizable on X .Deﬁne the cost function V ( x , r N , N ) with ﬁnite predic-tion horizon: V ( x , r N , N ) = N (cid:88) i =1 l ( x i , r i , u Ni − ( x , r N )) (2)where x is the initial state, N is the length of predictionhorizon, r N = [ r , r , · · · , r N ] is the reference trajectory, V ( x , r N , N ) is the N -step cost function of state x withreference r N , u Ni − is the control input of the i th step in N prediction steps, and l ≥ is the utility function. The optimalcontrol sequence that minimizes V ( x , r N , N ) is denoted as: (cid:104) u N ∗ ( x , r N ) , u N ∗ ( x , r N ) , · · · , u NN − ∗ ( x , r N ) (cid:105) = arg min u N ,u N , ··· ,u NN − V ( x , r N , N ) (3)where u Ni − ∗ is the optimal control input of the i th step in N prediction steps. The superscript ∗ represents optimal policy.III. RMPC METHOD

A. The recurrent function

In real-time applications, we only need to execute theoptimal control input of the ﬁrst state u N ∗ ( x , r N ) at eachstep, instead of the whole control sequence (3). Given acontrol problem, assume that N max is the maximum predictionhorizon required. The purpose of our algorithm is to make fulluse of computing resources to adaptively select the longestprediction step k ∈ [1 , N max ] , which means that we need tocalculate and store the optimal control input u k ∗ ( x, r k ) of ∀ x ∈ X , ∀ r k and ∀ k ∈ [1 , N max ] in advance. This requiresus to ﬁnd an efﬁcient way to represent the policy and solve itofﬂine to obtain the nearly optimal policy function.So, we ﬁrstly introduce a recurrent function, denotedas π c ( x , r c ; θ ) to approximate the optimal control input u c ∗ ( x , r c ) , where θ is the function parameter and c is the number of recurrent cycles. The goal of the proposed RMPCalgorithm is to ﬁnd the optimal parameters θ ∗ , such that π c ( x , r c ; θ ∗ ) = u c ∗ ( x , r c ) ∀ x , r c ∈ X , ∀ c ∈ [1 , N max ] (4)The structure of the recurrent policy function is illustrated inFig. 1. All recurrent cycles share the same parameters θ , where h c is the hidden state. Each recurrent cycle is mathematicallydescribed as h c = σ h ( x ,r c , h c − ; θ h ) π c ( x , r c ; θ ) = σ y ( h c ; θ y ) c ∈ [1 , N max ] , θ = θ h ∪ θ y (5)where σ h and σ y are the layer function. … … ℎ 𝑁 max −1 ℎ ℎ 𝜋 𝑥 𝑥 𝑥 𝜎 ℎ 𝜋 𝑁 max ≅ 𝑢 max ∗ 𝑥 𝑟 𝑟 𝑟 𝑁 max ℎ 𝑐 ℎ 𝑐−1 𝑟 𝑐 Recurrent 𝑁 max cycles 𝜎 ℎ 𝜎 ℎ 𝜋 𝜋 𝑁 max 𝜎 ℎ … … ℎ 𝑁 max −1 ℎ ℎ 𝜋 𝑥 𝑥 𝑥 𝜎 ℎ 𝜋 𝑁 max ≅ 𝑢 max ∗ 𝑥 𝑟 𝑟 𝑟 𝑁 max ℎ 𝑐 ℎ 𝑐−1 𝑟 𝑐 Recurrent 𝑁 max cycles 𝜎 ℎ 𝜎 ℎ 𝜋 𝜋 𝑁 max 𝜎 ℎ Fig. 1. The structure of the recurrent policy function

As shown in Fig. 1, the recurrent policy function wouldoutput a speciﬁc control input in each cycle. Assuming thatwe have found the optimal parameters θ ∗ , it is clear thatthe output of c th cycle π c ( x , r c ; θ ∗ ) = u c ∗ ( x , r c ) . Inpractical applications, the calculation time of each cycle t c isdifferent due to the dynamic change of computing resourceallocation and other reasons, which is shown in Fig. 2. Ateach time step, the total time assigned to the control inputcalculation is assumed to be T . Denoting the total numberof the recurrent cycles at each time step as k , then the ﬁnaloutput policy is π k ( x , r k ; θ ∗ ) , where k =  N max , N max (cid:88) c =1 t c ≤ Tp, p (cid:88) c =1 t c ≤ T ∧ p +1 (cid:88) c =1 t c > T So, it is obvious that the recurrent policy can make fulluse of computing resources to adaptively select the longestprediction step k . In other word, the more computing resourcesallocated, the longer prediction horizon will be selected, whichwould also lead to the better control performance. B. The objective function

To ﬁnd the optimal parameters θ ∗ ofﬂine, we ﬁrst needto represent the MPC cost function (2) using θ (denoted by 𝑡 𝑁 max −1 𝑡 𝑡 𝑡 𝑝 𝑡 𝑝+1 𝑡 𝑁 max … 𝑇 𝑘 = 𝑝 𝑘 = 𝑁 max … 𝑡 unallocated allocated allocated 𝑡 Fig. 2. The relationship between computing resource and maximumrecurrent cycle V ( x , r N , N ; θ ) ). As shown in Fig. ?? , the MPC cost function(2) can be decomposed as follows: V ( x , r N , N )= l ( x , r , u N ( x , r N )) + V ( x , r N , N − (cid:88) i =1 l ( x i , r i , u N − i +10 ( x i − , r i : N )) + V ( x , r N , N − . . . = N − (cid:88) i =1 l ( x i , r i , u N − i +10 ( x i − , r i : N )) + V ( x N − , r N , N (cid:88) i =1 l ( x i , r i , u N − i +10 ( x i − , r i : N )) (6)According to (2) and (6), the global minimum V ∗ ( x , r N , N ) can be expressed as: V ∗ ( x , r N , N ) = N (cid:88) i =1 l ( x i , r i , u Ni − ∗ ( x , r N ))= N (cid:88) i =1 l ( x i , r i , u N − i +10 ∗ ( x i − , r i : N )) Therefore, for the same x and r N , it is clear that u Ni − ∗ ( x , r N ) = u N − i +10 ∗ ( x i , r i : N ) , ∀ i ∈ [1 , N ] . (7)This indicates that the i th optimal control input u Ni − ∗ ( x , r N ) of sequence (3) can be regarded as theoptimal control input of the initial state x i of the N − i + 1 -step MPC control problem. Which means that the N -stepMPC control problem can also be solved by minimizing V ( x , r N , N ) = (cid:80) Ni =1 l ( x i , r i , u N − i +10 ( x i − , r i : N )) . Re-placing all u N − i +10 ( x i − , r i : N ) with π N − i +1 ( x i − , r i : N ; θ ) ,we can ﬁnally obtain the N-step cost function in terms of θ : V ( x , r N , N ; θ ) = N (cid:88) i =1 l ( x i , r i , π N − i +1 ( x i − , r i : N ; θ )) (8)From (8), to ﬁnd the optimal parameters θ ∗ that make (4)holds, we can construct the following objective function: J ( θ ) = E x ∈X r N max ∈X (cid:110) V ( x , r N max , N max ; θ ) (cid:111) (9)From (3) and (4), it is obvious that θ ∗ = arg min θ J ( θ ) (10) So, we can update θ by directly minimizing J ( θ ) . The updategradient for the policy function is:d J d θ = E x ∈ X r N max ∈ X (cid:110) d V ( x , r N max , N max ) d θ (cid:111) , (11)whered V ( x , r N , N max ) d θ = d (cid:80) N max i =1 l ( x i , r i , π c ( x i − , r i : N max ; θ )) d θ = N max (cid:88) i =1 d l ( x i , r i , π c ( x i − , r i : N max ; θ )) d θ Denoting π N max − i +1 ( x i − , r i : N max ; θ ) as π N max − i +1 and l ( x i , r i , π c ( x i − , r i : N max ; θ )) as l i , we haved V ( x , r N max , N max ) d θ = N max (cid:88) i =1 (cid:110) ∂l i ∂x i φ i + ∂l i ∂π k ψ i (cid:111) φ i =  , i = 1 ∂f ( x i − , π N max − i +1 ) ∂x i − φ i − + ∂f ( x i − , π N max − i +1 ) ∂π N max − i +1 ψ i , else ψ i = ∂π N max − i +1 ∂x i − φ i − + ∂π N max − i +1 ∂θ Taking the Gradient Descent (GD) method as an example,the updating rules of the recurrent function in the K th iterationare: θ K +1 = − α θ d J d θ + θ K (12)where α θ denotes the learning rate.The framework and pseudo-code of the proposed RMPCalgorithm is shown in Fig.3 and Algorithm 1. Algorithm 1

RMPC algorithmGiven an appropriate learning rate α θ and any arbitrarilysmall positive number (cid:15) .Initial with arbitrary θ repeat Randomly select x , r N max ∈ X Calculate d J ( θ K ) d θ K using (11)Update policy function using (12) until | J ( θ K +1 ) − J ( θ K ) | ≤ (cid:15) C. Convergence and Optimality

There are many kinds of recurrent functions belonging tothe structure deﬁned in (13), recurrent neural networks (RNN)are the most popular one. In recent years, deep RNNs arefavored in many ﬁelds, such as natural language processingand system control, due to their higher ability to processsequential data. Next, we will show that as the iteration index K tends to inﬁnity, the optimal policy π c ( x , r c ; θ ∗ ) thatmake (4) hold can be achieved using Algorithm 1, as long ℎ ℎ ℎ 𝑟 𝑟 𝑥 𝑥 𝑟 𝑥 … ℎ 𝑁 max −2 ℎ ℎ 𝑟 𝑟 𝑁 max 𝑟 𝑥 𝑥 𝑥 … … ℎ 𝑥 𝑥 𝑥 𝑥 𝑥 … … … … … ℎ 𝑁 max −1 𝑟 𝑁 max 𝑥 Reference trajectory

Predictive trajectory 𝑐 = 1,2 … 𝑁 max 𝑥 … … 𝑟 𝑟 𝑥 𝑁 max −1 𝑥 Objective function 𝑥 𝑥 𝑥 𝑁 max Prediction steps: 𝑁 max Prediction steps: 𝑁 max − 1 Prediction steps: ℎ 𝑐 ℎ 𝑐−1 Recurrent ℎ 𝑐 ℎ 𝑐−1 𝑟 max Recurrent 𝑁 max − 1 cycles 𝑥 ℎ 𝑐 ℎ 𝑐−1 𝑟 max Recurrent 𝑁 max cycles 𝑥 𝜎 ℎ 𝑐 = 1,2 … 𝑁 max − 1 𝑐 = 1 𝜋 ≅ 𝑢 𝜋 ≅ 𝑢 𝜋 𝜋 𝜋 𝑁 max −1 𝜋 𝑁 max 𝜋 𝑁 max −1 𝜋 𝑁 max 𝜋 𝑁 max −1 ≅ 𝑢 max −1∗ 𝜋 𝑁 max −1 ≅ 𝑢 max −1∗ 𝜋 𝑁 max ≅ 𝑢 max ∗ 𝜋 𝑁 max ≅ 𝑢 max ∗ 𝔼 𝑥 ϵ𝒳𝑟 ϵ𝒳 𝑉(𝑥 , 𝑟 max , 𝑁 max ; 𝜃) = 𝔼 𝑥 ϵ𝒳𝑟 ϵ𝒳 𝑙(𝑥 𝑖 , 𝑟 𝑖 , 𝑁 max 𝑖=1 𝜋 𝑁 max −𝑖+1 (θ)) 𝑟 𝑁 max 𝑟 𝑁 max 𝑟 𝑁 max 𝑟 𝑁 max −1 𝑥 𝑁 max 𝑥 𝑁 max 𝑥 𝑁 max 𝑥 𝑁 max −1 𝑥 𝑁 max −1 𝜎 ℎ 𝜎 ℎ 𝜎 ℎ 𝜎 ℎ 𝜎 ℎ 𝜎 ℎ 𝜎 ℎ 𝜎 ℎ 𝜎 ℎ 𝜎 ℎ Fig. 3. The training ﬂowchart of RMPC algorithm as π c ( x , r c ; θ ) is an over-parameterized RNN. The over-parameterizaition means that the number of hidden neuronsis sufﬁciently large. Before the main theorem, the followinglemma and assumption are necessary to be introduced. Lemma 1. (Universal Approximation Theorem). Considera sequence of ﬁnite functions { F i ( y i ) } ni =1 , where y i =[ y , y , . . . , y i ] ∈ R i , i is the input dimension and F i ( y i ) : R i → R d is a continuous function on a compact set. Describethe RNN G c ( y c ; W, b ) as h c = σ h ( W h y c + U h h c − + b h ) G c ( y c ; W, b ) = σ y ( W y h c + b y ) where c is the number of recurrent cycles, U h , W = W h ∪ W y and b = b h ∪ b y are parameters, σ h and σ y are activationfunctions. Supposing G c ( y c ; W, b ) is over-parameterized, forany { F i ( y i ) } ni =1 , ∃ U h , W, b , such that (cid:107) G c ( y c ; W, b ) − F ( y c ) (cid:107) ∞ ≤ (cid:15), ∀ c ∈ [1 , n ] , where (cid:15) ∈ R + is an arbitrarily small error [17]–[19]. In recent years, many experimental results and theoreticalproofs have shown simple optimization algorithms, such as GD and Stochastic GD (SGD), can ﬁnd global minima on themost training objective in polynomial time if the approximatefunction is an over-parameterized neural network or RNN [20],[21]. Based on this fact, we make the following assumption.

Assumption 1.

If the approximate function is an over-parameterized RNN, the global minimum of objective functionin (9) can be found using an appropriate optimization algo-rithm such as SGD [22].

We now present our main result.

Theorem 1. (Recurrent Model Predictive Control). Suppose π c ( x , r c ; θ ) is an over-parameterized RNN. Through Algo-rithm 1, any initial parameters θ will converge to θ ∗ , suchthat (4) holds. Proof.

By Lemma 1, there always ∃ θ ∗ , such that (4) holds.Since Algorithm 1 continuously minimize J ( θ ) in (8), accord-ing to (10) and Assumption 1, we can always ﬁnd θ ∗ usingsimple optimization algorithms, such as GD and SGD.Thus, we have proven that RMPC algorithm can convergeto the optimal policy which is suitable for dynamic predictionhorizon and the number of its recurrent cycles is equal to the TABLE IS

TATE AND CONTROL INPUT

Mode Name Symbol Unitstate Lateral velocity v y [m/s]Yaw rate at center of gravity (CG) ω r [rad/s]Longitudinal velocity v x [m/s]Yaw angle φ [rad]trajectory y [m]input Front wheel angle δ [rad] TABLE IIV

EHICLE P ARAMETERS

Mode Name Value / UnitFront wheel cornering stiffness k -88000 [N/rad]Rear wheel cornering stiffness k -94000 [N/rad]Mass m a b I z · m ]Sampling frequency f

20 [Hz]System frequency 20 [Hz] number of prediction steps.IV. A

LGORITHM VERFICATION

To evaluate the performance of our RMPC algorithm, wechoose the vehicle lateral control problem in path trackingtask as an example [23]. The path tracking task has a constantlongitudinal speed v x , v x = 16 m/s and the expected trajectoryis sinusoidal. The system states and control inputs of thisproblem are listed in Table I, and the vehicle parameters arelisted in Table II. We offer two simulation examples withdifferent vehicle dynamics, one with linear system, and theother one with a nonlinear non input-afﬁne system. A. Example 1: Linear Time Invariant System1) Problem Description:

The vehicle model is bicyclemodel and the vehicle dynamics are: x =  yφv y ω r  , u = δ, x i +1 = 1 f (cid:0) Ax i + Bu i (cid:1) + x i A =  v x k + k mv x ak − bk mv x − v x ak − bk I z v x a k + b k I z v x  B =  − k m − ak I z  (13)The utility function of this problem is set to be l ( x k , u k ) =([1 , , , x k − r k ) + 100 u k .Therefore, the policy optimization problem of this examplecan be formulated as: min θ E x ∈ X r N max ∈ X (cid:110) V ( x , r N max , N max ) (cid:111) s.t. x i = Ax i − + Bπ N max − i +1 ( x i − , r i : N max ; θ ) , i ∈ [1 , N max ] where N max = 15 .

2) Algorithm Details:

The optimal policy is represented bya RNN variant called GRU (Gated Recurrent Unit). The inputlayer is composed of the states, followed by 4 hidden layersusing rectiﬁed linear unit (RELUs) as activation functions with units per layer and the output of network is a linear unit.We use Adam method to update the network with the learningrate of − . E rr o r Recurrent cycle1513119753

Fig. 4. The error between optimal control and RMPC in training. E rr o r Recurrent cycle1513119753Adaptive

Fig. 5. The error between optimal control and RMPC after training.

3) Result Analysis:

We run Algorithm 1 for 20 times andthe mean and 95% conﬁdence interval of the policy error ateach iteration are shown in Fig. 4. The policy error for eachprediction horizon c , which is also the number of recurrentcycles, is calculated by e Nc = E x ∈ X r N ∈ X (cid:34) | u N ∗ ( x , r N ) − π c ∗ ( x , r c ) | u max ∗ − u min ∗ (cid:35) N ∈ [1 , , c ∈ [1 , where u max ∗ and u min ∗ is respectively the maximum andminimum value of the optimal control input in all predictionsteps, N is the number of prediction steps and c is the numberof recurrent cycles. e Nc is the error between the theoreticaloptimal law with N prediction steps and the output of recurrentfunction with c cycles. The error are listed in Fig. 4 and Fig.5. In Fig. 4, c is equal to N . The error for each recurrent cycleis rapidly decreasing and converged in training process. After × iterations, each error is less than 0.7%. This indicatesthat Algorithm 1 has the ability to converge the policy functionto optimality.In Fig. 5, we use the gray line to represent the errorwhen c = N . And we use the lines of other colors torepresent the error for each c ∈ { , , , , , , } , where N ∈ [1 , . Each line represents the error for 20 different runsand the shaded area represents the 95% conﬁdence interval. e Nc achieves the minimum when c = N . Therefore, theRMPC method has the ability to approximate the optimallaw in each prediction horizon and the number of predictionsteps is equal to the number of recurrent cycles. The errorachieves the minimum (0.2%) in 15 prediction steps and themaximum error is no more than 0.7% which illustrates thegeneralizability and effectiveness of RMPC method. L o ss MethodOptimal solutionRecurrent MPC

Fig. 6. The performance of tracking problem in each recurrentcycle.

Fig. 6 shows the average and 95% conﬁdence intervalof the loss in each recurrent cycle. The performance foreach recurrent cycle is measured by the lost function of 200steps (10s) during the simulation period staring from randominitialized state. The calculation formula of loss is as follows: L = (cid:88) i =1 l i = (cid:88) i =1 ( (cid:2) (cid:3) x i − r i ) + 100 u i − x i +1 = 1 f (cid:0) Ax i + Bu i (cid:1) + x i As shown in Fig. 6, the performance between the optimalsolution and RMPC is very similar and the performance foreach prediction steps is really different. The longer recurrentcycle, the better performance. Therefore, we use the optimallaw with the recurrent cycle as long as possible under thereal-time requirements in real application.In Fig. 7, the blue line represents the average computationtime for RMPC in each recurrent cycle while the red linerepresents the computation time for online optimization MPCin each prediction horizon. We use Cvxpy as the QP solverfor online optimization MPC [24]. This ﬁgure shows thecomputation time for RMPC is linear growing in a low growthrate with the prediction horizon increasing while the onlineoptimization MPC is growing rapidly. As for 15 recurrentcycles, the average computation time for RMPC is just ms C o m p u t a t i o n t i m e / m s RMPCCvxpy

Fig. 7. The computation time for RMPC and cvxpy [24]. to get the approximate optimal law while ms for onlineoptimization MPC which is 20 times longer than RMPC.Fig. 8 shows the state trajectory controlled by the trainedpolicy function which is very similar to the state trajectorycontrolled by the optimal policy. The results of Example 1show that RMPC algorithm can solve the optimal controlproblem for linear systems.The simulation result demonstrates the effectiveness of theRMPC method. Because of the increasing range of output, theerror between the theoretical optimal control and the output ofRMPC increase when the number of prediction step increases,where c is from 1 to 9. The objective function is the costof 15 prediction steps. Hence, the error decreases with theincreasing number of prediction step when c is over 9 andachieves the minimum (0.2%) in 15 prediction steps. And themaximum error is no more than 0.7% which illustrates thegeneralizability and effectiveness of RMPC method. Moreover,RMPC is 20 times faster than the optimization MPC and thecomputation time is linear increasing in a low growth rate. B. Example 2: Nonlinear and Non Input-Afﬁne System1) Problem Description:

The vehicle dynamics are: x =  v y rφy  , u = δ, x k +1 =  F yf cos δ + F yr m − v x r aF yf cos δ − bF yr I z r − v x cos φ − v y sin φR − y v x sin φ + v y cos φ  f + x k (14)where F yf and F yr are the lateral tire forces of the front andrear tires respectively [25]. The lateral tire forces are usuallyapproximated according to the Fiala tire model: F y =  − C tan α (cid:16) C (tan α ) µ F z ) − C | tan α | µ F z + 1 (cid:17) , | α | ≤ | α max , | µ F z , | α | > | α max , | where α is the tire slip angle, F z is the tire load, µ isthe lateral friction coefﬁcient, and the subscript ∈ { f, r } represents the front or rear tires. The slip angles can be cal- L a t e r a l p o s i t i o n / m Optimal TrajectoryActual TrajectoryExpected Trajectory (a) L a t e r a l p o s i t i o n / m Optimal TrajectoryActual TrajectoryExpected Trajectory (b) L a t e r a l p o s i t i o n / m Optimal TrajectoryActual TrajectoryExpected Trajectory (c) L a t e r a l p o s i t i o n / m Optimal TrajectoryActual TrajectoryExpected Trajectory (d) L a t e r a l p o s i t i o n / m Optimal TrajectoryActual TrajectoryExpected Trajectory (e) L a t e r a l p o s i t i o n / m Optimal TrajectoryActual TrajectoryExpected Trajectory (f)

Fig. 8. Performance comparison of different recurrent cycle in Example 1. (a) c=5. (b) c=7. (c) c=9. (d) c=11. (e) c=13. (f) c=15. culated from the geometric relationship between the front/rearaxle and the center of gravity (CG): α f = arctan( v y + arv x ) − δ, α r = arctan( v y − brv x ) The loads on the front and rear tires can be approximatedby: F zf = ba + b mg, F zr = aa + b mg The utility function is: l = ( (cid:2) (cid:3) x − r ) + 100 u The policy optimization problem of this example can beformulated as: min θ E x ∈ X r N max ∈ X (cid:110) V ( x , r N max , N max ) (cid:111) s.t. x i = f ( x i − , π N max − i +1 ( x i − , r i : N max ; θ )) , i ∈ [1 , N max ] u min ≤ π N max − i +1 ≤ u max (15)where N max = 15 , u min = − . rad and u max = 0 . rad .Then we can train the vehicle control policy using theproposed RMPC algorithm to minimize the objective function.

2) Algorithm Details:

The optimal policy is represented bya RNN variant called GRU (gated recurrent unit) with 4 fully-connected hidden layers and have the same architecture exceptfor the output layers, while the output layer of the policynetwork is set as a tanh layer with a unit, multiplied by toconfront bounded control. The input layer is composed of thestates, followed by 5 hidden layers using rectiﬁed linear unit(RELUs) as activation functions with units per layer andthe output of network is a linear unit. We use Adam methodto update the network with the learning rate of × − andthe model is trained for × iterations.

3) Result Analysis:

Fig. 9 shows the loss value is rapidlydecreasing and converged to the optimal solution in training.Fig. 10 shows the state trajectory controlled by the trainedpolicy function. The learned policy can make the vehicle reachthe equilibrium state very quickly, which takes less than 1sfor the case in Fig. 8, when c is over 9. The results ofExample 2 show that RMPC algorithm can also solve theoptimal control problem for general nonlinear systems withsaturated actuators. V. C ONCLUSION

This paper proposes an off-line algorithm, called RecurrentModel Predictive Control (RMPC), to solve general nonlinearﬁnite-horizon optimal control problems. Unlike traditional

Iteration10 L o ss Recurrent timeR15R13R11R9R7R5R3

Fig. 9. The performance in training

MPC algorithms, it can make full use of the current computingresources and adaptively select the longest model predictionhorizon by using a recurrent function to approximate theoptimal policy. The optimal recurrent policy can be obtainedby directly minimizing the designed loss function and weprove the convergence and optimality of the RMPC algorithmthorough Bellman optimality principle. To demonstrate itsgenerality and efﬁciency, we apply our algorithm to the vehiclelateral control problem in two kind of path-tracking taskand use a RNN variant called GRU(gated recurrent unit)as the recurrent function and calculate the error betweenthe theoretical optimal control and the output of RMPC ineach prediction horizon. The results show that the error isbelow 0.7% in each prediction steps from 1 to 15 and theRMPC algorithm is about 20 times faster than a traditionalMPC algorithm for a large-scale problem in Example 1. Thesimulation performance is better when the number of recurrentcycle increases for both examples.R

EFERENCES[1] S. J. Qin and T. A. Badgwell, “A survey of industrial model predic-tive control technology,”

Control engineering practice , vol. 11, no. 7,pp. 733–764, 2003.[2] S. Vazquez, J. Leon, L. Franquelo, J. Rodriguez, H. A. Young, A. Mar-quez, and P. Zanchetta, “Model predictive control: A review of its ap-plications in power electronics,”

IEEE Industrial Electronics Magazine ,vol. 8, no. 1, pp. 16–31, 2014.[3] S. E. Li, Z. Jia, K. Li, and B. Cheng, “Fast online computation ofa model predictive controller and its application to fuel economy–oriented adaptive cruise control,”

IEEE Transactions on IntelligentTransportation Systems , vol. 16, no. 3, pp. 1199–1209, 2014.[4] J. H. Lee, “Model predictive control: Review of the three decadesof development,”

International Journal of Control, Automation andSystems , vol. 9, no. 3, p. 415, 2011.[5] R. Cagienard, P. Grieder, E. C. Kerrigan, and M. Morari, “Move blockingstrategies in receding horizon control,”

Journal of Process Control ,vol. 17, no. 6, pp. 563–570, 2007.[6] R. C. Shekhar and J. M. Maciejowski, “Robust variable horizon mpcwith move blocking,”

Systems & Control Letters , vol. 61, no. 4, pp. 587–594, 2012.[7] Y. Wang and S. Boyd, “Fast model predictive control using onlineoptimization,”

IEEE Transactions on control systems technology , vol. 18,no. 2, pp. 267–278, 2009.[8] A. Bemporad, M. Morari, V. Dua, and E. N. Pistikopoulos, “The explicitlinear quadratic regulator for constrained systems,”

Automatica , vol. 38,no. 1, pp. 3–20, 2002. [9] B. Kouvaritakis, M. Cannon, and J. A. Rossiter, “Who needs qp forlinear mpc anyway?,”

Automatica , vol. 38, no. 5, pp. 879–884, 2002.[10] T. Geyer, F. D. Torrisi, and M. Morari, “Optimal complexity reductionof polyhedral piecewise afﬁne systems,”

Automatica , vol. 44, no. 7,pp. 1728–1740, 2008.[11] C. N. Jones and M. Morari, “Polytopic approximation of explicit modelpredictive controllers,”

IEEE Transactions on Automatic Control , vol. 55,no. 11, pp. 2542–2553, 2010.[12] C. Wen, X. Ma, and B. E. Ydstie, “Analytical expression of explicitmpc solution via lattice piecewise-afﬁne function,”

Automatica , vol. 45,no. 4, pp. 910–917, 2009.[13] F. Borrelli, M. Baoti´c, J. Pekar, and G. Stewart, “On the computationof linear model predictive control laws,”

Automatica , vol. 46, no. 6,pp. 1035–1041, 2010.[14] B. M. ˚Akesson, H. T. Toivonen, J. B. Waller, and R. H. Nystr¨om,“Neural network approximation of a nonlinear model predictive con-troller applied to a ph neutralization process,”

Computers & chemicalengineering , vol. 29, no. 2, pp. 323–335, 2005.[15] B. M. ˚Akesson and H. T. Toivonen, “A neural network model predictivecontroller,”

Journal of Process Control , vol. 16, no. 9, pp. 937–946,2006.[16] L. Cheng, W. Liu, Z.-G. Hou, J. Yu, and M. Tan, “Neural-network-based nonlinear model predictive control for piezoelectric actuators,”

IEEE Transactions on Industrial Electronics , vol. 62, no. 12, pp. 7717–7727, 2015.[17] L. K. Li, “Approximation theory and recurrent networks,” in

Proc. ofIJCNN , vol. 2, pp. 266–271, IEEE, 1992.[18] A. M. Sch¨afer and H.-G. Zimmermann, “Recurrent neural networksare universal approximators,”

International journal of neural systems ,vol. 17, no. 04, pp. 253–263, 2007.[19] B. Hammer, “On the approximation capability of recurrent neuralnetworks,”

Neurocomputing , vol. 31, no. 1-4, pp. 107–123, 2000.[20] Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deeplearning via over-parameterization,” in

International Conference onMachine Learning , (Long Beach, California, USA), pp. 242–252, ICML,2019.[21] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent ﬁndsglobal minima of deep neural networks,” in

International Conferenceon Machine Learning , (Long Beach, California, USA), pp. 1675–1685,ICML, 2019.[22] Z. Allen-Zhu, Y. Li, and Z. Song, “On the convergence rate of trainingrecurrent neural networks,” in

Advances in Neural Information Process-ing Systems , pp. 6673–6685, 2019.[23] R. Li, Y. Li, S. E. Li, E. Burdet, and B. Cheng, “Driver-automationindirect shared control of highly automated vehicles with intention-awareauthority transition,” in ,(Redondo Beach, California, USA), pp. 26–32, IEEE, 2017.[24] S. Diamond and S. Boyd, “Cvxpy: A python-embedded modelinglanguage for convex optimization,”

The Journal of Machine LearningResearch , vol. 17, no. 1, pp. 2909–2913, 2016.[25] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic anddynamic vehicle models for autonomous driving control design,” in , (Seoul, South Korea),pp. 1094–1099, IEEE, 2015. L a t e r a l p o s i t i o n / m Actual TrajectoryExpected Trajectory (a) L a t e r a l p o s i t i o n / m Actual TrajectoryExpected Trajectory (b) L a t e r a l p o s i t i o n / m Actual TrajectoryExpected Trajectory (c) L a t e r a l p o s i t i o n / m Actual TrajectoryExpected Trajectory (d) L a t e r a l p o s i t i o n / m Actual TrajectoryExpected Trajectory (e) L a t e r a l p o s i t i o n / m Actual TrajectoryExpected Trajectory (f)(f)