[PDF] H ∞ Tracking Control via Variable Gain Gradient Descent-Based Integral Reinforcement Learning for Unknown Continuous Time Nonlinear System

Abstract

Full PDF

TThis paper is a preprint of a paper submitted to IET Control Theory and Application. If accepted,the copy of record will be available at the IET Digital Library H ∞ Tracking Control via Variable GainGradient Descent-Based IntegralReinforcement Learning for UnknownContinuous Time Nonlinear System

Amardeep Mishra , Satadal Ghosh Student, Department of Aerospace Engineering, IIT Madras, Chennai 600036, India Faculty, Department of Aerospace Engineering, IIT Madras, Chennai 600036, IndiaE-mail: [email protected]

Abstract:

Optimal tracking of continuous time nonlinear systems has been extensively studied in literature. However, in severalapplications, absence of knowledge about system dynamics poses a severe challenge to solving the optimal tracking problem.This has found growing attention among researchers recently, and integral reinforcement learning (IRL)-based method augmentedwith actor neural network (NN) have been deployed to this end. However, very few studies have been directed to model-free H ∞ optimal tracking control that helps in attenuating the effect of disturbances on the system performance without any prior knowledgeabout system dynamics. To this end a recursive least square-based parameter update was recently proposed. However, gradientdescent-based parameter update scheme is more sensitive to real-time variation in plant dynamics. And experience replay (ER)technique has been shown to improve the convergence of NN weights by utilizing past observations iteratively. Motivated by these,this paper presents a novel parameter update law based on variable gain gradient descent and experience replay techniquefor tuning the weights of critic, actor and disturbance NNs. The presented update law leads to improved model-free trackingperformance under L -bounded disturbance. Simulation results are presented to validate the presented update law. Optimal control is one of the prominent control techniques that aimsto ﬁnd control policies that minimizes a cost function subjected toplant dynamics as constraints. Traditional optimal control techniquesrequire full knowledge of plant dynamics and corresponding param-eters for their implementation. However, in practice knowledge ofthe same might be partially available or unavailable. In order toimplement optimal control methods online under such limitations,reinforcement learning (RL) [1][2] and adaptive dynamic program-ming (ADP) [3][4] approaches were proposed that solve optimalcontrol problem forward in time.Regulation problems and trajectory tracking problems are the twobroad classiﬁcations of the optimal control problem. The prime ob-jective of regulation problems [5–11] is to ﬁnd a control policy thatbrings the desired states to origin in ﬁnite amount of time whileminimizing a cost function. On the other hand, optimal tracking con-trol problems (OTCP) [12–15] entails ﬁnding control policies thatwill make the desired states (output of the system) track a timevarying reference trajectory. Traditionally, the OTCP requires de-velopment of two different controllers: (i) transient controller and(ii) steady state control [4], [15]. Limitation of traditional OTCPsolving schemes lies in the requirement of (a) Knowledge of refer-ence dynamics and (b) invertibility condition on control gain matrix.Modares et al. [16], [17] proposed augmented system comprising oferror and desired dynamics to by-pass this limitation. Finding thecontrol policy that stabilizes the augmented system while minimiz-ing the performance index was the prime objective of their novelcontrol algorithm. The control policy generated by their algorithmalso consisted of both transient and steady state controllers.In the ADP schemes mentioned above identiﬁers were used toobviate the exact knowledge of nominal plant dynamics. However,identiﬁers add to the computational complexity and also reduce theaccuracy of the computations [18]. In most cases, identiﬁers alsorequire the knowledge of structure of the plant dynamics. Hence,efforts have been devoted to make RL schemes either partially model-free or completely model-free. In order to develop continuoustime optimal control policies under partial or no knowledge of plantdynamics integral reinforcement learning (IRL) algorithm was lever-aged. While ﬁrst few results in this direction for regulation problemwere presented in [19–22], Modares et al. [16] developed algorithmsfor OTCP for partially-unknown system. Thereafter, Zhu et al. [18]developed Off-policy model-free tracking control of continuous timenonlinear systems using IRL. They leveraged experience replay (ER)technique to effectively utilize past observations in order to updatethe NN weights. Further, the UUB stability of the update law wasalso proved.It may also be noted that most of the aforementioned RL schemes,for both regulation and tracking problems, do not deal with atten-uation of the effects of disturbance. To this end, H ∞ regulationproblem has been studied using RL both ofﬂine [23, 24] and on-line [25–27]. Online IRL was also utilized in [28] and [29] for H ∞ regulation problem for partially-unknown system. Note that underpartial or no knowledge of the plant dynamics structure, IRL hasbeen leveraged in several literature for regulation problem, whilevery few studies have dealt with IRL for OTCP problem with distur-bance rejection. To the best of the authors’ knowledge, [30] and [31]are the only few papers that have recently presented control poli-cies for model-free OTCP of continuous time nonlinear system in H ∞ framework. Modares et al. [30] updated the parameters of critic,actor and disturbance NNs using least square method, which couldonly be initiated after certain number of data had been collected. Thismakes their algorithm less sensitive to real-time variations in plantdynamics [18]. On the other hand, Zhang et al. [31] utilized gradi-ent descent driven parameter update law for H ∞ tracking controlproblem, and uniform ultimate boundedness (UUB) stability of theparameter update law was proven. However, their gradient descentfollowed a constant learning rate.While continuous time update law driven by gradient descent ismore sensitive to real-time variations in plant dynamics, experiencereplay (ER) technique has been shown to improve the learning speedsigniﬁcantly by utilizing past observations iteratively [18]. Also, IET Research Journals, pp. 1–12c (cid:13)

The Institution of Engineering and Technology 2015 1 a r X i v : . [ ee ss . S Y ] J a n ddition of ’robust terms’ in update law was shown to shrink theresidual set in [32]. Inspired by [18] and [30], this paper presents anovel off-policy IRL-based H ∞ tracking control scheme for contin-uous time nonlinear system, in which the parameter update laws fortuning the weights of critic, actor and disturbance NNs are drivenby variable gain gradient descent and ER technique in addition torobust terms. Instead of a constant learning rate in traditional gra-dient descent-based schemes, the learning rate of gradient descentdeveloped in this paper is variable and a function of Hamilton-Jacobi-Isaac (HJI) error. This results in an increased learning ratewhen HJI error is large and the learning rate is reduced as the HJI er-ror becomes smaller. The variable gain gradient descent technique isalso shown to have the added advantage of shrinking the size of theresidual sets, which the NN weights ﬁnally converge to. Term cor-responding to ER technique and robust term in the update law alsocontribute in further shrinking the size of the residual set. Unlike[32], the update law presented in this paper, not only leverages ro-bust terms for present instance but also past instances as well. Afterthe completion of learning phase, the ﬁnal learnt policies leveragingvariable gain gradient descent are executed and they are shown toreduce the oscillations in transient phase and steady-state errors thusresulting in improved tracking performance.The rest of the paper is structured as follows. Preliminaries andbackground of H ∞ tracking controller and the tracking HJI equationfor augmented system have been presented in Section 2. Next, inorder to obviate the requirement of system dynamics in policy eval-uation step, model-free version of HJI equation to formulate IRL,and neural networks to approximate value function, control and dis-turbance policies are presented in Section 3. Section 4 highlightsthe main contribution of this paper, i.e., the continuous time weightupdate law that is driven by variable gain gradient descent and expe-rience replay technique. The update law also incorporates the robustterms and their past observations. UUB stability analysis for theproposed mechanism is shown. Numerical studies are presented inSection 5 to justify the effectiveness of the presented algorithm.Finally, Section 6 provides concluding remarks. H ∞ Tracking Problem and HJI Equation

Problem Formulation

It is desired to drive certain states of interest of the dynamical systemto follow predeﬁned reference trajectories under L -bounded distur-bance in an optimal way. Let the dynamical system be described byan afﬁne-in-control differential equation: ˙ x = f ( x ) + g ( x ) u + k ( x ) d (1)where, x ∈ R n , u ∈ R m , d ∈ R q , f ( x ) : R n → R n is the driftdynamics, g ( x ) : R n → R n × m represents the control coupling dy-namics and k ( x ) : R n → R n × q is disturbance dynamics. In thesubsequent analysis in this paper, it is assumed that none of the sys-tem dynamics, that is f ( x ) , g ( x ) and k ( x ) , are known. However,Lipschitz continuity for the system dynamics as well as control-lability of the system over a compact set Ω ∈ R n are assumed.The reference trajectory is generated by a command generator ora reference system whose dynamics is described by: ˙ x d = η ( x d ) (2)Thus, the error is given by: e = x − x d (3)Therefore, the error dynamics is given as, ˙ e = ( f ( x ) + g ( x ) u + k ( x ) d − η ( x d )) (4)In order to formulate corresponding HJI equation and assess the ef-fect of disturbance on the closed-loop system, a virtual performance index ( | X | ) is deﬁned as [30], | X | = e T Qe + u T Ru (5)where, Q and R are positive deﬁnite matrices with only diagonalentries. It is to be noted that all the vector or matrix norms used inthis paper are 2-norm or the Eucledian norm. In [30], the disturbanceattenuation condition was characterized as the L -gain is smallerthan or equal to α for all d ∈ L [0 , ∞ ) , that is, (cid:82) ∞ t e − γ ( τ − t ) | X | dτ (cid:82) ∞ t e − γ ( τ − t ) | d ( τ ) | dτ ≤ α (6)where ≤ γ is the discount factor and α determines the degree of at-tenuation from disturbance input to the virtual performance measure.The value of α is selected based on trial and error. The minimumvalue of α , for which (6) is satisﬁed provides optimal-robust controlsolution [30]. Now, using (5) and (6), (cid:90) ∞ t e − γ ( τ − t ) ( e T Qe + u T Ru ) dτ ≤ α (cid:90) ∞ t e − γ ( τ − t ) (cid:107) d ( τ ) (cid:107) dτ (7)Finding a control policy u dependent on tracking error and referencetrajectory such that the system dynamics (1) satisﬁes the disturbanceattenuation condition (7) and that the error dynamics (4) is locallyasymptotically stable for d = 0 forms H ∞ tracking control problem[30]. HJI Equation: Preliminaries

The ﬁrst part of this section deals with the development of HJIequation for solving the H ∞ tracking problem stated above, whilethe second part discusses about policy iteration steps. As discussedin [30], the H ∞ tracking problem can also be posed as a min-max optimization problem subjected to augmented system dynamicscomprising of error dynamics and desired states dynamics. Subse-quently, the solution to min-max optimization problem is obtainedby imposing the stationarity condition on the Hamiltonian. In or-der to formulate tracking HJI equation, an augmented state vector isdeﬁned as, z = [ e T , x Td ] T (8)And, the augmented system dynamics is then given as, ˙ z = F ( z ) + G ( z ) u + K ( z ) d (9)Where, F ( z ) = (cid:18) ( f ( x ) − η ( x d )) η ( x d ) (cid:19) , G ( z ) = (cid:18) g ( x )0 (cid:19) , K ( z ) = (cid:18) k ( x )0 (cid:19) (10)In the subsequent analysis, F (cid:44) F ( z ) , G (cid:44) G ( z ) and K (cid:44) K ( z ) .Using augmented states, the attenuation condition (7) can be de-scribed as, (cid:90) ∞ t e − γ ( τ − t ) ( z T Q z + u T Ru ) dτ ≤ α (cid:90) ∞ t e − γ ( τ − t ) ( (cid:107) d ( τ ) (cid:107) ) dτ (11)where, Q is given by: Q = (cid:18) Q n × n n × n n × n n × n (cid:19) n × n (12)Thus, a ﬁnal performance index consisting of disturbance input isdeﬁned as: J ( u, d ) = (cid:90) ∞ t e − γ ( τ − t ) (cid:16) z ( τ ) T Q z ( τ ) + u ( τ ) T Ru ( τ ) − α (cid:107) d ( τ ) (cid:107) (cid:17) dτ (13) IET Research Journals, pp. 1–122 c (cid:13)(cid:13)

The Institution of Engineering and Technology 2015 he problem of ﬁnding control input u that satisﬁes (6) is same asminimizing (13) subjected to augmented dynamics. In [33], a directrelationship between H ∞ control problem and two-player zero-sumdifferential game was established. It was shown that solution of the H ∞ control problem is equivalent to solution of the following zero-sum game: V ∗ ( z ) = J ( u ∗ , d ∗ ) = min u max d J ( u, d ) (14)In the subsequent analysis, V (cid:44) V ( z ) , V z (cid:44) ∇ V ( z ) . The term J is as deﬁned in (13) and V ∗ is optimal value function. Existence ofgame theoretic saddle point was also shown to guarantee the exis-tence of solution of the two-player zero-sum game control problem.This is encapsulated in following Nash condition: V ∗ ( z ) = min u max d J ( u, d ) = max d min u J ( u, d ) (15)Differentiating (13) along augmented system trajectories the follow-ing Bellman equation is obtained: z T Q z + u T Ru − α d T d − γV + V z ( F + Gu + Kd ) = 0 (16)Let the Hamiltonian be deﬁned as: H ( z, V, u, d ) = z T Q z + u T Ru − α d T d − γV + V z ( F + Gu + Kd ) (17) V ∗ being the optimal cost, satisﬁes the Bellman equation. Applyingstationarity condition on the Hamiltonian, both optimal control inputand disturbance input are obtained as follows, ∂ H ( z, V ∗ , u, d ) ∂u = 0 = ⇒ u ∗ = − R − G T ∇ V ∗ ∂ H ( z, V ∗ , u, d ) ∂d = 0 = ⇒ d ∗ = 12 α K ∇ V ∗ (18)The optimal control input and disturbance input given above providesaddle point solution to the game [23]. Using (18) in (17) trackingHJI equation is, z T Q z + V ∗ z F − γV ∗ − V ∗ Tz G T R − GV ∗ z + 14 α V ∗ Tz KK T V ∗ z = H ( z, V ∗ , u ∗ , d ∗ ) = 0 (19) Theorem 2.1.

Let V ∗ be the smooth positive semi-deﬁnite andquadratic solution of the tracking HJI equation ( ), then the op-timal control action ( u ∗ ) generated by ( ) makes the tracking errordynamics ( ) asymptotically stable in the limiting sense when dis-count factor γ → and when d = 0 . Additionally, there exists anupper bound for the discount factor γ below which the error dynam-ics is asymptotically stable. However, if γ is greater than this boundor if d (cid:54) = 0 , then UUB stability can only be ensured.Proof: The detailed discussion of this proof can be seen in Theorems3 and 4 of [30]. Theorem 3 of [30] proves the asymptotic stability ofthe error dynamics in the limiting sense when γ → . Theorem 4 of[30] on the other hand provides an upper bound over the discountfactor γ below which asymptotic stability of the error dynamics canstill be ensured when d = 0 .Now, when d (cid:54) = 0 and γ (cid:54) = 0 , then, only UUB stability of the errordynamics can be ensured. From (16), it can be seen that, ˙ V = γV − z T Q z − u T Ru + α d T d (20)In order for ˙ V to be negative, following inequality should hold (using z T Q z = e T Qe ): (cid:107) e (cid:107) > (cid:115) γV ∗ − λ min ( R ) (cid:107) u (cid:107) + α (cid:107) d (cid:107) λ min ( Q ) (21)Provided, V ∗ , (cid:107) u (cid:107) and (cid:107) d (cid:107) are ﬁnite. (cid:3) As a consequence of Theorem 4 of [30], a small value of γ and/or positive deﬁnite matrix Q (such that its minimum eigenvalueis large) was suggested to ensure asymptotic stability of the errordynamics.Policy iteration framework is a computation approach to itera-tively solve the Bellman equation and improve the control policies.It is generally started off with some known initial stabilizing policy u and then following two steps are iteratively repeated till convergenceis achieved.(i) Policy Evaluation: Given initial admissible control and dis-turbance policies, this step entails solving the Bellman equation as(where V i , u i , d i denote improved value function and policies at i th iteration): ∇ V i ( F + Gu i + Kd i ) − γV i + z T Q z + u Ti Ru i − α d Ti d i = 0 (22)(ii) Policy Improvement: This step produces improved control anddisturbance policies: u i +1 = − R − G T ∇ V i ; d i +1 = 12 α K ∇ V i (23)Note that implementation of traditional policy iteration algo-rithms requires complete knowledge of system dynamics. However,this requirement in policy evaluation step can be obviated via theintegral reinforcement learning (IRL) [16], [22]. In their formu-lation, requirement of f ( x ) is precluded, however, g ( x ) , k ( x ) arestill needed to improve the policies. In order to achieve the con-trol completely model-free, Modares and Lewis [30] presented H ∞ tracking control leveraging IRL as well as three NNs - actor, criticand disturbance - to approximate control action, value function anddisturbance policy, respectively. Motivated by this result, this paperalso uses three set of NNs to approximate value function, control anddisturbance policies in order to solve H ∞ -tracking problem with anovel continuous time update law for all three NNs. Derivation of Model-Free HJI equation

In order to completely remove the requirement of system dynamicsfrom Policy Evaluation step, the IRL will be utilized in the follow-ing way. In the subsequent analysis, u refers to the executed controlpolicy and d refers to the disturbance present in the system. It isassumed that an initial admissible policy is known. Improved poli-cies on the other hand are denoted by u i , d i . Then, from (9) theaugmented system dynamics can be re-written as, ˙ z = F + Gu i + Kd i + G ( u − u i ) + K ( d − d i ) (24)Taking derivative of V i ( z ) along (24) a revised form of Bellmanequation (see (16)) is given by, V zi ( F + Gu i + Kd i ) + V zi ( G ( u − u i )) + V zi ( K ( d − d i )) − γV i = − z T Q z − u Ti Ru i + α d Ti d i (25)Multiplying both sides of (25) by e − γt , left hand side (LHS) of (25)can be expressed as, d ( e − γt V i ( z )) dt = e − γt ( V Tzi ( F + Gu i + Kd i ) + V Tzi ( G ( u − u i ))+ V Tzi ( K ( d − d i )) − γV i ) (26) IET Research Journals, pp. 1–12c (cid:13)

The Institution of Engineering and Technology 2015 3 here, V zi (cid:44) ∇ z V i . Using (22) and (23) in (26), d ( e − γt V i ) /dt canbe rewritten as: d ( e − γt V i ( z )) dt = e − γt ( − z T Q z − u i Ru i + α d Ti d i + V Tzi G ( u − u i ) + V Tzi K ( d − d i ))= e − γt ( − z T Q z − u i Ru i + α d Ti d i − u Ti +1 R ( u − u i ) + 2 α d Ti +1 ( d − d i )) (27)Integrating the above equation over [ t − T, t ] on both sides of (27)and rearranging, V i ( t ) − e γT V i ( t − T ) + (cid:90) tt − T e − γ ( τ − t ) ( z T Q z + u i Ru i − α d Ti d i + 2 u Ti +1 R ( u − u i ) − α d Ti +1 ( d − d i )) dτ = 0 (28)where, V i ( t ) (cid:44) V i ( z ( t )) and V i ( t − T ) (cid:44) V i ( z ( t − T )) . Note that(28) resembles (25) in the limiting sense when T → To maintainthis equivalence, the reinforcement interval T should be selected assmall as possible. However, it is important to note that compared to(22) or (25), equation (28) does not include system dynamics. Eq.(28) can be rewritten as, V i ( t ) − e γT V i ( t − T ) + I + (cid:90) tt − T e − γ ( τ − t ) (2 u Ti +1 R ( u − u i ) − α d Ti +1 ( d − d i )) dτ = 0 (29)where, I = (cid:90) tt − T e − γ ( τ − t ) ( z T Q z + u i Ru i − α d Ti d i ) (30) Approximation of Value function, Control policy andDisturbance policy

Similar to [18] and [30], value function and improved policies arerepresented by, V i = W Tc σ c + ε c ; u i +1 = W Ta σ a + ε a ; d i +1 = W Td σ d + ε d (31)where, W c ∈ R a , σ c ∈ R a , W a ∈ R a × m , σ a ∈ R a , W d ∈ R a × l and σ d ∈ R a . Using (31) in (29), the HJI error becomes, ε HJI = W Tc [ σ c ( t ) − e γT σ c ( t − T )] + I + (cid:90) tt − T e − γ ( τ − t ) (2 σ Ta W a R ( u − u i ) − α σ Td W d ( d − d i )) (32)Now, since ideal weights are not known, their estimates will beutilized instead, ˆ V i = ˆ W Tc σ c ; ˆ u i +1 = ˆ W Ta σ a ; ˆ d i +1 = ˆ W Td σ d (33)Then, the HJI error in terms of estimated weights can be written as, ˆ e ( t ) = ˆ W Tc [ σ c ( t ) − e γT σ c ( t − T )] + I + v ( ˆ W a ) T (cid:90) tt − T e − γ ( τ − t ) ( R ( u − u i ) ⊗ σ a ) − v ( ˆ W d ) T (cid:90) tt − T e − γ ( τ − t ) ( α ( d − d i ) ⊗ σ d ) (34) Eq. (34) can be written in compact form as, ˆ e ( t ) = ˆ W T ρ + I ( t ) (35)where, I is the reinforcement integral given in (30). And, ˆ W =  ˆ W c v ( ˆ W a ) v ( ˆ W d )  , ρ =  ∆ σ c (cid:82) tt − T e − γ ( τ − t ) ( R ( u − u i ) ⊗ σ a ( t )) − (cid:82) tt − T e − γ ( τ − t ) ( α ( d − d i ) ⊗ σ d ( t ))  (36) where, v ( . ) represents vectorization of matrix and ⊗ denotes thekronecker product. Here, ˆ W ∈ R q is the composite NN weightvector, and ρ ∈ R q is the composite regressor vector, where q = a + ma + la , in which m is the dimension of the control vec-tor and l is the dimension of the disturbance vector, and a , a , a are number of neurons in the hidden layer or the size of regressorvector for critic, actor and disturbance, respectively. And, ∆ σ c =[ σ c ( t ) − e γT σ c ( t − T )] . In subsequent discussion, ˆ e ( t ) , ˆ e ( t j ) are denoted by ˆ e and ˆ e j , respectively and ρ (cid:44) ρ ( t ) , ρ j (cid:44) ρ ( t j ) . Similarly, I (cid:44) I ( t ) and I j (cid:44) I ( t j ) . Exisiting update laws in literature for Off-policy IRL

In [30], a recursive least square-based update law was proposedto minimize the HJI approximation error to solve H ∞ -trackingproblem. Their update law was given by, ˆ W = ( ℵℵ T ) − ℵ Y (37)where, ℵ = [ ρ ( t ) , ρ ( t ) , ρ ( t ) , ..., ρ ( t N )] and Y = [ − I ( t ) , − I ( t ) , − I ( t ) , ..., − I ( t N )] T . This yields V i , u i +1 and d i +1 .However, this descrete-time update law requires that N samples becollected, before (37) can be implemented. This procedure make itless sensitive to real time variation in plant parameters.A continuous-time update law is more sensitive to real timeparametric variations and hence a continuous-time update law waspresented in [18] utilizing constant learning gradient descent andexperience replay (ER) technique to train the actor and critic NNin Off-policy IRL to solve optimal tracking problem. A memorystack ( { ρ ( t j ) , I ( t j ) } Nj =1 ) containing past observations was con-structed, which was then utilized for incorporating the ER termsin the parameter update law in order to improve the performanceof gradient descent-based algorithm by iteratively using past obser-vations. However, this control formulation did not incorporate anydisturbance rejection. Their update law was given as, ˙ˆ W = − ηN + 1 (cid:16) ρ m s ˆ e + N (cid:88) j =1 (cid:16) ρ j m sj ˆ e j (cid:17)(cid:17) (38)where, ρ in (38) is made up of ﬁrst two components of ρ in (36),that is ρ does not contain the last component of ρ that correspondsto disturbance term ( d ).Recently a continuous time update law was presented in [31] for H ∞ -tracking control incorporating terminal constraints, in whichthe update law relied only on constant learning-based gradientdescent apart from a term dedicated for incorporating terminalconstraints. Novel update law

All the update laws mentioned in Section 3.3 either utilize recursiveleast square (RLS) method [30] or gradient descent with constantlearning rate [31]. While the RLS-based update laws are usuallyfound to be less sensitive to real-time parameter variations in plant

IET Research Journals, pp. 1–124 c (cid:13)(cid:13)

The Institution of Engineering and Technology 2015 ig. 1 : Block diagram of the control systemdynamics [18], constant learning rate-based update law cannot scalethe learning rate based on the instantaneous value of the HJI error[34]. Also, experience replay technique and inclusion of robust termin update laws were found to beneﬁcial [18, 35], [32]. Consideringthese, a novel continuous time update law is presented in this paperto tune the critic, actor and disturbance NN weights online in ordersolve H ∞ -tracking problem. The novel update law utilizing variablegain gradient descent and ER technique is now presented as, ˙ˆ W = − ηN + 1 (cid:16) ρ m s | ˆ e | k ˆ e + N (cid:88) j =1 (cid:16) ρ j m sj | ˆ e j | k ˆ e j (cid:17) − K | ˆ e | k ρ T m s ˆ W + | ˆ e | k K ˆ W − K N (cid:88) j =1 | ˆ e j | k ρ T j m sj ˆ W (cid:17) (39)where, m s = (cid:113) ρ T ρ and m sj = (cid:113) ρ T j ρ j and K ∈ R q (where q = a + ma + la and m is the dimension of the con-trol vector and l is the dimension of the disturbance vector, and a , a , a are as deﬁned after (27)) , K ∈ R q × q .It can be seen that the update law (39) utilizes variable learningrate (via the term | ˆ e | k ) that is a function of instantaneous HJI error.This has the advantage of scaling the learning rate and reducing thesize of the residual set for error in NN weights as will become clearin the stability proof of Theorem 4.1. Additionally the second andﬁfth term under summation correspond to the ER terms, these termscan use past observations much more effectively. The memory stackin ER can be updated with recent data as and when they arrive. Thisleads to an efﬁcient learning from past data. Finally, inclusion ofrobust terms in the update law provides robustness against variationsin approximation errors and also reduce the size of the residual setfor error in NN weights.Now, assuming that u i and d i are in sufﬁciently small neighbor-hood of the optimal policies, the HJI error can be modiﬁed in thesame way as mentioned in [18] (refer to Section 4.2 in [18]). It is as-sumed that there exist NNs that can approximate the optimal value,action and disturbance policies as: V ∗ = W Tc σ c + ε c ; u ∗ = W Ta σ a + ε a ; d ∗ = W Td σ d + ε d (40)Under these assumptions, using (40) in (28), the HJI approximationerror is obtained as, ε HJI = W Tc [ σ c ( t ) − e γT σ ( t − T )] + (cid:90) tt − T e − γ ( τ − t ) ( z T Q z + 2 σ Ta W a Ru − σ Ta W a RW Ta σ a + α σ Td W d RW Td σ d − α σ Td W d d ) (41) In terms of approximation errors the HJI error ε HJI can equivalentlybe given as, ε HJI = ε c ( t ) − e γT ε c ( t − T ) − (cid:90) tt − T e − γ ( τ − t ) (cid:104) ε Ta Ru − ε Ta Rσ a − ε Ta Rε a + α ε Td Rε d − α ε Td d (cid:105) (42) Assumption 1.

It is assumed that the control policy u is admissiblepolicy for the augmented system. This makes the augmented systemremain in the compact set Ω ⊂ R n . Such an admissible policy ischosen for online training of critic, actor and disturbance NNs. Assumption 2.

There exist bounds such that, (cid:107) W (cid:107) ≤ W M , (cid:107) ¯ ρ (cid:107) ≤ ¯ ρ M , | m s | ≤ m sM , (cid:107) W c (cid:107) ≤ W cm , (cid:107) W a (cid:107) ≤ W am , (cid:107) W d (cid:107) ≤ W dm , (cid:107) σ c (cid:107) ≤ b c , (cid:107) σ a (cid:107) ≤ b a , (cid:107) σ d (cid:107) ≤ b d , (cid:107) ε c (cid:107) ≤ b εc , (cid:107) ε a (cid:107) ≤ b εa , (cid:107) ε d (cid:107) ≤ b εd . This is in line with Assumption 2 of [18]Now, since, ideal NN weights are not known, their estimates willbe utilized to express optimal value function and optimal policies. ˆ V ∗ = ˆ W c σ c , ˆ u ∗ = ˆ W a σ a , ˆ d ∗ = ˆ W d σ d (43)Based on these estimated weights, following (Eq. (41)) the approxi-mate HJI error can be re-stated as: ˆ e = ˆ W Tc [ σ c ( t ) − e γT σ ( t − T )] + (cid:90) tt − T e − γ ( τ − t ) ( z T Q z + 2 σ Ta ˆ W a Ru − σ Ta ˆ W a R ˆ W Ta σ a + α σ Td ˆ W d R ˆ W Td σ d − α σ Td ˆ W d d ) (44)The approximate HJI error, thus, can be expressed in a compact formas: ˆ e ( t ) = ˆ W T ρ + v ( ˆ W a ) T A v ( ˆ W a ) − v ( ˆ W d ) T B v ( ˆ W d ) + I (45)where, ∆ σ c (cid:44) σ c ( t ) − e γT σ ( t − T ) A (cid:44) (cid:90) tt − T e − γ ( τ − t ) ( Ru ⊗ σ a ) dτ A (cid:44) (cid:90) tt − T e − γ ( τ − t ) ( R ⊗ σ a σ Ta ) dτ B (cid:44) (cid:90) tt − T e − γ ( τ − t ) ( α d ⊗ σ d ) dτ B (cid:44) (cid:90) tt − T e − γ ( τ − t ) ( α ⊗ σ d σ Td ) dτI (cid:44) (cid:90) tt − T e − γ ( τ − t ) ( z T Q z ) dτ (46) ˆ W (cid:44)  ˆ W c v ( ˆ W a ) v ( ˆ W d )  , ρ (cid:44)  ∆ σ c A − A v ( ˆ W a ) − B + 2 B v ( ˆ W d )  (47)Now, in order to implement experience replay technique, past data (cid:16) { ˆ e ( t j ) , m s ( t j ) , ρ ( t j ) , I ( t j ) } Nj =1 (cid:17) is stored in memory stack of IET Research Journals, pp. 1–12c (cid:13)

The Institution of Engineering and Technology 2015 5 ize N . These data that are stored in memory stack are given as, ρ ( t j ) (cid:44)  ∆ σ c ( t j )2 A ( t j ) − A ( t j ) v ( ˆ W a ) − B ( t j ) + 2 B ( t j ) v ( ˆ W d )  m s ( t j ) (cid:44) (cid:113) ρ T ( t j ) ρ ( t j )ˆ e ( t j ) (cid:44) ˆ W T ρ ( t j ) + v ( ˆ W a ) T A ( t j ) v ( ˆ W a ) − v ( ˆ W d ) T B ( t j ) v ( ˆ W d ) + I ( t j ) (48)where, m s = (cid:112) ρ T ρ . Also in the subsequent analysis, ρ j (cid:44) ρ ( t j ) , m sj (cid:44) m s ( t j ) , ˆ e j (cid:44) ˆ e ( t j ) for j = 1 , , ...N .Similar to (39), continuous-time update law for this case can bewritten as, ˙ˆ W = − ηN + 1 (cid:16) ρm s | ˆ e | k ˆ e + N (cid:88) j =1 (cid:16) ρ j m sj | ˆ e j | k ˆ e j (cid:17) − K | ˆ e | k ρ T m s ˆ W + | ˆ e | k K ˆ W − K N (cid:88) j =1 | ˆ e j | k ρ T ( t j ) m sj ˆ W (cid:17) (49)It could be observed that certain terms appearing in (49) are deﬁneddifferently from (39). For instance, ( ˆ e and ρ ) appearing in (49) aregiven by (45) and (47), respectively.Note that the update law presented above (49), is different fromleast square-based update law mentioned in [30] and continuous timegradient descent-based update laws mentioned in [18] and [31]. Theupdate law presented in this paper consists of ﬁve terms. The ﬁrstterm is directly responsible for reducing the HJI error, while the sec-ond term is a representation of its past observations over the memorystack. Also, unlike the constant learning rate in [18] and [31], inthis paper the learning rate in (49) is time-varying and consideredas a function of the HJI error such that it can accelerate the learn-ing when the HJI error is large and reduce the learning speed whenthe HJI error becomes small. The next three terms are responsiblefor providing robustness in achieving small residual set. Moreover,the second and ﬁfth term correspond to the experience replay (ER)of the ﬁrst and third terms, respectively. Signiﬁcance of each termin the update law (49) in improving the performance of the track-ing controller would be evident in the proof of Theorem 4.1 and itssubsequent discussion. Stability proof of the update law

Theorem 4.1.

Let ˆ W be the estimated parameters for critic, actorand disturbance. Under the Assumptions , , and that the normal-ized regressor ¯ ρ (cid:44) ρ/ (cid:112) ρ T ρ is persistently excited, the updatelaw mentioned in ( ) ensures the error in NN weights ˜ W to be UUBstable.Proof: Let the Lyapunov function candidate be: L = (1 /

2) ˜ W T η − ˜ W .In order to prove stability of the update law, the HJI error needs tobe expressed as a function of ˜ W . In order to accomplish this, using(41) and (45), ˆ e = ε HJI − ˜ W T ρ − v ( ˜ W a ) T A v ( ˜ W a ) + v ( ˜ W d ) T B v ( ˜ W d ) (50)In subsequent discussion, ε would be equivalently used in placeof ε HJI . Differentiating the Lyapunov function, ˙ L = ˜ W T η − ˙˜ W = g ( N + 1) m s ( ˜ W T ρε − ˜ W T ρ ˜ W T ρ − ˜ W T ρ v ( ˜ W a ) T A v ( ˜ W a ) + ˜ W T ρ v ( ˜ W d ) T B v ( ˜ W d ))+ 1 N + 1 (cid:16) ˜ W ε N (cid:88) j =1 ρ ( t j ) g ( t j ) m s ( t j ) − ˜ W T N (cid:88) j =1 ρ ( t j ) ρ ( t j ) T g ( t j ) m s ( t j ) ˜ W − ˜ W T N (cid:88) j =1 g ( t j ) ρ ( t j ) T m s ( t j ) v ( ˜ W a ) T A ( t j ) v ( ˜ W a )+ ˜ W T N (cid:88) j =1 ρ ( t j ) g ( t j ) m s ( t j ) ( v ( ˜ W d ) T B ( t j ) v ( ˜ W d )) (cid:17) − N + 1 g ˜ W T K T ρ T m s W + g N + 1 ˜ W T K T ρ T m s ˜ W + g N + 1 ˜ W T K W − g N + 1 ˜ W T K ˜ W − ˜ W T K N (cid:88) j =1 g ( t j ) ρ ( t j ) T m s ( N + 1) W + ˜ W T K N (cid:88) j =1 g ( t j ) ρ ( t j ) T m s ( N + 1) ˜ W (51)where, g = | ˆ e | k , g ( t j ) = | ˆ e j | k , ( j = 1 , , ..., N ) (52)Now, in order to ﬁnd a bound over ˙ L , it is required to ﬁnd boundover terms containing A and B (ref. to Eq. (46)). Recall from As-sumption 2 that (cid:107) σ a (cid:107) ≤ b a and (cid:107) σ d (cid:107) ≤ b d . Utilizing properties ofmatrices (maximum and minimum eigenvalue) and persistent excita-tion (PE) condition ( λ I ≤ (cid:82) tt − T ρρ T dτ ≤ λ I , where λ , λ arepositive constants) on regressor, the bounds over terms containing A and B can be derived as,v ( ˜ W a )( R ⊗ σ a σ Ta ) v ( ˜ W a ) ≤ q (cid:107) v ( ˜ W a ) (cid:107) ≤ q (cid:107) ˜ W (cid:107) v ( ˜ W d )( α ⊗ σ d σ Td ) v ( ˜ W d ) ≤ q (cid:107) v ( ˜ W d ) (cid:107) ≤ q (cid:107) ˜ W (cid:107) (53)where, q and q are the maximum eigenvalues of the matrices givenby ( R ⊗ σ a σ Ta ) and ( α ⊗ σ d σ Td ) , respectively. Further, the boundover terms, v ( ˜ W a ) A v ( ˜ W a ) and v ( ˜ W d ) B v ( ˜ W d ) can be derivedfrom (46), (47) and (53) as,v ( ˜ W a ) A v ( ˜ W a ) ≤ q (cid:90) tt − T e − γ ( τ − t ) (cid:107) ˜ W (cid:107) dτ ≤ q γβ ( e γT − (cid:107) ˜ W T ¯ ρ (cid:107) = q γβ ( e γT −

1) ˜ W T ¯ ρ ¯ ρ T ˜ W (54)v ( ˜ W d ) B v ( ˜ W d ) ≤ q (cid:90) tt − T e − γ ( τ − t ) (cid:107) ˜ W (cid:107) dτ ≤ q γβ ( e γT − (cid:107) ˜ W T ¯ ρ ¯ (cid:107) = q γβ ( e γT −

1) ˜ W T ¯ ρ ¯ ρ T ˜ W (55)where, β = (cid:107) ρ (cid:107) . Using the same PE condition on A , B in ρ ,from (47) there exists a constant L such that, (cid:12)(cid:12)(cid:12) ˜ W T ρm s (cid:12)(cid:12)(cid:12) ≤ L (cid:107) ˜ W (cid:107) (56)Combining (54), (55) and (56), (cid:12)(cid:12)(cid:12) ˜ W T ρm s v ( ˜ W a ) A v ( ˜ W a ) (cid:12)(cid:12)(cid:12) ≤ L q γβ ( e γT − (cid:107) ˜ W (cid:107) ˜ W T ¯ ρ ¯ ρ T ˜ W (cid:12)(cid:12)(cid:12) ˜ W T ρm s v ( ˜ W d ) B v ( ˜ W d ) (cid:12)(cid:12)(cid:12) ≤ L q γβ ( e γT − (cid:107) ˜ W (cid:107) ˜ W T ¯ ρ ¯ ρ T ˜ W (57) IET Research Journals, pp. 1–126 c (cid:13)(cid:13)

The Institution of Engineering and Technology 2015 here, ¯ ρ = ρ/m s and the reinforcement interval T can be selectedsuch that, L q γβ ( e γT − (cid:107) ˜ W (cid:107) ≤ ε T a L q γβ ( e γT − (cid:107) ˜ W (cid:107) ≤ ε T d (58)where, ε T a and ε T d are two small positive scalar constants. There-fore, using (58) in (57), (cid:12)(cid:12)(cid:12) ˜ W T ρm s v ( ˜ W a ) A v ( ˜ W a ) (cid:12)(cid:12)(cid:12) ≤ ε T a ˜ W T ¯ ρ ¯ ρ T ˜ W (cid:12)(cid:12)(cid:12) ˜ W T ρm s v ( ˜ W d ) B v ( ˜ W d ) (cid:12)(cid:12)(cid:12) ≤ ε T d ˜ W T ¯ ρ ¯ ρ T ˜ W (59)Similarly, their ER versions can be represented as, (cid:12)(cid:12)(cid:12) ˜ W T N (cid:88) j =1 g j ρ ( t j ) m s ( t j ) v ( ˜ W a ) A ( t j ) v ( ˜ W a ) (cid:12)(cid:12)(cid:12) ≤ ε T sa ˜ W T N (cid:88) j =1 ¯ ρ j ¯ ρ Tj ˜ W (cid:12)(cid:12)(cid:12) ˜ W T N (cid:88) j =1 g j ρ ( t j ) m s ( t j ) v ( ˜ W a ) B ( t j ) v ( ˜ W a ) (cid:12)(cid:12)(cid:12) ≤ ε T sd ˜ W T N (cid:88) j =1 ¯ ρ j ¯ ρ Tj ˜ W (60)Now, using (59) and (60), Eq. (51) can be rewritten as: ˙ L ≤ ˜ W T ε (cid:16) ρg m s ( N + 1) + N (cid:88) j =1 ρ ( t j ) g ( t j ) m s ( t j )( N + 1) (cid:17) − ˜ W T (cid:16) g ¯ ρ ¯ ρ T ( N + 1) + N (cid:88) j =1 g j ¯ ρ j ¯ ρ Tj ( N + 1) (cid:17) ˜ W + ˜ W T (cid:16) N + 1 ε T a ¯ ρ ¯ ρ T + 1 N + 1 ε T sa N (cid:88) j =1 ¯ ρ j ¯ ρ Tj (cid:17) ˜ W + ˜ W T (cid:16) N + 1 ε T d ¯ ρ ¯ ρ T + 1 N + 1 ε T sd N (cid:88) j =1 ¯ ρ j ¯ ρ Tj (cid:17) ˜ W − N + 1 g ˜ W T K T ρ T m s W + g N + 1 ˜ W T K T ρ T m s ˜ W + g N + 1 ˜ W T K W − g N + 1 ˜ W T K ˜ W − ˜ W T K N (cid:88) j =1 g ( t j ) ρ ( t j ) T m s ( N + 1) W + ˜ W T K N (cid:88) j =1 g ( t j ) ρ ( t j ) T m s ( N + 1) ˜ W (61)After further simpliﬁcation, (61) can be rendered into the follow-ing inequality: ˙ L ≤ −P T M P + PN − P T M ε Ta P + P T M ε Td P N + 1 (62)where, P (cid:44) (cid:16) ˜ W T ¯ ρ, ˜ W , ˜ W T (cid:80) Nj =1 ¯ ρ ( t j ) (cid:17) T M (cid:44)  g − g K T (cid:80) g j − g K g K − K (cid:80) g j − (cid:80) g j − K T (cid:80) g j (cid:80) g j  N (cid:44)  g ε − g K T Wg K Wε (cid:80) Nj =1 g j − K (cid:80) Nj =1 g j W  (63) M ε Ta (cid:44)  ε T a G T − c − G q × q L c − L T ε T sa  ; M ε Td (cid:44)  ε T d − G T c G q × q − L − c L T ε T sd  (64)where, K ∈ R q and K ∈ R q × q and q being the dimension ofthe composite regressor vector ρ (see (47)). Also, G , G ∈ R q , c , c ∈ R , L , L ∈ R q are constants used in (64). Proposition 1.

Let x ∈ R n and M ∈ R n × n be any square matrix,then, λ min ( M + M T ) (cid:107) x (cid:107) ≤ x T Mx ≤ λ max ( M + M T ) (cid:107) x (cid:107) . Where, λ min ( . ) and λ max ( . ) denote the minimum and maximum eigenvalues of corresponding matrices, respectively.Proof: The proof of this proposition is provided in Lemma 7.1 inSection 7. (cid:3)

Using Proposition 1, (62) can be simpliﬁed into: ˙ L ≤ ( − λ min ( M (cid:48) ) (cid:107)P(cid:107) + b N (cid:107)P(cid:107) − λ min ( M (cid:48) ε Ta ) (cid:107)P(cid:107) + λ max ( M (cid:48) ε Td ) (cid:107)P(cid:107) ) / ( N + 1) (65)where, b N is the maximum value of norm of N , i.e, (cid:107)N (cid:107) ≤ b N andis expressed using (63) and Assumptions 2 as, b N = (cid:16) ( g ε − g (cid:107) K (cid:107)(cid:107) W (cid:107) ) + ( g (cid:107) K (cid:107)(cid:107) W (cid:107) ) + ... ( ε N (cid:88) j =1 g j − K (cid:88) j =1 g j (cid:107) W (cid:107) )) (cid:17) (66)Note that in (65), following substitutions were made, M (cid:48) ε Ta (cid:44) M ε Ta + M Tε Ta , M (cid:48) ε Td (cid:44) M ε Td + M Tε Td M (cid:48) (cid:44) M + M T (67)From (65), in order to ensure negative deﬁniteness of ˙ L , the follow-ing inequality should hold: (cid:107)P(cid:107) > b N λ min ( M (cid:48) ) + λ min ( M (cid:48) ε Ta ) − λ max ( M (cid:48) ε Td )= ⇒ ˙ L < (68)From the deﬁnition of P as mentioned in (63), (cid:107)P(cid:107) ≤ (cid:107) ˜ W (cid:107) (cid:118)(cid:117)(cid:117)(cid:116) (cid:107) ¯ ρ M (cid:107) + (cid:13)(cid:13)(cid:13) N (cid:88) j =1 ¯ ρ M ( t j ) (cid:13)(cid:13)(cid:13) (69)The right hand side of (69) will be represented as, S (cid:44) (cid:118)(cid:117)(cid:117)(cid:116) (cid:107) ¯ ρ M (cid:107) + (cid:13)(cid:13)(cid:13) N (cid:88) j =1 ¯ ρ M ( t j ) (cid:13)(cid:13)(cid:13) (70)From (68), (69) and (70) the UUB set for error in NN weights isobtained as, (cid:107) ˜ W (cid:107) > b N S ( λ min ( M (cid:48) ) + λ min ( M (cid:48) ε Ta ) − λ max ( M (cid:48) ε Td )) (71)Thus, from (68), (69) and (71), under the NN parameter update law(49), the error in NN weights are guaranteed to decrease outside the IET Research Journals, pp. 1–12c (cid:13)

The Institution of Engineering and Technology 2015 7 esidual ball given as, Ω ˜ W = (cid:40) ˜ W : (cid:107) ˜ W (cid:107) ≤ b N S ( λ min ( M (cid:48) ) + λ min ( M (cid:48) ε Ta ) − λ max ( M (cid:48) ε Td )) (cid:41) (72) This concludes the stability proof of the continuous-time updatemechanism. (cid:3)

Discussion on the presented update law

Note that the update law presented in (49) is different from thegradient descent-based update laws of [18] and [31] and least square-based one presented in [30] in several ways. First of all, being acontinuous-time update law based on gradient descent, it is moresensitive to variations in plant dynamics than least square-basedupdate mechanism in [30]. Secondly, unlike [18], it utilizes H ∞ framework for disturbance rejection as well. While [31] utilized H ∞ framework for their tracking controller, their gradient descent hadonly constant learning rate and lacked ER and robust terms to fur-ther shrink the size of the residual set. The prime novelties of theupdate law (49) are the use of variable gain gradient descent and in-corporation of robust terms i.e., the last three terms in (49). Thesehelp in improving the performance of the ﬁnal learnt control policiesto track a given reference trajectory.From Theorem 4.1 it is evident that (cid:107) ˜ W (cid:107) decreases in the stableregion, i.e., where ˙ L is negative deﬁnite. This results in estimatedNN weights, i.e., ˆ W getting closer to ideal NN weights W , which inturn implies that the HJI error (52) is decreasing in the stable region.Now, note that the numerator in the right hand side (RHS) of (72))i.e b N is a function of g = | ˆ e | k and g j = | ˆ e ( t j ) | k (see (66)),which implies that the size of the ball (72) shrinks due to decreasing g and g j .Thus, b N encapsulates the effect of variable gain gradient de-scent in Off-Policy parameter update law. Further, the variable gainin gradient descent, i.e., | ˆ e | k and | ˆ e j | k , j = 1 , , ..., N scale thelearning rate based on instantaneous and past values of HJI error, re-spectively, where the constant k ≥ governs the amount of scalingin the learning of gradient descent. These terms increase the learn-ing rate when the HJI error is large and slow it down as the HJIerror becomes smaller in magnitude. So, the actual learning rate be-comes, l = η | ˆ e | k . Note that if the | ˆ e | ≤ , then l ≤ η for all k ≥ .However, if | ˆ e | ≥ , then l ≥ η for all k ≥ Furthermore, the gains K and K in the robust term of the adap-tation law (49) can be selected, so as to have a large λ min (( M + M T ) / (ref. to (63)), which in turn leads to a smaller ball (ref. to(72)) and hence a tighter residual set for ˜ W . With these novel modi-ﬁcations, the variable gain gradient descent-based Off-policy updatelaw presented in this paper yields a much tighter residual set for ˜ W and hence improved tracking performance. In order to evaluate the performance of update law proposed in thispaper, two applications are considered for simulation studies in thissection. • Nonlinear system [18] in subection 5.1 • Linearized F16 Model [30] in subsection 5.2

A Nonlinear System

Dynamics of a nonlinear system is considered from [18] and isdescribed as, ˙ x = − sin x + x ˙ x = − x + u + dy = x (73) where, disturbance affecting the system is given as d = . e − . t sin . t . The reference system is considered as [18], ˙ x d = (cid:18) . − . (cid:19) (cid:18) . . t )0 . . t ) (cid:19) (74)The penalty on states as appearing in (13), i.e., Q is chosen to be: Q = (cid:18) diag (217 ,

0) 00 0 (cid:19) (75)Here, x d is the desired trajectory for the output y = x . The initalstate of the system is ( . . . The regressor vectors for critic, actorand disturbance NNs are chosen as : σ c = (cid:0) z , z , z , z , z z , z z , z z , z z , z z , z z (cid:1) σ a = (cid:0) z , z , z , z (cid:1) σ d = (cid:0) z , z , z z , z z , z z (cid:1) (76)where, z = ( e T , x Td ) T ∈ R and z i is individual component of z .The constant part of the learning rate for both the cases is selectedto be, η = 2998 , the size of memory stack, i.e., N for experiencereplay technique is chosen to be . The level of attenuation α is choosen to be . . The value of reinforcement interval shouldbe selected as small as possible in order to preserve the relation-ship between Bellman equation and IRL equation (refer to section3.1). Here, the reinforcement interval T is selected as 0.001s. Allthe NN weights are initialized to . Also, note that in order toyield tighter residual set, λ min ( M (cid:48) ) in the denominator of RHS of(72) needs to be large, which could be made possible by selectingthe gains in robust term K and K with high norms. However,it should also be noted that since norms of both K and K ap-pear in numerator of RHS of (72) too. Hence, gains K and K cannot be selected with very high norms. For ease in the simula-tion study, K and K are both selected as q and ( q × q ) (forboth the cases), where q is the dimension of composite regres-sor vector ρ (refer to (47)), i.e., q = 19 . At ﬁrst, an exploratorycontrol policy is ﬁred into the system, and the system is allowedto explore the state space. The exploratory control signal con-sidered has the form, n ( t ) = 2 e − . t (cid:0) sin(11 . t ) cos(19 . t ) +sin(2 . t ) cos(5 . t ) + sin(1 . t ) cos(9 . t ) + sin(2 . t ) (cid:1) similarto the one mentioned in [22]. When critic, actor and disturbanceNN weights have converged, the exploration is stopped and learntpolicies are executed to the system as shown in Fig. 1. Without variable gain gradient descent scheme:

Sim-ulation results for the update law (49) without variable gain gradientdescent on the nonlinear system considered above (73) are shownin Fig. 2. For this, the exponent in variable gain term, i.e., k waschosen to be , leading to just the constant learning speed (see (49))in the online training of the NN weights. Note that with k = 0 and K , K being zero vector and matrix, the update law (49) resem-bles the update law mentioned in [18]. The difference between (49)and the update law of [18] arises from the inclusion of disturbancerejection (via the regressor term ρ ) in (49). The NN weights corre-sponding to critic, actor and disturbance are shown to converge inﬁnite amount of time in Figs. 2a, 2b, and 2c, respectively. The ﬁ-nal learnt control policy due to the converged weights of critic, actorand disturbance NNs is depicted in Fig. 2d. The tracking error by thelearnt policies resulting out of the constant learning rate-based up-date law is shown to become small in ﬁnite time as depicted in Figs.2e and 2f. However, It can be observed that there still exists smallsteady state error in tracking performance (see Fig. 2e). With variable gain gradient descent scheme:

Variablegain gradient descent-based update law (49) is validated on the non-linear system (73) in Fig. 3. Here, the exponent in variable gain term,i.e., ( k ) is chosen to be . . All other parameters are kept same.NN weights of critic, actor and disturbance NNs converge very close IET Research Journals, pp. 1–128 c (cid:13)(cid:13)

The Institution of Engineering and Technology 2015 a) Critic NN weights

Time (seconds) -15-10-505 A c t o r W e i gh t (b) Actor NN weights Time (seconds) -3-2-10123456 D i s t u r ban c e W e i gh t -6 (c) Disturbance NN weights Time (seconds) -1.5-1-0.500.511.5 C on t r o l P r o f il e

60 80 100-0.0500.05 (d) Learnt Control Policy

Time (seconds) -0.15-0.1-0.0500.050.10.150.2 S t a t e s x x d1

55 60-0.1-0.05 (e) State Proﬁle under learnt Policy (f) HJI error during learning phase

Fig. 2 : Online training of NN weights and state and control proﬁle without variable gain gradient descent

Time (seconds) -2-101234567 C r i t i c W e i gh t (a) Critic NN weights Time (seconds) -15-10-505 A c t o r W e i gh t (b) Actor NN weights Time (seconds) -4-3-2-101 D i s t u r ban c e W e i gh t -4 (c) Disturbance NN weights Time (seconds) -1.5-1-0.500.511.5 C on t r o l P r o f il e

50 100-0.0200.02 (d) Learnt Control Policy

Time (seconds) -0.1-0.0500.050.10.15 S t a t e s x x d1 (e) State Proﬁle under learnt Policy (f) HJI error during learning phase Fig. 3 : Online training of NN weights and state and control proﬁle with variable gain gradient descent

IET Research Journals, pp. 1–12c (cid:13)

The Institution of Engineering and Technology 2015 9 o their ideal values in a ﬁnite amount of time as can be seen in Figs.3a, 3b and 3c, respectively. The learnt control policy arising out ofthe converged NN weight is depicted in Fig. 3d. It is able to trackthe reference trajectory with high accuracy in ﬁnite amount of timeas evident from Figs. 3e. The HJI error proﬁle during the learningphase is depicted in Fig. 3f and it can be seen that | ˆ e | ≤ during thelearning phase.Note that the oscillations in learnt control policies (see Fig. 2d)are more and persist for longer duration in the case when constantlearning rate was used as compared to the case when variable gaingradient descent (Fig. 3d) is utilized. This in turn leads to an os-cillatory tracking performance (Fig. 2e) in the transient phase withsteady state error for the case with constant learning speed. On theother hand, the ﬁnal learnt policies arising out of variable gain gra-dient descent-based update law leads to very less oscillations andalmost no steady state error (Fig. 3e). All this is possible because,the variable gain gradient descent-based update law leads to a muchtighter residual set for ˜ W . This implies that the control policies re-sulting out of variable gain gradient descent-based update law arecloser to the ideal optimal controller than the policies due to just theconstant learning rate gradient descent-based update laws. It couldalso be noted from Figs. 2f and 3f that the HJI error is within the [ − , , and since variable gain gradient descent uses a learning ratethat is function of instantaneous HJI error, for our problem set, thepresence of term g = | ˆ e | k actually reduces the learning rate. Thisis also the reason why in this case, the convergence time of Fig. 3ais slightly longer than Fig. 2a. However, when HJB or HJI error islarge, the variable gain gradient descent-based update law leads tofaster convergence of NN weights as can be observed in [34]. Linearized F16 Model

In this section, the tracking performance of the update law devel-oped in this paper (49) would be studied on the following linearizeddynamics model of F16 ﬁghter aircraft [30]. ˙ x = Ax + Bu + Dd (77)where, A =  − . . − . . − . − . −  , B =   , D =   (78)The state vector x = [ α, q, δ e ] T , where α is angle of attack, q is thepitch rate and δ e is the elevator deﬂection. The control input is thevoltage signal to the elevators and disturbance is caused by the windgust to the angle of attack. It is required to track reference angle ofattack. For this, the augmented dynamics of z = [ e T , x Td ] T is givenas, ˙ z = A z + B u + D d (79)where, A = (cid:32) A × (cid:33) , B = (cid:32) B × (cid:33) , D = (cid:32) D × (cid:33) (80) The problem set-up in this section is considered in line with that in[30], where the same problem was considered using least-square-based update law. The disturbance is assumed to have the form, d = . e − . t sin ( . t ) . Reinforcement interval T = . s and R =1 , Q = diag ([9 . , , , , , . Discount factor γ = . was cho-sen for simulation. In the simulation, the desired value of outputwas α d = 1 . for ﬁrst s and then was subsequently changed to α d = 2 . thereafter. The constant part of the learning rate is selectedas η = 209 . with variable gain exponent k in | ˆ e | k as k = . . The regressor vectors for critic, actor and disturbance NNs werechosen to be, σ c = (cid:18) z z , z z , z z , z z , z z , z z , z z , z z , z z , z z , ......z z , z z , z z , z z , z z (cid:19) T σ a = (cid:0) z , z , z , z , z , z (cid:1) T σ d = (cid:0) z , z , z , z , z , z , z z , z z , z z , z z , z z (cid:1) T (81)The exploratory control policy used during the learning phase isgiven by n ( t ) = 2 e ( − . t ) (sin( t ) cos( t ) + sin(3 t ) cos(1 . t )+ sin(9 t ) cos(8 . t ) + sin(3 . t ) cos(2 . t ) sin(19 t )+ sin(11 . t ) cos(5 . t ) + sin(12 t ) cos(2 . t ) + sin(15 t ) cos(1 . t ) ) Note from Figs. 4a, 4b, 4c that the NN weights converge closeto their ideal values in ﬁnite amount of time. For this example, inthis paper, disturbance attenuation factor is chosen to be α = 1 . ,whereas in [30] this factor was α = 10 . The update law presentedin this paper is able to yield optimal policies even under higher de-gree of disturbance attenuation. As it can be clearly seen that theﬁnal learnt policy (see Fig. 4d) is able to track the set point withhigh accuracy as is evident from Fig. 4e. Contrary to the trackingresults presented in [30] for the same problem set-up, the trackingperformance by the presented update law here is devoid of any peakovershoot (refer to Fig. 4 of [30] and Fig. 4e in this paper).Overall, the variable gain gradient descent-based continuous timeupdate law consisting of ER and robust terms lead to much bettertracking performance even in the presence of disturbance, whichjustiﬁes the tighter UUB set proved in Theorem 4.1. A continuous time neural network (NN) parameter update law drivenby variable gain gradient descent, experience replay technique androbust terms for model-free H ∞ optimal tracking control problem ofcontinuous time nonlinear system has been presented in this paper.Integral reinforcement learning (IRL) has been leveraged in policyiteration framework in this paper. Incorporation of IRL obviates therequirement of drift dynamics in policy evaluation stage, while usageof actor and disturbance NNs to approximate control and disturbancepolicies obviates the requirement of control coupling dynamics anddisturbance dynamics in policy improvement stage. Variable gaingradient descent increases the learning rate when HJI error is largeand it dampens the learning rate when HJI error becomes smaller.It also results in smaller residual set over which the errors in NNweights converge to. Besides this, the ER term and robust terms inthe update law help in further shrinking the size of the residual set onwhich the error in NN weights ﬁnally converge to. This results in animproved learnt control policy, sufﬁciently close to the ideal optimalcontroller, leading to highly accurate tracking performance. Lemma 7.1.

Let x ∈ R n and M ∈ R n × n be any square matrix,then, λ min ( M + M T ) (cid:107) x (cid:107) ≤ x T Mx ≤ λ max ( M + M T ) (cid:107) x (cid:107) .Proof: x T Mx = x T (cid:16) M + M T M − M T (cid:17) x (82)RHS of above equation can be rewritten as: x T Mx = x T ( M + M T x + . x T Mx − . x T M T x = x T ( M + M T x + . x T Mx − . x T Mx ) T (83) IET Research Journals, pp. 1–1210 c (cid:13)(cid:13)

The Institution of Engineering and Technology 2015 a) Critic NN Weights for F16 model(b) Actor NN Weight for F16 model(c) Disturbance NN Weights for F16 model

Time (s) -4.5-4-3.5-3-2.5-2-1.5-1-0.50 E l e v a t o r V o l t age ( C on t r o l ) (d) Optimal control policy for F16 model Time (s) s t a t e p r o f il e des (e) Angle of attack proﬁle under optimal policies Fig. 4 : Set point tracking with Variable Gain Gradient Descent forF16 model Therefore, x T Mx = x T ( M + M T x (84)Using (84), λ min ( M + M T (cid:107) x (cid:107) ≤ x T Mx ≤ λ max ( M + M T (cid:107) x (cid:107) (85) (cid:3) [1] Sutton, R.S., Barto, A.G., et al.: ‘Introduction to reinforcement learning’. vol. 2.(MIT press Cambridge, 1998)[2] Lewis, F.L. and Liu, D.: ‘Reinforcement learning and approximate dynamicprogramming for feedback control’. vol. 17. (John Wiley & Sons, 2013)[3] Powell, W.B.: ‘Approximate Dynamic Programming: Solving the curses ofdimensionality’. vol. 703. (John Wiley & Sons, 2007)[4] Zhang, H., Liu, D., Luo, Y. and Wang, D.: ‘Adaptive dynamic programming forcontrol: algorithms and stability’. (Springer Science & Business Media, 2012)[5] Murray, J.J., Cox, C.J., Lendaris, G.G. and Saeks, R.: ‘Adaptive dynamic pro-gramming’, IEEE Transactions on Systems, Man, and Cybernetics, Part C , 2002, , (2), pp. 140–153[6] Abu.Khalaf, M. and Lewis, F.L.: ‘Nearly optimal control laws for nonlinear sys-tems with saturating actuators using a neural network hjb approach’, Automatica ,2005, , (5), pp. 779–791[7] Li, H. and Liu, D.: ‘Optimal control for discrete-time afﬁne non-linear systemsusing general value iteration’, IET Control Theory & Applications , 2012, , (18),pp. 2725–2736[8] Yang, X., Liu, D. and Wei, Q.: ‘Online approximate optimal control for afﬁnenon-linear systems with unknown internal dynamics using adaptive dynamicprogramming’, IET Control Theory & Applications , 2014, , (16), pp. 1676–1688[9] Zhao, D. and Zhu, Y.: ‘Mecâ ˘AˇTa near-optimal online reinforcement learningalgorithm for continuous deterministic systems’, IEEE transactions on neuralnetworks and learning systems , 2014, , (2), pp. 346–356[10] Zhu, Y., Zhao, D. and Liu, D.: ‘Convergence analysis and application offuzzy-hdp for nonlinear discrete-time hjb systems’, Neurocomputing , 2015, ,pp. 124–131[11] Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K.G., Lewis, F.L.and Dixon, W.E.: ‘A novel actor–critic–identiﬁer architecture for approximateoptimal control of uncertain nonlinear systems’,

Automatica , 2013, , (1),pp. 82–92[12] Park, Y.M., Choi, M.S. and Lee, K.Y.: ‘An optimal tracking neuro-controller fornonlinear dynamic systems’, IEEE Transactions on Neural Networks , 1996, ,(5), pp. 1099–1110[13] Toussaint, G.J., Basar, T. and Bullo, F.: ‘ H ∞ -optimal tracking control techniquesfor nonlinear underactuated systems’. Proceedings of the 39th IEEE Conferenceon Decision and Control (Cat. No. 00CH37187). vol. 3, 2000. pp. 2078–2083[14] Alameda.Hernandez, E., Blanco, D., Ruiz, D. and Carrion, M.: ‘Optimal trackingof time-varying systems with the overdetermined recursive instrumental variablealgorithm’, IET Control Theory & Applications , 2007, , (1), pp. 291–297[15] Zhang, H., Cui, L., Zhang, X. and Luo, Y.: ‘Data-driven robust approximateoptimal tracking control for unknown general nonlinear systems using adaptivedynamic programming method’, IEEE Transactions on Neural Networks , 2011, , (12), pp. 2226–2236[16] Modares, H. and Lewis, F.L.: ‘Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning’, Au-tomatica , 2014, , (7), pp. 1780–1792[17] Kiumarsi, B., Lewis, F.L., Modares, H., Karimpour, A. and Naghibi.Sistani,M.B.: ‘Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics’, Automatica , 2014, , (4), pp. 1167–1175[18] Zhu, Y., Zhao, D. and Li, X.: ‘Using reinforcement learning techniques to solvecontinuous-time non-linear optimal tracking problem without system dynamics’, IET Control Theory & Applications , 2016, , (12), pp. 1339–1347[19] Vrabie, D., Pastravanu, O., Abu.Khalaf, M. and Lewis, F.L.: ‘Adaptive op-timal control for continuous-time linear systems based on policy iteration’, Automatica , 2009, , (2), pp. 477–484[20] Jiang, Y. and Jiang, Z.P.: ‘Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics’, Automatica , 2012, ,(10), pp. 2699–2704[21] Luo, B., Wu, H.N., Huang, T. and Liu, D.: ‘Data-based approximate policy it-eration for nonlinear continuous-time optimal control design’, arXiv preprintarXiv:13110396 , 2013,[22] Vamvoudakis, K.G., Vrabie, D. and Lewis, F.L.: ‘Online adaptive algorithm foroptimal control with integral reinforcement learning’, International Journal ofRobust and Nonlinear Control , 2014, , (17), pp. 2686–2710[23] Abu.Khalaf, M., Lewis, F.L. and Huang, J.: ‘Neurodynamic programming andzero-sum games for constrained control systems’, IEEE Transactions on NeuralNetworks , 2008, , (7), pp. 1243–1252[24] Zhang, H., Wei, Q. and Liu, D.: ‘An iterative adaptive dynamic programmingmethod for solving a class of nonlinear zero-sum differential games’, Automatica ,2011, , (1), pp. 207–214[25] Vamvoudakis, K.G. and Lewis, F.L.: ‘Online solution of nonlinear two-playerzero-sum games using synchronous policy iteration’, International Journal ofRobust and Nonlinear Control , 2012, , (13), pp. 1460–1483 IET Research Journals, pp. 1–12c (cid:13)

The Institution of Engineering and Technology 2015 11

26] Vamvoudakis, K.G. and Lewis, F.L. ‘Online gaming: Real time solution ofnonlinear two-player zero-sum games using synchronous policy iteration’. In:Advances in Reinforcement Learning. (IntechOpen, 2011.[27] Modares, H., Lewis, F.L. and Sistani, M.B.N.: ‘Online solution of nonquadratictwo-player zero-sum games arising in the hâ´L ¯d control of constrained input sys-tems’,

International Journal of Adaptive Control and Signal Processing , 2014, , (3-5), pp. 232–254[28] Vrabie, D. and Lewis, F.: ‘Adaptive dynamic programming for online solutionof a zero-sum differential game’, Journal of Control Theory and Applications ,2011, , (3), pp. 353–360[29] Luo, B., Wu, H.N. and Huang, T.: ‘Off-policy reinforcement learning for H ∞ control design’, IEEE transactions on cybernetics , 2014, , (1), pp. 65–76[30] Modares, H., Lewis, F.L. and Jiang, Z.P.: ‘ H ∞ tracking control of completelyunknown continuous-time systems via off-policy reinforcement learning’, IEEEtransactions on neural networks and learning systems , 2015, , (10), pp. 2550–2562[31] Zhang, H., Cui, X., Luo, Y. and Jiang, H.: ‘Finite-horizon H ∞ tracking controlfor unknown nonlinear systems with saturating actuators’, IEEE transactions onneural networks and learning systems , 2017, , (4), pp. 1200–1212[32] Liu, D., Yang, X., Wang, D. and Wei, Q.: ‘Reinforcement-learning-based robustcontroller design for continuous-time uncertain nonlinear systems subject to in-put constraints’, IEEE transactions on cybernetics , 2015, , (7), pp. 1372–1385[33] Ba¸sar, T. and Bernhard, P.: ‘ H ∞ optimal control and related minimax designproblems: a dynamic game approach’. (Springer Science & Business Media,2008)[34] Mishra, A. and Ghosh, S.: ‘Variable gain gradient descent-based reinforcementlearning for robust optimal tracking control of uncertain nonlinear system withinput-constraints’, arXiv preprint arXiv:191104157 , 2019,[35] Modares, H., Lewis, F.L. and Naghibi.Sistani, M.B.: ‘Adaptive optimal control ofunknown constrained-input systems using policy iteration and neural networks’, IEEE Transactions on Neural Networks and Learning Systems , 2013, , (10),pp. 1513–1525 IET Research Journals, pp. 1–1212 c (cid:13)(cid:13)