A Predictive Deep Learning Approach to Output Regulation: The Case of Collaborative Pursuit Evasion
Shashwat Shivam, Aris Kanellopoulos, Kyriakos G. Vamvoudakis, Yorai Wardi
AA Predictive Deep Learning Approach to Output Regulation: The Case ofCollaborative Pursuit Evasion
S. Shivam , A. Kanellopoulos , K. G. Vamvoudakis , and Y. Wardi Abstract — In this paper, we consider the problem of con-trolling an underactuated system in unknown, and potentiallyadversarial environments. The emphasis will be on autonomousaerial vehicles, modelled by Dubins dynamics. The proposedcontrol law is based on a variable integrator via onlineprediction for target tracking. To showcase the efficacy of ourmethod, we analyze a pursuit evasion game between multipleautonomous agents. To obviate the need for perfect knowledgeof the evader’s future strategy, we use a deep neural networkthat is trained to approximate the behavior of the evader basedon measurements gathered online during the pursuit.
I. I
NTRODUCTION
Output tracking in dynamical systems, such as robots,flight control, economics, biology, cyber-physical systems,is the practice of designing decision makers which ensurethat a system’s output tracks a given signal [1], [2].Well-known existing methods for nonlinear output reg-ulation and tracking include control techniques based onnonlinear inversions [3], high-gain observers [4], and theframework of model predictive control (MPC) [5], [6].Recently a new approach has been proposed, based onthe Newton-Raphson flow for solving algebraic equations[7]. Subsequently it has been tested on various applicationsincluding controlling an inverted pendulum, and positioncontrol of platoons of mobile robotic vehicles [8], [9]. Whileperhaps not as general as the aforementioned establishedtechniques, it seems to hold out promise of efficient com-putations and large domains of stability.The successful deployment of complex control systemsin real world applications increasingly depends on theirability to operate on highly unstructured – even adversarial– settings, where a-priori knowledge of the evolution ofthe environment is impossible to acquire. Moreover, due tothe increasing interconnection between the physical and thecyber domains, control systems become more intertwinedwith human operators, making model-based solutions fragileto unpredictable. Towards that, methods that augment low-level control techniques with intelligent decision makingmechanisms have been extensively investigated in [10]. Ma-chine learning [11], [12], offers a suitable framework to S. Shivam, and Y. Wardi are with the School of Electrical and ComputerEngineering, Georgia Institute of Technology, Atlanta, GA, , USA.e-mail: ([email protected], [email protected]). A. Kanellopoulos, and K. G. Vamvoudakis are with the DanielGuggenheim School of Aerospace Engineering, Georgia Institute of Tech-nology, Atlanta, GA, , USA. e-mail: ([email protected], [email protected]).This work was supported in part, by ONR Minerva under grant No.N ´ ´ ´ , and by NSF under grants No. SaTC- and CPS- . allow control systems to autonomously adapt by leveragingdata gathered from their environment. To enable data-drivensolutions for autonomy, learning algorithms use artificialneural networks (NNs); classes of functions that, due toproperties that stem from their neurobiological analogy,offer adaptive data representations and prediction based onexternal observations.NNs have been used extensively in control applications[13], both in open-loop and closed-loop fashion. In closed-loop applications, NNs have been utilized as dynamics ap-proximators, or in the framework of reinforcement learning,in enabling online solution of the Hamilton-Jacobi-Bellmanequation [14]. However, the applicability of NNs in open-loop control objectives is broader, due to their ability tooperate as classifiers, or as nonlinear function approximators[15].The authors of [13] introduced NN structures for systemidentification as well as adaptive control. Extending theidentification capabilities of learning algorithms, the authorsof [16] introduce a robustification term that guaranteesasymptotic estimation of the state and the state derivative.Furthermore, reinforcement learning has received increas-ing attention since the development of methods that solveoptimal control problems for continuous time control sys-tems online without the knowledge of the dynamics [17].Prediction has been in the forefront of research conductedon machine learning. Learning-based attack prediction wasemployed both in [18] and [19] in the context of cyber-security, and [20] utilized NNs to solve a pursuit evasiongame by constructing both the evader’s and the pursuer’sstrategies offline using pre-computed trajectories. Recently,authors of this paper have applied NN for on-line modelconstruction in a control application [21].This paper applies an NN technique to the pursuit-evasionproblem investigated in [22], which is more challengingthan the problem addressed in [21]. The strategies of bothpursuers and evader are based on respective games. In Ref.[22], the pursuers know the game of the evader ahead of time,and an MPC technique is used to determine their trajectories.In this paper the pursuers do not have an a-priori knowledgeof the evader’s game or its structure, and they employ an NNin real time to identify its input-output mapping. We use ourtracking-control technique [7] rather than MPC, and obtainsimilar results to [22]. Furthermore, the input to the systemhas a lesser dimension that its output, and hence the controlis underactuated. We demonstrate a way of overcoming thislimitation, which may have a broad scope in applications.The rest of the paper is structured as follows. Section II a r X i v : . [ c s . M A ] S e p (t) e(t) � Controller u(t) Plant y(t) � � + . - Fig. 1. Basic control system scheme. describes our proposed control technique and some prelim-inary results on NN, and it formulates the pursuers-evaderproblem. Section III describes results on model-based andlearning-based strategies. Simulation results are presentedin Section IV. Finally, Section V concludes the paper anddiscusses directions for future research.II. P
RELIMINARIES AND P ROBLEM F ORMULATION
A. Tracking Control Technique
This subsection recounts results published in our previouswork in which prediction-based output tracking was used forfully-actuated systems [7]–[9]. Consider a system as shownin Figure 1 with r p t q P R m , y p t q P R m , u p t q P R m , and e p t q : “ r p t q ´ y p t q . The objective of the controller is toensure that lim t Ñ8 || r p t q ´ y p t q|| ă ε, (1)for a given (small) ε P R ` .To illustrate the basic idea underscoring the controller, letus first assume that (i) The plant subsystem is a memorylessnonlinearity of the form y p t q “ g p u p t qq , (2)for a continuously-differentiable function g : R m Ñ R m ,and (ii) the target reference t r p t q : t P r , is a constant, r p t q ” r for a given r P R m . These assumptions will berelaxed later. In this case, the tracking controller is definedby the following equation, u p t q “ ´ B g B u p u p t qq ¯ ´ ` r ´ y p t q ˘ , (3)assuming that the Jacobian matrix B g B u p u p t qq is nonsingular atevery point u p t q computed by the controller via (3). Observethat (3) defines the Newton-Raphson flow for solving thealgebraic equation r ´ g p u q “ , and hence (see [7], [8])the controller converges in the sense that lim t Ñ8 ` r p t q ´ y p t q ˘ “ . Next, suppose that the reference target is time-dependent, while keeping the assumption that the plant isa memoryless nonlinearity. Suppose that t r p t qu is bounded,continuous, piecewise-continuously differentiable, and t r p t qu is bounded. Define η : “ lim sup t Ñ8 || r p t q|| , (4)then (see [8]), with the controller defined by (3), we havethat lim t Ñ8 || r p t q ´ y p t q|| ď η. (5) Henceforth we will use the notation t x p t qu For a generic signal t x p t q , t P r , , to distinguish it from its value at a particular point t , x p t q . Note that Eqs. (2) and (3) together define the closed-loopsystem. Observe that the plant-equation (2) is an algebraicequation while the controller equation (3) is a differentialequation, hence the closed-loop system represents a dynam-ical system. Its stability, in the sense that t y p t qu is boundedwhenever t r p t qu and t r p t qu are bounded, is guaranteed by(5) as long as the control trajectory t u p t qu does not passthrough a point u p t q where the Jacobian matrix B g B u p u p t qq issingular.Finally, let us dispense with the assumption that the plantsubsystem is a memoryless nonlinearity. Instead, supposethat it is a dynamical system modeled by the following twoequations, x p t q “ f p x p t q , u p t qq , x p q : “ x (6) y p t q “ h p x p t qq , (7)where the state variable x p t q is in R n , and the functions f : R n ˆ R m Ñ R n and h : R n Ñ R m satisfy the followingassumption. Assumption 1. (i). The function f : R n ˆ R m Ñ R n iscontinuously differentiable, and for every compact set Γ Ă R m there exists K P R ` such that, for every x P R n and u P Γ , || f p x, u q|| ď K ` || x || ` ˘ . (ii). The function h : R n Ñ R m is continuously differentiable. l This assumption ensures that whenever the control signal t u p t qu is bounded and continuous, the state equation (6) hasa unique solution x p t q on the interval t P r , .In this setting, y p t q is no longer a function of u p t q , butrather of x p t q which is a function of t u p τ q : τ ă t u .Therefore (2) is no longer valid, and hence the controllercannot be defined by (3). To get around this conundrumwe pull the feedback not from the output y p t q but from apredicted value thereof. Specifically, fix the look-ahead time T P R ` , and suppose that at time t the system computesa prediction of y p t ` T q , denoted by ˜ y p t ` T q . Supposealso that ˜ y p t ` T q is a function of p x p t q , u p t qq , hence canbe written as ˜ y p t ` T q “ g p x p t q , u p t qq , where the function g : R n ˆ R m Ñ R m is continuously differentiable.Now the feedback law is defined by the following equa-tion, u p t q “ ´ B g B u p x p t q , u p t qq ¯ ´ ` r p t ` T q ´ g p x p t q , u p t qq ˘ . (8)The state equation (6) and control equation (8) togetherdefine the closed-loop system. This system can be viewedas an p n ` m q -dimensional dynamical system with the statevariable p x p t q T , u p t q T q T P R n ` m and input r p t q P R m . Weare concerned with a variant of Bounded-Input-Bounded-State (BIBS) stability whereby if t r p t qu and t r p t qu arebounded, t x p t qu is bounded as well. Such stability no-longercan be taken for granted as in the case where the plant is amemoryless nonlinearity.We remark that a larger T means larger prediction errors,and these translate into larger asymptotic tracking errors. Onthe other hand, an analysis of various second-order systemsin [7] reveals that they all were unstable if T is too small, andtable if T is large enough. It can be seen that, a requirementfor a restricted prediction error can stand in contradictionwith the stability requirement. This issue was resolved byspeeding up the controller in the following manner. Consider α ą , and modify (8) by multiplying its right hand side by α , resulting in the following control equation: u p t q “ α ´ B g B u p x p t q , u p t qq ¯ ´ ` r p t ` T q ´ g p x p t q , u p t qq ˘ . It was verified in [7]–[9], that regardless of the value of T P R ` , a large-enough α stabilizes the closed-loop system. Furthermore, if the closed-loop system is stable then thefollowing bound holds, lim sup t Ñ8 || r p t q ´ ˜ y p t q|| ď ηα , (9)where η is defined by (4). Thus, a large gain α can stabilizethe closed-loop system and reduce the asymptotic trackingerror. B. Problem Formulation
In an attempt to broaden the application scope of thecontrol algorithm, underactuated systems such as the fixed-wing aircraft are explored, which are widely used in thedomain of aerospace engineering. The behavior of a fixedwing aircraft at constant elevation can be approximated bya planar Dubins vehicle with states [23] @ t ě , z p p t q “ V p cos θ p p t q , z p p t q “ V p sin θ p p t q , θ p p t q “ u p t q , where p z p p t q , z p p t qq T denotes the planar position of thevehicle, θ p p t q its heading and u p t q the angular acceleration,constrained as, (cid:107) u (cid:107) ď u max . The input saturation enforcesa minimum turning radius equal to V { u max . For testingthe efficacy of the controller for the underactuated system,henceforth referred to as the pursuer, it is tasked withtracking an evading vehicle, modeled as a single integrator,with dynamics as follows:dd t „ z e p t q z e p t q “ « V e cos θ e V e sin θ e ff , where p z e p t q , z e p t qq J denote the planar position of theevader, and V e is its speed. We consider two cases; onewhere the evader is agnostic to the pursuer and followsa known trajectory and the other where the the evader isadversarial in nature and its trajectory is not known tothe pursuer. The next section will provide two solutionsfor the problem of estimating the evader’s trajectory based,respectively, on a model-based approach and a learning-based approach. This statement seems to have a broad scope, and does not require theplant to be a minimum-phase system.
III. P
REDICTIVE F RAMEWORK
A. Model-Based Pursuit Evasion
The considered system is underactuated because the pur-suer’s position, p z p p t q , z p p t qq J , is two-dimensional while it iscontrolled by an one-dimensional variable, u p t q . This raisesa problem since the application of the proposed trackingtechnique requires the control variable and system’s outputto have the same dimension. To get around this difficulty,we define a suitable function F : R Ñ R ` and set g p x p t q , u p t qq : “ ş t ` Tt F p ˜ y p p τ q ´ ˜ y e p τ qq d τ where ˜ y p p τ q and ˜ y e p τ q are the predicted position of the pursuer and the evaderat time τ ; we apply the Newton-Raphson flow to the equation g p x p t q , u p t qq “ . The modified controller becomes u p t q “ ´ α ´ B g B u p x p t q , u p t qq ¯ ´ ` g p x p t q , u p t qq ˘ , t ě . (10)Since g p x, u q is a scalar, the modified algorithm workssimilar to the base case.Assume general nonlinear system dynamics as in (6) withoutput described in (7). The predicted state trajectory iscomputed by holding the input to a constant value overthe prediction horizon, given by the following differentialequation: ξ p τ q “ f p ξ p τ q , u p t qq , τ P r t, t ` T s , (11)with the initial condition ξ p t q “ x p t q as shown in [7]. Thepredicted output at τ is ˜ y p p τ q “ h p ξ p τ qq . Furthermore, bytaking the partial derivative of (11) with respect to u(t), weobtain B ξ B u p τ q “ B f B ξ p ξ p τ q , u p t qq B ξ B u p τ q ` B f B u p ξ p τ q , u p t qq , (12)with the initial condition B ξ B u p t q “ . The above is adifferential equation in B ξ B u p τ q ; τ P r t, t ` T s and (11) and(12) can be solved numerically. Finally, the values of g p x, u q and B g B u p x, u q can be substituted in (10) to get the control law.In the next section, results are presented for an agnosticas well as an adversarial pursuer- evader system. However,as mentioned above, in the adversarial problem formulation,the trajectory of the evader is not known in advance, whichcan be overcome in two ways.In the first approach, the pursuer(s) use game theory topredict the approximate direction of evasion. As mentionedin [24], in the case of single pursuer, the evader’s optimalstrategy is to move along the line joining the evader andpursuer’s position, if the pursuer is far enough. When thedistance between the pursuer and the evader reduces to theturning radius of the pursuer, the evader switches strategiesand enters into the non-holonomic constraint region of thepursuer. This can be represented as follows: θ E “ $’’’’&’’’’% arctan ˆ z e p t q´ z p p t q z e p t q´ z p p t q ˙ , d ą R P , arctan ˆ z e p t q´ z p p t q z e p t q´ z p p t q ˙ ˘ π { , d ď R P . (13)ere θ E is the expected evasion angle of the evader and d is the distance between the pursuer and evader,If there are multiple pursuers, it is assumed that the evaderfollows the same strategy by considering only the closestpursuer. It is notable that this will not provide the pursuers acorrect prediction of the evader’s motion as they do not knowabout the goal seeking behavior mentioned above. However,it gives a good enough approximation of the pursuer’s motionthat the algorithm can be used for tracking.The second approach involves learning the evader’s be-havior over time using NN. The pursuers take their positionsand the position of the evader as input and the NN gives theestimated evasion direction as the output after training.To showcase the efficacy of our method, we consider apursuit evasion problem, involving multiple pursuing agents.Such problems are typically formulated as zero-sum dif-ferential games [24]. Due to the difficulty of solving theunderlying Hamilton-Jacobi-Isaacs (HJI) equations [25] ofthis problem, we shall utilize the method described in II-Ato approximate the desired behavior. Furthermore, we showthat augmenting the controller with learning structures inorder to tackle the pursuit evasion problem without explicitknowledge of the evader’s behavior is straightforward.In order to formulate the pursuit evasion problem, wedefine a global state space system consisting of the dynamicsof the pursuers and the evader. For ease of exposition, theanalysis will focus on the -pursuer, -evader problem, sinceextending the results to multiple pursuers is straightforward.The global state dynamics become,dd t »————————————– z p p t q z p p t q θ p p t q z p p t q z p p t q θ p p t q z e p t q z e p t q fiffiffiffiffiffiffiffiffiffiffiffiffifl “ »————————————– V p cos θ p V p sin θ p u p V p cos θ p V p sin θ p u p V e cos θ e V e sin θ e fiffiffiffiffiffiffiffiffiffiffiffiffifl , (14)where the subscripts indicate the autonomous agent. Forcompactness, we denote the global state vector as x p t q P R ,the pursuers’ control vector u p t q P R , and the nonlinearmapping described by the right-hand side of (14). Thus,given the initial states of the agents x P R , the evolu-tion of the pursuit evasion game is described by x p t q “ f p x p t q , u, u e q , x p q “ x , t ě .Subsequently, this zero-sum game can be described as aminimax optimization problem through the cost index, J p x, u, u e q “ ż e ´ γt L p x q d t : “ ż e ´ γt ˆ β p d ` d q ` β d d d ` d ˙ d t, (15)where d i “ a p z i ´ z e q ` p z i ´ z e q , i P t p , p u is thedistance between the i -th pursuer and the evader, β , β P R ` are user defined contants, and γ P R ` is a discountfactor. The first term ensures that the pursuers remain closeto the evader, while the second term encourages cooperationbetween the agents. The cost decreases exponentially toensure that the integral has a finite value in the absence ofequilibrium points.Let V p x q : R Ñ R be a smooth function quantifying thevalue of the game when specific policies are followed startingfrom state x p t q . Then, we can define the correspondingHamiltonian of the game as, H ` x, u, u e , B V B x ˘ “ L p x q ` B V B x T f p x, u, u e q ` γV. (16)The optimal feedback policies u ‹ p x q , u ‹ e p x q of this gameare known to constitute a saddle point [25] such that, u ‹ p x q “ arg min u H p x, u, u e q , (17) u ‹ e p x q “ arg max u e H p x, u, u e q . (18)Under the optimal policies (17),(18), the HJI equation issatisfied, H ` x, u ‹ , u ‹ e , B V B x ‹ ˘ “ . (19)Evaluating the optimal pursuit policies, yields the singularoptimal solutions described by, V θ p u “ V θ p u “ , where V x i is the partial derivative of the value function with respectto the state x i , calculated by solving (19). To obviate theneed for bang-bang control, as is derived by (17) and (18)we shall employ the predictive tracking technique describedin Section II-A to derive approximate, easy to implement,feedback controllers for the pursuing autonomous agents.Furthermore, by augmenting the predictive controller withlearning mechanisms, the approximate controllers will haveno need for explicit knowledge of u ‹ e p x q , the evader’s policy.The following theorem presents bounds on the optimalityloss induced by the use of the look-ahead controller approx-imation. Theorem . Let the pursuit evasion game evolve according tothe dynamics given by (14), where the evader is optimal withrespect to (15) and the pursuers utilize the learning-basedpredictive tracking strategy given (10). Then, the trackingerror of the pursuers and the optimality loss due to the useof the predictive controller are bounded if D ¯∆ P R ` , suchthat, ∆ p x p t q , ˆ u p t q , ˆ u p t q e q ď ¯∆ , @ t ě , where ∆ p x, ˆ u, ˆ u e q “ V x e v e p cos ˆ u e ´ cos u ‹ e q ` V y e v e p sin ˆ u e ´ sin u ‹ e q ` V θ p p u ‹ ´ ˆ u q ` V θ p p u ‹ ´ ˆ u q , with V ξ denoting the partial derivativeof the game value with respect to the state component ξ p t q .Proof: Consider the Hamiltonian function when the ap-proximate controller, denoted ˆ u p t q and the NN-based pre-diction of the evader’s policy, ˆ u e p t q are used, H p x, ˆ u, ˆ u e q “ L p x q ` ` B V B x ˘ T f p x, ˆ u, ˆ u e q ` γV. (20)Taking into account the nonlinear dynamics of the system(14), one can rewrite (20) in terms of the optimal Hamil-tonian as, H p x, ˆ u, ˆ u e q “ H p x, u ‹ , u ‹ e q ` ∆ p ˆ u, ˆ u e q , where H p x, u ‹ , u ‹ e q “ is the HJI equation that is obtained afterubstituting (17) and (18) in (16). Now, take the orbitalderivative of the value function along the trajectories usingthe approximate controllers as, V “ ` B V B x ˘ T f p x, ˆ u, ˆ u e q . Substituting (20) yields V “ ´ L p x q ´ γV ` ∆ p x, ˆ u, ˆ u e q . Thus, since L p x q ą , @ x P R zt u , V ă ´ γV ` ∆ p x, ˆ u, ˆ u e q ñ V ă ´ γV ` ¯∆ . Hence for V ě ¯∆ { γ , we have V ď . Thus t x P R | V p x q ď ¯∆ { γ u is a forward invariant set, which impliesthat the tracking error and the optimality loss over any finitehorizon is bounded. Remark . Note that we do not use optimal control or MPCto solve the pursuit evasion problem. Instead, the controlleris governed by (10), which is simple to implement and haslow computational complexity. l B. Deep Learning-Based Pursuit Evasion
A deep NN, consisting of L ą hidden layers, describesa nonlinear mapping between its input space R n and outputspace R p . Each layer receives the output of the previouslayer as an input and, subsequently, feeds its own output tothe next layer. Each layer’s output consists of the weightedsum of its input alongside a bias term, filtered through anapplication-specific activation function [11].Specifically, let R n l be the input space of a specific layer,and R p l the corresponding output space. Then the layer’soutput is, Y i p x q “ σ ˆ n l ÿ j “ v ij X j ` v i ˙ , i “ , , . . . , p l , where X “ “ X . . . X n l ‰ T P R n l is the input vector,gathered from training data or from the output of previouslayers, v ij P R is a collection of n l weights for each layer, v i P R the bias term and σ : R n l Ñ R is the layer’sactivation function. We note that it is typical to write theoutput of layer compactly, with slight abuse of notation, as, Y “ σ p W T σ p X qq , (21)where Y “ “ Y . . . Y p l ‰ P R p l , W “ “ v ij ‰ P R p n l ` qˆ p l and σ : R n l Ñ R n l is the activation function of the previouslayer, taking as input the vector X “ ” X T ı T .It is known [26], that two-layer NNs possess the universalapproximation property, according to which, any smoothfunction can be approximated arbitrarily close by an NNof two or more layers. Let S Ă R n be a simply connectedcompact set and consider the nonlinear function κ : S Ñ R p .Given any (cid:15) b ě , there exists a NN such structure such that, κ p x q “ σ ` W T σ p x q ˘ ` (cid:15) , @ x P S , where } (cid:15) } ď (cid:15) b . We note that, typically, the activationfunction of the output layer σ p¨q is taken to be linear.Evaluating the weight matrix W in a network is the mainconcern of the area of machine learning. In this work, we em-ploy the gradient descent based backpropagation algorithm.Given a collection of N d training data, stored in the tuple t x k , κ k u k , where x k P R n , κ k P R p , @ k “ , . . . , N d , wedenote the output errors as r k “ κ p x k q´ κ k . Then, the updateequation for the weights at each optimization iteration t k isgiven by, w ij p t k ` q “ w ij p t k q ´ η Bp r T k r k qB w ij , @ t k P N , (22)where η P R ` denotes the learning rate. We note that theupdate index t k need not correspond to the sample index k , since different update schedules leverage the gathereddata in different ways [26]. It can be seen that in orderfor the proposed method to compute the pursuers’ controlinputs, an accurate prediction of the future state of the evaderis required. However, this presupposes that the pursuersthemselves have access to the evader’s future decisions; anassumption that is, in most cases, invalid. Thus, we augmentthe pursuers’ controllers with a NN structure, that learns topredict the actions of the evader, based on past recorded data.Initially, we assume that the evader’s strategy is com-puted by a feedback algorithm, given her relative posi-tion to the pursuers. This way, the unknown function wewish to approximate is f : R N Ñ R , with, u e “ f p δz p , δz p , . . . , δz p N , δz p N q , where, p δz p i , δz p i q denotethe distance of pursuer i to the evader in the X and Y axes,respectively. In order to train the network, we let the pursuersgather data regarding the fleet’s position with respect tothe evader, as well as her behavior over a predefined timewindow T l ą . Remark . Increasing the time window T l will allow thepursuers to gather more training data for the predictivenetwork. However, this will not only increase the com-putational complexity of the learning procedure, but willmake the pursuers more inert to sudden changes in theevader’s behavior. Simulation results corroborate our choiceof training parameters. l Subsequently, we denote by ˆ u e p x q , the current predictionfunction for the evader’s strategy, i.e., ˆ u e p x q “ σ ` ˆ W T ˆ σ p χ q ˘ ,where χ “ “ δz δy . . . δx N δy N ‰ P R N , ˆ W denotes the current weight estimate of the NNs outputlayer, and ˆ σ p¨q is the current estimate of the hidden layers,parametrized by appropriate hidden weights. Remark . While the learning algorithm for the evader’sbehavior operates throughout the duration of the pursuit, thusmaking the approximation weights time-varying, we suppresstheir explicit dependence on time since the process is open-loop, in the sense that the system is learning in batches,rather that in a continuous fashion. l IV. S
IMULATION R ESULTS
This section presents results for the problems brieflydescribed in the previous section. First, the agnostic evadercase is considered followed by the adversarial case. Forthe second case, single and multiple pursuer systems areconsidered separately. The controller is implemented on aDubins vehicle. For the purpose of tracking, we define thesystem output to be y i “ “ z i z i ‰ T , i P t p , p , e u . . Single Pursuer - Agnostic Target In this subsection, the controller is tested on a Dubinsvehicle with the task of pursuing an agnostic target movingalong a known trajectory. Since the vehicle has a constantspeed and an input saturation is enforced, it has an inherentminimum turning radius. For this simulation, we set V p “ m/s and the input saturation is first set to π rad/s and thento π rad/s. The evader moves along two semicircular curveswith a constant speed which is less than V p .As a consequence, when the pursuer catches up to theevader, it overshoots and has to go around a full circle toagain start tracking. Naturally, lower turning radius translatesto better tracking as the vehicle can make “tighter” turns.This can be seen when comparing the trajectories of thevehicle in Figure 2 with Figure 4. For the same trajectoryof the evader, the tracking performance is far better in thesecond case. Once the pursuer catches up to the target, themaximum tracking error in the first case is approximately meters and only meter in the second case, shown inFigures 3 and 5. This is consistent with the fact that theratio of the turning radii is . B. Single Pursuer - Adversarial Evader
The pursuer is again modelled as a Dubins vehicle, whilethe evader is modelled as a single integrator with a maximumvelocity less than the speed of the pursuer. Hence, while thepursuer is faster, the evader is more agile, and can instantlychange its direction of motion. In this and subsequent cases,the evader is considered adversarial in nature and uses gametheory to choose evasion direction.Let y p p t q and y e p t q be the position vector of the pursuerand evader respectively at time t . First, the pursuer makesan estimate of the optimal evasion direction based on therelative position of the evader and itself at time t using(13). Assuming this direction of evasion to be fixed overthe prediction window from t to t ` T gives the predictedposition of the evader at all time instances in this interval, Algorithm 1
Deep Learning-Based and Predictive PursuitEvasion
Inputs: X P i p t q , @ i t , . . . , N u , X E p t q and evasion strategyapproximation weights W . Output: u P i p t q , @ i t , . . . , N u . Compute p δx i , δy i q , i P t , . . . , N u . Predict evader’s future behavior via (21). Train NN as in (22). Predict evader’s future state as ˜ X E p t ` T q “ X E p t q `r V E cos θ E V E sin θ E s T T . Propagate pursuer dynamics to get ˜ X P p t ` T q . Computed current Newton flow parameters using (23). Computed control dynamics u P i p t q from (3). Propagate actual system evolution using (14). Append current distances p δx i , δy i q to a stack of previ-ous observations. Update evader prediction network through (22).
Fig. 2. Agnostic evader with a large turning radius.Fig. 3. Evolution of an agnostic evader tracking error with a large turningradius. Fig. 4. Agnostic evader with a small turning radius.Fig. 5. Evolution of the agnostic evader tracking error with a small turningradius. denoted as ˜ y e p τ q , τ P r t, t ` T s . Next, the pursuer estimatesits own predicted position if its input is kept constant, called ˜ y p p τ q , τ P r t, t ` T s . Finally, g p t q is set as || ˜ y e p t ` T q ´ ˜ y p p t ` T q|| and the value of B g B u p x p t q , u p t qq ( x p t q being theensemble vector of the states of the pursuer and the evader)is used to compute the input differential equation (10).Figures 6 shows the trajectories of the pursuer and theevader, with the goal for the evader set to to point p , q .It can be observed that the evader moves towards the goalwhile the pursuer is far away and starts evasive maneuverswhen it gets close to it, by entering its non-holonomic region.Figure 7 displays the tracking error, defined as the distancebetween the pursuer and the evader, which is almost periodic.This is because the evader’s maneuver forcing the pursuer toircle back. The peak tracking error after the pursuer catchesup is slightly more than twice the turning radius, as expected. Fig. 6. Trajectories for a single pursuer-evader system.Fig. 7. Evolution of the tracking error for a single pursuer-evader system.
C. Multiple Pursuers - Adversarial Evader
While the previous section had only one pursuer, thissimulation considers the case of two pursuers and a singleevader. Having multiple pursuers means there must be coop-eration between them in order to optimally utilize resources.Thus, a pursuer can no longer make decisions solely basedon the position of the evader relative to itself. The positionsof the rest of the pursuers must also be factored in. Thuswe redefine the expression for g p x, u q to include theseparameters as shown below for the case of two pursuers.Let d p be the distance between the two pursuers, and let g p x p t q , u p t qq : “ ż t ` Tt " β p d p τ q ` d p τ qq ` β d p τ q d p τ q d p τ q ` d p τ q` β e ´ γd p p τ q * d τ, @ t ě . (23) The first term ensures that the pursuers remain close tothe evader, while the second term encourages cooperationbetween agents. The last term is added to repel pursuersapart if they come close to each other, as having multiplepursuers in close vicinity of each other is sub-optimal.Figure 8 shows the trajectories of the pursuers and theevader when the goal for the evader is set to the point p , ´ q . In this case, the pursuers close in on the evaderand trap it away from its goal due to their cooperativebehavior. The evader is forced to continuously performevasive maneuvers as the other pursuer closes in when thefirst has to make a turn. This can be seen more clearly inthe tracking error plot given in Figure 9. After catching upwith the evader, it can be seen that when one pursuer is at itsmaximum distance, the other is at its minimum. The resultsachieved show good coordination between the pursuers andlow tracking error and are qualitatively comparable to [22].Lastly, we present the results under the learning-basedprediction. In Figure 11, we present a comparative result of the tracking error of the model-based algorithm vis-`a-vis the NN-based control. Figure 12 showcases the qualityof the performance of the proposed algorithm based on thegame theoretic cost metric. From these figures, it can be seenthat the NN structure offers fast predictive capabilities to thecontroller; hence the overall performance is comparable tothe model based control. Fig. 8. Trajectories for the two pursuer-single evader system.Fig. 9. Evolution of the tracking error for the two pursuer-single evadersystem.
V. C
ONCLUSION AND F UTURE W ORK
This work extends the framework of prediction-basednonlinear tracking in the context of pursuit evasion games.We present results for vehicle pursuit of agnostic targets,modeled as moving along known trajectories, as well asadversarial target tracking, where the evader evolves accord-ing to game-theoretic principles. Furthermore, to obviate theneed for explicit knowledge of the evader’s strategy, we em-ploy learning algorithms alongside the predictive controller.The overall algorithm is shown to produce comparable resultsto those in the literature, while it precludes the need forsolving an optimal control problem.Future work will focus on developing robustness guaran-tees will allow for more realistic scenarios, where noise andexternal disturbances are taken into consideration.R
EFERENCES[1] S. Devasia, D. Chen, and B. Paden, “Nonlinear inversion-based outputtracking,”
IEEE Transactions on Automatic Control , vol. 41, no. 7, pp.930–942, 1996.[2] P. Martin, S. Devasia, and B. Paden, “A different look at outputtracking: control of a vtol aircraft,”
Automatica , vol. 32, no. 1, pp.101–107, 1996.[3] A. Isidori and C. Byrnes, “Output regulation of nonlinear systems,”
IEEE Trans. Automat. Control , vol. 35, pp. 131–140, 1990.[4] H. Khalil, “On the design of robust servomechanisms for minimumphase nonlinear systems,”
Proc. 37th IEEE Conf. Decision and Con-trol,
Tampa, FL, pp. 3075–3080, 1998.ig. 10. Trajectories for two pursuers-single evader system with learning.Fig. 11. Evolution of the tracking error for the systems with and withoutlearning.Fig. 12. Total cost for the system with and without learning.[5] F. Allg¨ower and A. Zheng,
Nonlinear model predictive control .Birkh¨auser, 2012, vol. 26.[6] J. Rawlings, D. Mayne, and M. Diehl,
Model Predictive Control:Theory, Computation, and Design, 2nd Edition . Nob Hill, LLC,2017.[7] Y. Wardi, C. Seatzu, M. Egerstedt, and I. Buckley, “Performance reg-ulation and tracking via lookahead simulation: Preliminary results andvalidation,” in
Melbourne,Australia, December 12-15, 2017.[8] Y. Wardi, C. Seatzu, and M. Egerstedt, “Tracking control via variable-gain integrator and lookahead simulation: Application to leader-follower multiagent networks,” in
Sixth IFAC Conference on Analysis and Design of Hybrid Systems (2018 ADHS)l,
Oxford, the UK, July11-13, 2018.[9] S. Shivam, I. Buckley, Y. Wardi, C. Seatzu, and M. Egerstedt, “Track-ing control by the newton-raphson flow: Applications to autonomousvehicles,” in
European Control Conference,
Naples, Italy, June 25-28,2018.[10] G. Saridis, “Intelligent robotic control,”
IEEE Transactions on Auto-matic Control , vol. 28, no. 5, pp. 547–557, 1983.[11] S. S. Haykin,
Neural networks and learning machines . Pearson UpperSaddle River, 2009, vol. 3.[12] D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis,
Optimal adaptivecontrol and differential games by reinforcement learning principles .IET, 2013, vol. 2.[13] K. S. Narendra and K. Parthasarathy, “Identification and control ofdynamical systems using neural networks,”
IEEE Transactions onneural networks , vol. 1, no. 1, pp. 4–27, 1990.[14] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic algorithm tosolve the continuous-time infinite horizon optimal control problem,”
Automatica , vol. 46, no. 5, pp. 878–888, 2010.[15] C. M. Bishop et al. , Neural networks for pattern recognition . Oxforduniversity press, 1995.[16] S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, F. L.Lewis, and W. E. Dixon, “A novel actor–critic–identifier architecturefor approximate optimal control of uncertain nonlinear systems,”
Automatica , vol. 49, no. 1, pp. 82–92, 2013.[17] K. G. Vamvoudakis, “Q-learning for continuous-time linear systems:A model-free infinite horizon optimal control approach,”
Systems &Control Letters , vol. 100, pp. 14–20, 2017.[18] B. G. Weber and M. Mateas, “A data mining approach to strategyprediction,” in . IEEE, 2009, pp. 140–147.[19] T. Alpcan and T. Bas¸ar,
Network security: A decision and game-theoretic approach . Cambridge University Press, 2010.[20] H. J. Pesch, I. Gabler, S. Miesbach, and M. H. Breitner, “Synthesisof optimal strategies for differential games by neural networks,” in
New Trends in Dynamic Games and Applications . Springer, 1995,pp. 111–141.[21] A. Kanellopoulos, K. Vamvoudakis, and Y. Wardi, “Predictive learningvia lookahead simulation,” in
AIAA Scitech 2019 Forum,
San Diego,California, January 7-11, 2019.[22] S. A. Quintero, D. A. Copp, and J. P. Hespanha, “Robust uav coordina-tion for target tracking using output-feedback model predictive controlwith moving horizon estimation,” in
American Control Conference,
Chicago, Illinois, July 1-3, 2015.[23] S. M. LaValle,
Planning algorithms . Cambridge university press,2006.[24] R. Isaacs,
Differential games: a mathematical theory with applicationsto warfare and pursuit, control and optimization . Courier Corpora-tion, 1999.[25] T. Basar and G. J. Olsder,
Dynamic noncooperative game theory .Siam, 1999, vol. 23.[26] F. Lewis, S. Jagannathan, and A. Yesildirak,