[PDF] A 3D Game Theoretical Framework for the Evaluation of Unmanned Aircraft Systems Airspace Integration Concepts

Abstract

Predicting the outcomes of integrating Unmanned Aerial Systems (UAS) into the National Airspace System (NAS) is a complex problem which is required to be addressed by simulation studies before allowing the routine access of UAS into the NAS. This paper focuses on providing a 3-dimensional (3D) simulation framework using a game theoretical methodology to evaluate integration concepts using scenarios where manned and unmanned air vehicles co-exist. In the proposed method, human pilot interactive decision making process is incorporated into airspace models which can fill the gap in the literature where the pilot behavior is generally assumed to be known a priori. The proposed human pilot behavior is modeled using dynamic level-k reasoning concept and approximate reinforcement learning. The level-k reasoning concept is a notion in game theory and is based on the assumption that humans have various levels of decision making. In the conventional "static" approach, each agent makes assumptions about his or her opponents and chooses his or her actions accordingly. On the other hand, in the dynamic level-k reasoning, agents can update their beliefs about their opponents and revise their level-k rule. In this study, Neural Fitted Q Iteration, which is an approximate reinforcement learning method, is used to model time-extended decisions of pilots with 3D maneuvers. An analysis of UAS integration is conducted using an example 3D scenario in the presence of manned aircraft and fully autonomous UAS equipped with sense and avoid algorithms.

Full PDF

11 A 3D Game Theoretical Framework for theEvaluation of Unmanned Aircraft Systems AirspaceIntegration Concepts

Negin Musavi, Ayman Manzoor, and Yildiray Yildiz,

Senior Member, IEEE

Abstract —Predicting the outcomes of integrating UnmannedAerial Systems (UAS) into the National Airspace System (NAS)is a complex problem which is required to be addressed bysimulation studies before allowing the routine access of UAS intothe NAS. This paper focuses on providing a 3-dimensional (3D)simulation framework using a game theoretical methodology toevaluate integration concepts using scenarios where manned andunmanned air vehicles co-exist. In the proposed method, humanpilot interactive decision making process is incorporated intoairspace models which can ﬁll the gap in the literature wherethe pilot behavior is generally assumed to be known a priori. Theproposed human pilot behavior is modeled using dynamic level-k reasoning concept and approximate reinforcement learning.The level-k reasoning concept is a notion in game theory andis based on the assumption that humans have various levelsof decision making. In the conventional “static” approach, eachagent makes assumptions about his or her opponents and chooseshis or her actions accordingly. On the other hand, in thedynamic level-k reasoning, agents can update their beliefs abouttheir opponents and revise their level-k rule. In this study,Neural Fitted Q Iteration, which is an approximate reinforcementlearning method, is used to model time-extended decisions ofpilots with 3D maneuvers. An analysis of UAS integration isconducted using an example 3D scenario in the presence ofmanned aircraft and fully autonomous UAS equipped with senseand avoid algorithms.

Index Terms —UAS integration into NAS, Modeling, Reinforce-ment Learning, Game Theory, Neural Fitted Q Iteration.

I. I

NTRODUCTION A LTHOUGH unmanned aircraft systems (UAS) have oper-ational and cost advantages over manned aircraft in manyapplications, they do not have routine access to the NationalAirspace (NAS). The aviation industry, being very sensitive tosafety, needs strong evidence that the UAS integration will nothave any negative impact on the existing airspace system interms of safety [2], [3] before they are granted routine accessinto the NAS. Until technologies, standards, and proceduresfor a safe integration of UAS into the airspace are matured,there will not be enough data accumulated about the issueand it will be hard to predict the effectiveness of the relatedtechnologies and concepts. Although research efforts exist todevelop a safe and efﬁcient real test environment for UASintegration [25], ﬂight tests are expensive and experimentalfailures can cause severe economic loss. Therefore, employing

N. Musavi, A. Manzoor, and Y. Yildiz are with the Department ofMechanical Engineering, Bilkent University, Ankara, 06800 Turkey (e-mail:[email protected]; [email protected]; [email protected]). simulations is currently the most efﬁcient way to understandthe effects of UAS integration on the air trafﬁc system [4].These simulation studies need to be conducted with hybridairspace system (HAS) models, where manned and unmannedvehicles coexist.HAS models in the literature are generally based on theassumption that the pilots of manned aircraft always behaveas expected, without deviating from ideal behavior [5]-[10].Most of the existing HAS models are designed to evaluate andtest the performance of collision avoidance systems in singleencounter scenarios in which the intruder (generally a mannedaircraft) has a pre-deﬁned behavior with no consideration ofthe decision making process of the pilot. These models arevaluable and essential at the initial stages of evaluating anew method but it is not realistic to expect that the pilot,as a decision maker, will always behave deterministicallyand in a pre-deﬁned manner. It is not always predictable,for example, how pilots will respond to the trafﬁc controlalert system (TCAS) [11]. TCAS is an on-board collisionavoidance system which observes and tracks surrounding airtrafﬁc, detects conﬂicts and suggests avoidance maneuvers tothe pilots. In recent studies, it was shown that only 13% ofpilot responses match the deterministic pilot model that wasassumed for TCAS development [12], [13]. Therefore, incor-porating human decision-making processes in HAS modelshas a strong potential to improve the predictive power of thesemodels.In prior works [14], [15], authors have created HAS modelswith human decision making models, inspired by a gametheoretical methodology known as semi-network-form games[12], where the pilot behavior was not assumed to be knowna priori but obtained using 1) the level-k reasoning conceptwhich is a game theoretical approach used to model multiplestrategic player interactions, where it is assumed that humanshave various levels of reasoning, level-0 being the lowestlevel, and 2) reinforcement learning, which helps model timeextended decisions as opposed to assuming one-shot decisionmaking. Although these studies introduced one of the veryﬁrst examples of HAS models where several decision makerscan be modeled simultaneously in a time-extended manner,they had two limitations: First, HAS models were developedfor a 2-dimensional (2D) airspace. Second, the policies, i.e.maps from observation spaces to action spaces, obtained forthe decision makers remain unchanged during their interaction.In the proposed framework, these limitations are removed anda 3D HAS model is introduced where the strategic decision a r X i v : . [ c s . G T ] F e b makers can modify their policies during interactions betweeneach other. Therefore, compared to [14], [15] a much largerclass of interactions can be modeled.It is shown in the literature that 1) in repeated strategicinteractions, where agents consider other agents’ possibleactions before determining their own, agents with differentcognitive abilities change their behavior during the interaction[17] and 2) there is a positive relationship between cognitiveability and reasoning levels [16], [17]. These observationslead to agents with different levels of reasoning who canobserve their opponents’ behavior during repeated interactions,update their beliefs on their opponents’ reasoning level andchange their own level-k rule against them. In [16] and [17],a systematic level-k structure is introduced where players canupdate their beliefs about their opponents, and switch theirown level rule up one level during their interactions. Thereare also other level-k rule learning models in the literaturesuch as the ones presented in [18] and [19], where the agentlevels can reach up to inﬁnity. This is not a problem for theapplications investigated in [18] and [19], in which obtaininglevel-k rules (k=0,1,2,..., ∞ ) are straight forward and has ananalytical solution. Since it is computationally expensive toobtain higher levels, and in certain experimental studies it isshown that humans in general have a maximum reasoning levelof 2 [32], the existing level-k rule learning methods may notbe suitable for the application considered in this work wheremore than 188 decision makers are modeled simultaneously ina time extended manner. Here, we propose a simpler methodfor modeling level-k rule updates during interactions by a)limiting the levels up to 2 and b) allowing rule updates onlyif a trajectory conﬂict is observed.Different from the 2D HAS model developed in [14], [15],in this study, the game theoretical modeling framework isdeveloped for a 3D HAS model which allows to cover amuch larger class of integration scenarios. The reinforcementlearning algorithm used in the authors’ earlier works [33], [34],[35] employ tables to store the Q values of all state (locationof the intruder, approach angle of the intruder, best trajectoryaction, best destination action and previous action)-action (turnleft, turn right, go straight) pairs, which deﬁne how preferableit is to take a certain action given the observations/states.This poses a challenge for the application of the method tosystems with a large numbers of state-action pairs such asthe proposed 3D HAS model in this study. To circumventthis issue, Neural Fitted Q Iteration (NFQ) method [22], [23]and [24], an approximate reinforcement learning algorithm,is utilized. Approximate reinforcement learning methods usefunction approximators to represent the Q value function [20].In other words, instead of saving Q values for each state-action pair, Q value function is approximated by a functionapproximator. In the case of NFQ, a neural network is used asthe function approximator. NFQ approach also allows using acontinuous observation space, which also contributes to obtaina more precise deﬁnition of the agents’ observations, comparedto conventional approaches, where a discretized observationspace is required.In the simulations, pilot models that are obtained using theproposed game theoretical modeling framework are used in complex scenarios, where UAS and manned aircraft co-exist,to analyze the probable outcomes of HAS interactions. HASscenarios contain interacting humans (pilots) who also interactwith multiple UAS with their own sense and avoid (SAA)systems. It is noted that automation algorithms other than SAAsystems, such as TCAS, and possible air trafﬁc managementinstructions can also be incorporated into the proposed frame-work. During the simulations, UAS ﬂy autonomously based onpre-programmed ﬂight plans but they can deviate from theirplans to resolve a possible conﬂict with the help of their SAAalgorithms. In these simulations, as an example to demonstratehow the proposed framework can be utilized, the effect ofresponsibility assignment for conﬂict resolutions on the safetyand performance of the HAS is analyzed (see [25] for theimportance of these variables and responsibility assignmentfor UAS integration.).The organization of the paper is as follows: In section II, theHAS scenario for UAS integration into the NAS is describedin detail. In section III, the proposed pilot decision modelingmethod is explained. In section IV, simulation results areprovided. Finally, conclusions are given in section V.II. UAS I NTEGRATION S CENARIO

In order to evaluate the possible outcomes of integrat-ing Unmanned Aircraft Systems (UAS) in to the NationalAir Space (NAS), a Hybrid Air Space (HAS) scenario,where manned and unmanned aircraft co-exist, is designedand explained in this section. The scenario consists of 188manned aircraft and 3 UAS. The size of the airspace is km × km ( horizontal ) × f t ( altitude ) Fig. 1. 2D snapshot of the airspace scenario in the simulation platform.

In the scenario, it is assumed that each aircraft is able toreceive the surrounding trafﬁc information using AutomaticDependent Surveillance-Broadcast (ADS-B) technology. ADS-B technology can provide an own-ship aircraft the identiﬁca-tion, position and velocity information of surrounding aircraftthat are also equipped with ADS-B.

A. UAS Conﬂict Detection And Avoidance Logic

UAS ﬂy according to their pre-programmed ﬂight plans, anexample of which is marked by yellow shown in Fig. 1. UASare assumed to have the dynamics of RQ-4 Global Hawk withan operation speed of knots [29]. UAS are also equippedwith SAA systems which enable them to detect trajectoryconﬂicts and to initiate evasive maneuvers, if necessary. Ifno conﬂict is detected, UAS continue to follow their missionplan. Either receiving a conﬂict resolution command from theSAA system or ﬂying based on their pre-deﬁned ﬂight plan,UAS always receives a velocity command during the ﬂight.The UAS velocity vector variation is modeled as ﬁrst orderdynamics with a time constant of s [30] which is representedas: ˙ (cid:126)v = − ( (cid:126)v − (cid:126)v d ) , (1)where (cid:126)v and (cid:126)v d are the current and the desired/commandedvelocity vectors, respectively. The two SAA logics that areutilized in this study are developed by Fasano et al. [31],which is referred as SAA1, and Mujumdar et al. [30], which isreferred as SAA2. Both of the SAA logics contain two phases;a conﬂict detection phase and a conﬂict resolution phase. Theconﬂict detection phase is the same for both SAA1 and SAA2.A conﬂict is detected if the minimum distance between theUAS and the intruder aircraft is calculated to be less thana minimum required distance, R , during a predeﬁned timeinterval. The minimum distance is calculated by projectingthe trajectories of the UAS and the intruder aircraft in time.Once the conﬂict is detected, SAA1 and SAA2 suggest theirown velocity adjustment commands in order to resolve theconﬂict. The velocity adjustment command of the SAA1 andSAA2 logics, (cid:126)v d A and (cid:126)v d A , are given in the equations below (cid:126)v d A = (cid:20) v AB cos( η − ζ )sin( ζ ) [sin( η ) (cid:126)v AB v AB − sin( η − ζ ) (cid:126)r (cid:107) (cid:126)r (cid:107) ] (cid:21) + (cid:126)v B (2) (cid:126)v d A = − (cid:126)v A ( (cid:126)r .(cid:126)v AB (cid:107) (cid:126)v AB (cid:107) ) − ( R − (cid:107) (cid:126)r m (cid:107) ) (cid:126)r m (cid:107) (cid:126)r m (cid:107) (cid:107) − (cid:126)v A ( (cid:126)r .(cid:126)v AB (cid:107) (cid:126)v AB (cid:107) ) − ( R − (cid:107) (cid:126)r m (cid:107) ) (cid:126)r m (cid:107) (cid:126)r m (cid:107) (cid:107) (3)where, (cid:126)v A and (cid:126)v B refer to the velocity vectors of the UASand the intruder. (cid:126)r and (cid:126)v AB denote the relative position andvelocity between the UAS and the intruder, respectively. ζ is the angle between (cid:126)r and (cid:126)v AB and η is calculated as η = sin − R (cid:107) (cid:126)r (cid:107) . (cid:126)r refers to the initial relative position vectorbetween the UAS and the intruder. If multiple conﬂicts aredetected, UAS start an evasive maneuver to resolve the conﬂictthat is predicted to happen earliest. The velocity adjustmentsuggested by the SAA1 logic guarantees minimum deviationfrom the trajectory, while in the case of the SAA2 logic, UASmoves to resolve the conﬂict until it retains the minimum safedistance with the intruder. B. Manned Aircraft

All manned aircraft are assumed to be in their en-routephase of travel with constant speed, v , in the range of [150 − knots . Pilots may decide to change the headingangle for ± ◦ , or change the pitch angle for ± ◦ , or maydecide to keep both the heading and pitch angles unchanged.Once the pilot gives a heading or pitch command, the aircraftmoves to the desired heading and pitch, ψ d and θ d , in theconstant speed mode where, the heading and pitch change ismodeled by ﬁrst order dynamics with the standard rate turn:a turn in which an aircraft changes its heading at a rate of ◦ per second ( ◦ in 2 minutes) [28]. This is modeledapproximately by a ﬁrst order dynamics with a time constantof s ( × (1 − /e ) / ≈ ). Therefore, the aircraft headingand pitch angle dynamics can be given as ˙ ψ = − × ( ψ − ψ d ) (4) ˙ θ = − × ( θ − θ d ) (5)and the velocity, (cid:126)v = ( v x , v y , v z ) , is then obtained as: v x = (cid:107) (cid:126)v (cid:107) sin ψ cos θ. (6) v y = (cid:107) (cid:126)v (cid:107) cos ψ cos θ. (7) v z = (cid:107) (cid:126)v (cid:107) sin θ. (8)III. P ILOT D ECISION M ODEL

The proposed model for pilot decision making in this studyis formed by combining two methodologies: dynamic level-k reasoning and neural ﬁtted Q iteration (NFQ), which isan approximate reinforcement learning algorithm. A level-k-type model is trained by assigning level-(k-1)-type behaviorto all of the agents (manned aircraft) except the one that isbeing trained. The trainee learns to react as best as he/she canin this environment using NFQ. Thus, the resulting behaviorbecomes a level-k type. This process starts with training alevel-1 type behavior and continues until the highest desiredlevel is reached. Once all of the desired levels are obtained, thetraining stage ends and, in the simulation stage, the obtained level-k reaction models are used in the airspace scenarioexplained in section II where both manned aircraft and UASco-exist. In the simulation, certain proportions of level-0, level-1 and level-2 behavior type are assigned to the manned aircraft.It is noted that each of level-1 and level-2 agents can changetheir level-k behavior type based on Dynamic level-k reasoningmethod after observing their intruder’s behavior.

A. Dynamic Level-k Reasoning

Level-k reasoning is a game theoretical model where themain idea is that humans have various levels of reasoningin their decision-making process [18]. It has been observedthat reasoning levels are related to the cognitive abilities ofhumans [26]. The level hierarchy is iteratively deﬁned suchthat the level-k rule is a best response to the level-(k-1) rule.A level-1 decision maker (DM), for example, assumes thatthe other agents in the scenario are level-0 and takes actionsaccordingly to provide the best response. A level-2 DM takesactions to give the best response to other DMs that have level-1reasoning and so on. From a modeling standpoint, the level-0rule represents an initial point from which more sophisticatedrules can be obtained iteratively. A level-0 rule representsa “nonstrategic” DM who does not take into account otherDMs’ possible moves when choosing his/her own actions. Thisbehavior can also be considered as “reﬂexive” since it onlyreacts to the immediate observations. In this study, a level-0pilot ﬂies an aircraft with constant heading and pitch anglesstarting from its initial position toward its destination.In its conventional form, level-k reasoning help model theinteractions between the DMs where a level-k DM assumesthat the other DMs have level-(k-1) reasoning. Although thisapproach proved to be successful in modeling short term orone-shot interactions, it misses the point that agents, duringtheir interactions, may update their assumptions about theother agents and in turn update their own behavior. To remedythis problem, we introduce a closed loop algorithm whichallows the agents to dynamically update their reasoning levelsif a trajectory conﬂict is detected. This algorithm is explainedin Fig. 2, where a pseudo-code is provided. • Check if there is a conflict • IF there is a conflict Take action using the existing level-k rule, observe the resulting HAS state and see whether conflict persists IF the conflict persists Switch level-k rule with 50% probability Else

Continue with the existing level-k rule

End IF Else

Continue with the existing level-k rule

End IF

Fig. 2. Dynamic level-k reasoning pseudo-code.

B. Neural Fitted Q Iteration

Reinforcement learning is a mathematical learning methodbased on reward and punishment [21]. The agent interacts withan environment through its observations, actions and a scalarreward signal received from the environment. The agent’saim is to select actions that maximizes the cumulative futurereward. Given a state, when an action increases (decreases)the value of an objective function (reward), which deﬁnesthe goals of the agent, the probability of taking that actionincreases (decreases). Reinforcement learning algorithms in-volve estimating the state-action value function, or “the Qvalue”, which is a measure of how valuable, in terms ofmaximizing the total accumulated rewards, taking an action is,given the agent’s state. This estimation is generally conductedin an iterative fashion by updating these Q values after eachtraining step. In classical Q-learning reinforcement learningalgorithm, for discrete state and action spaces, the update ruleis given as [22], [24]: Q ( s, a ) → (1 − α ) Q ( s, a ) + α ( r ( s, a, s (cid:48) ) + γ max a Q ( s (cid:48) , a )) (9)where, s , a , and s (cid:48) refer to the state, action, and the suc-cessor state, respectively. α is a learning rate and γ is adiscounting factor. The discounting factor emphasizes therelative importance of future rewards compared to currentrewards. Neural Fitted Q Iteration (NFQ) method [22], [24]approximates the Q value function using a neural networkof type multi-layer perceptron. For a given state-action pairthe neural network takes the state and action as its input andprovides an approximate value of the corresponding Q valuefor the state-action pair. The method’s aim is to minimize thefollowing error function [22] ( Q ( s, a ) − ( r ( s, a, s (cid:48) ) + γ max a Q ( s (cid:48) , a ))) . (10)where, r ( . ) is the reward signal that agent receives fromthe environment after the transition from state s to state s (cid:48) by taking action a . This error function measures thedeviations between state-action Q values approximated bythe multi-layer perceptron ( Q ( s, a ) ) and the target value( r ( s, a, s (cid:48) ) + γ max a Q ( s (cid:48) , a )) ). In the NFQ method, Q valuefunctions are updated in batches meaning that the entireset of input patterns ( ( s i , a i ) , i = 0 , , , ... ) and targetpatterns ( r ( s i , a i , s (cid:48) i ) + γ max a Q ( s (cid:48) i , a i )) , i = 0 , , , ... ) arecollected and the update is performed at the end of a fullepisode. To summarize, the NFQ method consists of two majorsteps: the generation of a training set and training a multi-layer perceptron using this set to obtain a Q-value functionapproximating the optimal state-action Q-values, at the end ofeach episode. The training is stopped whenever the receivedaverage reward per episode converges.The goal of the reinforcement learning algorithm is tolearn the optimal Q values by maximizing the agent’s return,which is calculated via a reward/objective function. A rewardfunction can be considered as a happiness function, goalfunction or utility function which represents, mathematically, the preferences of the pilot. In this paper, the pilot rewardfunction is deﬁned as reward = w ∗ ( − C )+ w ∗ ( − S )+ w ∗ ( − A )+ w ∗ ( − P ) . (11)In (11) “ C ” is the number of aircraft within the collision re-gion. Based on the deﬁnition provided by the Federal AviationAdministration (FAA), the radius of collision is taken as f t in the horizontal direction and f t in the vertical direction[27]. “ S ” is the number of air vehicles within the separationregion. The radius of the separation region is nm in thehorizontal direction [8] and f t in the vertical directionbased on the “Reduced vertical separation minima” [27]. “ A ” represents whether the aircraft is getting closer to the intruderor going away from the intruder in terms of their approachangle and takes the values of 1, for getting closer, or 0, forgoing away. “ P ” represents whether the aircraft gets closer toor goes away from its trajectory vector in terms of angle andtakes the values of 0, for getting closer, or 1, for going away.Although ADS-B provides the positions and the velocities ofother aircraft, with his/her limited cognitive capabilities a pilotcan not possibly process all this information during his/herdecision making process. In this study, in order to modelpilot limitations, including the limitations at visual acuity andperception depth, as well as the limited viewing range of anaircraft, it is assumed that the pilots can observe (or process)the information from a limited portion of the nearby airspace.This limited portion is called the “observation space”. Sincethe aircraft are moving in a 3D region, the observation spaceis a 3D portion of the nearby airspace. This observation spaceis considered as a portion of a sphere centered at the locationof the pilot. In order to illustrate the observation space, it isdivided into horizontal and vertical parts and is schematicallydepicted in Fig. 3. Viewing range of a pilot may be differentin horizontal and vertical directions, which is why the obser-vation space in these two directions are shown as differentangular portions of a circle. Since the standard separation formanned aviation is − nm [8], the radius of observation spaceis taken as a variable larger than nm . Whenever an intruderaircraft moves toward the observation space (see Fig. 3, whereAgent B is the intruder), the approach geometry is deﬁnedby two angles: φ H , in the horizontal plane, and φ V , in thevertical plane. Aircraft’s angular orientation with respect tohis/her ideal trajectory is also deﬁned by two angles: β H , inthe horizontal plane, and β V , in the vertical plane. Fig. 3depicts a typical example, where the aircraft B is movingtoward the observation space with φ H = − ◦ , φ V = +35 ◦ , β H = − ◦ and β V − ◦ . Aircraft relative orientations arealso coded as different “encounter types”. Fig. 4 depicts 8types of encounter geometries projected in the horizontal plane(left column) and the vertical plane (right column). Thesegeometries are indicated as C i, i = 1 , , ..., in the ﬁgure.Finally, the observation space includes the pilot’s memory ofwhat their actions were at the previous time step. Given anobservation, the pilots can choose between ﬁve actions: turn ◦ left, go straight, turn ◦ right, pitch ◦ up, or pitch ◦ down. It is noted that these pilot commands are ﬁlteredthrough the aircraft dynamics provided in Section II-B. The information of the observations and actions of the pilotsis fed to the neural network which is in charge of approx-imating the Q values. The input vector fed into the neuralnetwork is [ sign ( β H ) , sign ( β V ) , intruderstatus , sign ( φ H ) , sign ( φ V ) , encountertype , previousaction , action ] T , where sign ( . ) takes the values of +1 , − or depending on whetherits argument value is positive, negative or zero, respectively.The intruderstatus is taken as whenever an intruder isdetected by the pilot, and , otherwise. The encountertype (see Fig. 4) is fed to the neural network in the form ofa vector with 2 elements indicating the encounter type inhorizontal plane and vertical plane. The 1st element takesthe values of − , − . , . and +1 for encounter typesof C , C , C and C in horizontal plane, and ,otherwise. Similarly, the 2nd element takes the values of − , − . , . and +1 for encounter types of C , C , Agent A Agent B

Top View of Observation Space 𝝋 𝑯 Destination 𝜷 𝑯 𝑥𝑦 Agent A

Side View of Observation Space

Agent B 𝝋 𝑽 Destination 𝜷 𝑽 𝑥𝑧 Ideal Path

Ideal Path 𝜑 𝐻 = −40°𝛽 𝐻 = −110°𝜑 𝑉 = +35° 𝛽 𝑉 = −90° Fig. 3. Pilot observation space. 𝑥𝑦 𝑥𝑦 𝑥𝑦 𝑥 𝑦 Agent AAgent A

Agent A

Agent AAgent B

Agent B

Agent BAgent B Agent A Agent B 𝑥𝑧 𝑥𝑧

Agent AAgent B 𝑥 𝑧 Agent A Agent B 𝑥 Agent AAgent B 𝑧 C

Fig. 4. Encounter types. C and C in vertical plane, and , otherwise. The previousaction and action are fed to the neural network inthe form of a vector with 4 elements: actions “turn ◦ left”,“go straight”, “turn ◦ right”, “pitch ◦ up”, or “pitch ◦ down” are coded as [0 , , , , [0 , , , , [0 , , , , [1 , , , and [ − , − , − , − , respectively. It is noted that to improvethe precision, instead of using the signs of the orientationangles their continuous values can be used.IV. S IMULATION R ESULTS AND D ISCUSSION

In this section, a quantitative analysis of multiple UAS inte-gration in a crowded airspace is presented. Before presentingthese results, single encounter scenarios, where two mannedaircraft with different reasoning levels which are in a collisionpath, are investigated.

A. Single Encounter Scenarios for manned aircraft

Fig. 5 presents the separation violation rates of mannedaircraft in 5000 random single encounters where pilots aremodeled as level-1 and level-2 decision makers. Separationviolation occurs when the horizontal and the vertical distancesbetween the two aircraft are less than the horizontal separationrequirement, nm [8] and the reduced vertical separationrequirement, f t [8], respectively. In the ﬁgure, separationviolations are shown for 3 different “distance horizon” values,which is the radius of the observation space depicted inFig. 3. Pilots oversee a s time window prior to a probableseparation violation with a second decision frequency forchoosing their actions. Distance horizon takes three values: nm (equal to horizontal separation requirement), . nm and nm . Pilots are either level-0, level-1 or level-2 agents. Thereare 4 possible types of scenarios: 1) level-1 pilot vs. level-0 pilot, 2) level-2 pilot vs. level-1 pilot, 3) level-1 pilot vs.level-1 pilot, and 4) level-2 pilot vs. level-2 pilot. Accordingto Fig. 5, in the 1st type and 2nd type scenarios (level-k vs level-(k-1)), by increasing the distance horizon from nm to nm the separation violation rate decreases from . to . . It is noted that (9.8/11.9 = 0.82) ofthe nm distance horizon separation violations occur duringthe scenarios where the encounters are difﬁcult to be resolved.This is because for these type of encounters, no matter howpilots in the collision path maneuver to resolve the conﬂictthe separation violation occurs. In the 3rd and 4th type ofscenarios (level-k vs level-k) the separation violation ratedecreases by increasing the pilots’ distance horizon, however,separation violation rate is still high ( . ) even for the nm case. The reason for this high separation violation rateis that when a level-k agent encounters another agent withthe same level, his/her assumption about the other becomesinvalid. This problem and its implications were discussed inSection III-A, where a remedy was proposed (see Fig 2), whichwas termed as “dynamic” level-k reasoning method. Accordingto Fig. 5, the separation violation rate, for regular encounters(red), decreases from . to . , when dynamic level-kreasoning is chosen as the interactive decision making model,for the case of nm distance horizon.To visually demonstrate the pilot decision making process,two 3D single encounter scenarios are designed. The ﬁrstscenario consists of two manned aircraft ﬂying towards eachother at different altitude levels, where the initial horizontaland vertical distances between the aircraft are . nm and f t , respectively. The 2nd scenario consists of two mannedaircraft ﬂying towards each other where one of them is ﬂyingon a horizontal plane while the other is leveling up with aconstant vertical rate of f t/min . In the second scenario,initial horizontal and vertical distances between the aircraft are nm and f t , respectively. In the scenarios, pilots canoversee conﬂicts in a s time window prior to a separationviolation. Pilots’ distance horizon is considered to be nm .Both of the aircraft can change the heading angle for ± ◦ ,or change the pitch angle for ± ◦ , or may keep both theheading and pitch angles unchanged. In both of the scenarios,4 cases for pilots’ level-k type behavior are considered: 1) a Fig. 5. Separation violation rates. “dst hor” refers to the distance horizonof the pilots. On the columns that show the separation violation rates for a nm distance horizon, the cyan color shows the percentage of violationsthat occur when the encounters are “difﬁcult to resolve”, meaning that theinitial conditions of the encounters do not permit any type of pilot action toavoid a separation violation. starting pointdestinationLevel-0 agent Level-1 agent

Fig. 6. Encounter scenario 1: level-1 pilot vs. level-0 pilot. level-1 pilot vs. a level-0 pilot, 2) a level-2 pilot vs. a level-1pilot, 3) a level-1 pilot vs. a level-1 pilot, and 4) a dynamiclevel-1 pilot vs. a level-1 pilot.Figures 6-9 depict the 2D horizontal projection snapshotsof the four cases, for the ﬁrst scenario. Black, red and greensquares correspond to manned aircraft with level-0, level-1and level-2 type pilots. The circles and stars stand for theinitial positions and ﬁnal destinations of the aircraft. The graytrack lines right behind the aircraft represent their traveledpath from their initial positions to where they stand in thesnapshot. Two neighboring grid points are nm away. It isseen from these ﬁgures that while level-1 vs level-0 and level-2 vs level-1 encounter conﬂicts are properly resolved, level-1 vs level-1 conﬂict resulted in a separation violation. Asexplained earlier, this problem might occur in certain approachgeometries due to incorrect opponent level assumption. Figure9 shows an encounter scenario where a dynamic level-k pilothas an encounter with a level-1 pilot. The dynamic level-k pilot starts with a level-1 policy and when he/she detectsa probable conﬂict, changes his/her policy to level-2 andprevents a separation violation. Figure 10 and Fig. 11 depictthe horizontal and vertical distances between the two aircraftduring the simulation of four cases discussed here. It is seenthat no separation violation occurs except the level-1 vs level-1case, for the 1st scenario. starting pointdestinationLevel-1 agent Level-2 agent

Fig. 7. Encounter scenario 1: level-2 pilot vs. level-1 pilot.Fig. 8. Encounter scenario 1: level-1 pilot vs. level-1 pilot.

Level Changed

Conflict DetectedTrajectory Projections starting pointdestinationLevel-1 agent

Fig. 9. Encounter scenario 1: dynamic level-k pilot vs. level-1 pilot.

Similar observations can be drawn for the simulations ofthe second scenario, the results of which are presented inFigures 12-15.

Fig. 10. Encounter scenario 1: horizontal distance.

Fig. 11. Encounter scenario 1: vertical distance.

B. UAS ﬂying in a crowded airspace

In this section, the scenario explained in section II issimulated to investigate the effect of responsibility assignment for conﬂict resolution, on safety and performance. Separationresponsibility assignment is an important issue in addressingthe integration of UAS into NAS [25]: it is crucial to determinewhich of the agents (manned aircraft or UAS) will take theresponsibility of the conﬂict resolution. Since the loss ofseparation is the most serious issue, the safety metric is takenas the total number of separation violations between all aircraftwhether manned or unmanned. Performance metric, on theother hand, is taken as the averaged manned and unmannedaircraft trajectory deviations. In all of the simulations, level-0,level-1 and level-2 pilot policies are randomly distributed overthe manned aircraft in such a way that 10% of the pilots ﬂybased on level-0 policies, 60% based on level-1 policies and30% based on level-2 policies. This distribution is obtainedfrom human experimental studies discussed in [32] and maynot necessarily reﬂect the true distribution for the scenariosdiscussed here. It is noted, however, that level distributioncan be adapted to other distributional data in the proposedframework. Level-1 type and level-2 type pilots utilize thedynamic level-k reasoning method.Fig. 18, Fig. 19, Fig. 20, and Fig. 21 depict a comparisonof different resolution responsibility cases: manned aircraftare responsible (dark blue), both manned aircraft and UASare responsible (blue) and UAS are responsible (cyan). BothSAA1 and SAA2 logics are employed in the simulations. Inthe case when only manned aircraft are responsible for conﬂictresolution, UAS are forced to continue their path without

Fig. 12. encounter scenario 2: level-1 pilot vs. level-0 pilot. Fig. 13. Encounter scenario 2: level-2 pilot vs. level-1 pilot.Fig. 14. Encounter scenario 2: level-1 pilot vs. level-1 pilot.Fig. 15. Encounter scenario 2: dynamic level-k pilot vs. level-1 pilot. running their SAA system and manned aircraft act as dynamiclevel-1 and level-2 decision makers. In the case when theUAS are responsible for the conﬂict resolution, the mannedaircraft are forced to continue their path without changingtheir heading and the UAS execute the maneuvers dictatedby their SAA algorithms. In the case when both the manned

Fig. 16. Encounter scenario 2: horizontal distance.

Fig. 17. Encounter scenario 2: vertical distance. aircraft and the UAS are responsible for the conﬂict resolution,they both utilize their evasive maneuvers. Fig. 18 shows thatmanned aircraft deviate more from their trajectories when onlythe manned aircraft share resolution responsibility, comparedto the case when both the UAS and the manned aircraft areresponsible. This is true for both the SAA1 (on the left) andthe SAA2 (on the right) algorithms. On the other hand, Fig. 19shows that the UAS deviates from its trajectory more when itis responsible for the resolution, compared to the case whenthe responsibility is shared, which holds true for both of theSAA methods. Figure 20 shows, as expected, that for bothSAA1 and SAA2, the UAS ﬂight times are the shortest whenonly the manned aircraft become responsible for the resolution.According to Fig. 21, for both SAA1 and SAA2 utilizations,the minimum number of separation violations are observedwhen conﬂict resolution responsibility is shared between theUAS and the manned aircraft.V. C

ONCLUSION

In this paper, a combination of the level-k reasoning gametheoretical concept and an approximate reinforcement learningmethod called Neural Fitted Q-learning is used to create a 3-dimensional (3D) airspace modeling framework for predict-ing the possible outcomes of integrating Unmanned AircraftSystems (UAS) into the National Airspace System (NAS).Compared to the earlier results of the authors, the assumptionthat the decision makers’ levels remain the same duringinteractions and the requirement of keeping a large Q-valuetable are removed. These are achieved by the introduction ofa dynamic level-k reasoning method and the employment of

Fig. 18. Average trajectory deviation of manned aircraft. Fig. 19. Average trajectory deviation of UAS.Fig. 20. UAS ﬂight time.Fig. 21. Separation violation between manned aircraft and UAS. the Neural Fitted Q-learning algorithms, respectively. Theseimprovements made it possible to model a larger class ofinteractions between the decision makers and this is demon-strated by simulating various single encounter scenarios ina 3D airspace. The proposed modeling framework can beused to quantitatively investigate how safety and performanceof the simulated airspace system are affected by the variousintegration technologies and concepts such as airspace density,minimum separation distance and various UAS sense andavoid algorithms and their design parameters. One of theissues about UAS integration is responsibility assignmentduring conﬂicts and it is shown how the 3D game theoreticalmodeling framework discussed in this paper can be used tostudy this problem. A

CKNOWLEDGMENT

This effort was sponsored by the Scientiﬁc and Technologi-cal Research Council of Turkey under grant number 114E282. R EFERENCES[1] V. K. P. Dalamagkidis, D. L. A. Piegl, “On unmanned aircraft systemsissues, challenges and operational restrictions preventing integration intonational airspace system,”

Progress in Aerospace Sciences , vol. 44, no.7, pp. 503-519, 2008, DOI:10.1016/j.paerosci.2008.08.001.[2] Federal Aviation Administration, “Integration of civil unmanned aircraftsystem (UAS) in the national airspace system (NAS) roadmap,” U.S.Department of Transportation, Technical Report, 2013.[3] European RPAS Steering Group, “Roadmap for the integration of civilrpas into the European aviation system”, European Commission, Techni-cal Report, 2013.[4] The MITRE Corporation Center for Advanced Aviation System Develop-ment, “Issues concerning integration of unmanned aerial vehicles in civilairspace”, Technical Report, 2004.[5] D. Maki, C. Parry, K. Noth, M. Molinario, and R. Miraﬂor, “Dynamic pro-tection zone alerting and pilot maneuver logic for ground based sense andavoid of unmanned aircraft systems,” in

Proc. AIAA Infotech@Aerospace ,AIAA Paper 2012-2505, 2012, pp. 1-10, DOI: 10.2514/6.2012-2505.[6] M. J. Kochenderfer, L. P. Espindle, J. K. Kuchar, J. D. Grifﬁth, “Corre-lated Encounter Model for Cooperative Aircraft in the National AirspaceSystem Version 1.0,” Lincoln Laboratory, Massachusetts Institute ofTechnology, 24 October 2008.[7] J. K. Kuchar, J. Andrews, T. H. Drumm, V. Heinz, S. Thompson, andJ. Welch, “A Safety Analysis process for the trafﬁc alert and collisionavoidance system (TCAS) and see-and-avoid systems on remotely pilotedvehicles,” in

Proc. AIAA 3rd Unmanned Unlimited Technical Conference,Workshop and Exhibit , AIAA Paper 2004-6423, 2004, pp. 1-13, DOI:10.2514/6.2004-6423.[8] M. Perez-Batlle, E. Pastor, P. Royo, X. Prats, and C. Barrado, “Ataxonomy of UAS separation maneuvers and their automated execution,”in Proceedings of the , IRIT Press,London, May 2012, pp. 1-11.[9] M. Florent, R. R. Schultz, and Z. Wang, “Unmanned aircraft systemssense and avoid ﬂight testing utilizing ads-b transceiver”, in

Proc. AIAAInfotech@ Aerospace , AIAA Paper 2010-3441, 2010, pp. 1-8, DOI:10.2514/6.2010-3441.[10] T. B. Billingsley, “Safety analysis of TCAS on global hawk usingairspace encounter models,” Ph.D. dissertation, Massachusetts Instituteof Technology, 2006.[11] E. Salas, and D. Maurino,

Human Factors in Aviation , Elsevier, Aca-demic Press, 2nd ed., 2010.[12] R. Lee, D. Wolpert, “Game theoretic modeling of pilot behavior duringmid-air encounters,”

Decision Making with Multiple Imperfect DecisionMakers , Intelligent Systems Reference Library Series, Springer, 2011,DOI:10.1007/978-3-642-24647-0-4.[13] J. E. Kuchar, A. C. Drumm, “The trafﬁc alert and collision avoidancesystem,”

Lincoln Laboratory Journal , vol. 16, no. 2, pp. 277-296, 2007.[14] N. Musavi, D. Onural, K. Gunes, Y. Yildiz, “Unmanned Aircraft Sys-tems Airspace Integration: A Game Theoretical Framework for ConceptEvaluations,”

Journal of Guidance, Control, And Dynamics , vol. 40, pp.96-109, 2016.[15] N. Musavi, K. B. Tekeliolu, Y. Yildiz, K. Gunes, D. Onural, “A GameTheoretical Modeling and Simulation Framework For The IntegrationOf Unmanned Aircraft Systems In To The National Airspace,” in

Proc. AIAA Infotech@Aerospace , AIAA Paper 2016-1001, 2016, DOI:10.2514/6.2016-1001.[16] D. Gill and V. Prowse, “Cognitive ability and learning to play equi-librium: A level-k analysis,” pp. 1-33, 2012. Available at SSRN: https://ssrn.com/abstract=2043336 or http://dx.doi.org/10.2139/ssrn.2043336.[17] D. Gill, V. Prowse, “Cognitive Ability, Character Skills, and Learningto Play Equilibrium: A Level-k Analysis,” pp. 1-47, 2014.

Journal ofPolitical Economy , to be published. Available at SSRN: https://ssrn.com/abstract=2448144 or http://dx.doi.org/10.2139/ssrn.2448144.[18] J. Chong T. Ho and C. Camerer, “A generalized cognitive hierarchymodel of games,”

Journal of Games and Economic Behavior , vol. 99,pp. 257-274, 2016.[19] T. H. Ho, X. Su, “A Dynamic Level-k Model in Sequential Games,”

Management Science , vol. 59, n. 2, pp. 452-469, 2013.[20] M. Volodymyr, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,M. G. Bellemare, A. Graves et al, “Human-level control through deepreinforcement learning,”

Nature vol. 518, no. 7540, pp. 529-533, 2015.[21] W., Marco, and M. V. Otterlo, “Reinforcement learning: State of theArt,” vol. 12, Berlin, Heidelberg, Germany: Springer, 2012. [22] M. Riedmiller, “Neural ﬁtted Q iteration - First experiences with adata efﬁcient neural reinforcement learning method,” in

Proc. EuropeanConference on Machine Learning , Porto, Portugal, 2005, pp. 317-328.[23] M. Riedmiller, M. Montemerlo, and H. Dahlkamp, “Learning to drivea real car in 20 minutes,” in

Proc. IEEE Frontiers in the Convergence ofBioscience and Information Technologies , 2007, pp. 645-650.[24] G. Thomas, C. Lutz, and M. Riedmiller, “Improved neural ﬁtted Qiteration applied to a novel computer gaming and learning benchmark,”in

Proc. Adaptive Dynamic Programming And Reinforcement Learning(ADPRL), IEEE Symposium , 2011, pp. 279-286.[25] “Unmanned Aircraft Systems (UAS) Integration in the National AirspaceSystem (NAS) Project,” NASA Advisory Council Aeronautics Commit-tee, UAS Subcommittee, June 27, 2012.[26] D. Gill, V. Prowse, “Cognitive ability, character skills, and learning toplay equilibrium: A level-k analysis,”

Journal of Political Economy , vol.124, no. 6, pp. 1619-1676, 2016.[27] “NextGen Concept of Operations 2.0.” Washington, DC: FAA JointPlanning and Development Ofﬁce, 2007.[28] Federal Aviation Administration, “Pilot Handbook of AeronauticalKnowledge (FAA-H-8083-25B)”, U.S. Department of Transportation,2016.[29] K. Dalamagkidis, K. P. Valavanis, D. L. A. Piegl, “On IntegratingUnmanned Aircraft Systems into the National Airspace System,”

SpringerScience and Business Media , vol. 54, 2011, DOI: 10.1007/978-94-007-2479-2.[30] A. Mujumdar, R. Padhi, “Reactive Collision Avoidance Using NonlinearGeometric and Differential Geometric Guidance,”

Journal of Guidance,Control, And Dynamics , vol. 34, no. 1, pp. 303-310, 2011, DOI:10.2514/1.50923.[31] G. Fassano, D. Accardo, and A. Moccia, “Multi-sensor based fullyautonomous non-cooperative collision avoidance system for unmannedair vehicles,”

Journal of Aerospace Computing, Information, and Com-munication , vol. 5, pp. 338-360, 2008, DOI: 10.2514/1.35145.[32] M. A. Costa-Gomez, V. P. Craford, and N. Irriberri, “Comparing modelsof strategic thinking in van huyck, battalio, and beil’s coordinationgames,”

Games and Economic Behavior , vol. 7, no. 2-3, pp. 365-376,2009, DOI: 10.1162/JEEA.2009.7.2-3.365.[33] N. Li, D. W. Oyler, M. Zhang, Y. Yildiz, I. V. Kolmanovsky, A.Girard, “Game-Theoretic Modeling of Driver and Vehicle Interactionsfor Veriﬁcation and Validation of Autonomous Vehicle Control Systems,”

IEEE Transactions on Control Systems Technology , to be published.[34] Y. Yildiz, G. Brat, A. Agogino, “Predicting Pilot Behavior in MediumScale Scenarios Using Game Theory and Reinforcement Learning,”

AIAAJournal of Guidance, Control and Dynamics , vol. 37, no. 4, pp. 1335-1343, 2014, DOI: 10.2514/1.G000176.[35] S. Backhaus, R. Bent, J. Bono, R. Lee, B. Tracey, D. Wolpert, D. Xie,Y. Yildiz, “Cyber-Physical Security: A Game Theory Model of HumansInteracting over Control Systems,”