Act to Reason: A Dynamic Game Theoretical Model of Driving
AA CT TO R EASON : A D
YNAMIC G AME T HEORETICAL M ODEL OF D RIVING
Cevahir Köprülü
Department of Electrical and Electronics EngineeringBilkent UniversityAnkara, Turkey 06800 [email protected]
Yıldıray Yıldız ∗ Department of Mechanical EngineeringBilkent UniversityAnkara, Turkey 06800 [email protected] A BSTRACT
The focus of this paper is to propose a driver model that incorporates human reasoning levels asactions during interactions with other drivers. Different from earlier work using game theoreticalhuman reasoning levels, we propose a dynamic approach, where the actions are the levels themselves,instead of conventional driving actions such as accelerating or braking. This results in a dynamicbehavior, where the agent adapts to its environment by exploiting different behavior models asavailable moves to choose from, depending on the requirements of the traffic situation. The boundedrationality assumption is preserved since the selectable strategies are designed by adhering to the factthat humans are cognitively limited in their understanding and decision making. Using a highwaymerging scenario, it is demonstrated that the proposed dynamic approach produces more realisticoutcomes compared to the conventional method that employs fixed human reasoning levels.
Keywords
Driver Modeling · Game Theory · Reinforcement Learning
Modeling human driver behavior in complicated traffic scenarios paves the way for designing autonomous drivingalgorithms that incorporate human behavior. Complementary to behavior integration, building realistic simulators to testand verify autonomous vehicles (AVs) is imperative. Currently, the pace of AV development is hindered by demandingrequirements such as millions of miles of testing to ensure safe deployment (Kalra and Paddock, 2016). High-fidelitysimulators can boost the development phase by constructing an environment that is comprised of human-like drivers(Wongpiromsarn et al., 2009; Lygeros et al., 1998; Kamali et al., 2017). In addition, human-driver models may inducethe development of safer algorithms. Deploying control systems that behave similar to humans enables a familiarexperience for the passengers and a recognizable pattern for surrounding drivers to interact with. Therefore, designingmodels that display human-like driving is vital.Various driver models employing a range of methods are proposed in the literature. Examples of inverse reinforcementlearning based approaches, where the agents’ preferences are learned from demonstrations can be found in the studiesconducted by Kuderer et al. (2015); Kuefler et al. (2017); Sadigh et al. (2018); Sun et al. (2018). In addition, an exampleof direct reinforcement learning method can be found in the work of Zhu et al. (2018). Data-driven machine learningmethods are commonly proposed to model driver behavior, as well (Xie et al., 2019; Liu and Ozguner, 2007; Gadepallyet al., 2014; Yang et al., 2019; Ding et al., 2016; Li et al., 2016, 2017; Klingelschmitt et al., 2016; Li et al., 2019; Okudaet al., 2016; Weng et al., 2017; Dong et al., 2017a,b; Zhu et al., 2019). Different from machine learning methods, studiesthat utilize game theoretical approaches investigate strategic decision making. Studies that employs these approachesare presented by Yu et al. (2018); Zimmermann et al. (2018); Yoo and Langari (2013); Ali et al. (2019); Schwartinget al. (2019) and Fisac et al. (2019). Apart from machine learning and game theory based concepts, there exist studiesfollowing a control theory direction (Hao et al., 2016; Da Lio et al., 2018). ∗ Corresponding author a r X i v : . [ c s . M A ] J a n ct to Reason: A Dynamic Game Theoretical Model of DrivingIn recent years, a concept based on “Semi Network Form Games” (Lee and Wolpert, 2012; Lee et al., 2013) emergedthat combines reinforcement learning and game theory, to model human operators both in the aerospace (Yildiz et al.,2012, 2014; Musavi et al., 2017) and automotive domains (Li et al., 2018b; Albaba and Yildiz, 2019; Albaba et al.,2019; Li et al., 2016; Oyler et al., 2016; Tian et al., 2020; Garzón and Spalanzani, 2019). One of the main componentsof this method is the level- k game theoretical approach (Stahl and Wilson, 1995; Costa-Gomes et al., 2009; Camerer,2011), which assigns different levels of reasoning to intelligent agents. The lowest level, level-0 is considered to benon-strategic since it acts based on predetermined rules without considering other agents’ possible moves. A level-1agent, on the other hand, provides a best response assuming that the others are level-0 agents. Similarly, a level- k agent,where k is an integer, acts to maximize its rewards based on its belief that the environment agents have level-( k − )reasoning. In all of these earlier studies, it is assumed that agents’ reasoning levels remain the same, throughout theirinteractions with each other. Although this assumption may be valid for the initial stages of interaction, it falls short ofmodeling adaptive behavior. For example, drivers can and do adapt to the driving styles of the other drivers, which isnot possible to model with the conventional fixed level- k approach. In this paper, we fill this gap in the literature andintroduce a “dynamic level- k ” approach where the human drivers may have varying levels of reasoning. We achieve thisgoal by transforming the action space from direct driving actions, such as “accelerate”, “break”, or “change lane”, toreasoning levels themselves. This is made possible by using a two-step reinforcement learning approach: In the firststep, conventional level- k models, such as level-1, level-2 and so on are trained. In the second step, the models aretrained again using reinforcement learning but this time they use the reasoning levels, which are obtained from theprevious step, as their possible “actions”.There are other studies which also incorporate a dynamic level- k reasoning in agent training. Li et al. (2018a) utilizedreceding-horizon optimal control to determine agent’s actions, where the opponent’s reasoning level is deduced froma probability distribution, which is updated during the interaction. Tian et al. (2018) developed a controller whichdirectly predicts opponent’s level based on its belief function, and acts accordingly. Both of these studies modelthe interactions of two agents at a time, which makes them nontrivial to extend for modeling larger traffic scenarios.Our study distinguishes itself by designing agents that dynamically select a reasoning level given only their partialobservation of the environment, instead of employing a belief function. This allows our method to naturally modelcrowded traffic scenarios. Ho and Su (2013) also proposed a dynamic level- k method, where, similar to the onesdiscussed above, the opponent’s levels are predicted using belief functions, and therefore it may not be suitable formodeling crowded multi-player games. Ho et al. (0) extended their previous work for n-player games by proposing aniterative adaptation mechanism, where the ego agent assumes that all of its opponents have the same reasoning level.This may be restricting for heterogeneous traffic models.Specifically, the contributions of this study are the following:1. We introduce a dynamic level- k model, which incorporates a real-time selection of driver behavior, based onthe traffic scenario. It is noted that individual level- k models have reasonable match with real traffic data asprevious studies show (Albaba and Yildiz, 2019). In this regard, we preserve bounded rationality by selectingagents’ actions only from a level- k strategy set.2. The proposed method allows a single model for driver behavior on both the main road and the ramp inthe highway merging setting. In general, earlier studies only consider one or the other scenario due to thecomplicatedness of the problem at hand.3. In comparison to other methods, large traffic scenarios with multiple vehicles can naturally be modeled, whichsuits better for the complexity of real-life conditions.This paper is organized as follows. In Section 2, level- k game theory and Deep Q-learning are briefly described.In Section 3, proposed dynamic level- k model is explained. In Section 4, the construction of highway mergingenvironment based on NGSIM I-80 dataset (U.S. Department of Transportation/Federal Highway Administration, 2006)is demonstrated along with the observation and action spaces, vehicle model and the reward function. In Section 5,training configuration, implementation details, level-0 policy, and the results of training and simulation are provided.Finally, a conclusion is given in Section 6. In this section, we provide the main components of the proposed method. Namely, level- k reasoning, Deep Q-learning(DQN) and the synergistic employment of level- k reasoning and DQN are explained. For brevity, we provide only abrief summary, the details of which can be found in the work of Albaba and Yildiz (2020).2ct to Reason: A Dynamic Game Theoretical Model of Driving k Game Theory
Level- k game theory is a non-equilibrium hierarchical game theoretic concept that models strategic decision makingprocess of humans (Camerer et al., 2004; Stahl and Wilson, 1995). In a game, where N players are interacting, a level- k strategy maximizes the utility of a player given that the other players are making decisions based on the level-( k − )strategy. Therefore, in a game with players p i , i ∈ { , , , · · · , N − } , we can define the level- k policy as π k = arg max π u i ( π | π p , π p , · · · , π p N − ) , (1)where π p i = π k − ∀ i ∈ { , , · · · , N − } .In practice, a level-0 player is determined as a non-strategic agent whose actions are selected based on a set of rules thatmay or may not be stochastic. Then, a level-1 player that provides the best responses to the level-0 decision makersis formed. Similarly, a level-2 agent, who provides the best responses to a level-1 player, is created. Iteratively, thisprocess continues until a final level- K , K ∈ Z + , strategy is obtained. k Reasoning
In early implementations, where reinforcement learning and game theory is merged (Li et al., 2018b; Albaba and Yildiz,2019; Albaba et al., 2019; Li et al., 2016; Oyler et al., 2016; Tian et al., 2020), a tabular RL method was used. Due tothe requirement of an enlarged observation space to obtain higher-fidelity driver models, recently a Deep Q-Learningapproach is introduced (Albaba and Yildiz, 2020). Deep Q-learning, (Mnih et al., 2013, 2015), utilizes a neural networkof fully-connected or convolutional layers and maps an observed state s t at time t from the state space S to an actionvalue function Q ( s t , a t ) , where a t ∈ A is the action taken at time t . DQN algorithm achieves this goal using ExperienceReplay and Boltzmann Exploration (Ravichandiran, 2018).In the context of obtaining high fidelity driver models, DQN is used to obtain the level- k driver behaviour by trainingthe ego vehicle’s policy in an environment where all other drivers are assigned a level-( k − ) policy. The details of thealgorithm merging DQN and level- k reasoning is explained by Albaba and Yildiz (2020). In this section, dynamic level- k approach is explained in detail. k Model
Level- k strategy is built on a condition that the ego player is interacting with agents, who have level-( k − ) policies.Such an assumption brings about a problem when the agent is in an environment consisting of agents with policies thatare different than level-( k − ). In that case, the agent may make incorrect assumptions about others’ intentions and actin a way that can be detrimental to itself and others. As a countermeasure, we train the agent to select a policy amongavailable level- k policies at each time-step in order to maximize its utility in mixed traffic.We start by defining a set of ( K + )-many level- k policies as Π = { π , π , π , · · · , π K } . (2)This set determines the available level- k policies, where each policy, with the exception of level-0 policy, which ispre-determined, is trained in an environment where the rest of the drivers are taking actions based on level-( k − )policies. To train the ego agent with a dynamic level- k policy, we place it in an environment with N many cars, includingthe agent. From the ego’s perspective, the environment is comprised of ( N − )-fold Cartesian product of policies,represented as Λ N = Π × Π × · · · × Π . (3)An element λ of the set Λ N is an ordered N − tuple, expressed as λ = ( π p , π p , · · · , π p N − ) ∈ Λ N , (4)where π p i is the policy of player i , i ∈ { , , · · · N − } . In this manner, the dynamic level- k policy can be defined as π dyn = arg max π u ( π | Λ N ) , (5)3ct to Reason: A Dynamic Game Theoretical Model of Drivingwhere u represents utility. The dynamic level- k policy defined in (5) is the policy that maximizes the utility of anagent placed in an environment consisting of ( N − )-fold Cartesian product of K different level- k policies. In thereinforcement learning setting, given an observation, the dynamic policy selects a level- k policy among available level- k policies, and then a direct driving action (breaking, acceleration, etc) is sampled using this selected policy. The workingprinciple of the dynamic level- k model is illustrated in Fig. 1.Figure 1: Dynamic Level- k ModelThe algorithm for the training process is given in Algorithm 1.
Remark:
In the proposed approach, the selection of levels as actions are not based on the history of environmentvehicles, which would be prohibitive for modeling crowded scenarios. Instead, the behavior selection is conductedusing a policy, which provides a direct stochastic map between immediate observations to actions.4ct to Reason: A Dynamic Game Theoretical Model of Driving
Algorithm 1
Training Dynamic Level- k Decision Maker with DQN Initialize environment with N vehicles Load level- k policies π k ∈ Π , trained earlier, to networks QN k for k = 1 , , · · · , K Initialize the primary network QN dyn via Xavier uniform initializer (Glorot and Bengio, 2010) Initialize the target network QN dyntarget with primary network weights Initialize memory M as a queue with maximum size N initial Initialize Boltzmann Temperature T for e = 1 to E do Generate a random N − tuple λ ∈ Λ N for episode e for t = 1 to L do Pass s t through QN dyn to obtain Q π dyn ( s t , π m ) , ∀ π m ∈ Π − { π } Sample π mt among available policies via Boltzmann Sampling Pass s t through QN m to obtain Q π mt ( s t , a j ) , ∀ a j ∈ A Sample an action a t via Boltzmann Sampling Transition into next state s t +1 by executing action a t Observe reward r t Store current experience ( s t , π mt , r t , s t +1 ) in M if size ( M ) ≥ N initial then Randomly sample P -sized batch of experiences ( s i , π ki , r i , s i +1 ) for i = 1 to P do if s t +1 is terminal then y i = r i else y i = r i + γ max π (cid:48) QN dyntarget ( s i +1 , π (cid:48) ; θ dyntarget ) Calculate loss ( y i − QN dyn ( s i , π ki ; θ dyn )) Perform gradient descent using loss on primary weights θ dyn Every U steps, set θ dyntarget := θ dyn if s t +1 is terminal then End episode e Update Boltzmann temperature: T = max( T ∗ c, , c < We implement the proposed dynamic level- k approach in a highway merging scenario. To demonstrate the capabilitiesof the new method, we choose to create a realistic scenario that represents a real traffic environment. To achieve this, weemploy NGSIM I-80 data set (U.S. Department of Transportation/Federal Highway Administration, 2006) to constructthe road geometry (see Fig. 2) and to impose constraints on the driver models. Highway merging scenario primarily consists of 2 lanes, one of which is part of the main road and the other is theramp. In comparison to highway driving, highway merging is a more challenging problem, whose participants aregrouped into two types: The first type attempts to merge from the ramp to the main road, and the second type commuteson the main road. Its exhausting nature is caused by mandatory lane changes and frequently varying velocities of theparticipants. As the traffic becomes more and more crowded, predicting where and how an interacting vehicle mergesgrows intensely troublesome. In this context, risky interactions occur more often than in a highway setting. Such adistinction is mainly brought about by the imminent trade-off between being safe and leaving the merging region as fastas possible. Essentially, how a driver handles this trade-off determines the preferred style.
NGSIM I-80 dataset consists of vehicle trajectories from the northbound traffic in the San Francisco Bay area inEmeryville, CA, on April 13, 2005. The area of interest is around 500 meters in length and includes six highway lanes,and an on-ramp. Fig. 2 shows a schematic of the area.The dataset comprises vehicle trajectories collected from observations made for 45 minutes in total. There are threedistinct 15-min periods: 4:00 p.m. to 4:15 p.m., 5:00 p.m. to 5:15 p.m., and 5:15 p.m. to 5:30 p.m. Therefore, the5ct to Reason: A Dynamic Game Theoretical Model of DrivingFigure 2: NGSIM I-80 Study Area Schematic (U.S. Department of Transportation/Federal Highway Administration,2006): Red region is the area of interest for this work.dataset includes congestion accumulation and the congested traffic. The sampling frequency at which the vehicletrajectories are recorded is 10 Hz. A data reconstruction of this part is needed due to excessive noise on velocity andacceleration data. Montanino and Punzo (2015) carried out such a study for the first 15 min section and we use theirvelocity and acceleration data for our scenario creation. Although the reconstruction provides the lane informationof each vehicle, positions along the axis perpendicular to lanes are discrete, not continuous. Therefore, when a carchanges its lane, the reconstructed data does not provide all the details of the transition, but simply indicates the lanechange. Headway distance of a vehicle indicates the distance between that vehicle and the vehicle in front of it onthe same lane. The distributions of headway on the main road, ramp and on both lanes are calculated using the dataand shown in Fig. 3. The mean and the standard deviation of the distribution on both lanes (Fig. 3c) are 12.72 m and10.11 m, respectively. Similarly, velocity and acceleration distributions are calculated and presented in Fig.s 4 and 5,respectively. As expected, the vehicles traveling on the ramp commute slower than the ones on the main road. The meanand the standard deviation of the velocity distribution considering both lanes are calculated as 9.78 m/s and 4.84 m/s,respectively. Regarding the acceleration distribution, the vehicles on the ramp seem to be decelerating more frequentlythan the ones in the main road. Since the trend is not dominant, it is not clear whether this can be generalized to othermerging scenarios. The mean and the standard deviation of the acceleration distribution on both lanes are calculated as − . m/s and − . m/s , respectively. Vehicle population distributions are given in Fig. 6. We observe that themain road is more congested compared to the ramp.Details regarding the utilization of the headway, velocity, acceleration and population distribution data for scenariogeneration are explained in the following sections. 6ct to Reason: A Dynamic Game Theoretical Model of Driving (a) Main Road (b) Ramp (c) Both Lanes Figure 3: NGSIM I-80 Headway Distance Distributions (a) Main Road (b) Ramp (c) Both Lanes
Figure 4: NGSIM I-80 Velocity Distributions (a) Main Road (b) Ramp (c) Both Lanes
Figure 5: NGSIM I-80 Acceleration Distributions (a) Main Road (b) Ramp (c) Both Lanes
Figure 6: NGSIM I-80 Population Distributions7ct to Reason: A Dynamic Game Theoretical Model of DrivingFigure 7: Highway Merging Environment
The environment consists of a 305 m long main road, a 185 m long ramp from “Start of On-ramp for Ego” point and a145 m long merging region (see Fig. 7). Following the end of the merging region, the main road has an additional 45 m.The lanes in the environment have a width of 3.7 m. The vehicles are 5 m in length and 2 m in width.The initial position of the ego vehicle is the x = 0 m point, if placed on the main road, and x = 75 m point, which isthe “Start of On-ramp for Ego” location, if the ego is placed on the ramp. During the initialization of the environment,vehicles are placed in a random fashion with a minimum initial distance of 2 car lengths, which is equal to 10 m. Initialvelocities of the vehicles, other than the ones placed on the merging region of the ramp, are sampled from a uniformdistribution on the interval [ . m/s, . m/s], which is the ± v nom = 9 . m/s, the initial velocities of the vehicles placed onthe merging region of the ramp are determined using the formula v ( t ) = v nom ∗ (0 . . x e m − x ( t ) x e m − x s m ) + z, (6)where x s m and x e m are the start and end positions of the merging region (see Fig. 7), x ( t ) is the initial position of thevehicle, and z is a realization of the random variable Z, which is sampled from a uniform distribution U ( − , . It isnoted that the closest initial position to the end of the merging region is taken to be 23 m away from the end, to allowfor a safe slow down. This distance is approximately one standard deviation longer than the mean headway distance,calculated from the I-80 dataset (see Fig. 3c).The maximum number of cars that can be on the ramp is set as 7, which prevents unreasonably congested situations onthe ramp. This is supported by the data as more than 95% of the cases include 7 or fewer vehicles on the ramp (see Fig.6b). As the traffic flows, whenever a car leaves the environment, another one is added with a probability of 70%, and itsplacement is selected to be the main lane with a probability of 70%. The initial velocity of each newly added vehicle issampled using the methods explained above. In order to add a new car safely, the minimum initial distance condition issatisfied. In a merging scenario, a driver observes the car in front, the cars on the adjacent lane and the distance to the end of themerging region, which helps to determine when to merge or to yield the right of way to a merging vehicle.The regionsused to form the observation space variables are described in Fig. 8, and the variables are given in Table 1. (a) Ego vehicle on the ramp (b) Ego vehicle on the main road
Figure 8: Observation Space. RS, FS and FC refer to rear side , front side , and front center , respectively.The variables F C , F S and RS in Table 1 refer to front-center, front-side and rear-side, respectively (see Fig. 8).The subscripts v and d indicate whether the variable denotes relative velocity or relative distance. The normalization8ct to Reason: A Dynamic Game Theoretical Model of DrivingTable 1: Observation Space Variables Observation Space Variable Normalized Range
F C v [ − , F C d [0 , F S v [ − , F S d [0 , RS v [ − , RS d [0 , d e [ − , v x [0 , l l ∈ { , } constants used for the relative distance and relative velocity variables are v max = 29 . m/s and d far = 23 m,respectively. The ego driver can also observe the distance, d e , to the end of the merging region. d e is normalized by thetotal length of the merging region, which is 45 m. Furthermore, ego can observe his/her velocity along the x-axis, v x ,and his/her own lane, l . v x is normalized with v max , and l takes the values of 0 (ramp) and 1 (main lane). It is notedthat except the lane number, all the observation space variables are continuous. In the highway merging problem, an agent on the main road can change only its velocity, whereas an agent on the rampcan merge, as well as change its velocity. The specific actions are given below.•
Maintain : Acceleration is sampled from Laplace( µ = 0 , b = 0 . ) in the interval [ − . m/s , . m/s ].• Accelerate : Acceleration is sampled from Exponential( λ = 0 . ) with starting location 0.25 m/s and largestacceleration 2 m/s .• Decelerate : Acceleration is sampled from inverse Exponential( λ = 0 . ) with starting location -0.25 m/s and smallest acceleration -2 m/s .• Hard-Accelerate
Acceleration is sampled from Exponential( λ = 0 . ) with starting location 2 m/s andlargest acceleration 3 m/s .• Hard-Decelerate : Acceleration is sampled from inverse Exponential( λ = 0 . ) with starting location -2m/s and smallest acceleration -4.5 m/s .• Merge : Merging is assumed to be completed in one time-step.
Let x ( t ) , v x ( t ) and a ( t ) be the position, velocity and acceleration of a vehicle along the x-axis (see Fig. 7), respect-ively.Then, the equations of motion of the vehicle is given as x ( t + ∆ t ) = x ( t ) + v x ( t ) ∗ ∆ t + 1 / a ( t ) ∗ ∆ t , (7) v x ( t + ∆ t ) = v x ( t ) + a ( t ) ∗ ∆ t, (8)where ∆ t is the time-step, which is determined as 0.5 seconds. Reward function, R, is a mathematical representation of driver preferences. In this work, we use R = c ∗ w + h ∗ w + m ∗ w + e ∗ w + nm ∗ w + s ∗ w , (9)where w i ’s, i ∈ { , , , , , } , are the weights used to emphasize the relative importance of each term. The termsused to form the reward function in (9) are explained below• c : Collision parameter. Takes the values of -1 when a collision occurs, and zero otherwise. There are threedifferent possible collision cases, which can be listed as9ct to Reason: A Dynamic Game Theoretical Model of Driving1. Ego fails to merge, and crashes into the barrier at the end of the merging region2. Ego merges into an environment car on the main lane3. Ego crashes into a car in front• h : Headway distance parameter. Defining d close = 3 m, d far = 13 m and d nom = 23 m, this parameter iscalculated as h = − , if F C d ≤ d closeF C d − d nom d nom − d close , if d close ≤ F C d ≤ d nomF C d − d nom d far − d nom , if d nom ≤ F C d ≤ d far , otherwise (10)The parameters d close , d far and d nom are defined using the mean and standard deviation information obtainedfrom the headway distribution presented in Fig. 3c. The variable F C d is introduced in Table 1.• m : Velocity parameter. Calculated as m = (cid:40) v − v nom v nom , if v ≤ v nomv max − vv max − v nom , otherwise , (11)where v is the velocity of the ego vehicle, v nom = 9 . m/s corresponds to the mean of the velocity distributionin Fig. 4c, and v max = 29 . m/s is the speed limit on I-80. The parameter m encourages the ego vehicleincrease its speed until the maximum velocity is reached.• e : Effort parameter. Equals to zero when velocity is less than v nom / , and otherwise determined as e = − . , if act = Accelerate or Decelerate − , if act = Hard-Accelerate or Hard-Accelerate , otherwise (12)where act notes the action taken by the agent. The main purpose of the effort parameter is to discourageextreme actions such as Hard-Accelerate and
Hard-Decelerate as much as possible. For low velocities, it isset to zero to encourage faster speeds so that unnecessary congestion is prevented.• nm : “Not Merging” parameter. Equals to -1 when the agent is on the ramp. This parameter discourages theego agent to keep driving on the ramp unnecessarily, without merging.• s : Stopping parameter. Utilized to discourage the agent from making unnecessary stops. Using the observationspace variables defined in Table 1, and the definitions of d close and d far before (10), this parameter is definedas in the following.If l = 1 (main road), then s = (cid:26) − , if act (cid:54) = Hard-Accelerate and
F C d ≥ d far and d e ≥ d far , otherwise, (13)If l = 0 (ramp), then s = − , if act (cid:54) = Merge and
F S d ≥ d close and RS d ≤ . d far − . , else if d e ≤ d far , otherwise, (14) The DQN training using dynamic level- k game theory is carried out as explained in Algorithm 1. The traffic environmentis constructed as described in the previous section. An episode starts once the ego and the environment vehicles areinitialized, and ends when the ego car leaves the environment or experiences an accident. Training consists of 3 different phases: Initial Phase, Sinusoidal Car Population Phase, Random Traffic PopulationPhase. 10ct to Reason: A Dynamic Game Theoretical Model of Driving1.
Initial Phase : For the first 200 episodes, car population, including the ego vehicle, is set to 4. The reasoningbehind first phase is to enable the agent to learn the basic dynamics of the environment under simplecircumstances.2.
Sinusoidal Car Population Phase : Starting at the th episode until the th , in every 100 epis-odes vehicle population is modified in a sinusoidal fashion, where the population take values from theset { , , , , , , } (see Fig. 9). The reasoning behind this variation is to prevent the learning agentfrom getting stuck at congested traffic, and to provide an environment that changes in a controlled manner.3. Random Traffic Population Phase : The car population of last 1000 episodes are altered randomly using thenumbers from the set { , , , , , , } , in order to expose the ego agent to a wide variety of scenarios,without letting the learning process to overfit to any scenario variation structure.Figure 9: Population variation for Phases 1 and 2. DQN for the level- k agents is implemented via TensorFlow Keras API (Abadi et al., 2015), with an architecture thatis based on a 4-layered neural network, where each layer is fully-connected. The input layer is a vector of 9 units,consisting of variables given in Table 1. The first and second fully-connected layers are 256 units in size. The size of thethird layer decreases to 128 units, and it is connected to the output layer which is comprised of 5 units that correspondto the Q-values of 5 possible actions.DQN for the dynamic level- k agent has the same structure, except the output layer. The output layer is comprised of 3units that correspond to level-1, level-2 and level-3 reasoning actions.The computation specifications of the computer used for training and simulation are• Processor: Intel® Core™ i7-4700HQ 2.40GHz 2.40GHz• RAM: DDR3L 1600MHz SDRAM, size: 16 GB.The trainings for each level and the dynamic level- k last for varying durations, as episodes are not time-restricted.Level-1 training lasts around 1.5 hours, whereas level-2, 3 and dynamic agents are trained in approximately 3 hours.Longer training times for higher levels and the dynamic level can be explained by the increased sophistication of thepolicies defined by these levels. Further details regarding the training of a DQN agent can be found in Table 2. Level-0 policy is the anchoring piece of level- k game theory based behavior modeling. Several approaches can beconsidered for designing a Level-0 agent, considering the fact that the decision-maker is non-strategic, and the decisionmaking logic can be stochastic. The policy can be a uniform random selection mechanism (Shapiro et al., 2014), orcan take a single action regardless of the observation (Musavi et al., 2017, 2016; Yildiz et al., 2014), or it can be aconditional logic based on experience (Backhaus et al., 2013). Our level-0 policy approach is stochastic for the mergingaction, and rule-based for the rest of the actions. This leads to a simple but sufficiently rich set of actions during anepisode, which provides enough excitation for the training of higher levels.11ct to Reason: A Dynamic Game Theoretical Model of DrivingTable 2: DQN Training Configuration Hyper-parameter Value
Replay Memory Size 50000Replay Start Size 5000Target Network Update Frequency 1000Initial Boltzmann Temperature 50Mini-batch Size 32Discount Factor 0.95Learning Rate 0.0013Optimizer AdamSpecifically, we define two level-0 policies, one for the case when the ego vehicle is on the ramp and the other for themain road. To describe these policies, we first introduce a function that provides a metric for an agent’s proximity to theend of the merging region as f ( d e ( t )) = ( l m − d e /l m ) , (15)where d e is defined in Table 1 and l m is the length of the merging region. A quadratic function is used to reflect therapidly increasing danger of approaching the end point. Furthermore, we define two threshold parameters as• T T C hd = 4 s (seconds), which is the time-to-collision ( T T C ) threshold for the
Hard-Decelerate action, and•
T T C d = 7 s, which is the T T C threshold for the
Decelerate action.We also define (cid:15) = 0 . to prevent division by zero cases. Algorithms 2 and 3 describe the level-0 policies for drivingon the ramp and on the main road, respectively. The parameters used in these algorithms are defined in Table 1 andbefore (10). Algorithm 2
Level-0 Policy: On-ramp act := Maintain if Vehicle is in the merging region then Let Z be a random variable such that Z ∼ U (0 , if z < f ( d e ) or d e < d far then if F S v > then F S v := (cid:15) else F S v := max( − F S v , (cid:15) ) if ( F S d F S v ≥ T T C hd and F S d > d close ) or F S d > d far then if RS v < then RS v := − (cid:15) else RS v := max( RS v , (cid:15) ) if ( − RS d RS v ≥ T T C hd and − RS d > d close ) or − RS d > . d far then act := Merge if act = Maintain then
F C v := min( F C v , − (cid:15) ) v := v n ∗ d e /l m if ( − F C d F C v ≤ T T C hd and F C d > d c ) or F C d ≤ d close then act := Hard-Decelerate else if ( − F C d F C v ≤ T T C d and F C d > d close ) or ( d e < and v > v ) then act := Decelerate else if d e ≥ d far and F C d > d close and F C v > (cid:15) then act := Accelerate
Algorithm 3
Level-0 Policy: Main road act := Maintain F C v := min( F C v , − (cid:15) ) v := v nom ∗ d e /l m if ( − F C d F C v ≤ T T C hd and F C d > d close ) or F C d ≤ d close then act := Hard-Decelerate else if − F C d F C v ≤ T T C d and F C d > d close then act := Decelerate else if F C d > d close and F C v > (cid:15) and ( v < v nom or d e < )) then act := Accelerate5.4 Training Results
The evolutions of the average training reward and the average reward per training episode are given in Fig. 10a and 10b,respectively. The highway merging problem brings about a commonly observed obstacle during training: bottlenecks.As the moment when the ego agent drives slowly or stops and waits for the traffic to move, it may get stuck in a localextremum and stay on the ramp or stop even though the environment traffic moves again. During training, the agent ispunished for such decisions, and average reward decreases suddenly, which can be observed for level-1 training around th and th episodes, for level-2 training when the sinusoidal population variation arrives more crowded intervals,and for level-3 around th episodes. Thanks to the stochastic nature of the exploration policy, these bottlenecks areovercome and rewards eventually increase. It is noted that at the end of the training, the last 5 models, 100 episodesapart, are simulated against themselves (level-k vs level-k), and the one with the least accidents is selected. (a) Average Training Reward (b) Reward per Training Episode Figure 10: Reward Plots
In this section, we present simulation results that demonstrate the advantages of the new modeling framework, wherethe agent actions are determined through a two step sampling process: First, an appropriate reasoning level is sampledusing the level selection policy trained with RL. Then, a driving action is sampled using the selected reasoning level’spolicy, which is also trained using RL. This is radically different than the fixed level-k method, where the actions arealways sampled from a fixed policy. To investigate how the two methods compare, we tested four different ego drivers,namely the level-1, 2 and 3 drivers, and the dynamic level- k driver. We put these drivers in different traffic settings andmeasure how many accidents these policies experience. The accident types are given in Section 4.1.6. The number ofvehicles in the traffic is selected from the set { , , , , , , } . There are 150 episodes per population selection,which amounts to 1050 episodes in total.Table 3 shows how four different types of ego agents perform in terms of collision rates. "Traffic Level" indicates theenvironment that the ego is tested in. "Level- k " traffic consists of only level- k agents. "Dynamic" traffic denotes anenvironment consisting of dynamic agents only. "Mixed" traffic, on the other hand, consists of level-0, level-1, level-2and level-3 agents , the numbers of which are determined by uniform sampling. The table shows the drawback of afixed level- k policy: All level- k policies perform significantly worse when they are placed in a level- k traffic, comparedto a level-( k − ) traffic. This is expected since a level-k policy is trained to provide a best response to a level-( k − )environment, only. However, the table shows that the dynamic agent experiences the minimum collision rate in an13ct to Reason: A Dynamic Game Theoretical Model of Drivingenvironment consisting of dynamic agents. Furthermore, the dynamic agent outperforms every level- k strategy in themixed traffic environment. These results support the hypothesis that the agents that incorporate human adaptability andselect policies that are based on human reasoning act better than static agents.Table 3: Collision rates of simulations for every ego agent type Ego
Level-1 Level-2 Level-3 Dynamic T r a f fi c L e v e l Level-0 1.5%Level-1 20.7% 2.3%Level-2 6.9% 2.8%Level-3 4.68%Dynamic 1.2%Mixed 37.1% 3.9% 6.1% 1.5%Table 4 shows a detailed analysis for mixed traffic results, where the number of specific types of accidents (see Section4.1.6) that each ego agent undergoes are presented. The numbers are normalized by 100/390, where 390 is the totalnumber of level-1 accidents in mixed traffic for 1050 episodes. The main takeaway from this table is that dynamic agentoutperforms every other level at each collision type. This explains why the dynamic agent’s accident rates are lowest ina dynamic traffic environment (see Table 3): A traffic scenario consisting of adaptable, thus more human-like, agentsprovides a more realistic case compared to a scenario with static agents.Table 4: Mixed traffic collision results for different levels of ego agents
Collision Type
Type 1 Type 2 Type 3 E go Level-1 0 10.256 89.744Level-2 0.005 0.054 0.046Level-3 0.018 0.077 0.069Dynamic 0 0.008 0.033
In this work, a dynamic level- k reasoning, combined with DQN is proposed for driver behavior modeling. Previousstudies show that level- k game theory enables the design of realistic agents, specially in highway traffic settings whenintegrated with DQN. Our study builds on these early results and presents a dynamic approach which constructs agentswho can decide the level of reasoning in real-time. This solves the main problem of fixed level- k methods, where theagents do not have the ability to adapt to changing traffic conditions. Our simulations conclude that allowing an agent toselect a policy among hierarchical strategies provides a better approach while interacting with human-like agents. Thispaves the way for a more realistic driver modeling framework as collision rates are significantly reduced. The proposedmethod is also computationally feasible, and therefore naturally capable of modeling crowded scenarios, thanks to directreasoning from observations, instead of relying on belief functions. Furthermore, the bounded rationality assumption ispreserved as the available strategies are designed by adhering to the fact that humans are cognitively limited in theirdeductions. 14ct to Reason: A Dynamic Game Theoretical Model of Driving References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M.,Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever,I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke,M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Softwareavailable from tensorflow.org, last accessed on 25/10/2020.Albaba, B. M. and Yildiz, Y. (2019). Modeling cyber-physical human systems via an interplay between reinforcementlearning and game theory.
Annual Reviews in Control , 48:1 – 21.Albaba, B. M. and Yildiz, Y. (2020). Driver modeling through deep reinforcement learning and behavioral game theory.Albaba, M., Yildiz, Y., Li, N., Kolmanovsky, I., and Girard, A. (2019). Stochastic driver modeling and validation withtraffic data. , pages 4198–4203.Ali, Y., Zheng, Z., Haque, M. M., and Wang, M. (2019). A game theory-based approach for modelling mandatorylane-changing behaviour in a connected environment.
Transportation Research Part C: Emerging Technologies ,106:220 – 242.Backhaus, S., Bent, R., Bono, J., Lee, R., Tracey, B., Wolpert, D., Xie, D., and Yildiz, Y. (2013). Cyber-physicalsecurity: A game theory model of humans interacting over control systems.
IEEE Transactions on Smart Grid ,4(4):2320–2327.Camerer, C. F. (2011).
Behavioral game theory: Experiments in strategic interaction . Princeton University Press.Camerer, C. F., Ho, T.-H., and Chong, J.-K. (2004). A Cognitive Hierarchy Model of Games*.
The Quarterly Journalof Economics , 119(3):861–898.Costa-Gomes, M. A., Crawford, V. P., and Iriberri, N. (2009). Comparing models of strategic thinking in van huyck,battalio, and beil’s coordination games.
Journal of the European Economic Association , 7(2-3):365–376.Da Lio, M., Mazzalai, A., Gurney, K., and Saroldi, A. (2018). Biologically guided driver modeling: the stop behaviorof human car drivers.
IEEE Transactions on Intelligent Transportation Systems , 19(8):2454–2469.Ding, C., Wu, X., Yu, G., and Wang, Y. (2016). A gradient boosting logit model to investigate driver’s stop-or-runbehavior at signalized intersections using high-resolution traffic data.
Transportation Research Part C: EmergingTechnologies , 72:225 – 238.Dong, C., Dolan, J. M., and Litkouhi, B. (2017a). Intention estimation for ramp merging control in autonomous driving. , pages 1584–1589.Dong, C., Dolan, J. M., and Litkouhi, B. (2017b). Interactive ramp merging planning in autonomous driving: Multi-merging leading pgm (mml-pgm). , pages 1–6.Fisac, J. F., Bronstein, E., Stefansson, E., Sadigh, D., Sastry, S. S., and Dragan, A. D. (2019). Hierarchical game-theoretic planning for autonomous vehicles. ,pages 9590–9596.Gadepally, V., Krishnamurthy, A., and Ozguner, U. (2014). A framework for estimating driver decisions nearintersections.
IEEE Transactions on Intelligent Transportation Systems , 15(2):637–646.Garzón, M. and Spalanzani, A. (2019). Game theoretic decision making for autonomous vehicles’ merge manoeuvre inhigh traffic scenarios. , pages 3448–3453.Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. 9:249–256.Hao, H., Ma, W., and Xu, H. (2016). A fuzzy logic-based multi-agent car-following model.
Transportation ResearchPart C: Emerging Technologies , 69:477 – 496.Ho, T.-H., Park, S.-E., and Su, X. (0). A bayesian level-k model in n-person games.
Management Science , 0(0):null.Ho, T.-H. and Su, X. (2013). A dynamic level-k model in sequential games.
Management Science , 59(2):452–469.Kalra, N. and Paddock, S. M. (2016). Driving to safety: How many miles of driving would it take to demonstrateautonomous vehicle reliability?
Transportation Research Part A: Policy and Practice , 94:182 – 193.Kamali, M., Dennis, L. A., McAree, O., Fisher, M., and Veres, S. M. (2017). Formal verification of autonomous vehicleplatooning.
Science of Computer Programming , 148:88 – 106. Special issue on Automated Verification of CriticalSystems (AVoCS 2015). 15ct to Reason: A Dynamic Game Theoretical Model of DrivingKlingelschmitt, S., Damerow, F., Willert, V., and Eggert, J. (2016). Probabilistic situation assessment framework formultiple, interacting traffic participants in generic traffic scenes. ,pages 1141–1148.Kuderer, M., Gulati, S., and Burgard, W. (2015). Learning driving styles for autonomous vehicles from demonstration. , pages 2641–2646.Kuefler, A., Morton, J., Wheeler, T., and Kochenderfer, M. (2017). Imitating driver behavior with generative adversarialnetworks. , pages 204–211.Lee, R. and Wolpert, D. (2012). Game theoretic modeling of pilot behavior during mid-air encounters.
Decision Makingwith Imperfect Decision Makers , 28:75–111.Lee, R., Wolpert, D. H., Bono, J., Backhaus, S., Bent, R., and Tracey, B. (2013). Counter-factual reinforcementlearning: How to model decision-makers that anticipate the future.
Decision Making and Imperfection Studies inComputational Intelligence , pages 101–128.Li, G., Li, S. E., Cheng, B., and Green, P. (2017). Estimation of driving style in naturalistic highway traffic usingmaneuver transition probabilities.
Transportation Research Part C: Emerging Technologies , 74:113 – 125.Li, J., Ma, H., Zhan, W., and Tomizuka, M. (2019). Coordination and trajectory prediction for vehicle interactions viabayesian generative modeling.Li, K., Wang, X., Xu, Y., and Wang, J. (2016). Lane changing intention recognition based on speech recognition models.
Transportation Research Part C: Emerging Technologies , 69:497 – 514.Li, N., Kolmanovsky, I., Girard, A., and Yildiz, Y. (2018a). Game theoretic modeling of vehicle interactions atunsignalized intersections and application to autonomous vehicle control. , pages 3215–3220.Li, N., Oyler, D., Zhang, M., Yildiz, Y., Girard, A., and Kolmanovsky, I. (2016). Hierarchical reasoning game theorybased approach for evaluation and testing of autonomous vehicle control systems. , pages 727–733.Li, N., Oyler, D. W., Zhang, M., Yildiz, Y., Kolmanovsky, I., and Girard, A. R. (2018b). Game theoretic modelingof driver and vehicle interactions for verification and validation of autonomous vehicle control systems.
IEEETransactions on Control Systems Technology , 26(5):1782–1797.Liu, Y. and Ozguner, U. (2007). Human driver model and driver decision making for intersection driving. , pages 642–647.Lygeros, J., Godbole, D., and Sastry, S. (1998). Verified hybrid controllers for automated vehicles.
IEEE Transactionson Automatic Control , 43(4):522–539.Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playingatari with deep reinforcement learning.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,Fidjeland, A. K., Ostrovski, G., and et al. (2015). Human-level control through deep reinforcement learning.
Nature ,518(7540):529–533.Montanino, M. and Punzo, V. (2015). Trajectory data reconstruction and simulation-based validation against macro-scopic traffic patterns.
Transportation Research Part B: Methodological , 80:82 – 106.Musavi, N., Onural, D., Gunes, K., and Yildiz, Y. (2017). Unmanned aircraft systems airspace integration: A gametheoretical framework for concept evaluations.
Journal of Guidance, Control, and Dynamics , 40(1):96–109.Musavi, N., Tekelio˘glu, K. B., Yildiz, Y., Gunes, K., and Onural, D. (2016). A game theoretical modeling and simulationframework for the integration of unmanned aircraft systems in to the national airspace.
AIAA Infotech @ Aerospace .Okuda, H., Harada, K., Suzuki, T., Saigo, S., and Inoue, S. (2016). Modeling and analysis of acceptability for mergingvehicle at highway junction. ,pages 1004–1009.Oyler, D. W., Yildiz, Y., Girard, A. R., Li, N. I., and Kolmanovsky, I. V. (2016). A game theoretical model of trafficwith multiple interacting drivers for use in autonomous vehicle development. , pages 1705–1710.Ravichandiran, S. (2018).
Hands-on Reinforcement Learning with Python: Master Reinforcement and Deep Reinforce-ment Learning Using OpenAI Gym and TensorFlow . Packt Publishing Ltd.16ct to Reason: A Dynamic Game Theoretical Model of DrivingSadigh, D., Landolfi, N., Sastry, S. S., Seshia, S. A., and Dragan, A. D. (2018). Planning for cars that coordinate withpeople: leveraging effects on human actions for planning and active information gathering over human internal state.
Autonomous Robots , 42(7):1405–1426.Schwarting, W., Pierson, A., Alonso-Mora, J., Karaman, S., and Rus, D. (2019). Social behavior for autonomousvehicles.
Proceedings of the National Academy of Sciences , 116(50):24972–24978.Shapiro, D., Shi, X., and Zillante, A. (2014). Level-k reasoning in a generalized beauty contest.
Games and EconomicBehavior , 86:308 – 329.Stahl, D. O. and Wilson, P. W. (1995). On players’ models of other players: Theory and experimental evidence.
Gamesand Economic Behavior , 10(1):218 – 254.Sun, L., Zhan, W., and Tomizuka, M. (2018). Probabilistic prediction of interactive driving behavior via hierarchicalinverse reinforcement learning. ,pages 2111–2117.Tian, R., Li, N., Kolmanovsky, I., Yildiz, Y., and Girard, A. R. (2020). Game-theoretic modeling of traffic in unsignalizedintersection network for autonomous vehicle control verification and validation.
IEEE Transactions on IntelligentTransportation Systems , pages 1–16.Tian, R., Li, S., Li, N., Kolmanovsky, I., Girard, A., and Yildiz, Y. (2018). Adaptive game-theoretic decision making forautonomous vehicle control at roundabouts. , pages 321–326.U.S. Department of Transportation/Federal Highway Administration (2006). Interstate 80 freeway dataset, fhwa-hrt-06-137. Last accessed on 13/01/2021.Weng, J., Li, G., and Yu, Y. (2017). Time-dependent drivers’ merging behavior model in work zone merging areas.
Transportation Research Part C: Emerging Technologies , 80:409 – 422.Wongpiromsarn, T., Mitra, S., Murray, R. M., and Lamperski, A. (2009). Periodically controlled hybrid systems.
HybridSystems: Computation and Control Lecture Notes in Computer Science , page 396–410.Xie, D.-F., Fang, Z.-Z., Jia, B., and He, Z. (2019). A data-driven lane-changing model based on deep learning.
Transportation Research Part C: Emerging Technologies , 106:41 – 60.Yang, S., Wang, W., Jiang, Y., Wu, J., Zhang, S., and Deng, W. (2019). What contributes to driving behavior predictionat unsignalized intersections?
Transportation Research Part C: Emerging Technologies , 108:100 – 114.Yildiz, Y., Agogino, A., and Brat, G. (2014). Predicting pilot behavior in medium-scale scenarios using game theoryand reinforcement learning.
Journal of Guidance, Control, and Dynamics , 37(4):1335–1343.Yildiz, Y., Lee, R., and Brat, G. (2012). Using game theoretic models to predict pilot behavior in nextgen merging andlanding scenario.
AIAA Modeling and Simulation Technologies Conference .Yoo, J. H. and Langari, R. (2013). A stackelberg game theoretic driver model for merging.
Dynamic Systems andControl Conference .Yu, H., Tseng, H. E., and Langari, R. (2018). A human-like game theory-based controller for automatic lane changing.
Transportation Research Part C: Emerging Technologies , 88:140 – 158.Zhu, B., Jiang, Y., Zhao, J., He, R., Bian, N., and Deng, W. (2019). Typical-driving-style-oriented personalized adaptivecruise control design based on human driving data.
Transportation Research Part C: Emerging Technologies , 100:274– 288.Zhu, M., Wang, X., and Wang, Y. (2018). Human-like autonomous car-following model with deep reinforcementlearning.
Transportation Research Part C: Emerging Technologies , 97:348 – 368.Zimmermann, M., Schopf, D., Lütteken, N., Liu, Z., Storost, K., Baumann, M., Happee, R., and Bengler, K. J.(2018). Carrot and stick: A game-theoretic approach to motivate cooperative driving through social interaction.