Towards a Systematic Computational Framework for Modeling Multi-Agent Decision-Making at Micro Level for Smart Vehicles in a Smart World
TTowards a Systematic Computational Frameworkfor Modeling Multi-Agent Decision-Making at Micro Levelfor Smart Vehicles in a Smart World
Qi Dai ∗ , Xunnong Xu, Wen Guo, Suzhou Huang, Dimitar Filev { qdai2, xxu63, wguo16, shuang10, dfilev } @ford.comFord Motor Company, Dearborn, MI 48126September 28, 2020 Abstract
We propose a multi-agent based computational framework for modeling decision-making and strategicinteraction at micro level for smart vehicles in a smart world. The concepts of Markov game andbest response dynamics are heavily leveraged. Our aim is to make the framework conceptually soundand computationally practical for a range of realistic applications, including micro path planning forautonomous vehicles. To this end, we first convert the would-be stochastic game problem into a closelyrelated deterministic one by introducing risk premium in the utility function for each individual agent.We show how the sub-game perfect Nash equilibrium of the simplified deterministic game can be solvedby an algorithm based on best response dynamics. In order to better model human driving behaviors withbounded rationality, we seek to further simplify the solution concept by replacing the Nash equilibriumcondition with a heuristic and adaptive optimization with finite look-ahead anticipation. In addition, thealgorithm corresponding to the new solution concept drastically improves the computational efficiency.To demonstrate how our approach can be applied to realistic traffic settings, we conduct a simulationexperiment: to derive merging and yielding behaviors on a double-lane highway with an unexpectedbarrier. Despite assumption differences involved in the two solution concepts, the derived numericalsolutions show that the endogenized driving behaviors are very similar. We also briefly comment on howthe proposed framework can be further extended in a number of directions in our forthcoming work, suchas behavioral calibration using real traffic video data, computational mechanism design for traffic policyoptimization, and so on.
Keywords—
Multi-Agent Decision-Making, Coordinated Path Planning, Human Driving Behavior Modeling ∗ Corresponding Author. Telephone: 313-236-7640. a r X i v : . [ c s . M A ] S e p Introduction
Aiming to improve customer experience and traffic flow, reduce emission and congestion, vast attention and resourceshave been devoted to make motor vehicles autonomous and electrified. In this endeavor the main focus in theliterature so far is mostly following a strategy that is vehicle centric: to make the vehicle as smart as possible, whiletreating the environment as a given background. As AI and communication/network technology rapidly progress,vehicles are becoming increasingly connected as well as smarter. Concurrently, infrastructure is also becoming moreand more intelligent. Time appears to be ripe that we start to shift attention from the vehicle centric paradigm to asystem centric one. We anticipate that a smart transportation system that will work well in reality has to explicitlyaddress the issue of interaction and coordination among smart vehicles and the smart infrastructure, and likely otherplayers, such as pedestrians, bicycles, scooters, and even animals. It is in this system perspective that we proposeour modeling framework for smart vehicles in a smart world. Recently, we have also witnessed increasing efforts onhow to take advantage of the forthcoming connectivity and autonomy, such as reducing the congestion and emission.For example, an optimal control framework [1] is put forward to address issues related to intelligent vehicles in smartcity settings.In our approach, how to endogenize the decision-making at micro level with interaction and coordination amongall players is central to the modeling. We further hope that the modeling framework can be used to quantitativelycharacterize human behaviors in various realistic traffic settings, and to help refine existing traffic rules or derive newones from the perspective of the transportation authority. It turns out that such a modeling framework has alreadybeen around for quite some time: game theory. For those who are versed in microeconomics, it is apparent that theframework we are going to propose is essentially identical to what had been developed in microeconomics. This shouldnot be a big surprise, since they both share the same theme of modeling smart agents interacting with one anotherwithin a regulated regime. The high level concepts, methodologies, and even some of the detailed mathematics areinevitably common.There are generally three classes of questions the modeling framework will be able to address conceptually. Thefirst one has to do with endogenizing the decision-making using optimization given preferences and game rules. This iswhat we call the forward problem or utility maximization in microeconomics. Here we expect that driving behaviors,such as coasting, distance keeping, yielding, merging, lane-changing, passing, entering and exiting, can be solvedin various situations given physical constraints and initial conditions. One of the ultimate goals we have in mindis to provide driving instructions to autonomous vehicles at sub-second frequency in real time, taking into accountmulti-vehicle interactions in mixed settings where autonomous vehicles and human driven vehicles co-exist. Thesecond class has to do with how to derive and quantify human driving behavioral models given the observed actionsequences in real traffic situations. This is what we call the inverse problem, whose methodologies involved are similarto micro-econometrics and is also known as imitation learning in the machine learning community. Here we expect toextract driving preference quantitatively, such as the preferred speed on a specific road, head distance when followingother vehicles, and how these depend on weather, time of the day, road condition and lighting, traffic density, and soon, by using traffic video data with explicit heterogeneity. The third class has to do with how the traffic rules can beoptimized to benefit the society as a whole, and how new traffic rules can be invented to accommodate the emergingphenomena, once we understand the decision-making and preference for all players involved. This is what we call themechanism design problem, in which we hope the modeling framework will also be helpful for dealing with potentialtrolley-type of problems and associated moral ambiguities that can arise unavoidably.The other focus in our modeling framework is to make the numerical algorithms practical for a large number ofrealistic path planning applications, and even in real-time applications ultimately. This requires us to identify anappropriate mathematical framework that is both conceptually clean and computationally efficient. DeterministicMarkov games appear to satisfy these criteria. However, realistic traffic settings are generally stochastic, and hencewe introduce risk premium to ease the burden of tracking state variables precisely in the game, in the spirit of eward shaping [2]. We explore two types of solution concepts. The first one is the standard sub-game perfect Nashequilibrium in a dynamic setting. The advantage of Nash equilibrium is its conceptual and mathematical cleanness.However, this cleanness is at the expense of practical relevance due to very strong rationality assumptions and heavycomputational requirements. This leads us to consider a heuristics based solution concept, i.e., adaptive optimizationwith finite look-ahead anticipation. We obtain this new solution concept by systematically relaxing some of the lessrealistic assumptions mandated by Nash equilibrium. Along the way these conceptual simplifications also open theavenue for drastically reducing the computational burden.There is a large body of literature on applications of game theory to traffic behavioral modeling. A thoroughreview of them all is practically impossible. We concentrate only on those that are the closest to our approach:using game theory to address micro level decision-making with explicit interaction and coordination among multipleagents with some generality. In a series of papers, Yoo and Langari [3], Kim and Langari [4], Yu et al [5], Zhang etal [6], and Zhang et al [7] used Stackelberg solution concept to model highway driving. In this approach strategicdecisions are made sequentially, with vehicles divided into a leader and follower(s). Payoffs are given typically by amatrix form with discrete moves. Some additional technical simplifications are made to make the approach tractable.Consequently, the modeling results are mostly meant to be qualitative, and hence cannot be used for micro pathplanning. In a different strand of inquiry, Oyler et al [8], Tian et al [9], and Li et al [10] utilized a hierarchical cognitivedecision-making strategy, dubbed level- k game theory, that is essentially one step further than Stackelberg solutionconcept by iterating the leader-follower sequence k times. They tackled settings of highway driving and un-signaledintersections. Interesting results were obtained, though some additional assumptions or technical simplifications weremade, again for tractability. It is not obvious that their approach can be applied for micro path planning and howto prevent occasional accidents from happening.Lane changing behaviors were modeled using game theory by a number of authors. Talebpour et al [11] also useda leader-follower matrix game with discrete moves to model discretionary lane-changes. They even calibrated theirmodel using NGSIM data [12]. They further claimed that their lane-change model provided a greater level of realismthan a basic gap-acceptance model. Instead of sequential game, Meng et al [13] used a discrete simultaneous-movematrix game with receding horizon adaptively applied to model highway lane changes. One important feature of theirapproach is its use of reachability analysis to guarantee the safety. Because of the game only involves discrete moves italso cannot be used for micro path planning. In a more systematic approach, the adaptive receding horizon techniqueis also deployed in a differential cooperative game for predictive lane-changing and car-following control by Wang et al [14]. A decomposition technique is used to divide the original problem into smaller pieces and an iterative algorithmbased on Pontryagan’s maximum principle is then applied to solve each sub-problem. However, the lane change in thispaper appears to be instantaneous, which in turn limits its applicability. Motivated by the computational efficiency’sperspective, Huang et al [15] introduced a radically different approach: a mean field to represent the average behaviorof all agents in a local vicinity. When the number of agents N becomes large, the pairwise interaction is supposedto be ignorable, and the problem is vastly simplified. The original multi-agent game is then approximated by themean field solution to be within the realm of (cid:15) -equilibrium. While theoretically interesting, the condition of N beinglarge in a physical world is hard to meet, especially when lateral movements as well as longitudinal movements aremodeled simultaneously.In a totally different modeling strategy, Cunningham et al [16], Galceran et al [17], and Mehta et al [18] utilizedthe so-called multi-policy decision-making framework. In this framework, while a decision-maker is contemplatingits own move, a “library” of policies (either discrete or continuously parameterized) is evaluated, via simulation, foranticipating potential future moves that can be dangerous. They also proposed a gradient-based algorithm to improvethe policy parameterization online. This type of approaches appears to be very promising. It will be interesting tosee how this type of approach can be successfully applied to micro level modeling of traffic behaviors. One challengeseems to be in having good enough handcrafted starting policies from which the algorithm can figure out ways torefine themselves. Lastly, there is also a relatively big literature in using centralized decision-making to improve traffic ystems, taking advantage of the emerging smart environment. Techniques utilized can be either multi-agent basedsimulation, such as Dresner and Stone [19], or based on cooperative game theory, such as Elhenawy et al [20] andDing et al [21]. While these papers studied the situations at automated intersections, Rios-Torres and Malikopoulos[22] investigated vehicle merging situations at highway on-ramps. For the latter case, even game engine was utilizedin a recent work of Wang et al [23].Of course, there are plenty of research on micro path planning based on feasibility in autonomous driving literature.In this strand of endeavor surrounding agents are often treated as passive background and hence their strategicbehavior and coordination are not explicitly endogenized. A recent review of this type of approaches can be foundin Katrakazas et al [24].Our work distinguishes from these previous work in several important ways. We first introduced risk premium intothe utility function so that the burden of tracking state evolution precisely is substantially reduced. Consequently,we can treat the traffic behavioral modeling as deterministic Markov games. When we develop our game theoreticalsolution concept in Section 3 ( betaNash ) we actually provided an iterative algorithm, based on best response dynamics,that can be used to solve sub-game perfect dynamic Markov Nash equilibrium. In our approach, no additionalartificial assumptions are made beyond those required by the standard non-cooperative game theory. The solution issufficiently detailed and can be used for micro path planning. In an attempt to relax the common strong assumptionsassociated with game theory so that the behavioral modeling can be directly applied to human drivers, and tosubstantially improve the computational efficiency, we introduced an adaptive optimization algorithm with finite look-ahead anticipation in Section 4 ( adaptiveSeek ). This new solution concept is systematically simplified from the gametheory solution concept guided by human driving heuristics. Even though some of the important modeling ingredientshad appeared in some of the earlier papers cited above, such as receding time horizon, anticipation, and adaptivity,our starting point and detailed implementation are quite different. We further show, despite the conceptual differencefor the two solution concepts, one being game theory based and the other being almost unilateral optimization based,the numerical results of the solutions are reasonably close. Lastly, we also try to be very careful on treating theinformation in the solution concepts: what is shared as the common knowledge and what is not as private intentions.Incidentally, one recent article [25] deserves special mention here. This is the work that comes closest to ours,as far as the computational Nash equilibrium algorithm is concerned. Yet, their work is motivated very differentlyfrom ours, to embed social psychology into game theory so as to better predict driver behavior. In contrast we aremotivated to provide a realistic computational framework for micro path planning. The detailed formulations of theutility function and the associated interpretations are also quite different. Their emphasis is in the heterogeneityof Social Value Orientation (SVO), whereas ours is in introducing risk premium in the pairwise collision of theutility so that we can convert a would-be uncertain game model into deterministic one, and hence reducing thecomputational burden. Nevertheless, their computing methodology is essentially identical to our betaNash , both arebased on the concept of iterative best response, though our algorithm adaptiveSeek is entirely novel. In addition, theyfurther analyzed the NGSIM data [12] to quantify SVO at individual driver level by combining the Nash equilibriumalgorithm with Max Entropy Inverse Reinforcement Learning [26]. It is our belief that, despite of the improved fitting,SVO as an individual’s preference parameter, should not be made to vary at a time scale of second. If one driveris an altruist, he/she should not suddenly become an egoist a second later. It is much more preferable conceptuallythat the fast changing role should be taken up by the action variables instead. On the other hand, we will describeour approach to the inverse problem, calibrating utility function with observed action data, using state space modelin a separate paper [27]. The utility parameters in our case, while being heterogeneous individually, are all constantduring the course of 250 seconds of driving in Sugiyama experiment [28].Finally, it is also worthwhile to compare our approach with approaches utilizing some form of deep reinforcementlearning (RL), such as those in [29, 30, 31, 32, 33]. One of the common characteristics of RL approaches is that theyare all policy oriented, in the sense that optimal policies are typically trained offline and then deployed/generalizedin similar contexts. Most of the times, the offline training is done in a manner of exploration/exploitation, a strategy hat is often very slow. The heavy computational burden implies that inverse problems and mechanism designproblems become very hard to tackle under this framework. Furthermore, since it embodies the specific informationof the training context, the trained optimal policy can have hard time to generalize to new contexts. In contrast, ourapproach is objective oriented, in the sense that the optimal (or equilibrium) policy is solved for a given context onlinewith computational efficient algorithms. Consequently, issues in inverse problems and mechanism design problems,and generalizability are all become easier (see Section 4 for further comments).The rest of this paper is organized as follows. In the next section we detail the modeling environment and state therelevant assumptions. Markov games is proposed to be the mathematical framework. We then explore two solutionsconcepts in the two subsequent sections. The first solution concept is based on sub-game perfect Nash equilibrium, asdescribed in Section 3. We exploit the best response dynamics to numerically solve the equilibrium resulting Algorithm betaNash . The second solution concept is based on heuristic adaptive optimization that is obtained by systematicallyrelaxing some of the less realistic assumptions associated with Nash equilibrium, as described in Section 4. Thesimplified but more realistic approach is dubbed Algorithm adaptiveSeek . We further show how noise/uncertaintycan be accommodated within the framework, and outline how it can be utilized to formulate inverse problem usingstate space model approach. To demonstrate how the framework can be applied to tackle concrete problems, weconsider a simulation experiment in Section 5. In this experiment two vehicles in a setting of a double-lane highwaywith an unexpected barrier are studied in details. Solutions derived by using betaNash and adaptiveSeek are comparedand contrasted. In the final section we summarize our achievements and highlight forthcoming work. In this section we first establish the physical assumptions that are reasonable for modeling smart vehicles in a smartworld. Following the approach of microeconomics, we also outline the high level conceptual modeling framework.Then, we detail the mathematical considerations to make our modeling framework concrete and practical.
First, the agents involved can be classified into the following types: • The transportation authority is a special agent who has the power to decide on the rules of the game, such asthe right-of-the-way, traffic infrastructure and signal systems. Its decision-making is governed by some proxiesof social aggregate outcomes. In certain contexts, the transportation authority can also serve as the controllerfor centralized automated transportation systems. • Human agent (excluding the transportation authority) is regarded as rational whose decision-making is governedby maximizing an intrinsic utility function which potentially can be calibrated by actually observed behaviors.Examples are human driven vehicles, bicycles, and pedestrians. • Algorithmic agent is also regarded as rational whose decision-making is governed by maximizing a utilityfunction which needs to be endowed according to certain rules. Examples are autonomous vehicles, trafficsignals or infrastructure facilities. • Other agents include unexpected debris, and animals that are possibly irrational and hence treated as randomevents.Second, in the settings of smart vehicles in a smart world, the following simplifying assumptions appear to beappropriate: The transportation authority always has strategic precedence, relative to all other agents in the game. Thisis equivalent to saying that the transportation authority is the first mover of the traffic behavior game, as weassume that the transportation authority sets the rules of the game beforehand. • Vision, sensing, mapping, and awareness of the surroundings are adequately provided to all agents, with possiblelatency and communication interruption, either by the smart vehicles or the edge-based smart infrastructure.However, we do not generally assume common knowledge on everything, such as individual’s driving preference,destination, and so on, as this may be private, unless an explicit communication mechanism is specified. • Agents are sufficiently smart so that they, when required, can faithfully execute the decision recommendationsprovided from the system, within the scope of the physical constraints. This implies that we can concentrateon high level variables, such as locations, velocities, accelerations, orientations, rather than worry about gaspedals, steering wheels, voltages, or actuators.The high level conceptual modeling framework and the game playing setting is illustrated by the block diagramin Figure.1. The transportation authority, who decides on the rules of the game, resides in Box 1. All other
Figure 1:
An illustration of the modeling framework, its components and their relationships. HereSPRL stands for self-play reinforcement learning. game participants along with their preferences, which underlie their behaviors in the game, are represented by Box2. The physical environment, such as roads, initial conditions, communication channels, physical capabilities andconstraints, are represented in Box 3. The endogenized behavioral outcomes derived by solving the game with aprescribed algorithm of a chosen solution concept is denoted in Box 4. The conceptual framework illustrated aboveis very broad. It will enable us to address the following types of questions systematically, depending on the specificperspective of what is treated as input and what is treated as output: • The forward problem (given boxes 1, 2, and 3, solve for box 4): solving driving behaviors given the modelingenvironment and physical constraint, preference of each agent, and the rules of the game. Examples include pathplanning such as: behaviors in merging, yielding, lane changing, following, passing, and exiting; interactionsbetween different types of agents such as vehicle-pedestrian and vehicle-infrastructure; communication andcoordination among agents, and so on. • The inverse problem (given boxes 1, 3, and 4, solve for box 2): calibrating decision model parameters giventhe observed actions from all agents involved. This is critical if the computational framework is going to bequantitatively relevant to reality. One focus here is to understand individual and distributional priorities invarious preferences, such as how human drivers rank the relative importance in obeying speed limit, vehiclesmoothness, lane deviation, accident avoidance, and how these priorities manifest themselves in terms of theappropriate functional forms. The mechanism design problem (same as the forward problem, but with a feedback loop from box 4 to solvefor box 1): refining and optimizing the game rules from the transportation authority’s point of view. Topicsin this area can include how to tweak and improve existing traffic rules and how to invent the new onesto accommodate new vehicle capabilities and the ever changing new infrastructure; how to assign objectivefunctions for algorithmic agents, such as autonomous vehicles; how to handle potential moral ambiguities whichcan arise.In this paper we will solely concentrate on the forward problem, and leave the inverse and mechanism designproblems to our forthcoming papers [27] and [34].
There are multi-agent i ∈ I = { , , ..., N } playing the game specified by the transportation authority. Time isdiscrete: t ∈ { , , , ..., T − } in units of ∆ t . The precise value of choice for ∆ t is based on a compromise amongseveral factors: 1) making the dynamics sufficiently smooth and safe; 2) making the planning time horizon T × ∆ t not too small so that there can be some room for anticipation; and 3) ∆ t being reasonably close to human reflex timeduring driving. Our typical choices are ∆ t = 0 . ∼ . T × ∆ t is typically ∼
10 seconds.We follow the standard non-cooperative game theory approach: decision-making is modeled as utility maximizing.The state of agent i at period t , s i,t , embodies sufficient information for decision-making. In our setting, s i,t typicallycontains surrounding agents’ positions and velocities in the simplest cases. It may include additional information inmore complex settings, such as the orientation of a vehicle, and indicators for certain events, e.g., whether a vehiclehas made a stop at a red-light. The action for agent i at time t , a i,t , is typically a vector including its own accelerationat time t , steering angle, turning on signaling light, and so on. Again, the details highly depend on the setting. Forconvenience, we find that it is easier to separate the private information, such as the destination, from the observablestate variable. When necessary we will detail the specific communication mechanism for sharing private information.In a multi-agent setting, it is not so obvious a priori who should have the strategic precedence, except forthe transportation authority. Therefore, the theoretical framework of discrete time Markov sequential game withsimultaneous move [35] appears to be the most relevant choice as our starting modeling framework. Generally, thegame should have been treated as a stochastic one, which would require solving coupled Markov Decision Problems.Unfortunately, it is well known that even a MDP with a single decision maker is already very hard to be madecomputationally practical in realistic settings, let alone the coupled one. Therefore, we need to find a way todramatically reduce the computational burden if our approach has any chance of being useful at all, apart frompurely theoretical exercises. To this end, we introduce an explicit risk premium into the utility function, borrowinga concept from economics. The basic idea is to let the agents involved feel the risk, in the sense of decision-making,before any actual physical damage occurs. For example, even when two cars have not yet touched each other, theirdrivers can feel the danger of collision when the distance between them is sufficiently small. It is this danger that willbe explicitly reflected in the utility function. Once the risk premium is introduced, the precise form of uncertaintiesbecome less critical. This in turn allows us to treat the game deterministically, thereby radically simplifying thegame. We intuitively believe that our modeling strategy is consistent with human driving heuristics. Since the game we consider is Markov with a continuous and deterministic state evolution, the solution spaceis limited to Markov pure strategies. The Markov condition implies that the per period utility function for agent i at time period t , with own action a i,t and actions of others a − i,t , can be written as u i,t ( a i,t | s i,t , a − i,t ), i.e., it only Alternatively, we can also view the risk premium introduced here as a reward shaping [2], a well known technique inreinforcement learning literature, for the sparse penalty associated with collisions. We comment on how to reconcile our deterministic formulation with the standard MDP approach in subsection 4.3, wherewe explicitly introduce uncertainties in the state evolution and allow random deviations from the optimal action. epends on the current state and actions. The dependence of agent i ’s utility on its own action a i,t is obvious. Theneed for the dependence on others’ action a − i,t is for the purpose of handling interactions among agents. Furthermore,we assume that it can be decomposed as a sum over a number of components with appropriate weights u i,t ( a i,t | s i,t , a − i,t ) = (cid:88) k w i,k φ ( k ) i,t ( a i,t | s i,t , a − i,t ) , (1)where the weights are agent dependent in order to take care of potential heterogeneity, but not time dependent withinthe planning time horizon. Different problem could imply different set of components to be explicitly included in thesum. Some of the components can represent reward, such as moving towards the goal, whereas others can representpenalties, such as collision, lane violation, roughness. We anticipate that the weights will form a hierarchy in termsof their magnitudes ranging from small to very large, reflecting the preferred priority of the agent. Furthermore, theMarkov condition implies that the state evolution can be written as s i,t +1 = f i,t ( a i,t | s i,t , a − i,t ) , (2)where f i,t () is a deterministic function specified by the appropriate Newtonian kinematics or dynamics. This functioncan be dependent on agent i and potentially on the explicit time period t .It is conceptually important to realize that we treat the utility function purely as a decision modeling device.The utility function can deviate from the physical benefits and penalties, especially the part related to risk premium.Consequently, the precise form of the risk premium will depend on the solution concept used to derive the behavioraloutcome. For example, the risk premium deployed to solve the exact Nash equilibrium needs not be the same as thatused in a method for deriving a heuristic solution to approximate the Nash equilibrium. This is because that, at leastfor deterministic pure strategy games, everything is perfectly anticipated in a Nash equilibrium, whereas the timehorizon of the anticipation in a heuristic solution is typically shorter than the full planning time horizon. Therefore,the utility function should be commensurate with its corresponding solution concept. Of course, the physical partof the utility should always be kept identical. Ultimately, the utility function should always be calibrated using theactual traffic data together with the appropriate solution concept.In the next section, we first explore the solution concept based sub-game perfect Nash equilibrium. In section4, we will show how to relax the deterministic assumption by explicitly introducing noise factors when we replacethe Nash equilibrium based solution concept with a heuristic based concept of adaptive optimization with finitelook-ahead anticipation. In this and the following sections we mostly keep things general and formal. Implementation details will be illustratedin Section 5 when we apply the proposed algorithms to a concrete example.
Here we deploy the standard non-cooperative game theory solution concept in a deterministic dynamic setting:Markov sub-game perfect Nash equilibrium. The aim of each agent is to maximize the cumulative utility functiondefined as U i ( a i, { t =0 , ,...,T − } | s i, , a − i, { t =0 , ,...,T − } ) = T − (cid:88) t =0 u i,t ( a i,t | s i,t , a − i,t ) , (3)while taking into account how other agents would behave using only Markov strategies. In the above equation wehave used the fact that future states, s i,t> , can be expressed as a function of the initial condition s i, and the action This decomposition is standard in both economics and in machine learning literature, see for example [26, 36, 37]. equences via the state evolution Eq.(2). Since the planning time horizon in our context is typically a few secondsthere is no point to introduce discounting.The best response of agent i to a given set of fixed action sequence of all other agents, a − i, { t =0 , ,...,T − } , is thendefined by maximizing the cumulative utility function with respect to a sequence of self-action sequence a ∗ i, { t =0 , ,...,T − } ( s i, , a − i, { t =0 , ,...,T − } ) = argmax a i, { t =0 , ,...,T − } U i ( a i, { t =0 , ,...,T − } | s i, , a − i, { t =0 , ,...,T − } ) . (4)Note that the above equation effectively becomes a “static but simultaneous” optimization problem for all the self-action sequence a i,t , ∀ t ∈ { , , ..., T − } , hence side-stepping all the computational issues associated with dynamicprogramming. In particular, the curse of dimensionality and representation of value function do not arise in ourcontext. The Nash equilibrium of the game is reached when all agents are simultaneously using their best responsesagainst each other, i.e., a ∗ i, { t =0 , ,...,T − } ( s i, , a ∗− i, { t =0 , ,...,T − } ) = argmax a i, { t =0 , ,...,T − } U i ( a i, { t =0 , ,...,T − } | s i, , a ∗− i, { t =0 , ,...,T − } ) . (5)This condition is particularly useful for verifying whether the solution found by some numerical procedure is actuallya Nash equilibrium.One subtle aspect of our formulation above is that we only impose the initial condition explicitly in the opti-mization process, while leaving the terminal condition s i,T implicit, essentially as the outcome of the optimization.We choose to enforce the terminal condition, such as destination, in a soft manner in the sense of Lagrange, byaugmenting the utility function. This choice generally makes the optimization problem easier, but also endogenizescertain terminal outcomes, such as avoiding accidents or forming traffic jams when reaching the destination becomesphysically impossible.We can also re-write the optimization problem in Eq.(4) as the following W i,t ( s i,t | a − i, { t,...,T − } ) = max a i,t (cid:110) u i,t ( a i,t | s i,t , a − i,t ) + W i,t +1 ( s i,t +1 = f i,t ( a i,t | s i,t , a − i,t ) | a − i, { t +1 ,...,T − } ) (cid:111) , (6)with the utility-to-go function W i,t () is defined, in conjunction with the state evolution, as W i,t +1 ( s i,t +1 | a − i, { t +1 ,...,T − } ) = max a i, { t +1 ,...,T − } (cid:110) T − (cid:88) t (cid:48) = t +1 u i,t (cid:48) ( a i,t (cid:48) | s i,t (cid:48) , a − i,t (cid:48) ) (cid:111) . (7)Eq.(6) is immediately recognized as the Bellman equation from dynamic programming. This in turn implies that theabove Nash equilibrium defined in Eq.(5) is also a sub-game perfect equilibrium. betaNash We exploit the best response dynamics to numerically solve the Markov sub-game perfect Nash equilibrium with purestrategy. The basic idea is that, starting from some reasonable initial action sequence, each agent tries to respondoptimally in every iteration (viewed as self-play learning process) given the prevailing strategies of all other agents.Under certain mathematical conditions, the best response dynamics converges to a Nash equilibrium. A broad classof such games, called supermodular games with pure strategy, can be found in economics [38]. Formally, the bestresponse dynamics is defined by a ∗ i, { t =0 , ,...,T − } ,τ +1 ( s i, , a − i, { t =0 , ,...,T − } ,τ ) = argmax a i, { t =0 , ,...,T − } U i ( a i, { t =0 , ,...,T − } | s i, , a − i, { t =0 , ,...,T − } ,τ ) , (8)where we have added subscript, τ , to denote the iteration step for the best response dynamics. It is worth pointingout that, while there was no explicit coordination in the formulation, the actual coordination is achieved from theiterative nature of the best response dynamics from the above equation. In some sense, the best response iteration rovides a “negotiation process” for all agents involved to figure out one another’s intentions, such as where to go inwhat order at what speed, so that a mutually satisfactory outcome is obtained for all agents involved.We also recognize that the best response dynamics is a specific form of self-play reinforcement learning originallyinvoked to justify why Nash equilibrium, which is so complex mathematically, exists and can be found by its gameplayers that have generally bounded rationality. It is also very similar, at least in spirit, to the self-play reinforcementlearning used in DeepMind’s alphaGo Zero [39], though the detailed state representation, function approximation,and search strategies are totally different. There is no need for a neural network representation in our context,due to the fact that we know the explicit functional forms for utility functions and state evolution. It is much moreefficient computationally to rely on traditional maximization methods than on stochastic gradient based or tree-searchmethods in the best response dynamics.With the above mathematical preparation we state the algorithm below: Algorithm I: betaNash initialize the simulation environment for ∀ i ∈ I do initialize state variables s i,t =0 initialize the action set, e.g. a i, { t =0 , , ··· ,T − } ,τ =0 = 0 end for for iteration step τ = 1 : n do // this is the loop for best response dynamics for agent i = 1 : N do get the most recent action sequence: a − i, { t =0 , ,...,T − } ,τ − do the optimization for agent i : // the loop over T is embedded in this step a ∗ i, { t =0 , ,...,T − } ,τ = argmax a i, { t =0 , ,...,T − } U i ( a i, { t =0 , ,...,T − } | s i, , a − i, { t =0 , ,...,T − } ,τ − ) update the action set: a i, { t =0 , ,...,T − } ,τ ← a ∗ i, { t =0 , ,...,T − } ,τ end for check Nash equilibrium condition: a ∗ i, { t =0 , ,...,T − } ( a ∗− i, { t =0 , ,...,T − } ) = argmax a i, { t =0 , ,...,T − } U i ( a i, { t =0 , ,...,T − } | a ∗− i, { t =0 , ,...,T − } ) if condition is true then break end if end for for ∀ i ∈ I do for t = 0 , , · · · , T − do s i,t +1 = f i,t ( a ∗ i,t | s i,t , a ∗− i,t ) end for end for The algorithm defined above ( betaNash ) has a number of nice properties: • It is based on a rigorous Markov game whose Nash equilibrium can be solved by a systematic numerical methodcalled best response dynamics. • The dynamic Nash equilibrium so obtained is automatically sub-game perfect by construction. • The best response dynamics provides an explicit negotiation or learning mechanism among all agents, forfiguring out both their driving intentions and optimal actions. • While the assumptions for game theory is very strong so that the derived sub-game perfect Nash equilibriumsolutions may not be immediately relevant for modeling mixed settings where human drivers are involved, they t least serve as the theoretical benchmark. We intuitively expect that solutions based on game theory areoften more efficient in coordination than heuristically motivated solutions. This is also what we experiencedempirically. One such example will be presented in Section 5. Other examples can be found in our forthcomingpapers [40, 41]. • It is also conceivable, in the forthcoming connected and autonomous world, that the same solutions can beimplemented through an edge infrastructure system, provided that the decision-making for all vehicles in thevicinity can be centralized. • The algorithm is computationally feasible for at least some small but realistic problems where the number ofagents in the game is not too large.However, the algorithm also has a number of weaknesses: • The assumptions associated with the dynamic Nash equilibrium may be too strong to be sufficiently realistic.Of course, this is the same criticism for the relevance of the basic assumptions for game theory, such as thecommon knowledge (e.g. knowing the utility functions for all agents) and complete rationality (e.g. infinitecomputing power). • The common knowledge assumption perhaps can only be realizable when all agents are autonomous, whoseutility functions are endowed, and the solution is being derived by a central controller. • The computational burden grows dramatically when the number of agents becomes big, at least N × T timesthe number of iterations in the best response dynamics, where N is the number of agents involved in the game.This makes it hard to be applied to realistic settings where N is large. • The best response dynamics is intrinsically sequential as defined in Eq.(8), and hence betaNash is not naturallyparallelizable. It is possible to make the algorithm parallel by simultaneously calculating the best response tothe actions of all other agents at one time step earlier, though this may lose some efficiency on a per iterationbasis. • Another issue, though conceptually minor, is the number of agents N has to be fixed for the game in the entireplanning time window. This is not very realistic, given the fact the planning horizon can be as long as ∼ • The common knowledge and rationality assumptions, together with the heavy computational burden, will alsolimit the scope of applications in the inverse problems and mechanism design problems.It is in hoping to overcome these weaknesses that prompted us to pursue another solution concept, as detailed inthe next section.
In the preceding section, we introduced an explicit risk premium into the utility function so that the would-bestochastic game can be simplified to a game of perfect information, at least from the point of view of the stateevolution. Consequently, the computational burden is eased substantially. However, we have additional reasonsto further simplify the model so that we can accommodate human drivers and autonomous vehicles at the sametime. This is due to several considerations. First, once in a mixed setting, where human agents and algorithmicagents coexist, the common knowledge assumption becomes dubious. Of course, we can always revoke the BayesianNash equilibrium concept to relax some of the common knowledge requirements. But that will again substantiate thecomputational burden. Second, it is unreasonable to assume that human agents, with bounded rationality, can always nd the pertinent Nash equilibrium in real-time in realistic traffic settings. Third, in order to make the approachpractical, further reduction in computational requirement is necessary.Before we proceed more formally let us first recapitulate human driving heuristics: • All decisions are decentralized and individually made by each agent, mostly without explicit communication. • All the immediate actions are planned according to the current state, with some reasonable or good enoughbut otherwise imprecise anticipation for the future state. • In case the future state has multiple scenarios, action planning is done according to the most conservative one. • The planned actions have some time persistence, unless they are interrupted by environment change. • The action planning starts anew once the new information is assessed.
Armed with the formulation in section 3, we have a good starting point from which we can gradually relax theconditions that were originally required for the Nash equilibrium approach. We continue to rely on human drivingheuristics to guide us in this endeavor. The focus is to shift from the perspective of the full game and the pertinentequilibrium in the preceding section to the perspective of individual agent’s decision-making in this section.The first step is to simplify the state evolution in Eq.(2). Recall that we used the state evolution to eliminateall the dependence of the cumulative utility function on future states s i,t> . While this is natural theoretically, it ishard to imagine that a human driver can iterate the state evolution many times in a realistic traffic setting, in whichboth one’s own and others’ future action sequences are needed. More likely, a human driver will simply anticipate thefuture states by extrapolating from the current state and the immediate own action and some reasonable assumptionson action sequences for all other agents subsequently. This motivates the following estimated state evolution˜ s i,t +1 = ˜ f i,t ( a i,t | s i,t ) . (9)In the above equation we have kept the self-action dependence and dropped the precise dependence on actions for allother agents by assuming a − i,t> being some naturally intuitive maneuvering that is commensurate with the specificcontext. However, Eq.(9) cannot be entirely independent of a − i,t> , otherwise we would have been dealing withnon-interacting cases. We specify this imprecise dependence on a − i,t> in Eq.(9) by what we call anticipation foragent i at time t given state s i,t . Our strategy is to replace the dynamically interacting sequence of actions preciselyby some prescribed sequence of actions for all the other agents that are reasonably consistent with the context. Inthe remainder of this subsection we outline the two generic elements necessary for finite look-ahead anticipation.Of course, the detailed form for the anticipated a − i,t> is highly context dependent and hence cannot be madegeneral. We will provide a concrete example on how this is done in the simulation experiment in Section 5, includingalgorithmic details.The next step is to simplify the best response in Eq.(4). One obvious approach is to break the planning horizoninto smaller chunks. The extreme case is to let T = 1. Unfortunately, we know empirically that purely myopicdecision-making is not sufficient. Therefore, some form of anticipation needs to be built into the process. Towardsthis end, we define an effective utility function by looking ahead h (cid:29) u i,t ( a i,t | s i,t ; h ) = (cid:88) k w i,k g k (cid:16) φ ( k ) i,t ( a i,t | ˜ s i,t ); h (cid:17) . (10)The choice of functional form for g k () depends on the specific feature and is made conservatively when necessary forsafety reasons. For some components such as the moving forward reward and lane departure penalty, g k () are theaverage of the corresponding components in h -period. For driving smoothness, we choose the penalty in the first eriod. For components that are potentially calamitous, such as crash or collision penalties, we choose maximumpenalty among the h periods given the prescribed action sequences for others. The effective utility function defined inEq.(10) embodies the first element of what we call finite look-ahead anticipation once all the paths are hypothesized.Then, the resulting approximate best action can be derived from˜ a ∗ i,t ( s i,t ; h ) = argmax a i,t ˜ u i,t ( a i,t | s i,t ; h ) . (11)Notice that, due to the dropping of the action sequence for all other agents in the simplified effective utility function,the above optimization is only a reaction to the pre-chosen action sequences in the anticipation, and hence the originalgame problem is reduced to a control problem step by step.One subtlety arises when there is no explicit coordination mechanism in Eq.(11), how could each agent figureout the intentions of all the other nearby agents in case they do not have the common knowledge? To address thisproblem, we introduce the concept of path scenario: a finite number of possible paths that a nearby agent can pursueat any give moment. This can be done by enlarging the state space in Eq.(11). Again, we take a conservativeapproach to plan the action for agent i with the most conservative path scenarios from all nearby agents. In thisway, the state variable should be understood also as adaptive, including the number of nearby agents and their pathscenarios. This can be viewed as the second element of finite look-ahead anticipation. It is important to emphasizethat the anticipation of future state is done for all agents, including self and all other agents, though they are treateddifferently, due to the fact that the self intention is known whereas others’ intentions are unknown.Once solved from Eq.(11) the solution ˜ a ∗ i,t ( s i,t ; h ) is only executed for one period, even though it is derivedwith information from h periods. As time increments from t to t + 1, a new optimization problem is re-assessedfrom the observable state variables at t + 1. This is our third step in formulating a modeling framework which isultimately capable of handling human driving behaviors and being computationally feasible at the same time. Theintuition behind this modeling strategy is that each agent plans its path at micro level with deliberate intention andanticipation, and then monitors the situation constantly and adapts when necessary. Of course, questions such aswhether such a modeling strategy can truly approximate human driving behaviors well and whether it is sufficient toavoid accidents, can only be answered empirically. Our initial applications using solution concepts outlined in thissection in a variety of settings [27, 34, 40, 41, 42, 43] indicate that answers are affirmative, thanks to the explicitintroduction of the risk premium to the utility function, finite look-ahead anticipation, and the adaptivity.Finally, while it is important to demand h (cid:29)
1, we should not make h too large. This is because the decisionwill be dominated by those future time periods where the assumption of a − i,t> = 0 becomes totally dubious when h is extremely large. Generally, h should be treated as a hyper-parameter of the model and tuned in each specificmodeling situation. However, our experience indicates that h × ∆ t being in the range of 1 to 3 seconds is a goodchoice, providing robustness for many different driving settings, depending on the specific traffic condition, such aslocal or highway, crowded or uncrowded, good or bad weather, and so on. We also found empirical evidence that h should be in this range when we calibrate the utility function using traffic video data [27]. .2 Algorithm II: adaptiveSeek We are now ready to state the algorithm that implements the theoretical considerations in the preceding subsection.
Algorithm II: adaptiveSeek for ∀ i ∈ I do initialize the simulation environment end for for t = 0 : T − do for ∀ i ∈ I do // this is the loop over all agents get current state s i,t for ∀ j ∈ I − i do // this is the loop over agents other than i if certain desired paths are assessed then anticipate possible path scenarios that agent j likely to take based on its state for ˜ t = 0 : h − do extrapolate trajectory using a prescribed action consistent with the path scenario end for else assume agent j keep the natural extension of its current state for ˜ t = 0 : h − do extrapolate trajectory using zero action end for end if end for calculate the effective utility given all the anticipated path scenarios:˜ u i,t ( a i,t | s i,t ; h ) = (cid:80) k w i,k g k (cid:16) φ ( k ) i,t ( a i,t | ˜ s i,t ); h (cid:17) choose the effective utility with the safest path among all the anticipated path scenarios do the optimization ˜ a ∗ i,t ( s i,t ; h ) = argmax a i,t ˜ u i,t ( a i,t | s i,t ; h ) . if add noise then ˆ a i,t ( s i,t ) = ˜ a ∗ i,t ( s i,t ; h ) + (cid:15) i,t end if end for for ∀ i ∈ I do apply the observed action and evolve to new states: s i,t +1 = f i,t (ˆ a i,t | s i,t ) end for end for A few general comments are in order for the above algorithm ( adpativeSeek ): • Since the decision-making for each agent is independent without explicit coordination in adaptiveSeek , thereis no need to assume that all agents in the game follow the same algorithm. It could be that some agents arehuman driven, some are autonomous with their own built-in algorithms, so long as all the implied behaviorscan be reasonably anticipated. Therefore, we expect that adaptiveSeek can be used to model mixed settingswhere all types of agent co-exist. • Due to the lack of common knowledge we cannot assume that each agent knows the intentions of all otheragents. Consequently, this requires some form of anticipation for the intentions of the agents in the vicinity.We choose to accomplish this with the simplest possible approach - using a prescribed action sequence consistent ith the path scenario. Then, the algorithm picks the safest optimal action to execute among all the possiblescenarios. In this way, we are able to handle the mixed environment. The if - else condition in the optimizationprocess is a direct consequence of the fact that each agent does not have information about others’ intentions. • However, due to the inclusion of the anticipation for all the nearby agents in the effective utility function, adap-tiveSeek is still implicitly coordinated, in contrast with betaNash where the coordination is explicit. Withoutsome form of coordination, it is not possible to take care of the interactions among all agents. • With this fully adaptive approach, we can easily handle any agents entering or leaving the simulation environ-ment at any given time, which makes it more flexible than the solution concept in Algorithm betaNash . • By construction, this algorithm is local, because the utility function depends only on the nearby agents, andhence is linear in the number of agents involved in the game . Furthermore, the inner loop can be easilyparallelized, as the decision-making for each agent is practically independent from one another at each timestep, If the number of processors is big enough, the algorithm is practically order N . Alternatively, thisinner-loop optimization can be distributed to each agent’s own on-board computer. It is these nice propertiesthat make adaptiveSeek very fast. This in turn allows a good chance for it to be made real-time in realisticapplications. So far we have been solely concentrating on the deterministic limit. Now we would like to re-introduce uncertaintiesexplicitly to our formulation. There are several reasons that motivate us for this: 1) checking robustness of the deter-ministic modeling results against small disturbances; 2) solving the inverse problems using techniques developed ininverse reinforcement learning with maximum likelihood [36, 37] or maximum entropy methods [26]; and 3) designinggood mechanisms for improving traffic rules or regulations. To that end, we need to distinguish the perceived stateby the decision-maker from the actually realized state. We denote the former by ˆ s i,t and continue to use s i,t todenote the latter. Likewise, we also need to distinguish the actually realized action, ˆ a i,t , from the intended optimalaction a ∗ i,t . Note that the intended optimal action is now a function of the perceived state, not the actual state, i.e. a ∗ i,t = a ∗ i,t (ˆ s i,t ). With these new notations, we can re-state the equations related to the action optimization and stateevolution for Algorithm adaptiveSeek as follows (c.f. Eq.(9, 11)): ˆ s i,t = s i,t + (cid:15) si,t ,a ∗ i,t (ˆ s i,t ; h ) = argmax a i,t ˜ u i,t ( a i,t , ˆ s i,t ; h ) , ˆ a i,t (ˆ s i,t ) = a ∗ i,t (ˆ s i,t ; h ) + (cid:15) ai,t ,s i,t +1 = ˜ f i,t (ˆ a i,t (ˆ s i,t ) | s i,t ) , (12)with (cid:15) si,t and (cid:15) ai,t obeying some chosen statistical distributions. While we included the noise terms additively in theabove equation for notational simplicity, it might be more appropriate to treat some of them multiplicatively. Noticethat the realized action is inflicted by two sources of uncertainty, one explicitly in (cid:15) ai,t , the other implicitly throughthe perceived state ˆ s i,t in a ∗ i,t (ˆ s i,t ). Since the state evolution is supposed to be intrinsic, not from the decision-makingperspective, the actual state evolves from the actual initial state s i,t under the realized action ˆ a i,t (ˆ s i,t ). The originsof the noise terms in Eq.(12) can be either due to instrument noise, human estimation and execution errors fromthe idealized state and optimal action or due to minor uncertainties of the road, such as local slopes, bumps andpotholes, poor weather and lighting conditions. The specific distributional assumption depends on the context of themodeling. These can be as simple as IID normal distributions, or they can be more complex, such as containing someform of auto-regressive structure. It is seemingly an order N operation in the loop over I − i , which would in turn make adaptiveSeek an order N algorithm.Fortunately, we can make the loop over I − i an order N operation if we enlarge the state space by keeping track of a list of allneighboring agents and possibly their neighbors. n some sense, the above specification reconciles our earlier deterministic formulation posteriorly with the commonMDP formulation for the game, where the state evolution is stochastic and action can take on mixed strategy. Viewedthrough a similar lens, we can also regard the deterministic formulation in subsection 2.2 as an approximation to theMDP specification using certainty equivalent approach that was well known in adaptive control literature [44]. Thecertainty equivalent approximation aims to substantially reduce the computational burden by replacing probabilisticexpectations with their corresponding expected values, in the spirit of E [ V ( · , (cid:15) )] ≈ V ( · , E [ (cid:15) ]) or its variations, so thatthe policy optimization becomes deterministic, as what we did in subsection 2.2.Similarly, though with additional notational complication, we can also re-write the state evolution and actionoptimization for Algorithm betaNash in section 3. In this case the sub-game perfect Nash equilibrium conditions needto be understood as that under the certainty equivalent approximation mentioned above. Even so, the equilibriumhas to be solved period by period, not only to be adaptive to the realized state, which is now contaminated by noisesor errors, but also to accommodate the potential change of agents involved in the game at any given moment.One can immediately recoganize that Eq.(12), along with the kinematic bicycle model in Eq.(13,14,15) to bepresented in the next section, is in the form of state space models (SSM). The inverse problem, calibrating the utilityfunction using observed action data, is well formulated and investigated in SSM, see for example [45]. The drivingdecisions made by the driver can be incorporated naturally as the control input in SSM. In contrast with the standardinverse reinforcement learning techniques, such as those in [25, 26, 36, 37], where coupled dynamic programmingproblems have to be solved, the inference using SSM with adaptiveSeek as the behavioral model only requires solvingdecoupled static optimization problems. The dramatic computational simplification is possible, because adaptiveSeek is only contingent on the observable state by individual agent step by step. The dynamic aspect is implicitly encodedin the sequence of the state for each agent, while inter-agent interactions also reside in the observable state. In [27]we demonstrate how this is done in explicitly analyzing data from Sugiyama experiment [28]. So far our discussion has been quite general and formal. Now we apply these concepts and algorithms to a concreteexample in this section. Our aims are 1) to illustrate how the concepts and algorithms are actually used in solvingspecific problems; 2) to show detailed examples of specification of utility functions; 3) to contrast the solutions derivedfrom two very different algorithms: betaNash and adaptiveSeek . The specific setup of the simulation experiment is partially inspired by the work in [6]. • Total planning simulation time span is 8 seconds, with decision time interval ∆ t = 0 . T = 40. • Two lanes in parallel running from left to right, with an unexpected barrier located at x = 0 blocking the lowerlane starting at t = 0. • All lengths are in meters, all velocities are in m/s , all accelerations are in m/s , and angles are in degrees. • There are two vehicles running along the highway with constant speed, one in each lane initially. • The state is understood broadly that includes all causal information necessary for decision-making at any givenmoment. These include all the positions, velocities, accelerations, steerings, and orientations of the two vehiclesin the immediate past. .1.2 State Evolution: Kinematic Bicycle Model We adopt the kinematic bicycle model [46] with only front wheel steering capability to specify the state evolution.In this model the state variable of vehicle i is characterized by its center of gravity coordinate ( x i,t , y i,t ) and theorientation (aka heading angle or yaw angle) ψ i,t in a fixed coordinate system. The equation of the motion for vehicle i is explicitly given by x i,t +1 = x i,t + ∆ t v i,t cos( ψ i,t + β i,t ) ; y i,t +1 = y i,t + ∆ t v i.t sin( ψ i,t + β i,t ) ; (13) ψ i,t +1 = ψ i,t + ∆ t v i,t L cos β i,t tan δ i,t ;where L = 2 . m is the wheelbase, and b (chosen to be L/ β i,t is the slip angle (the angle between the velocity of the center of gravity v i,t and the longitudinalaxis of the vehicle) defined as β i,t = tan − (cid:16) bL tan δ i,t (cid:17) , (14)where δ i,t is the steering angle, a decision variable for vehicle i . The other decision variable is the acceleration of thecenter of gravity of the vehicle α i,t , which is acted on the velocity evolution directly as v i,t +1 = v i,t + ∆ t α i,t . (15) There are eight components for the utility function, whose interpretations and functional forms are described below: • A moving forward reward that peaks at the speed limit of the highway v = 31 m/s . φ (1) i,t = 1 − (cid:16) v i,t − v v (cid:17) . • Three roughness penalties, two for encouraging acceleration/steering smoothness over time, and one for dis-couraging hard acceleration and braking. φ (2) i,t = (cid:0) α i,t − α i,t − (cid:1) ; φ (3) i,t = (cid:0) δ i,t − δ i,t − (cid:1) ; φ (4) i,t = ln (cid:16) (cid:2) κ (4) ( α i,t − ¯ α ) (cid:3)(cid:17) + ln (cid:16) (cid:2) − κ (4) ( α i,t − α ) (cid:3)(cid:17) ;where ¯ α = 4 m/s and α = − m/s are the acceleration and braking limits respectively. The parameter κ (4) = 15 . • A lane departure penalty that incentivizes the vehicle to stay in the middle of either lanes centered at ± . m . φ (5) i,t = min (cid:104)(cid:0) y i,t − ( W/ (cid:1) / (3 W / , (cid:105) ;where W = 3 . m is the lane width, and the two lanes are centered at ± W/ • An out of road penalty that gives a huge penalty when the vehicle is off the road shoulder. φ (6) i,t = S (cid:16) κ (6) (cid:0) | y i,t | − ( W + w/ (cid:1)(cid:17) ;where w = 2 . m is the vehicle width, S ( x ) = 1 / [1+exp( − x )] the sigmoid function , and the parameter κ (6) = 3 . There should also be a similar component for penalizing hard steering. However, due to the range of steering in oursimulation experiment being fairly small, we ignore such penalty for simplicity. A crash penalty that at the location of the unexpected barrier x = 0 at the lower lane. φ (7) i,t = S (cid:16) κ (7) x ( x i,t + l (7) x ) (cid:17) · S (cid:16) − κ (7) y ( y i,t − l (7) y ) (cid:17) ;where the crash risk premium related parameters l (7) x = 5 . m and l (7) y = 1 . m , and the parameters κ (7) x = 2 . κ (7) y = 20 . • A collision penalty between (pairwise) nearby vehicles whose shapes are assumed to be rectangles. φ (8) ij,t = (cid:104) ˜ S (cid:16) κ (8) x (∆ x ij,t + l (8) x ) (cid:17) + ˜ S (cid:16) κ (8) x ( l (8) x − ∆ x ij,t ) (cid:17)(cid:105) · (cid:104) ˜ S (cid:16) κ (8) y (∆ y ij,t + l (8) y ) (cid:17) + ˜ S (cid:16) κ (8) y ( l (8) y − ∆ y ij,t ) (cid:17)(cid:105) ;where ∆ x ij,t = x i,t − x j,t , ∆ y ij,t = y i,t − y j,t , and ˜ S ( x ) = S ( x ) − /
2. The collision risk premium relatedparameters are l (8) x = 10 . m and l (8) y = 2 . m , with κ (8) x = 0 . κ (8) y = 9 . w = 1 . w = − . w = − . w = − . w = − . w = − . w = − .
0, and w = − .
0. For simplicity, all these values are shared by both vehicles in this simulation experiment.Note that the first six components are purely physical self-effects, and only the last two involve interactions and riskpremiums.Admittedly, the above functional forms and parameters in these components and their corresponding weights areintuitively chosen. They are also somewhat arbitrary within a range. Our simulation results are quite robust solong as the parameters do not deviate too much. Ultimately, these functional forms and parameters and weights aresupposed to be calibrated using real data of the vehicles and observed driving behaviors. There are certainly othercomplications. For example, some of the parameters may be dependent on many other factors, such as vehicle speed,weather condition, lighting, and so on. Generally, how best to assess the utility function and possible subtleties whichmay arise are topics in their own right and hence is beyond the scope of this paper. In two separate papers, [27, 43],we attempt to start addressing these types of issues more systematically. betaNash
We derive the sub-game perfect Nash equilibrium solution by using algorithm betaNash to iteratively solve Eq.(8)under the following two initial conditions, designed to obtain qualitatively different merging behaviors:IC1 : (cid:110) vehicle in open lane: ( x , , y , ) = ( − , .
85) ; ( v x, , v y, ) = (31 ,
0) ;vehicle in blocked lane: ( x , , y , ) = ( − , − .
85) ; ( v x, , v y, ) = (31 , . IC2 : (cid:110) vehicle in open lane: ( x , , y , ) = ( − , .
85) ; ( v x, , v y, ) = (31 ,
0) ;vehicle in blocked lane: ( x , , y , ) = ( − , − .
85) ; ( v x, , v y, ) = (31 , . We further assume zero initial actions: α i,t = − = 0 and δ i,t = − = 0. Because we are using betaNash , which assumescommon knowledge, we have to regard the two vehicles being autonomous and controlled by the edge infrastructure.There are 80 decision variables for each vehicle ( α i,t , δ i,t ) with t ∈ { , , · · · , T − } . Intuitively, when the vehicle inthe blocked lane leads the vehicle in the open lane in IC1, a front merge is expected to be the more natural outcome.When the vehicle in the blocked lane is no longer leading in IC2, a rear merge is expected to be the more naturaloutcome. We deliberately leave enough time and space for both vehicles to maneuver so that non-calamitous solutionsare feasible. .2.2 The Sub-Game Prefect Nash Equilibrium To avoid local maximum in the best response optimization in Eq.(8), we use the
Basin-Hopping algorithm implementedin Python package SciPy, which has some capability for handling global optimization. This appears to be sufficientfor betaNash in our particular setting to converge, thanks likely to the negotiation/learning mechanism inherentlyembedded in the best response dynamics. The numerical solutions are depicted in Fig.2, with the upper panel forIC1 and lower panel for IC2. The intuitively expected merging behaviors were borne out explicitly. Indeed, a frontmerge is the endogenized outcome under IC1, whereas a rear merge is the endogenized outcome under IC2. Note thatthere was no pre-injected notion of merging behaviors before the solutions are derived. Everything is purely drivenby the utility functions and the initial condition. The coordination of the two vehicles in the solutions is entirelyachieved during the iterative process of the best response dynamics. The resulting vehicle trajectories and velocitiesare all smooth.
Figure 2:
Vehicle trajectories ( x ∗ i,t , y ∗ i,t ) , ∀ t ∈ { , , · · · } in different merging behaviors under dif-ferent initial conditions in betaNash : (a) Under IC1, the vehicle in the blocked lane (yellow dot orsquare with a black frame) first accelerates longitudinally and then turns to the open lane, while thevehicle in the open lane (blue dot or square without a black frame) first yields by slowing down andthen accelerates to catch up. (b) Under IC2, the vehicle in the open lane first accelerates and moves tothe left slightly, while the vehicle in the blocked lane slows down first, and then turns to the open laneand accelerates to catch up. The number in the boxes represents the time in seconds, and the colorrepresents the speed. In Fig.3 we plot the action sequences for both vehicles in the equilibrium under IC1 (front merge) and IC2 (rearmerge). The solved actions are also smooth in both cases. It is interesting to observe that the vehicle in the openlane swerved to create a little more space for the vehicle in the blocked lane to merge, more so in the case of rearmerge than that of front merge.To verify the solutions actually satisfy the equilibrium condition Eq.(5), we plot the cumulative utility as afunction of the deviation for individual decision variables ( α ∗ i,t , δ ∗ i,t ) from its optimal point while keeping all otherdecision variables at the equilibrium, for i ∈ { , } and t ∈ { , , · · · , } . As can be seen from Fig.4, the equilibriumcondition is indeed met. Since betaNash is an iterative procedure, it depends on where the search starts, there is no guarantee that thebest response dynamics will converge to the same fixed point, unless the game is supermodular as required by [38].Unfortunately, the game in our setting is unlikely supermodular and the existence of multiple-equilibrium could igure 3: Action sequences for both vehicles in the equilibrium solved by betaNash under IC1 (frontmerge) and IC2 (rear merge). happen, due to the complicated tradeoffs among various components in the utility function. This turns out to bethe case, not too surprisingly. For example, a rear merge outcome could occasionally appear even under IC1, thoughmuch less likely than a front merge, if we start the best response dynamics iteration randomly. Numerically, wedid not find additional equilibria, other than the aforementioned front and rear merges. It is possible that the bestresponse dynamics in our setting is locally supermodular. adaptiveSeek
In this case, we either regard the two vehicles as autonomous, as in the case of betaNash . Or, if we assume theparameters and weights in the utility functions as calibrated by the actual data, we can view the solutions in thissubsection as describing the driving behaviors of two human-driven vehicles.
The utility functions and parameters are identical with what were described in subsection 5.1.3. In order to guaranteethat both vehicles see the barrier at t = 0, we need to enlarge the risk premium parameter for crash l (7) x = 10 . m ,and set the anticipation horizon to 3 seconds, implying h = 15. The self state evolution per period is the same as in Eq.(13,14,15) when the action is given. However, to calculatethe effective utility function in Eq.(10) we need to anticipate future path scenarios. In this particular example, thisamounts to inferring whether the other vehicle will change lane and how. For the lane change maneuvering, we follow igure 4: Cumulative utility as a function of the deviation of each individual decision variable fromits optimal point, while all other variables are kept at the equilibrium. The color from cold to warmcorresponds to time step from low ( t = 0) to high ( t = 39). a slightly modified Stanley’s orientation formula [47] shown below: ψ ∗ t = tan − (cid:16) κ d t √ v t (cid:17) , (16)where d t is the distance between the center of gravity of the vehicle and the center of the lane, v t is the velocity of thevehicle, and κ = 0 .
15 is an empirically tuned parameter. The steering is then negatively proportional to the differencebetween the current orientation and the above prescribed Stanley formula, as if we were following a feedback controlalgorithm for the vehicle’s steering. In this feedback control, Stanley’s formula acts as the reference point or theguidance for setting the steering angle. Of course, any other similar formulas will do the job for us. Stanley’s formulamerely provides an explicit mechanism to operationalize the path anticipation. The explicit future path anticipationduring path planning is implemented by the following algorithm (Algorithm III): lgorithm III: path anticipation in adaptiveSeek at time t for agent i ∈ I with the given action a i,t = ( α i,t , δ i,t ) do get current state of all agents s (cid:48) = s t for path anticipation for looking ahead time ˜ t = 1 : h do if ˜ t < h/ then a (cid:48) i, ˜ t = a i,t , a (cid:48)− i, ˜ t = (0 , // this is to maintain action persistence briefly else for agent i : if agent i crosses the lane divider then a (cid:48) i, ˜ t = ( α i,t , δ (cid:48) i, ˜ t = ψ ∗ i, ˜ t − ψ i, ˜ t ) // during lane change, steering follows Eq.(16) else a (cid:48) i, ˜ t = a i,t // maintain the intended motion if no lane-change end if for other agent − i if agent − i crosses the lane divider: then a (cid:48)− i, ˜ t = (0 , δ (cid:48)− i, ˜ t = ψ ∗− i, ˜ t − ψ − i, ˜ t ) // during a lane change, the steering followsEq.(16) else a (cid:48)− i, ˜ t = (0 , // maintain constant motion if no lane-change end if end if update their state anticipation s (cid:48) ˜ t according to Eq.(13) end for calculate the effective utility for agent i with the given action a i,t during the above path anticipation end for It is important to emphasize again that these scenario assumptions are only used anticipatively in calculating theeffective utility in Eq.(10) up to h -period, not for deriving the optimal action itself. The latter is done according toEq.(11). For this reason, the precise details of the anticipation is not too critical. Without an explicit negotiation mechanism in adpativeSeek , in contrast with betaNash , the landscape for the effectiveutility function in Eq.(10) appears to be too complex even for basinhopping to find the right global maximum inEq.(11). Fortunately, the action space in adaptiveSeek is small enough, we can always resort to a brute force of gridsearch, which guarantees a good approximation to the global optimum. The solutions found in this way are depictedin Figure.5, with the upper panel for IC1 and lower panel for IC2. Amazingly, these solutions are nearly identical tothose found by betaNash in Figure.2, except for slightly more lateral swerving upward in the latter.In Figure.6 we plot the optimal action sequences for both vehicles solved from adaptiveSeek under IC1 and IC2.These sequences look very close, at least semi-quantitatively, to that derived using betaNash in Figure.3. Of course,there is no absolute reason to expect that the action sequences derived by using betaNash and adaptiveSeek shouldbe quantitatively close, provided the same utility functions are used. This is because these algorithms correspond tovery different solution concepts based on very different assumptions. On the other hand, it is possible that solutionsfrom one may be able to approximate the solution of the other by slightly tweaking the utility functions or theirparameters and weights. Even without doing the tweaking, the fact that the solutions already appear similar to igure 5: Vehicle trajectories (˜ x ∗ i,t , ˜ y ∗ i,t ) , ∀ t ∈ { , , · · · } in different merging behaviors under dif-ferent initial conditions in adaptiveSeek : (a) Under IC1, the vehicle in the blocked lane (yellow dot orcircle with a black frame) first accelerates longitudinally and then turns to the open lane, while thevehicle in the open lane (blue dot or square without a black frame) first yields by slowing down andthen accelerates to catch up. (b) Under IC2, the vehicle in the open lane first accelerates and movesto the left slightly, while the vehicle in the blocked lane slows down first, and then turns to the openlane and accelerates to catch up. The number in the boxes represents the time in second, and the colorrepresents the speed. one another is very reassuring. Furthermore, although the action sequences in Fig.6 look a litte bit rougher than inFig.3, the derived vehicle trajectories (˜ x ∗ i,t , ˜ y ∗ i,t ) are as smooth as in the case of betaNash , because the trajectories areobtained from the action sequences by twice integration with a very small time interval. Also, a comparison betweenFig.3 and Fig.6 shows that the merging maneuver is done (i.e. back to zero action) in about 5 to 6 seconds for betaNash , whereas at least 2 additional seconds are needed to achieve the same maneuvering in adaptiveSeek . This inturn implies that the vehicles coordinated more efficiently in the solution from betaNash than that from adaptiveSeek .It is interesting to point out that, due to the use of grid search, adaptiveSeek is essentially a deterministic al-gorithm, in contrast with betaNash , given the orginal utility function, state evolution, initial condition, anticipationassumptions, and a search grid. Consequently, the situation of “multi-equilibrium” does not seem to arise in adpa-tiveSeek , a nice property at least for some cases. In this paper we propose a systematic computational framework for modeling smart vehicles in a smart world atmicro level based on game theory. Markov games with deterministic state evolution are exploited, thanks to theexplicit inclusion of risk premium in the utility functions. The corresponding sub-game perfect Nash equilibriumis solved via best response dynamics as a specific form of self-play reinforcement learning. We then relax some ofthe less realistic assumptions associated with Nash equilibrium and develop a heuristics based adaptive optimizationmethod that allows us to obtain solutions that are close to the Nash equilibrium, while drastically reducing thecomputational burden. In this framework, inter-agent interaction is at the center of the modeling: all agents aretreated equally, apart from the explicit heterogeneity in preference, intention and initial condition. We then illustratehow our approach works explicitly in a concrete example of two vehicles in a setting of a double-lane highway withan unexpected barrier. The front merge and rear merge behaviors endogenized by betaNash and adaptiveSeek areshown to be very similar to each other and both appear reasonable and intuitive. Finally, our solutions appear to bereasonably robust against minor disturbances, either in model parameters/hyperparameters and utility weights, or igure 6: Optimal action sequences for both vehicles solved by adaptiveSeek under IC1 (front merge)and IC2 (rear merge). in adding small random noises.So far, we have been concentrating on the forward problem exclusively in this paper. Even along this line thereare many applications can be pursued immediately. The specific simulation experiment in Section 5 is essentially amandatary lane change problem, deliberately chosen to be relatively simple so that all the details can be illustratedthoroughly. But our framework is much more powerful and broader, including the inverse problem and mechanismdesign problem mentioned in the Introduction. The latter problems will be tackled and illustrated in our forthcomingwork. For example, we show in [27] how to use traffic video data from Sugiyama experiment [28] for behavioralcalibration with explicit heterogeneity, using adaptiveSeek as the decision-making model for human drivers. Extendingthis line of effort, we have started to collect naturalistic driving data using drone at a urban roundabout, and thecalibration methodology outlined is being applied [43]. We demonstrate the flexibility of handling different typesof agents in our approach by considering a two-way traffic with a signed or unsigned crosswalk [42], where explicitinteractions among vehicles, pedestrians, and stop signs are modeled. We further show in two other separate papershow game theory based coordination can be used to improve traffic flow at a single-lane roundabout [40], and howsmart algorithms for a few system-controlled CAVs can be invoked to tame optimally stop-and-go shockwaves [34],taking the full advantage of connectivity/autonomy and smart infrastructure. With adaptiveSeek being able to serveas the micro path planning algorithm, we can introduce the concept of an edge-centric automated traffic system fordrive-by-wire vehicles, co-mingling with human driven vehicles. This idea will be made explicit in a coordinatedautonomous valet parking facility [41] that is much more efficient in avoiding gridlock in congested places.
Acknowledgement
We thank Gint Puskorius and Jinhong Wang for several useful discussions and for their comments on the manuscript.We are also grateful to Paul Stieg for his help in literature review. eferences [1] Christos G. Cassandras. Automating mobility in smart cities. Annual Reviews in Control , 44:1–8, 2017.[2] Andrew Y Ng, Daishi Harada, and Stuart J Russell. Policy invariance under reward transformations: Theory andapplication to reward shaping. In
Proceedings of the sixteenth International Conference on Machine Learning ,pages 278–287. ACM, 1999.[3] Je Hong Yoo and Reza Langari. Stackelberg game based model of highway driving. In
ASME 2012 5th AnnualDynamic Systems and Control Conference joint with the JSME 2012 11th Motion and Vibration Conference ,pages 499–508. American Society of Mechanical Engineers, 2012.[4] Changwon Kim and Reza Langari. Game theory based autonomous vehicles operation.
International Journal ofVehicle Design , 65(4):360–383, 2014.[5] Hongtao Yu, H. E. Tseng, and Reza Langari. A human-like game theory-based controller for an automatic lanechanging. In
Transportation Research Part C , volume 88, pages 140–158. Elsevier, 2018.[6] Qingyu Zhang, Dimitar Filev, H. E. Tseng, Steve Szwabowski, and Reza Langari. Addressing mandatory lanechange problem with game theoretic model predictive control and fuzzy markov chain. In , pages 4764–4771. IEEE, 2018.[7] Qingyu Zhang, Reza Langari, H. E. Tseng, Dimitar Filev, Steve Szwabowski, and Serdar Coskun. A game theo-retic model predictive controller with aggressiveness estimation for mandatory lane change. In
IEEE Transactionson Intelligent Vehicles , volume 5, pages 75–89. IEEE, 2020.[8] Dave W Oyler, Yildiray Yildiz, Anouck R Girard, Nan I Li, and Ilya V Kolmanovsky. A game theoretical modelof traffic with multiple interacting drivers for use in autonomous vehicle development. In , pages 1705–1710. IEEE, 2016.[9] Ran Tian, Sisi Li, Nan Li, Ilya Kolmanovsky, Anouck Girard, and Yildiray Yildiz. Adaptive game-theoreticdecision making for autonomous vehicle control at roundabouts. In , pages 321–326. IEEE, 2018.[10] Nan Li, Yu Yao, Ilya Kolmanovsky, Ella Atkins, and Anouck Girard. Game-theoretic modeling of multi-vehicleinteractions at uncontrolled intersections. arXiv preprint arXiv:1904.05423 , 2019.[11] Alireza Talebpour, Hani S Mahmassani, and Samer H Hamdar. Modeling lane-changing behavior in a connectedenvironment: A game theory approach.
Transportation Research Procedia , 7:420–440, 2015.[12] Federal Highway Administration. Next generation simulation (NGSIM), 2006.[13] Fanlin Meng, Jinya Su, Cunjia Liu, and Wen-Hua Chen. Dynamic decision making in lane change: Game theorywith receding horizon. In , pages 1–6.IEEE, 2016.[14] Meng Wang, Serge P Hoogendoorn, Winnie Daamen, Bart van Arem, and Riender Happee. Game theoreticapproach for predictive lane-changing and car-following control.
Transportation Research Part C: EmergingTechnologies , 58:73–92, 2015.[15] Kuang Huang, Xuan Di, Qiang Du, and Xi Chen. A game-theoretic framework for autonomous vehicles velocitycontrol: Bridging microscopic differential games and macroscopic mean field games. arXiv preprint:1903.06053 ,2019.[16] Alexander G Cunningham, Enric Galceran, Ryan M Eustice, and Edwin Olson. MPDM: Multipolicy decision-making in dynamic, uncertain environments for autonomous driving. In , pages 1670–1677. IEEE, 2015.
17] Enric Galceran, Alexander G Cunningham, Ryan M Eustice, and Edwin Olson. Multipolicy decision-making forautonomous driving via changepoint-based behavior prediction: Theory and experiment.
Autonomous Robots ,41(6):1367–1382, 2017.[18] Dhanvin Mehta, Gonzalo Ferrer, and Edwin Olson. C-MPDM: Continuously-parameterized risk aware mpdmby quickly discovering contextual policies. In , pages 7547–7554. IEEE, 2018.[19] Kurt Dresner and Peter Stone. A multiagent approach to autonomous intersection management.
Journal ofartificial intelligence research , 31:591–656, 2008.[20] Mohammed Elhenawy, Ahmed A Elbery, Abdallah A Hassan, and Hesham A Rakha. An intersection game-theory-based traffic control algorithm in a connected vehicle environment. In , pages 343–347. IEEE, 2015.[21] Jishiyu Ding, Huile Xu, Jianming Hu, and Yi Zhang. Centralized cooperative intersection control under auto-mated vehicle environment. In , pages 972–977. IEEE, 2017.[22] Jackeline Rios-Torres and Andreas A Malikopoulos. Automated and cooperative vehicle merging at highwayon-ramps.
IEEE Transactions on Intelligent Transportation Systems , 18(4):780–789, 2017.[23] Ziran Wang, BaekGyu Kim, Hiromitsu Kobayashi, Guoyuan Wu, and Matthew J Barth. Agent-based modelingand simulation of connected and automated vehicles using game engine: A cooperative on-ramp merging study.In
Transportation Research Board 98th Annual Meeting , 2019.[24] Christos Katrakazas, Mohammed Quddus, Wen-Hua Chen, and Lipika Deka. Real-time motion planning methodsfor autonomous on-road driving: State-of-the-art and future research directions.
Transportation Research PartC: Emerging Technologies , 60:416 – 442, 2015.[25] Wilko Schwarting, Alyssa Pierson, Javier Alonso-Mora, Sertac Karaman, and Daniela Rus. Social behavior forautonomous vehicles.
Proceedings of National Academy of Sciences , 116(50):24972–24978, 2019.[26] B. D. Ziebart, A Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In , pages 1433–1438. AAAI, 2008.[27] Qi Dai, Di Shen, Jinhong Wang, Suzhou Huang, and Dimitar Filev. Calibration of human driving behaviorsusing traffic video data: State space model approach. in preparation , 2020.[28] Yuki Sugiyama et al. Traffic jams without bottlenecks - experimental evidence for the physical mechanism ofthe formation of a jam.
New Journal of Physics , 10:033001, 2008.[29] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning forautonomous driving. arXiv: 1610.03295 , 2016.[30] Alex Kuefler, Jeremy Morton, Tim Wheeler, and Mykel Kochenderfer. Imitating driver behavior with generativeadversarial networks.
IEEE Intelligent Vehicles Symposium (IV) , pages 204–211, 2017.[31] Y Zhang, P Sun, Y Yin, and X Wang. Human-like autonomous vehicle speed control by deep reinforcementlearning with double q learning.
IEEE Intelligent Vehicles Symposium (IV) , pages 1251–1256, 2018.[32] C J Hoel, K Wolff, and L Laine. Automated speed and lane change decision making using deep reinforcementlearning.
The 21st IEEE International Conference on Intelligent Transportation Systems (ITSC) , pages 2148–2155, 2018.[33] S Nageshrao, H E Tseng, and D Filev. Autonomous highway driving using deep reinforcement learning. , pages 2326–2331, 2019.[34] Di Shen, Qi Dai, Jinhong Wang, Xinlin Song, Xiaoxuan Chen, Suzhou Huang, and Dimitar Filev. Taming stop-and-go shockwaves in a CAV modulated traffic: A computational mechanism design approach. in preparation ,2020.
35] Drew Fudenberg and Jean Tirole.
Game Theory , chapter 13. MIT Press, Cambridge, MA, fifth edition, 1996.[36] Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In
Proceedings of theseventeenth International Conference on Machine Learning , pages 663–670. ACM, 2000.[37] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In
Proceedings ofthe twenty-first International Conference on Machine Learning , pages 1–8. ACM, 2004.[38] Paul Milgrom and John Roberts. Rationalizability, learning, and equilibrium in games with strategic comple-mentarities.
Econometrica: Journal of the Econometric Society , pages 1255–1277, 1990.[39] David Silver et al. Mastering the game of go without human knowledge.
Nature , 550(7676):354, 2017.[40] Jinhong Wang, Qi Dai, Suzhou Huang, and Dimitar Filev. Improving traffic efficiency and safety using gametheory based coordination. in preparation , 2020.[41] Jinhong Wang, Xinlin Song, Qi Dai, Di Shen, Xiaoxuan Chen, Suzhou Huang, and Dimitar Filev. Coordinatedautonomous valet parking: An edge-based facility for dbw vehicles. in preparation , 2020.[42] Qi Dai, Wen Guo, Xunnong Xu, Suzhou Huang, and Dimitar Filev. Modeling vehicle-pedestrian interactions ina two-way crosswalk. in preparation , 2020.[43] Xiaoxuan Chen, Xinlin Song, Jinhong Wang, Qi Dai, Di Shen, Suzhou Huang, and Dimitar Filev. Characterizinghuman driving behaviors using drone captured traffic video data at an urban roundabout. in preparation , 2020.[44] Dimitri P Bertsekas.
Dynamic Programming and Optimal Control , volume I, chapter 6.1. Athena Scientific,1995.[45] Robert Shumway and David Stoffer.
Time Series Analysis and Its Applications , chapter 6. Springer, New York,NY, third edition, 2011.[46] Rajesh Rajamani.
Vehicle Dynamics and Control , chapter 2.2. Springer Science & Business Media, 2006.[47] Sebastian Thrun et al. Stanley: The robot that won the darpa grand challenge.
Journal of Field Robotics ,23(9):661–692, 2006.,23(9):661–692, 2006.