[PDF] A Survey of Deep RL and IL for Autonomous Driving Policy Learning

Abstract

Autonomous driving (AD) agents generate driving policies based on online perception results, which are obtained at multiple levels of abstraction, e.g., behavior planning, motion planning and control. Driving policies are crucial to the realization of safe, efficient and harmonious driving behaviors, where AD agents still face substantial challenges in complex scenarios. Due to their successful application in fields such as robotics and video games, the use of deep reinforcement learning (DRL) and deep imitation learning (DIL) techniques to derive AD policies have witnessed vast research efforts in recent years. This paper is a comprehensive survey of this body of work, which is conducted at three levels: First, a taxonomy of the literature studies is constructed from the system perspective, among which five modes of integration of DRL/DIL models into an AD architecture are identified. Second, the formulations of DRL/DIL models for conducting specified AD tasks are comprehensively reviewed, where various designs on the model state and action spaces and the reinforcement learning rewards are covered. Finally, an in-depth review is conducted on how the critical issues of AD applications regarding driving safety, interaction with other traffic participants and uncertainty of the environment are addressed by the DRL/DIL models. To the best of our knowledge, this is the first survey to focus on AD policy learning using DRL/DIL, which is addressed simultaneously from the system, task-driven and problem-driven perspectives. We share and discuss findings, which may lead to the investigation of various topics in the future.

Full PDF

11 A Survey of Deep RL and IL for AutonomousDriving Policy Learning

Zeyu Zhu,

Member, IEEE,

Huijing Zhao,

Senior Member, IEEE,

Abstract —Autonomous driving (AD) agents generate drivingpolicies based on online perception results, which are obtainedat multiple levels of abstraction, e.g., behavior planning, mo-tion planning and control. Driving policies are crucial to therealization of safe, efﬁcient and harmonious driving behaviors,where AD agents still face substantial challenges in complexscenarios. Due to their successful application in ﬁelds such asrobotics and video games, the use of deep reinforcement learning(DRL) and deep imitation learning (DIL) techniques to deriveAD policies have witnessed vast research efforts in recent years.This paper is a comprehensive survey of this body of work,which is conducted at three levels: First, a taxonomy of theliterature studies is constructed from the system perspective,among which ﬁve modes of integration of DRL/DIL modelsinto an AD architecture are identiﬁed. Second, the formulationsof DRL/DIL models for conducting speciﬁed AD tasks arecomprehensively reviewed, where various designs on the modelstate and action spaces and the reinforcement learning rewardsare covered. Finally, an in-depth review is conducted on howthe critical issues of AD applications regarding driving safety,interaction with other trafﬁc participants and uncertainty of theenvironment are addressed by the DRL/DIL models. To the bestof our knowledge, this is the ﬁrst survey to focus on AD policylearning using DRL/DIL, which is addressed simultaneously fromthe system, task-driven and problem-driven perspectives. Weshare and discuss ﬁndings, which may lead to the investigationof various topics in the future.

Index Terms —deep reinforcement learning, deep imitationlearning, autonomous driving policy

I. I

NTRODUCTION A UTONOUMOUS driving ( AD ) has received extensiveattention in recent decades [1–4] and could be a promis-ing solution for improving road safety [5], trafﬁc ﬂow [6] andfuel economy [7], among other factors. A typical architectureof an AD system is illustrated in Fig. 1, which is composedof perception, planning and control modules. An AD agentgenerates driving policies based on online perception results,which are obtained at multiple levels of abstraction, e.g.,behavior planning, motion planning and control. The earliestautonomous vehicles can be dated back to [8–10]. One mile-stone was the Defense Advanced Research Projects Agency(DARPA) Grand Challenges [11, 12]. Recent years havewitnessed a huge boost in AD research, and many productsand prototyping systems have been developed. Despite the fastdevelopment of the ﬁeld, AD still faces substantial challengesin complex scenarios for the realization of safe, efﬁcient This work is supported in part by the National Natural Science Foundationof China under Grant (61973004). The authors are with the Key Laboratory ofMachine Perception (MOE) and with the School of Electronics Engineeringand Computer Science, Peking University, Beijing 100871, China. Correspon-dence: H.Zhao, [email protected].

Perception

Route

Planning Behavior Planning

Motion

Planning

TaskMap

Control

Actuators • Gas/brake • Steer

Sensors • GPS/IMU • LiDAR • Camera

PerceptionPlanning

Control ① ② ③ ④ ① Sequences of route points through road network ② Behavior decision: lane change, car follow, stop, etc. ③ Planned trajectories or paths ④ Steer, throttle and brake commands

Fig. 1. Architecture of autonomous driving systems. A general abstractionthat is based on [13–16]. and harmonious driving behaviors [17, 18].

Reinforcementlearning ( RL ) is a principled mathematical framework forsolving sequential decision making problems [19–21]. Imita-tion learning ( IL ), which is closely related, refers to learningfrom expert demonstrations. However, the early methods ofboth were limited to relatively low-dimensional problems. Therise of deep learning ( DL ) techniques [22, 23] in recent yearshas provided powerful solutions to this problem through theappealing properties of deep neural networks ( DNNs ): func-tion approximation and representation learning. DL techniquesenable the scaling of RL/IL to previously intractable problems(e.g., high-dimensional state spaces), which have increasedin popularity for complex locomotion [24], robotics [25] andautonomous driving [26–28] tasks. Unless otherwise stated,this survey focuses on

Deep RL ( DRL ) and

Deep IL ( DIL ).A large variety of DRL/DIL models have been developed forlearning AD policies, which are reviewed in this paper. Severalsurveys are relevant to this study. [13, 15] survey the motionplanning and control methods of automated vehicles beforethe era of DL. [29–33] review general DRL/DIL methodswithout considering any particular applications. [4] addressesthe deep learning techniques for AD with a focus on perceptionand control, while [34] addresses control only. [35] providesa taxonomy of AD tasks to which DRL models have beenapplied and highlights the key challenges. However, none ofthese studies answers the following questions:

How can DRL/DIL models be integrated into AD systemsfrom the perspective of system architecture? How can theybe formulated to accomplish speciﬁed AD tasks? How canmethods be designed that address the challenging issues ofAD, such as safety, interaction with other trafﬁc participants, a r X i v : . [ c s . R O ] J a n and uncertainty of the environment? This study seeks answers to the above questions. To thebest of our knowledge, this is the ﬁrst survey to focus onAD policy learning using DRL/DIL, which is addressed fromthe system, task-driven and problem-driven perspectives. Ourcontributions are threefold: • A taxonomy of the literature is presented from the systemperspective, from which ﬁve modes of integration ofDRL/DIL models into an AD architecture are identi-ﬁed. The studies on each mode are reviewed, and thearchitectures are compared. It is found that the vastresearch efforts have focused mainly on exploring thepotential of DRL/DIL in accomplishing AD tasks, whileintensive studies are needed on the optimization of thearchitectures of DRL/DIL embedded systems toward real-world applications. • The formulations of DRL/DIL models for accomplishingspeciﬁed AD tasks are comprehensively reviewed, wherevarious designs on the model state and action spacesand the reinforcement learning rewards are covered. Itis found that these formulations rely heavily on empiri-cal designs, which are brute-force approaches and lacktheoretic support. Changing the designs or tuning theparameters could result in substantially different drivingpolicies, which may pose large challenges to the AD sys-tem’s stability and robustness in real-world deployment. • The critical issues of AD applications that are addressedby the DRL/DIL models regarding driving safety, in-teraction with other trafﬁc participants and uncertaintyof the environment are comprehensively discussed. It isfound that driving safety has been well studied. A typicalstrategy is to combine with traditional methods to ensurea DRL/DIL agent’s functional safety; however, strikinga balance between optimal policies and hard constraintsremains non-trivial. The studies on the latter two issuesare still highly preliminary, in which the problems havebeen addressed from divergent perspectives and the stud-ies have not been conducted systematically.The remainder of this paper is organized as follows: Sec-tions II and III brieﬂy introduce (D)RL and (D)IL, respectively.Section IV reviews the research on DRL/DIL in AD from asystem architecture perspective. Section V reviews the task-driven methods and examines the formulations of DRL/DILmodels for completing speciﬁed AD tasks. Section VI re-views problem-driven methods, in which speciﬁed autonomousvehicle problems and challenges are addressed. Section VIIdiscusses the remaining challenges, and the conclusions of thesurvey are presented in Section VIII.II. P

RELIMINARIES OF R EINFORCEMENT L EARNING

A. Problem Formulation

Reinforcement learning (RL) is a principled mathematicalframework that is based upon the paradigm of trial-and-error learning, where an agent interacts with its environmentthrough a trade-off between exploitation and exploration [36–38].

Markov decision processes (MDPs) are a mathemat-ically idealized form of RL [21], which are represented as ( S , A , P , R , γ ) , where S and A denote the sets of states andactions, respectively, and P ( s t +1 | s t , a t ) : S × S × A → [0 , is the transition/dynamics function that maps state-actionpairs onto a distribution of next-step states. The numericalimmediate reward function R ( s t , a t , s t +1 ) : S × A × S → R serves as a learning signal. A discount factor γ ∈ [0 , determines the present value of future rewards (lower valuesencourage more myopic behaviors). MDPs’ states satisfy the Markov property [21], namely, future states depend onlyon the immediately preceding states and actions.

Partiallyobservable MDPs (POMDPs) extend MDPs to problems inwhich access to fully observable Markov property states is notavailable. A POMDP has an observation set Ω and an observa-tion function O , where O ( a t , s t +1 , o t +1 ) = p ( o t +1 | a t , s t +1 ) is the probability of observing o t +1 given that the agent hasexecuted action a t and reached state s t +1 [39]. For the theoryand algorithms of POMDPs, we refer readers to [39, 40].The MDP agent selects an action a t ∈ A at each time step t based on the current state s t , receives a numerical reward r t +1 and visits a new state s t +1 . The generated sequence { s , a , r , s , a , r , ... } is called a rollout or trajectory .The expected cumulative reward in the future, namely, the expected discounted return G t after time step t , is deﬁnedas [21]: G t . = r t +1 + γr t +2 + γ r t +3 + ... = T (cid:88) k =0 γ k r t + k +1 (1)where T is a ﬁnite value or ∞ for ﬁnite and inﬁnite horizonproblems, respectively. The policy π ( a | s ) maps states to prob-abilities of selecting each possible action. The value function v π ( s ) is the expected return following π from state s : v π ( s ) . = E π [ G t | s t = s ] (2)Similarly, the action-value function q π ( s, a ) is deﬁned as: q π ( s, a ) . = E π [ G t | s t = s, a t = a ] (3)which satisﬁes the recursive Bellman equation [41] : q π ( s t , a t ) = E s t +1 [ r t +1 + γq π ( s t +1 , π ( s t +1 ))] (4)The objective of RL is to identify the optimal policy thatmaximizes the expected return π ∗ = arg max π E π [ G t | s t = s ] .The methods can be divided into three groups, as shown inFig. 2 (a). B. Value-based Methods

To solve a reinforcement learning problem, one can identifyan optimal action-value function and recover the optimalpolicy from the learned state-action values. q π ∗ ( s, a ) = max π q π ( s, a ) (5) π ∗ ( s ) = arg max a q π ∗ ( s, a ) (6)Q-learning [42] is one of the most popular methods, whichestimates Q values through temporal difference ( TD ): q π ( s t , a t ) ← q π ( s t , a t ) + α ( Y − q π ( s t , a t )) (7) IL BC Traditional BCDeep Imitation Learning

IRL

Traditional IRLDeep IRL

GAIL

GAIL [ ] & extensions ALVINN [10]PilotNet [6 ], DAgger [6 ], SafeDAgger [6 ]Maximum Margin IRL [ , ]Maximum Entropy IRL [7 ]Deep MaxEnt IRL [7 ], GCL [ ]InfoGAIL [2 ], CGAIL [2 ], Triple-GAIL [2 ], AIRL [ ] RL Actor-critic

Off-policyDPG [5 ], DDPG [ ], SAC [ ] On-policyA3C [ ], A2C [5 ] Policy-based

Monte CarloREINFORCE [ ] Temporal DifferenceTRPO [ ], PPO [ ] Value-based

Temporal DifferenceQ-Learning [4 ], SARSA [ ]DQN [ ], cDQN [4 ], DDQN [4 ], QR-DQN [ ], Dueling-DQN [ ] Monte Carlo TD( ƛ ) [1 ] (b)(a) Fig. 2. A taxonomy of the general methods of reinforcement learning (RL) and imitation learning (IL) where Y = r t +1 + γ max a t +1 ∈A q π ( s t +1 , a t +1 ) is the temporaldifference target and α is the learning rate. This can beregarded as a standard regression problem in which the error tobe minimized is Y − q π ( s t , a t ) . Q-learning is off-policy sinceit updates q π based on experiences that are not necessarilygenerated by the derived policy, while SARSA [43] is an on-policy algorithm that uses the derived policy to generateexperiences. Another distinction is that SARSA uses target Y = r t +1 + γq π ( s t +1 , a t +1 ) . In contrast to TD methods,Monte Carlo methods estimate the expected return throughaveraging the results of multiple rollouts, which can be appliedto non-Markovian episodic environments. TD and Monte Carlohave been further combined in TD( λ ) [21].The early methods [21, 42, 43] of this group rely ontabular representations. A major problem is the “curse ofdimensionality” [44], namely, an increase in the number ofstate features would result in exponential growth of the numberof state-action pairs that must be stored. Recent methodsuse DNNs to approximate a parameterized value function q ( s, a ; ω ) , of which Deep Q-networks (DQNs) [45] are themost representative, which learn the values by minimizing thefollowing loss function: J ( ω ) = E t [( Y − q ( s t , a t ; ω )) ] (8)where Y = r t +1 + γ max q ( s t +1 , a t +1 ; ω − ) is the target, ω − denotes the parameters of the target network, and ω can belearnt based on the gradients. ω ← ω − α E t [( Y − q ( s t , a t ; ω )) ∇ q ( s t , a t ; ω )] (9)The major contributions of DQN are the introduction of thetarget network and experience replay. To avoid rapidly ﬂuctu-ating Q values and stabilize the training, the target network isﬁxed for a speciﬁed number of iterations during the update ofthe primary Q-network and subsequently updated to match theprimary Q-network. Moreover, experience replay [46], whichmaintains a memory that stores transitions ( s t , a t , s t +1 , r t +1 ) ,increases the sample efﬁciency. A later study improves theuniform sample experience replay by introducing priority [47].Continuous DQN (cDQN) [48] derives a continuous variantof DQN based on normalized advantage functions (NAFs).Double DQN (DDQN) [49] addresses the overestimationproblem of DQN through the use of a double estimator.Dueling-DQN [50] introduces a dueling architecture in which both the value function and associated advantage function areestimated. QR-DQN [51] utilizes distributional reinforcementlearning to learn the full value distribution rather than onlythe expectation. C. Policy-based Methods

Alternatively, one can directly search and optimize a param-eterized policy π θ to maximize the expected return: max θ J ( θ ) = max θ v π θ ( s ) = max θ E π θ [ G t | s t = s ] (10)where θ denotes the policy parameters, which can be optimizedbased on the policy gradient theorem [21]: ∇ J ( θ ) ∝ (cid:88) s µ ( s ) (cid:88) a q π ( s, a ) ∇ π θ ( a | s )= E π [ (cid:88) a q π ( s t , a ) ∇ π θ ( a | s t )]= E π [ G t ∇ ln π θ ( a t | s t )] (11)where µ ( s ) denotes the state visitation frequency. REIN-FORCE [52] is a straightforward Monte Carlo policy-basedmethod that selects the discounted return G t following thepolicy π θ to estimate the policy gradient in Eqn.11. Theparameters are updated as follows [21]: θ ← θ + αG t ∇ ln π θ ( a t | s t ) (12)This update intuitively increases the log probability of actionsthat lead to higher returns. Since empirical returns are used,the resulting gradients suffer from high variances. A commontechnique for reducing the variance and accelerating the learn-ing is to replace G t in Eqn. 11 and 12 by G t − b ( s t ) [21, 52],where b ( s t ) is a baseline. Alternatively, G t can be replaced bythe advantage function [53, 54] A π θ ( s, a ) = q π θ ( s,a ) − v π θ ( s ) .One problem of policy-based methods is poor gradientupdates may result in newly updated policies that deviatewildly from previous policies, which may decrease the policyperformance. Trust region policy optimization (TRPO) [55]solves this problem through optimization of a surrogate objec-tive function, which guarantees the monotonic improvement ofpolicy performance. Each policy gradient update is constrainedby using a quadratic approximation of the Kullback-Leibler(KL) divergence between policies. Proximal policy optimiza-tion (PPO) [56] improved upon TRPO through the application of an adaptive penalty on the KL divergence and a heuristicclipped surrogate objective function. The requirement for onlya ﬁrst-order gradient also renders PPO easier to implementthan TRPO. D. Actor-Critic Methods

Actor-critic methods have the advantages of both value-based and policy-based methods, where a neural network actor π θ ( a | s ) selects actions and a neural network critic q ( s, a ; ω ) or v ( s ; ω ) estimates the values. The actor and criticare typically updated alternately according to Eqn. 11 andEqn. 8, respectively. Deterministic policy gradient (DPG) [57]is an off-policy actor-critic algorithm that derives determin-istic policies. Compared with stochastic policies, DPG onlyintegrates over the state space and requires fewer samples inproblems with large action spaces. Deep deterministic policygradient (DDPG) [24] utilizes DNNs to operate on high-dimensional state spaces with experience replay and a separateactor-critic target network, which is similar to DQN. Exploita-tion of parallel computation is an alternative to experiencereplay. Asynchronous advantage actor-critic (A3C) [58] usesadvantage estimates rather than discounted returns in the actor-critic framework and asynchronously updates policy and valuenetworks on multiple parallel threads of the environment.The parallel independent environments stabilize the learningprocess and enable more exploration. Advantage actor critic(A2C) [59], which is the synchronous version of A3C, usesa single agent for simplicity or waits for each agent to ﬁnishits experience to collect multiple trajectories. Soft actor critic(SAC) [60] beneﬁts from the addition of an entropy term tothe reward function to encourage better exploration.III. P RELIMINARIES OF I MITATION L EARNING

A. Problem Formulation

Imitation learning possesses a simpler form and is based onlearning from demonstrations (LfD) [61]. It is attractive forAD applications, where interaction with the real environmentcould be dangerous and vast amount of human driving datacan be easily collected [62]. A demonstration dataset D = { ξ i } i =0 ..N contains a set of trajectories, where each trajectory ξ i = { ( s it , a it ) } t =0 ..T is a sequence of state-action pairs, andaction a it ∈ A is taken by expert at state s it ∈ S under theguidance of an underlying policy π E of the expert [32]. Usingthe collected dataset D , a common optimization-based strategyof imitation learning is to learn a policy π ∗ : S → A thatmimics the expert’s policy π E by satisfying [33] π ∗ = arg min π D ( π E , π ) (13)where D is a similarity measure between policies. The methodsfor solving the problem can be divided into three groups, asshown in Fig. 2 (b), which are reviewed below. B. Behavior Clone

Behavior clone ( BC ) formulates the problem as a supervisedlearning process with the objective of matching the learnedpolicy π θ to the expert policy π E : min θ E || π θ − π E || (14) which is typically realized by minimizing the L2-loss: J ( θ ) = E ( s,a ) ∼D [( π θ ( s ) − a ) ] (15)Early research on imitation learning can be dated back tothe ALVINN system [10], which used a 3-layer neural networkto perform road following based on front camera images. Inthe most recent decade, deep imitation learning (DIL) hasbeen conducted using DNNs as policy function approximatorsand has realized success in end-to-end AD systems [63–65]. BC performs well for states that are covered by thetraining distribution but generalizes poorly to new states dueto compounding errors in the actions, which is also referred toas covariate shift [66, 67]. To overcome this problem, Rosset al. [68] proposed DAgger, which improves upon supervisedlearning by using a primary policy to collect training exampleswhile running a reference policy simultaneously. In eachiteration, states that are visited by the primary policy arealso sent to the reference policy to output expert actions,and the newly generated demonstrations are aggregated intothe training dataset. SafeDAgger [69] extends on DAgger byintroducing a safe policy that learns to predict the error thatis made by a primary policy without querying the referencepolicy. In addition to dataset aggregation, data augmentationtechniques such as the addition of random noise to the expertaction have also been commonly used in DIL [70, 71]. C. Inverse Reinforcement Learning

The inverse reinforcement learning problem, which was ﬁrstformulized in the study of Ng et al. [72], is to identify a rewardfunction r θ for which the expert behavior is optimal: max θ E π E [ G t | r θ ] − E π [ G t | r θ ] (16)Early studies utilized linear function approximation of re-ward functions and identiﬁed the optimal reward via maximummargin approaches [73, 74]. By introducing the maximumentropy principle, Ziebart et al. [75] eliminated the rewardambiguity between demonstrations and expert policy wheremultiple rewards may explain the expert behavior. The rewardfunction is learned through maximizing the posterior proba-bility of observing expert trajectories: J ( θ ) = E ξ i ∼D [log P ( ξ i | r θ )] (17)where the probability of a trajectory satisﬁes P ( ξ i | r θ ) ∝ exp( r θ ( ξ i )) . Several studies have extended the reward func-tions to nonlinear formulations through Gaussian processes[76] or boosting [77, 78]. However, the above methods gen-erally operate on low-dimensional features. The use of richand expressive function approximators, in the form of neuralnetworks, has been proposed to learn reward functions directlyon raw high-dimensional state representations [79, 80].A problem that is encountered with IRL is that to evaluatethe reward function, a forward reinforcement learning pro-cess must be conducted to obtain the corresponding policy,thereby rendering IRL inefﬁcient and computationally expen-sive. Many early approaches require solving an MDP in theinner loop of each iterative optimization [20, 72, 74, 75].Perfect knowledge of the system dynamics and an efﬁcient ofﬂine solver are needed in these methods, thereby limitingtheir applications in complex real-world scenarios such asrobotic control. Finn et al. [80] proposed guided cost learning(GCL), which handles unknown dynamics in high-dimensionalcomplex systems and learns complex neural network costfunctions through an efﬁcient sample-based approximation. D. Generative Adversarial Imitation Learning

Generative adversarial imitation learning (GAIL) [81] di-rectly learns a policy from expert demonstrations while re-quiring neither the reward design in RL nor the expensive RLprocess in the inner loop of IRL. GAIL establishes an analogybetween imitation learning and generative adversarial networks(GANs) [82]. The generator π θ serves as a policy for imitatingexpert behavior by matching the state-action distribution ofdemonstrations, while the discriminator D ω ∈ (0 , servesas a surrogate reward function for measuring the similaritybetween generated and demonstration samples. The objectivefunction of GAIL is formulated in the following min-maxform: min π θ max D ω E π θ [log D ω ( s, a )]+ E π E [log(1 − D ω ( s, a ))] − λH ( π θ ) (18)where H ( π ) is a regularization entropy term. The generatorand the discriminator are updated with the following gradients: ∇ θ J ( θ ) = E π [ ∇ θ log π θ ( a | s ) Q ( s, a )] − λ ∇ θ H ( π θ ) ∇ ω J ( ω ) = E π [ ∇ ω log D ω ( s, a )] + E π E [ ∇ ω log(1 − D ω ( s, a ))] (19)Finn et al. [83] mathematically proved the connection amongGANs, IRL and energy-based models. Fu et al.[84] proposedadversarial inverse reinforcement learning (AIRL) based onan adversarial reward learning formulation, which can recoverreward functions that are robust to dynamics changes.IV. A RCHITECTURES OF

DRL/DIL I

NTEGRATED

ADS

YSTEMS

AD systems have been studied for decades [13–16], whichare commonly composed of modular pipelines, as illustratedin Fig. 1. How can DRL/DIL models be integrated into anAD system and collaborate with other modules? This sectionreviews the literature from the system architecture perspective,from which ﬁve modes are identiﬁed, as illustrated in Fig.3. A classiﬁcation of the studies in each mode is presentedin Table I, along with the exploited DRL/DIL methods, theupstream module for perception, the targeted AD tasks, andthe advantages and disadvantages of the architectures, amongother information. Below, we detail each mode of the archi-tectures, which is followed by a comparison of the number ofstudies that correspond to each mode or to the use of DRL orDIL methods.

A. Mode 1. DRL/DIL Integrated Control

Many studies have applied DRL/DIL to control, whichcan be abstracted as the architecture of Mode 1 and isillustrated in Fig. 3. Bojarski et al. [63] proposed a well-known end-to-end self-driving control framework. They trained anine-layer CNN by supervised learning to learn the steeringpolicy without explicit manual decomposition of sequentialmodules. However, their model only adapts to lane keeping.An alternative option is to feed traditional perception resultsinto the DNN control module. Tang et al. [96] proposed theuse of environmental rasterization encoding, along with therelative distance, velocity and acceleration, as input to a two-branch neural network, which was trained via proximal policyoptimization.Although Mode 1 features a simple structure and adapts toa large variety of learning methods, it is limited to speciﬁedtasks; thus, it has difﬁculty addressing scenarios in which mul-tiple driving skills that are conditioned on various situationsare needed. Moreover, bypassing and ignoring behavior plan-ning or motion planning processes may weaken the model’sinterpretability and performance in complex tasks (e.g., urbannavigation).

B. Mode 2. Extension of Mode 1 with High-level Command

As illustrated in Fig. 3, Mode 2 extends Mode 1 byconsidering the high-level planning output. The control mod-ule is composed of either a general model for all behav-iors [104, 110] or a series of models for distinct behaviors[26, 70, 105, 106, 108]. Chen et al. [110] projected detectedenvironment vehicles and the routing onto a bird-view roadmap, which served as the input of a policy network. Liang etal. [26] built on conditional imitation learning (

CIL ) [70] andproposed the branched actor network, as illustrated in Fig. 3.These methods learn several control submodules for distinctbehaviors. A gating control command from high-level routeplanning and behavior planning modules is responsible for theselection of the corresponding control submodule.Although Mode 2 mitigates the problems that are encoun-tered with Mode 1, it has its own limitations. A generalmodel may be not sufﬁcient for capturing diverse behaviors.However, learning a model for each behavior increases thedemand for training data. Moreover, Mode 2 may be not ascomputationally efﬁcient as Mode 1 since it requires high-levelplanning modules that are determined in advance to guide thecontrol module.

C. Mode 3. DRL/DIL Integrated Motion Planning

Mode 3 integrates DRL/DIL into the motion planning mod-ule, and its architecture is illustrated in Fig. 3. Utilizing theplanning output (e.g., routes and driving decisions) from high-level modules, along with the current perception output, DNNsare trained to predict future trajectories or paths. DIL models[71, 111–113] are the mainstream choices for implementingthis architecture. As illustrated in Fig. 3, Sun et al. [112]proposed training a neural network that imitates long-termMPC via Sampled-DAgger, where the policy network’s inputwas from perception (obstacles, environment, and currentstates) and decision-making (driving decisions). Alternatively,Wulfmeier et al. [113] proposed projecting the LiDAR pointcloud onto a grid map, which is sent to the DNN. The DNNis responsible for learning a cost function that guides the

Bojarski et al., 2016 [ ]Liang et al., 2018 [ ] Sun et al., 2018 [ ]Yuan et al., 2019 [ ] Mode 1Mode 2Mode 3

Control

DNN

Mode 4

Perception or Control

DNN

Perception

SensorsActuator Actuator

Tang et al., 2019 [ ] Control

DNN

Perception S e n s o r s Actuator

Route Planning or Control

DNN P erce p t i o n Actuator

Route Planning

Chen et al., 2019 [ ] Motion Planning

DNN

Perception S e n s o r s Behavior PlanningControl or DNN P erce p t i o n Behavior PlanningMotion PlanningControl

Wulfmeier et al., 2017 [ ] Behavior Planning

DNN

Perception S e n s o r s Route PlanningMotion Planning

DNN P erce p t i o n Route PlanningBehavior PlanningMotion Planning or Mirchevska et al., 2018 [ ]Rezaee et al., 2019 [ ] Mode 5

Motion Planning

DNN

Behavior PlanningControlPerception Control

DNN

Behavior PlanningPerception

Actuator or Shi et al., 2019 [ ] Fig. 3. Integration modes of DRL/DIL models into AD architectures

TABLE IA

TAXONOMY OF THE LITERATURE BY THE INTEGRATION MODES OF

DRL OR DIL

MODELS INTO AD ARCHITECTURES

Architecture Advantages Disadvantages Studies Methods Perception Tasks Remarks Mode 1.DRL/DILIntegratedControl - Features a simplestructure and adaptsto various learningmethods. - Limited tospeciﬁed tasksor skills.- Bypassingplanning modulesweakens themodel’sinterpretabilityand capability. [10], [85],[63], [64],[86], [87],[65], [88] BC D road/lane followingurban driving safety: [69][69], [89] DAgger D road/lane following[90], [91],[92], [93],[94], [95] NQLDQN T lane changingtrafﬁc mergingimminent eventsintersection safety: [90]interaction:[95][96], [97] PPO T trafﬁc mergingurban driving[28], [98] DDPG D road/lane followingimminent events[99], [100],[101], [102],[103] DDPG T road/lane followinglane changingovertakingimminent events

Mode 2.Extension ofMode 1 withHigh-levelCommand - Considers bothhigh-level planningand perception- Generates distinctcontrol behaviorsaccording to thehigh-level decisions - Single modelmay not capturesufﬁciently diversebehaviors- Learning a model for eachbehaviorincreases the trainingcost and thedemand for data [104] BC D urban driving[105], [70],[106] CIL D urban navigation[107], [108] UAIL D urban navigation uncertainty:[107], [108][109] DDPG T parking[26] DDPG D urban navigation[110] DDQN/TD3/SAC T roundabout

Mode 3.DRL/DILIntegratedMotionPlanning - Learn to imitatehuman trajectories- Efﬁcient forwardprediction process - No guarantee onsafety or feasibility [111], [71] BC T urban driving safety: [71][112] DAgger T highway driving[113] MaxEnt DIRL D urban driving[114] DDQN/DQfD T valet parking safety: [114][115] SAC T trafﬁc merge

Mode 4.DRL/DILIntegratedBehaviorPlanning - The DNNs need onlyplan high-level behavioralactions. - Simple and fewactions limit thecontrol precisionand diversity ofdriving styles.- Complicated andtoo many actionsincrease the trainingcost. [116] AIRL T lane change[117] IRL D lane change[118], [119],[120], [121],[122], [123],[124], [125] DQN T lane changeintersection[126], [127] DQN D lane change[128], [129] Q-Learning T lane changing[130], [131] DDQN T lane changingurban driving[132] DQfD T lane changing safety: [132][133] QR-DQN D highway driving

Mode 5.DRL/DILIntegratedHierarchicalPlanning andControl - Simultaneouslyplan at variouslevels of abstraction. - Hierarchical DNNsincrease the training costand decreasethe convergence speed. [134], [135] DQN T cruise controllane changing[136] HierarchicalPolicy gradient T trafﬁc light passing[27] DDPG D lane changing[137] DDPG T intersection[138] DDPG T urban driving Type of upstream perception module: (D)eep learning method/(T)raditional method For detailed information about AD tasks, see Table III Studies that also address safety VI-A, interaction VI-B and uncertainty VI-C problems are labelled. motion planning. The control part in Mode 3 typically utilizestraditional control techniques such as PID [71] or MPC [112].One major disadvantage of Mode 3 is that although DNNplanned trajectories can imitate human trajectories, their safetyand feasibility cannot be guaranteed.

D. Mode 4. DRL/DIL Integrated Behavior Planning

Many studies focus on integrating DRL/DIL into the behav-ior planning module and deriving high-level driving policies.The corresponding architecture is presented in Fig. 3, whereDNNs decide behavioral actions and the subsequent motionplanning and control modules typically utilize traditional methods. Many studies build upon DQN and its variants [118–126]. As illustrated in Fig. 3, Yuan et al. [126] decomposedthe action space into ﬁve typical behavioral decisions on ahighway. Compared with DRL, studies that employ DIL tolearn high-level policies are limited. A recent study by Wanget al. [116] proposed the use of augmented adversarial inversereinforcement learning (AIRL) to learn human-like decision-making on highways, where the action space consists of allpossible combinations of lateral and longitudinal decisions.In Mode 4, the design of behavioral actions is nontrivial, andone must balance the training cost and the diversity of driving

TABLE IIC

OMPARISON OF THE LITERATURE BY

DRL/DIL

INTEGRATION MODES

Perception Control(Modes 1&2) Motion Planning(Mode 3) Behavior Planning(Mode 4) Hierarchical P. & C. (Mode 5) DRL DIL DRL DIL DRL DIL DRL DIL

Traditional 15 DNN Subtotal 18 (52.9%)

16 (47.1%) 2 (33.3%)

Total 34 (53.1%) * The values in this table are the numbers and percentages of papers in Table I that belong to the corresponding categories. Abbreviation for “Hierarchical Planning and Control”

TABLE IIIA

TAXONOMY OF THE LITERATURE BY SCENARIOS AND AD TASKS

Scenario AD Task Description Ref.DRL Methods DIL MethodsUrban Intersection

Learn to drive through intersections (while interactingand negotiating with other trafﬁc participants). DQN [90, 119–121]cDQN [138]DDPG [137, 138] —

Roundabout

Learn to drive through roundabouts (while interactingand negotiating with other trafﬁc participants). DDQN [110]TD3 [110]SAC [110] Horizon GAIL[139]

Urban Navigation

Learn to drive in urban environments to reach speciﬁedobjectives by following global routes. DDPG [26] BC [104]CIL [70, 105, 106]UAIL [107, 108]

Urban Driving

Learn to drive in urban environments without speciﬁed objectives. DDQN [122]PPO [140]Policy gradient [136] BC [65, 111]DIRL [113]

Highway Lane Change (LC)

Learn to decide and perform lane changes. DQN [91, 118]DDQN [141]DDPG [27] Projection IRL [117]AIRL [116]

Lane Keep (LK)

Learn to drive while maintaining the current lane. DQN [102]DDPG [28] BC [86, 87]SafeDAgger [69]

Cruise Control

Learn a policy for adaptive cruise control. NQL [92]DQN [134]Policy gradient [142]Actor-critic [143, 144] —

Trafﬁc Merging

Learn to merge into highways. DQN [93]PPO [96] —

Highway Driving

Learn a general policy for driving on a highway, whichmay include multiple behaviors such as LC and LK. DQN [126, 145]QR-DQN [133] GAIL [146]PS-GAIL [147]MaxEnt IRL [148]

Others Road Following

Learn to simply follow one road. — BC [10, 63, 64, 85]DAgger [89]

Imminent Events

Learn to avoid or mitigate imminent events such as collisions. DQN [94]DDPG [98] RAIL [149] styles. Simple and few behavioral actions limits the controlprecision and diversity of driving styles, whereas many andsophisticated actions increases the training cost.

E. Mode 5. DRL/DIL Integrated Hierarchical Planning andControl

As illustrated in Fig. 3, Mode 5 blurs the lines betweenplanning and control, where a single DNN outputs hierarchicalactions [27] based on the parameterized action space [151]or hierarchical DNNs output actions at multiple levels [134–136]. Rezaee et al. [134] proposed an architecture, whichis illustrated in Fig. 3, in which BP (behavior planning)is used to make high-level decisions regarding transitionsbetween discrete states and MoP (motion planning) generatesa target trajectory according to the decisions that are made viaBP. Qiao et al. [137] built upon hierarchical MDP (HMDP)and realized their model through an options framework. Intheir implementation, the hierarchical options were modeledas high-level decisions (SlowForward or Forward). Basedon high-level decisions and current observations, low-levelcontrol policies were derived. Mode 5 simultaneously plans at multiple levels, and the low-level planning process considers high-level decisions. How-ever, the use of hierarchical DNNs results in the increase of thetraining cost and potentially the decrease of the convergencespeed since one poorly trained high-level DNN may misleadand disturb the learning process of low-level DNNs.

F. Statistical Comparison

A statistical comparison among modes in terms of thenumber of studies is presented in Table II. Current studieson architectures are premature, unsystematic and unbalanced: • Most studies focus on integrating DRL/DIL into control(Mode1&2), followed by behavior planning (Mode 4). • For Mode 1&2, DRL methods mainly adopt traditionalmethod-based perception, while DNN-based perceptionis preferred in DIL methods. • DRL seems to be more popular for high-level decisionmaking (Mode 4&5), while DIL is chosen more fre-quently for low-level control (Mode 1&2).Future studies may address the imbalance problem and identifypotential new architectures.

TABLE IVI

NPUTS OF THE

DRL/DIL

METHODS FOR AD TASKS

Information Source Class Inputs Ref.DRL Methods DIL MethodsEgo Vehicle Position Information ego position [90, 91, 93, 110, 120, 122, 136] [111, 146, 147, 149]

Heading Information heading angle, orientation,steering, yaw, and yaw rate [28, 91, 93, 119, 121, 138] [64, 65, 139, 146, 147, 149]

Speed Information speed/velocity and acceleration [26, 90, 91, 118–122, 136–138, 141][28, 93, 94, 98, 102, 134, 144, 150] [65, 70, 105–108, 117, 139][89, 146–149]

Road Environment Pixel Data camera RGB images [26–28, 126, 133, 145] [65, 69, 70, 86, 87, 104–108][10, 19, 63, 64, 85, 89]semantically segmented images [98] —2D bird’s-eye-view images [96] —

Point Data

LiDAR sensor readings [126, 133, 145] [139, 146, 147, 149]2D LiDAR grid map — [113]

Object Data other road users’ information:relative speed, position,and distance to ego [110, 118–122, 138, 141][92–94, 96, 134, 142–144] [111, 116, 147, 149]lane/road information:ego vehicle’s distance tolane markings, roadcurvature, and lane width [91, 93, 102, 118, 122, 134, 137, 141] [116, 146, 147, 149]

Task Navigation Information navigational driving commandsor planned routes [26, 110] [70, 104–106, 111]

Destination Information destination position, distanceor angle to destination — [139]

Trafﬁc Rule Information trafﬁc lights’ state, speedlimit, and desired speed [136] [111, 148]

Prior Knowledge Road Map

2D top-down road map images [110] [111] P erce n t ag e D RL Methods D IL Methods

Input Classes

Ego Vehicle Road Environment Task Prior Knowledge (a) P erce n t ag e UrbanHighwayOthers

Input Classes

Ego Vehicle Road Environment Task Prior Knowledge (b)Fig. 4. The percentages of the literature that uses certain data as input. (a) Comparison by DRL/DIL methods. (b) Comparison by scenarios.

V. T

ASK -D RIVEN M ETHODS

DRL/DIL studies in AD can be categorized according totheir application scenarios and targeted tasks, as listed inTable III. These task-driven studies use DRL/DIL to solvespeciﬁed AD tasks, where the formulations of DRL/DIL canbe decomposed into several key components: 1) state spaceand input design , 2) action space and output design , and 3) reinforcement learning reward design . A. State Space and Input Design

Table IV classiﬁes commonly used inputs according toinformation source (ego vehicle/road environment/task/priorknowledge). A statistical comparison of the percentages ofinput classes that are used in DRL/DIL methods is presentedin Fig. 4(a). Compared with the task and prior knowledge, theego vehicle and road environment seem to be more popularinformation sources. In all input classes, the ego vehicle speed,pixel data (e.g., camera images) and object data (e.g., otherroad users’ relative speeds and positions) are the most com-monly used. Another signiﬁcant difference between DRL andDIL is that DRL models prefer object data while DIL models prefer pixel data. The selection of low-dimensional objectdata rather than high-dimensional pixel data as input for DRLmodels renders the problem more tractable and accelerates thetraining procedure. Fig. 4(b) presents a statistical comparisonof preferences for input classes in various scenarios. Inputfrom the task (e.g., goal positions) and prior knowledge (e.g.,road maps) are mostly used in urban scenarios. Point data (e.g.,LiDAR sensor readings) and object data are more commonlyused in highway scenarios.Aside from deciding the choice of input, the applicationof a dynamic input size is an important problem since thenumber of cars or pedestrians in the ego’s vicinity varies overtime. Unfortunately, standard DRL/DIL methods rely on inputsof ﬁxed size. The use of occupancy grid maps as inputs ofCNNs is a practical solution [119, 121]. However, this solutionimposes a trade-off between computational workload and ex-pressiveness. Low-resolution grids decrease the computationalburden at the cost of being imprecise in their representationof the environment, whereas for high-resolution grids, mostcomputations may be redundant due to sparsity of the grid TABLE VA

CTIONS OF THE

DRL/DIL

METHODS FOR AD TASKS

Action Category Subclass Action Outputs RefDRL Methods DIL MethodsBehavior-levelActions AccelerationRelated e.g., full brake, decelerate, continue, and accelerate [119, 121, 122, 126, 133, 142, 145] —

Lane ChangeRelated e.g., keep, LLC, and RLC;choice of lane change gaps [118, 122, 133, 141, 145] [116]

TurnRelated e.g., straight, left-turn, right-turn, and stop [126] [65, 117]

InteractionRelated e.g., take/give way and follow a vehicle; wait/go [120, 121] —

Trajectory-levelActions PlannedTrajectory future path 2D points — [111, 113]

Control-levelActions Lateral discrete steer angles [19] [10]continuous steer — [63, 64, 85–87]continuous angular speedor yaw acceleration [91] [65]

Longitudinal discrete acceleration values [90, 94] —continuous acceleration [92, 143, 144] —continuous brake and throttle [150] —

SimultaneousLateral & Longitudinal continuous steer/turn-rate andspeed/acceleration/throttle [28, 93, 96, 110] [70, 89, 104, 106, 107, 146, 147]continuous steer,acceleration/throttle and brake [26, 98, 102] [105, 108]continuous steer and binary brake decision — [69]

Hierarchical Actions Behavior & Control e.g., behavior-level pass/stop,control-level acceleration, and steer [27, 136–138] —

Behavior & Trajectory e.g., behavior-level maintenance, LLC, RLCand trajectory-level path points [134] —

TABLE VIR

EWARDS OF THE

DRL

METHODS FOR AD TASKS

Category Subclass Description Ref.Safety Avoid Collision

Impose penalties if a collision occurs [26, 90, 105, 110, 116, 118–122, 137, 138][69, 94, 96, 98, 126, 133, 143, 145, 149]

Time to Collision

Impose penalties if the time to collision (TTC) is belowa safe threshold [118–120, 122, 144]

Distance to Other Vehicles

Impose penalties if this distance is shorter than a safe threshold [93, 134, 142, 144]

Number of Lane Changes

Impose penalties if the number of lane changes is toolarge or reward a smaller number of lane changes [118, 126, 133, 134, 141, 145]

Out of Road

Impose penalties on driving out of road [19, 27, 96, 102, 118]

Efﬁciency Speed

Reward higher speed until the maximum speed limit is reached;Impose penalties if the speed is lower than the minimum speed limit [26, 105, 110, 118, 119, 122, 136, 138, 140, 141][27, 28, 93, 96, 102, 126, 133, 134, 145, 150]

Success

Reward the agent if it ﬁnished the task successfully [90, 96, 116, 118, 120, 121, 137, 138, 143]

Number of Overtakes

Reward a higher number of overtakes for efﬁciency [27, 126, 133, 145]

Time

Impose a negative reward in each step to encourage the agent ﬁnishthe task faster or penalize the agent if the task cannotbe ﬁnished within a time threshold [91, 110, 121, 137]

Distance to the Destination

Provide a larger reward the closer the agent is to the destination [105, 136, 137]

Comfort Jerk

Impose penalties if the longitudinal or lateral controlis too urgent [91, 93, 96, 110, 120, 136, 138, 144, 149, 150]

TrafﬁcRules Lane Mark Invasion

Impose penalties if the agent invades the lane marks [26, 105, 149]

Distance to the Lane Centerlines

Impose penalties if the agent deviates from the lanecenterlines or routing baselines [19, 27, 96, 110, 138, 140]

Wrong Lane

Impose penalties if the agent is in the wrong lane, e.g.,staying in the left-turn lane if the assigned route is straight [122]

Blocking Trafﬁc

Impose penalties if the agent blocks the future paths of othervehicles that have the right of way [137] maps. Furthermore, a grid map is still limited by its deﬁnedsize, and agents outside this region is neglected. Alternatively,Everett et al. [152] proposed to leverage LSTMs’ ability toencode a variable number of agents’ observations. Huegle etal. [141] suggested the use of deep sets [153] as a ﬂexible andpermutation-invariant architecture to handle dynamic input.Dynamic input remains an open topic for future studies.

B. Action Space and Output Design

A self-driving DRL/DIL agent can plan at different levelsof abstraction, namely, low-level control, high-level behav-ioral planning and trajectory planning, or even at multiplelevels simultaneously. According to this, Table V categorizesmainstream action spaces into four groups: behavior-levelactions, trajectory-level actions, control-level actions and hi-erarchical actions. Behavior-level actions are usually designedaccording to speciﬁed tasks. For speed control, acceleration-related actions (e.g., full brake, decelerate, continue, andaccelerate [119]) are commonly used. Lane change actions (e.g., keep/LLC/LRC) and turn actions (e.g., turn left/right/gostraight) are preferred in highway scenarios and urban scenar-ios, respectively. Trajectory-level actions refer to the plannedor predicted trajectories/paths [111, 113], which are typicallycomposed of future path 2D points. Control-level actionsrefer to low-level control commands (e.g., steer, acceleration,throttle, and brake), which are divided into three classes: lateralcontrol, longitudinal control and simultaneous lateral andlongitudinal control. Early studies focused mainly on discretelateral [10, 19] or discrete longitudinal control [143], whilecontinuous control was considered later [63, 64, 85, 91]. Con-tinuous control is demonstrated in [102] to produce smoothertrajectories than discrete control for lane keeping, which maymake passengers feel more comfortable. Simultaneous lateraland longitudinal control has received wide attention, especiallyin urban scenarios [70, 105–108]. Recently, hierarchical ac-tions have attracted more attention [27, 134, 136–138], whichprovide higher robustness and interpretability. Modified DRL/DIL to Enhance Safety

Modified Methods

DRL/DIL Traditional Methods to Enhance Safety

Combined Methods

DRL/DILTraditional Method

Hybrid Methods (a) (b)

Bouton et al., 2019 [90] Chen et al., 2019 [71] Bernhard et al., [114] (c)

Fig. 5. A taxonomy of the literature on how driving safety is addressed by DRL/DIL models. (a)-(c) Three main methods with typical examples in literature.

C. Reinforcement Learning Reward Design

A major problem that limits RL’s real-world AD appli-cations is the lack of underlying reward functions. Further,the ground truth reward, if it exists, may be multi-modalsince human drivers change objectives according to the cir-cumstances. To simplify the problem, current DRL modelsfor AD tasks commonly formulate the reward function asa linear combination of factors, as presented in Fig. 6. Alarge proportion of studies consider safety and efﬁciency.Reward terms that are used in the literature are listed inTable VI. Collison and speed are the most common rewardterms when considering safety and efﬁciency, respectively.However, empirically designed reward functions rely heavilyon expert knowledge. It is difﬁcult to balance rewards terms,which affects the trained policy performance. Recent studies

Traffic Rule ComfortEfficiency

Safety [ ] ,[94],[98] [ ], [102],[116],[118],[119],[120],[126],[133],[134], [141],[143],[145] [ ] , [ ] [ ],[105],[122],[137] [93],[120][ ] [ ] ,[136], [15 ] [ ] [ ][ ] ,[110],[138] [ ] Fig. 6. Reinforcement learning rewards for AD tasks. on predictive reward and multi-reward RL may inspire futureinvestigation. Hayashi et al. [154] proposed a predictive rewardfunction that is based on the prediction error of a deeppredictive network that models the transition of the surround-ing environment. Their hypothesis is that the movement ofsurrounding vehicles becomes unpredictable when the egovehicle performs an unnatural driving behavior. Yuan et al.[126] decomposed a single reward function into multi-rewardfunctions to better represent multi-dimensional driving policiesthrough a branched version of Q networks.VI. P

ROBLEM - DRIVEN M ETHODS

AD application has special requirements on factors such asdriving safety, interaction with other trafﬁc participants anduncertainty of the environment. This section reviews the liter-ature from the problem-driven perspective with the objectivesof determining how these critical issues are addressed by theDRL/DIL models and identifying the challenges that remain.

A. Safety-enhanced DRL/DIL for AD

Although DRL/DIL can learn driving policies for complexhigh-dimensional problems, they only guarantee optimalityof the learned policies in a statistical sense. However, insafety-critical AD systems, one failure (e.g., collision) wouldcause catastrophe. Below, we review representative methodsfor enhancing safety of DRL/DIL in the AD literature. Fig. 5categorizes the methods into three groups: (a) modiﬁed meth-ods : methods that modify the original DRL/DIL algorithms,(b) combined methods : methods that combine DRL/DIL with (a) (b) (c) Qi et al., 2018 [155] Chen et al., 2019 [156] Hu et al., 2019 [157]

Fig. 7. A taxonomy of the literature on interaction-aware DRL/DIL models for AD. (a) Qi et al. [155], (b) Chen et al. [156] are examples of explicit andimplicit interactive environment encoding, respectively. (c) Hu et al. [157] is an example of interactive learning strategy. traditional methods, and (c) hybrid methods : methods thatintegrate DRL/DIL into traditional methods.

1) Modiﬁed Methods:

As illustrated in Fig. 5(a), modiﬁedmethods modify the standard DRL/DIL algorithms to enhancesafety, typically by constraining the exploration space [158–160]. A safety model checker is introduced to identify theset of actions that satisfy the safety constraints at each state.This can be realized through several approaches, such asgoal reachability [90][161], probabilistic prediction [162] andprior knowledge & constraints [163]. Bouton et al. [90][161]use a probabilistic model checker, as illustrated in Fig. 5(a),to compute the probability of reaching the goal safely ateach state-action pair. Then, safe actions are identiﬁed byapplying a user-deﬁned threshold on the probability. However,the proposed model checker requires a discretization of thestate space and a full transition model. Alternatively, Isele et al.[162] proposed the use of probabilistic prediction to identifypotentially dangerous actions that would cause collision, butthe safety guarantee may not be sufﬁciently strong if theprediction is not accurate. Prior knowledge & constraints (e.g.,lane changes are disallowed if they will lead to small timegaps) are also exploited [125, 132, 163]. For DIL, Zhang et al.[69] proposed SafeDAgger, in which a safety policy is learnedto predict the error made by a primary policy without queryingthe reference policy. If the safety policy determines that it isunsafe to let the primary policy drive, the reference policywill take over. One drawback is that the quality of the learnedpolicy may be limited by that of the reference policy.

2) Combined Methods:

Various studies combine standardDRL/DIL with traditional rule-based methods to enhancesafety. In contrast to the modiﬁed methods that are discussedabove, combined methods don’t modify the learning process ofstandard DRL/DIL. As presented in Fig. 5(b), Chen et al. [71] proposed a framework in which DIL plans trajectories, whilethe rule-based tracking and safe set controller ensure safecontrol. Xiong et al. [164] proposed the linear combinationof the control output from DDPG, artiﬁcial potential ﬁeld andpath tracking modules. According to Shalev-Shwartz et al.[165], hard constraints should be injected outside the learningframework. They decompose the double-merge problem intoa composition of a learnable DRL policy and a trajectoryplanning module with non-learnable hard constraints. Thelearning part enables driving comfort, while the hard con-straints guarantee safety.

3) Hybrid Methods:

Hybrid methods integrate DRL/DILinto traditional heuristic search or POMDP planning methods.As presented in Fig. 5(c), Bernhard et al. [114] integratedexperiences in the form of pretrained Q-values into Hybrid A ∗ as heuristics, thereby overcoming the statistical failure rate ofDRL while still beneﬁtting computationally from the learnedpolicy. However, the experiments are limited to stationaryenvironments. Pusse et al. [166] presented a hybrid solutionthat combines DRL and approximate POMDP planning forcollision-free autonomous navigation in simulated critical traf-ﬁc scenarios, which beneﬁts from advantages of both methods. B. Interaction-aware DRL/DIL for AD

Interaction is one of the intrinsic characteristics of trafﬁcenvironments. An intelligent agent should reason beforehandabout the behaviors of other trafﬁc participants to passivelyreact or actively adjust its own policy to cooperate or competewith other agents. This section reviews the interaction mod-eling methods and two groups of interaction-aware DRL/DILmethods for AD, as presented in Fig. 7. One group of methodsfocus on interactive environment encoding, while the otherfocus on interactive learning strategies.

1) Interaction Modeling:

The simplest way to model in-teraction between multi-agents is to use a standard Markovdecision process (MDP), where the other trafﬁc participantsare only treated as part of the environment [141, 156]. POMDPis another common interaction model [95, 155, 167], wherethe agent has limited sensing capabilities. A Markov game(MG) is also used for modeling interaction scenarios. Accord-ing to whether agents have the same importance, methodscan be categorized into three groups: 1) equal importance[157, 168, 169], 2) one vs. others [170], and 3) proactive-passive pair [171].

2) Interactive Environment Encoding:

Interactive encodingof the environment is a popular research direction. As pre-sented in Fig. 7, mainstream methods can be divided into twogroups. One group of methods explicitly model other agentsand utilize active reasoning about other agents in the algorithmworkﬂow. POMDP is a common choice for these methods,where the intentions/cooperation levels of other agents aremodeled as unobservable states that must be inferred. Qiet al. [155] proposed an intent-aware multi-agent planningframework, as presented in Fig. 7(a), which decouples intentprediction, high-level reasoning and low-level planning. Themaintained belief regarding other agents’ intents (objectives)was considered in the planning process. Bouton et al. [95]proposed a similar method that maintains a belief regardingthe cooperation levels (e.g., the willingness to yield to the egovehicle) of other drivers.The other group of methods focuses on utilizing specialneural network architectures to capture the interplay betweenagents by their relation or interaction representations. Thesemethods are usually agnostic regarding the intentions of otheragents. Jiang et al. [167] proposed graph convolutional re-inforcement learning, in which the multi-agent environmentis constructed as a graph. Agents are represented by nodes,and each node’s corresponding neighbors are determined bydistance or other metrics. Then, the latent features that areproduced by graph convolutional layers are exploited to learncooperation. Similarly, Huegle et al. [172] built upon graphneural networks [173] and proposed the deep scene architec-ture for learning complex interaction-aware scene represen-tations. Inspired by social pooling [174, 175] and attentionmodels [176, 177], Chen et al. [156] proposed a sociallyattentive DRL method for interaction-aware robot navigationthrough a crowd. As illustrated in Fig. 7(b), they extractedpairwise features of interaction between the robot and eachhuman and captured the interactions among humans via localmaps. A self-attention mechanism was subsequently usedto infer the relative importance of neighboring humans andaggregate interaction features.

3) Interactive Learning Strategy:

Various learning strate-gies have been used to learn interactive policies. Curriculumlearning has been used to learn interactive policies [157][168],which can decouple complex problems into simpler problems.As presented in Fig. 7(c), Hu et al. [157] proposed aninteraction-aware decision making approach that leverages cur-riculum learning. First, a decentralized critic for each agent islearned to generate distinct behaviors, where the agent does notreact to other agents and only learns how to execute rational actions to complete its own task. Second, a centralized critic islearned to enable agents to interact with each other to realizejoint success and maintain smooth trafﬁc. One limitation ofthese methods is that new models must be learned if thenumber of agents increases. Based on dynamic coordinationgraph (DCG) [179], Chao et al. [180] proposed a strategiclearning solution for coordinating multiple autonomous vehi-cles in highways. DCG was utilized to explicitly model thecontinuously changing coordination dependencies among ve-hicles. Another group of interaction-aware methods use gametheory[181]. Game theory has already been applied in roboticstasks such as robust control [182, 183] and motion planning[184, 185]. Recent years have also witnessed the increasingapplication of game theory in interaction-aware AD policylearning [169–171]. Li et al. [170] proposed the combinationof hierarchical reasoning game theory (i.e. “level- k ” reasoning[186]) and reinforcement learning. Level- k reasoning is usedto model intelligent vehicles’ interactions in trafﬁc, while RLevolves these interactions in a time-extended scenario. Dinget al. [171] introduced a proactive-passive game theoreticallane changing framework. The proactive vehicles learn to takeactions to merge, while the passive vehicles learn to createmerging space. Fisac et al. [169] proposed a novel game-theoretic real-time trajectory planning algorithm. The dynamicgame is hierarchically decomposed into a long-horizon “strate-gic” game and a short-horizon “tactical” game. Furthermore,the long-horizon interaction game is solved to guide short-horizon planning, thereby implicitly extending the planninghorizon and pushing the local trajectory optimization closer toglobal solutions. Apart from combining game theory and RL,solving an imitation learning problem under game-theoreticformalism is another approach. Sun et al. [187] proposed aninteractive probabilistic prediction approach that was basedon hierarchical inverse reinforcement learning (HIRL). Theymodeled the problem from the perspective of a two-agentgame by explicitly considering the responses of one agentto the other. However, some of the current game-theoreticinteraction-aware methods are limited by their two-vehiclesettings and simulation experiments [169, 171, 187]. C. Uncertainty-aware DRL/DIL for AD

Before deployment of a learned model, it is importantto determine what it does not understand and estimate theuncertainty of the decision making output. As presented inFig. 8, this section reviews the uncertainty-aware DRL/DILmethods for AD from three aspects: 1)

AD and deep learninguncertainty , 2) uncertainty estimation methods , and 3) multi-modal driving behavior learning .

1) Autonomous Driving and Deep Learning Uncertainty:

Autonomous driving has inherent uncertainty, while deeplearning methods have deep learning uncertainty, and the twocan intersect. The AD uncertainty can be categorized as: • Trafﬁc environment uncertainty [107, 108, 178, 188–190]. Stochastic and dynamic interactions among agentswith distinct behaviors lead to intrinsic irreducible ran-domness and uncertainty in a trafﬁc environment. • Driving behavior uncertainty [191]. Human drivingbehavior is multi-modal and stochastic (e.g., a driver can (a) (b) Tai et al., 2019 [108] Henaff et al., 2019 [178]

Fig. 8. A taxonomy of the literature on uncertainty-aware DRL/DIL models. (a) Tai et al. [108] addresses aleatoric uncertainty, while (b) Henaff et al. [178]addresses both aleatoric and epistemic uncertainties. either make a left lane change or a right lane change whenhe comes up behind a van that is moving at a crawl). • Partial observability and sensor noise uncertainty [192]. In real-world scenarios, the AD agent usually haslimited partial observability (e.g., due to occlusion), andthere is noise in the sensor observation.Deriving from Bayesian deep learning approaches, Gal et al.[193] categorized deep learning uncertainty as aleatoric/data and epistemic/model uncertainties . Aleatoric uncertainty re-sults from incomplete knowledge about the environment (e.g.,partial observability and measurement noise), which can’t bereduced through access to more or even unlimited data butcan be explicitly modeled. In contrast, epistemic uncertaintyoriginates from an insufﬁcient dataset and measures what ourmodel doesn’t know, which can be eliminated with sufﬁcienttraining data. We refer readers to [193, 194] for a deeperbackground on predictive uncertainty in deep neural net-works. Although it is sometimes possible to use only aleatoric[108, 188, 189] or epistemic [107, 190] uncertainty to developa reasonable model, the ideal approach would be to combinethese two uncertainty estimates [178, 191, 192].

2) Uncertainty Estimation Methods:

Aleatoric uncertaintyis usually learned by using the heteroscedastic loss function[194]. The regression task and the loss are formulated as [˜ y , ˜ σ ] = f θ ( x ) (20) L ( θ ) = 12 (cid:107) y − ˜ y (cid:107) ˜ σ + 12 log ˜ σ (21) where x denotes the input data and y and ˜y denote theregression ground truth and the prediction output, respectively. θ denotes the model parameters, and ˜ σ is another output ofthe model, which represents the standard variance of data x (the aleatoric uncertainty). The loss function can be interpretedas penalizing a large prediction error when the uncertainty issmall and relaxing constraints on the prediction error whenthe uncertainty is large. In practice, the network predicts thelog variance log ˜ σ [194]. Tai et al. [108] proposed an end-to-end real-to-sim visual navigation deployment pipeline, asillustrated in Fig. 8(a). An uncertainty-aware IL policy istrained with the heteroscedastic loss and outputs actions, alongwith associated uncertainties. A similar technique is proposedby Lee et al. [192].The epistemic uncertainty is usually estimated via twopopular methods: Monte Carlo (MC)-dropout [195, 196] andensembles [197, 198]. These methods are similar in the sensethat both apply probabilistic reasoning on the network weights.The variance of the model output serves as an estimate ofthe model uncertainty. However, multiple stochastic forwardpasses through dropout sampling may be time-consuming,while ensemble methods have higher training and storagecosts. Kahn et al. [190] proposed an uncertainty-aware RLmethod that utilizes MC-dropout and bootstrapping [199],where the conﬁdence for a speciﬁed obstacle is updatediteratively. Guided by the uncertainty cost, the agent behavesmore carefully in unfamiliar scenarios in the early trainingphase. As presented in Fig. 8(b), Henaff et al. [178] proposed training a driving policy by unrolling a learned dynamicsmodel over multiple time steps while explicitly penalizing theoriginal policy cost and an uncertainty cost that represents thedivergence from the training dataset. Their method estimatesboth the aleatoric and epistemic uncertainties.The uncertainty estimation methods that are presented abovedepend mainly on sampling. A novel uncertainty estimationmethod that utilizes a mixture density network (MDN) wasproposed by Choi et al. [191] for learning from complexand noisy human demonstrations. Since an MDN outputsthe parameters for constructing a Gaussian mixture model(GMM), the total variance of the GMM can be calculatedanalytically, and the acquisition of uncertainty requires onlya single forward pass. Distributional reinforcement learning[51, 200, 201] offers another approach for modelling theuncertainty that is associated with actions. It models the RLreturn R as a random variable that is subject to the probabilitydistribution Z ( r | s, a ) and the Q-value as the expected return Q ( s, a ) = E r ∼ Z ( r | s,a ) [ r ] . In the AD domain, Wang et al.[188] applied distributional DDPG to an energy managementstrategy (EMS) problem as a case study to evaluate the effectsof estimating the uncertainty that is associated with variousactions at various states. Bernhard [189] et al. presenteda two-step approach for risk-sensitive behavior generationthat combined ofﬂine distribution reinforcement learning withonline risk assessment, which increased safety in intersectioncrossing scenarios.

3) Multi-Modal Driving Behavior Learning:

The inher-ent uncertainty in driving behavior results in multi-modaldemonstrations. Many multi-modal imitation learning methodshave been proposed. InfoGAIL [202] and Burn-InfoGAIL[203] infer latent/modal variables by maximizing the mutualinformation between latent variables and state-action pairs.VAE-GAIL [204] introduces a variational auto-encoder forinferring modal variables. However, due to the lack of labelsin the demonstrations, these algorithms tend to distinguishlatent labels without considering semantic information or thetask context. Another direction focuses on labeled data inexpert demonstrations. CGAIL [205] sends the modal labelsdirectly to the generator and the discriminator. ACGAIL[206] introduces an auxiliary classiﬁer for reconstructing themodal information, where the classiﬁer cooperates with thediscriminator to provide the adversarial loss to the generator.Nevertheless, the above methods mainly leverage randomsampling of latent labels from a known prior distribution todistinguish multiple modalities. The trained models rely onmanually speciﬁed labels to output actions; hence, they cannotselect modes adaptively according to environmental scenarios.Recently, Fei et al. [207] proposed Triple-GAIL, which canlearn adaptive skill selection and imitation jointly from expertdemonstrations, and generated experiences by introducing anauxiliary skill selector.VII. D

ISCUSSION

Although DRL and DIL attract signiﬁcant amounts of inter-est in AD research, they remain far from ready for real-worldapplications, and challenges are faced at the architecture, task and algorithm levels. However, solutions remain largelyunderexplored. In this section, we discuss these challengesalong with future investigation. DRL and DIL also have theirown technical challenges; we refer readers to comprehensivediscussions in [29, 31, 33].

A. System architecture

The success of modern AD systems depends on the metic-ulous design of architectures. The integration of DRL/DILmethods to collaborate with other modules and improve theperformance of the system remains a substantial challenge.Studies have demonstrated various ways that DRL/DIL mod-els could be integrated into an AD system. As illustratedFig. 3, some studies propose new AD architectures, e.g.,Modes 1&2, where an entire pipeline from the input ofthe sensor/perception to the output of the vehicle’s actuatorsis covered. However, the traditional modules of sequentialplanning are missing in these new architectures, and drivingpolicy is addressed at the control level only. Hence, theseAD systems could adapt to only simple tasks, such as roadfollowing, that require neither the guidance of goal points norswitching of driving behaviors. The extension of these archi-tectures to accomplish more complicated AD tasks remains asubstantial challenge. Other studies utilize the traditional ADarchitectures, e.g., Modes 3&4, where DRL/DIL models arestudied as substitutes for traditional modules to improve theperformance in challenging scenarios. Mode 5 studies use bothnew and traditional architectures. Overall, the research effortuntil now has been focused more on exploring the potential ofDRL/DIL in accomplishing AD tasks, whereas the design ofthe system architectures has yet to be intensively investigated.

B. Formulation of driving tasks

Various DRL/DIL formulations have been established foraccomplishing AD tasks. However, these formulations relyheavily on empirical designs. As reviewed in Section V, thestate space and input data are designed case by case, andad-hoc reward functions are usually adopted with hand-tunedcoefﬁcients that balance the costs regarding safety, efﬁciency,comfort, and trafﬁc rules, among other factors. Such designsare very brute-force approaches, which lack both theoreticalproof and in-depth investigations. Changing the designs ortuning the parameters could result in substantially differentdriving policies. However, in real-world deployment, moreattention should be paid to the following questions: Whatdesign could realize the most optimal driving policy? Couldsuch a design adapt to various scenes? How can the boundaryconditions of designs be identiﬁed? To answer these questions,rigorous studies with comparative experiments are needed.

C. Safe driving policy

AD applications have high requirements on safety, andguaranteeing the safety of a DRL/DIL integrated AD systemis of substantial importance. Compared to traditional rule-based methods, DNN has been widely acknowledged as havingpoor interpretability. Its “black-box” nature renders difﬁcult the prediction of when the agent may fail to generate a safepolicy. Deep models for real-world AD applications mustaddress unseen or rarely seen scenarios, which is difﬁcultfor DL methods as they optimize objectives at the level ofexpectations over speciﬁed instances. To solve this problem,a general strategy is to combine traditional methods to ensurea DRL/DIL agent’s functional safety. As reviewed in Fig.5, various methods have been proposed in the literature,where the problems are usually formulated as compositionsof learned policies with hard constraints [125, 132, 163].However, balancing between the learned optimal policy andthe safety guarantee by hard constraints is non-trivial andrequires intensive investigation in the future. D. Interaction with trafﬁc participants

The capability of human-like interaction is required ofself-driving agents for sharing the roads with other trafﬁcparticipants. As reviewed in Section VI-B, interaction-awareDRL/DIL is a rising topic, but the following problems remain:First, current studies attempt to solve the problem fromvarious perspectives, and the systematic studies are needed.Second, few interaction-aware DIL methods are available,while interaction-aware DRL methods are limited to simpliﬁedscenarios that involve only a few agents. The combinationof interaction-aware trajectory prediction methods [208–210]may be an open topic of potential value. Third, game theoryand multi-agent reinforcement learning (

MARL ) [31] arehighly correlated for interactive scenarios. MARL methodsusually build on concepts of game theory (e.g., Markovgames) to model the interaction process. Apart from DRL/DIL,methods are available for learning interactive policies throughtraditional game theory approaches, such as Nash equilibrium[211], level- k reasoning [212] and game tree search [213].Exploiting POMDP planning to learn interactive polices isalso a trend [214, 215]. These methods have satisfactoryinterpretability but are limited to simpliﬁed or coarse dis-cretizations of the agents’ action space [213, 215]. Althoughthe simpliﬁcation reduces the computation burden, it tends toalso lower the control precision. In the future, the combinationof these methods and DRL/DIL may be promising. E. Uncertainty of the environment

Decision-making under uncertainty has been studied fordecades [216, 217]. Nevertheless, modeling the uncertainty inDRL/DIL formulations remains challenging, especially undercomplex uncertain trafﬁc environments. Several problems havebeen identiﬁed in current research: First, most uncertainty-aware methods follow the style of deep learning predictiveuncertainty [193] without a deeper investigation. Is computingthe predictive uncertainty of DNNs sufﬁcient for AD tasks?Second, can the computed uncertainty be effectively utilizedto realize a better decision making policy? Various methodsincorporate the uncertainty cost into the global cost functions[178, 190], while other methods utilize uncertainty to generaterisk-sensitive behavior [189]. Future efforts are needed toidentify more promising applications. Third, human-drivingbehavior is uncertain, or multi-modal. However, DIL performs well for demonstrations from one expert rather than multipleexperts [218]. A naive solution is to neglect the multi-modalityand treat the demonstrations as if there is only one expert.The main side effect is that the model tends to learn anaverage policy rather than a multi-modal policy [202]. Thus,determining whether DRL/DIL learn effectively from noisyuncertain naturalistic driving data and generate multi-modaldriving behavior according to various scenarios is meaningful.

F. Validation and benchmarks

Validation and benchmarks are especially important forAD, but far from sufﬁcient effort has been made regardingthese aspects. First, comparison between DRL/DIL integratedarchitectures and traditional architectures is usually neglectedin the literature, which is meaningful for identifying the quan-titative performance gains and disadvantages of introducingDRL/DIL. Second, systematic comparison between DRL/DILarchitectures is necessary. A technical barrier of the formertwo problems is the lack of a reasonable benchmark. High-ﬁdelity simulators such as CARLA [105] may provide a virtualplatform on which various architectures can be deployedand evaluated. Third, exhaustive validation of trained policiesbefore deployment is of vital importance. However, validationis challenging. Real-world testing on vehicles has high costs interms of time, ﬁnances and human labor and could be danger-ous. Empirical validation through simulation can reduce theamount of required ﬁeld testing and can be used as a ﬁrst stepfor performance and safety evaluation. However, veriﬁcationthrough simulation only ensures the performance in a statisti-cal sense. Even small variations between the simulators and thereal scenario can have drastic effects on the system behavior.Future studies are needed to identify practical, effective, low-risk and economical validation methods.VIII. C

ONCLUSIONS

In this study, a comprehensive survey is presented that fo-cuses on autonomous driving policy learning using DRL/DIL,which is addressed simultaneously from the system, task-driven and problem-driven perspectives. The study is con-ducted at three levels: First, a taxonomy of the literaturestudies is presented from the system perspective, from whichﬁve modes of integration of DRL/DIL models into an AD ar-chitecture are identiﬁed. Second, the formulations of DRL/DILmodels for accomplishing speciﬁed AD tasks are comprehen-sively reviewed, where various designs on the model stateand action spaces and the reinforcement learning rewards arecovered. Finally, an in-depth review is presented on how thecritical issues of AD applications regarding driving safety,interaction with other trafﬁc participants and uncertainty ofthe environment are addressed by the DRL/DIL models. Themajor ﬁndings are listed below, from which potential topicsfor future investigation are identiﬁed. • DRL/DIL attract signiﬁcant amounts of interest in ADresearch. However, literature studies in this scope havefocused more on exploring the potential of DRL/DIL inaccomplishing AD tasks, whereas the design of the sys-tem architectures remains to be intensively investigated. • Many DRL/DIL models have been formulated for ac-complishing AD tasks. However, these formulations relyheavily on empirical designs, which are brute-force ap-proaches and lack both theoretical proof and in-depthinvestigations. In the real-world deployment of suchmodels, substantial challenges in terms of stability androbustness may be encountered. • Driving safety, which is the main issue in AD applica-tions, has received the most attention in the literature.However, the studies on interaction with other trafﬁcparticipants and the uncertainty of the environment re-main highly preliminary, in which the problems have beenaddressed from divergent perspectives, and have not beenconducted systematically.R

EFERENCES[1] C. Urmson and W. Whittaker, “Self-driving cars and the urban chal-lenge,”

IEEE Intelligent Systems , vol. 23, no. 2, pp. 66–68, 2008.[2] S. Thrun, “Toward robotic cars,”

Communications of the ACM , vol. 53,no. 4, pp. 99–106, 2010.[3] A. Eskandarian,

Handbook of intelligent vehicles . Springer, 2012,vol. 2.[4] S. M. Grigorescu, B. Trasnea, T. T. Cocias, and G. Macesanu, “Asurvey of deep learning techniques for autonomous driving,”

J. FieldRobotics , vol. 37, no. 3, pp. 362–386, 2020.[5] W. H. Organization et al. , “Global status report on road safety 2018:Summary,” World Health Organization, Tech. Rep., 2018.[6] A. Talebpour and H. S. Mahmassani, “Inﬂuence of connected andautonomous vehicles on trafﬁc ﬂow stability and throughput,”

Trans-portation Research Part C: Emerging Technologies , vol. 71, pp. 143–163, 2016.[7] W. Payre, J. Cestac, and P. Delhomme, “Intention to use a fullyautomated car: Attitudes and a priori acceptability,”

Transportationresearch part F: trafﬁc psychology and behaviour , vol. 27, pp. 252–263, 2014.[8] E. D. Dickmanns and A. Zapp, “Autonomous high speed road vehicleguidance by computer vision,”

IFAC Proceedings Volumes , vol. 20,no. 5, pp. 221–226, 1987.[9] C. Thorpe, M. H. Hebert, T. Kanade, and S. A. Shafer, “Vision andnavigation for the carnegie-mellon navlab,”

IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 10, no. 3, pp. 362–373, 1988.[10] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neuralnetwork,” in

Advances in Neural Information Processing Systems ,1989, pp. 305–313.[11] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron,J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann et al. , “Stanley:The robot that won the darpa grand challenge,”

Journal of ﬁeldRobotics , vol. 23, no. 9, pp. 661–692, 2006.[12] M. Buehler, K. Iagnemma, and S. Singh,

The DARPA urban challenge:autonomous vehicles in city trafﬁc . springer, 2009, vol. 56.[13] D. Gonz´alez, J. P´erez, V. Milan´es, and F. Nashashibi, “A review of mo-tion planning techniques for automated vehicles,”

IEEE Transactionson Intelligent Transportation Systems , vol. 17, no. 4, pp. 1135–1145,2016.[14] X. Li, Z. Sun, D. Cao, Z. He, and Q. Zhu, “Real-time trajectoryplanning for autonomous urban driving: Framework, algorithms, andveriﬁcations,”

IEEE/ASME Transactions on Mechatronics , vol. 21,no. 2, pp. 740–753, 2016.[15] B. Paden, M. ˇC´ap, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey ofmotion planning and control techniques for self-driving urban vehicles,”

IEEE Transactions on Intelligent Vehicles , vol. 1, no. 1, pp. 33–55,2016.[16] S. Ulbrich, A. Reschka, J. Rieken, S. Ernst, G. Bagschik, F. Dierkes,M. Nolte, and M. Maurer, “Towards a functional system architecturefor automated vehicles,” arXiv preprint arXiv:1703.08557 , 2017.[17] L. Li, K. Ota, and M. Dong, “Humanlike driving: Empiricaldecision-making system for autonomous vehicles,”

IEEE Trans. Veh.Technol. , vol. 67, no. 8, pp. 6814–6823, 2018. [Online]. Available:https://doi.org/10.1109/TVT.2018.2822762 [18] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-making for autonomous vehicles,”

Annual Review of Control, Robotics,and Autonomous Systems , vol. 1, 05 2018.[19] G. Yu and I. K. Sethi, “Road-following with continuous learning,” in the Intelligent Vehicles’ 95. Symposium . IEEE, 1995, pp. 412–417.[20] A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse,E. Berger, and E. Liang, “Autonomous inverted helicopter ﬂight viareinforcement learning,” in

International Symposium on ExperimentalRobotics , ser. Springer Tracts in Advanced Robotics, vol. 21. Springer,2004, pp. 363–372.[21] R. S. Sutton and A. G. Barto,

Reinforcement learning: an introduction .MIT Press, 2018.[22] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,”

Nature , vol.521, no. 7553, pp. 436–444, 2015.[23] I. J. Goodfellow, Y. Bengio, and A. C. Courville,

Deep Learning , ser.Adaptive computation and machine learning. MIT Press, 2016.[24] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Sil-ver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” in

International Conference on Learning Representations ,2016.[25] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, andA. Farhadi, “Target-driven visual navigation in indoor scenes usingdeep reinforcement learning,” in

International Conference on Roboticsand Automation . IEEE, 2017, pp. 3357–3364.[26] X. Liang, T. Wang, L. Yang, and E. Xing, “Cirl: Controllable imitativereinforcement learning for vision-based self-driving,” in the EuropeanConference on Computer Vision (ECCV) , 2018, pp. 584–599.[27] Y. Chen, C. Dong, P. Palanisamy, P. Mudalige, K. Muelling, and J. M.Dolan, “Attention-based hierarchical deep reinforcement learning forlane change behaviors in autonomous driving,” in

IEEE Conference onComputer Vision and Pattern Recognition Workshops , 2019, pp. 0–0.[28] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V.-D. Lam, A. Bewley, and A. Shah, “Learning to drive in a day,” in

International Conference on Robotics and Automation . IEEE, 2019,pp. 8248–8254.[29] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,“Deep reinforcement learning: A brief survey,”

IEEE Signal ProcessingMagazine , vol. 34, no. 6, pp. 26–38, 2017.[30] Y. Li, “Deep reinforcement learning: An overview,”

CoRR , vol.abs/1701.07274, 2017.[31] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcementlearning for multi-agent systems: A review of challenges, solutions andapplications,”

CoRR , vol. abs/1812.11794, 2018.[32] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation learning:A survey of learning methods,”

ACM Computing Surveys (CSUR) ,vol. 50, no. 2, pp. 1–35, 2017.[33] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, andJ. Peters, “An algorithmic perspective on imitation learning,”

Found.Trends Robotics , vol. 7, no. 1-2, pp. 1–179, 2018.[34] S. Kuutti, R. Bowden, Y. Jin, P. Barber, and S. Fallah, “A survey ofdeep learning applications to autonomous vehicle control,”

CoRR , vol.abs/1912.10773, 2019.[35] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. A. Sallab, S. K.Yogamani, and P. P´erez, “Deep reinforcement learning for autonomousdriving: A survey,”

CoRR , vol. abs/2002.00444, 2020.[36] S. B. Thrun, “Efﬁcient exploration in reinforcement learning,” 1992.[37] M. Coggan, “Exploration and exploitation in reinforcement learning,”

Research supervised by Prof. Doina Precup, CRA-W DMP Project atMcGill University , 2004.[38] Z. Hong, T. Shann, S. Su, Y. Chang, T. Fu, and C. Lee, “Diversity-driven exploration strategy for deep reinforcement learning,” in

Ad-vances in Neural Information Processing Systems , 2018.[39] G. Shani, J. Pineau, and R. Kaplow, “A survey of point-based pomdpsolvers,”

Autonomous Agents and Multi-Agent Systems , vol. 27, no. 1,pp. 1–51, 2013.[40] W. S. Lovejoy, “A survey of algorithmic methods for partially observedmarkov decision processes,”

Annals of Operations Research , vol. 28,no. 1, pp. 47–65, 1991.[41] R. Bellman and R. Kalaba, “On the role of dynamic programming instatistical communication theory,”

IRE Trans. Inf. Theory , vol. 3, no. 3,pp. 197–203, 1957.[42] C. J. C. H. Watkins and P. Dayan, “Technical note q-learning,”

Mach.Learn. , vol. 8, pp. 279–292, 1992.[43] G. A. Rummery and M. Niranjan,

On-line Q-learning using connec-tionist systems . University of Cambridge, Department of EngineeringCambridge, UK, 1994, vol. 37.[44] R. Bellman, “Dynamic programming,”

Science , vol. 153, no. 3731, pp. Nature , vol. 518, no. 7540, pp. 529–533, 2015.[46] L.-J. Lin, “Self-improving reactive agents based on reinforcementlearning, planning and teaching,”

Machine learning , vol. 8, no. 3-4,pp. 293–321, 1992.[47] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experiencereplay,” in

International Conference on Learning Representations ,2016.[48] S. Gu, T. P. Lillicrap, I. Sutskever, and S. Levine, “Continuous deepq-learning with model-based acceleration,” in

International Conferenceon Machine Learning , 2016.[49] H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double q-learning,” in the Thirtieth AAAI Conference on ArtiﬁcialIntelligence , 2016, pp. 2094–2100.[50] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas,“Dueling network architectures for deep reinforcement learning,” in

International Conference on Machine Learning , 2016, pp. 1995–2003.[51] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Dis-tributional reinforcement learning with quantile regression,” in

AAAIConference on Artiﬁcial Intelligence , 2018, pp. 2892–2901.[52] R. J. Williams, “Simple statistical gradient-following algorithms forconnectionist reinforcement learning,”

Mach. Learn. , vol. 8, pp. 229–256, 1992.[53] L. C. Baird, “Reinforcement learning in continuous time: Advantageupdating,” in

IEEE International Conference on Neural Networks ,vol. 4. IEEE, 1994, pp. 2448–2453.[54] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel,“High-dimensional continuous control using generalized advantageestimation,” in

International Conference on Learning Representations ,2016.[55] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” in

International Conference on MachineLearning , 2015, pp. 1889–1897.[56] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,”

CoRR , vol. abs/1707.06347,2017.[57] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. A.Riedmiller, “Deterministic policy gradient algorithms,” in

InternationalConference on Machine Learning , vol. 32, 2014, pp. 387–395.[58] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in

International Conference on Machine Learning ,2016, pp. 1928–1937.[59] J. Wang, Z. Kurth-Nelson, H. Soyer, J. Z. Leibo, D. Tirumala,R. Munos, C. Blundell, D. Kumaran, and M. M. Botvinick, “Learningto reinforcement learn,” in the 39th Annual Meeting of the CognitiveScience Society, CogSci 2017, London, UK, 16-29 July 2017 . cogni-tivesciencesociety.org, 2017.[60] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft actor-critic algorithms and applications,”

CoRR , vol. abs/1812.05905, 2018.[61] B. D. Argall, S. Chernova, M. M. Veloso, and B. Browning, “A surveyof robot learning from demonstration,”

Robotics Auton. Syst. , vol. 57,no. 5, pp. 469–483, 2009.[62] H. John and C. James, “Ngsim interstate 80 freeway dataset,” USFedeal Highway Administration, FHWA-HRT-06-137, Washington,DC, USA, Tech. Rep., 2006.[63] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al. , “End toend learning for self-driving cars,” arXiv preprint arXiv:1604.07316 ,2016.[64] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner,L. Jackel, and U. Muller, “Explaining how a deep neural net-work trained with end-to-end learning steers a car,” arXiv preprintarXiv:1704.07911 , 2017.[65] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driv-ing models from large-scale video datasets,” in

IEEE conference onComputer Vision and Pattern Recognition , 2017, pp. 2174–2182.[66] D. A. Pomerleau, “Efﬁcient training of artiﬁcial neural networks forautonomous navigation,”

Neural computation , vol. 3, no. 1, pp. 88–97,1991.[67] S. Ross and D. Bagnell, “Efﬁcient reductions for imitation learning,” in

International Conference on Artiﬁcial Intelligence and Statistics , ser.JMLR Proceedings, vol. 9. JMLR.org, 2010, pp. 661–668. [68] S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitationlearning and structured prediction to no-regret online learning,” in

International Conference on Artiﬁcial Intelligence and Statistics , ser.JMLR Proceedings, vol. 15, 2011, pp. 627–635.[69] J. Zhang and K. Cho, “Query-efﬁcient imitation learning for end-to-endautonomous driving,” arXiv preprint arXiv:1605.06450 , 2016.[70] F. Codevilla, M. Miiller, A. L´opez, V. Koltun, and A. Dosovitskiy,“End-to-end driving via conditional imitation learning,” in

InternationalConference on Robotics and Automation . IEEE, 2018, pp. 1–9.[71] J. Chen, B. Yuan, and M. Tomizuka, “Deep imitation learning for au-tonomous driving in generic urban scenarios with enhanced safety,” in

IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS) , 2019, pp. 2884–2890.[72] A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcementlearning,” in

International Conference on Machine Learning , 2000, pp.663–670.[73] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-forcement learning,” in

International Conference on Machine Learning ,2004, p. 1.[74] N. D. Ratliff, J. A. Bagnell, and M. Zinkevich, “Maximum marginplanning,” in

International Conference on Machine Learning , vol. 148,2006, pp. 729–736.[75] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximumentropy inverse reinforcement learning,” in

Aaai , vol. 8, 2008, pp.1433–1438.[76] S. Levine, Z. Popovic, and V. Koltun, “Nonlinear inverse reinforcementlearning with gaussian processes,” in

Advances in Neural InformationProcessing Systems , 2011, pp. 19–27.[77] N. D. Ratliff, D. M. Bradley, J. A. Bagnell, and J. E. Chestnutt,“Boosting structured prediction for imitation learning,” in

Advancesin Neural Information Processing Systems , 2006, pp. 1153–1160.[78] N. D. Ratliff, D. Silver, and J. A. Bagnell, “Learning to search:Functional gradient techniques for imitation learning,”

Auton. Robots ,vol. 27, no. 1, pp. 25–53, 2009.[79] M. Wulfmeier, P. Ondruska, and I. Posner, “Maximum entropy deep in-verse reinforcement learning,” arXiv preprint arXiv:1507.04888 , 2015.[80] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverseoptimal control via policy optimization,” in

International Conferenceon Machine Learning , 2016, pp. 49–58.[81] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in

Advances in Neural Information Processing Systems , 2016, pp. 4565–4573.[82] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Advances in Neural Information Processing Systems , 2014, pp. 2672–2680.[83] C. Finn, P. F. Christiano, P. Abbeel, and S. Levine, “A connection be-tween generative adversarial networks, inverse reinforcement learning,and energy-based models,”

CoRR , vol. abs/1611.03852, 2016.[84] J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarialinverse reinforcement learning,”

CoRR , vol. abs/1710.11248, 2017.[85] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun, “Off-roadobstacle avoidance through end-to-end learning,” in

Advances in neuralinformation processing systems , 2006, pp. 739–746.[86] V. Rausch, A. Hansen, E. Solowjow, C. Liu, E. Kreuzer, and J. K.Hedrick, “Learning a deep neural net policy for end-to-end control ofautonomous vehicles,” in .IEEE, 2017, pp. 4914–4919.[87] H. M. Eraqi, M. N. Moustafa, and J. Honer, “End-to-end deep learningfor steering autonomous vehicles considering temporal dependencies,” arXiv preprint arXiv:1710.03804 , 2017.[88] D. Wang, C. Devin, Q.-Z. Cai, F. Yu, and T. Darrell, “Deep object-centric policies for autonomous driving,” in

International Conferenceon Robotics and Automation . IEEE, 2019, pp. 8853–8859.[89] Y. Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan, E. Theodorou, andB. Boots, “Agile autonomous driving using end-to-end deep imitationlearning,” arXiv preprint arXiv:1709.07174 , 2017.[90] M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochenderfer, “Safereinforcement learning with scene decomposition for navigating com-plex urban environments,” in

Intelligent Vehicles Symposium . IEEE,2019, pp. 1469–1476.[91] P. Wang, C.-Y. Chan, and A. de La Fortelle, “A reinforcement learningbased approach for automated lane change maneuvers,” in

IntelligentVehicles Symposium . IEEE, 2018, pp. 1379–1384.[92] X. Chen, Y. Zhai, C. Lu, J. Gong, and G. Wang, “A learning modelfor personalized adaptive cruise control,” in

Intelligent Vehicles Sym-posium . IEEE, 2017, pp. 379–384. [93] P. Wang and C.-Y. Chan, “Formulation of deep reinforcement learningarchitecture toward autonomous driving for on-ramp merge,” in Inter-national Conference on Intelligent Transportation Systems . IEEE,2017, pp. 1–6.[94] H. Chae, C. M. Kang, B. Kim, J. Kim, C. C. Chung, and J. W. Choi,“Autonomous braking system via deep reinforcement learning,” in

International Conference on Intelligent Transportation Systems . IEEE,2017, pp. 1–6.[95] M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochenderfer,“Cooperation-aware reinforcement learning for merging in dense traf-ﬁc,” in

IEEE Intelligent Transportation Systems Conference . IEEE,2019, pp. 3441–3447.[96] Y. Tang, “Towards learning multi-agent negotiations via self-play,” in

IEEE International Conference on Computer Vision Workshops , 2019,pp. 0–0.[97] A. Folkers, M. Rick, and C. B¨uskens, “Controlling an autonomousvehicle with deep reinforcement learning,” in

Intelligent Vehicles Sym-posium . IEEE, 2019, pp. 2025–2031.[98] H. Porav and P. Newman, “Imminent collision mitigation with re-inforcement learning and vision,” in

International Conference onIntelligent Transportation Systems . IEEE, 2018, pp. 958–964.[99] S. Wang, D. Jia, and X. Weng, “Deep reinforcement learning forautonomous driving,” arXiv preprint arXiv:1811.11329 , 2018.[100] P. Wang, H. Li, and C.-Y. Chan, “Continuous control for automatedlane change behavior based on deep deterministic policy gradientalgorithm,” in

Intelligent Vehicles Symposium . IEEE, 2019, pp. 1454–1460.[101] M. Kaushik, V. Prasad, K. M. Krishna, and B. Ravindran, “Overtakingmaneuvers in simulated highway driving using deep reinforcementlearning,” in

Intelligent Vehicles Symposium . IEEE, 2018, pp. 1885–1890.[102] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “End-to-enddeep reinforcement learning for lane keeping assist,” arXiv preprintarXiv:1612.04340 , 2016.[103] R. Vasquez and B. Farooq, “Multi-objective autonomous brakingsystem using naturalistic dataset,” in

IEEE Intelligent TransportationSystems Conference . IEEE, 2019, pp. 4348–4353.[104] S. Hecker, D. Dai, and L. Van Gool, “End-to-end learning of drivingmodels with surround-view cameras and route planners,” in the Euro-pean Conference on Computer Vision (ECCV) , 2018, pp. 435–453.[105] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:An open urban driving simulator,” arXiv preprint arXiv:1711.03938 ,2017.[106] M. Abdou, H. Kamal, S. El-Tantawy, A. Abdelkhalek, O. Adel,K. Hamdy, and M. Abaas, “End-to-end deep conditional imitationlearning for autonomous driving,” in . IEEE, 2019, pp. 346–350.[107] Y. Cui, D. Isele, S. Niekum, and K. Fujimura, “Uncertainty-aware dataaggregation for deep imitation learning,” in

International Conferenceon Robotics and Automation . IEEE, 2019, pp. 761–767.[108] L. Tai, P. Yun, Y. Chen, C. Liu, H. Ye, and M. Liu, “Visual-basedautonomous driving deployment from a stochastic and uncertainty-aware perspective,” arXiv preprint arXiv:1903.00821 , 2019.[109] M. Buechel and A. Knoll, “Deep reinforcement learning for predictivelongitudinal control of automated vehicles,” in

International Confer-ence on Intelligent Transportation Systems . IEEE, 2018, pp. 2391–2397.[110] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcementlearning for urban autonomous driving,” in

IEEE Intelligent Trans-portation Systems Conference . IEEE, 2019, pp. 2765–2771.[111] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning todrive by imitating the best and synthesizing the worst,” arXiv preprintarXiv:1812.03079 , 2018.[112] L. Sun, C. Peng, W. Zhan, and M. Tomizuka, “A fast integratedplanning and control framework for autonomous driving via imitationlearning,” in

Dynamic Systems and Control Conference , vol. 51913.American Society of Mechanical Engineers, 2018, p. V003T37A012.[113] M. Wulfmeier, D. Rao, D. Z. Wang, P. Ondruska, and I. Posner,“Large-scale cost function learning for path planning using deepinverse reinforcement learning,”

The International Journal of RoboticsResearch , vol. 36, no. 10, pp. 1073–1087, 2017.[114] J. Bernhard, R. Gieselmann, K. Esterle, and A. Knol, “Experience-based heuristic search: Robust motion planning with deep q-learning,”in

International Conference on Intelligent Transportation Systems .IEEE, 2018, pp. 3175–3182.[115] P. Hart, L. Rychly, and A. Knoll, “Lane-merging using policy-basedreinforcement learning and post-optimization,” in

IEEE Intelligent Transportation Systems Conference . IEEE, 2019, pp. 3176–3181.[116] P. Wang, D. Liu, J. Chen, H. Li, and C.-Y. Chan, “Human-like decisionmaking for autonomous driving via adversarial inverse reinforcementlearning,” arXiv , pp. arXiv–1911, 2019.[117] S. Sharifzadeh, I. Chiotellis, R. Triebel, and D. Cremers, “Learning todrive using inverse reinforcement learning and deep q-networks,” arXivpreprint arXiv:1612.03653 , 2016.[118] A. Alizadeh, M. Moghadam, Y. Bicer, N. K. Ure, U. Yavas, andC. Kurtulus, “Automated lane change decision making using deep re-inforcement learning in dynamic and uncertain highway environment,”in

IEEE Intelligent Transportation Systems Conference . IEEE, 2019,pp. 1399–1404.[119] N. Deshpande and A. Spalanzani, “Deep reinforcement learning basedvehicle navigation amongst pedestrians using a grid-based state rep-resentation,” in

IEEE Intelligent Transportation Systems Conference .IEEE, 2019, pp. 2081–2086.[120] T. Tram, I. Batkovic, M. Ali, and J. Sj¨oberg, “Learning when to drive inintersections by combining reinforcement learning and model predic-tive control,” in

International Conference on Intelligent TransportationSystems . IEEE, 2019, pp. 3263–3268.[121] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura,“Navigating occluded intersections with autonomous vehicles usingdeep reinforcement learning,” in

International Conference on Roboticsand Automation . IEEE, 2018, pp. 2034–2039.[122] C. Li and K. Czarnecki, “Urban driving with multi-objective deepreinforcement learning,” in

International Conference on AutonomousAgents and MultiAgent Systems (AAMAS) . International Foundationfor Autonomous Agents and Multiagent Systems, 2019, pp. 359–367.[123] M. P. Ronecker and Y. Zhu, “Deep q-network based decision makingfor autonomous driving,” in

International Conference on Robotics andAutomation Sciences . IEEE, 2019, pp. 154–160.[124] P. Wolf, K. Kurzer, T. Wingert, F. Kuhnt, and J. M. Zollner, “Adaptivebehavior generation for autonomous driving using deep reinforcementlearning with compact semantic states,” in

Intelligent Vehicles Sympo-sium . IEEE, 2018, pp. 993–1000.[125] B. Mirchevska, C. Pek, M. Werling, M. Althoff, and J. Boedecker,“High-level decision making for safe and reasonable autonomous lanechanging using reinforcement learning,” in

International Conferenceon Intelligent Transportation Systems . IEEE, 2018, pp. 2156–2162.[126] W. Yuan, M. Yang, Y. He, C. Wang, and B. Wang, “Multi-rewardarchitecture based reinforcement learning for highway driving policies,”in

International Conference on Intelligent Transportation Systems .IEEE, 2019, pp. 3810–3815.[127] J. Lee and J. W. Choi, “May i cut into your lane?: A policy networkto learn interactive lane change behavior for autonomous driving,” in

IEEE Intelligent Transportation Systems Conference . IEEE, 2019, pp.4342–4347.[128] C. You, J. Lu, D. Filev, and P. Tsiotras, “Highway trafﬁc modeling anddecision making for autonomous vehicle using reinforcement learning,”in

Intelligent Vehicles Symposium . IEEE, 2018, pp. 1227–1232.[129] L. Wang, F. Ye, Y. Wang, J. Guo, I. Papamichail, M. Papageorgiou,S. Hu, and L. Zhang, “A q-learning foresighted approach to ego-efﬁcient lane changes of connected and automated vehicles on free-ways,” in

IEEE Intelligent Transportation Systems Conference . IEEE,2019, pp. 1385–1392.[130] C.-J. Hoel, K. Wolff, and L. Laine, “Automated speed and lane changedecision making using deep reinforcement learning,” in

InternationalConference on Intelligent Transportation Systems . IEEE, 2018, pp.2148–2155.[131] Y. Zhang, P. Sun, Y. Yin, L. Lin, and X. Wang, “Human-like au-tonomous vehicle speed control by deep reinforcement learning withdouble q-learning,” in

Intelligent Vehicles Symposium . IEEE, 2018,pp. 1251–1256.[132] D. Liu, M. Br¨annstrom, A. Backhouse, and L. Svensson, “Learningfaster to perform autonomous lane changes by constructing maneuversfrom shielded semantic actions,” in

IEEE Intelligent TransportationSystems Conference . IEEE, 2019, pp. 1838–1844.[133] K. Min, H. Kim, and K. Huh, “Deep distributional reinforcement learn-ing based high-level driving policy determination,”

IEEE Transactionson Intelligent Vehicles , vol. 4, no. 3, pp. 416–424, 2019.[134] K. Rezaee, P. Yadmellat, M. S. Nosrati, E. A. Abolfathi,M. Elmahgiubi, and J. Luo, “Multi-lane cruising using hierarchicalplanning and reinforcement learning,” in

International Conference onIntelligent Transportation Systems . IEEE, 2019, pp. 1800–1806.[135] T. Shi, P. Wang, X. Cheng, C.-Y. Chan, and D. Huang, “Drivingdecision and control for automated lane change behavior based on deepreinforcement learning,” in

IEEE Intelligent Transportation Systems Conference . IEEE, 2019, pp. 2895–2900.[136] J. Chen, Z. Wang, and M. Tomizuka, “Deep hierarchical reinforcementlearning for autonomous driving with distinct behaviors,” in

IntelligentVehicles Symposium . IEEE, 2018, pp. 1239–1244.[137] Z. Qiao, K. Muelling, J. Dolan, P. Palanisamy, and P. Mudalige,“Pomdp and hierarchical options mdp with continuous actions forautonomous driving at intersections,” in

International Conference onIntelligent Transportation Systems . IEEE, 2018, pp. 2377–2382.[138] C. Paxton, V. Raman, G. D. Hager, and M. Kobilarov, “Combiningneural networks and tree search for task and motion planning inchallenging environments,” in

IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS) . IEEE, 2017, pp. 6059–6066.[139] F. Behbahani, K. Shiarlis, X. Chen, V. Kurin, S. Kasewa, C. Stirbu,J. Gomes, S. Paul, F. A. Oliehoek, J. Messias et al. , “Learning fromdemonstration in the wild,” in

International Conference on Roboticsand Automation . IEEE, 2019, pp. 775–781.[140] L. Chen, Y. Chen, X. Yao, Y. Shan, and L. Chen, “An adaptive pathtracking controller based on reinforcement learning with urban drivingapplication,” in

Intelligent Vehicles Symposium . IEEE, 2019, pp. 2411–2416.[141] M. Huegle, G. Kalweit, B. Mirchevska, M. Werling, and J. Boedecker,“Dynamic input for deep reinforcement learning in autonomous driv-ing,” in

IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS) , 2019, pp. 7566–7573.[142] C. Desjardins and B. Chaib-Draa, “Cooperative adaptive cruise control:A reinforcement learning approach,”

IEEE Transactions on intelligenttransportation systems , vol. 12, no. 4, pp. 1248–1260, 2011.[143] D. Zhao, B. Wang, and D. Liu, “A supervised actor–critic approach foradaptive cruise control,”

Soft Computing , vol. 17, no. 11, pp. 2089–2099, 2013.[144] D. Zhao, Z. Xia, and Q. Zhang, “Model-free optimal control basedintelligent cruise control with hardware-in-the-loop demonstration [re-search frontier],”

IEEE Computational Intelligence Magazine , vol. 12,no. 2, pp. 56–69, 2017.[145] K. Min, H. Kim, and K. Huh, “Deep q learning based high level drivingpolicy determination,” in

Intelligent Vehicles Symposium . IEEE, 2018,pp. 226–231.[146] A. Kueﬂer, J. Morton, T. Wheeler, and M. Kochenderfer, “Imitatingdriver behavior with generative adversarial networks,” in

IntelligentVehicles Symposium . IEEE, 2017, pp. 204–211.[147] R. P. Bhattacharyya, D. J. Phillips, B. Wulfe, J. Morton, A. Kueﬂer, andM. J. Kochenderfer, “Multi-agent imitation learning for driving sim-ulation,” in

IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS) . IEEE, 2018, pp. 1534–1539.[148] M. Kuderer, S. Gulati, and W. Burgard, “Learning driving styles forautonomous vehicles from demonstration,” in

International Conferenceon Robotics and Automation . IEEE, 2015, pp. 2641–2646.[149] R. P. Bhattacharyya, D. J. Phillips, C. Liu, J. K. Gupta, K. Driggs-Campbell, and M. J. Kochenderfer, “Simulating emergent properties ofhuman driving behavior using multi-agent reward augmented imitationlearning,” in

International Conference on Robotics and Automation .IEEE, 2019, pp. 789–795.[150] Z. Huang, X. Xu, H. He, J. Tan, and Z. Sun, “Parameterized batchreinforcement learning for longitudinal control of autonomous landvehicles,”

IEEE Transactions on Systems, Man, and Cybernetics:Systems , vol. 49, no. 4, pp. 730–741, 2017.[151] M. J. Hausknecht and P. Stone, “Deep reinforcement learning inparameterized action space,” in

International Conference on LearningRepresentations , 2016.[152] M. Everett, Y. F. Chen, and J. P. How, “Motion planning amongdynamic, decision-making agents with deep reinforcement learning,” in

IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS) . IEEE, 2018, pp. 3052–3059.[153] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. P´oczos, R. Salakhutdinov,and A. J. Smola, “Deep sets,” in

Advances in Neural InformationProcessing Systems 30 , 2017.[154] D. Hayashi, Y. Xu, T. Bando, and K. Takeda, “A predictive rewardfunction for human-like driving based on a transition model of sur-rounding environment,” in

International Conference on Robotics andAutomation . IEEE, 2019, pp. 7618–7624.[155] S. Qi and S.-C. Zhu, “Intent-aware multi-agent reinforcement learning,”in

International Conference on Robotics and Automation . IEEE, 2018,pp. 7533–7540.[156] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction:Crowd-aware robot navigation with attention-based deep reinforcementlearning,” in

International Conference on Robotics and Automation .IEEE, 2019, pp. 6015–6022. [157] Y. Hu, A. Nakhaei, M. Tomizuka, and K. Fujimura, “Interaction-awaredecision making with adaptive strategies under merging scenarios,” arXiv preprint arXiv:1904.06025 , 2019.[158] J. Garc´ıa, Fern, and o Fern´andez, “A comprehensive survey on safe re-inforcement learning,”

Journal of Machine Learning Research , vol. 16,no. 42, pp. 1437–1480, 2015.[159] N. Jansen, B. K¨onighofer, S. Junges, and R. Bloem, “Shielded decision-making in mdps,”

CoRR , vol. abs/1807.06096, 2018.[160] N. Fulton and A. Platzer, “Safe reinforcement learning via formalmethods: Toward safe control through proof and learning,” in theThirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018, pp.6485–6492.[161] M. Bouton, J. Karlsson, A. Nakhaei, K. Fujimura, M. J. Kochenderfer,and J. Tumova, “Reinforcement learning with probabilistic guaranteesfor autonomous driving,” arXiv preprint arXiv:1904.07189 , 2019.[162] D. Isele, A. Nakhaei, and K. Fujimura, “Safe reinforcement learningon autonomous vehicles,” in

IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS) . IEEE, 2018, pp. 1–6.[163] M. Mukadam, A. Cosgun, A. Nakhaei, and K. Fujimura, “Tacticaldecision making for lane changing with deep reinforcement learning,”2017.[164] X. Xiong, J. Wang, F. Zhang, and K. Li, “Combining deep reinforce-ment learning and safety based control for autonomous driving,” arXivpreprint arXiv:1612.00147 , 2016.[165] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,”

CoRR , vol.abs/1610.03295, 2016.[166] F. Pusse and M. Klusch, “Hybrid online pomdp planning and deep re-inforcement learning for safer self-driving cars,” in

Intelligent VehiclesSymposium . IEEE, 2019, pp. 1013–1020.[167] J. Jiang, C. Dun, T. Huang, and Z. Lu, “Graph convolutional reinforce-ment learning,” in

International Conference on Learning Representa-tions , 2020.[168] A. Mohseni-Kabir, D. Isele, and K. Fujimura, “Interaction-aware multi-agent reinforcement learning for mobile agents with individual goals,”in

International Conference on Robotics and Automation . IEEE, 2019,pp. 3370–3376.[169] J. F. Fisac, E. Bronstein, E. Stefansson, D. Sadigh, S. S. Sastry, andA. D. Dragan, “Hierarchical game-theoretic planning for autonomousvehicles,” in

International Conference on Robotics and Automation .IEEE, 2019, pp. 9590–9596.[170] N. Li, D. W. Oyler, M. Zhang, Y. Yildiz, I. Kolmanovsky, and A. R.Girard, “Game theoretic modeling of driver and vehicle interactionsfor veriﬁcation and validation of autonomous vehicle control systems,”

IEEE Transactions on control systems technology , vol. 26, no. 5, pp.1782–1797, 2017.[171] G. Ding, S. Aghli, C. Heckman, and L. Chen, “Game-theoreticcooperative lane changing using data-driven models,” in

IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS) .IEEE, 2018, pp. 3640–3647.[172] M. Huegle, G. Kalweit, M. Werling, and J. Boedecker, “Dynamicinteraction-aware scene understanding for reinforcement learning inautonomous driving,” arXiv preprint arXiv:1909.13582 , 2019.[173] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation with graphconvolutional networks,” in

International Conference on LearningRepresentations , 2017.[174] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, F. Li, and S. Savarese,“Social LSTM: human trajectory prediction in crowded spaces,” in

IEEE Conference on Computer Vision and Pattern Recognition , 2016,pp. 961–971.[175] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “So-cial GAN: socially acceptable trajectories with generative adversarialnetworks,” in

IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 2255–2264.[176] A. Vemula, K. Muelling, and J. Oh, “Social attention: Modelingattention in human crowds,” in

International Conference on Roboticsand Automation . IEEE, 2018, pp. 1–7.[177] Y. Hoshen, “VAIN: attentional multi-agent predictive modeling,” in

Advances in Neural Information Processing Systems , 2017, pp. 2701–2711.[178] M. Henaff, A. Canziani, and Y. LeCun, “Model-predictive policylearning with uncertainty regularization for driving in dense trafﬁc,”in

International Conference on Learning Representations , 2019.[179] C. Guestrin, M. G. Lagoudakis, and R. Parr, “Coordinated reinforce-ment learning,” in

International Conference on Machine Learning ,2002, pp. 227–234.[180] C. Yu, X. Wang, X. Xu, M. Zhang, H. Ge, J. Ren, L. Sun, B. Chen, and G. Tan, “Distributed multiagent coordinated learning for autonomousdriving in highways based on dynamic coordination graphs,”

IEEETransactions on Intelligent Transportation Systems , vol. 21, no. 2, pp.735–748, 2019.[181] K. Leyton-Brown and Y. Shoham,

Essentials of Game Theory: AConcise Multidisciplinary Introduction , ser. Synthesis Lectures onArtiﬁcial Intelligence and Machine Learning. Morgan & ClaypoolPublishers, 2008.[182] G. P. Papavassilopoulos and M. G. Safonov, “Robust control designvia game theoretic methods,” in

IEEE Conference on Decision andControl , 1989, pp. 382–387 vol.1.[183] U. Branch, S. Ganebnyi, S. Kumkov, V. Patsko, and S. Pyatko, “Robustcontrol in game problems with linear dynamics,”

International Journalof Mathematics, Game Theory and Algebra , vol. 3, 01 2007.[184] Hong Zhang, V. Kumar, and J. Ostrowski, “Motion planning withuncertainty,” in

International Conference on Robotics and Automation ,vol. 1, 1998, pp. 638–643 vol.1.[185] M. Zhu, M. Otte, P. Chaudhari, and E. Frazzoli, “Game theoreticcontroller synthesis for multi-robot motion planning part i: Trajectorybased algorithms,” in

International Conference on Robotics and Au-tomation , 2014, pp. 1646–1651.[186] P. Wilson and D. Stahl, “On players’ models of other players: Theoryand experimental evidence,”

Games and Economic Behavior , vol. 10,pp. 218–254, 07 1995.[187] L. Sun, W. Zhan, and M. Tomizuka, “Probabilistic prediction of interac-tive driving behavior via hierarchical inverse reinforcement learning,” in

International Conference on Intelligent Transportation Systems . IEEE,2018, pp. 2111–2117.[188] P. Wang, Y. Li, S. Shekhar, and W. F. Northrop, “Uncertainty esti-mation with distributional reinforcement learning for applications inintelligent transportation systems: A case study,” in

IEEE IntelligentTransportation Systems Conference . IEEE, 2019, pp. 3822–3827.[189] J. Bernhard, S. Pollok, and A. Knoll, “Addressing inherent uncertainty:Risk-sensitive behavior generation for automated driving using distri-butional reinforcement learning,” in sium . IEEE, 2019, pp. 2148–2155.[190] G. Kahn, A. Villaﬂor, V. Pong, P. Abbeel, and S. Levine, “Uncertainty-aware reinforcement learning for collision avoidance,” arXiv preprintarXiv:1702.01182 , 2017.[191] S. Choi, K. Lee, S. Lim, and S. Oh, “Uncertainty-aware learningfrom demonstration using mixture density networks with sampling-free variance modeling,” in

International Conference on Robotics andAutomation . IEEE, 2018, pp. 6915–6922.[192] K. Lee, K. Saigol, and E. A. Theodorou, “Safe end-to-endimitation learning for model predictive control,” arXiv preprintarXiv:1803.10231 , 2018.[193] Y. Gal, “Uncertainty in deep learning,”

University of Cambridge , vol. 1,no. 3, 2016.[194] A. Kendall and Y. Gal, “What uncertainties do we need in bayesiandeep learning for computer vision?” in

Advances in neural informationprocessing systems , 2017, pp. 5574–5584.[195] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural net-works with bernoulli approximate variational inference,”

CoRR , vol.abs/1506.02158, 2015.[196] Y. Gal, R. Islam, and Z. Ghahramani, “Deep bayesian active learningwith image data,” in

International Conference on Machine Learning ,2017, pp. 1183–1192.[197] T. G. Dietterich, “Ensemble methods in machine learning,” in

MultipleClassiﬁer Systems, First International Workshop, MCS 2000, Cagliari,Italy, June 21-23, 2000, Proceedings , ser. Lecture Notes in ComputerScience, vol. 1857. Springer, 2000, pp. 1–15.[198] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalablepredictive uncertainty estimation using deep ensembles,” in

Advancesin Neural Information Processing Systems 30 , 2017.[199] B. Efron and R. Tibshirani,

An Introduction to the Bootstrap . Springer,1993.[200] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional per-spective on reinforcement learning,” in

International Conference onMachine Learning , 2017, pp. 449–458.[201] W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quantilenetworks for distributional reinforcement learning,” in

InternationalConference on Machine Learning , 2018, pp. 1104–1113.[202] Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable imitation learn-ing from visual demonstrations,” in

Advances in Neural InformationProcessing Systems , 2017, pp. 3812–3822.[203] A. Kueﬂer and M. J. Kochenderfer, “Burn-in demonstrations for multi-modal imitation learning,” arXiv preprint arXiv:1710.05090 , 2017.[204] Z. Wang, J. S. Merel, S. E. Reed, N. de Freitas, G. Wayne, and N. Heess, “Robust imitation of diverse behaviors,” in

Advances inNeural Information Processing Systems , 2017, pp. 5320–5329.[205] J. Merel, Y. Tassa, D. TB, S. Srinivasan, J. Lemmon, Z. Wang,G. Wayne, and N. Heess, “Learning human behaviors from motioncapture by adversarial imitation,” arXiv preprint arXiv:1707.02201 ,2017.[206] J. Lin and Z. Zhang, “Acgail: Imitation learning about multipleintentions with auxiliary classiﬁer gans,” in

Paciﬁc Rim InternationalConference on Artiﬁcial Intelligence . Springer, 2018, pp. 321–334.[207] C. Fei, B. Wang, Y. Zhuang, Z. Zhang, J. Hao, H. Zhang, X. Ji, andW. Liu, “Triple-gail: A multi-modal imitation learning framework withgenerative adversarial nets,” in the Twenty-Ninth International JointConference on Artiﬁcial Intelligence (IJCAI) , 2020, pp. 2929–2935.[208] E. Schmerling, K. Leung, W. Vollprecht, and M. Pavone, “Multimodalprobabilistic model-based planning for human-robot interaction,” in

International Conference on Robotics and Automation . IEEE, 2018,pp. 1–9.[209] J. Li, H. Ma, and M. Tomizuka, “Interaction-aware multi-agent trackingand probabilistic behavior prediction via adversarial learning,” in

International Conference on Robotics and Automation . IEEE, 2019,pp. 6658–6664.[210] H. Ma, J. Li, W. Zhan, and M. Tomizuka, “Wasserstein generativelearning with kinematic constraints for probabilistic interactive drivingbehavior prediction,” in

Intelligent Vehicles Symposium . IEEE, 2019,pp. 2477–2483.[211] A. Turnwald, D. Althoff, D. Wollherr, and M. Buss, “Understandinghuman avoidance behavior: interaction-aware decision making based ongame theory,”

International Journal of Social Robotics , vol. 8, no. 2,pp. 331–351, 2016.[212] R. Tian, S. Li, N. Li, I. Kolmanovsky, A. Girard, and Y. Yildiz,“Adaptive game-theoretic decision making for autonomous vehiclecontrol at roundabouts,” in

IEEE Conference on Decision and Control(CDC) . IEEE, 2018, pp. 321–326.[213] D. Isele, “Interactive decision making for autonomous vehicles in densetrafﬁc,” in

IEEE Intelligent Transportation Systems Conference . IEEE,2019, pp. 3981–3986.[214] H. Bai, S. Cai, N. Ye, D. Hsu, and W. S. Lee, “Intention-aware onlinepomdp planning for autonomous driving in a crowd,” in

InternationalConference on Robotics and Automation . IEEE, 2015, pp. 454–460.[215] C. Hubmann, J. Schulz, G. Xu, D. Althoff, and C. Stiller, “A beliefstate planner for interactive merge maneuvers in congested trafﬁc,” in

International Conference on Intelligent Transportation Systems . IEEE,2018, pp. 1617–1624.[216] C. A. Holloway,

Decision making under uncertainty: models andchoices . Prentice-Hall Englewood Cliffs, NJ, 1979, vol. 8.[217] M. J. Kochenderfer,

Decision making under uncertainty: theory andapplication . MIT press, 2015.[218] R. Camacho and D. Michie, “Behavioral cloning A correction,”

AIMag. , vol. 16, no. 2, p. 92, 1995.

Zeyu Zhu received B.S. degree in computer sciencefrom Peking University, Beijing, China, in 2019,where he is currently pursuing the Ph.D. degreewith the Key Laboratory of Machine Perception(MOE), Peking University. His research interestsinclude intelligent vehicles, reinforcement learning,and machine learning.