[PDF] A Safe Hierarchical Planning Framework for Complex Driving Scenarios based on Reinforcement Learning

Abstract

Autonomous vehicles need to handle various traffic conditions and make safe and efficient decisions and maneuvers. However, on the one hand, a single optimization/sampling-based motion planner cannot efficiently generate safe trajectories in real time, particularly when there are many interactive vehicles near by. On the other hand, end-to-end learning methods cannot assure the safety of the outcomes. To address this challenge, we propose a hierarchical behavior planning framework with a set of low-level safe controllers and a high-level reinforcement learning algorithm (H-CtRL) as a coordinator for the low-level controllers. Safety is guaranteed by the low-level optimization/sampling-based controllers, while the high-level reinforcement learning algorithm makes H-CtRL an adaptive and efficient behavior planner. To train and test our proposed algorithm, we built a simulator that can reproduce traffic scenes using real-world datasets. The proposed H-CtRL is proved to be effective in various realistic simulation scenarios, with satisfying performance in terms of both safety and efficiency.

Full PDF

AA Safe Hierarchical Planning Framework for Complex DrivingScenarios based on Reinforcement Learning

Jinning Li, Liting Sun, Masayoshi Tomizuka and Wei Zhan

Abstract — Autonomous vehicles need to handle various traf-ﬁc conditions and make safe and efﬁcient decisions and maneu-vers. However, on the one hand, a single optimization/sampling-based motion planner cannot efﬁciently generate safe trajecto-ries in real time, particularly when there are many interactivevehicles near by. On the other hand, end-to-end learningmethods cannot assure the safety of the outcomes. To addressthis challenge, we propose a hierarchical behavior planningframework with a set of low-level safe controllers and ahigh-level reinforcement learning algorithm (H-CtRL) as acoordinator for the low-level controllers. Safety is guaranteedby the low-level optimization/sampling-based controllers, whilethe high-level reinforcement learning algorithm makes H-CtRLan adaptive and efﬁcient behavior planner. To train and test ourproposed algorithm, we built a simulator that can reproducetrafﬁc scenes using real-world datasets. The proposed H-CtRL is proved to be effective in various realistic simulationscenarios, with satisfying performance in terms of both safetyand efﬁciency.

I. INTRODUCTIONRecently the ﬁeld of autonomous driving has witnessedrapid development. A number of novel behavior planningalgorithms have been proposed, many of which are awesomeachievements. However, it is almost impossible for a singlebehavior planner to drive through so many different real andcomplex scenarios. For example, one needs to ride in anautonomous car in his everyday life. The autonomous carshould be able to drive both in cities and on highways, whichinclude various intersections, roundabouts, and merging andfollowing scenarios (as shown in Figure 1). One singleplanning algorithm, such as an optimization-based planner,may never ﬁnd its solution in each complex trafﬁc condition.The hyperparameters in each planner or policy may workwell in intersection scenarios, but it is very likely to failin highway merging scenarios, because the optimal drivingsettings in each scenario, e.g. the optimal speed and spacing,are so difﬁcult to adjust with only one ﬁxed planner.The most difﬁcult part in this complex driving problemlies in the rapidly changing environment. The road curvaturemay change, and the number of obstacles may vary ineach of these self-driving scenarios. For example, when anautonomous vehicle is trying to make a left turn at anintersection, it has to consider the number of oncomingvehicles, the actual speed of them, and their distance tothe intersection. However, when the self-driving vehicle isﬁnding its way to merge to the next lane on the highway,

J. Li, L. Sun, M. Tomizuka and W. Zhan are with the De-partment of Mechanical Engineering, University of California, Berke-ley, CA 94720, USA { jinning li, litingsun, tomizuka,wzhan } @berkeley.edu Fig. 1: Various complex real-world trafﬁc conditions inwhich autonomous vehicles (red) must drive safely andefﬁciently.although it also has to analyze similar information, it needs todo it in a whole different larger scale for distance, velocity,etc. We therefore have to ﬁnd an adaptive policy that canmake high-quality decisions in various scenarios.Each behavior and motion planner from existing work hasits own advantages and disadvantages. For example, learning-based methods can recognize the speciﬁc properties of dif-ferent scenarios with less effort, but it is hard to interpret theresults and generalize them to other scenarios. On the otherhand, non-learning-based methods can ensure the safety ofthe agent and deal with many similar cases without too muchmodiﬁcation, but it is difﬁcult to tune the hyperparameters(e.g., the detailed cost function of motion planning) to becompatible for various trafﬁc scenarios with different numberof interactive agents. Therefore, based on the analysis ofthese limitations, we seek to design a hierarchical modelto exploit the advantages of both reinforcement learning andnon-learning-based policies.In this paper, we make the following contributions: • We propose a H ierarchical behavior planning frame-work with low-level safe C on t rollers and a high-level R einforcement L earning (H-CtRL) . It is an adaptivebehavior planner that can make high-quality decisionsin many complex trafﬁc scenarios. • A simulator that could reproduce trafﬁc scenes fromreal-world trafﬁc datasets is constructed, so that theproposed method can be trained and tested in realisticscenarios. • We test the proposed method H-CtRL in real-worldtrafﬁc conditions, and it was proved to be capable ofhandling different planning tasks in various scenarios.II. RELATED WORK

A. Non-Learning Based Methods

Non-learning-based decision-making modules are consid-ered to be safe-guaranteed and interpretable [1], [2]. But a r X i v : . [ c s . R O ] J a n hese planners tends to be too conservatively. In [3], [4], thisproblem was addressed by making the autonomous agentmore cognizant of and reactive to obstacles. The authorsof [5] proposed another framework that leverage effects onhuman actions to make it more interactive. However, themethods aforementioned suffer more or less from their longcomputation time. Constrained Iterative Linear QuadraticRegulator [6] signiﬁcantly reduced the operation time whilepreserving the safety of the agent. We try to inherit the safeguarantees and fast operation time from these methods, sowe select non-learning-based planners as our low-level safecontrollers. B. Supervised Learning

The application of supervised learning on autonomousdriving dates back to the work in [7]. The authors of [8]then learned a map from raw pixels directly to steeringcommands, where the concept of imitation learning (IL)began to surface. A more robust perception-action model wasdeveloped in [9]. To enhance the safety of IL, [10] proposed ahierarchical framework which utilized a high-level IL policyand a low-level MPC controller to improve efﬁciency andsafety. Similarly, to make IL generalizable and deal withcomplex urban scenarios, the authors of [11] learned policiesfrom ofﬂine connected driving data, and integrated a safetycontroller at test time.

C. Reinforcement Learning

Reinforcement learning (RL) has also been extensively ex-plored in autonomous driving. The algorithm in [12] adoptedRecurrent Neural Networks for information integration, andlearned an effective driving policy on simulators. The authorsin [13] proposed a driving policy that makes little assumptionabout the environment. The work in [14] developed a realistictranslation network to make sim2real possible. [15], [16]developed robust policies to make self-driving cars capableof driving through complex urban scenarios. We believe RLis a suitable high-level policy candidate. It can learn fromexperience to know which low-level controller is the mostsuitable at a speciﬁc time step.Hierarchical Reinforcement Learning (HRL) can make thelearning process more sample-efﬁcient. The idea is to reusethe well trained network of one sub-goal on other similartasks in HRL [17], [18].There are also many variants of HRL. The work in [19]integrated a sampling-based motion planner with a high-level RL policy, which can solve long horizon problems.Similarly, [20] combined deep reinforcement learning withMonte Carlo sampling to achieve tactical decision makingfor autonomous vehicles. Authors of [21] developed SAVEDwhich augmented the safety of model-based reinforcementlearning with Value estimation from demonstrations. Onlyto plan in normal scenarios is not enough for reliable self-driving cars, so the authors of [22] developed a hybridmethod with RL and IL policies to plan safely in near ac-cident scenarios. The authors of [23] proposed an attention- based architecture that can deal with a varying number ofobstacles and the interaction in between.We adopted the basic ideas to reuse low-level controllers,and aimed to design a novel planning module that works invarious trafﬁc conditions. The high-level RL was trained torecognize and react to different environments, while the low-level conventional controllers fulﬁll the goals sent from thehigh-level RL and guarantee the safety in the same time.III. P

ROBLEM S TATEMENT

Throughout the paper, we focus on the behavior planningproblem in different complex urban trafﬁc scenarios. Thereis one ego agent and many other obstacle cars in the environ-ment. Each of them has its own behavior pattern. Thus, weneed a mechanism to model the evolution of each scene andthe interactions among agents. We formulate the problem as aPartially Observable Markov Decision Process (POMDP). APOMDP can be deﬁned as a tuple: < S , A , O , T , Z , R , γ > . S denotes the state space and s ∈ S is a state of theenvironment. A is deﬁned to be the action space and a ∈ A is an action taken by the ego agent. o ∈ O is an observationreceived by the ego agent. The transition model T ( s, a, s (cid:48) ) isthe probability of the transition from a current state - actionpair ( s, a ) to s (cid:48) at the next time step. Z ( s, o ) denotes thetransition model, which calculates the probability of endingin the observation o given a state s . The reward functionis deﬁned by R ( s, a ) , which yields a speciﬁc reward via astate - action pair ( s, a ) . The discount factor is denoted by γ .The overall objective is to maximize the expected discountreward and ﬁnd the corresponding optimal policy π ∗ = arg max π E (cid:34) ∞ (cid:88) t =0 γ t R ( s t , π ( o t )) (cid:35) (1)where s t , o t are the state and the observation at timestep t of the environment, respectively.IV. METHOD A. H-CtRL Framework

We propose to solve the POMDP aforementioned with ahierarchical behavior planning framework H-CtRL (shownin Figure 2). It can be viewed as an integrated solver forthe POMDP. We input the current state of the environmentinto the framework and then it outputs actions to be executedby the ego agent. The hierarchical framework consists of acollection of low-level controllers with safety constraints anda high-level Reinforcement Learning policy to manage themall. We aim to ﬁnd an optimal high-level policy π ∗ that cantake advantage of one controller at a speciﬁc time step whendoing behavior planning for the autonomous agent.In details, the state s t of the problem at the time step t is designed to be the low-dimensional states of all agentspresented in the environment at t , namely, s t = (cid:2) s t || s t || . . . || s mt (cid:3) s it = (cid:2) x it y it v it θ i (cid:3) , i ∈ [1 , m ] (2)ig. 2: The hierarchical framework with low-level safecontrollers and a high-level RL algorithm.where s it , i ∈ [0 , m ] is the low-dimensional state of the i -th agent in the environment at time step t (note that s t denotes the state of the ego agent), and s t is the state of theenvironment. The operator || here is the concatenation oper-ator. The action at time step t is denoted by a t = [ acc t δ t ] where acc t and δ t are the acceleration and the steering angleof the ego agent, respectively. We choose the states of the k nearest neighbors around the ego vehicle at time step t tobe the observation, namely, o t = [ o t || o t || . . . || o kt ] where o it , i ∈ [1 , k ] is the state of the i -th nearest obstaclearound the ego car. We refer [24] and select the bicycle modelas the transition model for the ego agent. As for the othervehicles in the scene, their states would evolve according tothe deﬁnition by the environment or the simulator. Low-Level Safe Controllers consist a set of n non-learning-based controllers, denoted by { f , f , . . . , f n } .They are like the workers within the hierarchical framework,who are responsible for speciﬁc tasks assigned by high-levelcoordinators. Each f i , i ∈ [1 , n ] has its own behavior pattern,either cooperative or selﬁsh, either defensive or aggressive.Although they can lead to different driving strategies, theyall have their own safety constraints when performing mo-tion planning. For example, sampling-based planners wouldassign little probability to the area where an accident ismore likely to happen, and optimization-based controllerswould have huge cost in the objective function when theoutput lands in the dangerous area. Since the chosen low-level controller is the one whose action directly affects theevolution of the environment, these safety constraints ofthe low-level controllers are generally a good property thatguarantees the safety of the whole hierarchical behaviorplanning framework. Speciﬁcally, the low-level controllertakes in the observation o t and gives back an action a t atthe time step t . The High-Level Reinforcement Learning policy is thecoordinator in this framework. It makes observations fromthe environment and gets to choose one of the low-levelcontrollers that is the most suitable given the current ob-

Algorithm 1

H-CtRL Training Algorithm Initialize the simulation environment env and get aninitial observation o t . Initialize the policy and the target Q-network withweights θ and θ − . Set θ − ← θ . Instantiate an empty reply buffer B with a maximumlength of l B . for h ← to N steps do Select an action a h ← arg max a h Q θ ( o h , a h ) ac-cording to the (cid:15) − greedy . Given o h , the chosen low-level controller corre-sponding to a h acts p timesteps in env to get the nextobservation o h +1 and the reward r h . Push the transition T = (cid:2) o h a h r h o h +1 (cid:3) into B . Update the weights θ of the policy Q-network usingthe replay buffer B . if h mod target update f requency == 0 then θ − ← θ.copy () if the episode is done then Record the cumulative reward in this episode.

Reset env and get an initial observation o t +1 .servation. Therefore, the action space of the high-level RLis reduced and discretized from the original continuous spaceto a ﬁnite set of low-level controller’s id { , , . . . , n } .The actions of RL can be viewed as intermediate actionswithin the hierarchical framework, whereas the ﬁnal outputactions that directly interact and affect the environment arethe outputs of low-level controllers. Here, we should notethat to reduce the cardinality of the actions within oneepisode and improve the stability of the algorithm, the high-level RL switches its choice of low-level controllers every p time steps, which means within the p time steps of theenvironment, only one chosen low-level controller wouldplan trajectories consistently and would not be disabled bythe RL. We therefore introduce the new timestep h for thehigh-level RL: the original time step t = h · p should coincidewith the time step h in RL.The state and the observation at each time step of the RLproblem are deﬁned accordingly as s h and o h as aforemen-tioned. One way to get s h and o h is to convert the timescale in RL back to the original scale, thus s h and o h shouldbe the state and the observation at time step t = h · p inthe environment. The transition model of RL is also deﬁnedwithin the new time scale, namely, T ( s h , a h , s h +1 ) , where s h +1 is the next state of s h after applying p actions bythe low-level controller corresponding to a h . Because ofthe existence of low-level controllers, we no longer needto worry about various tedious design details of the rewardfunction R ( s h , a h ) , and can simply adopt a very high-levelone that encourages the completion of the planning task asfast as possible without any collision. In detail, the ego agentreceives a positive reward that is proportion to its progressalong its reference trajectory and a negative reward if thepisode terminates early because of collision or low-levelcontroller failure or other factors.Generally, the high-level RL can be trained using anymodel-free RL algorithms. In this paper, we choose to useDouble Deep Q-Network (DDQN) in [25] to learn our high-level RL policy, for it is more stable and has less variance.The pseudo-code for the hierarchical planning framework isshown in algorithm 1.V. EXPERIMENTS A. Simulator

We constructed our own simulation environment basedon the INTERACTION Dataset [26] and the OpenAI GYMtoolkit. The road maps in the simulator were loaded from theINTERACTION Dataset map collections, which containedvarious real-world trafﬁc scenarios recorded from manydifferent countries. After the simulator ﬁnished constructingthe road map, we specify an initial timestep from wherevehicles data were loaded. The states of these vehicles otherthan the ego agent were all loaded from the dataset at eachtime step. Since these vehicle data were all collected fromreal-word trafﬁcs, we could simulate relatively realistic trafﬁcconditions. The ego agent in the simulator would then beour self-driving car equipped with H-CtRL. The transitionmodel of the ego agent was the bicycle model. When runningexperiments in the simulator, we input into it one low-levelaction, a t = [ acc t , δ t ] , and then it would take one step andoutput a reward, an observation, and a boolean indicatingwhether the episode had terminated, according to the bicyclemodel and the dataset. B. Scenarios

We consider two road maps from the INTERACTIONdataset, and design different driving tasks in each of them. • TC BGR Intersection VA (VA) . It is a map of a busyand complex intersection, which makes it difﬁcult forthe ego agent to avoid collisions. • DR USA Roundabout SR (SR) . The map is collectedfrom a real-world roundabout. The map is bigger thanthe previous intersection map and thus is difﬁcult tonavigate through.Since there are four directions in both scenarios, we designsimilar tasks in each of them. The ego agent should navigatethrough the trafﬁc safely to make unprotected left turns, un-protected right turns, and straight crossings. By unprotectedleft or right turns, we mean that one must yield to othervehicles when turning left on green lights and turning righton red lights according to the trafﬁc rules.

C. Low-level Controllers

Generally speaking, we can choose any mature non-learning-based planner to make the proposed hierarchicalframework inclusive and powerful. In this paper, we consider n = 9 different Constrained Iterative Linear QuadraticRegulators (CILQR) [6] as the set of low-level controllers.The objective of CILQR is to ﬁnd an optimal control sequence, namely, an optimal action sequence a ∗ given aninitial observation o that minimizes a cost function: a ∗ , o ∗ = arg min a,o (cid:40) φ ( o N ) + N − (cid:88) t =0 L ( o t , a t ) (cid:41) s.t. o t +1 = f ( o t , a t ) , t = 0 , , . . . , N − g t ( o t , a t ) < , t = 0 , , . . . , N − g N ( o N ) < (3)where N is the planning horizon, L ( · ) and φ ( · ) are the costfunctions, f ( · ) is the transition model, and g t ( · ) ’s are thesafety and dynamics constraints.CILQR algorithm is capable of fast solving the optimalcontrol problem with non-convex constraints and non-linearsystem dynamics. It has been tested with on-road drivingscenarios and is proved to be able to avoid obstacles suc-cessfully. However, the main drawback of CILQR is thatit tends to be very aggressive if its reference speed is toofast. For example, when the ego agent is following slowtrafﬁc in an urban scenario, it always tries to pass the cars inthe front whenever it ﬁnds a gap. This maneuver style maycause serious problems, because a sudden move is highlylikely to result in collisions in such dense trafﬁc with manyocclusions. The dangerous decision is mainly because ofthe high reference speed that the controller tempts to track.Since the objective function penalizes its deviation from thereference trajectory, it is willing to sacriﬁce the safety tobypass the obstacles.Therefore, we seek to apply the high-level RL policy tochoose the most suitable reference and the most appropriatesetting for low-level controllers. We design a ﬁxed ﬁniteset of candidate reference speed for the high-level RL tochoose. The set includes 9 possible discrete speeds: v ref ∈{ , , , , , , , , } ( m/s ) . Each reference speed corre-sponds to a different CILQR controller that has a differentbehavior pattern. For example, the one with the referencespeed v ref = 0 m/s is the most conservative one, becauseit would yield to any obstacle in the environment, whereasthe one with v ref = 9 m/s is the most aggressive one, forit would tries its best to track the high reference speed andthe safety would be compromised. The high-level RL aimsto balance between the safety and the passing time given thecurrent observations, so as to make the ego agent capable ofhandling various complex scenarios. D. Baseline Methods

We compared the following policies in the experiments: • CILQR . The third low-level CILQR controller witha reference speed of m/s . No high-level RL policy isused. • CILQR . The ninth low-level CILQR controller witha reference speed of m/s . No high-level RL policy isused. • Random . The hierarchical framework with the high-level RL is replaced by a random sampler, whichchooses low-level controllers randomly. a) VA: 1-Entering (b) VA: 2-Waiting (c) VA: 3-Exiting(d) SR: 1-Entering (e) SR: 2-Waiting (f) SR: 3-Following

Fig. 3: The planned trajectories in both maps. The ego car is always red and the obstacles are always purple. The futuretrajectories for the next 10 high-level timestep h are plotted using a line with markers on it. A green line means H-CtRLchooses a low-level CILQR planner with high reference speed, while a red line means H-CtRL chooses one with lowreference speed. The darker the red, the lower the speed. • H-CtRL . Our proposed hierarchical behavior planningframework.It would be tedious to compare and list results of all low-levelCILQR controllers, so we only choose two representativelow-level controllers here. CILQR

E. Implementation Details

We trained and tested the RL in our proposed hierarchicalframework in two maps separately. At the beginning of eachepisode, we initialized the position of the ego vehicle atthe edge of the road map, perturbed by a Gaussian noise.The initial timestep to load obstacles from the dataset wasrandomly sampled from to original timestep in VAand from to in SR. Each timestep in the simulatoras well as CILQRs lasts . s, whereas each timestep of thehigh-level RL lasts . s with p = 10 .The goal of each episode was also chosen randomly, eitherto turn left, turn right, or to go straight. The observationthat was fed into the high-level RL was the observationgiven by the simulator plus the goal of each episode. TheRL policy was represented by a neural network with twofully connected hidden layers. According to the high-leveldecision, the same observation was then fed into the low-level controller to plan executable actions in the simulatorfor the ego agent. VI. RESULTS A. Statistics

We ran each policy for 100 episodes in VA and SRseparately, and compared the average episode return, the TABLE I: Average episode returns, collision rate, and thecompletion rate within 50s time limit based on 100 episodesin each map.

Method Aver. Epi. Return Collision Rate Completion Rate

Intersection (VA)

CILQR

Roundabout (SR)

CILQR collision rate, and the completion rate with in a 50s timelimit.As shown in Table I, the proposed H-CtRL has the bestperformance in terms of the average episode reward. It ismuch higher than CILQR

B. Visualization

To show the details of what happened in both VA and SR,we visualized one representative episode in each map.Figure 3a - 3c are the visualization for an episode in VA,where the ego car was trying to make a left turn when thetrafﬁc light was green. According to trafﬁc rules, it mustyield to oncoming cars from across the road. As we cansee in Figure 3a, the high-level RL ﬁrst chose to drive at ahigh speed of approximately m/s (as shown by the greenmarkers) into the entrance of the intersection. Then it decidedto slow down in front of the intersection (as shown by theorange markers). When cars continued to come from theopposite direction, the high-level RL planned a perfect stopto stay put (as shown in Figure 3b). After all cars ﬁnishedcrossing the intersection, It decided to accelerate and to exitthe intersection as fast as possible (as shown in Figure 3c).One episode in SR was visualized in Figure 3d - 3f. Inthis episode, the ego agent was trying to go straight acrossthe roundabout. When it was approaching the roundabout,the high-level RL chose to drive at a high speed as usual(implied by the green markers in Figure 3d). When it wasabout to enter the roundabout, the high-level RL decided toslow down (shown in Figure 3e) so that it could observethe surroundings and could avoid collisions if necessary.After it exited the roundabout (shown in Figure 3f), it ﬁrstaccelerated toward the goal position. But there were severalobstacles driving slowly in the front, so it chose a low-levelcontroller with a very low reference speed ( v ref = 2 m/s ).Then the low-level CILQR controller helped the ego agentto slow down and avoid collisions.We visualized the velocity and the task progress of H-CtRL for the SR episode described above, and comparedthem with those of CILQR C. Failure Cases

When training and testing our proposed method, we dis-covered that the ego agent braked really hard if it observedobstacles in the front or the front vehicle slowed down. Webelieved that this is caused by the low-level CILQR con-trollers, which only considered the safety constraint without (a) The velocity vs. the position along the trajectory of the egoagent.(b) The task progress vs. the timestep of the ego agent.

Fig. 4: A visualization of the velocity and the task progressof the ego agent driving in SR. The proposed H-CtRL wascompared to CILQR v ref = 3 m/s ) and CILQR v ref = 9 m/s ). The black cross means that the episodewas terminated earlier without reaching the goal position.any optimization on the comfort. Generally speaking, ourproposed hierarchical framework has the ability to adoptmany low-level planners. Therefore, the future work is toadd more low-level planners into the framework, and to trainthem all together to get a more powerful and generalizablebehavior planner. VII. CONCLUSIONIn this paper, we proposed a general behavior plannerfor autonomous vehicles based on reinforcement learningand safe motion planners. By combining the power of low-level safe controllers with a high-level reinforcement learningcoordinator, various complex urban trafﬁc conditions can behandled via this general framework. We built a simulatorthat can reproduce scenarios according to real-world trafﬁcdataset. The proposed algorithm was trained and tested basedon such real trafﬁc data. Compared to other baseline meth-ods, the proposed framework achieved both high completionrate and low collision rate, verifying its ability to handlevarious trafﬁc scenarios with satisfying performance on bothsafety and efﬁciency. EFERENCES[1] M. McNaughton, C. Urmson, J. M. Dolan, and J.-W. Lee, “Motionplanning for autonomous driving with a conformal spatiotemporallattice,” in , pp. 4889–4895, IEEE, 2011.[2] B. Paden, M. ˇC´ap, S. Z. Yong, D. Yershov, and E. Frazzoli, “Asurvey of motion planning and control techniques for self-drivingurban vehicles,”

IEEE Transactions on intelligent vehicles , vol. 1,no. 1, pp. 33–55, 2016.[3] W. Zhan, C. Liu, C.-Y. Chan, and M. Tomizuka, “A non-conservativelydefensive strategy for urban autonomous driving,” pp. 459–464, 112016.[4] C. Peng and M. Tomizuka, “Bayesian persuasive driving,” in , pp. 723–729, 2019.[5] D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning forautonomous cars that leverage effects on human actions.,” in

Robotics:Science and Systems , vol. 2, Ann Arbor, MI, USA, 2016.[6] J. Chen, W. Zhan, and M. Tomizuka, “Constrained iterative lqrfor on-road autonomous driving motion planning,” in , pp. 1–7, 2017.[7] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neuralnetwork,” in

Advances in neural information processing systems ,pp. 305–313, 1989.[8] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang,J. Zhao, and K. Zieba, “End to end learning for self-driving cars,”2016.[9] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of drivingmodels from large-scale video datasets,” 2017.[10] L. Sun, C. Peng, W. Zhan, and M. Tomizuka, “A fast integratedplanning and control framework for autonomous driving via imitationlearning,” in

Dynamic Systems and Control Conference , vol. 51913,p. V003T37A012, American Society of Mechanical Engineers, 2018.[11] J. Chen, B. Yuan, and M. Tomizuka, “Deep imitation learning forautonomous driving in generic urban scenarios with enhanced safety,”2019.[12] A. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcementlearning framework for autonomous driving,”

Electronic Imaging ,vol. 2017, p. 70–76, Jan 2017.[13] K. Makantasis, M. Kontorinaki, and I. Nikolos, “A deep reinforcementlearning driving policy for autonomous road vehicles,” 2019. [14] X. Pan, Y. You, Z. Wang, and C. Lu, “Virtual to real reinforcementlearning for autonomous driving,” 2017.[15] M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochenderfer, “Scal-able decision making with sensor occlusions for autonomous driving,”in , pp. 2076–2081, 2018.[16] J. Chen, S. E. Li, and M. Tomizuka, “Interpretable end-to-end urbanautonomous driving with latent deep reinforcement learning,” 2020.[17] Z. Qiao, Z. Tyree, P. Mudalige, J. Schneider, and J. M. Dolan,“Hierarchical reinforcement learning method for autonomous vehiclebehavior planning,” 2019.[18] M. S. Nosrati, E. A. Abolfathi, M. Elmahgiubi, P. Yadmellat, J. Luo,Y. Zhang, H. Yao, H. Zhang, and A. Jamil, “Towards practical hier-archical reinforcement learning for multi-lane autonomous driving,”2018.[19] J. Wang, Y. Wang, D. Zhang, Y. Yang, and R. Xiong, “Learninghierarchical behavior and motion planning for autonomous driving,”2020.[20] C.-J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J.Kochenderfer, “Combining planning and deep reinforcement learningin tactical decision making for autonomous driving,”

IEEE Transac-tions on Intelligent Vehicles , vol. 5, no. 2, pp. 294–305, 2019.[21] B. Thananjeyan, A. Balakrishna, U. Rosolia, F. Li, R. McAllister,J. E. Gonzalez, S. Levine, F. Borrelli, and K. Goldberg, “Safetyaugmented value estimation from demonstrations (saved): Safe deepmodel-based rl for sparse cost robotic tasks,”

IEEE Robotics andAutomation Letters , vol. 5, no. 2, pp. 3612–3619, 2020.[22] Z. Cao, E. Bıyık, W. Z. Wang, A. Raventos, A. Gaidon, G. Rosman,and D. Sadigh, “Reinforcement learning based control of imitativepolicies for near-accident driving,” 2020.[23] E. Leurent and J. Mercat, “Social attention for autonomous decision-making in dense trafﬁc,” 2019.[24] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic anddynamic vehicle models for autonomous driving control design,” in , pp. 1094–1099, 2015.[25] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double q-learning,” 2015.[26] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann,J. K¨ummerle, H. K¨onigshof, C. Stiller, A. de La Fortelle, andM. Tomizuka, “INTERACTION Dataset: An INTERnational, Ad-versarial and Cooperative moTION Dataset in Interactive DrivingScenarios with Semantic Maps,” arXiv:1910.03088 [cs, eess]arXiv:1910.03088 [cs, eess]