Hierarchical Reinforcement Learning Method for Autonomous Vehicle Behavior Planning
Zhiqian Qiao, Zachariah Tyree, Priyantha Mudalige, Jeff Schneider, John M. Dolan
HHierarchical Reinforcement Learning Method for Autonomous VehicleBehavior Planning ∗ Zhiqian Qiao , Zachariah Tyree , Priyantha Mudalige , Jeff Schneider and John M. Dolan Abstract — In this work, we propose a hierarchical reinforce-ment learning (HRL) structure which is capable of performingautonomous vehicle planning tasks in simulated environmentswith multiple sub-goals. In this hierarchical structure, the net-work is capable of 1) learning one task with multiple sub-goalssimultaneously; 2) extracting attentions of states according tochanging sub-goals during the learning process; 3) reusingthe well-trained network of sub-goals for other similar taskswith the same sub-goals. The states are defined as processedobservations which are transmitted from the perception systemof the autonomous vehicle. A hybrid reward mechanism isdesigned for different hierarchical layers in the proposed HRLstructure. Compared to traditional RL methods, our algorithmis more sample-efficient since its modular design allows reusingthe policies of sub-goals across similar tasks. The results showthat the proposed method converges to an optimal policy fasterthan traditional RL methods.
I. INTRODUCTIONIn a traditional autonomous vehicle (AV) system, afterreceiving the processed observations coming from the per-ception system, the ego vehicle performs behavior planningto deal with different scenarios or environments. At thebehavior planning level, algorithms generate high-level deci-sions such as Go , Stop , Follow front vehicle , etc. After that,a lower-level trajectory planning system maps those high-level decisions to trajectories according to map and dynamicobject information. Then a lower-level controller outputs thedetailed pedal or brake inputs to allow the vehicle to followthese trajectories.At first glance, among algorithms generating behaviordecisions, rule-based algorithms [1][2] appear to describehuman-like decision processes well. However, estimatingother vehicles’ behaviors accurately and adjusting the corre-sponding decisions to account for changes in the environmentis difficult if the decisions of the ego car are completelyhand-engineered. This is because the environment can varyacross many different dimensions, all relevant to the task ofdriving, and the number of rules necessary for planning inthis nuanced setting can be untenable.An alternative method is reinforcement learning [3][4][5].In recent works, RL has been used to solve some particularproblems by designing states, actions and reward functionsin a simulated environment. For example, the related ap-plications within the autonomous vehicle domain include *This work is supported by General Motors Zhiqian Qiao is Ph.D. student of Electrical and Computer Engi-neering, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, USA [email protected] Research & Development, General Motors Faculties of The Robotics Institute, Carnegie Mellon University
Fig. 1: Heuristic-based structure vs. HRL-based structurelearning an output controller for lane-following, merging intoa roundabout, traversing an intersection and lane changing.However, low stability and large computational requirementsmake RL difficult to use widely for more general taskswith multiple sub-goals. Obviously, applying RL to learn thebehavior planning system from scratch not only increasesthe difficulties of adding or deleting sub-functions withinthe existing behavior planning system, but also makes itharder to debug problems. A hierarchical structure which isstructurally similar to the heuristic-based algorithms is morefeasible and can save computation time by learning differentfunctions or tasks separately.Reinforcement learning (RL) has proven the capability ofsolving for the optimal policy, which can map various ob-servations to corresponding actions in complicated scenarios.In traditional RL approaches it is often necessary to train aunique policy for each task the agent may be faced with. Inorder to solve a new task the entire policy must be relearnedregardless of how similar the two tasks may be. Our goal inthis work is to construct a single planning algorithm basedon hierarchical deep reinforcement learning (HRL) whichcan accomplish behavior planning in an environment wherethe agent must pursue multiple sub-goals and to do so ina way in which any sub-goal policies can be reused forsubsequent tasks in a modular fashion (see Figure 1). Themain contributions of the work are: • A state attention model-based HRL structure. • A hybrid reward function mechanism which can ef-ficiently evaluate the performance among actions ofdifferent hierarchical levels. a r X i v : . [ c s . R O ] N ov A hierarchical prioritized experience replay designed forHRL. II. R
ELATED W ORK
This section introduces previous work related to this paper,which can be categorized as follows: 1) papers that addressreinforcement learning (RL) and hierarchical reinforcementlearning algorithms; 2) papers that propose self-driving be-havior planning algorithms.
A. Reinforcement Learning
Based on the context of reinforcement learning, algorithmswith extended functions based on RL and HRL have beenproposed. [6] proposed the idea of a meta controller, which isused to define a policy governing when the lower-level actionpolicy is initialized and terminated. [7] introduced the con-cept of hierarchical Q learning called MAXQ, which provedthe convergence of MAXQ mathematically and could becomputed faster than the original Q learning experimentally.[8] proposed an improved MAXQ method by combining theR-MAX [9] algorithm with MAXQ. It has both the efficientmodel-based exploration of R-MAX and the opportunitiesfor abstraction provided by the MAXQ framework. [10]used the idea of the hierarchical model and transferred itinto parameterized action representations. They use a DRLalgorithm to train high-level parameterized actions and low-level actions together in order to get more stable results thanby getting the continuous actions directly.
B. Behavior Planning of Autonomous Vehicles
Previous work applied heuristic-based and learning-basedalgorithms to the behavior planning of autonomous vehiclesbased on different scenarios. For example, [11] proposed aslot-based approach to check if a situation is safe to mergeinto lanes or across an intersection with moving traffic.This method is based on information on slots available formerging behavior, which may include the size of the slot inthe target lane, and the distance between the ego-vehicle andfront vehicle. Time-to-collision (TTC) [2] is a heuristic-basedalgorithm which has normally been applied in intersectionscenarios as a baseline algorithm. Fuzzy logic is also avery popular heuristic-based approach to model the decisionmaking and behavior planning for autonomous vehicles. Incontrast to the vanilla heuristic-based algorithm, fuzzy logicallows adding the uncertainty of the results into the decisionprocess. [12] used a fuzzy logic method to control the trafficflow in urban intersection scenarios, where the vehicles haveaccess to the environment information via the vehicle tovehicle (V2V) system. However, the V2V system has onlybeen applied to a small number of public roads and fewvehicle manufacturers have added V2V function into theirvehicles. In [13], the researchers developed a fuzzy logicmethod for the application of steering control in roundaboutscenarios.The heuristic-based algorithms need much work fromhuman beings to design various rules in order to dealwith different scenarios in urban environments. As a result, learning-based algorithms, especially reinforcement learning,has been applied to transfer multiple rules into a mappingfunction or one neural network. [14] formulated the decision-making problem for autonomous vehicles under uncertain en-vironments as a POMDP and trained out a Bayesian Networkrepresentation to deal with a T-shape intersection mergingproblem. [15] modeled the interaction between autonomousvehicles and human drivers by the method of Inverse Rein-forcement Learning (IRL) [16] in a simulated environment.The work simulated autonomous vehicles to motivate humandrivers’ reactions and acquired reward functions in order toplan better decisions while controlling autonomous vehicles.[17] dealt with the traversing problem via Deep Q-Networkscombined with a long-term memory component. They traineda state-action function Q to allow an autonomous vehicle totraverse intersections with moving traffic. [18] used DeepRecurrent Q-network (DRQN) with states from a bird’s-eyeview of the intersection to learn a policy for traversing theintersection. [19] proposed an efficient strategy to navigatethrough intersections with occlusion by using the DRLmethod. Their results showed better performance comparedto some heuristic methods.In our work, the main idea is to combine the heuristic-based decision-making structured with the HRL-based ap-proaches in order to integrate the advantages coming fromboth methods. We built the HRL-structure according to theheuristic method (see Figure 1) so that the system is easierfor validating different functions in the system instead of awhole neural-network black-box.III. P
RELIMINARIES
In this section, the preliminary background of the problemis described. The fundamental algorithms including Deep Q-Learing [3], Double Deep Q-Learning [20] and HierarchicalDeep Reinforcement Learning [6] (HRL) are introduced inthis part.
1) Deep Q-learning and Double Deep Q-learning:
Sinceproposed, Deep Q-Networks and Double Deep Q-Networkshave been widely applied in reinforcement learning prob-lems. In Q-learning, an action-value function Q π ( s , a ) islearned to get the optimal policy π which can maximizethe action-value function Q ∗ ( s , a ) . Hence, a parameterizedaction-value function Q ( s , a | θ ) is used with a discount factor γ , as in Equation 1. Q ∗ ( s , a ) = max θ Q ( s , a | θ )= r + γ max θ Q ( s (cid:48) , a (cid:48) | θ ) (1)
2) Double Deep Q-learning:
For the setting of Deep Q-learning, the network parameter θ is optimized by minimiz-ing the loss function L ( θ ) , which is defined as the differencebetween the predicted action-value Q and the target action-value Y Q . θ can be updated with a learning rate α , as shownn Equation 2. Y Qt = R t + + γ max a Q ( S t + , a | θ t ) L ( θ ) = (cid:16) Y Qt − Q ( S t , A t | θ t ) (cid:17) θ t + = θ t + α ∂ L ( θ ) ∂ θ (2)For the Double Deep Q-learning setting, the target action-value Y Q is revised according to another target Q-network Q (cid:48) with parameter θ (cid:48) : Y Qt = R t + + γ Q ( S t + , arg max a Q ( S t + , a | θ t ) | θ (cid:48) t ) (3)During the training procedure, technologies such as ε -greedy approach [21] and the prioritized experience replayapproach [22] can be applied to improve the training perfor-mance.
3) Hierarchical Reinforcement Learning:
For the HRLmodel [6] with sequential sub-goals, a meta controller Q generates the sub-goal g for the following steps and acontroller Q outputs the actions based on this sub-goal untilthe next sub-goal is generated by the meta controller. Y Q t = t + + N ∑ t (cid:48) = t + R t (cid:48) + γ max g Q ( S t + + N , g | θ t ) Y Q t = R t + + γ max a Q ( S t + , a | θ , g ) (4)IV. M ETHODOLOGY
In this section we present our proposed model, which isa hierarchical RL network with an explicit attention model,hybrid reward mechanism and a hierarchical prioritized ex-perience replay training schema. We will refer to this modelas Hybrid HRL throughout the paper.
A. Hierarchical RL with Attention
Hierarchical structures based on RL can be applied to learna task with multiple sub-goals. For a hierarchical structurewith two levels, an option set O is assigned to the first level,whose object is to select among sub-goals. The weight θ ot isupdated according to Equation 5. O ∗ t + = arg max o Q o ( S t + , o | θ ot ) Y Q o t = R ot + + γ Q o ( S t + , O ∗ t + | θ o (cid:48) t ) L ( θ o ) = (cid:16) Y Q o t − Q o ( S t , O t | θ ot ) (cid:17) (5)After selecting an option o , the corresponding action set A o represents the action candidates that can be executed onthe second level of the hierarchical structure with respectto the selected option o . Some previous work proposed theHierarchical Markov Decision Process (MDP), which sharesthe state set S among different hierarchical levels during theMDP or designs different states for changing sub-goals andapplies initial and terminating condition sets to transfer fromone state set to another.In many situations, the portion of the state set and theamount of abstraction needed to choose actions at differentlevels of this hierarchy can vary widely. In order to avoid Fig. 2: Hierarchical RL Option and Action Q-Network. FCstands for a fully connected layer. Within all the FC layers, Linear activation functions are used to generate last layers inboth Option-Value and Action-Value networks. For the restof the layers,
ReLu activation functions are applied.designing a myriad of state representations correspondingto each hierarchy level and sub-goal, we share one stateset S for the whole hierarchical structure. Meanwhile, anattention model is applied to define the importance of eachstate element I ( s , o ) with respect to each hierarchical leveland sub-goal and then use these weights to reconstruct thestate s I . The weight θ at is updated according to Equation 6. A ∗ t + = arg max a Q a ( S It + , O ∗ t + , a | θ at ) Y Q a t = R at + + γ Q a ( S It + , O ∗ t + , A ∗ t + | θ o (cid:48) t ) L ( θ a ) = (cid:16) Y Q a t − Q a ( S It , O t , A t | θ at ) (cid:17) (6)When implementing the attention-based HRL, we con-struct the option network and the action network (Figure 2),which includes the attention mechanism as a softmax layerin the action-value network Q a . B. Hybrid Reward Mechanism
For a sequential sub-goals HRL model [6], the rewardfunction is designed separately for the sub-goals and maintask. The extrinsic meta reward is responsible for the option-level task, and meanwhile the intrinsic reward is responsiblefor the action-level sub-goals. For HRL with parameterizedactions [23], an integrated reward is designed to evaluateboth option-level and action-level together.In our work, instead of generating one reward functionwhich is applied to evaluate the final outputs coming fromboth options and actions in one step together, we designeda reward mechanism which can evaluate the goodness ofoption and action separately during the learning procedure.As a result, a hybrid reward mechanism is introduced sothat: 1) the algorithm gets the information of which rewardfunction should be triggered to get rewards or penalties;ig. 3: Hybrid Reward Mechanism2) meanwhile, a positive reward which benefits both optionreward and action reward occurs if and only if the whole taskand the sub-goals in the hierarchical structure have all beencompleted. Figure 3 demonstrates the idea for the hybridreward mechanism.
C. Hierarchical Prioritized Experience Replay
In [22] the authors propose a framework for more effi-ciently replaying experience during the training process inDQN so that the stored transitions { s , a , r , s (cid:48) } with higherTD-error in the previous training iteration result in a higherprobability of being selected in the mini-batch for trainingduring the current iteration. However, in the HRL structure,the rewards received from the whole system not only relyon the current level, but also are affected by the interactionsamong different hierarchical levels.For the transitions { s , o , a , r o , r a , s (cid:48) } stored during the HRLprocess, the central observation is that if the output of theoption-value network o is chosen wrongly due to high errorbetween predicted option-value Q o and the targeted option-value r o + γ Q o ( s (cid:48) , o (cid:48) ) , then the success or failure of thecorresponding action-value network is inconsequential to thecurrent transition. As a result, we propose a hierarchical pri-oritized experience replay (HPER) in which the priorities inthe option-level are based on error directly and the prioritiesin the lower level are based on the difference between errorscoming from two levels. Higher priority is assigned to theaction-level experience replay if the corresponding option-level has lower priority. According to Equations 5 and 6, thetransition priorities for option and action level are given inEquation 7. p o = (cid:12)(cid:12)(cid:12) Y Q o − Q o ( S , O | θ o ) (cid:12)(cid:12)(cid:12) p a = (cid:12)(cid:12)(cid:12) Y Q a t − Q a ( S It , O t , A t | θ at ) (cid:12)(cid:12)(cid:12) − p o (7)Based on the aforementioned approaches, the Hybrid HRLis shown in Algorithm 1, 2 and 3.V. E XPERIMENT
In this section, we apply the proposed algorithm to the be-havior planning of a self-driving car and make comparisonswith competing methods.
Algorithm 1
Hierarchical RL with Attention State procedure HRL-AHR()2: Initialize option and action network Q o , Q a with weights θ o , θ a andthe target option and action network Q o (cid:48) , Q a (cid:48) with weights θ o (cid:48) , θ a (cid:48) .3: Construct an empty replay buffer B with max memory length l B .4: for e ← E training epochs do
5: Get initial states s .6: while s is not the terminal state do
7: Select option O t = argmax o Q o ( S t , o ) based on ε -greedy. O t is the selected sub-goal that the lower-level action will execute.8: Apply attention model to state S t based on the selected option O t : S It = I ( S t , O t ) .9: Select action A t = argmax a Q a ( S It , O t , a ) based on ε -greedy.10: Execute A t in simulation to get S t + .11: R ot + , R at + = HybridReward ( S t , O t , A t ) .12: Store transition T into B : T = (cid:8) S t , O t , A t , R ot + , R at + , S t + (cid:9) .13: Train the buffer ReplayBu f f er ( e ) .14: if e mod n == then
15: Test without action exploration with the weights from train-ing results for n epochs and save the average rewards. Algorithm 2
Hybrid Reward Mechanism procedure H YBRID R EWARD ()2: Penalize R ot and R at for regular step penalties (e.x.: time penalty).3: for δ in sub-goals candidates do if δ fails then if option o t == δ then
6: Penalize option reward R ot else
8: Penalize action reward R at if task success (all δ success) then
10: Reward both R ot and R at . Algorithm 3
Hierarchical Prioritized Experience Replay procedure R EPLAY B UFFER ( e )2: mini-batch size k , training size N , exponents α and β .3: Sample k transitions for option and action mini-batch: MB g ∼ P g = p g α ∑ l B p gi α , g ∈ { o , a }
4: Compute importance-sampling weights: w g = [ N · P g ] − β max i w gi , g ∈ { o , a }
5: Update transition priorities: p o = (cid:12)(cid:12)(cid:12) Y Q o t − Q o ( S t , O t | θ ot ) (cid:12)(cid:12)(cid:12) p a = (cid:12)(cid:12)(cid:12) Y Q a t − Q a ( S It , O t , A t | θ at ) (cid:12)(cid:12)(cid:12) − p o
6: Adjust the transition priorities to be greater than 0: p a = p a − min ( p a ) .7: Perform gradient descent to update θ gt = θ gt + α ∂ L ( θ g ) ∂θ g according tosample weights w g , g ∈ { o , a } .8: Update target networks weights θ g (cid:48) = θ g , g ∈ { o , a } . A. Scenario
We tested our algorithm in MSC’s VIRES VTD, which is acomplete simulation tool-chain for driving applications [24].We designed a task in which an autonomous vehicle (greenbox with A ) intends to stop at the stop-line behind a randomnumber of front vehicles (pink boxed with F ) which haverandom initial positions and behavior profiles (see Figure 4).The two sub-goals in this scenario are designed as STOP AT
ABLE I: Results comparisons among different behavior policies
Rewards Step Step Penalty Performance RateOption Reward r o Action Reward r a Unsmoothness Unsafe Collision Not Stop Timeout SuccessRule 1 -36.82 -9.11 112 0.38 8.05 18% 82% 0% 0%Rule 2 -28.69 0.33 53 0.32 6.41 89% 0% 0% 11%Rule 3 26.42 13.62 128 0.54 13.39 31% 0% 0% 69%Rule 4 40.02 17.20 149 0.58 16.50 14% 0% 0% 86%Hybrid HRL 43.52 28.87 178 5.32 1.23 0% 7% 0% 93%
Fig. 4: Autonomous vehicle (green box with A ) approachingstop-sign intersection STOP-LINE ( SSL ) and
FOLLOW FRONT VEHICLE ( FFV ). B. Transitions1) State:
The state which is used to formulate the hierar-chical deep reinforcement learning includes the informationof the ego car, which is useful for both sub-goals, and therelated information that is needed for each sub-goal. s = (cid:20) v e , a e , j e , d f , v f , a f , d f c , d f c d f s , d d , d dc , d dc d ds (cid:21) (8)Equation 8 describes our state space where v e , a e and j e are respectively the velocity, acceleration and jerk of the egocar, while d f and d d denote the distance from the ego carto the nearest front vehicle and the stop-line, respectively. Asafety distance parameter is introduced as a nominal distancebehind the target object which can improve safety due todifferent sub-goals. d f s = max (cid:32) v e − v f a max , d (cid:33) , d f c = d f − d f s d ds = v e a max , d dc = d d − d ds (9)Here a max and d denote the ego car’s maximum deceler-ation and minimum allowable distance to the front vehicle,respectively, and d f c and d dc are the distances that can bechased by the ego car (distances to the front vehicle minussafety distance of the target). The initial positions of frontvehicles and ego car are randomly selected.
2) Option and Action:
The option network in the scenariooutputs the selected sub-goal:
SSL or FFV . Then, accordingto the option result, the action network generates the throttleor brake choices.
3) Reward Functions:
Assume that for one step, theselected option is denoted as o , o ∈ { d , f } . The rewardfunction is given by:For each step: • Time penalty: − σ . • Unsmoothness penalty if jerk is too large: − I j e > . σ . • Unsafe penalty: − I d dc < exp ( − d dc d ds ) − I d fc < exp ( − d fc d fs ) .For the termination conditions: • Collision penalty: − I d f = . σ . • Not stop at stop-line penalty: − I d d = . v e . • Timeout: − I timeout d d . • Success reward: I d d = ., v e = σ where σ k are constants. I c are indicator functions. I c = c is satisfied, otherwise I c = o , o ∈ { d , f } and the unselected option is o − , o − ∈ { f , d } : sr = − σ − I timeout d d + I d d = ., v e = σ r option = sr − I d o − c < exp ( − d o − c d o − s ) − I d o − = . v e r action = sr − I j e > . σ − I d oc < exp ( − d oc d os ) − I d o = . σ r task = sr − I d dc < exp ( − d dc d ds ) − I d fc < exp ( − d f c d f s ) − I j e > . σ − I d f = . σ − I d d = . v e (10)where sr represents the portion of the reward common to r option , r action and r task .For comparison, we also formulate the problem withoutconsidering a hierarchical model via Double DQN. Then r task denotes the reward for achieving the task in this flattenedaction space. C. Results
We compare the proposed algorithm with four rule-basedalgorithms and some traditional RL algorithms mentionedbefore. Table I shows the quantitative results for testing theaverage performance of each algorithm over 100 cases.The competing methods include: • Rule 1: stick to the option
Follow Front Vehicle ( FFV ). • Rule 2: stick to the option
Stop at Stop-line ( SSL ). • Rule 3: if d d > ( d f + car length ) , select FFV , w/o
SSL . • Rule 4: if d f > d f c , select FFV , w/o
SSL . • Table II shows the explanations of different HRL-basedalgorithms whose results are shown in Figure 5.Figure 5 compares the Hybrid HRL method with differentsetup of HRL algorithms. The results show that the hybridig. 5: Training resultsTABLE II: Different HRL-based policies
Hybrid Reward Hierarchical PER Attention ModelHRL × × × HRL √ × × HRL √ √ × HRL √ × √ Hybrid HRL √ √ √ reward mechanism can perform better with the help ofhierarchical PER approach.Figure 6 depicts a typical case of the relative speed andposition of the ego vehicle with respect to the nearest frontvehicle as they both approach the stop-line. In the bottomgraph we see the ego vehicle will tend to close the distanceto the front vehicle until a certain threshold (about 5 meters)before lowering its speed relative to the front vehicle toallow a certain buffer between them. In the top graph wesee that during this time the front vehicle begins to slowrapidly for the stop-line at around 25 meters out before taxing Fig. 6: Velocities of ego car and front vehiclesFig. 7: Attention value extracted from the attention layer inthe model. dr and f r are d dc d ds and d fc d fs in the introduced state,respectively.to a stop. Simultaneously, the ego vehicle opts to focus onstopping for the stop-line until it’s within a certain thresholdof the front vehicle, at which point it will attend to thefront vehicle instead. Finally, after a pause the front vehicleaccelerates through the stop-line and at this point the egovehicle immediately begins focusing on the stop sign onceagain as desired.Figure 7 shows the results extracted from the attention softmax layer. Only the two state elements with the highestattentions have been visualized. The upper sub-figure showsthe relationship between the distance to the nearest frontvehicle (y-axis) and the distance to the stop-line (x-axis). Thelower sub-figure is the attention value. When the ego car isapproaching the front vehicle, the attention is mainly focusedon d fc d fs . When the front vehicle leaves without stopping at thestop-line, the ego car transfers more and more attentions to d dc d ds during the process of approaching the stop-line.ig. 8: Performance rate of only training to Follow FrontVehicles during the training process. Results from traininginclude random actions taken according to explorations.Results from testing show average performance by testing200 cases based on the trained network after that trainingepoch.Fig. 9: Performance rate of only training to choose theoptions between
FFV or SSL based on the designed rule-based or trained action-level policies. Results from Testshows average performance by testing 100 cases based onthe trained network after that training epoch.For the scenario of approaching the intersection with frontvehicles, one of the methods is to manually design all therules. Another possibility is to design a rule-based policyof stopping at the stop-line which is relative easy to model.Then we train a DDQN model (see Figure 8 for trainingprocess) to be the policy of following front vehicles. Basedon these two action-level models, we train another DDQNmodel (see Figure 9 for training process) to be the policygoverning which option is needed for approaching the stop-line with front vehicles. During the training process, afterevery training epoch, the simulation will test 500 epochs Fig. 10: Performance rate of Hybrid HRL training process.Results from testing show average performance by testing500 cases based on the trained network after that trainingepoch.without action exploration based on the trained-out network.By applying the proposed hybrid HRL, all the option-leveland action-level policies can be trained together (see Figure10 for training process) and the trained out policy canbe separated if the target task only need to achieve oneof the sub-goals. For example, the action-value networkof
Following Front Vehicle can be used alone with thecorresponding option input to the network. Then, the ego carcan follow the front vehicle without stopping at the stop-line.VI. CONCLUSIONSIn this paper, we proposed three extensions to hierarchicaldeep reinforcement learning aimed at improving convergencespeed, sample efficiency and scalability over traditional RLapproaches. Preliminary results suggest our algorithm is apromising candidate for future research as it is able tooutperform a suite of hand-engineered rules on a simulatedautonomous driving task in which the agent must pursuemultiple sub-goals in order to succeed.A
CKNOWLEDGMENTS
The authors would like to thank S. Bilal Mehdi of Gen-eral Motors Research & Development for his assistance inimplementing the VTD simulation environment used in ourexperiments. R
EFERENCES[1] S. Jin, Z.-y. Huang, P.-f. Tao, and D.-h. Wang, “Car-following theoryof steady-state traffic flow using time-to-collision,”
Journal of ZhejiangUniversity-SCIENCE A , vol. 12, no. 8, pp. 645–654, 2011.[2] D. N. Lee, “A theory of visual control of braking based on informationabout time-to-collision,”
Perception , vol. 5, no. 4, pp. 437–459, 1976.[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcementlearning,” arXiv preprint arXiv:1312.5602 , 2013.[4] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning,” arXiv preprint arXiv:1509.02971 , 2015.5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al. , “Mastering the game of go with deep neuralnetworks and tree search,” nature , vol. 529, no. 7587, pp. 484–489,2016.[6] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hier-archical deep reinforcement learning: Integrating temporal abstractionand intrinsic motivation,” in
Advances in neural information process-ing systems , 2016, pp. 3675–3683.[7] T. G. Dietterich, “The maxq method for hierarchical reinforcementlearning.” in
ICML , vol. 98. Citeseer, 1998, pp. 118–126.[8] N. K. Jong and P. Stone, “Hierarchical model-based reinforcementlearning: R-max+ maxq,” in
Proceedings of the 25th internationalconference on Machine learning . ACM, 2008, pp. 432–439.[9] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomialtime algorithm for near-optimal reinforcement learning,”
Journal ofMachine Learning Research , vol. 3, no. Oct, pp. 213–231, 2002.[10] W. Masson, P. Ranchod, and G. Konidaris, “Reinforcement learningwith parameterized actions,” in
AAAI , 2016, pp. 1934–1940.[11] C. R. Baker and J. M. Dolan, “Traffic interaction in the urbanchallenge: Putting boss on its best behavior,” in . IEEE,2008, pp. 1752–1758.[12] V. Milan´es, J. P´erez, E. Onieva, and C. Gonz´alez, “Controller for urbanintersections based on wireless communications and fuzzy logic,”
IEEE Transactions on Intelligent Transportation Systems , vol. 11,no. 1, pp. 243–248, 2009.[13] J. P. Rastelli and M. S. Pe˜nas, “Fuzzy logic steering control ofautonomous vehicles inside roundabouts,”
Applied Soft Computing ,vol. 35, pp. 662–669, 2015.[14] S. Brechtel, T. Gindele, and R. Dillmann, “Probabilistic decision-making under uncertainty for autonomous driving using continuouspomdps,” in . IEEE, 2014, pp. 392–399.[15] D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning forautonomous cars that leverage effects on human actions.” in
Robotics:Science and Systems , vol. 2. Ann Arbor, MI, USA, 2016. [16] A. Y. Ng, S. J. Russell et al. , “Algorithms for inverse reinforcementlearning.” in
Icml , vol. 1, 2000, p. 2.[17] D. Isele, A. Cosgun, and K. Fujimura, “Analyzing knowledge transferin deep q-networks for autonomously handling multiple intersections,” arXiv preprint arXiv:1705.01197 , 2017.[18] D. Isele, A. Cosgun, K. Subramanian, and K. Fujimura, “Navigatingintersections with autonomous vehicles using deep reinforcementlearning,” arXiv preprint arXiv:1705.01196 , 2017.[19] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura,“Navigating occluded intersections with autonomous vehicles usingdeep reinforcement learning,” in . IEEE, 2018, pp. 2034–2039.[20] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double q-learning,” in
Thirtieth AAAI conference on artificialintelligence , 2016.[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,”
Nature , vol. 518, no. 7540, p. 529, 2015.[22] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experi-ence replay,” arXiv preprint arXiv:1511.05952 , 2015.[23] M. Hausknecht and P. Stone, “Deep reinforcement learning in param-eterized action space,” arXiv preprint arXiv:1511.04143arXiv preprint arXiv:1511.04143