Experience-Based Heuristic Search: Robust Motion Planning with Deep Q-Learning
Julian Bernhard, Robert Gieselmann, Klemens Esterle, Alois Knoll
EExperience-Based Heuristic Search: Robust Motion Planningwith Deep Q-Learning
Julian Bernhard , Robert Gieselmann , Klemens Esterle and Alois Knoll ©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishingthis material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component ofthis work in other works Abstract — Interaction-aware planning for autonomous driv-ing requires an exploration of a combinatorial solution spacewhen using conventional search- or optimization-based motionplanners. With Deep Reinforcement Learning, optimal drivingstrategies for such problems can be derived also for higher-dimensional problems. However, these methods guarantee op-timality of the resulting policy only in a statistical sense,which impedes their usage in safety critical systems, such asautonomous vehicles. Thus, we propose the Experience-Based-Heuristic-Search algorithm, which overcomes the statisticalfailure rate of a Deep-reinforcement-learning-based plannerand still benefits computationally from the pre-learned optimalpolicy. Specifically, we show how experiences in the formof a Deep Q-Network can be integrated as heuristic into aheuristic search algorithm. We benchmark our algorithm inthe field of path planning in semi-structured valet parkingscenarios. There, we analyze the accuracy of such estimates anddemonstrate the computational advantages and robustness ofour method. Our method may encourage further investigationof the applicability of reinforcement-learning-based planning inthe field of self-driving vehicles.
I. I
NTRODUCTION
Motion planners for self-driving vehicles frequently adhereto optimization- or search-based paradigms. At each newplanning run, these methods reexamine the solution spaceto find an optimal motion. For higher-dimensional planningscenarios, this is computationally demanding. For instance,at the strategic level, such approaches commonly evaluateonly a subset of potential maneuvers and their interactionwith the traffic scene, restricting their usage to scenarioswith a reduced number of participants and a limited timehorizon [1, 2]. In path planning scenarios in unstructuredenvironments, heuristic search algorithms, such as the HybridA ∗ algorithm, [3] fully reexplore the configuration space onevery replanning task.In contrast, humans rely on their past experiences to evalu-ate the safety and suitability of a maneuver. This allows themto handle complex planning problems with ease. Inspired bythis, with Reinforcement Learning (RL), an optimal policyis derived by exploiting all past environmental interactions.The ongoing success in applying RL using neural networksto high-dimensional problems [4, 5] motivated its use forderiving driving policies for intersection crossing [6] orhighway maneuvering [7]. However, approximate RL meth-ods guarantee optimality of the learned policy merely in Julian Bernhard, Robert Gieselmann and Klemens Esterle are with for-tiss GmbH, An-Institut Technische Universit¨at M¨unchen, Munich, Germany Alois Knoll is with Chair of Robotics, Artificial Intelligence and Real-time Systems, Technische Universit¨at M¨unchen, Munich, Germany
Pretrained ExperiencesHeuristic Search D ee p Q - N e t w o r k Q ( s MDP , a , ) · · · Q ( s MDP , a N ) s MDP s Plan HeuristicCosts
Fig. 1. The experience-based-heuristic-search algorithm relies on apretrained Deep Q-Network to guide an incremental search. During nodeexpansion, a single forward pass through the DQN, followed by a post-processing step, yields the heuristic costs for all expanded child nodes.Compared to baseline approaches, we benefit computationally from theoptimal policy encoded within the network, even when the planning state s plan and training state s MDP are defined differently. a statistical sense, impeding their usage in safety criticalsystems such as autonomous vehicles.This motivates our work: The Experience-Based HeuristicSearch (EBHS) algorithm integrates experiences in the formof pretrained Q-values into a heuristic search as depictedin Figure 1. We demonstrate that our algorithm benefitscomputationally from the pretrained experiences. Further, itovercomes the statistical failure rate of a pure reinforcement-learning-based planner due to the added search process.Specifically, we apply Double Deep Q-Networks [5] andlearning from demonstration [8] to learn the state-actionvalues Q ( s, a ) for two application types in the field ofpath planning. The learned Q-functions are integrated intoa Hybrid A ∗ planner to replace the commonly used heuristicfunctions.The main contributions of this paper are: • An adaptation of an heuristic search algorithm to uselearned experiences in the form of a Q-function asheuristic estimate. • The evaluation of variants of Deep-Q-learning algo-rithms and their parameters to study the accuracy ofthe derived heuristic estimate. • A demonstration of the computational advantages whenusing our experience-based planner in semi-structuredvalet parking scenarios. • A demonstration of the reliability of such an approachcompared to pure reinforcement learning based plan-ning.The structure of this paper is as follows: First, we presentprevious work related to our field. Then, we introduce the a r X i v : . [ c s . R O ] F e b BHS algorithm. Next, we present the results of experiencelearning and, finally, an application to different planningscenarios and a statistical analysis of the robustness of ourmethod. II. R
ELATED W ORK
The heuristic function plays an important role in allinformed search algorithms. Previous work already com-bined search-based methods with learned heuristic functions,obtained either by supervised or reinforcement learning. Acombination of Monte Carlo Tree Search (MCTS) with alearned policy and value network led to a mayor break-through in artificial intelligence by beating the best humanplayers in the game of Go [9]. Paxton et al. [10] adapt thisapproach to discrete task planning for autonomous driving.However, as continuous state spaces remain challenging forthe MCTS algorithm, their approach impedes a generationof continuous, dynamic behavior.In the field of heuristic learning, Li et al. [11] trained aneural network with supervised learning to estimate a correc-tion factor for a standard heuristic. Yet, their approach cannotreplace the actual heuristic function. Pareekutty et al. [12]use value iteration to iteratively create a quality grid mapduring planning, which guides the node expansion of a RRTplanner. However, their approach uses a discretized statespace and does not allow pretraining of the heuristic.Using imitation learning, Bhardwaj et al. [13] first acquirean optimal policy for a distribution of potential planningscenarios. This policy is used to guide a best-first searchwhen planning for a specific scenario within the distribution.Similar to our approach, they encode the optimal policywith a Q-function. However, as they directly use the policy,instead of calculating a heuristic from the Q-values, theiralgorithm ignores the optimality of the solution.We benchmark our algorithm in the field of path planningin unstructured environments. A common approach in thisfield is the Hybrid A ∗ algorithm extending the standard A ∗ algorithm towards a continuous state representation. It usesthe maximum of two different heuristic functions [3]: A holo-nomic version considering obstacles and a non-holonomicversion considering the kinematic constraints. We observedthat this heuristic leads to long planning times in certainplanning scenarios, since the two sub-heuristics may guidetowards contradicting states.To reduce planning time, the orientation-aware space ex-ploration guided heuristic search algorithm creates a unifiedheuristic function [14]. In a pre-planning step, it performsa circle-based state exploration, leading to a decrease inplanning time compared to the conventional Hybrid A ∗ implementation. Other ways of heuristic definition are highercost regions dependent on the amount of required additionalgear shifts [15] or are based on a separation of the con-figuration space into visible and non-visible regions [16].The above methods are suitable to decrease planning timein more complex, maze-like environments. In contrast, weinvestigate, if exploiting an already learned maneuver mightbe more beneficial to reduce planning time in standard parking maneuvers. In semi-structured environments withlanes given, planning should consider the road geometry.Up-to now, no analytical heuristic exists which estimatesthe non-holonomic path onto a curved lane. Instead, with alook-ahead parameter, a configuration on the curve is fixed,forming a planning problem with a single goal configura-tion [14, 17]. This parameter, however, does not generalizewell to different situations.Compared to existing work, we show how a learnedQ-function can be used as the only heuristic in anA ∗ -algorithm to search for an optimal solution in a con-tinuous state space. We learn a non-holonomic heuristic forsemi-structured environments, disregarding obstacles, and aunifying heuristic for standard parking scenarios consideringboth vehicle constraints and obstacles. Further, we show thata combination of learning and search-based methods benefitsfrom the optimality of the learned policy and the increase inrobustness due to the additional search. This may pave theway to practical applications of machine learning algorithmsfor motion planning algorithms of autonomous vehicles.III. P ROBLEM D EFINITION
We want to find the sequence of actions leading froman environment start state s Plan start to one of several possibleenvironment goal states S g = { s Plan goal,l } , l ∈ { , . . . , H } .The actual end state s ∗ goal fulfills an optimality criterion, e.g.giving the path with minimum length.The A ∗ algorithm finds the minimum cost solution bybuilding a search tree rooted at s Plan start . By applying the setof possible actions A = { a i } , i ∈ { , . . . , N } from thecurrent best state, new child states are expanded and thetree is iteratively grown until a goal configuration is reached.In each expansion step, the state with the lowest total cost f ( s Plan ) = g ( s Plan ) + h ( s Plan ) is selected with g ( s Plan ) beingthe cost from the start state s Plan start to the current state s Plan and h ( s Plan ) naming the cost-to-go metric or heuristic functionfrom the current state s Plan to the set of goal states S g .A closed list contains already expanded nodes. The searchprocess is over either when the open list is empty or thenumber of maximum iterations is reached.To ensure fast convergence of the search, the followingconditions should hold for a heuristic function h ( s Plan ) : • Admissibility h ( · ) ≤ h opt ( · ) : h ( · ) should never overes-timate the true cost-to-go h opt ( · ) . • Optimality h ( · ) ≈ h opt ( · ) : If h ( · ) is close to the truecost-to-go value, this fastens goal expansion and reducesprocessing time.We propose a learning-based mechanism to meet theserequirements.IV. E XPERIENCE -B ASED H EURISTIC S EARCH
We derive how a state-action value Q ( s MDP , a ) yields aheuristic function h ( s Plan ) in the EBHS algorithm. In thefollowing derivation, we set s = ∧ s MDP for better readability. . Q-Learning
Reinforcement learning seeks an optimal policy for theproblem of sequential decision making formulated as MarkovDecision Process (MDP). One distinguishes between value-based and policy-gradient methods. Q-learning belongs to thecategory of model-free, value-based reinforcement learningmethods [4]. It learns the state-action value function Q π ( s, a ) = E π (cid:34) ∞ (cid:88) t =0 γ t r t | s = s, a = a (cid:35) , (1)representing the expected return, taking action a in state s and from thereon following policy π . The discount factor γ defines how future rewards r t contribute to the currentstate-action value. The Bellman equation Q ∗ ( s, a ) = E s (cid:48) (cid:104) r ( s, a, s (cid:48) ) + γ max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) | s, a (cid:105) (2)defines the fix point of the optimal action-value function fromwhich the optimal policy a ∗ = π ∗ ( s ) = argmax a Q ∗ ( s, a ) is derived. B. Q-function Integration
The MDP and planning state definitions may differ. Aproblem-dependent transformation s MDP = t ( s Plan ) links thetwo state definitions.
1) Definition of the Rewards:
The reward definition ofthe MDP shall simplify the heuristic calculation from theQ-function. As we will derive in the following, this requires r ( s, a, s (cid:48) ) = (cid:40) R g , if s (cid:48) ∈ t ( S g )0 , otherwise,meaning the only non-zero reward is given for a transitiononto a goal state. Figure 2 visualizes this sparse rewardsetting.
2) Preserving the Greedy Policy:
For the following rea-soning, we assume that g ( s k ) = 0 , meaning during nodeexpansion, the search algorithm ignores passed way-costs.In this theoretical setting, the order of expanded nodes ofthe EBHS algorithm shall resemble the state sequence of theoptimal policy π ∗ .We achieve this, by establishing a inversely proportionalrelationship between the heuristic value h ( s Plan k ) and theQ-function of the parent state Q ( s k − , a i ) with the action a i leading from s k − to s k . Figure 2 shows the corre-sponding state transitions. For instance, if a i is optimal instate s k − , meaning a i = argmax a Q ( s k − , a ) , the heuristic h ( s Plan k ) shall take the lowest value among all nodes expandedfrom s k − . s k − a i , r = s k a ∗ k , r = s k +1 a ∗ k +1 , r = s k +2 a N a i +1 s T − a ∗ T − , r = R g s ∗ goal Fig. 2. Visualization of the state transitions when following the optimalpolicy a ∗ k = π ∗ ( s k ) to the goal. We require a sparse reward setting toenable a straightforward calculation of the heuristic function from the Q-function. Algorithm 1
ExpandNodeEBHS( s Plan k − ) S children ← ∅ s MDP k − = t ( s Plan k − ) qvalues ,...,N = ForwardEvaluationDQN ( s MDP k − ) for a i ∈ A , i ∈ , . . . , N do s Planchild ,i = SimulateMotionSegment ( s Plan k − , a i ) if !colliding( s Planchild ,i ) then f ( s Planchild,i ) = g ( s Planchild,i ) + log γ qvalues i R g · c a S children ← s Planchild,i Output: S children
3) Calculation of the Heuristic:
By combining the rewardsetting in aspect 1) with the Q-function definition in equation1, we get Q ∗ ( s k − , a i ) = 0 + γQ ∗ ( s k , a ∗ k )= 0 + γ (cid:2) γQ ∗ ( s k +1 , a ∗ k +1 ) (cid:3) = 0 + γ (cid:2) γ (cid:2) · · · + γ (cid:2) γ Q ∗ ( s T − , a ∗ T − ) (cid:124) (cid:123)(cid:122) (cid:125) = R g (cid:3)(cid:3)(cid:3) = γ L · R g (3)where L is the number of steps from state s k to the goal.Solving equation for L yields the heuristic estimate h ( s Plan k ) = L · c a = log γ Q ∗ ( s MDP k − , a i ) R g · c a . (4)We require a unique cost value c a for all motion segments. C. Deep Q-Networks for Heuristic Learning
To enable planning in a continuous state space, we mustrepresent Q ( s, a ) with a function approximator, such as aneural network. Mnih et al. [4] successfully applied Q-learning to problems with higher dimensional continuousstate spaces. To overcome divergence issues when using neu-ral networks for Q-function representation, they introducedthe concepts of a target network and an experience replaybuffer.For integration of an approximated Q-function (cid:101) Q ( s, a ) asheuristic function, it is important to minimize its differenceto the true Q-value Q ∗ ( s, a ) given in equation 1. Further,the training algorithm must be capable of dealing withsparse reward settings. Therefore, we evaluate the followingalgorithmic adaptations of DQN: • Double Deep Q-learning (DDQN) [5] aims to reduce theupward bias inherent to approximated Q-values. Thisbias arises due to the maximum operation within theBellman update. • Prioritized experience replay [18] gives increased prior-ity to experiences with high temporal-difference (TD)error, improving convergence, especially in sparse re-ward settings. • We apply n -step Deep Q-learning proposed by Mnih et al. [19], but in a synchronous version. It reducesthe upward bias of the Q-value estimate and fastenshe propagation of rewards to previously visited states.However, convergence is impeded due to higher vari-ances of the TD-error estimates. • Learning from demonstrations becomes beneficial whendealing with high-dimensional state spaces and sparserewards. Thus, we apply
Deep Q-learning from Demon-strations (DQfD) [8] which allows pretraining froman expert policy while still preserving the Bellmanproperty.Further algorithmic details are found in the respective pub-lications.We employ the standard neural network architecture forDQNs, outputting a vector of Q-values for all actions. Webenefit computationally from this architecture, as we requireonly a single forward evaluation of the DQN in the nodeexpansion step, to retrieve the heuristic costs for all children.Algorithm 1 describes the node expansion process of theEBHS algorithm. V. E
XPERIMENT
We benchmark the EBHS algorithm in two applicationsfrom the field of path planning: • Non-holonomic heuristic learning (NHL):
As dis-cussed in section II, for semi-structured scenarios, nosuitable non-holonomic heuristic exists for planningonto a continuously-curved road segment. Thus, welearn a non-holonomic heuristic estimating the optimalpath onto a quadratic Bezier curve. The slope at aspecific point on the curve defines the desired vehicleorientation at this point. Obstacles are considered by theEBHS algorithm and not during experience learning. • Learning of a unified heuristic (UHL):
We learna unified heuristic for a standard parking scenario.The learned policy considers both non-holonomic con-straints and obstacles. The scenario consists of tworows of four parking spaces placed opposite each other.The start configuration is arbitrarily oriented and placedbetween these rows. The goal is positioned in oneof the eight parking spaces, and oriented forwards orbackwards.
A. Experience Learning
We show how the experiences in form of a Deep Q-function were acquired for the two applications and discussfindings of the training processes.
1) MDP Definition:
The vehicle kinematics were de-scribed by a single track model with discretized steeringangle κ and a constant speed v for forward and backwardmotions. We used the same motion primitives a i = { κ i , v } for the RL agent, as used later on by the Hybrid A ∗ algorithmfor graph expansion.We evaluated two types of representations of the ve-hicle configuration: a standard form with a normalizedorientation c s = ( x s , y s , θ/ π ) , and a trigonometric version c t = ( x s , y s , sin ( θ ) , cos ( θ )) . The latter avoids a value jumpafter a full turn, which proved to be more beneficial in the Training Steps [ × ] Su cce ss R a t e [ % ] (a) NHL Training Steps [ × ] Su cce ss R a t e [ % ] (b) UHLFig. 3. The learning curves for the best performing hyperparameter setsfor the NHL and UHL setting averaged over three training runs. NHL usedprioritized DDQN with one-step return. Due to the sparse reward setting along training time was required. In the UHL setting, we employed a DQfDalgorithm. It initializes the policy from demonstrations, explaining the highsuccess rate at the beginning. UHL setting. The coordinate values x, y were normalizedwith respect to the workspace boundaries.Positive rewards were given when the vehicle reacheda tolerance region around the goal, negative rewards forcollisions with the workspace boundary, or in case of theUHL setting, when colliding with an occupied parking lot.An episode was over either after colliding or when reachingthe maximum number of allowed actions. Table III in theappendix provides the detailed MDP definitions.
2) Deep Reinforcement Learning:
For the NHL applica-tion, we applied prioritized DDQN [18] and experimentedwith different n -step returns [19]. For the UHL application,we found that an initialization of the policy from expertdemonstrations is beneficial to ease exploration of the higherdimensional state space. Therefore, we employed prioritizedDQfD [8] with Hybrid A ∗ expert demonstrations. For bothcases, we used (cid:15) -greedy exploration.
3) Network Architectures:
The Q-function was approxi-mated by a fully connected network with u hidden ReLUlayers. A linear layer for NHL and a tanh layer for UHLwith size N outputted a state-action value for each of themotion primitives. The input layer had the dimensions ofthe MDP state space.
4) Training and Test Data:
The initial states of the MDPswere sampled from fixed training data sets at the beginningof each episode. To obtain a data set for the NHL setting, wefixed the first two Bezier curve supporting points in the lefthalf of the workspace. Then, we sampled 100 Bezier curvesby moving the third point on a half circle in the right half ofthe workspace. Each of the 100 Bezier curves was combinedwith 1000 randomly sampled vehicle start configurationsresulting in a randomized but fixed training set with MDP states. For the UHL setting, we separated the trainingdata into goal and start vehicle configurations and combinedthese sets randomly during training. The goal set consisted ofone forward and one backward vehicle configuration for eachof the eight parking spaces. To obtain the start configurations,we defined a grid in the configuration space with spacings ∆ x = 0 . , ∆ y = 0 . , ∆ θ = 30 ° and sorted out all col-liding configurations. Combining this two configuration sets,ave a training set with roughly × MDP states. Ourequal sized test set for UHL consists of all intermediate startconfigurations in between the training configurations.
5) Results:
A random search over the most relevanthyperparameters was performed to improve the success rateof the learned policies. The success rate describes how oftenthe goal configuration is reached on average over the lastepisodes. Figure 3 shows the success rates over the courseof training for the best-performing parameters. Table III inthe appendix summarizes the most relevant parameters usedin the final evaluation.
B. Studying the Accuracy of Heuristic Estimates
In a next step, we investigated the effect of certainhyperparameters on the accuracy of the heuristic estimatefor NHL. To simplify notation in the following, we use h ( s k ) = h ( s Plan k ) .According to Anschel et al. [20], the difference betweenthe optimal Q-function Q ∗ ( s, a ) , defined by the Bellmanequation, and the Q-function (cid:101) Q ( s, a ) approximated by aneural network can be decomposed into: • The target approximation error (TAE): During training,we minimize the temporal differences between sub-sequent state-action pairs. The TAE is the remainingminimization error after training. It arises due to inexactoptimization of the loss functions, finite capacity of theneural network and insufficient generalization to unseenstate-action pairs. • The overestimation error (OE): Noise during environ-ment interaction leads to overestimations of the Q-values due to the maximum operation in the Bellmanequation (Equation 2). Double Deep Q-Networks reducethis effect. But, as discussed in [20], a growth in TAEvariance, a higher number of actions or increasing thediscount factor heightens this type of error. The varianceof the TAE is reduced with larger n -step return, as webootstrap further in the future to estimate the temporaldifference. • The optimality difference is the error between standardtabular Q-learning and the optimal Q-function Q ∗ ( s, a ) .It is negligible in our evaluation.To make different parameter settings comparable in ourevaluation, we define a normalized TAE error asTAE Norm = TAE R g · n . We divide by the goal reward, and, as the temporal differenceerror sums up with the n -step return, also by n . For differenthyperparameters, Table 4c depicts the normalized TAE aftertraining, averaged over a mini-batch of training samples, andthe corresponding final success rates. Though, the size of theaction set and the discount factor influence the TAE, a changeof these parameters greatly affected the success rate of thelearned policy. Thus, these parameters were left out in thisevaluation. We observe that the success rates resemble eachother for the different parameter settings, but the TAE variesgreatly. − single step [steps] . . . . . D e n s i t y f ( ∆ F i r s t S t e p ) range = 11-20 − single step [steps] . . . . range = 1-10 n = 3 , u = 3 × n = 1 , u = 3 × n = 1 , u = 6 × Desired Value (a) Single Step Difference −
200 0 200∆ total [%]0 . . . . . . D e n s i t y f ( ∆ t o t a l ) range = 11-20 −
200 0 200∆ total [%]0 . . . . . range = 1-10 (b) Total Difference Parameters Success Rate [%] Normalized TAE [%] n = 3 , u = 3 × n = 1 , u = 3 × n = 1 , u = 6 × (c) Training ResultsFig. 4. Density estimates of the heuristic accuracy metrics for differenthyperparameters ( u : number of layers, n : length of n -step return) andremaining steps to the goal (range) in the NHL setting. Below, we givethe corresponding final success rates and TAEs. To study the effects of the remaining TAE on the accuracyof the heuristic estimates, we defined two evaluation metrics.The single step difference ∆ single step = (cid:101) h ( s k ) − (cid:101) h ( s k +1 ) expresses how the learned heuristic estimate (cid:101) h ( · ) changesfrom a parent state s k to a child state s k +1 . The total relativedifference ∆ total = h opt ( s k ) − (cid:101) h ( s k ) h opt ( s k ) compares the true cost-to-go h opt ( · ) to the learned heuristicestimate (cid:101) h ( · ) . To increase convergence speed of the heuristicsearch algorithm, ∆ single step should be slightly lower than thecost of a motion segment c a . ∆ total should be close to zero.We estimate probability densities of this metrics using training samples, in which the learned policy success-fully reached the goal. For each of these samples, we cal-culated the true cost-to-go h opt ( · ) by multiplying the motioncost with the true number of required steps, obtained byfollowing the learned policy to the goal. The child state forcalculation of ∆ single step was obtained by applying the greedyction from the parent state. Figure 4 compares the densityestimates of these metrics for the NHL setting for differenthyperparameters.The goal of our evaluation was to clarify why certainsettings worked better in the final evaluation than others.For the three analyzed parameter settings, we summarize ourmain observations as follows: • n = 3 , u = 3 × : We obtain accurate peaks atthe desired value for ∆ single step , and, as expected, thedistribution has small variance. However, ∆ total showsa large offset. We assume that this is due to the highremaining TAE. • n = 1 , u = 3 × : We observe average performancefor ∆ single step , but best performance for ∆ total , however,with an overall increase in variance. • n = 1 , u = 6 × : We expected the lowest TAE, how-ever, unstable training amplified the TAE. The densitiespeak near the desired values, but the distributions arenon-Gaussian. We assume overfitting of the Q-function,which lead to larger errors at non-frequently visitedstates.Considering both evaluation metrics one-step DQN withsmall network capacity performed best. Thus, we selectedits learned Q-function for NHL in the final evaluation.We proposed two metrics which served as guidance forselecting suitable hyperparameters for experience learning.Yet, a profound study in the future should refine our metricdefinition and evaluate the influence of the observed vari-ances on the performance of the EBHS algorithm. C. Final Evaluation of EBHS
In a final evaluation of the EBHS algorithm, we want toapproach the following questions: • How well can the EBHS algorithm benefit computation-ally from the pretrained experiences in comparison tobaseline approaches? • Can the EBHS algorithm generalize to scenarios notcovered by the MDP state definition? • Can the statistical failure rate of a pure reinforcement-learning-based approach be overcome with the EBHSalgorithm?
1) Implementation Aspects:
We use the C++ implemen-tation of the Hybrid A* algorithm presented in [14]. Weinterface with a tensorflow-based implementation in Pythonto estimate the heuristic costs for one expansion step andreturn them to the planner.As baseline heuristic, we employ for the UHL applicationonly the Reeds-Shepp heuristic. The additional A ∗ heuristicworsened the performance of the baseline Hybrid A ∗ inour experiment. We disable direct Reeds-Shepp goal expan-sion [3], as both EBHS and the Hybrid A ∗ would benefitfrom it. For the NHL application, no analytical heuristicexists for planning onto a quadratic Bezier curve. Thus, weapproximate it as follows: We sample the goal Bezier curveat equidistant points and calculate a Reed Shepp path to eachof them. Then, we take the minimum-length path as heuristiccost. Success(94.1%)Max. Steps(4.3%)Collision(1.6%)Total S a m p l e T y p e DQNEBHSHA*
Planning Duration [ms]
Fig. 5. Median planning durations with confidence bounds for the UHLscenario, estimated from 1000 test samples for each sample category, for apure DQN-based planner, the EBHS algorithm and the Hybrid A ∗ baseline.The EBHS algorithm outperformed the baseline over the total test set andeven succeeded for samples, in which the DQN-based planner collided orreached the maximum number of allowed steps. We performed the final evaluation on an Intel Corei7 @ 3.3 GHz and 16 GB Ram with disabled graphic cardsupport to ensure same processing conditions for all of theapproaches.
2) Scenario Evaluation:
We applied the EBHS algorithmand the baseline approaches to two scenarios for the NHLand UHL setting. For the NHL application, we selected apullout maneuver and a parallel marking maneuver onto acurved road. Note that the obstacles in the NHL scenariosare not part of the MDP state space. The UHL setting wasevaluated on a reverse parking scenario from the trainingdata set, and a scenario where we added an obstacle not considered by the MDP state definition.For this scenarios, Figure 6 shows the resulting paths andexpanded nodes. Tables I and II provide numerical results.For all depicted scenarios, EBHS required a significantlylower number of planning iterations and a lower planningtime, though, one iteration in EBHS is computationally morecostly due to the forward evaluation of the DQN and thePython interfacing. When adding an obstacle not included inthe MDP state, the EBHS algorithm still benefited from thepretrained experiences indicating the generalization capabil-ities of our approach. However, EBHS generated a longerpath in this case. For the future, we plan to investigatehow to improve optimality of the solution in generalizationscenarios.
3) Statistical Evaluation:
Figure 3b depicts a successrate of 90% at the end of training in the UHL setting.Hence, learning of a suitable policy failed for 10% of thetraining data. For these fail samples, the learned policy eitherexceeded the maximum number of steps, due to the learnedpolicy getting trapped in local minima, or lead to a collisionin case of difficult starting positions near the workspaceboundary. This stochastic failure behavior occurs due to thelearned policy optimizing the expectation of the return over all states visited during training.To see, if EBHS can overcome the stochastic failure rate,we compared the planning durations of the EBHS algorithm,a pure DQN-based planner and the baseline Hybrid A ∗ separately for fail and success test samples. The test datashowed an equal failure rates as the training data. Figure ABLE IP
LANNING R ESULTS FOR NON - HOLONOMIC HEURISTIC LEARNING (NHL)Scenario Parallel Pullout Backwards PulloutPlanner EBHS Baseline EBHS BaselinePlanning Time [s] 5.2 8.1 3.2 13.8Expanded Nodes 2096 3467 2080 9115Iterations 503 1630 227 1954Path Length 9.0 9.0 12.0 9.0TABLE IIP
LANNING R ESULTS FOR UNIFIED HEURISTIC LEARNING (UHL)Scenario Added Obstacle ReversePlanner EBHS Baseline EBHS BaselinePlanning Time [s] 0.2 3.9 0.5 4.5Expanded Nodes 778 175230 2567 205862Iterations 137 132682 457 131602Path Length 26.6 16.9 23.9 24.3 ∗ . Thus, we conclude that the nodeexpansion process in the EBHS algorithm mainly followsthe learned policy and spends only slight computationaloverhead with unnecessary node expansions. In contrast toDQN, EBHS always found a solution for the failure samples,but with higher planning duration than the Hybrid A*. Thetotal median duration over the whole test set outperforms theHybrid A ∗ by .Our evaluation showed that the EBHS algorithm success-fully exploits learned experiences to speed up the conver-gence of the search process. The search process itself ensuresrobustness against the statistical failure rate of a pure DQN-based planner.VI. C ONCLUSION AND F UTURE W ORK
We presented the EBHS algorithm, which uses experiencesin the form of a Deep Q-Network as heuristic functionin a heuristic search, and proposed two metrics to assessthe accuracy of learned heuristic estimates for differenthyperparameter settings. We empirically proved that, with anadditional search, we overcome the statistical failure rate ofDeep-reinforcement-learning-based planning, but still benefitcomputationally from a pre-learned optimal policy.Silver et al. [9] demonstrated the advantages of com-bining reinforcement learning with search-based algorithmsfor planning in discrete state spaces. The EBHS algorithmrepresents now a step forward in applying this principle tocontinuous state spaces. Yet, a better understanding of theDQN overestimation errors and the accuracy of the learnedheuristics could further increase the benefits of our method.In the future, we plan to further investigate the general-ization capabilities of our method, and apply it to strategicplanning tasks in dynamic environments. (a) NHL: Parallel Pullout(b) NHL: Backwards Pullout(c) UHL: Additional Obstacle(d) UHL: Reverse
EBHS: ResultEBHS: Expanded Nodes Baseline: ResultBaseline: Expanded Nodes
Fig. 6. Visualization of planned paths and expanded nodes for EBHS andbaseline planners for the NHL and UHL application. EBHS led to a lowernumber of expanded nodes and faster planning time.ABLE IIIS
UMMARY OF MOST RELEVANT HYPERPARAMETERS FOR LEARNING THE EXPERIENCES WITH D EEP
Q-N
ETWORKS . Non-holonomic Heuristic Learning (NHL) Unified Heuristic Learning (UHL)MDP Definition
State Space s MDP = ( c start ,s , P , P , P ) ∈ R with Beziercurve supporting points P i = ( x i , y i ) s MDP = (cid:0) c start ,t , c goal ,t , o , o , . . . , o (cid:1) ∈ R with one-hot encoding o i ∈ { , } of parking spacesWork Space [m] ≤ x ≤ , ≤ y ≤
30 0 ≤ x ≤ , ≤ y ≤ Action Space κ = ± ◦ , ± ◦ , ± ◦ , ◦ , v = ± . ms κ = ± . ◦ , ± . ◦ , ◦ , v = ± . ms Reward goal: +1000 , collision: − goal: +1 , collision: − Time Step . s . s Discount Factor 0.95 0.95Transition Model deterministic: Single Track Vehicle deterministic: Single Track Vehicle
Deep Reinforcement Learning
Algorithm Prioritized DDQN Prioritized DQfD with Hybrid A ∗ demonstrationsLength of n -step return 1 5Hidden ReLU Layers x Units 3x300 5x300Output Layer Type Linear Tanh R EFERENCES [1] Hubmann, C., Becker, M., et al. , “Decision makingfor autonomous driving considering interaction anduncertain prediction of surrounding vehicles,” in , Jun. 2017,pp. 1671–1678.[2] Kessler, T. and Knoll, A., “Multi vehicle trajectory co-ordination for automated parking,” in
IEEE IntelligentVehicles Symposium , IEEE, 2017, pp. 661–666.[3] Dolgov, D., Thrun, S., et al. , “Path Planning forAutonomous Vehicles in Unknown Semi-structuredEnvironments,”
The International Journal of RoboticsResearch , vol. 29, no. 5, pp. 485–501, Apr. 2010.[4] Mnih, V., Kavukcuoglu, K., et al. , “Human-levelcontrol through deep reinforcement learning,”
Nature ,vol. 518, no. 7540, pp. 529–533, Feb. 2015.[5] Van Hasselt, H., Guez, A., et al. , “Deep ReinforcementLearning with Double Q-Learning,” in , Phoenix, Ari-zona: AAAI Press, 2016, pp. 2094–2100.[6] Isele, D., Rahimi, R., et al. , “Navigating OccludedIntersections with Autonomous Vehicles using DeepReinforcement Learning,” May 2017.[7] Li, X., Xu, X., et al. , “Reinforcement learning basedovertaking decision-making for highway autonomousdriving,” in , Nov.2015, pp. 336–342.[8] Hester, T., Vecerik, M., et al. , “Learning from Demon-strations for Real World Reinforcement Learning,” arXiv:1704.03732 , Apr. 2017. arXiv: .[9] Silver, D., Huang, A., et al. , “Mastering the gameof Go with deep neural networks and tree search,”
Nature , vol. 529, no. 7587, pp. 484–489, Jan. 2016.[10] Paxton, C., Raman, V., et al. , “Combining neuralnetworks and tree search for task and motion planningin challenging environments,” in
International Confer-ence on Intelligent Robots and Systems , IEEE, Sep.2017, pp. 6059–6066. [11] Li, G., Wang, G., et al. , “ANN: A Heuristic SearchAlgorithm Based on Artificial Neural Networks,” in
Proceedings of the 2016 International Conferenceon Intelligent Information Processing , Wuhan, China:ACM, 2016, 51:1–51:9.[12] Pareekutty, N., James, F., et al. , “RRT-HX: RRTWith Heuristic Extend Operations for Motion Planningin Robotic Systems,” in
International Design Engi-neering Technical Conferences and Computers andInformation in Engineering Conference , vol. 5A: 40thMechanisms and Robotics Conference, ASME, Aug.2016.[13] Bhardwaj, M., Choudhury, S., et al. , “LearningHeuristic Search via Imitation,” in
Proceedings of the1st Annual Conference on Robot Learning , vol. 78,PMLR, Nov. 2017, pp. 271–280.[14] Chen, C., “Motion Planning for NonholonomicVehicles with Space Exploration Guided Heuris-tic Search,” Dissertation, Technische Universit¨atM¨unchen, M¨unchen, 2016.[15] Liu, C., Wang, Y., et al. , “Boundary layer heuristic forsearch-based nonholonomic path planning in maze-like environments,” in , Jun. 2017, pp. 831–836.[16] Choi, J.-W., “An Efficient Heuristic Estimate for Non-holonomic Motion Planning,” in , vol. 10, 2012.[17] Fassbender, D., Heinrich, B. C., et al. , “Motion plan-ning for autonomous vehicles in highly constrainedurban environments,” in
Intelligent Robots and Sys-tems , IEEE, 2016, pp. 4708–4713.[18] Schaul, T., Quan, J., et al. , “Prioritized experiencereplay,” in
International Conference on Learning Rep-resentations (ICLR) , 2016.[19] Mnih, V., Badia, A. P., et al. , “Asynchronous meth-ods for deep reinforcement learning,” in
InternationalConference on Machine Learning , 2016, pp. 1928–1937.20] Anschel, O., Baram, N., et al. , “Averaged-DQN:Variance Reduction and Stabilization for Deep Rein-forcement Learning,” in