Addressing Inherent Uncertainty: Risk-Sensitive Behavior Generation for Automated Driving using Distributional Reinforcement Learning
AAddressing Inherent Uncertainty: Risk-Sensitive Behavior Generationfor Automated Driving using Distributional Reinforcement Learning
Julian Bernhard , Stefan Pollok and Alois Knoll ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishingthis material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component ofthis work in other works Abstract — For highly automated driving above SAE level 3,behavior generation algorithms must reliably consider theinherent uncertainties of the traffic environment, e.g. arisingfrom the variety of human driving styles. Such uncertaintiescan generate ambiguous decisions, requiring the algorithm toappropriately balance low-probability hazardous events, e.g.collisions, and high-probability beneficial events, e.g. quicklycrossing the intersection. State-of-the-art behavior generationalgorithms lack a distributional treatment of decision outcome.This impedes a proper risk evaluation in ambiguous situations,often encouraging either unsafe or conservative behavior. Thus,we propose a two-step approach for risk-sensitive behaviorgeneration combining offline distribution learning with on-line risk assessment. Specifically, we first learn an optimalpolicy in an uncertain environment with Deep DistributionalReinforcement Learning. During execution, the optimal risk-sensitive action is selected by applying established risk criteria,such as the Conditional Value at Risk, to the learned state-action return distributions. In intersection crossing scenarios,we evaluate different risk criteria and demonstrate that ourapproach increases safety, while maintaining an active drivingstyle. Our approach shall encourage further studies about thebenefits of risk-sensitive approaches for self-driving vehicles.
I. I
NTRODUCTION
For highly automated driving above SAE level 3, behaviorgeneration algorithms must reliably consider the inherentuncertainties of the traffic environment, e.g. arising fromthe variety of driving styles of other participants. Suchuncertainties can generate ambiguous decisions, requiring thealgorithm to appropriately balance low-probability hazardousevents, e.g. collisions, and high-probability beneficial events,e.g. quickly crossing the intersection. A single, numeric valuemeasuring the outcome of a decision does not appropriatelycharacterize such an ambiguous situation, since it neglectsthe probability of events. Instead of using an expectation-based utility measure, humans resolve ambiguity by minimiz-ing an adequate risk-metric over an outcome distribution [1].Such risk-metrics better evaluate the potential harm of anaction with respect to the probability of occurence.However, state-of-the-art behavior generation algorithmsstill lack a distributional treatment of risk. On the one hand,frequently used problem definitions for behavior generation,e.g. MDPs † [2, 3], POMDPs † [4, 5] or MAMDPs † [6–8],adhere to expectation-based return calculation as it is the Julian Bernhard and Stefan Pollok are with fortiss GmbH, An-InstitutTechnische Universit¨at M¨unchen, Munich, Germany Alois Knoll is with Chair of Robotics, Artificial Intelligence andReal-time Systems, Technische Universit¨at M¨unchen, Munich, Germany † MDP: Markov Decision Process, POMDP: Partially-Observable MDP,MAMDP: Multi-Agent MDP
Ego brake small accel. high accel. Driver TypesReturnrisk-neutral actionReturnrisk-sensitive actionstatestate, reward
Environment P pa ss i v e sampling agg r e ss i v e Offline Distribution LearningOnline Risk Assessment
Risk-sensitive P r obab ili t y behavior generation P r obab ili t y Fig. 1. Our approach deals with an unknown episode-specific driver typeof a participant sampled from a known environment-specific driver typedistribution (right). Risk-neutral, state-action return distributions are trainedoffline for this environment and evaluated online regarding collision risk(left). This two-step risk-sensitive behavior generation approach increasessafety in the face of behavioral uncertainty. conventional definition of optimality for such problems.On the other hand, most problem solvers, e.g. Deep Q-Learning [2, 9, 10] and Monte Carlo Tree Search [6, 7] forMDPs or Adaptive Belief Tree [4] for POMDPs, output theexpected return instead of the return distribution.Interestingly, recent variants of Deep ReinforcementLearning enable learning of state-action return distribu-tions [11–13], motivating an approach for risk-sensitivebehavior generation. Our two-step approach combines of-fline learning of the return distribution with online riskassessment (Fig. 1). It demonstrates the advantages of us-ing risk-sensitive return metrics to increase safety in theface of behavioral uncertainty. Specifically, we use DeepDistributional Q-Learning to learn the risk-neutral, state-action return distributions in environments with an unknown episode-specific behavior type of a participant sampled froma known environment-specific behavior type distribution.During execution, the optimal action is then selected basedon a distortion risk metric applied to the learned state-actionreturn distributions.The main contributions of this work are: • A risk-sensitive behavior generation approach combin-ing offline Deep Distributional Reinforcement Learningwith online risk assessment. • A benchmark of continuous observation spaces suitablefor Deep Q-Learning in intersection scenarios. • An evaluation of risk metrics applicable for behavior a r X i v : . [ c s . A I] F e b eneration of autonomous vehicles. • A demonstration of the safety benefits of risk-sensitivebehavior generation in environments with behavioraluncertainty.This work is structured as follows: First, we presentrelated work and introduce our approach. Next, we presentthe experiment setup with a benchmark of neural networkobservation spaces, followed by a qualitative and quantitativeevaluation of our method.II. R
ELATED W ORK
A behavior generation algorithm should consider theinteractions between participants to successfully navigatein congested traffic, e.g. crowded intersections. Differentvariants exist to model interactive human behavior.
Cooperative approaches assume that all agents optimizea global cost function. Solving this multi-agent MDP eitherwith optimization- [8, 14] or search-based methods [7, 15]yields a globally optimal solution defining also the egoagent’s behavior. However, the equilibrium assumption ne-glects the uncertainty inherent to human interactions.
Proba-bilistic approaches model the uncertainty about the behaviorof other participants as hidden state in a POMDP and solvethe problem mainly using sampling-based approaches eitheroffline [5] or online [4, 16]. Both, cooperative and proba-bilistic approaches, face the problem of a combinatoriallyincreasing number of maneuvering options with a growingnumber of participants. To achieve real-time capability, thesealgorithms limit the planning horizon [4], consider onlyinteractions with the nearest participants [7, 8] and applycomputationally simple traffic prediction models [4, 7], e.g.the Intelligent Driver Model. Men´endez-Romero et al. [17]define a probabilistic cooperative approach for highwaymerging. However, their approach assumes a discrete for-mulation of the participants’ intentions.
Deep Reinforcement Learning (DRL) promisesinteraction-aware decision making at lower computationalcost. It learns the expected return of an action by interactingwith other participants in simulation. During online planning,the agent exploits this experience. Isele et al. [10] applyDeep Q-Networks (DQN) to intersection crossing and extendit to occlusion handling in [9]. Wolf et al. [18] evaluatesemantic state space definitions for DQNs in highwayscenarios. Both approaches apply deterministic trafficmodels. Yet, even with deterministic models frequently asmall percentage of collisions remains. This epistemic orparametric uncertainty arises from imperfect information ofthe learning algorithm about the problem [19], e.g. comingfrom insufficient exploration or inexact minimization ofthe loss function. To overcome epistemic uncertaintywhen using reinforcement learning for autonomous drivingbehavior generation, one can combine DRL with a searchprocess to allow escaping from local optima of the learnedpolicy [20, 21] or add an additional safety layer to avoidinsecure actions [2, 22, 23].In contrast to epistemic uncertainty, our work deals with inherent or aleatoric uncertainty in the environment, e.g. arising from uncertainty about the behavior of other par-ticipants. To avoid unsafe decisions in such domains, risk-sensitive reinforcement learning employs an optimizationcriterion balancing the return and the risk of an action [24].Such risk criteria and their application to the field of roboticsare discussed in [1]. Dabney et al. [13] combine a novel,non-parametric approach for return distribution estimationusing Deep Distributional Reinforcement Learning (DDRL)with risk-sensitive action selection. They outperform pre-vious DDRL approaches in the domain of Atari games.The risk preferences of humans in driving scenarios areevaluated in [25] using Inverse Reinforcement Learning. Todeal with behavioral uncertainty in DRL, Bouton et al. [3]add safety rules that block actions when they violate asafety constraint with certain probability. The probabilisticsafety measure is calculated separately for each participantin a discretized state space. At an intersection with twoparticipants their approach yields zero collisions. However,the approach neglects interactions between other participantsand does not scale efficiently to more complex scenarios.To the best of our knowledge, our work is the firstwhich addresses the inherent uncertainty of traffic environ-ments with risk-sensitive optimization criteria. Our algorithmavoids a rule-based formulation of safety or discretization ofthe state space. It is solely based on the reward definitionand easily interpretable risk evaluation metrics. Further, wedemonstrate the advantages of Distributional ReinforcementLearning for autonomous vehicle behavior generation.III. P
ROBLEM D EFINITION
There exist various types of inherent uncertainties. Wefocus on the inherent uncertainty that arises from the in-teraction with other traffic participants of varying drivingstyles. We formulate the problem as Stochastic BayesianGame (SBG) [26]. The SBG models other agents’ behaviorsbased on a behavior type space and distribution. We canadopt this notion to our domain: A behavior type correspondsto a human driving style, e.g. ”aggressive” or ”passive”. Thetype distribution models the occurrence frequencies of thedriving styles in an environment. This SBG consists of: • environment state space S with fully observable kino-dynamic states s i of the participants. • N traffic participants; for each participant i ∈ N : – action set A i of motion primitives – behavior type space Θ i modeling the driving styles,e.g. Θ i = { ”aggressive , ”passive” } . – reward function R i : S × A × Θ i → R defining thereward after executing the joint action a ∈ A . – stochastic policy π i : H × A i × Θ i → [0 , overthe sets of state-action histories H , actions A i andbehavior types Θ i , e.g. π i ( H t , a, ”aggressive” ) . • state transition function T : S × A × S → [0 , . • type distribution ∆ : N × Θ → [0 , over the setsof participants’ indices N and types Θ . In the aboveexample, it reflects the percentage of drivers showing”aggressive” or ”passive” behavior.efore each episode of the SBG, the type θ i for eachparticipant i is sampled from Θ i with probability ∆( θ i ) .Based on the state-action history H t at time step t , eachparticipant repeatedly chooses an action according to itsbehavior π i ( H t , a, θ i ) until a terminal environment stateoccurs. The ego-vehicle, i =0 , knows a priori the typedistribution ∆ and space Θ , and the behaviors π i ; in ourapproach via inference during training in simulation. Theepisode-specific, sampled types θ i of the other participantsare unknown.A behavior generation algorithm must solve the presentedSBG. It shall find the optimal driving policy of the ego-vehicle π ( · , · , · ) , maximizing positive return while consid-ering the risk of negative return due to the uncertainty aboutthe episode-specific driving styles of other participants.IV. M ETHOD
We propose a risk-sensitive behavior generation approachto deal with the presented problem. It encompasses thefollowing two steps visualized in Fig. 1:1)
Offline Distribution Learning:
The random returnvariable R depending on action a in environmentstate s is distributed according to the state-action returndistribution Z ( r | s, a ) . Using Distributional Reinforce-ment Learning [11], we learn Z ∗ ( r | s, a ) in simulationfor a fixed behavior type space and distribution. Itencodes the optimal policy of the ego-agent for suchan environment.2) Online Risk Assessment:
We deviate from using thestandard, expectation-based selection of the optimalaction with a ∗ = argmax a E r ∼ Z ∗ [ R ] . Instead, we quan-tify the collision risk with distortion risk metrics [1]applied to the learned state-action value distribution.The optimal action is then selected based on themeasured risk of each action.Next, we depict the presented approach in detail. A. Distributional Reinforcement Learning
Reinforcement learning finds an optimal policy for aMarkov Decision Process (MDP). The Bellman equationdefines the optimal Q-function Q ∗ ( s, a ) = E s (cid:48) (cid:104) r ( s, a, s (cid:48) ) + γ max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) | s, a (cid:105) (1)representing the expected return, taking action a instate s and from thereon following the optimal pol-icy π ∗ ( s )= a ∗ = argmax a Q ∗ ( s, a ) . The discount factor γ defines how future rewards r t ∼ R contribute to the currentstate-action value. Mnih et al. [27] introduced Deep Q-Networks (DQN) enabling Q-learning for problems withhigher-dimensional, continuous state space. Double DeepQ-Networks (DDQN) [28] and prioritized experience re-play [29] improved convergence and optimality of DQN.Distributional reinforcement learning models the re-turn R as a random variable with probability distribu-tion Z ( r | s, a ) and the Q-value being the expected return Q ( s, a )= E r ∼ Z [ R ] . Bellemare et al. [11] introduced Deep Distributional Reinforcement Learning to learn Z ( r | s, a ) non-parametrically in a continuous state space. They provedthat the Distributional Bellman equation Z ( r | s, a ) D := R ( s, a ) + γZ ( r | s (cid:48) , a (cid:48) ) s (cid:48) ∼ T ( ·| s, a ) , a (cid:48) ∼ π ∗ ( ·| s (cid:48) ) (2)has a unique fix point Z ∗ ( s, a ) that minimizes the maximalform of the Wasserstein metric, a distance between twoprobability distributions. Their proposed algorithm, C51,approximately minimizes this distance to learn the return dis-tribution Z ∗ ( s, a ) from which the optimal policy is obtainedgreedily with π ∗ ( s ) = argmax a E [ Z ∗ ( s, a )] .Quantile Regression Deep Q-Learning (QRDQN) im-proves the performance of the C51 algorithm by trulyminimizing the Wasserstein metric [12]. It approximates theinverse cumulative distribution function (c.d.f) or quantilefunction F − Z at discrete probabilities. B. Training Process
We apply the QRDQN algorithm to learn the state-actiondistribution Z ∗ ( r | s, a ) of the ego agent for a fixed behaviortype space and distribution in simulation.Before the start of a training episode, we sample episode-specific behavior types θ i for all other participants i from thefixed environment-specific type distribution ∆ . The sampledtypes remain constant for the rest of the episode. The sim-ulated participants then behave according to π i ( · , · , θ i ) . Byseeing a multitude of episodes with different behavior types,the learning agent infers the type distribution and space, andlearns a risk-neutral, optimal policy for the given SBG. Afterlearning, Z ∗ ( r | s, a ) expresses the inherent uncertainty aboutthe actual behavior types appearing at a specific episode. C. Risk Assessment
During execution, we quantify the risk of an action basedon the learned distribution Z ∗ ( s, a ) using risk metrics. Learn-ing of the state-action distribution occurred risk-neutral withexpectation-based action selection. Now, during executionof the learned behavior, we quantify the action risks witha distortion risk metric applied to the learned risk-neutraldistribution.Distortion risk metrics comply with six mathematical ax-ioms and emerged from the field of finance. Their applicationas risk metric in robotics is discussed by Majumdar andPavone [1]. Sequential decision making may be temporallyinconsistent when risk evaluation is not applied alreadyduring training [30]. However, the advantage of assessingrisk based on the risk-neutral distribution is that the mostsuitable risk estimator and its parameters could be adaptedonline to the encountered traffic scene.We evaluate two distortion risk metrics. For better read-ability, we denote Z = Z ∗ ( r | s, a ) in the following: • Conditional Value at Risk (CVaR) [1]: ρ CVaR [ Z ] = E r ∼ Z [ R | R <
VaR α ] (3)with probability parameter α and the value at riskVaR α := F − Z ( α ) . Thus, α is the cumulative probability VaR = F − Z ( α ) Q ( s, a ) = E r ∼ Z [ R ] CVaR = E r ∼ Z [ R | R <
VaR ] CVaRWang βσρ
Wang = E r ∼ Z (cid:48) [ R ] Random Return Variable R with r ∼ Z ( s, a ) rZ ( r | s, a ) F Z Z ( r | s, a ) Z (cid:48) ( r | s, a ) rr r Fig. 2. Graphical representation of the calculation of Wang and CVaRrisk metrics:
Wang distorts the original distribution and calculates anexpectation over the resulting distribution. It shifts the mean for a normal-like distribution.
CVaR considers only returns below the Value at Risk (VaR). of returns smaller than VaR α . The CVaR α is the meanover this section of the return distribution. • Wang [31]: Wang distorts the original cumulative dis-tribution and uses the expectation of the resulting dis-tribution: F Z (cid:48) = Φ[Φ − ( F Z ) + β ] ρ Wang [ Z ] = E r ∼ Z (cid:48) [ R ] (4)where Φ is the standard normal c.d.f and β a real-valuedparameter. For a normal distribution, this metric shiftsits mean with µ (cid:48) = µ + βσ .The calculation of the risk metrics is represented graphicallyin Fig. 2.Action selection is greedy. For each action a i , we calculateits expected return under the risk metric and select with a ∗ ( s t ) = argmax a i ∈A ρ xxx ( Z ( r | s t , a i )) . (5)the optimal, risk-sensitive action a ∗ in state s t .V. E XPERIMENT S ETUP
We evaluate our approach on four turning scenarios inthe T-intersection given in Fig. 3. Below, we describe theexperiment setup in detail.
A. Scenario
We consider four turning scenarios, with either left orright turn and a varying number of participants. The otherparticipants have right of way, but react to the ego-vehicle.At the beginning of an episode, the ego-vehicle starts at thesame point of the intersection with zero initial velocity. Itsucceeds when reaching the end of the turning lane withoutcollision. We limit the velocity for all participants to 54 km/hand the maximum acceleration to − and − − . Thetime step of the simulation is
200 ms . B. Behavior Modeling
We define two deterministic driving styles ”passive” and”aggressive”. Stochastic behavior types are planned in fu- ture work. The corresponding policies π ( · , · , ”passive” ) and π ( · , · , ”agressive” ) use an Intelligent Driver Model (IDM)that reacts also to turning vehicles. π ( · , · , ”agressive” ) accel-erates to the desired velocity and keeps the gap to other IDMvehicles. But, it does not react and brake, if the ego-vehicleoccupies the lane.As we want to evaluate performance with and withoutbehavioral uncertainty, we consider two type definitions: • Single:
All the other drivers behave aggressively, thus Θ single = { ”aggressive” } with ∆ single ( ”aggressive” )=1 . . • Mixed:
Drivers act with equal percentage passivelyor aggressively, thus Θ mixed = { ”passive” , ”aggressive” } with uniform type distribution ∆ mixed ( ”passive” )=0 . and ∆ mixed ( ”agressive” )=0 . .Before an episode, a single type is sampled from the selectedtype distribution. It is then used by all other participants. C. Deep Reinforcement Learning
We train DQN and QRDQN agents separately for eachscenario for both type definitions ”single” and ”mixed”. Weemploy the standard DQN [27] and QRDQN [12] architec-tures with fully connected layers ( × ReLUs), outputtinga single value for each action, respectively N =200 quantilesfor each action. The input consists of the concatenatedobservations of all participants. Observation and action spaceare given below. We use prioritized experience replay [29]and Double DQN [28] for both DQN and QRDQN.Rewards are defined for the ego-agent. The other par-ticipants do not adhere to reward maximization. They arefully controlled by their policies π . The ego-agent receivesa positive reward R goal =100 for reaching the goal, and R collision = − for collisions. Every action costs addi-tionally R step = − . The discount factor γ is set to 0.95. D. Training & Test Data
For each combination of scenario and type definition,we define a fixed training and test data set consistingof
100 000 episode definitions, respectively. Each episodedefinition contains the environment state, specifying theinitial kinodynamic vehicle states of all participants at thebeginning of the episode, and the applied behavior type. Thedistribution of behavior types in the data set complies withthe selected type distribution ∆ single or ∆ mixed .To improve generalization, we vary the number of par-ticipants up to the maximum count of the scenario and (a) Turn right x 2 (b) Turn left x 2(c) Turn left x 4 (d) Turn right platoonFig. 3. Intersection scenarios considered in the evaluation. ,67 m (a) Small gap (b) Intermediate gap (c) Large gapFig. 4. Different gap sizes in the training and test data sets. shuffle the order of vehicle state positions in the concatenatedobservation space. The other participants have a randomvelocity between
29 km h − and
36 km h − . Further, the datasets contain different initial gap sizes between two vehiclesin the same lane as depicted in Fig. 4. E. Evaluation Metrics and Significance Testing
The following metrics are used for evaluation: • Success/Collision rate [%]: percentage of runs the ego-vehicle reached the end of the turning lane/collided. • Max. time rate [%]: percentage of runs exceedingmaximum allowed crossing time (
14 s ). • Crossing time [s]: time to reach the goal averaged oversuccessful runs.We use a fixed set of
10 000 test runs per approach and sce-nario to calculate the metrics, employing the best performingtraining checkpoint after a fixed number of training steps.To check for significance in performance differences, weuse paired statistical tests in which the test set index is theindependent variable. We perform for the binomial variables(success, collision, max. time) a Cochran’s Q test followedby pair-wise McNemar tests. For the crossing time, we use arepeated measures ANOVA followed by pair-wise dependentt-tests. Confidence level is 0.95 with Bonferroni correctionfor the pair-wise tests.
F. Action Space
We use only longitudinal actions A ego =( − , , , in m s − along predefined left or right turning paths to facilitatean empirical analysis of the benefits of our method. Agenerally applicable driving policy including lateral actionsis deferred to future work. The other participants have acontinuous action space defined by their behavior model. G. Observation Space
To find an appropriate observation space of the intersectionscenario applied as input to DQN and QRDQN, we bench-marked different observation spaces. We chose a right turnscenario with a single other participant and type distribution Θ single , and compared the success rates of a DQN agent forthe following observation spaces:1) Cartesian coordinates and velocity ( x ego , y ego , v ego , x , y , v , . . . , x k , y k , v k ) TABLE IR
ESULTS OF THE OBSERVATION SPACE BENCHMARK . Observation Collision Max. Time CrossingRepresentation Rate [%] Rate [%] Time [s]1) x, y, v x, y, v, c ∆ x, ∆ y, ∆ v x ∆ v y ∆ x, ∆ y, ∆ v x ∆ v y , state ego d, ∆ φ, v, TTC 31.9 0.0 3.246) d, v φ , lane 2.6 0.0 4.79
2) Cartesian coordinates, velocity and binary value ( x ego , y ego , v ego , x , y , v , c , . . . , x k , y k , v k , c k )
3) Relative features to ego-vehicle (∆ x , ∆ y , ∆ v x, , ∆ v y, , . . . , ∆ x k , ∆ y k , ∆ v x,k , ∆ v y,k )
4) Relative features and ego-vehicle state ( x ego , y ego , v ego , ∆ x , ∆ y , ∆ v x, , ∆ v y, , . . . , ∆ x k , ∆ y k , ∆ v x,k , ∆ v y,k )
5) Distance, orientation, velocity and TTC (inspiredby [10]) ( x ego , y ego , v ego , d , ∆ φ , v , TTC , . . . ,d k , ∆ φ k , v k , TTC k )
6) Distance, signed velocity and lane ( d , v φ, , lane , . . . , d k , v φ,k , lane k ) All real-valued numbers are normalized to the range − to . The ∆ sign denotes the value difference, d k theEuclidean distance and TTC k is the Time-To-Collision be-tween vehicle k and the ego-vehicle. The binary value c isone, if the vehicle is present in the scene, otherwise, it iszero. The signed velocity v φ,k is positive when driving fromleft to right or bottom to top, and negative when driving rightto left. The lane index lane k can take a value between oneand four to indicate the position on one of the four availablelanes. If the number of vehicles is lower than the scenariomaximum, we calculate missing features based on a vehiclewith a zeroed state, not interfering with the drivable spaceof the intersection, TTC = − and lane =0 .Table I compares the results, in this preliminary evaluation,without significance testing. The TTC-based representationlead to an aggressive driving behavior with high collisionrate. Interestingly, sparse observation spaces such as and achieved an acceptable overall performance. However,relative state information in combination with the ego-vehiclestate (4) outperformed the other representations in terms ofcollision rate and achieved a medium crossing time. Thus,we decided to use this representation in the evaluation. Training Steps [ × ] Su cce ss R a t e [ % ] DQN: singleDQN: mixed QRDQN: singleQRDQN: mixed
Fig. 5. Success rates during training of DQN and QRDQN for the singleand mixed behavior type space in the ”Turn left x 2” scenario.
VI. E
VALUATION
In our final evaluation, we compare the performance of theDQN baseline [2, 10, 18] with QRDQN, and QRDQN withrisk-sensitive policy evaluation in environments with singleand mixed behavior type definitions.
A. Exemplary Training Results
Fig. 5 compares the training data success rates of DQNand QRDQN over the course of training, exemplarily forthe ”Turn left x 2” scenario with single or mixed behaviortype space. QRDQN converged smoothly in both the singleand mixed setting. In contrast, DQN fluctuated strongly fromthe × episode on in the mixed setting. As expected,QRDQN showed thus improved stability in the trainingprocess for stochastic environments. B. Risk Metric Parameterization
In a preliminary study, we coarsely evaluated the influenceof the risk metric parameters α of CVaR and β of Wangusing the trained QRDQN agents. We considered the averageperformance over all scenarios in the training data set.We discovered that Wang was very sensitive to parameterchanges and started to yield conservative driving behaviorfor β < − . . In contrast, CVaR was more robust toparameter changes. For the following evaluation, we set α = 0 . and β = − . . C. Quantitative Analysis
First, we qualitatively compare the different approaches.We highlight in bold the best/worst result of a group, ifthe group test and all pair-wise tests within the group weresignificant as described in Sec. V-E.First, we discuss the advantage of Distributional Rein-forcement Learning over standard Deep Q-Learning. Table IIdepicts the performance of these two algorithms for ”single”and ”mixed” behavior type space averaged over all scenarios .For Θ single , QRDQN achieved a slightly higher success ratethan standard DQN. However, for Θ mixed , when uncertaintyabout the behavior of others was given, QRDQN reducedcollisions by 5%. There, a local minima in the learned policyof DQN led to a large max. time rate. The crossing timedecreased in the ”single” and ”mixed” case with QRDQN.These results underline the benefits of learning state-action distributions Z ( s, a ) in uncertain environments. State-action values Q ( s, a ) do not dissolve the subtle return nuances ofsuch domains. A detailed, more general discussion of thebenefits of the distributional approach is given in [11].Yet, a collision rate of 1.68% remained when usingQRDQN with Θ mixed . Table II provides the results forQRDQN combined with the best performing risk measureCVaR. Applying risk assessment during online planningoutperformed the QRDQN approach significantly , halving the collision rate from 1.68% to 0.7%. The crossing timeincreased slightly due to conservative driving in the ”Turnright platoon” scenario. Risk assessment led in that case tolonger waiting times at the entrance of the intersection. Whenno inherent uncertainty was present ( Θ single ), the learneddistributions represent only model uncertainties, e.g. arisingdue to insufficient exploration or loss minimization. Riskassessment was still beneficial in that case.Next, we compared the CVaR and Wang risk measures.For the different scenarios, the results in the ”mixed” settingare depicted in Table III. Pair-wise significance was foundagainst the QRDQN approach (risk measure ”none”); notbetween CVaR and Wang. Still, we detect a tendency: Wangreduced collisions in three scenarios compared to QRDQN.In the ”right platoon” scenario, Wang led to conservativedriving, failing to cross the intersection within the maximumepisode duration. In contrast, CVaR reduced collisions in all cases. The crossing time is comparable to QRDQN. In theplatoon scenario, CVaR drives conservatively too, increasingcrossing time noticeably. But in contrast to Wang, it stillmanaged to cross the intersection in all cases.We conclude that CVaR is a suitable metric to evaluate riskin behavior generation algorithms of autonomous vehicles.Regarding the remaining collisions, further studies shouldinvestigate the effects of different risk metric parameteriza-tions and evaluate the influence of epistemic uncertainties onthe reliability of the proposed approach. D. Qualitative Analysis
We pick out a single episode to qualitatively examinethe reasons for better performance with risk assessment. Weconsider a ”Turn right x 2” episode with ”passive” behaviorof other participants. In this case, risk assessment with CVaRor Wang resulted in successful intersection crossing, andDQN and QRDQN collided.Fig. 6b depicts the longitudinal position s and velocity v TABLE IIC
OMPARISON OF ALGORITHMS AVERAGED OVER ALL SCENARIOS . % Collisions % Max. Time CrossingTime [s] Θ others Algorithmsingle QRDQN + CVaR mixed QRDQN + CVaR = 0 . t = 1 . t = 2 . − − − − − − l og ( Z C V a R ) -3m/s ⇐ ⇐ ⇐− − − − −
50 0
Return R − − − − − − l og ( Z ) -3m/s ⇐ Return R -3m/s0m/s 2m/s5m/s ⇐ − − − −
50 0 50
Return R -3m/s0m/s 2m/s5m/s ⇐ (a) Learned return distributions Z and modified return distributions Z CVaR with returns pruned above the VaR at specific time points of the scenario.An arrow indicates the optimal action obtained with either a ∗ = argmax a E [ Z ] or a ∗ = argmax a E [ Z CVaR ] . time [s] s [ m ] time [s] . . . . . . . v [ m / s ] QRDQN + WangQRDQN + CVaRQRDQNDQNSuccessCollision (b) Longitudinal position s and velocity v of the ego-agent over the course of the scenario.Fig. 6. For a ”Turn right x 2” episode with passive behavior of other participants, we show the evaluation of the scene and the corresponding distributionsat specific timepoints and compare velocity and longitudinal position for all algorithms. Risk-sensitive action selection with CVaR yields a more moderateacceleration at time point t = 1 . in comparison to the learned optimal action of QRDQN choosing highest acceleration. This subtle difference madethe risk-sensitive approach successfully complete this episode whereas DQN and QRDQN failed.TABLE IIIC OMPARISON OF RISK METRICS IN THE ” MIXED ” TYPE SPACE . % Collisions % Max. Time CrossingTime [s]Scenario Risk Measureleft x 4 CVaR 1.08 0.00 5.43None Wang 0.88 0.00 5.46right platoon CVaR 0.00 0.00
None -left x 2 CVaR 0.08 0.00 4.84None
Wang 0.09 0.00 4.88right x 2 CVaR 1.64 0.00 3.67None
Wang 1.59 0.00 3.70 over the course of the scenario. Slight backwards movementsoccurred with the distributional approaches, since, with ourreward definition, the policy was optimized solely for safety.We postpone comfort constraints to later work. For specifictime points in this scenario, Fig. 6a displays the currenttraffic situation, and for all actions, the corresponding learneddistributions Z and the modified distributions Z CVaR withreturns pruned above the VaR. An arrow ” ⇐ ” highlights the selected, optimal action, respectively.Overall, we see that the distributions mainly differ in thelength of a tail with lower-probability negative returns andonly marginally with respect to higher probability positivereturns. At the beginning of the scenario at t =0 s , riskassessment with CVaR did not change the optimality of thelearned action. Braking remained optimal, since the otheractions’ long negative tails dominate their distributions evenafter pruning returns above the VaR. The time point t =1 . ,however, was critical for the final outcome of the episode.The learned policy of QRDQN chose highest possible ac-celeration, primarily as the positive return probability in itsdistribution outweighs the large negative tail. In contrast,considering only the return values below the VaR for decisionmaking yielded risk-averse action selection with the optimalaction being a lower acceleration value. This avoided thecollision occurring with QRDQN and led to a successfulcompletion of the scenario ( t = 2 . ).This example clarifies the benefits of the CVaR metricfor behavior generation in uncertain environments: Due tothe inherent uncertainty about a hazardous event in theenvironment, the distributions of riskier actions consist of alonger, low-probability tail at negative returns facing higherprobability peaks at more positive returns. The decisionecomes ambiguous. The CVaR risk measure decides basedon the VaR which returns to prune from the distribution,strengthening the contribution of less likely negative out-comes. Overall, this removes the ambiguity of a decision, oc-curring with riskier actions, and yields a saver driving policy.A video comparing the performance of the evaluatedalgorithms and risk measures at selected episodes is foundunder https://youtu.be/PSDFEG5d1xg .VII. C ONCLUSION AND F UTURE W ORK
We proposed a two-step approach for risk-sensitive behav-ior generation, evaluating the risk of actions online, based onreturn distributions learned offline with Deep DistributionalReinforcement Learning. We evaluated two distortion riskmetrics and demonstrated that our approach increases safetyin environments with inherent uncertainty about other par-ticipants’ behaviors while avoiding too conservative driving.Majumdar and Pavone [1] discussed the application of riskmetrics, emerging from finance, to robotics. Our approachpresents now a step forward in applying risk-sensitive behav-ior generation for autonomous driving. Yet, a distributionalconsideration of risk in other methods, e.g search-basedmethods, would broaden our understanding of its benefitsand challenges.To achieve a high level of safety under inherent and epistemic uncertainties, we plan to combine the approachwith methods reducing epistemic uncertainty in the future,e.g an additional search process [20].R
EFERENCES [1] Majumdar, A. and Pavone, M., “How Should a Robot AssessRisk? Towards an Axiomatic Theory of Risk in Robotics,”
CoRR , vol. abs/1710.11040, 2017.[2] Mirchevska, B., Pek, C., et al. , “High-Level Decision Mak-ing for Safe and Reasonable Autonomous Lane ChangingUsing Reinforcement Learning,” en, in ,IEEE, 2018.[3] Bouton, M., Karlsson, J., et al. , “Reinforcement learningwith probabilistic guarantees for autonomous driving,” in
Workshop on Safety, Risk and Uncertainty in ReinforcementLearning, Conference on Uncertainty in Artifical Intelli-gence (UAI) , 2018.[4] Hubmann, C., Schulz, J., et al. , “A Belief State Plannerfor Interactive Merge Maneuvers in Congested Traffic,” in , IEEE, 2018.[5] Bouton, M., Cosgun, A., et al. , “Belief state planning forautonomously navigating urban intersections,” in
IntelligentVehicles Symposium (IV) , IEEE, 2017.[6] Kurzer, K., Engelhorn, F., et al. , “Decentralized CooperativePlanning for Automated Vehicles with Continuous MonteCarlo Tree Search,”
CoRR , vol. abs/1809.03200, 2018.[7] Lenz, D., Kessler, T., et al. , “Tactical cooperative planningfor autonomous highway driving using Monte-Carlo TreeSearch,” in
Intelligent Vehicles Symposium (IV) , IEEE, 2016.[8] Burger, C. and Lauer, M., “Cooperative Multiple VehicleTrajectory Planning using MIQP,” in ,IEEE, 2018. [9] Isele, D., Rahimi, R., et al. , “Navigating Occluded Intersec-tions with Autonomous Vehicles Using Deep ReinforcementLearning,” in
International Conference on Robotics andAutomation (ICRA) , IEEE, 2018.[10] ——, “Navigating Occluded Intersections with AutonomousVehicles using Deep Reinforcement Learning,” May 2017.[11] Bellemare, M. G., Dabney, W., et al. , “A distribu-tional perspective on reinforcement learning,”
CoRR ,vol. abs/1707.06887, 2017.[12] Dabney, W., Rowland, M., et al. , “Distributional Re-inforcement Learning with Quantile Regression,”
CoRR ,vol. abs/1710.10044, 2017.[13] Dabney, W., Ostrovski, G., et al. , “Implicit Quantile Net-works for Distributional Reinforcement Learning,” in ,vol. 80, PMLR, Jul. 2018.[14] Sadigh, D., Sastry, S., et al. , “Planning for Autonomous Carsthat Leverages Effects on Human Actions,” in
Proceedingsof the Robotics: Science and Systems Conference (RSS) , Jun.2016.[15] Kurzer, K., Zhou, C., et al. , “Decentralized CooperativePlanning for Automated Vehicles with Hierarchical MonteCarlo Tree Search,” in
Intelligent Vehicles Symposium (IV) ,Jun. 2018.[16] Bai, H., Cai, S., et al. , “Intention-aware online POMDPplanning for autonomous driving in a crowd,” in
IEEE In-ternational Conference on Robotics and Automation (ICRA) ,IEEE, 2015.[17] Men´endez-Romero, C., Sezer, M., et al. , “Courtesy Behaviorfor Highly Automated Vehicles on Highway Interchanges,”in
Intelligent Vehicles Symposium (IV) , IEEE, Jun. 2018.[18] Wolf, P., Kurzer, K., et al. , “Adaptive Behavior Generationfor Autonomous Driving using Deep Reinforcement Learn-ing with Compact Semantic States,” in , IEEE, 2018.[19] Dilokthanakul, N. and Shanahan, M., “Deep ReinforcementLearning with Risk-Seeking Exploration,” in
From Animalsto Animats 15 , Springer International Publishing, 2018.[20] Bernhard, J., Gieselmann, R., et al. , “Experience-BasedHeuristic Search: Robust Motion Planning with Deep Q-Learning,” in , IEEE, 2018.[21] Paxton, C., Raman, V., et al. , “Combining neural networksand tree search for task and motion planning in challengingenvironments,” in
International Conference on IntelligentRobots and Systems , IEEE, Sep. 2017.[22] Mukadam, M., Cosgun, A., et al. , “Tactical Decision Makingfor Lane Changing with Deep Reinforcement Learning,” in
Conference on Neural Information Processing (NIPS) , 2017.[23] Shalev-Shwartz, S., Shammah, S., et al. , “Safe, Multi-Agent,Reinforcement Learning for Autonomous Driving,”
CoRR ,vol. abs/1610.03295, 2016.[24] Garc´ıa, J. and Fern´andez, F., “A Comprehensive Survey onSafe Reinforcement Learning,”
Journal of Machine Learn-ing Research , vol. 16, 2015.[25] Majumdar, A., Singh, S., et al. , “Risk-sensitive inverse rein-forcement learning via coherent risk models,” in
Robotics:Science and Systems , 2017.[26] Albrecht, S. V., Crandall, J. W., et al. , “Belief and Truthin Hypothesised Behaviours,” en,
Artificial Intelligence ,vol. 235, Jun. 2016.[27] Mnih, V., Kavukcuoglu, K., et al. , “Human-level controlthrough deep reinforcement learning,”
Nature , vol. 518,no. 7540, Feb. 2015.[28] Van Hasselt, H., Guez, A., et al. , “Deep ReinforcementLearning with Double Q-Learning,” in , AAAI Press, 2016.29] Schaul, T., Quan, J., et al. , “Prioritized experience replay,”in
International Conference on Learning Representations(ICLR) , 2016.[30] Ruszczy´nski, A., “Risk-averse dynamic programming forMarkov decision processes,”
Mathematical Programming ,vol. 125, no. 2, Oct. 2010.[31] Wang, S. S., “A Class of Distortion Operators for PricingFinancial and Insurance Risks,”