[PDF] Addressing Inherent Uncertainty: Risk-Sensitive Behavior Generation for Automated Driving using Distributional Reinforcement Learning

Abstract

For highly automated driving above SAE level~3, behavior generation algorithms must reliably consider the inherent uncertainties of the traffic environment, e.g. arising from the variety of human driving styles. Such uncertainties can generate ambiguous decisions, requiring the algorithm to appropriately balance low-probability hazardous events, e.g. collisions, and high-probability beneficial events, e.g. quickly crossing the intersection. State-of-the-art behavior generation algorithms lack a distributional treatment of decision outcome. This impedes a proper risk evaluation in ambiguous situations, often encouraging either unsafe or conservative behavior. Thus, we propose a two-step approach for risk-sensitive behavior generation combining offline distribution learning with online risk assessment. Specifically, we first learn an optimal policy in an uncertain environment with Deep Distributional Reinforcement Learning. During execution, the optimal risk-sensitive action is selected by applying established risk criteria, such as the Conditional Value at Risk, to the learned state-action return distributions. In intersection crossing scenarios, we evaluate different risk criteria and demonstrate that our approach increases safety, while maintaining an active driving style. Our approach shall encourage further studies about the benefits of risk-sensitive approaches for self-driving vehicles.

Full PDF

AAddressing Inherent Uncertainty: Risk-Sensitive Behavior Generationfor Automated Driving using Distributional Reinforcement Learning

Julian Bernhard , Stefan Pollok and Alois Knoll ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishingthis material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component ofthis work in other works Abstract — For highly automated driving above SAE level 3,behavior generation algorithms must reliably consider theinherent uncertainties of the trafﬁc environment, e.g. arisingfrom the variety of human driving styles. Such uncertaintiescan generate ambiguous decisions, requiring the algorithm toappropriately balance low-probability hazardous events, e.g.collisions, and high-probability beneﬁcial events, e.g. quicklycrossing the intersection. State-of-the-art behavior generationalgorithms lack a distributional treatment of decision outcome.This impedes a proper risk evaluation in ambiguous situations,often encouraging either unsafe or conservative behavior. Thus,we propose a two-step approach for risk-sensitive behaviorgeneration combining ofﬂine distribution learning with on-line risk assessment. Speciﬁcally, we ﬁrst learn an optimalpolicy in an uncertain environment with Deep DistributionalReinforcement Learning. During execution, the optimal risk-sensitive action is selected by applying established risk criteria,such as the Conditional Value at Risk, to the learned state-action return distributions. In intersection crossing scenarios,we evaluate different risk criteria and demonstrate that ourapproach increases safety, while maintaining an active drivingstyle. Our approach shall encourage further studies about thebeneﬁts of risk-sensitive approaches for self-driving vehicles.

I. I

NTRODUCTION

For highly automated driving above SAE level 3, behaviorgeneration algorithms must reliably consider the inherentuncertainties of the trafﬁc environment, e.g. arising fromthe variety of driving styles of other participants. Suchuncertainties can generate ambiguous decisions, requiring thealgorithm to appropriately balance low-probability hazardousevents, e.g. collisions, and high-probability beneﬁcial events,e.g. quickly crossing the intersection. A single, numeric valuemeasuring the outcome of a decision does not appropriatelycharacterize such an ambiguous situation, since it neglectsthe probability of events. Instead of using an expectation-based utility measure, humans resolve ambiguity by minimiz-ing an adequate risk-metric over an outcome distribution [1].Such risk-metrics better evaluate the potential harm of anaction with respect to the probability of occurence.However, state-of-the-art behavior generation algorithmsstill lack a distributional treatment of risk. On the one hand,frequently used problem deﬁnitions for behavior generation,e.g. MDPs † [2, 3], POMDPs † [4, 5] or MAMDPs † [6–8],adhere to expectation-based return calculation as it is the Julian Bernhard and Stefan Pollok are with fortiss GmbH, An-InstitutTechnische Universit¨at M¨unchen, Munich, Germany Alois Knoll is with Chair of Robotics, Artiﬁcial Intelligence andReal-time Systems, Technische Universit¨at M¨unchen, Munich, Germany † MDP: Markov Decision Process, POMDP: Partially-Observable MDP,MAMDP: Multi-Agent MDP

Ego brake small accel. high accel. Driver TypesReturnrisk-neutral actionReturnrisk-sensitive actionstatestate, reward

Environment P pa ss i v e sampling agg r e ss i v e Ofﬂine Distribution LearningOnline Risk Assessment

Risk-sensitive P r obab ili t y behavior generation P r obab ili t y Fig. 1. Our approach deals with an unknown episode-speciﬁc driver typeof a participant sampled from a known environment-speciﬁc driver typedistribution (right). Risk-neutral, state-action return distributions are trainedofﬂine for this environment and evaluated online regarding collision risk(left). This two-step risk-sensitive behavior generation approach increasessafety in the face of behavioral uncertainty. conventional deﬁnition of optimality for such problems.On the other hand, most problem solvers, e.g. Deep Q-Learning [2, 9, 10] and Monte Carlo Tree Search [6, 7] forMDPs or Adaptive Belief Tree [4] for POMDPs, output theexpected return instead of the return distribution.Interestingly, recent variants of Deep ReinforcementLearning enable learning of state-action return distribu-tions [11–13], motivating an approach for risk-sensitivebehavior generation. Our two-step approach combines of-ﬂine learning of the return distribution with online riskassessment (Fig. 1). It demonstrates the advantages of us-ing risk-sensitive return metrics to increase safety in theface of behavioral uncertainty. Speciﬁcally, we use DeepDistributional Q-Learning to learn the risk-neutral, state-action return distributions in environments with an unknown episode-speciﬁc behavior type of a participant sampled froma known environment-speciﬁc behavior type distribution.During execution, the optimal action is then selected basedon a distortion risk metric applied to the learned state-actionreturn distributions.The main contributions of this work are: • A risk-sensitive behavior generation approach combin-ing ofﬂine Deep Distributional Reinforcement Learningwith online risk assessment. • A benchmark of continuous observation spaces suitablefor Deep Q-Learning in intersection scenarios. • An evaluation of risk metrics applicable for behavior a r X i v : . [ c s . A I] F e b eneration of autonomous vehicles. • A demonstration of the safety beneﬁts of risk-sensitivebehavior generation in environments with behavioraluncertainty.This work is structured as follows: First, we presentrelated work and introduce our approach. Next, we presentthe experiment setup with a benchmark of neural networkobservation spaces, followed by a qualitative and quantitativeevaluation of our method.II. R

ELATED W ORK

A behavior generation algorithm should consider theinteractions between participants to successfully navigatein congested trafﬁc, e.g. crowded intersections. Differentvariants exist to model interactive human behavior.

Cooperative approaches assume that all agents optimizea global cost function. Solving this multi-agent MDP eitherwith optimization- [8, 14] or search-based methods [7, 15]yields a globally optimal solution deﬁning also the egoagent’s behavior. However, the equilibrium assumption ne-glects the uncertainty inherent to human interactions.

Proba-bilistic approaches model the uncertainty about the behaviorof other participants as hidden state in a POMDP and solvethe problem mainly using sampling-based approaches eitherofﬂine [5] or online [4, 16]. Both, cooperative and proba-bilistic approaches, face the problem of a combinatoriallyincreasing number of maneuvering options with a growingnumber of participants. To achieve real-time capability, thesealgorithms limit the planning horizon [4], consider onlyinteractions with the nearest participants [7, 8] and applycomputationally simple trafﬁc prediction models [4, 7], e.g.the Intelligent Driver Model. Men´endez-Romero et al. [17]deﬁne a probabilistic cooperative approach for highwaymerging. However, their approach assumes a discrete for-mulation of the participants’ intentions.

Deep Reinforcement Learning (DRL) promisesinteraction-aware decision making at lower computationalcost. It learns the expected return of an action by interactingwith other participants in simulation. During online planning,the agent exploits this experience. Isele et al. [10] applyDeep Q-Networks (DQN) to intersection crossing and extendit to occlusion handling in [9]. Wolf et al. [18] evaluatesemantic state space deﬁnitions for DQNs in highwayscenarios. Both approaches apply deterministic trafﬁcmodels. Yet, even with deterministic models frequently asmall percentage of collisions remains. This epistemic orparametric uncertainty arises from imperfect information ofthe learning algorithm about the problem [19], e.g. comingfrom insufﬁcient exploration or inexact minimization ofthe loss function. To overcome epistemic uncertaintywhen using reinforcement learning for autonomous drivingbehavior generation, one can combine DRL with a searchprocess to allow escaping from local optima of the learnedpolicy [20, 21] or add an additional safety layer to avoidinsecure actions [2, 22, 23].In contrast to epistemic uncertainty, our work deals with inherent or aleatoric uncertainty in the environment, e.g. arising from uncertainty about the behavior of other par-ticipants. To avoid unsafe decisions in such domains, risk-sensitive reinforcement learning employs an optimizationcriterion balancing the return and the risk of an action [24].Such risk criteria and their application to the ﬁeld of roboticsare discussed in [1]. Dabney et al. [13] combine a novel,non-parametric approach for return distribution estimationusing Deep Distributional Reinforcement Learning (DDRL)with risk-sensitive action selection. They outperform pre-vious DDRL approaches in the domain of Atari games.The risk preferences of humans in driving scenarios areevaluated in [25] using Inverse Reinforcement Learning. Todeal with behavioral uncertainty in DRL, Bouton et al. [3]add safety rules that block actions when they violate asafety constraint with certain probability. The probabilisticsafety measure is calculated separately for each participantin a discretized state space. At an intersection with twoparticipants their approach yields zero collisions. However,the approach neglects interactions between other participantsand does not scale efﬁciently to more complex scenarios.To the best of our knowledge, our work is the ﬁrstwhich addresses the inherent uncertainty of trafﬁc environ-ments with risk-sensitive optimization criteria. Our algorithmavoids a rule-based formulation of safety or discretization ofthe state space. It is solely based on the reward deﬁnitionand easily interpretable risk evaluation metrics. Further, wedemonstrate the advantages of Distributional ReinforcementLearning for autonomous vehicle behavior generation.III. P

ROBLEM D EFINITION

There exist various types of inherent uncertainties. Wefocus on the inherent uncertainty that arises from the in-teraction with other trafﬁc participants of varying drivingstyles. We formulate the problem as Stochastic BayesianGame (SBG) [26]. The SBG models other agents’ behaviorsbased on a behavior type space and distribution. We canadopt this notion to our domain: A behavior type correspondsto a human driving style, e.g. ”aggressive” or ”passive”. Thetype distribution models the occurrence frequencies of thedriving styles in an environment. This SBG consists of: • environment state space S with fully observable kino-dynamic states s i of the participants. • N trafﬁc participants; for each participant i ∈ N : – action set A i of motion primitives – behavior type space Θ i modeling the driving styles,e.g. Θ i = { ”aggressive , ”passive” } . – reward function R i : S × A × Θ i → R deﬁning thereward after executing the joint action a ∈ A . – stochastic policy π i : H × A i × Θ i → [0 , overthe sets of state-action histories H , actions A i andbehavior types Θ i , e.g. π i ( H t , a, ”aggressive” ) . • state transition function T : S × A × S → [0 , . • type distribution ∆ : N × Θ → [0 , over the setsof participants’ indices N and types Θ . In the aboveexample, it reﬂects the percentage of drivers showing”aggressive” or ”passive” behavior.efore each episode of the SBG, the type θ i for eachparticipant i is sampled from Θ i with probability ∆( θ i ) .Based on the state-action history H t at time step t , eachparticipant repeatedly chooses an action according to itsbehavior π i ( H t , a, θ i ) until a terminal environment stateoccurs. The ego-vehicle, i =0 , knows a priori the typedistribution ∆ and space Θ , and the behaviors π i ; in ourapproach via inference during training in simulation. Theepisode-speciﬁc, sampled types θ i of the other participantsare unknown.A behavior generation algorithm must solve the presentedSBG. It shall ﬁnd the optimal driving policy of the ego-vehicle π ( · , · , · ) , maximizing positive return while consid-ering the risk of negative return due to the uncertainty aboutthe episode-speciﬁc driving styles of other participants.IV. M ETHOD

We propose a risk-sensitive behavior generation approachto deal with the presented problem. It encompasses thefollowing two steps visualized in Fig. 1:1)

Ofﬂine Distribution Learning:

The random returnvariable R depending on action a in environmentstate s is distributed according to the state-action returndistribution Z ( r | s, a ) . Using Distributional Reinforce-ment Learning [11], we learn Z ∗ ( r | s, a ) in simulationfor a ﬁxed behavior type space and distribution. Itencodes the optimal policy of the ego-agent for suchan environment.2) Online Risk Assessment:

We deviate from using thestandard, expectation-based selection of the optimalaction with a ∗ = argmax a E r ∼ Z ∗ [ R ] . Instead, we quan-tify the collision risk with distortion risk metrics [1]applied to the learned state-action value distribution.The optimal action is then selected based on themeasured risk of each action.Next, we depict the presented approach in detail. A. Distributional Reinforcement Learning

Reinforcement learning ﬁnds an optimal policy for aMarkov Decision Process (MDP). The Bellman equationdeﬁnes the optimal Q-function Q ∗ ( s, a ) = E s (cid:48) (cid:104) r ( s, a, s (cid:48) ) + γ max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) | s, a (cid:105) (1)representing the expected return, taking action a instate s and from thereon following the optimal pol-icy π ∗ ( s )= a ∗ = argmax a Q ∗ ( s, a ) . The discount factor γ deﬁnes how future rewards r t ∼ R contribute to the currentstate-action value. Mnih et al. [27] introduced Deep Q-Networks (DQN) enabling Q-learning for problems withhigher-dimensional, continuous state space. Double DeepQ-Networks (DDQN) [28] and prioritized experience re-play [29] improved convergence and optimality of DQN.Distributional reinforcement learning models the re-turn R as a random variable with probability distribu-tion Z ( r | s, a ) and the Q-value being the expected return Q ( s, a )= E r ∼ Z [ R ] . Bellemare et al. [11] introduced Deep Distributional Reinforcement Learning to learn Z ( r | s, a ) non-parametrically in a continuous state space. They provedthat the Distributional Bellman equation Z ( r | s, a ) D := R ( s, a ) + γZ ( r | s (cid:48) , a (cid:48) ) s (cid:48) ∼ T ( ·| s, a ) , a (cid:48) ∼ π ∗ ( ·| s (cid:48) ) (2)has a unique ﬁx point Z ∗ ( s, a ) that minimizes the maximalform of the Wasserstein metric, a distance between twoprobability distributions. Their proposed algorithm, C51,approximately minimizes this distance to learn the return dis-tribution Z ∗ ( s, a ) from which the optimal policy is obtainedgreedily with π ∗ ( s ) = argmax a E [ Z ∗ ( s, a )] .Quantile Regression Deep Q-Learning (QRDQN) im-proves the performance of the C51 algorithm by trulyminimizing the Wasserstein metric [12]. It approximates theinverse cumulative distribution function (c.d.f) or quantilefunction F − Z at discrete probabilities. B. Training Process

We apply the QRDQN algorithm to learn the state-actiondistribution Z ∗ ( r | s, a ) of the ego agent for a ﬁxed behaviortype space and distribution in simulation.Before the start of a training episode, we sample episode-speciﬁc behavior types θ i for all other participants i from theﬁxed environment-speciﬁc type distribution ∆ . The sampledtypes remain constant for the rest of the episode. The sim-ulated participants then behave according to π i ( · , · , θ i ) . Byseeing a multitude of episodes with different behavior types,the learning agent infers the type distribution and space, andlearns a risk-neutral, optimal policy for the given SBG. Afterlearning, Z ∗ ( r | s, a ) expresses the inherent uncertainty aboutthe actual behavior types appearing at a speciﬁc episode. C. Risk Assessment

During execution, we quantify the risk of an action basedon the learned distribution Z ∗ ( s, a ) using risk metrics. Learn-ing of the state-action distribution occurred risk-neutral withexpectation-based action selection. Now, during executionof the learned behavior, we quantify the action risks witha distortion risk metric applied to the learned risk-neutraldistribution.Distortion risk metrics comply with six mathematical ax-ioms and emerged from the ﬁeld of ﬁnance. Their applicationas risk metric in robotics is discussed by Majumdar andPavone [1]. Sequential decision making may be temporallyinconsistent when risk evaluation is not applied alreadyduring training [30]. However, the advantage of assessingrisk based on the risk-neutral distribution is that the mostsuitable risk estimator and its parameters could be adaptedonline to the encountered trafﬁc scene.We evaluate two distortion risk metrics. For better read-ability, we denote Z = Z ∗ ( r | s, a ) in the following: • Conditional Value at Risk (CVaR) [1]: ρ CVaR [ Z ] = E r ∼ Z [ R | R <

VaR α ] (3)with probability parameter α and the value at riskVaR α := F − Z ( α ) . Thus, α is the cumulative probability VaR = F − Z ( α ) Q ( s, a ) = E r ∼ Z [ R ] CVaR = E r ∼ Z [ R | R <

VaR ] CVaRWang βσρ

Wang = E r ∼ Z (cid:48) [ R ] Random Return Variable R with r ∼ Z ( s, a ) rZ ( r | s, a ) F Z Z ( r | s, a ) Z (cid:48) ( r | s, a ) rr r Fig. 2. Graphical representation of the calculation of Wang and CVaRrisk metrics:

Wang distorts the original distribution and calculates anexpectation over the resulting distribution. It shifts the mean for a normal-like distribution.

CVaR considers only returns below the Value at Risk (VaR). of returns smaller than VaR α . The CVaR α is the meanover this section of the return distribution. • Wang [31]: Wang distorts the original cumulative dis-tribution and uses the expectation of the resulting dis-tribution: F Z (cid:48) = Φ[Φ − ( F Z ) + β ] ρ Wang [ Z ] = E r ∼ Z (cid:48) [ R ] (4)where Φ is the standard normal c.d.f and β a real-valuedparameter. For a normal distribution, this metric shiftsits mean with µ (cid:48) = µ + βσ .The calculation of the risk metrics is represented graphicallyin Fig. 2.Action selection is greedy. For each action a i , we calculateits expected return under the risk metric and select with a ∗ ( s t ) = argmax a i ∈A ρ xxx ( Z ( r | s t , a i )) . (5)the optimal, risk-sensitive action a ∗ in state s t .V. E XPERIMENT S ETUP

We evaluate our approach on four turning scenarios inthe T-intersection given in Fig. 3. Below, we describe theexperiment setup in detail.

A. Scenario

We consider four turning scenarios, with either left orright turn and a varying number of participants. The otherparticipants have right of way, but react to the ego-vehicle.At the beginning of an episode, the ego-vehicle starts at thesame point of the intersection with zero initial velocity. Itsucceeds when reaching the end of the turning lane withoutcollision. We limit the velocity for all participants to 54 km/hand the maximum acceleration to − and − − . Thetime step of the simulation is

200 ms . B. Behavior Modeling

We deﬁne two deterministic driving styles ”passive” and”aggressive”. Stochastic behavior types are planned in fu- ture work. The corresponding policies π ( · , · , ”passive” ) and π ( · , · , ”agressive” ) use an Intelligent Driver Model (IDM)that reacts also to turning vehicles. π ( · , · , ”agressive” ) accel-erates to the desired velocity and keeps the gap to other IDMvehicles. But, it does not react and brake, if the ego-vehicleoccupies the lane.As we want to evaluate performance with and withoutbehavioral uncertainty, we consider two type deﬁnitions: • Single:

All the other drivers behave aggressively, thus Θ single = { ”aggressive” } with ∆ single ( ”aggressive” )=1 . . • Mixed:

Drivers act with equal percentage passivelyor aggressively, thus Θ mixed = { ”passive” , ”aggressive” } with uniform type distribution ∆ mixed ( ”passive” )=0 . and ∆ mixed ( ”agressive” )=0 . .Before an episode, a single type is sampled from the selectedtype distribution. It is then used by all other participants. C. Deep Reinforcement Learning

We train DQN and QRDQN agents separately for eachscenario for both type deﬁnitions ”single” and ”mixed”. Weemploy the standard DQN [27] and QRDQN [12] architec-tures with fully connected layers ( × ReLUs), outputtinga single value for each action, respectively N =200 quantilesfor each action. The input consists of the concatenatedobservations of all participants. Observation and action spaceare given below. We use prioritized experience replay [29]and Double DQN [28] for both DQN and QRDQN.Rewards are deﬁned for the ego-agent. The other par-ticipants do not adhere to reward maximization. They arefully controlled by their policies π . The ego-agent receivesa positive reward R goal =100 for reaching the goal, and R collision = − for collisions. Every action costs addi-tionally R step = − . The discount factor γ is set to 0.95. D. Training & Test Data

For each combination of scenario and type deﬁnition,we deﬁne a ﬁxed training and test data set consistingof

100 000 episode deﬁnitions, respectively. Each episodedeﬁnition contains the environment state, specifying theinitial kinodynamic vehicle states of all participants at thebeginning of the episode, and the applied behavior type. Thedistribution of behavior types in the data set complies withthe selected type distribution ∆ single or ∆ mixed .To improve generalization, we vary the number of par-ticipants up to the maximum count of the scenario and (a) Turn right x 2 (b) Turn left x 2(c) Turn left x 4 (d) Turn right platoonFig. 3. Intersection scenarios considered in the evaluation. ,67 m (a) Small gap (b) Intermediate gap (c) Large gapFig. 4. Different gap sizes in the training and test data sets. shufﬂe the order of vehicle state positions in the concatenatedobservation space. The other participants have a randomvelocity between

29 km h − and

36 km h − . Further, the datasets contain different initial gap sizes between two vehiclesin the same lane as depicted in Fig. 4. E. Evaluation Metrics and Signiﬁcance Testing

The following metrics are used for evaluation: • Success/Collision rate [%]: percentage of runs the ego-vehicle reached the end of the turning lane/collided. • Max. time rate [%]: percentage of runs exceedingmaximum allowed crossing time (

14 s ). • Crossing time [s]: time to reach the goal averaged oversuccessful runs.We use a ﬁxed set of

10 000 test runs per approach and sce-nario to calculate the metrics, employing the best performingtraining checkpoint after a ﬁxed number of training steps.To check for signiﬁcance in performance differences, weuse paired statistical tests in which the test set index is theindependent variable. We perform for the binomial variables(success, collision, max. time) a Cochran’s Q test followedby pair-wise McNemar tests. For the crossing time, we use arepeated measures ANOVA followed by pair-wise dependentt-tests. Conﬁdence level is 0.95 with Bonferroni correctionfor the pair-wise tests.

F. Action Space

We use only longitudinal actions A ego =( − , , , in m s − along predeﬁned left or right turning paths to facilitatean empirical analysis of the beneﬁts of our method. Agenerally applicable driving policy including lateral actionsis deferred to future work. The other participants have acontinuous action space deﬁned by their behavior model. G. Observation Space

To ﬁnd an appropriate observation space of the intersectionscenario applied as input to DQN and QRDQN, we bench-marked different observation spaces. We chose a right turnscenario with a single other participant and type distribution Θ single , and compared the success rates of a DQN agent forthe following observation spaces:1) Cartesian coordinates and velocity ( x ego , y ego , v ego , x , y , v , . . . , x k , y k , v k ) TABLE IR

ESULTS OF THE OBSERVATION SPACE BENCHMARK . Observation Collision Max. Time CrossingRepresentation Rate [%] Rate [%] Time [s]1) x, y, v x, y, v, c ∆ x, ∆ y, ∆ v x ∆ v y ∆ x, ∆ y, ∆ v x ∆ v y , state ego d, ∆ φ, v, TTC 31.9 0.0 3.246) d, v φ , lane 2.6 0.0 4.79

2) Cartesian coordinates, velocity and binary value ( x ego , y ego , v ego , x , y , v , c , . . . , x k , y k , v k , c k )

3) Relative features to ego-vehicle (∆ x , ∆ y , ∆ v x, , ∆ v y, , . . . , ∆ x k , ∆ y k , ∆ v x,k , ∆ v y,k )

4) Relative features and ego-vehicle state ( x ego , y ego , v ego , ∆ x , ∆ y , ∆ v x, , ∆ v y, , . . . , ∆ x k , ∆ y k , ∆ v x,k , ∆ v y,k )

5) Distance, orientation, velocity and TTC (inspiredby [10]) ( x ego , y ego , v ego , d , ∆ φ , v , TTC , . . . ,d k , ∆ φ k , v k , TTC k )

6) Distance, signed velocity and lane ( d , v φ, , lane , . . . , d k , v φ,k , lane k ) All real-valued numbers are normalized to the range − to . The ∆ sign denotes the value difference, d k theEuclidean distance and TTC k is the Time-To-Collision be-tween vehicle k and the ego-vehicle. The binary value c isone, if the vehicle is present in the scene, otherwise, it iszero. The signed velocity v φ,k is positive when driving fromleft to right or bottom to top, and negative when driving rightto left. The lane index lane k can take a value between oneand four to indicate the position on one of the four availablelanes. If the number of vehicles is lower than the scenariomaximum, we calculate missing features based on a vehiclewith a zeroed state, not interfering with the drivable spaceof the intersection, TTC = − and lane =0 .Table I compares the results, in this preliminary evaluation,without signiﬁcance testing. The TTC-based representationlead to an aggressive driving behavior with high collisionrate. Interestingly, sparse observation spaces such as and achieved an acceptable overall performance. However,relative state information in combination with the ego-vehiclestate (4) outperformed the other representations in terms ofcollision rate and achieved a medium crossing time. Thus,we decided to use this representation in the evaluation. Training Steps [ × ] Su cce ss R a t e [ % ] DQN: singleDQN: mixed QRDQN: singleQRDQN: mixed

Fig. 5. Success rates during training of DQN and QRDQN for the singleand mixed behavior type space in the ”Turn left x 2” scenario.

VI. E

VALUATION

In our ﬁnal evaluation, we compare the performance of theDQN baseline [2, 10, 18] with QRDQN, and QRDQN withrisk-sensitive policy evaluation in environments with singleand mixed behavior type deﬁnitions.

A. Exemplary Training Results

Fig. 5 compares the training data success rates of DQNand QRDQN over the course of training, exemplarily forthe ”Turn left x 2” scenario with single or mixed behaviortype space. QRDQN converged smoothly in both the singleand mixed setting. In contrast, DQN ﬂuctuated strongly fromthe × episode on in the mixed setting. As expected,QRDQN showed thus improved stability in the trainingprocess for stochastic environments. B. Risk Metric Parameterization

In a preliminary study, we coarsely evaluated the inﬂuenceof the risk metric parameters α of CVaR and β of Wangusing the trained QRDQN agents. We considered the averageperformance over all scenarios in the training data set.We discovered that Wang was very sensitive to parameterchanges and started to yield conservative driving behaviorfor β < − . . In contrast, CVaR was more robust toparameter changes. For the following evaluation, we set α = 0 . and β = − . . C. Quantitative Analysis

First, we qualitatively compare the different approaches.We highlight in bold the best/worst result of a group, ifthe group test and all pair-wise tests within the group weresigniﬁcant as described in Sec. V-E.First, we discuss the advantage of Distributional Rein-forcement Learning over standard Deep Q-Learning. Table IIdepicts the performance of these two algorithms for ”single”and ”mixed” behavior type space averaged over all scenarios .For Θ single , QRDQN achieved a slightly higher success ratethan standard DQN. However, for Θ mixed , when uncertaintyabout the behavior of others was given, QRDQN reducedcollisions by 5%. There, a local minima in the learned policyof DQN led to a large max. time rate. The crossing timedecreased in the ”single” and ”mixed” case with QRDQN.These results underline the beneﬁts of learning state-action distributions Z ( s, a ) in uncertain environments. State-action values Q ( s, a ) do not dissolve the subtle return nuances ofsuch domains. A detailed, more general discussion of thebeneﬁts of the distributional approach is given in [11].Yet, a collision rate of 1.68% remained when usingQRDQN with Θ mixed . Table II provides the results forQRDQN combined with the best performing risk measureCVaR. Applying risk assessment during online planningoutperformed the QRDQN approach signiﬁcantly , halving the collision rate from 1.68% to 0.7%. The crossing timeincreased slightly due to conservative driving in the ”Turnright platoon” scenario. Risk assessment led in that case tolonger waiting times at the entrance of the intersection. Whenno inherent uncertainty was present ( Θ single ), the learneddistributions represent only model uncertainties, e.g. arisingdue to insufﬁcient exploration or loss minimization. Riskassessment was still beneﬁcial in that case.Next, we compared the CVaR and Wang risk measures.For the different scenarios, the results in the ”mixed” settingare depicted in Table III. Pair-wise signiﬁcance was foundagainst the QRDQN approach (risk measure ”none”); notbetween CVaR and Wang. Still, we detect a tendency: Wangreduced collisions in three scenarios compared to QRDQN.In the ”right platoon” scenario, Wang led to conservativedriving, failing to cross the intersection within the maximumepisode duration. In contrast, CVaR reduced collisions in all cases. The crossing time is comparable to QRDQN. In theplatoon scenario, CVaR drives conservatively too, increasingcrossing time noticeably. But in contrast to Wang, it stillmanaged to cross the intersection in all cases.We conclude that CVaR is a suitable metric to evaluate riskin behavior generation algorithms of autonomous vehicles.Regarding the remaining collisions, further studies shouldinvestigate the effects of different risk metric parameteriza-tions and evaluate the inﬂuence of epistemic uncertainties onthe reliability of the proposed approach. D. Qualitative Analysis

We pick out a single episode to qualitatively examinethe reasons for better performance with risk assessment. Weconsider a ”Turn right x 2” episode with ”passive” behaviorof other participants. In this case, risk assessment with CVaRor Wang resulted in successful intersection crossing, andDQN and QRDQN collided.Fig. 6b depicts the longitudinal position s and velocity v TABLE IIC

OMPARISON OF ALGORITHMS AVERAGED OVER ALL SCENARIOS . % Collisions % Max. Time CrossingTime [s] Θ others Algorithmsingle QRDQN + CVaR mixed QRDQN + CVaR = 0 . t = 1 . t = 2 . − − − − − − l og ( Z C V a R ) -3m/s ⇐ ⇐ ⇐− − − − −

50 0

Return R − − − − − − l og ( Z ) -3m/s ⇐ Return R -3m/s0m/s 2m/s5m/s ⇐ − − − −

50 0 50

Return R -3m/s0m/s 2m/s5m/s ⇐ (a) Learned return distributions Z and modiﬁed return distributions Z CVaR with returns pruned above the VaR at speciﬁc time points of the scenario.An arrow indicates the optimal action obtained with either a ∗ = argmax a E [ Z ] or a ∗ = argmax a E [ Z CVaR ] . time [s] s [ m ] time [s] . . . . . . . v [ m / s ] QRDQN + WangQRDQN + CVaRQRDQNDQNSuccessCollision (b) Longitudinal position s and velocity v of the ego-agent over the course of the scenario.Fig. 6. For a ”Turn right x 2” episode with passive behavior of other participants, we show the evaluation of the scene and the corresponding distributionsat speciﬁc timepoints and compare velocity and longitudinal position for all algorithms. Risk-sensitive action selection with CVaR yields a more moderateacceleration at time point t = 1 . in comparison to the learned optimal action of QRDQN choosing highest acceleration. This subtle difference madethe risk-sensitive approach successfully complete this episode whereas DQN and QRDQN failed.TABLE IIIC OMPARISON OF RISK METRICS IN THE ” MIXED ” TYPE SPACE . % Collisions % Max. Time CrossingTime [s]Scenario Risk Measureleft x 4 CVaR 1.08 0.00 5.43None Wang 0.88 0.00 5.46right platoon CVaR 0.00 0.00

None -left x 2 CVaR 0.08 0.00 4.84None

Wang 0.09 0.00 4.88right x 2 CVaR 1.64 0.00 3.67None

Wang 1.59 0.00 3.70 over the course of the scenario. Slight backwards movementsoccurred with the distributional approaches, since, with ourreward deﬁnition, the policy was optimized solely for safety.We postpone comfort constraints to later work. For speciﬁctime points in this scenario, Fig. 6a displays the currenttrafﬁc situation, and for all actions, the corresponding learneddistributions Z and the modiﬁed distributions Z CVaR withreturns pruned above the VaR. An arrow ” ⇐ ” highlights the selected, optimal action, respectively.Overall, we see that the distributions mainly differ in thelength of a tail with lower-probability negative returns andonly marginally with respect to higher probability positivereturns. At the beginning of the scenario at t =0 s , riskassessment with CVaR did not change the optimality of thelearned action. Braking remained optimal, since the otheractions’ long negative tails dominate their distributions evenafter pruning returns above the VaR. The time point t =1 . ,however, was critical for the ﬁnal outcome of the episode.The learned policy of QRDQN chose highest possible ac-celeration, primarily as the positive return probability in itsdistribution outweighs the large negative tail. In contrast,considering only the return values below the VaR for decisionmaking yielded risk-averse action selection with the optimalaction being a lower acceleration value. This avoided thecollision occurring with QRDQN and led to a successfulcompletion of the scenario ( t = 2 . ).This example clariﬁes the beneﬁts of the CVaR metricfor behavior generation in uncertain environments: Due tothe inherent uncertainty about a hazardous event in theenvironment, the distributions of riskier actions consist of alonger, low-probability tail at negative returns facing higherprobability peaks at more positive returns. The decisionecomes ambiguous. The CVaR risk measure decides basedon the VaR which returns to prune from the distribution,strengthening the contribution of less likely negative out-comes. Overall, this removes the ambiguity of a decision, oc-curring with riskier actions, and yields a saver driving policy.A video comparing the performance of the evaluatedalgorithms and risk measures at selected episodes is foundunder https://youtu.be/PSDFEG5d1xg .VII. C ONCLUSION AND F UTURE W ORK

We proposed a two-step approach for risk-sensitive behav-ior generation, evaluating the risk of actions online, based onreturn distributions learned ofﬂine with Deep DistributionalReinforcement Learning. We evaluated two distortion riskmetrics and demonstrated that our approach increases safetyin environments with inherent uncertainty about other par-ticipants’ behaviors while avoiding too conservative driving.Majumdar and Pavone [1] discussed the application of riskmetrics, emerging from ﬁnance, to robotics. Our approachpresents now a step forward in applying risk-sensitive behav-ior generation for autonomous driving. Yet, a distributionalconsideration of risk in other methods, e.g search-basedmethods, would broaden our understanding of its beneﬁtsand challenges.To achieve a high level of safety under inherent and epistemic uncertainties, we plan to combine the approachwith methods reducing epistemic uncertainty in the future,e.g an additional search process [20].R

EFERENCES [1] Majumdar, A. and Pavone, M., “How Should a Robot AssessRisk? Towards an Axiomatic Theory of Risk in Robotics,”

CoRR , vol. abs/1710.11040, 2017.[2] Mirchevska, B., Pek, C., et al. , “High-Level Decision Mak-ing for Safe and Reasonable Autonomous Lane ChangingUsing Reinforcement Learning,” en, in ,IEEE, 2018.[3] Bouton, M., Karlsson, J., et al. , “Reinforcement learningwith probabilistic guarantees for autonomous driving,” in

Workshop on Safety, Risk and Uncertainty in ReinforcementLearning, Conference on Uncertainty in Artiﬁcal Intelli-gence (UAI) , 2018.[4] Hubmann, C., Schulz, J., et al. , “A Belief State Plannerfor Interactive Merge Maneuvers in Congested Trafﬁc,” in , IEEE, 2018.[5] Bouton, M., Cosgun, A., et al. , “Belief state planning forautonomously navigating urban intersections,” in

IntelligentVehicles Symposium (IV) , IEEE, 2017.[6] Kurzer, K., Engelhorn, F., et al. , “Decentralized CooperativePlanning for Automated Vehicles with Continuous MonteCarlo Tree Search,”

CoRR , vol. abs/1809.03200, 2018.[7] Lenz, D., Kessler, T., et al. , “Tactical cooperative planningfor autonomous highway driving using Monte-Carlo TreeSearch,” in

Intelligent Vehicles Symposium (IV) , IEEE, 2016.[8] Burger, C. and Lauer, M., “Cooperative Multiple VehicleTrajectory Planning using MIQP,” in ,IEEE, 2018. [9] Isele, D., Rahimi, R., et al. , “Navigating Occluded Intersec-tions with Autonomous Vehicles Using Deep ReinforcementLearning,” in

International Conference on Robotics andAutomation (ICRA) , IEEE, 2018.[10] ——, “Navigating Occluded Intersections with AutonomousVehicles using Deep Reinforcement Learning,” May 2017.[11] Bellemare, M. G., Dabney, W., et al. , “A distribu-tional perspective on reinforcement learning,”

CoRR ,vol. abs/1707.06887, 2017.[12] Dabney, W., Rowland, M., et al. , “Distributional Re-inforcement Learning with Quantile Regression,”

CoRR ,vol. abs/1710.10044, 2017.[13] Dabney, W., Ostrovski, G., et al. , “Implicit Quantile Net-works for Distributional Reinforcement Learning,” in ,vol. 80, PMLR, Jul. 2018.[14] Sadigh, D., Sastry, S., et al. , “Planning for Autonomous Carsthat Leverages Effects on Human Actions,” in

Proceedingsof the Robotics: Science and Systems Conference (RSS) , Jun.2016.[15] Kurzer, K., Zhou, C., et al. , “Decentralized CooperativePlanning for Automated Vehicles with Hierarchical MonteCarlo Tree Search,” in

Intelligent Vehicles Symposium (IV) ,Jun. 2018.[16] Bai, H., Cai, S., et al. , “Intention-aware online POMDPplanning for autonomous driving in a crowd,” in

IEEE In-ternational Conference on Robotics and Automation (ICRA) ,IEEE, 2015.[17] Men´endez-Romero, C., Sezer, M., et al. , “Courtesy Behaviorfor Highly Automated Vehicles on Highway Interchanges,”in

Intelligent Vehicles Symposium (IV) , IEEE, Jun. 2018.[18] Wolf, P., Kurzer, K., et al. , “Adaptive Behavior Generationfor Autonomous Driving using Deep Reinforcement Learn-ing with Compact Semantic States,” in , IEEE, 2018.[19] Dilokthanakul, N. and Shanahan, M., “Deep ReinforcementLearning with Risk-Seeking Exploration,” in

From Animalsto Animats 15 , Springer International Publishing, 2018.[20] Bernhard, J., Gieselmann, R., et al. , “Experience-BasedHeuristic Search: Robust Motion Planning with Deep Q-Learning,” in , IEEE, 2018.[21] Paxton, C., Raman, V., et al. , “Combining neural networksand tree search for task and motion planning in challengingenvironments,” in

International Conference on IntelligentRobots and Systems , IEEE, Sep. 2017.[22] Mukadam, M., Cosgun, A., et al. , “Tactical Decision Makingfor Lane Changing with Deep Reinforcement Learning,” in

Conference on Neural Information Processing (NIPS) , 2017.[23] Shalev-Shwartz, S., Shammah, S., et al. , “Safe, Multi-Agent,Reinforcement Learning for Autonomous Driving,”

CoRR ,vol. abs/1610.03295, 2016.[24] Garc´ıa, J. and Fern´andez, F., “A Comprehensive Survey onSafe Reinforcement Learning,”

Journal of Machine Learn-ing Research , vol. 16, 2015.[25] Majumdar, A., Singh, S., et al. , “Risk-sensitive inverse rein-forcement learning via coherent risk models,” in

Robotics:Science and Systems , 2017.[26] Albrecht, S. V., Crandall, J. W., et al. , “Belief and Truthin Hypothesised Behaviours,” en,

Artiﬁcial Intelligence ,vol. 235, Jun. 2016.[27] Mnih, V., Kavukcuoglu, K., et al. , “Human-level controlthrough deep reinforcement learning,”

Nature , vol. 518,no. 7540, Feb. 2015.[28] Van Hasselt, H., Guez, A., et al. , “Deep ReinforcementLearning with Double Q-Learning,” in , AAAI Press, 2016.29] Schaul, T., Quan, J., et al. , “Prioritized experience replay,”in

International Conference on Learning Representations(ICLR) , 2016.[30] Ruszczy´nski, A., “Risk-averse dynamic programming forMarkov decision processes,”

Mathematical Programming ,vol. 125, no. 2, Oct. 2010.[31] Wang, S. S., “A Class of Distortion Operators for PricingFinancial and Insurance Risks,”