Risk-Constrained Interactive Safety under Behavior Uncertainty for Autonomous Driving
RRisk-Constrained Interactive Safety under Behavior Uncertainty forAutonomous Driving
Julian Bernhard and Alois Knoll Abstract — Balancing safety and efficiency when planning indense traffic is challenging. Interactive behavior planners incor-porate prediction uncertainty and interactivity inherent to thesetraffic situations. Yet, their use of single-objective optimalityimpedes interpretability of the resulting safety goal. Safetyenvelopes which restrict the allowed planning region yieldinterpretable safety under the presence of behavior uncertainty,yet, they sacrifice efficiency in dense traffic due to conservativedriving. Studies show that humans balance safety and efficiencyin dense traffic by accepting a probabilistic risk of violating thesafety envelope. In this work, we adopt this safety objective forinteractive planning. Specifically, we formalize this safety objec-tive, present the Risk-Constrained Robust Stochastic BayesianGame modeling interactive decisions satisfying a maximum riskof violating a safety envelope under uncertainty of other trafficparticipants’ behavior and solve it using our variant of Multi-Agent Monte Carlo Tree Search. We demonstrate in simulationthat our approach outperforms baselines approaches, and byreaching the specified violation risk level over driven simulationtime, provides an interpretable and tunable safety objective forinteractive planning.
I. I
NTRODUCTION
Behavior planners for autonomous vehicles must be ableto solve dense driving situations in close interaction withhumans thereby being confronted with a variety of behavioralvariations and limited knowledge about the true humandriving behavior.Interactive behavior planners became more and moreadvanced to plan in such situations. Popular approaches topredict other participants use cooperative [1–3] or probabilis-tic models [4, 5]. The authors presented a behavior spaceapproach in [6] to deal with the variety of human behaviorvariations in interactive planning. Yet, existing approachesleave open how to specify a meaningful safety objectiveunder the presence of behavior uncertainty. They use single-objective optimality criteria which are not interpretable withrespect to a meaningful safety goal, e.g a maximum collisionprobability P col .Restricting the ego motion to stay within a safety envelope,e.g. defined using reachability analysis [7–9] or other formsof safe distance measures [10, 11] circumvents the problemof prediction model inaccuracy. However, since interactionsare neglected, this approach becomes problematic in crowdedtraffic where it leads to conservative driving or the freezingvehicle symptom. Julian Bernhard is with fortiss GmbH, An-Institut Technische Univer-sität München, Munich, Germany Alois Knoll is with Chair of Robotics, Artificial Intelligence and Real-time Systems, Technische Universität München, Munich, Germany
Safety envelope restriction Interactive Planning
Risk-Constrained Interactive Safety : only stopping trajectory available : collision: violating part of trajectory P col ! ≤ − P violate ! ≤ β : restricted region : predicted trajectories Fig. 1: Restricting the ego motion to stay within a safety envelopefosters the freezing vehicle symptom in dense traffic scenarios.Objective functions of interactive planners employable in densetraffic are not interpretable with respect to the collision probability P col . Therefore, we transfer a human-related safety objective fordense traffic to interactive planning. Our planner generates a policywhich satisfies a specified maximum risk level of violating a safetyenvelope β . It seems that humans follow a different safety goal thanrealized by existing behavior planners. Evaluation over real-world driving data showed that humans violate safe distancemeasures with a certain percentage averaged over driventime [12] with increasing violations occurring in dense trafficduring rush hours [13]. It seems that the safety objective ofhuman drivers balances safety and efficiency in a compre-hensible way.In this work, we therefore adopt this safety objective forinteractive planning. Our approach generates a policy whichsatisfies a maximum risk of violating a safety envelope underuncertainty of other traffic participants’ behavior. Specifi-cally, we contribute • a formalization of this safety objective, • the Risk-Constrained Robust Stochastic Bayesian Game(RC-RSBG) modeling risk-constrained interactive deci-sions under behavior uncertainty, • a Multi-Agent Monte Carlo Tree Search (MA-MCTS)planner to solve the RC-RSBG, • a simulative analysis showing that our approach outper-forms baselines approaches while reaching the specifiedviolation risk level averaged over driven time.Fig. 1 visualizes our contribution. We start with relatedwork. Next, we formalize the problem and present ourmethod followed by the experimental evaluation. a r X i v : . [ c s . A I] F e b I. R
ELATED W ORK
We present related work on risk metrics and levels, inter-active and risk-constrained planning.
A. Risk-metrics and levels
There exist probabilistic and non-probabilistic risk def-initions. The latter defines a metric and an accompany-ing risk threshold. Data-related metrics use human drivingdata to parameterize distance functions, e.g to fit Gaussianmetrics [11] or human safety corridors from feedback insimulation [14]. Kinematics-based metrics model physics-based risk, e.g time-to-collision [15] or potential fields tomodel braking forces [16] and are closely related to safetyenvelope definitions [10, 11] whereby a kinematic risk levelof zero corresponds to completely staying within the safetyenvelope. Non-probabilistic risk is straightforward to defineand interpret. Yet, it does not reveal information about theprobability of occurrence of a harmful event and neglectsuncertainty information, e.g. beliefs about the behavior ofother participants.In the functional safety sense, risk is probabilistic and de-fined as the combination of probability of occurrence of harmand the severity of that harm [17]. Existing probabilistic riskdefinitions often consider collision as harmful events in risk-based planning approaches [9, 17–19] using the state proba-bility, the probability of spatial overlap at discrete times [20].Value-based probabilistic risk definitions arose in financeand are applied to robotics and autonomous driving [21–23].Defining risk based on collisions is infeasible consideringthe vast amount of samples required to approximate humancollision probabilities P col ≈ − . In contrast to previouswork, we propose a notion of risk coupled to how humandrivers might balance safety and efficiency in interactivesituations based on the risk of violating a safety envelope.By formalizing risk as event probability, with the harmfulevent occurring over a period of time [20], correspondencebetween specified risk level and observed risk is achieved. B. Interactive Planning
In [24], Bernhard et al. define the behavior of an au-tonomous vehicle as its desired future sequence of phys-ical states encoding the agent’s strategy to reach a short-term goal, e.g changing lane. A behavior planner createsa behavior trajectory and passes it to a controller. In thiswork, we focus on modeling of risk induced by unknownbehaviors of other traffic participants and drop localizationand execution uncertainties. Interactive planning algorithmsincorporate potential reactions of other participants intotheir plan to successfully navigate in congested traffic [25].We briefly summarize existing approaches to predict otherparticipants behavior within an interactive planning process.
Cooperative approaches assume that participants act ac-cording to a global cost function in a Multi-Agent MarkovDecision Process (MA-MDP) [1–3] and allow for explainableparameterization of the model. However, the assumption thattraffic follows a globally optimal solution neglects the uncer-tainty inherent to human interactions.
Markov approaches represent participants’ behavior via environment transitionsbased on the current observable state in a Markov DecisionProcess (MDP) [22, 26]. Though, MDPs model uncertaintyin state transitions, they neglect information from past states,facilitating inefficient and unsafe decisions. In constrast,
Belief-based approaches use observations from past statesto gather information about the true behavior of other par-ticipants, modeled as Partially Observable Markov DecisionProcess (POMDP) [4, 5, 27, 28] or Bayesian game [6] toimprove efficiency in dense traffic scenarios.Existing interactive planners employ a single-objective optimality criterion with manual or data-based cost tuning toavoid collisions [1–5, 22, 27]. Compared to previous work,we employ a multi-objective optimality criterion to modelrisk over behavior uncertainty. For this, we combine our de-cision model, the Robust Stochastic Bayesian Game (RSBG)presented in [6] with the Cost-Constrained Partially Observ-able Markov Decision Process (CC-POMDP) [29, 30].
C. Risk-Constrained Planning
Planning under non-probabilistic risk definitions equalsfinding an ego trajectory within the space spanned by riskmetric and level, e.g by using graph search on a discretizedstate space [11] or Model Predictive Control (MPC) [31].Probabilistic collision risk is used in [23] to model lanechanging as conditional value at risk MDP, in [18] to modelon ramp merging as CC-POMDP and in [17] to incorporatevarious uncertainties into an MPC algorithm. Related toour risk definition, Müller et al. [32] constrain the risk ofviolating the safe distance. Prescribed related work in risk-constrained planning employs long-term, i.e maneuver-basedprediction of other participants neglecting interactions. Inter-active risk-constrained planning using reinforcement learninghas been investigated in [22] using conditional value at risk,and in [26] using a discretized state space.The presented approaches allow for specification of arisk constraint, yet, reveal missing correspondence between specified and observed risk in the experimental evaluation.In this work, we demonstrate that with our variant of MA-MCTS the observed risk in simulation corresponds to thespecified constraint yielding an interpretable and tunablesafety objective for interactive planning.III. P
ROBLEM F ORMULATION
Evaluation over real-world driving data showed that hu-mans violate safe distance measures with a certain percent-age [12, 13]. It seems that humans1) stay within their safety envelope, e.g. spanned by thecurrent safe distance, in most cases to avoid unsafebehavior due to modeling uncertainty, yet2) accept the risk * of violating the safety envelope. Theybehave such that a safety envelope is violated withnot more than probability β under consideration ofprediction uncertainty. * Modeling severity of envelope violations is considered future work. y adjusting β humans tune safety versus efficiency toavoid conservative driving in congested, rush hour traf-fic [13]. Based on these considerations, we informally definea human-related safety objective for interactive behaviorplanning as: "The ego vehicle behaves such that the percent-age of time the safety envelope is violated is smaller than agiven threshold". Next, we formalize this safety goal.We model the traffic environment in a game-theoreticmanner [33]. It consists of N j other agents, i.e trafficparticipants, each observing a joint environment state o t =( o t , o t , . . . , o tN ) ∈ O with physical dynamic and staticproperties, e.g. participant positions and velocities, mapinformation etc. Agents choose next actions based on theirpolicy a tj ∼ π j ( a tj | o t , ? (cid:13) ) defined over a continuous actionspace A j , e.g. a 2-dimensional space bounding maximumlongitudinal acceleration and steering angle. We control asingle agent i , the autonomous vehicle, which selects actionsfrom a discrete set of actions A i according to a i ∼ π i .It reasons about the behavior of the other agents j . Agent i knows the action space and can observe past actions ofthe other agents. The true policy π j and hidden inputs? (cid:13) are unknown to i . Deterministic transitions to the nextenvironment state, o t +1 = T ( o t , a t ) , result from the agents’joint action a t = ( a ti , a t − i ) and their kinematic models, e.g.single track, applied for action duration τ a to o t . An index − i indicates joint action of all agents except agent i . Theprocess continues until some terminal criterion applies, e.g.collision or goal reached. Using this environment model wedefine a violation risk. Definition 3.1:
Violation risk:
Given the current environ-ment state o t ∈ O , behavior policies π i and π j , a safetyviolation indicator f : O → { , } indicating a safetyviolation in observation o (cid:48) ∈ O , then the violation risk isdefined as: ρ ( o t , π i , π j , f )= E F o ∼ P ot,πi,πj (cid:20) (cid:80) z = | F o |− z =1 f ( F o ( z )) · τ a | F o | · τ a (cid:21) (1)The expectation is defined over the distribution P o t ,π i ,π j overfuture observation sequences F o = ( o t , o t + τ a , o t +2 τ a , . . . ) starting from current environment state o t . It is influencedby ego and other agents policies. The upper sum representsthe violation duration within the observation sequence F o with | F o | being the length of the sequence and F o ( z ) givingthe z -th observation within the sequence. The lower termis the total duration of the sequence. The fraction of theseterms yields the percentage of time the safety envelope isviolated for one sequence. A sequence ends in a terminalstate. The temporal resolution of this fraction is determinedby the action duration τ a . For fixed ego policy π i , theexpectation provides the time-based violation risk underunknown behavior of other participants π j .Using equation 1, we formalize risk-constrained interac-tive safety against behavior uncertainty. Definition 3.2:
Risk-constrained safety:
Given an indi-cator function for safety envelope violations f envelope , thebehavior planner generates a goal-directed policy π i in the t (cid:48) O ot Terminal state F o F o F o F o Envelope violating statesCollision statesEnvelope violating timeCollision violating time τa τa τat
Fig. 2: Example for envelope violation and collision risk calcula-tion: Four future observation sequences F − o starting from state o t are sampled with probabilities P ( F − o ) = 0 . and P ( F o ) = 0 . ,with F o ending due to a terminal collision state. Since sequence F o shows 2 and sequences F − o show 1 envelope violation, we obtain ρ env ( · ) = 0 . · τ a τ a + 0 . · τ a τ a + 0 . · τ a τ a + 0 . · τ a τ a = 0 . .The only sequence showing a collision violation is F o giving ρ col ( · ) = 0 . · τ a τ a + 0 . · τ a τ a + 0 . · τ a τ a + 0 . · τ a τ a = 0 . . current environment state o t under unknown behavior π j ofother participants which achieves a safety envelope violationrisk lower than a specified allowed risk level β : ρ ( o t , π i , π j , f envelope ) (cid:98) = ρ env ( o t , π i , π j ) ! ≤ β (2)Our risk formulation is independent of the goal formal-ism, e.g. based on rewards, and the actual safety envelopedefinition e.g. being based on safe distance measures orreachability analysis. Since, the collision risk is difficult tointerpret and infeasible to calculate exactly, we require it tobe close to zero, ρ ( o t , π i , π j , f collision ) (cid:98) = ρ col ( o t , π i , π j ) ! ≈ with f collision denoting a collision indicator. Fig. 2 providesan example of envelope and collision risk calculations.To satisfy our problem definition, the planner must cor-rectly approximate the observation sequence distributionunder unknown behavior of other agents and optimize amulti-objective criterion integrating the safety envelope andcollision risk constraints and the goal-directed optimalitycriterion. In the next section, we propose a game-theoreticmodel, the RC-RSBG and a variant of MA-MCTS to solvethe defined problem.IV. R ISK -C ONSTRAINED I NTERACTIVE P LANNING
We present a game-theoretic model, the Risk-ConstrainedRobust Stochastic Bayesian Game (RC-RSBG), which com-bines • the RSBG presented by Bernhard et al. [6] to approxi-mate the observation sequence distribution in eq. 1 givenunknown behavior of other participants π j • and the Cost-Constrained Partially Observable MarkovDecision Process (CC-POMDP) to incorporate risk con-straints, • and adapt a MA-MCTS planner to solve the RC-RSBG.We detail each of these aspects in the following. . Robust Stochastic Bayesian Game (RSBG) In Bernhard et al. [6], the authors presented the RobustStochastic Bayesian Game (RSBG), a game-theoretic modelfor interactive planning to cover the physically feasible con-tinuous variations inherent to human behavior consideringboth inter-driver and intra-driver variabilities [34]. The modeluses a predefined set of hypothetical behavior types θ k ∈ Θ and behavior hypotheses a tj ∼ π θ k ( a tj | H t o ) . To predict theothers’ behavior one tracks the posterior beliefs Pr ( θ k | H t o , j ) over hypothesized types for each agent based on the action-observation history H t o ∈ H o .To obtain behavior types and hypothesis, it defines ahypothetical policy π ∗ : H o × B tj → A j (3)for a specific task with b tj ∈ B tj being j th agent’s behaviorstate at time t , B tj ⊂ R N B its behavior space of dimen-sion N B . A behavior state b tj is a physically interpretable quantity describing j th continuous behavior variations. Theother agents’ behavior spaces B tj and their current behaviorstates b tj are unknown. By using the property of physicalinterpretability of b tj , an expert can define a hypothesizedbehavior space B , comprising the individual behavior spaces B j ( B j ⊂ B ), by looking at the physically realistic situations.For instance, it is straightforward to define the physicalboundaries of a behavior state modeling the desired gapbetween vehicles j and i at the time point of merging ontoanother lane with the one-dimensional behavior space B = { b | b ∈ [0 , d max ] } where d max is the maximum sensor range.The approach then uses a partitioning of the full behaviorspace B = B ∪ B ∪ . . . ∪ B K , ∀ l (cid:54) = k B l ∩ B k = ∅ . Thisyields K hypothesis π θ k by adopting a uniform distributionover behavior states within behavior space partitions B k ,respectively.The RSBG computes an optimal ego policy integratingthe current posterior type belief Pr ( θ k | H t o , j ) according to π i = max π i Q π i ( H t o ) , where Q π i ( H t o ) = E π i , ( θ kj ,... ) ∀ j ∼ Pr ( ·| H t o ,j ) (cid:20) ∞ (cid:88) t (cid:48) = t γ t r ( o t (cid:48) , a t (cid:48) ) (cid:21) with a t (cid:48) = ( a t (cid:48) i , a t (cid:48) − i ) ,a t (cid:48) i ∼ π i ( ·| o t (cid:48) ) ,a t (cid:48) − i = ( a t (cid:48) j , . . . ) ∀ j ∼ π θ kj (4)is the expected cumulative reward of agent i in state o t and history H t o . Future rewards r ( o t , a t ) are discounted by γ . We denote independent sampling from P and sampleconcatenation for all other agents j with ( · , . . . ) ∀ j ∼ P . B. Risk-Constrained Robust Stochastic Bayesian Game
Next, we integrate the risk constraints for safety enve-lope violation and collision into the RSBG. For this, weapproximate the unknown behavior of other agents π j in theconstraint eq. 1 using a mixture distribution (cid:98) π j combining hypotheses and posterior beliefs for each other agent j : (cid:98) π j ( a j | H t o ) = (cid:88) ∀ k Pr ( θ k | H t o , j ) · π θ k ( a j | H t o ) (5)Optimality of RC-RSBGs is then defined by extending eq. 4with the risk constraints. Definition 4.1:
Optimality of RC-RSBGs
The optimalego policy π i of the RC-RSBG maximizes the expectedcumulative reward Q π i ( H t o ) defined in eq. 4 subject to ρ env ( o t , π i , (cid:98) π j ) ! ≤ βρ col ( o t , π i , (cid:98) π j ) ! ≈ (6)Next, we present a Multi-Agent Monte Carlo Tree Search(MA-MCTS) approach which solves the RC-RSBG. C. Monte Carlo Tree Search for the RC-RSBG
Our MA-MCTS approach is shown in Alg. 1. To solve theRSBG, as presented in [6], we use: • Stage-wise action selection: Similar to [2, 3] agentsselect actions independently in stages. Their joint actionyields the next environment state. In constrast to pre-vious work [2, 3], we define separate selection mecha-nisms for ego and other agents’ in E GO A CTION S ELEC - TION and O
THER A CTION S ELECTION each returningactions a i or a j at stage nodes (cid:104) H o (cid:105) . Ego actions arediscrete. Other agents’ actions are continuous and theirselection strategy represents the agents’ behavior types. • Hypothesis-belief-sampling: At the beginning of eachsearch iteration, we sample a hypothesis for each otheragent j from the posterior belief θ (cid:48) j ∼ Pr ( θ k | H t o , j ) anduse it within the function O THER A CTION S ELECTION for selection, expansion and roll-out steps.To extend the MA-MCTS approach to solve the RC-RSBG, we develop an ego action selection mecha-nism E GO A CTION S ELECTION based on solutions to CC-POMDPs. The POMDP is a well known single-agent frame-work to model sequential decisions under partially ob-servable environment states. An optimal policy of a Cost-Constrained Partially Observable Markov Decision Process(CC-POMDP) [29, 30] not only maximizes the expected cu-mulative reward but also constraints M expected cumulativecosts Q C m ≤ (cid:98) c m , m ∈ M .Lee et al. [30] use Monte Carlo Tree Search (MCTS)to solve CC-POMDPs by reformulating the problem as an unconstrained POMDP. They introduce Lagrange multipliersand express the optimal value function as Q ∗ λ = Q R − (cid:80) ∀ m λ ∗ m · Q C m for optimal λ ∗ m . A solution is found byupdating the Lagrange multipliers iteratively at the beginningof each MCTS iteration using a gradient estimate ∆ λ m ∼ Q C m − (cid:98) c m , ∀ m .We adapt their approach to solve the RC-RSBG.Yet, instead of cumulative costs, we maintain ego enve-lope violation and collision action-risks ρ env ( (cid:104) H o (cid:105) , a i ) and ρ col ( (cid:104) H o (cid:105) , a i ) in each stage node. For this, we separatelybackpropagate the violation durations for safety envelope T env and collision T col , and total time T tot occurred with lgorithm 1: Multi-Agent Monte Carlo Tree Search for the Risk-Constrained Robust Stochastic Bayesian Game function S EARCH ( o t , Pr ( θ k | H t o , · ) ) λ ← R ANDOM I NIT () repeatfor j = 1 . . . N j do θ (cid:48) j ∼ Pr ( θ k | H t o , j ) S IMULATE ( (cid:104) o t (cid:105) , θ (cid:48) j ∀ j , 1) a i ∼ E GO A CTION S ELECTION ( (cid:104) o t (cid:105) , ) λ env ← λ env + α n [ ρ env ( (cid:104) o t (cid:105) , a i ) − β ] λ col ← λ col + α n [ ρ col ( (cid:104) o t (cid:105) , a i ) − Clip λ env , λ col to range [0 , until M AX I TERATIONS () return E GO A CTION S ELECTION ( (cid:104) o t (cid:105) , ) function O THER A CTION S ELECTION ( (cid:104) H o (cid:105) , j, θ j ) if P ROGRESSIVE W IDENING () thenreturn a j ← π θj ( a j | H o ) elsereturn argmax a Q C ( (cid:104) H o (cid:105) , a j , j ) function E GO A CTION S ELECTION ( (cid:104) H o (cid:105) , κ ) Q ⊕ λ ( (cid:104) H o (cid:105) , a ) ← QR ( (cid:104) H o (cid:105) ,a ) − λ env · ρ env ( (cid:104) H o (cid:105) ,a ) − λ col · ρ col ( (cid:104) H o (cid:105) ,a )+ κ √ log N ( (cid:104) H o (cid:105) ) /N ( (cid:104) H o (cid:105) ,a,i ) a ∗ ← arg max a Q ⊕ λ ( (cid:104) H o (cid:105) , a ) A ∗ ← add other actions to a ∗ to consider exploration differ-ences π i ← Solve linear program with A ∗ to obtain stochastic policy return a i ∼ π i function S IMULATE ( (cid:104) H o (cid:105) , θ (cid:48) j ∀ j, d ) if d > d max or I S T ERMINAL ( (cid:104) H o (cid:105) ) thenreturn [0, 0, 0 ,0, 0] if F IRST N ODE V ISIT ( (cid:104) H o (cid:105) ) thenreturn R ANDOM R OLLOUT ( (cid:104) H o (cid:105) , θ (cid:48) j ∀ j, d ) a i ← E GO A CTION S ELECTION ( (cid:104) H o (cid:105) , κ ) for l = 1 . . . N j do a jl ← O THER A CTION S ELECTION ( (cid:104) H o (cid:105) , l, θ (cid:48) l ) τ predict ← d · τ a ( o (cid:48) , r ) ← E NVIRONMENT M OVE ( H o , ( a i , a j ) , τ predict ) [ R (cid:48) , T (cid:48) env , T (cid:48) col , T (cid:48) tot , C (cid:48) ] ← S IMULATE ( (cid:104) H o , ( a i , a j ) , o (cid:48) (cid:105) , θ (cid:48) j ∀ j, d + 1 ) [ R, T env , T col , T tot ] ← [ r + γ · R (cid:48) , T (cid:48) env + f envelope ( o (cid:48) ) · τ predict ,T (cid:48) col + f collision ( o (cid:48) ) · τ predict , T (cid:48) tot + τ predict ] N ( (cid:104) H o (cid:105) ) ← N ( (cid:104) H o (cid:105) ) + 1 N ( (cid:104) H o (cid:105) , a i , i ) ← N ( (cid:104) H o (cid:105) , a i , i ) + 1 Q R ( (cid:104) H o (cid:105) , a i ) ← Q R ( (cid:104) H o (cid:105) , a i ) + ( R − Q R ( (cid:104) H o (cid:105) , a i )) /N ( (cid:104) H o (cid:105) , a i , i ) ρ env ( (cid:104) H o (cid:105) , a i ) ← ρ env ( (cid:104) H o (cid:105) , a i ) + ( T env /T tot − ρ env ( (cid:104) H o (cid:105) , a i )) /N ( (cid:104) H o (cid:105) , a i , i ) ρ col ( (cid:104) H o (cid:105) , a i ) ← ρ col ( (cid:104) H o (cid:105) , a i ) + ( T col /T tot − ρ col ( (cid:104) H o (cid:105) , a i )) /N ( (cid:104) H o (cid:105) , a i , i ) for l = 1 . . . N j do N ( (cid:104) H o (cid:105) , a jl , l ) ← N ( (cid:104) H o (cid:105) , a jl , l ) + 1 C ← f envelope ( o (cid:48) ) + f collision ( o (cid:48) ) + γ · C (cid:48) Q C ( (cid:104) H o (cid:105) , a jl , l ) ← Q C ( (cid:104) H o (cid:105) , a jl ) + ( C − Q C ( (cid:104) H o (cid:105) , a jl )) /N ( (cid:104) H o (cid:105) , a jl , j ) return [ R, T env , T col , T tot , C ] this iteration’s selection, expansion and rollout step in theS IMULATE () function. These terms correspond to the upperand lower term of the ratio defined in eq. 1 and can beused to update ρ env ( (cid:104) H o (cid:105) ) and ρ col ( (cid:104) H o (cid:105) ) in each iteration.Return estimates are updated as usual. Prediction time τ predict increases with search depth d .In the search method, we perform a gradient updateof the Lagrange multipliers λ env and λ col using the rootnode’s risks estimates ρ env ( (cid:104) o t (cid:105) , a i ) and ρ col ( (cid:104) o t (cid:105) , a i ) anddesired risk constraints β and 0 with decreasing stepsize α n ∼ / N UM I TERATIONS (). In E GO A CTION S ELEC - TION , we calculate the combined action-value based on theestimated Lagrange multipliers. As proposed in [30], weaccount for inaccuracies in return and risk estimates and forma set of equal-valued actions A ∗ maximizing the action-valueby accepting a tolerance based on action selection counts. Anoptimal policy of a CC-POMDP is generally stochastic. Inour case, this requires (cid:88) a i ∈ A ∗ π i ( a i | o t ) · ρ env ( (cid:104) o t (cid:105) , a i ) ! ≤ β (cid:88) a i ∈ A ∗ π i ( a i | o t ) · ρ col ( (cid:104) o t (cid:105) , a i ) ! = 0 (7)We solve the linear program defined in [30] to obtain astochastic policy π i satisfying eq. 7, 4 and 6.We apply worst-case action selection in combination withprogressive widening for other agents in O THER A CTION -S ELECTION . Other agents maintain separate combined egocost estimates Q C ( (cid:104) H o (cid:105) , a j , j ) during back-propagation fortheir own selected actions. Progressive widening ensures thatnew actions are explored. In the other case, other agentsselect actions maximizing the combined envelope violation and collision cost of the ego agent. As shown in [6], thisconcept reduces sample complexity when other agents selectactions from a continuous action space. In the case of aconstrained setting, it additionally improves convergence bybetter exploring joint actions which violate the given riskconstraints. V. E XPERIMENT
In our experiment, we evaluate quantitatively over manydriven scenarios with simulated behavior uncertainty • if the presented planning approach respects the specifiedrisk level β , • how the risk parameter β balances safety and efficiency, • and how our approach compares against baseline inter-active planners.We also qualitatively analyze the computed policy and pos-terior beliefs in specific scenarios. A. Scenarios
We use the OpenSource behavior benchmarking environ-ment BARK [24] for simulating two dense traffic scenarios: • Freeway enter: In a double merge, the ego vehicle wantsto enter the freeway on the left occupied lane. • Left turn: The ego vehicle wants to turn left from a sideinto a main road having to cross two occupied lanes.We uniformly sample initial starting conditions, e.g. thedistances between vehicles from [15 m ,
30 m] and velocitiesfrom [ . , . ] . Fig. 3 provides examples of the scenar-ios. A scenario successfully terminates when the ego vehicleis close to and oriented towards the goal while the velocityis within the sampling bounds. = 2 . s Freeway enter T desired P r ( θ |· ) a a a a a a a t = 3 . s T desired P r ( θ |· ) a a a a a a a ρ exp.env β ρ exp.col πi ( a | Ht o ) ρ env ( h ot i , a ) ρ col ( h ot i , a ) QR ( h ot i , a ) t = 2 . s Left turn T desired P r ( θ |· ) a a a a a a a t = 3 . s T desired P r ( θ |· ) a a a a a a a Fig. 3: Scenario analysis at two time steps with posterior beliefs Pr ( θ k | H t o , j ) for four nearest agents in respective color. The plannedstochastic policy π i balances action-risk estimates, ρ env ( (cid:104) o t (cid:105) , a i ) and ρ col ( (cid:104) o t (cid:105) , a i ) yielding an expected envelope risk ρ exp.env fulfilling therisk constraint β = 0 . while the expected planned collision risk ρ exp.col is close to zero and higher returns Q R ( (cid:104) H o (cid:105) , a i ) are preferred. B. Behavior Simulation & Space
The Intelligent Driver Model (IDM) [34] defines boththe simulated unknown behavior π j of other participantsand hypothetical policy π ∗ used for hypotheses design. Forsimulation, we define a 5-dimensional true behavior space B ∗ D over the IDM parameters and uniformly draw un-known boundaries of behavior variations [ b lj, min , b lj, max ] , l ∈{ , . . . , } for each agent and trial ( B j ⊆ B ∗ D ). We in-troduce the parameters minimum and maximum boundarywidths ∆ min / ∆ max to specify minimum and maximum time-dependent variations of behavior states in simulation. Thisavoids unrealistic large variations of behavior parameters.For hypotheses design, we employ a 1D-behavior space,since, as shown in [6], lower-dimensional behavior spacescan capture the uncertainty occurring with the 5-dimensionaltrue behavior space. Tab. I depicts both the simulated 5D andhypothesized 1-D behavior space used in our experiment.We simulate the other agents by sampling a new behaviorstate b tj at every time step from the unknown boundaries ofbehavior variations B j and then use the IDM model with thisparameters to choose their actions. Fixing the random seedsfor all sampling operations ensures equal conditions for allevaluations. C. Safety Violation Indicators
Rizaldi et al. [35] propose a model for longitudinal safedistances between vehicles which we use to define theindicator f envelope to return a violation if the ego vehicleviolates the safe distance to other vehicles. The response B∗ D B D, Head.Param bl [ bl min, bl max] ∆ min/ ∆ max [ bl min, bl max] v desired [m/s] [8.0, 14.0] 0.1 / 0.1 11.0 T desired [s] [0.5, 2.0] 0.1 / 0.3 [0.0, 4.0] s min [1] [2.0, 2.5] 0.1 / 0.3 2.25 ˙ v factor [m/s ] [1.5, 2.0] 0.1 / 0.3 1.75 ˙ v comft [m/s ] [1.7, 2.0] 0.1 / 0.3 1.85 TABLE I: Boundaries of the simulated true behavior spaces B ∗ D and the hypothesized behavior space B D, Head. for IDM parametersdesired velocity v desired , desired time headway T desired , minimumspacing s min , acceleration factor ˙ v factor , comfortable braking ˙ v comft . time of other vehicles is . For the ego vehicle, we use . in freeway and, due to higher traffic density, in leftturn. The deceleration limit is − ms as in the IDM model.The indicator f collision indicates a collision when anothervehicle overlaps with a static safety boundary of . around the ego vehicle. We introduce a static safety boundaryinstead of using the exact geometric collision check toaccount for potential inaccuracy due to sampling with MCTSapproaches. D. Planning Algorithms
We compare the
RC-RSBG and the variant
RC-RSBGFullInfo which employs the true simulated behaviorpolicies π j for prediction to several baseline algorithms. Wefocus on interactive planners suitable for dense traffic. RC-RSBG and RC-RSBGFullInfo use a simplistic reward func-tion r ( · )=1 . · GOAL REACHED ( · ) . To model risk-awarenesswith the single-objective baselines, we define the rewardfunction r ( · )=0 . · GOAL REACHED ( · ) − . f envelope ( · ) · τ predict β · T Plan − . f collision ( · ) . It is designed such that it fully erases thegoal reward when the predicted envelope violation duration (cid:80) ∀ t (cid:48) f envelope ( · ) τ predict , exceeds the allowed duration β · T Plan with T Plan being the maximum planning horizon. Using thisreward function, we define the following baselines • RSBG uses a MA-MCTS with hypothesis-based pre-diction and reward-based worst-case action selection forother agents, • MDP does not incorporate belief information overhypotheses. Instead it predicts other participants bysampling behavior states from the full hypothesizedbehavior space B D, Head. , • Cooperative does not use the behavior space modelfor prediction. In its MA-MCTS, other agents selectactions based on a global cost function which weightssubjective and others’ rewards based on a cooperationfactor c = 0 . after evaluating r ( · ) for each agent.All baselines use an Upper Confidence Bound (UCB) ac-tion selection strategy for the ego agent, the coopera-tive approach additionally for the other agents. All plan- P s u c [ % ] Freeway enter024 T s u c [ s ] RC-RSBG-FullInfo RC-RSBG RSBG MDP Cooperative0.1 0.2 0.4 0.6 0.8 1.0Specified risk β P c o l [ % ] Left turn0.1 0.2 0.4 0.6 0.8 1.0Specified risk β Fig. 4: Performance metrics for
RC-RSBG planner and baselines for increasing envelope violation risk levels β . ners use an equal ego action space consisting of themacro actions lane changing, lane keeping at constant ac-celerations, ˙ v i = {− , − , − , , , } [ m / s ] for freeway and ˙ v i = {− , − , , , } [ m / s ] for left turn, and gap keepingbased on the IDM for freeway. The cooperative approachemploys this action space additionally for the other agents.All planners use τ a =0 . , a max search depth d max =10 yielding T Plan =11 s . We obtain K =16 hypothesis π θ k , k ∈{ , . . . , K } by equally partitioning the hypothesized behav-ior space B D, Head. . In this work, the focus is on evaluatingthe proposed risk formulation and decision model, the RC-RSBG, and not on working towards real-time capability withMA-MCTS. All planners use an equal number of 20000iterations to minimize influence of sampling inaccuracies.
E. Results
First, we qualitatively analyze the
RC-RSBG planner inFig. 3 for two time steps in each scenario. The ego vehicleis shown in red. The posterior beliefs Pr ( θ k | H t o , j ) givenfor the four nearest other vehicles in the respective colorqualitatively reflect the desired distance to the respectiveleading agent. The posterior is uniformly distributed forvehicles without a leading vehicle. We see that the plannedstochastic policy π i correctly balances envelope and collisionaction-risk estimates, ρ env ( (cid:104) H t o (cid:105) , a i ) and ρ col ( (cid:104) H t o (cid:105) , a i ) suchthat the expected planned envelope risk ρ exp.env (dashed red)fulfills the constraint β = 0 . (blue) while the expectedplanned collision risk ρ exp.col (dotted red) is close to zero.Given these constraints, the planned policy prefers actionswith higher expected action-return values Q R ( (cid:104) H o (cid:105) , a i ) .Our qualitative analysis reveals that the RC-RSBG plannercorrectly implements the risk-constrained optimality criteriafrom Sec. IV-B using a stochastic policy.Next, we quantitatively analyze the percentage of trials risk O b s e r v e d r i s k β ∗ E q u a li t y Freeway enter RC-RSBG-FullInfoRC-RSBGRSBGMDPCooperativerisk E q u a li t y Left turn
Specified risk β W a i t i n g t i m e t w [ s ] Specified risk β — ∞∞∞ — Fig. 5: Analysis of observed envelope violation risk β ∗ and expectedscenario waiting time t w for different risk levels β . over 100 scenarios where the ego vehicle reaches the goal P suc , collides P col or exceeds the maximum allowed simula-tion time, P max to solve a scenario ( t ≤ . s ] ). We evaluateover increasing envelope risk constraint β . For successfultrials, we calculate the average time to reach the goal T suc .Results are given in Fig. 4. The RC-RSBG planner out-performs the baselines with increasing β regarding P suc and T suc . The MDP and
Cooperative planner do rarely succeed( P suc (cid:28) ) emphasizing the advantage of belief-basedprediction in denser traffic. With higher β the RC-RSBG planner increasingly relies on the accuracy of the predictionmodel, and lesser on the safety provided by the enveloperestriction. In the case of prediction model inaccuracies thisprovokes collisions for β ≥ . . This tendency is not observedwith baseline planners which show collisions also for lower β . The RC-RSBG-FullInfo planner having access to thetrue behavior of other vehicles does not suffer from modelinaccuracies, and, thus does not provoke any collision. Thesefindings indicate the usefulness of multi-objective optimal-ity to integrate risk constraints, and support our problemformulation considering risk to balance prediction modelinaccuracies and safety, as discussed in Sec. III.To further analyze how β balances safety and efficiency,we introduce two metrics, the observed envelope violationrisk β ∗ and the expected scenario waiting time t w . The ob-served envelope violation risk is the percentage of simulationtime the envelope is violated, β ∗ = (cid:80) ∀∫ (cid:80) ot ∈∫ f envelope ( o t ) · τ a (cid:80) ∀∫ L ( ∫ ) · τ a with o t ∈ ∫ giving the simulated states and L ( ∫ ) thelength of the scenario ∫ . The expected waiting time, t w = (cid:80) ∞ k =0 ( T max k + T suc ) · ( P suc P k max ) defines the expected timeto solve a scenario. The calculation assumes that the egovehicle encounters solvable scenarios with probability P suc and duration T suc and unsolvable scenarios with probability P max with duration equal to the allowed simulation time T max = 6 s . Fig. 5 shows both metrics over risk level β . Inscenario freeway enter, the RC-RSBG planners fully exploitthe allowed risk ( β ∗ ≈ β ) for β < . , indicating that our risk formulation and planning approach reflects the observedrisk . For β ≥ . , β ∗ decreases. We assume that the scenarioending becomes less risky by allowing higher β initially. Theopposite case occurs in the left turn scenario where a low β prevents entering the intersection impeding full exploitationof allowed β . The baseline approaches do not show anyinterpretable correlation between β ∗ and β . The waiting time w indicates efficiency while P col from Fig. 4 defines safety.Interestingly, our risk formulation suggests β ≤ . in thefreeway enter scenario to prevent collisions which resemblesthe safety envelope violation risk of humans during lanechanging, β human ≈ − [12, 13] and introduces anatural trade off between safety and efficiency for β > . .With our risk formulation waiting times are within a realistichuman-like range, t w ≈ . . .
15 s in dense traffic whereasthe baseline approaches show mostly larger waiting times.We conclude, that the presented risk formulation andinteractive planner balances safety and efficiency accordingto human-related risk criteria in dense traffic. It outperformsinteractive baseline planners regarding efficiency and inter-pretability of their employed safety objectives.VI. C
ONCLUSION
This work formalizes a novel safety objective for in-teractive planning which balances safety and efficiency indense traffic scenarios with uncertainty about other traf-fic participants’ behavior by accepting a specifiable riskof violating a safety envelope. We propose a decision-theoretic framework under this safety objective, the Risk-Constrained Robust Stochastic Bayesian Game (RC-RSBG),and an accompanying interactive planner based on a variantof Multi-Agent Monte Carlo Tree Search. In two types oftraffic scenarios, we demonstrate that the RC-RSBG planneroutperforms baseline planners and provides an interpretableand tunable safety objective.This work reveals that a combination of uncertainty- andprediction based interactive planners with safety enveloperestrictions is a promising direction for future research. Wewill further invest into the real-time capability of the planner,improve safety envelope definitions, and analyze in detailthe correspondence between human risk concepts and theproposed risk level formalism.VII. A
CKNOWLEDGEMENT
This research was funded by the Bavarian Ministry of Eco-nomic Affairs, Regional Development and Energy, projectDependable AI. R
EFERENCES[1] Y. Wang et al. , “Enabling courteous vehicle interactions throughgame-based and dynamics-aware intent inference,”
IEEE Transac-tions on Intelligent Vehicles , vol. 5, no. 2, 2020.[2] D. Lenz et al. , “Tactical cooperative planning for autonomoushighway driving using Monte-Carlo Tree Search,” in
IntelligentVehicles Symposium (IV) , IEEE, 2016.[3] K. Kurzer et al. , “Decentralized Cooperative Planning for Auto-mated Vehicles with Hierarchical Monte Carlo Tree Search,” in
Intelligent Vehicles Symposium (IV) , Jun. 2018.[4] C. Hubmann et al. , “A Belief State Planner for Interactive MergeManeuvers in Congested Traffic,” in , IEEE, 2018.[5] Y. Lu et al. , “Safe mission planning under dynamical uncertainties,”in
IEEE International Conference on Robotics and Automation(ICRA) , 2020.[6] J. Bernhard et al. , “Robust Stochastic Bayesian Games for BehaviorSpace Coverage,” in
Robotics: Science and Systems (RSS), Work-shop on Interaction and Decision-Making in Autonomous-Driving ,2020. [7] C. Pek et al. , “Using online verification to prevent autonomousvehicles from causing accidents,”
Nature Machine Intelligence ,vol. 2, Sep. 2020.[8] K. Leung et al. , “On infusing reachability-based safety assurancewithin planning frameworks for human–robot vehicle interactions,”
International Journal of Robotics Research , vol. 39, no. 10-11,2020.[9] M. .-.-Y. Yu et al. , “Risk assessment and planning with bidirectionalreachability for autonomous driving,” in , 2020.[10] S. Shalev-Shwartz et al. , “On a Formal Model of Safe and ScalableSelf-driving Cars,” 2017. arXiv: 1708.06374.[11] A. Pierson et al. , “Learning risk level set parameters from data setsfor safer driving,” in , 2019.[12] K. Esterle et al. , “Formalizing traffic rules for machine interpretabil-ity,” 2020. arXiv: 2007.00330.[13] C. Pek et al. , “Verifying the safety of lane change maneuvers of self-driving vehicles based on formalized traffic rules,” in , 2017.[14] C. Wei et al. , “Risk-based autonomous vehicle motion control withconsidering human driver’s behaviour,”
Transportation ResearchPart C: Emerging Technologies , vol. 107, 2019.[15] D. Iberraken et al. , “Safe autonomous overtaking maneuver basedon inter-vehicular distance prediction and multi-level bayesiandecision-making,” in , 2018.[16] Y. Akagi et al. , “Stochastic driver speed control behavior modelingin urban intersections using risk potential-based motion planningframework,” in , Jun.2015.[17] C. M. Hruschka et al. , “Uncertainty-adaptive, risk based motionplanning in automated driving,” in , 2019.[18] X. Huang et al. , “Hybrid Risk-Aware Conditional Planning withApplications in Autonomous Vehicles,” in , Dec. 2018.[19] M. Yu et al. , “Occlusion-aware risk assessment for autonomousdriving in urban environments,”
IEEE Robotics and AutomationLetters , vol. 4, no. 2, 2019.[20] A. Philipp et al. , “Analytic collision risk calculation for autonomousvehicle navigation,” in , 2019.[21] A. Majumdar et al. , “How Should a Robot Assess Risk? Towards anAxiomatic Theory of Risk in Robotics,”
CoRR , vol. abs/1710.11040,2017.[22] J. Bernhard et al. , “Addressing Inherent Uncertainty: Risk-SensitiveBehavior Generation for Automated Driving using DistributionalReinforcement Learning,” in , IEEE, 2019.[23] J. I. Ge et al. , “Risk-aware motion planning for automated vehicleamong human-driven cars,” in , 2019.[24] J. Bernhard et al. , “BARK: Open behavior benchmarking in multi-agent environments,” in , 2020.[25] W. Schwarting et al. , “Planning and Decision-Making for Au-tonomous Vehicles,” en,
Annual Review of Control, Robotics, andAutonomous Systems , vol. 1, no. 1, May 2018.[26] M. Bouton et al. , “Reinforcement learning with probabilistic guar-antees for autonomous driving,” in
Workshop on Safety, Risk andUncertainty in Reinforcement Learning, Conference on Uncertaintyin Artifical Intelligence (UAI) , 2018.[27] H. Bai et al. , “Intention-aware online POMDP planning for au-tonomous driving in a crowd,” in
IEEE International Conferenceon Robotics and Automation (ICRA) , IEEE, 2015.[28] M. Bouton et al. , “Belief state planning for autonomously navi-gating urban intersections,” in
Intelligent Vehicles Symposium (IV) ,IEEE, 2017.[29] J. D. Isom et al. , “Piecewise linear dynamic programming for con-strained POMDPs,” in
Proceedings of the 23rd National Conferenceon Artificial Intelligence - Volume 1 , AAAI Press, 2008.[30] J. Lee et al. , “Monte-Carlo Tree Search for Constrained POMDPs,”in
Advances in Neural Information Processing Systems 31 , CurranAssociates, Inc., 2018.31] J. Zhou et al. , “Gap acceptance based safety assessment of au-tonomous overtaking function,” in , 2019.[32] J. Müller et al. , “A risk and comfort optimizing motion planningscheme for merging scenarios*,” in , 2019.[33] S. V. Albrecht et al. , “Belief and Truth in Hypothesised Behaviours,”en,
Artificial Intelligence , vol. 235, Jun. 2016.[34] A. M. Uhrmacher et al. , Multi-Agent Systems: Simulation andApplications , 1st. CRC Press, Inc., 2009.[35] A. Rizaldi et al. , “Formalising and monitoring traffic rules for au-tonomous vehicles in Isabelle/HOL,” in