[PDF] An Interaction-aware Evaluation Method for Highly Automated Vehicles

Abstract

It is important to build a rigorous verification and validation (V&V) process to evaluate the safety of highly automated vehicles (HAVs) before their wide deployment on public roads. In this paper, we propose an interaction-aware framework for HAV safety evaluation which is suitable for some highly-interactive driving scenarios including highway merging, roundabout entering, etc. Contrary to existing approaches where the primary other vehicle (POV) takes predetermined maneuvers, we model the POV as a game-theoretic agent. To capture a wide variety of interactions between the POV and the vehicle under test (VUT), we characterize the interactive behavior using level-k game theory and social value orientation and train a diverse set of POVs using reinforcement learning. Moreover, we propose an adaptive test case sampling scheme based on the Gaussian process regression technique to generate customized and diverse challenging cases. The highway merging is used as the example scenario. We found the proposed method is able to capture a wide range of POV behaviors and achieve better coverage of the failure modes of the VUT compared with other evaluation approaches.

Full PDF

AAn Interaction-aware Evaluation Method for Highly Automated Vehicles

Xinpeng Wang , Songan Zhang , Kuan-Hui Lee , Huei Peng Abstract — It is important to build a rigorous veriﬁcationand validation (V&V) process to evaluate the safety of highlyautomated vehicles (HAVs) before their wide deployment onpublic roads. In this paper, we propose an interaction-awareframework for HAV safety evaluation which is suitable for somehighly-interactive driving scenarios including highway merging,roundabout entering, etc. Contrary to existing approacheswhere the primary other vehicle (POV) takes predeterminedmaneuvers, we model the POV as a game-theoretic agent. Tocapture a wide variety of interactions between the POV andthe vehicle under test (VUT), we characterize the interactivebehavior using level- k game theory and social value orientationand train a diverse set of POVs using reinforcement learning.Moreover, we propose an adaptive test case sampling schemebased on the Gaussian process regression technique to generatecustomized and diverse challenging cases. The highway mergingis used as the example scenario. We found the proposed methodis able to capture a wide range of POV behaviors and achievebetter coverage of the failure modes of the VUT compared withother evaluation approaches. I. I

NTRODUCTION

Highly automated vehicles (HAVs) are under rapid devel-opment all over the world. They have the potential to trans-form ground transportation by liberating people from tediousdriving tasks and improving road safety by avoiding humanerrors. It’s crucial to conduct veriﬁcation and validation ontheir safety efﬁcacy before their wide deployments.Safety evaluation of HAVs has been conducted at multi-ple trafﬁc scenarios including unprotected left-turn, cut-in,and pedestrian crossing scenarios [1]–[3], etc. They can becharacterized as reactive tests, where the vehicle under test(VUT) will be challenged by the primary other vehicle (POV)"in a surprise". The test case is fully deﬁned by the initialcondition of the challenge. Due to the short duration of thechallenge, the POV is typically not programmed to interactwith the VUT, and a predetermined trajectory is assumed.For SAE level 3 and above automated vehicles [4], theiroperational design domain (ODD) can include dynamic andcomplex scenarios, including highway merging, roundaboutentering, turning at unsignalized intersections, etc. For thesescenarios, the existing methods show their limitations. Forexample, in highway merging, the VUT attempts to mergefrom the ramp onto the main road when another vehicleis present, which serves as the POV. The merging rampgives enough time for the two vehicles to interact, andthus the assumption of "no interaction between POV and Xinpeng Wang ( [email protected] ), Songan Zhang, HueiPeng are with the Department of Mechanical Engineering, the Universityof Michigan, Ann Arbor, MI 48109, U.S. Kuan-Hui Lee is with Toyota Research Institute, Los Altos, CA 94022,U.S.

VUT" becomes unrealistic. Moreover, in the role of the POV,different human drivers may exhibit different behaviors underthe same initial conditions, including coasting, acceleratingto pull ahead, decelerating to yield, etc. These diversebehaviors pose a novel challenge for the motion predictionand decision-making module of the VUT, and should beincorporated into the evaluation framework.We propose an interaction-aware evaluation methodologyin this paper. It consists of two parts: ﬁrst, we create a testcase pool, in which we model a set of interactive POVs usinglevel- k game theory and social value orientation (SVO).Second, we propose an adaptive sampling scheme basedon Gaussian process regression to generate challenging testcases for a given VUT. In this paper, we focus on thehighway merging scenario, but this methodology can beapplied to other scenarios including roundabout entering,turning at unsignalized intersections, etc. This is the ﬁrsteffort to comprehensively identify the failure modes of aVUT in an interactive scenario.The paper is organized as follows: Section 2 introducesrelated works; Sections 3 to 5 introduce the proposed methodby ﬁrst formulating the evaluation problem, then introducesthe POV library construction and ﬁnally the adaptive test casegeneration procedure. Section 6 discusses the implementationdetails for the highway merging scenario; Section 7 showsthe simulated testing results and the comparison with othersampling methods; ﬁnally concluding remarks are made inSection 8. II. R ELATED WORK

Evaluation of the safety of HAVs has been an active area inrecent years. Many test procedures have been proposed. Testmatrix has been used to evaluate advanced driver assistancesystem (ADAS) [5]. However, the VUT can be tuned topass the predeﬁned test cases, but may fail under broaderconditions in real-world driving tasks. Worst-case evaluationmethods attempt to generate adversarial situations or POVinputs to create edge cases. [6] and [7] used reachabilityanalysis to ﬁnd test cases where the VUT has a minimalsolution space. [8] applied simulation-based falsiﬁcation toﬁnd failure cases for a given VUT. However, the assumptionof adversarial POVs may not be reasonable. [9] appliedreinforcement learning to create adversarial yet sociallyacceptable POV behaviors in a highway driving scenario,while the diversity of the challenging scenarios has notbeen discussed. On the other hand, Monte-Carlo sampling-based evaluation methods have been proposed to generatetest cases to estimate the real-world performance of the VUT.Some research uses importance sampling [1] [2] or subset a r X i v : . [ c s . R O ] F e b ig. 1. Pipeline of the interaction-aware evaluation method. simulation [10] to efﬁciently estimate the crash rate of theVUT, while other works customize test cases to identify thefailure modes of the VUT using adaptive sampling method[11]–[13]. The interactions between POV and VUT have notbeen considered in these works.To model the interactive nature of human driving behavior,game theory has been widely applied, in which humans aremodeled as utility-maximizing rational agents. Nash [14]or Stackelberg [15], [16] equilibrium models have beenapplied to model human driving behaviors. They rely on theassumption that each agent has an inﬁnite level of rationality,which could be too strict considering that human drivershave to make quick decisions in a complex and dynamicenvironment. Therefore, other researchers assumed boundedrationality of human drivers and applied level- k game theory[17], quantal response [18] or cumulative prospect theory[19] to model human driving behaviors. On the other hand,[14] and [20] considered the altruism of human drivingbehaviors in a game-theoretic setting. Despite the richness ofgame-theoretic models, they have yet to be comprehensivelyconsidered for HAV evaluations. Filling this gap is the focusof this work. III. P ROBLEM FORMULATION

We aim to systematically generate test cases for a givenVUT in interactive scenarios. The tasks are two-fold: ﬁrstly,we will create a test case pool for the target scenario;secondly, we proposed a mechanism to sample test casesfrom the test case pool. The test cases can be characterized bytwo sets of attributes: the ﬁrst set deﬁnes the initial conditionof the scenario; the second set describes the interactiveand behavioral properties of the POV, which determines itsdriving policy. The test case sampling procedure aims toevaluate the safety performance of a black-box VUT byﬁnding the failure modes of it through efﬁcient samplingschemes. The overall concept of the proposed interaction-aware evaluation method is shown in Figure 1.IV. POV

LIBRARY CONSTRUCTION

The POV library needs to capture a diverse set of POVbehaviors. On the one hand, the POV model should ap- (a) (b)Fig. 2. (a) The SVO ring: we will focus on ψ ∈ [0 , π/ . (b) The hierarchyof level- k agents. The level-2 POV has an extra parameter ψ characterizingits SVO angle. proximate the decision-making procedure of human drivers.Therefore, we assume that the POVs are game-theoreticagents, which take the (near) optimal action according toits utility function and assumptions of the opponents. Onthe other hand, the modeling framework should have theﬂexibility to describe a wide range of possible driving be-haviors. We model the POVs as agents that possess differentassumptions on the VUT and different utility functions fortheir own behavior. Speciﬁcally, we adopt the idea of level- k game theory and social value orientation to describe thediversiﬁed POVs. A. Level- k game formulation The level- k game theory model is based on the idea thatintelligent agents (such as human drivers) have ﬁnite level ofrationality. The model ﬁrst assumes a known level-0 agent,which is a naive agent that behaves non-cooperatively. Then,a level- k agent ( k > ) will assume that all the opponentsare level-( k − ) and will behave optimally according to thisassumption. Using the level-0 policy as the starting point,the optimal policy for a level- k agent can be generated se-quentially. According to an experimental study in economics[21], human decision-makers are usually as high as level-2thinkers. Therefore, we only consider agents that are up tolevel-2 in this research to illustrate the concept. B. Social value orientation

Social value orientation (SVO) is a concept from socialpsychology literature, which quantiﬁes the agent’s degreeof selﬁshness [22]. It can be represented as an orientationangle ψ indicating the agent’s preference on the outcomefor itself versus for others, as shown in Figure 2(a), wheredifferent ψ represent personalities including egoistic, pro-social, altruistic, competitive, etc. In a common game-theoretic setting, an agent is egoistic and will solely optimizefor its own utility function, i.e. ψ = 0 . However, whenthe variable SVO is combined with a game-theoretic drivermodel, as shown in [14], it can signiﬁcantly improve theaccuracy of trajectory prediction for human drivers, thusbetter explain human driving behaviors. Moreover, agentswith different SVO could represent a continuous spectrumof human drivers, which complements the level- k frameworkhere humans only have discrete types. In this work, wecombine the SVO with the level- k game theory to capturericher POV behaviors. C. POV library construction using reinforcement learning

Based on the level- k game theory and SVO, we create alibrary with the following POV agents: level-0 POV, level-1POV and level-2 POV with varying social value orientation.Due to the non-competitive nature of driving tasks, we onlyconsider the SVO angle in the st quadrant for simplicity,i.e. ≤ ψ < π/ . The reasons that we do not considerSVO for lower-level POVs are that: a level-0 POV is non-cooperative, thus SVO cannot be deﬁned; a level-1 POVassumes its opponent is level-0 and non-cooperative, thusSVO is not deﬁned either.To construct the POV library, we ﬁrst design the policyfor a level-0 POV as a baseline. Next, to generate the policyfor a level- k POV ( k > , a level-( k − ) VUT is needed inadvance. Therefore, we start with a level-0 VUT policy, andthen generate higher-level POV and VUT sequentially in adouble-helix structure, as shown in Figure 2(b). Although thetargets are level- k POVs, we still need to compute level- k VUTs as the stepping stones to obtain higher-level POVs. Insimulated and real tests, the VUTs we evaluate are not thesemodel VUTs.For level-0 POV and VUT, they behave non-cooperativelywith ﬁxed speed proﬁles, which capture the behavior ofinattentive drivers. For level- k POV and VUT ( k > ), weuse reinforcement learning (RL) to compute their drivingpolicies. To train a level- k POV, we model it as an agentoperating in an environment of level-( k − ) VUT. The sameprocedure applies to VUT. To incorporate the factor of SVO,we consider the SVO angle as an extra state of the modelwhen the level-2 POV is trained to generate a continuum oflevel-2 POVs.

1) Reinforcement learning basics:

Computing a rationalagent can be modelled as an Markov Decision Process(MDP) problem, which is deﬁned by M = ( X , U , P , R , γ ) ,with the state space X ⊆ R n , the action space U ⊆ R m , thetransition dynamics of the environment P : X × U → X , thereward function R : X × U → R , and the discount factor γ ∈ [0 , .At each state x i , an agent tries to compute a best action u i from the state-action mapping, i.e. the policy π ( x i ) = u i ,that maximizes the expected cumulative reward, written as E π [ t (cid:80) t =0 γ t r ( t )] , where t is the end time. To learn the optimalpolicy π ∗ , we use the Q-learning technique. We ﬁrst deﬁnethe action-value function Q : Q ( x, u | π ) = E π (cid:34) t (cid:88) t =0 γ t r t | x = x, u = u (cid:35) (1)Then π ∗ is learned by training the agent to learn the optimal Q function, i.e. Q ∗ ( x, u | π ∗ ) , which satisﬁes the Bellmanequation. For details please refer to [23].

2) Reinforcement learning formulation:

For a level- k VUT, the state space includes all continuous physical statesof the POV and the VUT, denoted as X ( X = X ). For POVs,the SVO angle is considered in the states space, which is heldconstant in each episode, i.e. X = X × [0 , π/ . The actionspace is a discrete set of acceleration or steering input.The reward function reﬂects the goal of driving for eachagent. We assume that the reward function can be representedas: r ( x, u ) = W T Φ( x, u ) = k (cid:88) i =1 w i φ i ( x, u ) (2)which is a linear combination of multiple terms, eachrepresents a different attribute for driving. There are threecategories:1) Ego reward for POV: r P OV e = W TP OV e Φ P OV e .2) Ego reward for VUT: r V UT e = W TV UT e Φ V UT e .3) Safety reward for both: r safe = W Tsafe Φ safe .The ﬁnal reward function for VUT is: r V UT = r safe + r V UT e (3)For a POV with SVO angle ψ , the reward function is: r P OV = r safe + r P OV e cos( ψ ) + r V UT e sin( ψ ) (4)where ψ modulates the rewards between POV and VUT. Fora level-1 POV, ψ ≡ .

3) Training POV & VUT agents using DDQN:

In thiswork, since the state space is continuous, we use an artiﬁcialneural network as the function approximator for the opti-mal action-value function Q ∗ . The reinforcement learningalgorithm we use is Double Deep-Q network (DDQN) [24].DDQN is based on the Deep-Q network (DQN) [25] method.It addresses the problem of overestimating future returnof DQN by decoupling the action evaluation and actionselection into max operations in two different Q-networks.For the MDP with discrete action space and low dimensionalstate space, other advanced RL methods are not necessarilybetter than the DDQN approach. For other applications,DDQN can be replaced by other appropriate RL method.V. A DAPTIVE TEST CASE GENERATION

A. Problem formulation for adaptive testing

From the previous section, we systematically generate theinteractive POV library, which is characterized by the SVO ψ and rationality level L . Combined with the initial conditionof the scenario x , we can build the test cases pool, denotedas S , where each case is s = [ x T , ψ, L ] T . Then, we needa mechanism to pick a set of N test cases s = [ s , ...s N ] from the pool to identify the failure modes of the VUT. Themain challenge is that different VUTs may have differentperformance proﬁles and weaknesses, and thus the failuremodes are unknown at the beginning of the V&V process.Therefore, we need a sampling scheme that can select newcases based on past test results to adaptively search for theweaknesses of each VUT as the testing proceeds. The goalsof the test case generation process are two-fold: ig. 3. Measuring the FMC of test samples: the 1-D illustration. The bluecurve represents the performance score P ( s ) , the red dashed line shows theperformance threshold λ , and the regions of curve below the threshold arethe failure modes. The FMC is computed as M ( s , ρ, λ ) = l + l + l + l .

1) Challenge: ﬁnd cases where the VUT performs poorly(i.e. identify its weakness).2) Coverage: identify (possibly disassociated) regions ofweak performance.For a test run with case s , the performance of a VUT canbe evaluated by function P , which takes the VUT trajectory τ = [ x (0) , u (0) , x (1) , u (1) ...x ( t − , u ( t − , x ( t )] asinput, and computes the performance score. It is written as: P ( s ) = f ( τ ) = µ I crash + µ P safety + µ P task (5)where I crash is the indicator function for collision; P safety is the safety score; P task is the score on task accomplish-ment (success highway merge, smooth acceleration, etc); µ , µ , µ are weighting factors.To describe the aforementioned two goals, we propose thecriterion of failure mode coverage (FMC) for evaluating thequality of test samples: M ( s , ρ, λ ) = (cid:90) (cid:83) B ( ρ, s I ( λ ) ) dv (6)where B ( ρ, s ) is a hyper-ball centered around case s withradius ρ ; s I ( λ ) is a subset of s , such that ∀ s ∈ s I ( λ ) , P ( s ) <λ . The FMC evaluates the volume of the union of hyper-ballscentered around test cases for which the VUT behaves poorly( P ( s ) < λ ), which characterizes the coverage of failuremodes for the VUT. Here, all the dimensions are normalizedbetween [0,1]. Figure 3 is a graphic illustration of the FMCin 1-D. B. Adaptive testing method overview

To meet the goals of adaptive testing, we apply an adaptivesampling method. We generate N test cases in batchessequentially, with batch size n . For each case in S , thelast attribute L is a categorical variable, while all others arecontinuous variables, i.e. S = S × { , , } . Therefore, weseparate the sampling scheme for each batch into two stages,as shown in the lower part of Figure 1. In the 1st stage, weallocate the number of samples into different POV levels, i.e.assign n ik cases to be tested with level- k POV at batch i . Inthe 2nd stage, we generate new test cases within each POVlevel from S using Gaussian process regression (GPR). Wewill elaborate on the two stages later in this section. C. Intra-level adaptive sampling

Within each POV level, we conduct adaptive samplingusing the Gaussian process regression (GPR). GPR is anon-parametric probabilistic model [26]. The key idea isto maintain and update a GPR based meta-model basedon existing samples, and use the meta-model to guide thegeneration of a new batch of samples.

1) Gaussian process regression:

Gaussian process (GP) isa stochastic process, for which the joint distribution of everyﬁnite collection of random variables follows a multivariateGaussian distribution. A GP, as shown in (7), is characterizedby its mean function m ( x ) and a covariance function k ( x, x (cid:48) ) (kernel). f ( x ) ∼ GP ( m ( x ) , k ( x, x (cid:48) )) (7)In this work, we use GP to model the performance surfaceof each VUT, as shown in (8). P ( s ) = (cid:15) + f ( s ) , where f ( s ) ∼ GP (0 , k ( s, s (cid:48) | θ )) , (cid:15) ∼ N ( β, σ ) (8)where ( β, σ, θ ) are the parameters of the model. In this work,we use a zero mean function and a square-exponential kernelfunction for the GPR model. Model parameters are optimizedusing maximum likelihood estimation. The procedure ofadaptive sampling is illustrated in Algorithm 1, and somedetails are explained below. Algorithm 1

Intra-level adaptive sampling

Input : batches number i ; batch size n ik ; previous GPRmodel ˆ P i − k ; exploration factor (cid:15) . Output : test cases with level- k POV s ik and test results y ik ; updated GPR model ˆ P ik . if i = 1 then Sample initial test batch s k uniformly from S . else Uniformly sample p queries ˇ s from S ( p >> n ik ). (cid:15) = (cid:15) α i − . Pick (1 − (cid:15) ) n ik queries from ˇ s according to q exploit ( s ) as s exploit . Pick (cid:15) ik queries according to q explore ( s ) as s explore . s ik = [ s exploit , s explore ] . end if Execute test cases s ik , acquire results y ik . Fit/Update the GPR model: y = ˆ P ik ( s ) = ˆ P ( s | s ik , y ik ) .

2) Balancing between exploration and exploitation:

Toachieve good coverage of the failure modes, we need tobalance exploration and exploitation. On the one hand, itis desirable to explore regions with high uncertainty andpick more informative samples for more accurate meta-model, which helps on coverage. On the other hand, sampleswith low predicted ˆ P ( s ) represents more challenging cases,which are preferred for the goal of challenge. We attempt tosolve this dilemma in an (cid:15) -greedy way: on lines 6,7 of theAlgorithm 1, we evaluate the queries with two sets of query ig. 4. The conﬁguration of highway merging scenario. quality metric according to the GPR model, q exploit ( s ) and q explore ( s ) : q exploit ( s ) = ˆ µ z ( s )ˆ σ z ( s ) (9) q explore ( s ) = ˆ µ z ( s )ˆ σ z ( s ) (10)where ˆ µ ( s ) = E [ ˆ P ( s )] , ˆ σ ( s ) = V ar [ ˆ P ( s )] . q exploit ( s ) and q explore ( s ) have different parameters: the former prefersexploitation ( z > z ), and the latter focuses on exploration( z < z ). For each batch, We pick the cases which maximizeone of these metrics. The portion of cases for explorationand exploitation are determined by the parameter (cid:15) . It whichwill gradually decrease across batches at the rate of α ( α ∈ (0 . , ), such that the procedure starts with moreexploration, and bias towards exploitation as more data arecollected and a better meta-model is built. D. Inter-level ratio adjustment

In this section, we will consider all the POV levels togetherby distributing cases into each level based on the results fromthe previous batch. Targeting on maximizing the expectedcoverage of the failure region, the strategy is to invest moresamples in better-performing POV levels, while we keepexploring the other options. Speciﬁcally, we implement thefollowing softmax decision rule on batch allocation: n i +1 k = π i +1 k n = exp( ξU ( i, k )) (cid:80) j =0 exp( ξU ( i, j )) n (11)where U ( i, k ) = P ( s ) < threshold n ik , (cid:80) j =0 π ij = 1 . Forthe ﬁrst batch, we distribute cases equally to all POV levels.After that, the cases are distributed according to the ratioof challenging cases found within that level in the previousbatch. The parameter ξ controls how "greedy" the decisionrule is.VI. I MPLEMENTATION ON HIGHWAY MERGING SCENARIO

A. Scenario model

The highway merging scenario is the focus of this paper,for which the conﬁguration is illustrated in Figure 4. TheVUT attempts to merge onto the highway, while the POV isdriving on the main lane of the road. We make the followingassumptions for simpliﬁcation:1) The POV and the VUT do see and interact with eachother through the simulation horizon. 2) The POV is not able to change lanes to yield to theVUT; the VUT can only merge at the merge point M ,which is the origin for the lane-ﬁxed coordinates forboth the ramp and the main lane.3) There is only one POV on the main lane and there isno vehicle in front of the VUT on the ramp.4) The scenario ends when the VUT reaches point M .We model both vehicles as double integrators and theyonly move longitudinally in their own lane. The equationsof motion are:  x P OV ( t + 1) = x P OV ( t ) + v P OV ( t ) δtv P OV ( t + 1) = v P OV ( t ) + a P OV ( t ) δtx V UT ( t + 1) = x V UT ( t ) + v V UT ( t ) δtv V UT ( t + 1) = v V UT ( t ) + a V UT ( t ) δt (12)where x P OV , x

V UT are the longitudinal position, and v P OV , v

V UT are the longitudinal speed of POV and VUTin their lanes. The input for each vehicle is the longitudinalacceleration, which ranges between [ a min , a max ] . The initialcondition is characterized by ( x P OV , v P OV , x V UT , v V UT ) .Without loss of generality, we assume x V UT is ﬁxed. More-over, v V UT is observed rather than determined by the testconductor. Therefore, the initial condition to sample from is x = [ x P OV , v P OV ] T . B. Level-0 policy

For the highway merging scenario, a level-0 POV isassumed to keeps a constant speed, regardless of the VUT.A level-0 VUT will accelerate with constant acceleration( m/s ) until the assumed highway speed ( m/s ). C. Training RL agents at the highway merging scenario

When applied to the highway merging scenario, the phys-ical state space of the MDP is X = R , where each state is x = [ x P OV , v

P OV , x

V UT , v

V UT ] . The transition dynamicsare illustrated in (12), with the opponent’s action governedby the level-( k -1) policy. The actions are discrete accelerationchoices within a min and a max for both POV and VUT, i.e. u = a P OV /a V UT ∈ U = {− , − , ..., , +1 , +2 } ( m/s ) .Each episode terminates when the VUT reaches the mergepoint x V UT ( t ) = 0 .The detailed deﬁnitions of the three categories of rewardmentioned in section IV-C for the highway merging scenarioare as follow: Φ P OV e = [ φ acc , φ v HW ] T , where φ acc penalizes acceler-ation action; φ v HW penalizes speed exceeding the highwayspeed limits (either v HW min or v HW max ). The parametervalues are shown in Table I. Φ V UT e = [ φ acc , φ v min , φ v end ] T , where φ acc is the sameas in Φ P OV e ; φ v min penalizes speed lower than a minimumspeed v min during the episode; φ v end penalizes ﬁnal mergingspeed of the VUT that is faster or slower than highway speedlimits. ABLE IP

ARAMETERS FOR REWARD DESIGN v HWmax v HWmin v min T T C min ∆ x crash ∆ x critical

15 m Φ safe = [ φ T T C , φ ∆ x , φ crash ] T are the safety terms eval-uated at the end of the episode t . We deﬁne: ∆ x = x P OV ( t ) − x V UT ( t )∆ v = v P OV ( t ) − v V UT ( t ) T T C = (cid:40) ∆ x − ∆ v when ∆ x ∆ v < ∞ otherwise where φ T T C gives penalty when

T T C < T T C min ; φ ∆ x re-wards large | ∆ x | , and gives penalty when | ∆ x | < ∆ x critical ; φ crash gives heavy penalty when | ∆ x | < ∆ x crash .Finally, the DDQN algorithm for training the level- k POVsand VUTs is implemented using the MATLAB reinforcementlearning toolbox and Simulink.VII. S

IMULATION RESULTS

We conduct interactive-aware testing to several baselineVUTs in simulations to validate the performance and beneﬁtsof the proposed method.

A. Baseline VUT

For the highway-merge scenario, we design a rule-basedalgorithm for the merging vehicle (the VUT). Its decision-making has 3 stages:1) The VUT starts by following the speed proﬁle of alevel-0 VUT policy π V UT . Go to stage 2 when it is x rb close to the merge point M .2) The VUT predicts ∆ x relative to the POV whenarriving at M , assuming the POV keeps a constantspeed, and VUT follows π V UT . If too close, switch tocoast; else, keep following π V UT . Go to stage 3 whenVUT is x rb close to M ( x rb < x rb ).3) The VUT predicts ∆ x with the POV when arriving at M , assuming the POV maintains a constant speed, andVUT follows π V UT . If too close, switch to PID-controlon acceleration; if not, follows π V UT .By adjusting the parameters, we can manipulate the VUT tohave different failure modes.

B. Various interactive test cases

In this section, we present exemplar test cases with dif-ferent interactions between the POV and the VUT. The roadgeometry of the highway merging scenario is based on anentrance ramp on US 23 North near exit 41. The VUT startedat x V UT = − m ] . In Figure 5, we present three testcases with different POVs and VUTs. In all cases, the initialconditions are the same. In the 1st case, as shown in Figure5(a), 5(b), the level-0 VUT is accelerating non-cooperatively,while the POV yields by reducing its speed to let the VUTmerge ﬁrst. In the 2nd scenario (Figure 5(c), 5(d)), the (a) Level-1 POV & Level-0 VUT (b)(c) Level-2 POV & Level-1 VUT; ψ = 0 . (d)(e) Level-2 POV & rule-based VUT ψ =0 . (f)Fig. 5. Results with initial condition: x POV = − m , v POV =33 m/s , v V UT = 18 m/s ; blue for VUT, red for POV; the numbers showtime lapses in seconds. level-1 VUT yields by starting its accelerating phase later,while the level-2 POV with a cooperative SVO acceleratesto leave room for the VUT to merge behind. In the 3rdscenario, (Figure 5(e), 5(f)), the same level-2 POV yieldsto let the VUT enter ﬁrst. However, the rule-based VUTfails to understand the POV’s intention. It starts to accelerate,then coasts and even decelerates hard before it crashes withthe VUT. This last case shows a "stalemate" situation, whenboth agents try to yield to each other and create an inefﬁcientand dangerous scene. These three test cases capture differentinteractions, which makes the evaluation scenarios diverse.

C. Test results comparison1) Results with a single POV level:

We ﬁrst show theresults of simulated testing with a ﬁxed POV level. EachVUT is put through N = 400 test cases. A case with ascore P ( s ) < − means a collision has occurred, thus itis deemed as a failure case. We compare the proposed GPR-based adaptive sampling scheme to other test case generationschemes, including uniform sampling, simulated annealing[27], and subset simulation [10]. The FMC M is the criterionfor comparing their capability of discovering failure cases.For the VUT, two rule-based algorithm designs are selected,denoted as design ABLE IIA

DAPTIVE TESTING RESULTS COMPARISON

Methods FMC M ( s , . , − L-0 POV L-1 POV L-2 POV

GPR-based sampling adaptive sampling achieves the highest FMC among all themethods, and is also closest to the ground truth using only4% of the cases.Speciﬁcally, Figure 6 compares the results for testing VUTdesign

2) Results with multiple POV levels:

Finally, we simulatethe adaptive testing procedure with all the POV levels forVUT design N = 800 cases. Figure 9 showsthe change of sample allocation across different POVs. Thesample sizes start evenly, but since more failure cases werefound with level-1, and none for level-0 POV, The samplesize grows for level-1 in later batches while shrinks forlevel-0. It is demonstrated that the proposed method isable to focus on the more promising interactive POV levelfor efﬁcient identiﬁcation of challenging test cases, whilekeeping exploring the under-performed ones.VIII. CONCLUSIONSIn this paper, we study the evaluation problem for black-box HAVs in scenarios with signiﬁcant human interactions.We apply two game-theoretic methodologies, level- k gametheory and social value orientation, to model the interactivePOV driving policies and incorporate them into our test casepool design. Then, we design an adaptive test case samplingscheme based on Gaussian process regression and proposea metric to assess failure mode coverage (FMC) to measurethe test sample quality. We verify the proposed method byrunning simulated testing on several baseline VUTs. ThePOV library is able to emulate a wide variety of interactivebehaviors and the sampling method can customize test casesto discover the failure modes of the VUTs by using only (a) Ground truth (b) GPR-based adaptive sampling(c) Subset simulation (d) Simulated annealing(e) Uniform samplingFig. 6. Simulated testing results with level-1 POV on VUT design k = 20 , n = 20 .(a) Batch k = 20 , n = 20 .a) GPR-based adaptive sampling (b) Subset simulationFig. 8. Simulated testing results with level-2 POV on VUT design k = 20 , n = 20 . (a) (b)Fig. 9. Simulated testing results with the full POV library on VUT design k = 20 , n = 40 . (a) Batch sample allocation across POV levels in allbatches. (b) Average performance score for each POV level in all batches. a fraction of the number of cases compared to the groundtruth. It out-performs other sampling methods according tothe FMC metric. ACKNOWLEDGMENTToyota Research Institute (TRI) provided funds to assistthe authors with their research but this article solely reﬂectsthe opinions and conclusions of its authors and not TRI orany other Toyota entity.We thank Shaobing Xu, Geunseob Oh, Yuanxin Zhong fortheir insightful suggestions and help.R EFERENCES[1] D. Zhao, H. Lam, H. Peng, S. Bao, D. J. LeBlanc, K. Nobukawa, andC. S. Pan, “Accelerated Evaluation of Automated Vehicles Safety inLane-Change Scenarios Based on Importance Sampling Techniques,”

IEEE Transactions on Intelligent Transportation Systems , vol. 18,no. 3, pp. 595–607, 3 2017.[2] X. Wang, H. Peng, and D. Zhao, “Combining reachability analysis andimportance sampling for accelerated evaluation of highly automatedvehicles at pedestrian crossing,” in

ASME 2019 Dynamic Systemsand Control Conference, DSCC 2019 , vol. 3. American Society ofMechanical Engineers, 10 2019.[3] X. Wang, Y. Dong, S. Xu, H. Peng, F. Wang, and Z. Liu, “BehavioralCompetence Tests for Highly Automated Vehicles,” in

Accepted byIEEE Intelligent Vehicles Symposium

IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems , vol. 37, no. 11, pp. 2906–2917,2018.[7] M. Althoff and S. Lutz, “Automatic Generation of Safety-Critical TestScenarios for Collision Avoidance of Road Vehicles,” in , vol. 2018-June. IEEE, 6 2018,pp. 1326–1333. [8] C. E. Tuncali, T. P. Pavlic, and G. Fainekos, “Utilizing S-TaLiRo as anautomatic test generation framework for autonomous vehicles,”

IEEEConference on Intelligent Transportation Systems, Proceedings, ITSC ,no. ii, pp. 1470–1475, 2016.[9] S. Zhang, H. Peng, S. Nageshrao, and H. E. Tseng, “Generating so-cially acceptable perturbations for efﬁcient evaluation of autonomousvehicles,”

IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops , vol. 2020-June, pp. 1341–1347,2020.[10] S. Zhang, H. Peng, D. Zhao, and H. E. Tseng, “Accelerated Evaluationof Autonomous Vehicles in the Lane Change Scenario Based onSubset Simulation Technique,” in . IEEE, 11 2018, pp.3935–3940.[11] G. E. Mullins, P. G. Stankiewicz, and S. K. Gupta, “Automated gen-eration of diverse and challenging scenarios for test and evaluation ofautonomous vehicles,”

Proceedings - IEEE International Conferenceon Robotics and Automation , pp. 1443–1450, 2017.[12] Z. Huang, H. Lam, and D. Zhao, “Towards Affordable On-track Test-ing for Autonomous Vehicle - A Kriging-based Statistical Approach,”

IEEE Conference on Intelligent Transportation Systems, Proceedings,ITSC , vol. 2018-March, pp. 1–6, 7 2017.[13] S. Feng, Y. Feng, H. Sun, Y. Zhang, and H. X. Liu, “Testing ScenarioLibrary Generation for Connected and Automated Vehicles: An Adap-tive Framework,”

IEEE Transactions on Intelligent TransportationSystems , pp. 1–10, 3 2020.[14] W. Schwarting, A. Pierson, J. Alonso-Mora, S. Karaman, and D. Rus,“Social behavior for autonomous vehicles,”

Proceedings of the Na-tional Academy of Sciences of the United States of America , vol. 116,no. 50, pp. 2492–24 978, 2019.[15] J. F. Fisac, E. Bronstein, E. Stefansson, D. Sadigh, S. S. Sastry, andA. D. Dragan, “Hierarchical game-theoretic planning for autonomousvehicles,” in

Proceedings - IEEE International Conference on Roboticsand Automation , vol. 2019-May, 2019, pp. 9590–9596.[16] J. H. Yoo and R. Langari, “A stackelberg game theoretic driver modelfor merging,”

ASME 2013 Dynamic Systems and Control Conference,DSCC 2013 , vol. 2, pp. 1–8, 2013.[17] N. Li, D. W. Oyler, M. Zhang, Y. Yildiz, I. Kolmanovsky, and A. R.Girard, “Game theoretic modeling of driver and vehicle interactionsfor veriﬁcation and validation of autonomous vehicle control systems,”

IEEE Transactions on Control Systems Technology , vol. 26, no. 5, pp.1782–1797, 2018.[18] A. Sarkar and K. Czamecki, “A behavior driven approach for samplingrare event situations for autonomous vehicles,” in .IEEE, 11 2019, pp. 6407–6414.[19] L. Sun, W. Zhan, Y. Hu, and M. Tomizuka, “Interpretable Modellingof Driving Behaviors in Interactive Driving Scenarios based on Cu-mulative Prospect Theory,” in . IEEE, 10 2019, pp. 4329–4335.[20] L. Sun, W. Zhan, M. Tomizuka, and A. D. Dragan, “CourteousAutonomous Cars,”

IEEE International Conference on IntelligentRobots and Systems , pp. 663–670, 2018.[21] M. A. Costa-Gomes, V. P. Crawford, and N. Iriberri, “ComparingModels of Strategic Thinking in Van Huyck, Battalio, and Beil’sCoordination Games,”

Journal of the European Economic Association ,vol. 7, no. 2-3, pp. 365–376, 4 2009.[22] C. G. McClintock and S. T. Allison, “Social Value Orientation andHelping Behavior,”

Journal of Applied Social Psychology , vol. 19,no. 4, pp. 353–362, 3 1989.[23] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[24] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double Q-Learning,” in , 2016, pp. 2094–2100.[25] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcementlearning,” arXiv preprint arXiv:1312.5602 , 2013.[26] C. E. Rasmussen and C. K. I. Williams,

Gaussian Processes forMachine Learning . Cambridge, Massachusetts: MIT Press, 2006.[27] L. C. W. Dixon and G. P. Szegö,