[PDF] Interpretable Safety Validation for Autonomous Vehicles

Abstract

An open problem for autonomous driving is how to validate the safety of an autonomous vehicle in simulation. Automated testing procedures can find failures of an autonomous system but these failures may be difficult to interpret due to their high dimensionality and may be so unlikely as to not be important. This work describes an approach for finding interpretable failures of an autonomous system. The failures are described by signal temporal logic expressions that can be understood by a human, and are optimized to produce failures that have high likelihood. Our methodology is demonstrated for the safety validation of an autonomous vehicle in the context of an unprotected left turn and a crosswalk with a pedestrian. Compared to a baseline importance sampling approach, our methodology finds more failures with higher likelihood while retaining interpretability.

Full PDF

IInterpretable Safety Validation for Autonomous Vehicles

Anthony Corso , and Mykel J. Kochenderfer Abstract — An open problem for autonomous driving is howto validate the safety of an autonomous vehicle in simulation.Automated testing procedures can ﬁnd failures of an au-tonomous system but these failures may be difﬁcult to interpretdue to their high dimensionality and may be so unlikely as tonot be important. This work describes an approach for ﬁndinginterpretable failures of an autonomous system. The failuresare described by signal temporal logic expressions that can beunderstood by a human, and are optimized to produce failuresthat have high likelihood. Our methodology is demonstrated forthe safety validation of an autonomous vehicle in the contextof an unprotected left turn and a crosswalk with a pedestrian.Compared to a baseline importance sampling approach, ourmethodology ﬁnds more failures with higher likelihood whileretaining interpretability.

I. I

NTRODUCTION

A major challenge for autonomous driving is how tovalidate the safety of an autonomous vehicle (AV) beforedeploying it on the road. A common practice is to testthe AV using an adversarial driving simulator that controlsstochastic disturbances in the environment over time to ﬁndfailures [1]–[3]. The disturbance trajectories that lead to fail-ure are often high-dimensional and can therefore be difﬁcultto interpret. In this work, we try to address this problemby developing a technique for interpretable black-box safetyvalidation. We seek to generate descriptions of disturbancetrajectories that lead to failure so that the descriptions maybe used to understand weaknesses in the autonomous sys-tem and produce many related failure examples for furtheranalysis.The safety validation problem we study involves ﬁndingfailures of an autonomous system (system-under-test or SUT)that operates in a stochastic environment. The state of theSUT and the environment is s ∈ S and the disturbances x ∈ X are stochastic changes in the environment that caninﬂuence the SUT. For autonomous driving, the state isthe pose of each agent and the internal state of the AVpolicy, while the disturbances may include the behaviorof other drivers, the behavior of pedestrians, and sensornoise. A disturbances trajectory x = { x , . . . , x N } has aprobability p ( x ) which models the stochastic elements of theenvironment. The set of all trajectories that end in a failureof the AV is denoted E and x ∈ E means that the sequenceof disturbances x causes the SUT to fail.Black-box falsiﬁcation techniques try to ﬁnd any x suchthat x ∈ E . Falsiﬁcation can be cast as an optimizationproblem over the space of disturbance trajectories [4] whichcan be solved using various optimization algorithms [5]–[9]. A. Corso and M. J. Kochenderfer are with the Aeronautics and Astronau-tics Department, Stanford University. e-mail: { acorso,mykel } @stanford.edu These approaches do not scale well when x is very high-dimensional and failure examples might be extremely rareaccording to p ( x ) . Adaptive stress testing [1], [2], [10],[11] tries to ﬁnd the most likely failure by maximizing p ( x ) subject to x ∈ E and frames the falsiﬁcation problemas a Markov decision process so it can be solved usingreinforcement learning. Failures can also be found usingimportance sampling [12]–[15], or with rapidly exploringrandom trees [16]. Unlike the previous approaches, or workseeks to ﬁnd an interpretable description of the failuretrajectory x rather than just a single example.For an algorithm to be interpretable, its results must bereadily understood by a human. Interpretable models mustchoose an output representation that is sufﬁciently expressiveto compactly convey the desired information while havingdirect mappings to words and concepts. Decision trees usea branching set of rules to classify data [17] or representa reinforcement learning policy [18]. Simple mathematicalexpressions can be used for regression [19], governing equa-tion discovery [20] and as a reinforcement learning pol-icy [21]. Signal temporal logic (STL) is well-suited for high-dimensional time series data [22], [23] because it describeslogical relationships between temporal variables and has anatural-language description that is easily understood.Our approach seeks to ﬁnd a description ϕ such that anydisturbance trajectory that satisﬁes the description will causethe SUT to fail. Additionally, we want to search for the mostlikely failure example by maximizing the probability of thefailure trajectories. These goals can be combined into theoptimization problem max ϕ E x ∼ p ( x | x ∈ ϕ ) [ p ( x )] s.t. x ∈ E (1)where p ( x | x ∈ ϕ ) is the probability density of disturbancetrajectories that satisfy the description. Taking inspirationfrom related work, we will represent ϕ using STL andperform the optimization using genetic programming.We use our approach to ﬁnd the most-likely failures ofan autonomous vehicle in two driving scenarios: 1) Anunprotected left turn scenario with a small discrete distur-bance space, and 2) a crosswalk scenario with a continuousand high-dimensional disturbance space to show that theapproach scales. The ﬁrst scenario assumes that the distur-bances are independent while the second scenario models thejoint distribution over disturbance trajectories as a Gaussianprocess. In both scenarios, we ﬁnd interpretable failuremodes of the AV for several different initial conditions.Additionally, our approach ﬁnds an order of magnitudemore failures than the importance sampling baseline and the a r X i v : . [ c s . R O ] J un ailures that are found have a much higher likelihood. Themain contributions of this paper are:1) An approach for generating interpretable trajectorydescriptions.2) Application of the approach to ﬁnding the most likelyfailures of an autonomous vehicle.3) A comparison to an existing validation approach fortwo driving scenarios.The paper is organized as follows. Section II describes thenecessary technical background for our approach. Section IIIgives details of the algorithm. Section IV describes twoexperiments with safety validation of autonomous vehicles.Section V concludes and gives future direction.II. B ACKGROUND

This section discusses signal temporal logic, context-freegrammars, genetic programming for expression optimization,and Gaussian processes.

A. Signal Temporal Logic

Signal temporal logic (STL) is a logical formalism that isused to describe the behavior of time-varying systems [24],[25]. STL uses the basic logical propositions and ( ∧ ), or ( ∨ ), and not ( ¬ ), as well as temporal propositions such as eventually ( ♦ ) and always ( (cid:3) ). Temporal logic operators areevaluated on series of Booleans. Continuous and discretetime series data are converted to Boolean statements usingthe comparison operators ≤ , ≥ , and = . A time series x satisﬁes an STL expression ψ if ψ ( x ) evaluates to T RUE .STL expressions can also include an explicit dependenceon time. The proposition (cid:3) [ t ,t ] ( x < means always x < for t ∈ [ t , t ] and ♦ [ t ,t ] ( x = 13) means eventually x = 13 for t ∈ [ t , t ] . The ﬂexibility of STL and the easewith which STL expressions can be converted into naturallanguage can make it a suitable choice for interpretablemodeling of heterogeneous time series [22], [23]. B. Context-Free Grammar

A context-free grammar is a set of rules that deﬁne aformal language such as STL. Each rule in the grammaris composed of an output type and a set of inputs consistingof types and symbols. To generate an expression, a startingtype is chosen and types are recursively expanded until onlysymbols remain.The grammar for the STL language described in section II-A is given in eq. (2) for a single time series variable x . Thetype B is used for Boolean scalars, S is for Boolean series, T is for time, and X is for values of x . The logical operatorsapply element-wise on series. The symbol | separates ruleson the same line and : indicates a range of symbols. In thiswork, time is discrete and the variable x may be discrete orcontinuous. When x is continuous, X is sampled uniformlyat random in the range [ x min , x max ] . B (cid:55)→ B ∧ B | B ∨ B | ¬ B | (cid:3) [ T , T ] ( S ) | ♦ [ T , T ] ( S ) T (cid:55)→ T max S (cid:55)→ S ∧ S | S ∨ S | ¬ S | x ≤ X | x ≥ X | x = XX (cid:55)→ [ x min , x max ] (2) C. Genetic Programming

Genetic programming is a population-based optimizationtechnique for trees such as expressions generated from agrammar [26], [27]. Trees are evaluated according to a ﬁtnessfunction to determine the quality of an individual. To start theoptimization, a population of N pop trees is randomly sam-pled. Then, the following operations are performed randomlyat each iteration (or generation) until convergence: • Reproduction:

The ﬁttest tree is selected from a subsetof the population and progresses to the next iteration. • Mutation:

A random node in the tree is replaced witha random tree of the same type from the grammar. • Crossover:

Two individuals are selected at random andmixed to create a child. A random subtree from the ﬁrstindividual replaces a random node of the same type inthe second individual.

D. Gaussian Processes

A Gaussian process [28] is a stochastic process whereany ﬁnite set of sample points t = [ t , . . . , t m ] havevalues x = [ x , . . . , x m ] that are distributed according toa multivariate normal distribution x ∼ N (cid:0) µ ( t ) , K ( t , t ) (cid:1) where µ i ( t ) = µ ( t i ) for mean function µ and K ij ( t , t (cid:48) ) = k ( t i , t (cid:48) j ) for kernel function k . A common choice for k is thesquared exponential kernel with variance σ and character-istic length (cid:96) given by k ( t , t ) = σ exp (cid:0) − ( t − t ) / (cid:96) (cid:1) .One strength of Gaussian processes is the ability to easilycompute a posterior distribution after observing some values x o at observation points t o .To apply linear constraints on a Gaussian process, weuse the procedure of Jidling et al. [29] where they apply atransformation to sample points from a truncated multivariatenormal distribution. Sampling from a truncated multivariatedistribution can be done efﬁciently with the minimax tilt-ing approach [30]. For notational convenience, we write aGaussian process with mean µ , kernel k , observed points x o , lower bound linear constraints l and upper bound linearconstraints u as GP (cid:0) µ, k, x o , l , u (cid:17) .III. M ETHODS

This section provides details for our approach to inter-pretable validation. We overview the validation algorithm andthen discuss how expressions and trajectories are sampled.

A. Algorithm Overview

The procedure for interpretable safety validation is sum-marized in algorithm 1. The algorithm takes as input a gram-mar G STL that speciﬁes the STL rules (eq. (2)) and a costfunction c which can be any function that leads to failureswhen minimized. In this work, c ( x ) = − p ( x ) { x ∈ E } sothat trajectories that end in failure and have high-likelihoodcorrespond with low cost. A population of expressions ϕ j aresampled from the grammar (line 2) and then the algorithmiterates until the computational budget is exhausted. On eachiteration, the cost for each expression ϕ j is computed byevaluating c ( x ) for N trajectories that satisfy ϕ j and takingthe average (line 6). The details of how trajectories areampled are given in sections III-C and III-D. Then, weperform the operations of genetic programming: reproduc-tion, crossover and mutation, to evolve the population ofexpressions to a lower cost (lines 7 to 9). After looping,return the highest performing description (line 10). Algorithm 1

Interpretable Validation function I NTERPRETABLE V ALIDATION ( G STL , c ) Sample { ϕ , . . . , ϕ M } from G STL loop for each expression ϕ j Sample { x , . . . , x N } from p ( x ) s.t. x i ∈ ϕ j c ϕ j ← N (cid:80) Ni =1 c ( x i ) Select expressions { ϕ , . . . , ϕ M } based on c ϕ j Crossover expressions Mutate expressions return best ϕ j B. Sampling Expressions

Expressions need to be sampled from the STL gram-mar during the initialization of the genetic programmingoptimization scheme and during the mutation procedure.Expressions are constructed as a tree of nodes where eachnode is a rule from the grammar. Each node in the expressiontree is recursively expanded by choosing a rule uniformly atrandom from the grammar until each leaf has a terminalrule. A maximum depth of 10 is imposed on all expressionsto keep them from becoming too large. The grammar isdeﬁned and sampled from using ExprRules.jl. For a moredetailed analysis on sampling expressions from a grammarsee Kochenderfer and Wheeler [26].

C. Converting Expressions into Stochastic Constraints

To generate trajectories that satisfy an expression, onecould sample x ∼ p ( x ) until a sample satisﬁes the givenexpression. If the expression places very restrictive require-ments on the trajectory, however, then a large number ofsamples would be required to ﬁnd one that satisﬁes theexpression. To mitigate this problem, we construct linear con-straints that, when imposed on samples from p ( x ) , enforcethe satisfaction of the STL expression. There are efﬁcientalgorithms for sampling from certain linearly-constraineddistributions [29], which makes this approach tractable.There are many different constraints that could enforce atime series to satisfy an STL expression. We seek to ﬁnd theset of constraints that impose the minimum restriction on thetime series while enforcing STL satisfaction, and choose oneof these constraints at random. The procedure for samplingthe minimally-restrictive constraints is given in algorithm 2.The algorithm takes as input an expression ex , an initiallyempty vector V to store the linear constraints, and the desiredoutput of the expression out . Depending on the type of theexpression, out can be a scalar or a vector whose values canbe T RUE , F

ALSE , or A

RBITRARY , where A

RBITRARY meansthat there is no restriction placed on what the expression TABLE I: The minimally-restrictive set of subexpressionoutputs for the desired output of a parent expression. T, F,and A are T

RUE , F

ALSE , and A

RBITRARY . To place theminimal amount of restriction on the time series, an outputof A

RBITRARY is used whenever possible.

Expression Output Subexpression Output ∧ T RUE (T, T)F

ALSE (F, A) or (A, F) ∨ T RUE (T, A) or (A, T)F

ALSE (F, F) ¬ T RUE FF ALSE T (cid:3) [ t ,t ] T RUE [ A, . . . , T , . . . , T (cid:124) (cid:123)(cid:122) (cid:125) [ t ,t ] , A, . . . ]F ALSE [ A, . . . , F ↑ t ∈ [ t ,t ] , A, . . . ] ♦ [ t ,t ] T RUE [ A, . . . , F ↑ t , . . . , T ↑ t ∈ [ t ,t ] , A, . . . ]F ALSE [ A, . . . , F , . . . , F (cid:124) (cid:123)(cid:122) (cid:125) [ t ,t ] , A, . . . ] should evaluate to. If the expression is a leaf (e.g. a directcomparison such as x < ), then the expression and theoutput are stored as a pair to the vector of constraints (line3). If the expression is not a leaf, then it can be decomposedinto subexpressions (e.g. x < ∨ x > yields twosubexpressions x < and x > ) (line 5). The output ofeach of those subexpressions is determined by the outputof their parent expression (line 6). A set of subexpressionoutputs is sampled from the rules shown in table I. For eachsubexpression, the algorithm recurses (line 8) until all leafnodes of the original expression are converted to constraints.The next section describes how to use those constraints tosample a disturbance trajectory from certain distributions. Algorithm 2

Sampling constraints for STL satisfaction function S AMPLE C ONSTRAINTS ( ex , out , V ) if ex is a leaf push( V , ( ex , out )) else E ← SubExpressions( ex ) O ← SubExpressionOutputs( ex , out ) for i in 1:length( E ) SampleConstraints( E [ i ] , O [ i ] , V ) D. Sampling Multivariate Time Series with Constraints

Let (cid:2) x (1) , . . . , x ( m ) (cid:3) be a multivariate times series with m samples of x ∈ R n . Suppose we are given a set of constraintfunctions l , u : R → R n such that l j ( t ) ≤ x j ( t ) ≤ u j ( t ) and l j ( t ) ≤ u j ( t ) for j = 1 , . . . , n . The most general model ofthis times series is given by the joint probability distribution p ( x (1) , . . . , x ( m ) ) s.t. l ( t i ) ≤ x ( i ) ≤ u ( t i ) (3)here t i corresponds to the time of the sample point x ( i ) .The model p captures the correlations in the data over timeand between variables.If time series samples and components of x are indepen-dent, then the probability distribution decomposes to p ( x (1) , . . . , x ( m ) ) = m (cid:89) i =1 p ( x ( i ) ) = m (cid:89) i =1 n (cid:89) j =1 p ( x ( i ) j ) . (4)If the probability model is uniform, the j th component ofthe sample at t i is distributed as x ( i ) j ∼ U ( l j ( t i ) , u j ( t i )) (5)If the probability model of the j th component is normal withmean ˜ µ j and variance ˜ σ j , then the sample at t i is distributedas a truncated normal distribution x ( i ) j ∼ T N l j ( t i ) ,u j ( t i ) (˜ µ j , ˜ σ j ) (6)When the time series is jointly distributed as a Gaussianprocess with mean function µ j and kernel k j for the j thcomponent, then samples are distributed according to (cid:104) x (1) j , . . . , x ( m ) j (cid:105) ∼ GP (cid:16) µ j , k j , l j ( t o ) , l j ( t c ) , u j ( t c ) (cid:17) (7)The inequality constraints are separated into two sets: t o where l j ( t o ) = u j ( t o ) and t c where l j ( t c ) < u j ( t c ) .IV. E XPERIMENTS

This section describes experiments where interpretablesafety validation is used to ﬁnd failures of an autonomousvehicle in two driving scenarios: 1) an unprotected left turn,and 2) a vehicle approaching a crosswalk with a pedestrian.The system-under-test is an AV (referred to as the ego vehi-cle) that follows a modiﬁed version of the intelligent drivermodel (IDM) [31], a vehicle-following algorithm that triesto drive at a speciﬁed velocity while avoiding collisions withleading vehicles or pedestrians. The IDM is parameterized bya desired velocity of

29 m / s , a minimum spacing of , amaximum acceleration of / s and a comfortable brakingdeceleration of / s . In the left-turn scenario, the IDM isaugmented with a rules-based algorithm for navigating theintersection that predicts the behavior of oncoming trafﬁcbased on turn signals and right-of-way norms.In both driving scenarios, we aim to ﬁnd the most likelyfailure (collision between the ego vehicle and another agent)of the SUT based on the models of the environment. Theperformance of our algorithm is evaluated on three metrics:1) The rate of failures generated (evaluated from 500 trials),2) the likelihood of the failure trajectories (evaluated from500 failure trajectories), and 3) the interpretability of the STLexpression. As a baseline, we use importance sampling tomake ﬁnding failures more likely. The proposal distributionis chosen for each scenario separately.In all of the experiments, the STL optimization wasperformed with genetic programming with a population of individuals over generations, implemented in Ex-prOptimization.jl. The probability of reproduction was . ,the probability of crossover was . and the probability ofmutation was . . A. Scenario 1: Unprotected Left Turn

As shown in ﬁgs. 1 to 3, the ﬁrst scenario is an interactionbetween the ego vehicle (in blue), which is attempting toturn left, and one adversarial vehicle (in red) which isinitially driving straight through the intersection. The yellowdot represents a turn signal. The adversarial vehicle hasa discretized disturbance space with seven possible distur-bances: no disturbance ∅ , a medium slowdown d med , amajor slowdown d maj , a medium speedup a med , a majorspeedup a maj , toggle turn signal S , and toggle turn intention L . The medium accelerations are ± . / s and the majoraccelerations are ± . / s . The disturbances are added tothe normal behavior of the vehicle dictated by the IDM,and have an associated probability with larger disturbancesbeing less likely. The medium disturbances have probability − , the major disturbances (including toggling turn signaland turn intention) have probability − . The rest of theprobability ( . ) is assigned to no disturbance. Under thisprobability model, failures are very unlikely to occur so weuse importance sampling (IS) as a baseline. The proposaldistribution was chosen to be uniform over the disturbances,a choice that signiﬁcantly increases the likelihood of failure.The simulation timestep was set to ∆ t = 0 .

18 s .Three initial conditions of the left-turn (LT) scenario wereinvestigated. Each initial condition is represented by the tuple( s ego , v ego , s adv , v adv ) where s is the distance from thecenter of the intersection and v is the vehicle velocity. Theinitial conditions and corresponding nominal behavior of theego vehicle are given below. • LT1: (

15 m , / s ,

29 m ,

10 m / s ). The nominal behav-ior is for the ego vehicle to take the left turn before theother vehicle arrives at the intersection. • LT2: (

15 m , / s ,

29 m ,

20 m / s ). The nominal behav-ior is for the ego vehicle to wait for other vehicle topass through the intersection before turning left. • LT3: (

19 m , / s ,

43 m ,

29 m / s ). The nominal behav-ior is the same as in LT2.The results for the three initial conditions are shown intable II. Trajectories that were sampled from the optimalSTL expression produced at least an order of magnitudemore failures than the rollouts from the IS distribution. Thismeans that once the optimal expression is computed, it isvery effective at generating failures. Additionally, we see thatthe average disturbance probability is much higher for the in-terpretable failures than for the IS failures. The interpretablefailures place minimal constraints on the trajectories so manyof the disturbances are selected according to the true modelof the environment, leading to the relatively high likelihoodof the trajectory. With importance sampling, the distributioncauses rare disturbances to be chosen with regularity, evenwhen they are not critical for causing a failure. Many rareevents cause the IS trajectories to have very low likelihood.Finally, we can interpret the generated failure expressions.For LT1, we see that the optimal STL expression enforces amajor acceleration in the ﬁrst two timesteps of the simula-tion. This acceleration allows the adversarial vehicle to justABLE II: Comparing interpretable validation (IV) to im-portance sampling (IS) for three different initial conditionsof the left-turn scenario. Method Fail Rate Likelihood Expression

IV (LT1) . ± .

003 0 . ± . (cid:3) [0 ,. a maj IS (LT1) . ± .

008 0 . ± . -IV (LT2) . ± .

001 0 . ± . (cid:3) [0 , B IS (LT2) . ± .

63 0 . ± . -IV (LT3) . ± .

004 0 . ± . (cid:3) [0 ,. B ∨ d maj IS (LT3) . ± . . ± . - Fig. 1: Collision for LT1 at t = (1 .

08 s , .

98 s) .reach the ego vehicle as it completes its left turn as shownin ﬁg. 1. The ego vehicle initiated the turn based on theinitial velocity of the other vehicle and did not predict sucha signiﬁcant speedup. For LT2, the generated expression hasthe adversarial vehicle toggle its turn signal but continuestraight through the intersection as shown in ﬁg. 2. Theego vehicle expects the adversarial vehicle to turn, so itinitiates the left turn and causes a collision. Lastly, for LT3,the generated expression has the adversary either turn onits turn signal or perform a major deceleration in the ﬁrsttwo timesteps of the simulation (ﬁg. 3) leading to a similarfailure as in LT2. From these experiments, we conclude thatour approach is able to ﬁnd interpretable expressions thatreliable produce likely failures of the SUT.

B. Scenario 2: Pedestrian Crossing

As shown in ﬁgs. 4 and 5, the second driving scenariohas an autonomous vehicle approach a crosswalk while apedestrian tries to cross, as done by Koren et al. [1]. The dis-turbances are the pedestrian acceleration ( a x , a y ) , the sensornoise on pedestrian position ( n x , n y ) , and the sensor noiseon the pedestrian velocity ( n v x , n v y ) . The sensor noise ismodeled as independent samples from a zero-mean Gaussianwith standard deviations σ pos for the position noise in bothdirections and σ vel for the velocity noise in both directions.The pedestrian acceleration in both directions was modeledFig. 2: Collision for LT2 at t = (0 .

72 s , .

26 s) .Fig. 3: Collision for LT3 at t = (0 .

90 s , .

62 s) . TABLE III: Comparing interpretable validation (IV) to im-portance sampling (IS) for two different models of thepedestrian crosswalk scenario.

Method Fail Rate Log-Likelihood Expression

IV (PC1) . ± . . − ± . (cid:3) ( a x = 0 . ∧ a y = − . IS (PC1) . ± .

04 5 . − ± . − N/AIV (PC2) . ± . . − ± . − (cid:3) ( a x = − . ∧ a y = − . ∧ n v y = 0 . IS (PC2) . ± .

01 6 . − ± . − N/A as a zero-mean Gaussian process with a squared-exponentialkernel. The characteristic length was chosen as t = 0 . sothat the pedestrian cannot change acceleration too quickly.The standard deviation of the kernel is σ acc . The importancesampling distribution used the same disturbances models butdoubled the standard deviation parameters to make largerdisturbances more likely, and increase the failure rate). Thecomparisons in the STL expressions were sampled uniformlyfrom [ − / s , / s ] for acceleration, [ − , ] forposition noise, and [ − / s , / s ] for velocity noise.For these experiments, the vehicle starts

35 m to the left ofthe crosswalk travelling at . / s and the pedestrian starts below the center of the lane, moving at . / s to crossthe street. The simulation timestep was set to ∆ t = 0 . . Thecorrect behavior for the ego vehicle is to come to a safe stopbefore the crosswalk as the pedestrian crosses. The differencebetween pedestrian crosswalk (PC) scenarios is in the choiceof probability parameters. The ﬁrst set of parameters (PC1)makes pedestrian motion more volatile than sensor noise with σ acc = 1 m / s , σ pos = 0 . , σ vel = 0 . / s . The secondset (PC2) causes the noise to be a larger factor with σ acc =1 m / s , σ pos = 1 m , σ vel = 1 m / s .The results of the experiments are shown in table III. Inscenario PC1 and PC2, our approach found STL expres-sions that produced failures % of the time, an order ofmagnitude larger than the importance sampling approach.Additionally, the failures found via STL expressions hadmuch higher likelihood than those found using importancesampling. In PC1 we see that the optimal STL expressionprescribes a speciﬁc acceleration for the pedestrian thatcauses the pedestrian to enter the street and hit the car rightas the car passes by the crosswalk, as shown in ﬁg. 4. Notethat the actual pedestrian position and orientation is shown inred and the vehicle’s perceived position and orientation is ingray. In PC2, the expression also includes some prescribednoise that causes the AV to misjudge the location of thepedestrian as shown in ﬁg. 5. We conclude from these resultsthat our approach performs well with continuous disturbancespaces and complex probability models. It is able to ﬁndSTL expressions that reliable produce likelier failures thanthe importance sampling baseline.V. C ONCLUSION

We presented an approach for the interpretable safetyvalidation of an autonomous system based on signal temporalig. 4: Collision for PC1 t = (1 . , . .Fig. 5: Collision for PC2 t = (2 . , . .logic. The approach uses genetic programming to optimizean STL expression so that it describes disturbances trajec-tories that cause an autonomous system to fail. We demon-strated the approach on two autonomous driving scenarioswith discrete and continuous disturbance spaces. We alsodemonstrated that the technique works in the simple caseof disturbances that are independent at each timestep, andin the more complex situation where the disturbances arejointly distributed as a Gaussian process. Compared to animportance sampling baseline, our approach found an orderof magnitude more failures of the autonomous vehicle andthose failures had a higher likelihood. The failure descrip-tions helped identify ﬂaws in the autonomoous vehicle undertest. Future work will include techniques for ensuring goodcoverage of the disturbance space, improved optimizationtechniques, and more complex simulators.A CKNOWLEDGMENT

The authors gratefully acknowledge the ﬁnancial supportfrom the Stanford Center for AI Safety.R

EFERENCES[1] M. Koren, S. Alsaif, R. Lee, and M. J. Kochenderfer, “Adaptivestress testing for autonomous vehicles,” in

IEEE Intelligent VehiclesSymposium (IV) , 2018.[2] A. Corso, P. Du, K. Driggs-Campbell, and M. J. Kochenderfer,“Adaptive stress testing with reward augmentation for autonomousvehicle validatio,” in

IEEE International Conference on IntelligentTransportation Systems (ITSC) , 2019.[3] Y. Abeysirigoonawardena, F. Shkurti, and G. Dudek, “Generatingadversarial driving scenarios in high-ﬁdelity simulators,” in

Inter-national Conference on Robotics and Automation (ICRA) , 2019.[4] A. Donz and O. Maler, “Robust satisfaction of temporal logicover real-valued signals,” in

International Conference on FormalModeling and Analysis of Timed Systems (FORMATS) , 2010.[5] L. Mathesen, S. Yaghoubi, G. Pedrielli, and G. Fainekos, “Falsiﬁ-cation of cyber-physical systems with robustness uncertainty quan-tiﬁcation through stochastic optimization with adaptive restart,” in

IEEE Conference on Automation Science and Engineering (CASE) ,Aug. 2019.[6] T. Akazaki, S. Liu, Y. Yamagata, Y. Duan, and J. Hao, “Falsiﬁcationof cyber-physical systems using deep reinforcement learning,” in

International Symposium on Formal Methods , 2018.[7] Q. Zhao, B. H. Krogh, and P. Hubbard, “Generating test inputs forembedded control systems,”

IEEE Control Systems Magazine , vol.23, no. 4, pp. 49–57, 2003.[8] Z. Zhang, G. Ernst, S. Sedwards, P. Arcaini, and I. Hasuo, “Two-layered falsiﬁcation of hybrid systems guided by monte carlotree search,”

IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems (TCAD) , vol. 37, no. 11, pp. 2894–2905, 2018. [9] S. Sankaranarayanan and G. Fainekos, “Falsiﬁcation of temporalproperties of hybrid systems using the cross-entropy method,” in

ACM international conference on Hybrid Systems: Computation andControl (HSCC) , 2012.[10] R. Lee, M. J. Kochenderfer, O. J. Mengshoel, G. P. Brat, andM. P. Owen, “Adaptive stress testing of airborne collision avoidancesystems,” in

Digital Avionics Systems Conference (DASC) , 2015.[11] M. Koren and M. Kochenderfer, “Efﬁcient autonomy validationin simulation with adaptive stress testing,” in

IEEE InternationalConference on Intelligent Transportation Systems (ITSC) , Oct. 2019.[12] Z. Huang, M. Arief, H. Lam, and D. Zhao, “Evaluation uncertaintyin data-driven self-driving testing,” in

IEEE International Confer-ence on Intelligent Transportation Systems (ITSC) , 2019.[13] M. O’Kelly, A. Sinha, H. Namkoong, R. Tedrake, and J. C. Duchi,“Scalable end-to-end autonomous vehicle testing via rare-eventsimulation,” in

Advances in Neural Information Processing Systems(NIPS) , 2018.[14] J. Uesato, A. Kumar, C. Szepesvari, T. Erez, A. Ruderman, K.Anderson, N. Heess, P. Kohli, et al. , “Rigorous agent evaluation:An adversarial approach to uncover catastrophic failures,”

ArXiv ,2018.[15] D. Zhao, H. Lam, H. Peng, S. Bao, D. J. LeBlanc, K. Nobukawa,and C. S. Pan, “Accelerated evaluation of automated vehicles safetyin lane-change scenarios based on importance sampling techniques,”

IEEE transactions on intelligent transportation systems , vol. 18, no.3, pp. 595–607, 2016.[16] M. Koschi, C. Pek, S. Maierhofer, and M. Althoff, “Computationallyefﬁcient safety falsiﬁcation of adaptive cruise control systems,”in

IEEE International Conference on Intelligent TransportationSystems (ITSC) , 2019.[17] L. Breiman,

Classiﬁcation and regression trees . Routledge, 2017.[18] I. D. J. Rodriguez, T. Killian, S.-H. Son, and M. Gombolay,“Interpretable reinforcement learning via differentiable decisiontrees,”

ArXiv , no. 1903.09338, 2019.[19] H. Schielzeth, “Simple means to improve the interpretability ofregression coefﬁcients,”

Methods in Ecology and Evolution , vol.1, no. 2, pp. 103–113, 2010.[20] S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Discovering governingequations from data by sparse identiﬁcation of nonlinear dynamicalsystems,”

Proceedings of the National Academy of Sciences , vol.113, no. 15, pp. 3932–3937, 2016.[21] D. Hein, A. Hentschel, T. Runkler, and S. Udluft, “Particle swarmoptimization for generating interpretable fuzzy reinforcement learn-ing policies,”

Engineering Applications of Artiﬁcial Intelligence ,vol. 65, pp. 87–98, 2017.[22] R. Lee, M. J. Kochenderfer, O. J. Mengshoel, and J. Silbermann,“Interpretable categorization of heterogeneous time series data,” in

SIAM International Conference on Data Mining , 2018.[23] M. Vazquez-Chanlatte, J. V. Deshmukh, X. Jin, and S. A. Seshia,“Logical clustering and learning for time-series data,” in

Interna-tional Conference on Computer Aided Veriﬁcation , 2017.[24] O. Maler and D. Nickovic, “Monitoring temporal properties ofcontinuous signals,” in

Formal Techniques, Modelling and Analysisof Timed and Fault-Tolerant Systems , Springer, 2004, pp. 152–166.[25] C. Baier and J.-P. Katoen,

Principles of model checking . MIT Press,2008.[26] M. J. Kochenderfer and T. A. Wheeler,

Algorithms for optimization .MIT Press, 2019.[27] J. R. Koza,

Genetic programming: On the programming of comput-ers by means of natural selection . MIT Press, 1992, vol. 1.[28] C. K. Williams and C. E. Rasmussen,

Gaussian processes formachine learning , 3. MIT press Cambridge, MA, 2006, vol. 2.[29] C. Jidling, N. Wahlstrm, A. Wills, and T. B. Schn, “Linearlyconstrained Gaussian processes,” in

Advances in Neural InformationProcessing Systems (NIPS) , 2017.[30] Z. I. Botev, “The normal law under linear restrictions: Simulationand estimation via minimax tilting,”

Journal of the Royal StatisticalSociety: Series B , vol. 79, no. 1, pp. 125–148, 2017.[31] M. Treiber, A. Hennecke, and D. Helbing, “Congested Trafﬁc Statesin Empirical Observations and Microscopic Simulations,”