[PDF] Grounded Relational Inference: Domain Knowledge Driven Explainable Autonomous Driving

Abstract

Explainability is essential for autonomous vehicles and other robotics systems interacting with humans and other objects during operation. Humans need to understand and anticipate the actions taken by the machines for trustful and safe cooperation. In this work, we aim to enable the explainability of an autonomous driving system at the design stage by incorporating expert domain knowledge into the model. We propose Grounded Relational Inference (GRI). It models an interactive system's underlying dynamics by inferring an interaction graph representing the agents' relations. We ensure an interpretable interaction graph by grounding the relational latent space into semantic behaviors defined with expert domain knowledge. We demonstrate that it can model interactive traffic scenarios under both simulation and real-world settings, and generate interpretable graphs explaining the vehicle's behavior by their interactions.

Full PDF

11 Grounded Relational Inference: Domain KnowledgeDriven Explainable Autonomous Driving

Chen Tang , † , Nishan Srishankar , † , Sujitha Martin , Masayoshi Tomizuka Abstract —Explainability is essential for autonomous vehiclesand other robotics systems interacting with humans and otherobjects during operation. Humans need to understand andanticipate the actions taken by the machines for trustful andsafe cooperation. In this work, we aim to enable the explain-ability of an autonomous driving system at the design stageby incorporating expert domain knowledge into the model.We propose Grounded Relational Inference (GRI). It modelsan interactive system’s underlying dynamics by inferring aninteraction graph representing the agents’ relations. We ensurean interpretable interaction graph by grounding the relationallatent space into semantic behaviors deﬁned with expert domainknowledge. We demonstrate that it can model interactive trafﬁcscenarios under both simulation and real-world settings, andgenerate interpretable graphs explaining the vehicle’s behaviorby their interactions.

Index Terms —intelligent transportation system, learning fromdemonstration, deep learning in robotics and automation, ex-plainable AI

I. I

NTRODUCTION D EEP learning has been utilized to address various au-tonomous driving problems [1], [2], [3]. However, deepneural networks lack the transparency that helps people un-derstand their underlying mechanism. It is a crucial drawbackfor safety-critical applications with humans involved (e.g.,autonomous vehicles). Humans need to understand and an-ticipate the actions taken by the machines for trustful andsafe cooperation. In response to this problem, the conceptof explainable AI (XAI) was introduced. It refers to ma-chine learning techniques that provide details and reasons thatmake a model’s mechanism easy to understand [4]. Most ofthe existing works for deep learning models focus on post-hoc explanations [4]. They enhance model explainability byunraveling the underlying mechanisms of a trained model:Vision-based approaches, such as visual attention [5] anddeconvolution [6], illustrate which segments of the input imageaffect the outputs; Interaction-aware models, such as socialLSTM with social attention [7], [8] and graph neural networks(GNN) with graph attention [9], [10], [11], [12], identify theagents that are critical to the decision-making procedure.Although promising, post-hoc explanations could be am-biguous and falsely interpreted by humans because of thenon-interpretable nature of deep neural networks. Unless the Honda Research Institute, CA, USA Department of Mechanical Engineering, University of California Berke-ley, CA, USA Department of Robotics Engineering, Worcester Polytechnic Institute,MA, USA † Work done during internship at Honda Research Institute. model is interpretable by design, it is deceiving to claim thatthe generated post-hoc explanation can capture the model’sunderlying mechanism. In this work, we aim to improveinterpretability at the design stage and develop a modelthat can generate interpretable explanations clearly deﬁnedin human domain knowledge and operate as the explana-tions suggest. We consider the problem of interactive systemmodeling—which is the foundation behind interaction-awareprediction and control models for autonomous vehicles—andfollow the practice in Neural Relational Inference (NRI) [12]to model an interactive system by explicitly inferring itsinherent interactions. Similar to NRI, our model outputs aninteraction graph with discrete edge variables corresponding toa cluster of pairwise interactions between the agents. However,unlike NRI, which learns latent space in an unsupervisedmanner, we aim to ground it in a set of interactive behaviorsdeﬁned with expert domain knowledge.As a running example, consider the scenario depicted inFig. 1, where we ask different models to control the redvehicle. Attention mechanisms can indicate the critical pixelsor agents, but they cannot recognize different effects—thetwo cars are mutually important but affect each other indistinct ways. The NRI model can distinguish between dif-ferent interactive behaviors. Still, the latent space does nothave explicit semantic meaning. In contrast, our model shoulddetermine the interaction graph with a latent space grounded inyielding and cutting-in behaviors. It learns control policies thatgenerate behaviors consistent with their deﬁnitions in domainknowledge (e.g., trafﬁc rules) and executes the correspondingpolicies according to the inferred edge types. As a result,we ensure a semantic interaction graph, which illustrates themodel’s understanding of the scenario and explains the actionit takes.A straightforward way to enable semantic relations is super-vision. Interaction labels can be either obtained from humanexperts [13] or heuristic labeling functions [14]. However, ac-curate and unbiased labels are practically prohibitive becausehuman intentions are intricate and unobservable. Inaccuratelabels could introduce bias and limit model capacity. More-over, it is unclear if the model can understand the semanticmeaning behind the labels and synthesize the right behaviors.Instead, we recast relational inference into an inverse rein-forcement learning (IRL) problem and introduce structuredreward functions to ground the latent space. Concretely, thesystem is modeled as a multi-agent Markov decision process(MDP), where the agents share a reward function that dependson the relational latent space. We design structured rewardfunctions based on expert domain knowledge to explicitly a r X i v : . [ c s . A I] F e b 𝜔𝜔 𝜔𝜔 𝜔𝜔 𝜔𝜔 𝑧𝑧 𝑧𝑧 Observation Visual Attention Graph Attention Interaction Graph

YieldCut-in

Semantic Interaction GraphPolicy Network Policy Network Policy Network Yielding Policy Deceleration Command

Fig. 1: A motivating lane-changing scenario where we ask different models to control the red vehicle. All the models generatedeceleration commands but have different intermediate outputs. With the aid of visual attention, we generate a heat mapindicating the critical pixels of the input image. Graph attention network assigns edge weights ω i to specify the importanceof surrounding vehicles to the controlled vehicle. However, the attention mechanisms cannot recognize different effects—thetwo cars are mutually important but affect each other in distinct ways. The NRI model can distinguish between differentinteractive behaviors by assigning different values to the latent variables z i in the interaction graph. Still, the latent space doesnot have explicit semantic meaning. In contrast, our model ensures a semantic interaction graph, which illustrates the model’sunderstanding of the scenario and explain the action it takes. It determines the interaction graph with a latent space grounded inyielding and cutting-in behaviors. It learns control policies that generate behaviors consistent with their deﬁnitions in domainknowledge (e.g., trafﬁc rules) and executes the corresponding policies according to the inferred edge types.deﬁne the interactive behaviors corresponding to the latentspace. Compared to direct supervision, we merely specify thefunction space of the reward for each type of interaction, butleave the reward parameters and interaction graph—namelywhich reward function each agent follows—to be learned fromdata without supervision signals.To solve the formulated IRL problem, we propose GroundedRelational Inference (GRI). It has a variational-autoencoder-like (VAE) GNN in NRI [12] as the backbone model. Ad-ditionally, we incorporate the structured reward functionsinto the model as a decoder. A variational extension of theadversarial inverse reinforcement learning (AIRL) algorithm isderived to train all the modules simultaneously. Experimentsshow that GRI can model interactive trafﬁc scenarios undersimulation and real-world settings, and generate interpretablegraphs explaining the vehicle’s behavior by their interactions.Moreover, the semantically meaningful latent space enableshumans to govern the model and ensure safety under unfamil-iar situations.To sum up, our contributions are as follows: • We reformulate relational inference into a multi-agentIRL problem with relational latent space and introducestructured reward functions as a systematic and principledmanner to incorporate expert domain knowledge intointeractive driving behavior model. • We propose Grounded Relational Inference model to solve the formulated multi-agent IRL problem. It learnsto infer the agents’ relations and model the underlyingdynamics of the interactive system based on a semanticrelational latent space grounded by domain knowledge. • We apply the proposed framework to some simple trafﬁcscenarios in both simulation and real-world setting forevaluation. We show that the interaction graphs inferredby GRI have explicit semantic meanings and thereforeimproves the explainability of the overall model.II. R

ELATED W ORK

Our model combines graph neural networks and adversarialinverse reinforcement learning for interactive system model-ing. This section gives a concise review on these two topicsand summarizes the existing works closely related to ours.We also discuss some additional works on explainable drivingmodels as a complement to the discussion in Sec. I.

Interaction modeling using GNN.

GNN has been widelyapplied for interactive system modeling in recent years [11],[15], [16]. One category of models we ﬁnd particularlyinteresting is those with graph attention mechanism. Oneseminal work is Graph Attention Network (GAT) [10] whichperformed well on large-scale inductive classiﬁcation prob-lems. VAIN [9] applied attention in multi-agent modeling.The attention map unravels the interior interaction structure tosome extent which improves the explainability of VAIN. An approach closely related to ours is NRI [12], which modeledthe interaction structure explicitly with discrete relationallatent space compared to the continuous graph attention. Weexplain the difference between NRI and our proposed methodin Sec. I and V. Another related approach in the autonomousdriving domain is [14], which also modeled interactive drivingbehavior with semantically meaningful interactions but in asupervised manner.

Adversarial inverse reinforcement learning.

Our work isrelated to two types of IRL methods: multi-agent and latentAIRL algorithms. Yu et al. [17] proposed a multi-agent AIRLframework for Markov games under correlated equilibrium.It is capable of modeling general heterogeneous multi-agentinteractions. The PS-GAIL algorithm [18] considered a multi-agent environment in the driving domain that is similar toours—homogeneous agents with shared policy under central-ized control—and extended GAIL [19] to model the interactivebehaviors. In [20], they augmented the reward in PS-GAIL as aprinciple manner to specify prior knowledge, which shares thesame spirit with the structured reward functions in GRI. LatentAIRL models integrate a VAE into either the discriminatoror the generator for different purposes. Wang et al. [21]conditioned the discriminator on the embeddings generatedby a VAE trained separately using behavior cloning. The VAEencodes trajectories into low-dimensional space, enabling thegenerator to produce diverse behaviors from limited demon-stration. VDB [22] constrained information contained in thediscriminator’s internal representation to balance the trainingprocedure for adversarial learning algorithms. The PEMIRLframework [23] achieved meta-IRL by encoding demonstrationinto a contextual latent space. Though studied in differentcontext, PEMIRL is conceptually similar to our frameworkas both its generator and discriminator depend on the inferredcontext variables.

Explainable Autonomous Driving.

At the end of thissection, we discuss some additional works related to ex-plainable autonomous driving as a complement to those wehave mentioned in Sec. I. They addressed some shortcomingsof the discussed approaches, especially those methods basedon attention mechanisms. Kim et al. [24] trained a textualexplanation generator concurrently with a visual-attention-based controller in a supervised manner. It generates sentencesexplaining the control action as a consequence of certainobjects highlighted in the attention map, which can be easilyinterpreted compared to visual attention. Another issue of at-tention that has been raised in the literature is causal confusion[25]. The model does not necessarily assign high attentionweights to objects/regions that inﬂuence the control actions.In [5], a ﬁne-grained decoder was proposed to reﬁne visualattention maps and detect critical regions through causalitytests. In [26], Li et al. adopted a similar idea for object-levelreasoning. Causal inference was applied to identify risk objectsin driving scenes. One interesting observation was that thedetection accuracy was improved with intervention during thetraining stage, i.e., augmenting the training data by maskingout non-causal objects. However, intervention requires explicitprior knowledge on the causal relations to label the casualand non-causal objects in a scene. Similar to intention labels, such kind of labels is generally prohibitive due to the intricatenature of human cognition.III. B

ACKGROUND

In this section, we would like to brieﬂy summarize twoalgorithms that are closely related to our approach, in orderto prepare the readers for the core technical content.

A. Neural Relational Inference (NRI)

Kipf et al. [12] represent an interacting system with N objects as a complete bi-directed graph G scene = ( V , E ) withvertices V = { v i } Ni =1 and edges E = { e i,j = ( v i , v j ) | i (cid:54) = j } .The edge e i,j refers to the one pointing from the vertex v i to v j . Each vertex corresponds to an object in the system.The NRI model is formalized as a VAE with a GNN encoderinferring the underlying interactions and a GNN decodersynthesizing the system dynamics given the interactions.Formally, the model aims to reconstruct a given state trajec-tory, denoted by x = (cid:0) x , . . . , x T − (cid:1) , where T is the numberof timesteps and x t = { x t , . . . , x tN } . The vector x ti ∈ R n denotes the state vector of object v i at time t . Alternatively, thetrajectory can be decomposed into x = ( x , . . . , x N ) , where x i = (cid:8) x i , . . . , x T − i (cid:9) . The encoder operates over G scene , with x i as the node feature of v i . It infers the posterior distributionof the edge type z i,j for all the edges, collected into a singlevector z . The decoder operates over an interaction graph G interact and reconstructs x . The graph G interact is constructedby assigning sampled z to the edges of G scene and assigningthe initial state to the nodes of G scene . If G interact representsthe interactions sufﬁciently, the decoder should be able toreconstruct the trajectory accurately.The model is trained by maximizing the evidence lowerbound (ELBO): L = E q φ ( z | x ) [log p γ ( x | z )] − D KL [ q φ ( z | x ) || p ( z )] , where q φ ( z | x ) is the encoder output which can be factorizedas: q φ ( z | x ) = N (cid:89) i =1 N (cid:89) j =1 ,j (cid:54) = i q φ ( z i,j | x ) , (1)where φ refers to the parameters of the encoder. The decoderoutput p γ ( x | z ) can be written as: p γ ( x | z ) = T − (cid:89) t =0 p γ ( x t +1 | x t , . . . , x , z ) , where γ refers to the parameters of the decoder. B. Adversarial Inverse Reinforcement Learning (AIRL)

The AIRL algorithm follows the principle of maximumentropy IRL [27]. Consider a MDP deﬁned by ( X , A , T , r ) ,where X , A are the state space and action space respec-tively. In the rest of the paper, we use x and a with anysuperscript or subscript to represent a state and action in X and A . T is the transition operator given by x t +1 = f ( a t , x t ) , and r : X × A → R is the reward function.The maximum entropy IRL framework assumes a suboptimalexpert policy π E ( a | x ) . The demonstration trajectories gen-erated with the expert policy, D E = (cid:8) τ E1 , . . . τ E M (cid:9) where τ E i = (cid:16) x E , i , a E , i , . . . , x E ,T − i , a E ,T − i (cid:17) , have probabilitiesincreasing exponentially with the cumulative reward. Con-cretely, they follow a Boltzmann distribution: τ E i ∼ π E ( τ ) = 1 Z exp (cid:32) T − (cid:88) t =0 r λ ( x t , a t ) (cid:33) , where r λ is the reward function with parameters denoted by λ . Maximum entropy IRL aims to infer the underlying rewardfunction parameters of the expert policy. It is formalized as amaximum likelihood problem: λ ∗ = arg max λ E τ E ∼ π E ( τ ) (cid:34) T − (cid:88) t =0 r λ ( x E t , a E t ) (cid:35) − log Z. To derive a feasible algorithm to solve the problem, we needto estimate the partition function Z . One practical solution isco-training a policy model with the current estimated rewardfunction through reinforcement learning [28]. Finn et al. [29]found the equivalency between it and a special form of thegenerative adversarial network (GAN). The policy model isthe generator, whereas a structured discriminator is deﬁnedwith the reward function to distinguish a generated trajectory τ G from a demonstrated one τ E . Fu et al. [30] proposedthe AIRL algorithm based on it, using a discriminator thatidentiﬁes generated samples based on the pairs of state andaction instead of the entire trajectory to reduce variance: D λ,η ( x , a ) = exp { r λ ( x , a ) } exp { r λ ( x , a ) } + π η ( a | x ) , (2)where π η ( a | x ) is the policy model with parameters denotedby η . The models D λ,η and π η are trained adversarially bysolving the following min-max optimization problem: min η max λ E x E , a E ∼ π E ( x , a ) (cid:2) log (cid:0) D λ,η ( x E , a E ) (cid:1)(cid:3) + E x G , a G ∼ π η ( x , a ) (cid:2) log (cid:0) − D λ,η ( x G , a G ) (cid:1)(cid:3) , (3)where π E ( x , a ) denotes the distribution of state and actioninduced by the expert policy, and π η ( x , a ) is the distributioninduced by the learned policy.IV. P ROBLEM F ORMULATION

Our GRI model grounds the relational latent space in aclustering of interpretable interactions by reformulating therelational inference problem into a multi-agent IRL problem.Since the framework has the potential to be generalized tointeractive systems in other domains apart from autonomousdriving, we will introduce our approach in a general tone.However, it should be aware that we limit our discussion in thispaper to autonomous driving problems, without claiming that itcan be directly applied to other domains. GRI relies on expert The transition is assumed deterministic to simplify the notation. A moregeneral form of the algorithm can be derived for stochastic systems, which isessentially the same with the deterministic case. domain knowledge to identify all possible semantic behaviorsand design the corresponding reward functions. There existsa broad range of literature on interactive driving behaviormodeling [13], [31], which we can refer to when designingthe rewards. We can extend the proposed framework to otherﬁelds if proper domain knowledge is available, which is leftfor future investigation.We start with modeling the interactive system as a multi-agent MDP with graph representation. As in NRI, the systemhas an underlying interaction graph G interact . The discretelatent variable z i,j takes a value from , , . . . , K − , where K is the number of interactions. It indicates the type of relationbetween v i and v j in respect to its effect on v j . Additionally,we assume the objects of the system are homogeneous intel-ligent agents who make decisions based on their interactionswith others.Concretely, each of them is modeled with identical statespace X , action space A , transition operator T and rewardfunction r : X × A → R . At time step t , the reward of agent v j depends on the states and actions of itself and the pairwiseinteractions between itself and all its neighbors: r ξ,ψ ( v tj , z j ) = r nξ ( x tj , a tj )+ (cid:88) i ∈N j K (cid:88) k =1 ( z i,j = k ) r e,kψ k ( x ti , a ti , x tj , a tj ) , (4)where z j is the collection of { z i,j } i ∈N j , r nξ is the node rewardfunction parameterized by ξ , N j is the set of v j ’s neighbouringnodes, is the indicator function, and r e,kψ k is the edge rewardfunction parameterized by ψ k for the k th type of interaction.We utilize expert domain knowledge to design r e,kψ k , so thatthe corresponding interactive behavior emerges by maximizingthe rewards. Particularly, the edge reward equals to zero for k = 0 , indicating the action taken by v j does not depend onits interaction with v i .We assume the agents act cooperatively to maximize thecumulative reward of the system: R ξ,ψ ( τ , z ) = T − (cid:88) t =0 r ξ,ψ (cid:0) x t , a t , z (cid:1) = T − (cid:88) t =0 N (cid:88) j =1 r ξ,ψ (cid:0) v tj , z j (cid:1) , with a joint policy denoted by π η ( a t | x t , z ) . The cooperativeassumption is not necessarily valid for generic trafﬁc scenarios[17], but it simpliﬁes the training procedure signiﬁcantly. Wewill leave the extension of the proposed method to non-cooperative interactive trafﬁc scenarios as a future work.Given a demonstration dataset, we aim to infer the underly-ing reward function and policy. Different from a typical IRLproblem, both r ξ,ψ and π η depend on z . Therefore, we needto infer the distribution p ( z | τ ) to solve the IRL problem.V. G ROUNDED R ELATIONAL I NFERENCE

We now present the Grounded Relational Inference modelto solve the IRL problem speciﬁed in Sec. IV. The modelconsists of three modules modeled by message-passing GNNs [32]: an encoder inferring the posterior distribution of edgetypes, a policy decoder generates control actions conditionedon the edge variables sampled from the posterior distribution,and a reward decoder models the rewards conditioned on theinferred edge types.

A. Architecture

The overall model structure is illustrated in Fig. 2. Given ademonstration trajectory τ E ∈ D E , the encoder operates over G scene and approximates the posterior distribution p ( z | τ E ) with q φ ( z | τ E ) . The policy decoder operates over a G interact sampled from the inferred q φ ( z | τ E ) and models the policy π η ( a t | x t , z ) . Given an initial state, we can generate a tra-jectory by sequentially sampling a t from π η ( a t | x t , z ) andpropagating the state. The state is propagated with either thetransition operator T if given, or a simulating environment if T is not accessible. We denote a generated trajectory giventhe initial state of τ E as τ G . Since these two modules areessentially the same in NRI, we omit the detailed modelstructures here and include them in Appx. VIII-A.The reward decoder computes the reward of a state-actionpair given the sampled edge variables. We use it to computethe cumulative rewards of τ G and τ E conditioned on thesampled G interact . The reward decoder is in the form of Eqn.(4). Additionally, we augment the functions r nξ and r e,kψ k withMLP shaping terms to mitigate the reward shaping effect [30],resulting in: f nξ,ω ( x tj , a tj , x t +1 j ) = r nξ ( x tj , a tj ) + h nω ( x t +1 j ) − h nω ( x tj ) , and f e,kψ k ,χ k ( x ti , a ti , x t +1 i , x tj , a tj , x t +1 j ) = r e,kψ k ( x ti , a ti , x tj , a tj )+ h e,kχ k ( x t +1 i , x t +1 j ) − h e,kχ k ( x ti , x tj ) , where h nω and h e,kχ k are MLPs with parameters denoted by ω and χ respectively. We denote the shaped reward function ofagent v j by f ξ,ω,ψ,χ (cid:0) x t , a t , x t +1 , z (cid:1) , which equals to the lefthand side of Eqn. (4) but with r nξ and r e,kψ k substituted by theaugmented rewards. The shaped reward function together withthe policy model deﬁnes the discriminator which distinguishes τ G from τ E : D ξ,ω,ψ,χ,η ( x t , a t , x t +1 , z )= exp (cid:8) f ξ,ω,ψ,χ (cid:0) x t , a t , x t +1 , z (cid:1)(cid:9) exp { f ξ,ω,ψ,χ ( x t , a t , x t +1 , z ) } + π η ( a t | x t , z ) . B. Training

We aim to train the three modules simultaneously. Con-sequently, we incorporate the encoder model q φ (cid:0) z | τ E (cid:1) intothe objective function of AIRL, resulting in the optimizationproblem (6). The encoder is integrated into the minimizationproblem because the reward function has a direct dependenceon the latent space. The model is then trained by solvingproblem (6) in an adversarial scheme: we alternate betweentraining the encoder and reward for the minimization problem and training the policy for the maximization problem. Speciﬁ-cally, the objective for the encoder and reward is the followingminimization problem given ﬁxed η : min ξ,ω,ψ,χ,φ J ( ξ, ω, ψ, χ, φ, η ) s.t. E (cid:8) D KL (cid:2) q φ (cid:0) z | τ E (cid:1) ) || p ( z ) (cid:3)(cid:9) (cid:54) I c . (5)The objective for the policy is maximizing J ( ξ, ω, ψ, χ, φ, η ) with ﬁxed ξ, ω, ψ, χ and φ .The objective function in the problem (6) is essentiallythe expectation of the objective function in the problem (3)over the inferred posterior distribution q φ (cid:0) z | τ E (cid:1) and thedemonstration distribution π E ( τ ) . The constraint enforces anupper bound I c on the KL-divergence between q φ (cid:0) z | τ E (cid:1) and the prior distribution p ( z ) . A sparse prior is chosen toencourage sparsity in G interact . It has the similar regularizationeffect as the D KL term in ELBO. We borrow its formatfrom variational discriminator bottleneck (VDB) [22]. VDBimproves adversarial training by constraining the informationﬂow from the input to the discriminator. The KL-divergenceconstraint is derived as a variational approximation to theinformation bottleneck [33]. Although having different mo-tivation, we adopt it for two reasons. First, the proposedmodel is not generative because our goal is not synthesizingtrajectories from the prior p ( z ) , but inferring the posterior p (cid:0) z | τ E (cid:1) . Therefore, regularization derived from informationbottleneck is more sensible compared to ELBO. Second, theconstrained problem (5) can be relaxed by introducing aLagrange multiplier β . During training, β is updated throughdual gradient descent as follows: β ← max (cid:0) , α β (cid:0) E (cid:8) D KL (cid:2) q φ (cid:0) z | τ E (cid:1) ) || p ( z ) (cid:3)(cid:9) − I c (cid:1)(cid:1) (7)We ﬁnd the adaptation scheme particularly advantageous. Themodel can focus on inferring z for reward learning aftersatisfying the sparsity constraint, because the magnitude of β decreases towards zero once the constraint is satisﬁed.However, it is worth noting that our framework does not relyon the bottleneck constraint to induce an interpretable latentspace as in [34]. In contrast, GRI relies on the structuredreward functions to ground the latent space into semanticinteractive behaviors. The bottleneck serves as a regularizationto ﬁnd out the minimal interaction graph to represent the inter-actions. In fact, we trained the baseline NRI models with thesame constraints and weight update scheme. The experimentalresults show that the constraint itself is not sufﬁcient to inducea sparse and interpretable interaction graph.In general, when the dynamics T is unknown or non-differentiable, maximum entropy RL algorithms [35] areadopted to optimize the policy. In this work, we assume knownand differentiable dynamics, which is a reasonable assumptionfor the investigated scenarios. It allows us to directly back-propagate through the trajectory for gradient estimation, whichsimpliﬁes the training procedure.VI. E XPERIMENTS

We evaluated the proposed GRI model on a synthetic datasetas well as a naturalistic trafﬁc dataset. The synthetic data weregenerated using policy models trained given the ground-truth

GNNEncoder

Scene Graph 𝒢 " 𝝉 & Demo. Trajectory:

Interaction Graph 𝒢 ’%($)* Latent Posterior Distribution: 𝑧 ’,, ~𝑞 - (𝑧 ’,, |𝝉 & ) (𝑧 !, , 𝑧 ) Policy GNN

Demo. Trajectory: 𝝉 𝑬 Gen. Trajectory: 𝝉 / ℛ(𝝉 / , 𝒛) Reward GNN

ℛ 𝝉 𝑬 , 𝒛 Initial state of 𝝉 & : 𝒙 &,0 𝒙 !$ Fig. 2: Architecture of grounded interpretable relational inference model. Given a demonstration trajectory τ E ∈ D E , theencoder operates over G scene and approximates the distribution p ( z | τ E ) with q φ ( z | τ E ) . The policy decoder operates over a G interact sampled from the inferred q φ ( z | τ E ) and models the policy π η ( a t | x t , z ) . Given the initial state of τ E , we sample atrajectory τ G by sequentially sampling a t frosm π η ( a t | x t , z ) and propagating the state. Finally, We use the reward GNN tocompute the cumulative rewards of τ G and τ E conditioned on the sampled G interact . max η min ξ,ω,ψ,χ,φ J ( ξ, ω, ψ, χ, φ, η ) = E τ E ∼ π E ( τ ) (cid:40) E z ∼ q φ ( z | τ E ) (cid:20) − T − (cid:88) t =0 log D ξ,ω,ψ,χ,η ( x E ,t , a E ,t , x E ,t +1 , z ) − E τ G ∼ π η ( τ | z ) T − (cid:88) t =0 log (cid:0) − D ξ,ω,ψ,χ,η ( x G ,t , a G ,t , x G ,t +1 , z ) (cid:1) (cid:21)(cid:41) , s.t. E τ E ∼ π E ( τ ) (cid:8) D KL (cid:2) q φ (cid:0) z | τ E (cid:1) ) || p ( z ) (cid:3)(cid:9) (cid:54) I c , (6)reward function and interaction graph. We intend to verifyif GRI can induce an interpretable relational latent spaceand infer the underlying relations precisely. The naturalistictrafﬁc data were extracted from the NGSIM dataset. We aimto validate if GRI can model real-world trafﬁc scenarioseffectively with the grounded interpretable latent space. Unlikesynthetic agents, we do not have the privilege to access thegraphs governing human drivers’ interactions. Instead, weconstructed hypothetical graphs after analyzing the segmenteddata. The hypotheses reﬂect humans’ understanding of thetrafﬁc scenarios. We would like to see if GRI can modelreal-world interactive systems in the same way as humans.We claim the model interpretable if the inferred interactiongraphs are consistent with the hypotheses. In each setting, weconsider two trafﬁc scenarios, car-following (CF) and lane-changing (LC). A. Baselines

The main question of interest is whether GRI can induceinterpretable and semantic interaction graphs. To answer thequestion, the most important baseline model for comparison isNRI, because GRI shares the same prior distribution of latentvariables with NRI. Comparing the posterior distributionsprovides insights on whether the structured reward functionscan ground the latent space into semantic interactive behaviors.In each experiment, the baseline NRI model has the sameencoder and policy decoder as the GRI model. Besides, as stated in Sec. V, the same bottleneck constraint and the weightupdate scheme in Eqn. (7) were applied as regularization forminimal representation.Another model for comparison is a supervised policy de-coder. We assume that the ground-truth graphs or humanhypotheses are available. Therefore, we can directly train apolicy decoder in a supervised way. The ground-truth graphis fed to the policy decoder as a substitute for the interactiongraph sampled from the encoder output q φ ( z | τ E ) . The trainingof the decoder becomes a regression problem. We used meansquare error as the loss function to train it.As additional information is granted, it is unfair to directlycompare the performance of GRI with the supervised policymodel. However, the supervised baseline provides some use-ful insights. Since the supervised model is trained with theground-truth interaction graphs governing the systems, it isexpected to achieve smaller reconstruction error. However, aswe argue in Sec. I, even if interaction labels are available, thesupervised model is not guaranteed to understand the semanticmeaning behind the labels and synthesize the right behaviors.We demonstrate the advantage of GRI over both baselines onthis problem in some simple out-of-distribution experiments.The details are discussed in Sec. VI-E. Additionally, in the nat-uralistic trafﬁc scenarios, the supervised model gives us someinsights on whether the human hypotheses are reasonable. Ifthe supervised model can reconstruct the trajectories precisely,it will justify our practice to adopt graph accuracy as one of 𝑧 = 3 𝑧 = 1𝑧 = 2 Cut-inYieldFollowNone

Lane-changing Scenario (Synthetic)Car-following Scenario

10 2

FollowNone 𝑧 = 1𝑧 = 1 𝑧 = 1 𝑧 = 2𝑧 = 2 YieldFollowNone

Lane-changing Scenario (NGSIM)

Fig. 3: Test scenarios with the underlying interaction graphs. In the synthetic scenarios, the graphs are the ground-truth onesgoverning the synthetic experts. In the naturalistic trafﬁc scenarios, the graphs are human hypotheses reﬂecting humans’understanding of the trafﬁc scenarios.the evaluation metrics.There exist other alternatives for the purpose of trajectoryreconstruction. However, it is not our goal in this paper to ﬁndan expressive model for accurate reconstruction. Therefore, wedo not consider other baselines from this perspective. For thetask of grounding the latent space into semantic interactivedriving behaviors, we did not ﬁnd any exact alternatives inthe literature. There could exist some heuristics or rule-basedapproaches to directly determine an interpretable interactiongraph, especially for the speciﬁc scenarios studies in thispaper. However, they are not within the scope of discussion forthis paper because we are interested in a data-driven frame-work that can be integrated into a learning-based autonomousdriving model, and has the potential to be generalized tocomplicated driving scenarios and systems in other domains.

B. Evaluation Metrics

To evaluate a trained model, we sample a τ E from the testdataset and extract the maximum posterior probability (MAP)estimate of edge variables, ˆ z , from q φ ( z | τ E ) . Afterward, weobtain a single sample of trajectories ˆ τ by executing themean value of the policy output. The root mean square errors(RMSE) of states and the accuracy of G interact are selected asthe evaluation metrics, which are computed based on ˆ z , ˆ τ , τ E ,and the ground truth or hypothetical latent variables denotedby z E : RMSE (cid:15) = (cid:118)(cid:117)(cid:117)(cid:116) N − T N (cid:88) j =1 T − (cid:88) t =0 ( (cid:15) E ,tj − ˆ (cid:15) tj ) , Accuracy = (cid:80) Ni =1 (cid:80) Nj =1 ,j (cid:54) = i ( z E i,j = ˆ z i,j ) N ( N − . If multiple edge types exist, we test all the possible permuta-tions of edge types and report the one with the highest graphaccuracy for NRI.

C. Synthetic Scenes

As mentioned above, we designed two synthetic scenarios,car-following and lane-changing. The two scenes and theirunderlying interaction graphs are illustrated in Fig. 3. In bothscenarios, we have a leading vehicle whose behavior does notdepend on the others. Its trajectory is given without the need ofreconstruction. We simply assume it runs at constant velocity.The other vehicles interact with each other and the leading one in different ways. In CF, we model the system with twotypes of edges: z i,j = 1 means that Vehicle j follows Vehicle i ; z i,j = 0 means that Vehicle j does not interact with withVehicle i . In LC, two additional edge types are introduced: z i,j = 2 means that Vehicle j yields to Vehicle i ; z i,j = 3 means that Vehicle j cuts in front of Vehicle i .The MDPs for the tested scenarios are speciﬁed as follows.In CF, since the vehicles mainly interact in longitudinal direc-tion, we only model their longitudinal dynamics to simplifythe problem. For all j ∈ { , , } , the state vector of Vehicle j consists of three states: x tj = (cid:2) x tj v tj a tj (cid:3) (cid:124) , where x tj isthe longitudinal coordinate, v tj is the velocity, and a tj is theacceleration. There is only one control input which is the jerk.We denote it as δa tj . The dynamics is governed by a 1D point-mass model: x t +1 j = x tj + v tj ∆ t + 12 a tj ∆ t ,v t +1 j = v tj + a tj ∆ t,a t +1 j = a tj + δa tj ∆ t, where ∆ t is the sampling time. In LC, we consider bothlongitudinal and lateral motions. The state vector consistsof six states instead: x tj = (cid:2) x tj y tj v tj θ tj a tj ω tj (cid:3) (cid:124) . The threeadditional states are the lateral coordinate y tj , the yaw angle θ tj , and the yaw rate ω tj . There is one additional action whichis the yaw acceleration, denoted by δω tj . We model the vehicleas a Dubins’ car: x t +1 j = x tj + v tj cos( θ tj )∆ t,y t +1 j = y tj + v tj sin( θ tj )∆ t,v t +1 j = v tj + a tj ∆ t,θ t +1 j = θ tj + ω tj ∆ t,a t +1 j = a tj + δa tj ∆ t,ω t +1 j = ω tj + δω tj ∆ t. The structured reward functions were designed based onexpert domain knowledge (e.g. transportation studies [31],[36]). We mainly referred to [13] in this paper. For the car-following behavior, its reward function is deﬁned as follows: r e, ψ (cid:0) x ti , x tj (cid:1) = − (1 + exp( ψ , )) g IDM ( x ti , x tj ) − (1 + exp( ψ , )) g dist ( x ti , x tj ) − (1 + exp( ψ , )) g lat ( x ti , x tj ) , TABLE I: Performance Comparison on Synthetic Dataset

Model Car Following ( ∆ t = 0 . s , T = 20 ) Lane Changing ( ∆ t = 0 . s , T = 30 ) RMSE x (m) RMSE v (m / s) Accuracy(%) RMSE x (m) RMSE y (m) RMSE v (m / s) Accuracy(%) GRI . ± .

125 0 . ± . . ± . . ± .

297 0 . ± .

060 0 . ± . . ± . NRI . ± .

024 0 . ± .

015 66 . ± .

00 0 . ± .

046 0 . ± .

049 0 . ± .

022 66 . ± . Supervised . ± .

016 0 . ± . - . ± .

041 0 . ± .

045 0 . ± . - The data is presented in form of mean ± std. where the features are deﬁned as: g IDM ( x ti , x tj ) = (cid:16) max (cid:0) x ti − x tj , (cid:1) − ∆ x IDM ,ti,j (cid:17) , (8) g dist ( x ti , x tj ) = exp (cid:32) − (cid:0) max (cid:0) x ti − x tj , (cid:1)(cid:1) ζ (cid:33) , (9) g lat ( x ti , x tj ) = (cid:0) y tj − g center ( y ti ) (cid:1) . The feature g IDM suggests a spatial headway ∆ x IDM ,ti,j derivedfrom the intelligent driver model (IDM) [31]. The feature f dist ensures a minimum collision-free distance. We penalize thefollowing vehicle for surpassing the preceding one with thehelp of x IDM ,ti,j in Eqn. (8) and Eqn. (9). The last feature g lat exists only in LC. It regulates the following vehicle to stay inthe same lane as the preceding one with the help of g center ,which determines the lateral coordinate of the correspondingcenterline based on the position of the preceding vehicle.The reward function for yielding is deﬁned as: r e, ψ (cid:0) x ti , x tj (cid:1) = − (1 + exp( ψ , )) g yield ( x ti , x tj ) − (1 + exp( ψ , )) g dist ( x ti , x tj ) . The feature g dist is deﬁned in Eqn. (9). The other feature g yield suggests an appropriate spatial headway for yielding: g yield ( x ti , x tj ) = (cid:0) g center ( y tj ) = g center ( y ti ) (cid:1) g IDM ( x ti , x tj )+ (cid:0) g center ( y tj ) (cid:54) = g center ( y ti ) (cid:1) g goal ( x ti , x tj ) ,g goal ( x ti , x tj ) = (cid:0) max (cid:0) x ti − x tj − ∆ x yield , (cid:1)(cid:1) . (10)The suggested headway is set to be a constant value, ∆ x yield ,when the other vehicle is merging, and switches to ∆ x IDM ,ti,j once the merging vehicle enters into the same lane, where itsbehavior becomes consistent with car following.The reward function for cutting-in is quite similar: r e, ψ (cid:0) x ti , x tj (cid:1) = − (1 + exp( ψ , )) g goal ( x tj , x ti ) − (1 + exp( ψ , )) g dist ( x tj , x ti ) , where the features are deﬁned as in Eqn. (9) and Eqn. (10),but with the input arguments switched, because the mergingvehicle should stay in front of the yielding one.Apart from the edge rewards, all the agents share the samenode reward function. The following one is adopted for LC: r nξ ( x tj , a tj ) = − (1 + exp( ξ )) f v ( x tj ) − (1 + exp( ξ )) (cid:124) f state ( x tj ) − (1 + exp( ξ )) (cid:124) f action ( a tj ) − (1 + exp( ξ )) f lane ( x tj ) , where f state and f action take the element-wise square of (cid:2) a tj θ tj ω tj (cid:3) and (cid:2) δa tj δω tj (cid:3) respectively. The feature f v is thesquared error between v tj and the speed limit v lim . The last term f lane penalizes the vehicle for staying close to the laneboundaries. For CF, we simply remove those terms that areirrelevant in 1D motion. In all the reward functions, theparameters collected in ψ and ξ are unknown during trainingand inferred by GRI. We take the exponents of them and addone to the results. It enforces the model to use the featureswhen modeling the corresponding interactions.With the scenarios deﬁned above, we aim to generate onedataset for each scenario. For each scenario, we randomlysampled the initial states of the vehicles and trained anexpert policy given the ground-truth reward functions and theinteraction graph. Afterwards, we used the trained policy togenerate the dataset. The same sampling scheme was used tosample the initial states. Results.

On each dataset, we trained a GRI model withthe policy decoder (16)-(18) introduced in Appx. VIII-A.The results are summarized in Table I. The NRI model canreconstruct the trajectories with errors close to the supervisedpolicy. However, it learns a relational latent space that isdifferent from the one underling the demonstration; Therefore,the edge variables cannot be interpreted as those semanticallymeaningful behaviors. In contrast, our GRI model interpretsthe interactions consistently with the domain knowledge inher-ited in the demonstration, and recovers the interaction graphwith high accuracies.To further evaluate the explainability of the inferred graphs,we computed the empirical distribution of the estimated edgevariables ˆ z over the test dataset. The results are summarizedin Fig. 4. It shows the empirical distribution in multipleadjacency matrices corresponding to different edge types. Thedistribution concentrates into a single interaction graph forboth models in both scenarios—as opposed to the case onthe naturalistic trafﬁc dataset which will be introduced in thenext section—because the synthetic agents have consistentinteraction patterns over all the samples. In CF, the interactiongraph of the NRI model has two additional edges comparedto the ground-truth one: z , = 1 and z , = 1 . It is fairlyreasonable to have z , = 1 because Vehicle 2 affects Vehicle0 in an indirect way. On the other hand, z , = 1 is notconsistent with the inherent causality and cannot be interpretedas car-following as the other edges. In LC, the NRI modeltreats the edges e , , e , , and e , the same, which makesit difﬁcult to interpret the semantic meaning behind, becausethese edges correspond to distinct interactive behaviors in theexpert demonstration. D. Naturalistic Trafﬁc Scenes

To evaluate the proposed method in real-world trafﬁc sce-narios, we investigated the same scenarios as in the synthetic

NRI GIRI

CF – Following Edge

NRI GIRI

LC – Following Edge

NRI GIRI

LC – Yielding Edge

NRI GIRI

LC – Cutting-in Edge

Fig. 4: The empirical distribution of estimated edge variables ˆ z over the test dataset in the synthetic scenarios. We summarizethe results in multiple adjacency matrices corresponding todifferent edge types. In the adjacency matrix correspondingto the k th type of interaction, the element A i,j indicates therelative frequency of z j,i = k , where z j,i is the latent variablefor the edge from node j to node i .case, car-following and lane-changing. we segmented datafrom the Highway-101 and I-80 datasets of NGSIM. After-wards, we further screened the data to select those interactivesamples and ensure that no erratic swerving or multiple lanechanges occur. Unlike the synthetic agents, human agents donot have a ground-truth interaction graph that governs theirinteractions. Instead, we constructed hypothetical G interact after analyzing the segmented data. The hypotheses reﬂecthumans’ understanding of the trafﬁc scenarios. We would like to see if GRI can model the real-world interactive systemsin a consistent way as humans. The hypotheses for the twoscenarios are depicted in Fig. 3. The one for CF is identicalto the ground-truth interaction graph we designed for the syn-thetic agents. However, we proposed a different hypothesis forLC. We excluded the cutting-in relation to reduce the numberof edge types and therefore simplify the training procedure.Moreover, we differentiated distinct interactions according tothe vehicles’ lateral position. We say that a vehicle yields to itspreceding vehicle if they drive in neighbouring lanes, whereasit follows the preceding one if they drive in the same lane.The node dynamics is the same as in the synthetic scenefor CF. For LC, since we did not have accurate headinginformation, we adopted 2D point-mass model instead. Sincethe behavior of human drivers is much more complicatedthan the synthetic agents, we designed reward functions withlarger model capacity using neural networks. In CF, the rewardfunctions are deﬁned as follows: r e, ψ (cid:0) x ti , x tj (cid:1) = − (1 + exp( ψ , )) g NNv ( x ti , x tj ) − (1 + exp( ψ , )) g NNs ( x ti , x tj ) ,r nξ (cid:0) x tj , a tj (cid:1) = − (1 + exp( ξ )) f NNv ( x tj ) − (1 + exp( ξ )) f acc ( x tj ) − (1 + exp( ξ )) f jerk ( x tj , a tj ) , where the features are deﬁned as: f NNv ( x tj ) = (cid:0) v tj − h ( x tj ) (cid:1) ,g NNv ( x ti , x tj ) = (cid:0) v tj − h ( x ti , x tj ) (cid:1) ,g NNs ( x ti , x tj ) = ReLU (cid:0) h (cid:0) x ti , x tj (cid:1) − x ti + x tj (cid:1) . The features f acc and f jerk penalize the squared magnitude ofacceleration and jerk. The functions h , h and h are neuralnetworks with ReLU output activation. The feature g NNs is thecritical component which shapes the car-following behavior.It learns a non-negative reference headway and penalizes thefollowing vehicle for violating it. The feature g NNv and f NNv suggest reference velocities considering interaction and merelyitself respectively.In LC, the edge reward function for car-following andthe node reward function are similar to those in CF, withadditional terms for lateral position, velocity and acceleration.Particularly, the node reward for lateral position encouragesthe vehicles to drive on the target lane, i.e., the lane wherethe leading vehicle is driving. To design the yielding reward,we deﬁne a collision point of two vehicles based on theirstates. We approximate the vehicles’ trajectories as piecewise-linear between sequential timesteps, and compute the collisionpoint as the intersection between their trajectories (Fig. 5). Wethreshold the point if it exceeds a hard-coded range of interest(e.g. if it is behind the vehicles or greater than certain dis-tance). Afterwards, we deﬁne the distance-to-collision ( d poc )as the longitudinal distance from the vehicle to the collisionpoint, and the time-to-collision ( T col ) as the time to reach thecollision point calculated by dividing d poc with the velocity (𝑥 !" , 𝑦 !" )(𝑥 $ , 𝑦 $ )(𝑥 % , 𝑦 % ) Fig. 5: Collision point diagram. At every timestep, the headingvector of the agents’ can be calculated approximating themotion as linear. The intersection between these vectors istaken to be the collision point where the agents would collideif a yield action is not taken.of the vehicle. Then the yielding reward function is deﬁned asfollows: r e, ψ (cid:0) x ti , x tj (cid:1) = − (1 + exp( ψ , )) g NNspatial ( x ti , x tj ) − (1 + exp( ψ , )) g NNtime ( x ti , x tj ) , where g NNspatial ( x ti , x tj ) = ReLU (cid:0) ( x j − x poc ) − h d poc ( x ti , x tj ) (cid:1) ,g NNtime ( x ti , x tj ) = ReLU (cid:0) h T col ( x ti , x tj ) − ( T col i − T col j ) (cid:1) . The functions h d poc and h T col are neural networks with ReLUoutput activation. The g spatial term learns a spatial aspect ofthe yield behavior and compares the agent’s distance from theestimated collision-point with the NN-learned safe referencewithin which the LC maneuver can be done. The second term g time adds a temporal aspect to yield and compares a learned safe headway time and to the difference in time-to-collisionfor the two vehicles. The intuition behind is to ensure that thevehicles do not occupy the same position at the same time. Results.

For each scenario, we trained a GRI model withthe recurrent policy decoder (19)-(22) in Appx. VIII-A. Theresults are summarized in Table II. In CF, the NRI modelstill performs better on trajectory reconstruction, but theGRI model achieves comparable RMSE on NGSIM dataset.Moreover, we observed that the NRI model overﬁtted to thetraining dataset, whereas the GRI model performs consistentlyon both training and test datasets. It shows that incorporatingdomain knowledge in a principled manner is an effectiveregulation to avoid overﬁtting. In LC, their comparison isconsistent: The NRI model slightly outperforms our modelin trajectory reconstruction; Our model dominates the NRImodel in graph accuracy. For the supervised policy, it has thelowest reconstruction error in LC. And its performance in CFis comparable to the NRI model. It implies that the humanhypotheses are reasonable assumptions which are capable tomodel the interactions between human drivers.We visualize the interaction graphs in Fig. 6. One interestingobservation is that the graphs inferred by NRI have moreedges in general. We want to emphasize that both modelsare trained under the same sparsity constraint. The resultsimply that we could guide the model to explore a clean andsparse representation of interactions by incorporating relevantdomain knowledge, whereas the sparsity regularization itself isnot sufﬁcient to serve the purpose. Moreover, the NRI model

NRI GIRI

CF – Following Edge

NRI GIRI

LC – Following Edge

NRI GIRI

LC – Yielding Edge

LC – Following Edge

LC – Yielding Edge

NRI GIRI

CF – Following Edge

NRI GIRI

LC – Following Edge

NRI GIRI

LC – Yielding Edge

NRI GIRI

CF – Following Edge

NRI GIRI

LC – Following Edge

NRI GIRI

LC – Yielding Edge

Fig. 6: The empirical distribution of estimated edge variables ˆ z over the test dataset in the naturalistic trafﬁc scenarios.We summarize the results in multiple adjacency matricescorresponding to different edge types. In the adjacency matrixcorresponding to the k th type of interaction, the element A i,j indicates the relative frequency of z j,i = k , where z j,i is thelatent variable for the edge from node j to node i .assigns the same edge type to both edges between a pairof agents. It makes the graphs less interpretable because thevehicles ought to affect each other in different ways. On theother hand, even if different from the hypotheses, our GRImodel tends to infer sparse graphs with directional edges. E. Out-of-distribution Experiments

Because of the smaller reconstruction errors, it appears thatthe NRI model discovers a relational latent space that caneffectively model the interactions, which makes it a morefavorable option. The interpretability of the inferred graphsseems less important when the reconstruction is substantiallyaccurate. However, we would like to emphasize that a seman-tically meaningful latent space that is consistent with humans’prior knowledge is necessary, especially if we want to deploythe models in real-world settings for applications in human-robot interaction. We demonstrate it with the following out- TABLE II: Performance Comparison on Naturalistic Trafﬁc Dataset

Model Car Following ( ∆ t = 0 . s, T = 30 ) Lane Changing ( ∆ t = 0 . s, T = 40 ) RMSE x (m) RMSE v (m / s) Accuracy(%) RMSE x (m) RMSE y (m) RMSE v (m / s) Accuracy(%) GRI . ± .

005 0 . ± . . ± . . ± .

647 0 . ± .

336 4 . ± . . ± . NRI . ± .

880 0 . ± . . ± .

08 6 . ± .

822 0 . ± . . ± . . ± . Supervised . ± .

938 0 . ± . - . ± .

651 0 . ± . . ± . - The data is presented in form of mean ± std. OOD Lane-changing Scenario (Synthetic)OOD Car-following Scenario 𝑧 = 1

FollowNone Δ𝑥 = 𝑥 !" − 𝑥 "" Δ𝑥 = 𝑥 !" − 𝑥 "" 𝑧 = 1 None Follow

OOD Lane-changing Scenario (NGSIM)

Δ𝑥 = 𝑥 !" − 𝑥 "" 𝑧 = 2 None Yield

Fig. 7: Out-of-distribution scenarios. We removed one vehiclefrom the nominal scenes and shifted the initial longitudinalheadway ∆ x to unseen values.of-distribution tests .For the synthetic scenarios, we removed one vehicle fromeach scene: Vehicle 0 in CF and Vehicle 1 in LC, resulting intwo interaction graphs consisting merely of following relations(Fig. 7). Also, we decreased the initial longitudinal headway tovalues unseen during the training stage. The initial longitudinalheadway is deﬁned as ∆ x = x − x , namely the longitudinaldistance from Vehicle 1 to Vehicle 0 at the ﬁrst time step.During the training stage, we sampled ∆ x from uniform dis-tributions: In CF, ∆ x ∼ unif(4 , ; In LC, ∆ x ∼ unif(8 , .In the out-of-distribution experiments, we gradually decreased ∆ x from the lower bound to some negative value, whichmeans Vehicle 0 is placed in front of Vehicle 1.Instead of evaluating the entire model, we enforced theground-truth G interact , and ran the policy decoders to generatetrajectories. The experiment is analogous to the case when thevehicle encounters an unfamiliar situation. The safety driversor passengers have the privilege to override the inferred graph For clariﬁcation, the models used in this section are the same as thoseintroduced in Sec. VI-C. We merely designed additional out-of-distributioncases for testing. For the NRI model, since the edge types are not deﬁned explicitly, weuse the permutation found with the highest graph accuracy to ﬁnd out thecorresponding edge type, when multiple edge types exist. to let the desired behavior emerge if the model misunderstandsthe scenario. Such kind of safety assurance could help buildingup a safe and trustworthy cooperation between humans andthe autonomous vehicles. It is then crucial that the latentspace possesses explicit semantic meanings and correspondsto a cluster of semantically meaningful interactive behaviors.Therefore, we are curious about if the models can generatetrajectories meeting the characteristics of the car-followingbehavior in these unseen scenarios—scenarios with a differentnumber of vehicles and distorted state distribution. We con-sider three metrics for quantitative evaluation: • Final headway: ∆ x f = x T − x T , (11) • Lateral distance: ∆ y = (cid:12)(cid:12) y T − y T (cid:12)(cid:12) − (cid:12)(cid:12) y − y (cid:12)(cid:12) , (12) • Minimum distance: d min = min i (cid:113)(cid:12)(cid:12) x i − x i (cid:12)(cid:12) + (cid:12)(cid:12) y i − y i (cid:12)(cid:12) . (13)We intend to quantify three typical characteristics of thefollowing behavior with the metrics deﬁned above: 1) stayingbehind the leading vehicle; 2) keeping in the same lane asthe leading vehicle; 3) maintaining a substantial safe distancefrom the leading vehicle. All metrics were applied in LC,but we only adopted the ﬁnal headway in CF. Since only thelongitudinal dynamics is modeled in CF, ∆ y is not applicable.For the same reason, if their initial positions are too close orthe following vehicle located ahead of the leading one initially,the following vehicle will inevitably crush into the leadingvehicle, which results in d min = 0 . Therefore, we only careabout the ﬁrst characteristic and its corresponding metric.The results are summarized in Fig. 8 and Fig. 9, wherewe plot the mean values of the evaluated metrics versus ∆ x ,with error bands denoting conﬁdence interval. We areparticularly interested in the cases when ∆ x becomes negative,which changes the spatial relations between the vehicles.In CF, the NRI policy does not slow down Vehicle 0 tofollow Vehicle 1 when ∆ x becomes negative, resulting innegative ∆ x f . In contrast, the supervised policy and GRIpolicy maintain a positive average ∆ x f , which means theyyield Vehicle 0 to follow Vehicle 1. However, the GRI policyattains a larger ∆ y f and the margin becomes larger withdecreasing ∆ y . We visualize a marginal example in Fig.12,where both the NRI policy and the supervised one fail tomaintain a positive ﬁnal headway.In LC, the pattern of the ﬁnal headway is the same. TheGRI policy maintains a consistent ∆ x f over all tested valuesof ∆ x . For the other two models, the values of ∆ x f decrease Fig. 8: Results in out-of-distribution synthetic car-followingscenario. We plot the mean values of ∆ x f versus ∆ x witherror bands denoting conﬁdence interval. A positiveheadway means that Vehicle 0 stays behind Vehicle 1, whereasa negative headway means that Vehicle 0 locates in front ofVehicle 1.with decreasing ∆ x . The average ∆ x f of the NRI policyturns negative when the magnitude of ∆ x becomes sufﬁcientlylarge. In terms of ∆ y , all models tend to reduce the lateraldistance between the vehicles which is consistent with thesecond characteristic of the following behavior. However, wefound that the GRI policy attains an average ∆ y with smallermagnitude and the magnitude decreases with decreasing ∆ x .It implies that the GRI policy changes its strategy when theinitial position of Vehicle 0 is ahead of Vehicle 1. In order tokeep a proper safe distance, Vehicle 0 does not change its laneuntil Vehicle 1 surpasses itself. On the other hand, the lateralbehavior is unchanged for the other two models. However, thevehicle cannot maintain a substantial safe distance if it changesits lane too early, which is veriﬁed by the plot of d min versus ∆ x . The difference in their strategies is further demonstratedby the example visualized in Fig. 12.We repeated the experiment on the NGSIM datasets. Similarto the case of synthetic dataset, we removed one vehiclefrom each scene, resulting in interaction graphs consistingof a single edge (Fig. 7). The only difference is that theremaining edge in LC is of yielding type. However, accordingto our deﬁnition of yielding relation, we considered the samecharacteristics and adopted the same metrics deﬁned in Eqn.(11)-(13) for evaluation. Since we do not have control overthe data generation procedure, we generate out-of-distributiontest samples with different levels of discrepancy by controllingthe ratio of longitudinal headway change. Given a sample fromthe original test dataset, we generate its corresponding out-of-distribution sample by shifting its initial longitudinal headway ∆ x by a certain ratio, denoted by δ , resulting in a new valueof longitudinal headway ∆ x (cid:48) : ∆ x (cid:48) = (1 − δ )∆ x. We evaluated the models on datasets generated with differentvalues of δ . We are particularly interested in the cases when δ (cid:62) , which leads to a negative initial headway. We presentthe results in Fig. 10 and , where we plot the mean Fig. 9: Results in out-of-distribution synthetic lane-changingscenario. We plot the mean values of ∆ x f , ∆ y , and d min versus ∆ x with error bands denoting conﬁdence interval.A positive headway means that Vehicle 0 stays behind Vehicle1, whereas a negative headway means that Vehicle 0 locatesin front of Vehicle 1. A negative lateral difference means thatVehicle 0 is getting to the same lane as Vehicle 1. A minimumdistance close to zero means that the vehicles collide to eachother at least once during the tested time horizon.values of the metrics versus δ , with error bands denoting conﬁdence interval. The comparison is quite consistent withthe synthetic scenarios. compared to the other baselines, ourGRI policy can synthesize trajectories that satisfy the desiredsemantic properties in a larger range of distribution shift.The results suggest that even though the NRI model canaccurately reconstruct the trajectories, the unsupervised latentspace does not necessarily model the underlying interactionsprecisely. The latent space and the corresponding policies donot capture the semantic meanings behind the interactions. Asa result, the model is prone to failure at unseen scenarios.And the non-interpretable nature prohibits effective humanintervention in these circumstances. In contrast, the seman-tically meaningful latent space and policy of GRI enable safe Fig. 10: Results in out-of-distribution naturalistic trafﬁc car-following scenario. We plot the mean values of ∆ x f versusthe ratio of change of initial longitudinal distance δ , with errorbands denoting conﬁdence interval.and trustworthy human cooperation, which helps the modelgeneralize to unseen situations even if it might misinterpretthe relations.Another useful insight we draw from the experiment is thatinteraction labels are not sufﬁcient to induce an explainablemodel with semantic latent space. Even though the super-vised policy utilizes additional information on the ground-truth interaction graph, it fails to synthesize the followingbehavior in novel scenarios. Although the GRI model still hasa considerable gap in reconstruction performance compared tothe supervised baseline, it provides a promising and principledmanner to incorporate prior knowledge into a learning-basedautonomous driving system and induce an explainable model.VII. D ISCUSSION AND C ONCLUSION

In this work, we propose Grounded Relational Inference(GRI), which models an interactive system’s underlying dy-namics by inferring the agents’ semantic relations. By incor-porating structured reward functions, we ground the relationallatent space into semantically meaningful behaviors deﬁnedwith expert domain knowledge. Therefore, we assure an inter-pretable interaction graph at the design stage. We demonstratethat it can model simple trafﬁc scenarios under both simulationand real-world settings, and generate interpretable graphsexplaining the vehicle’s behavior by their interactions.Although we limit our experimental study to the au-tonomous driving domain, the model itself is formulatedwithout specifying the context. As long as proper domainknowledge is available, the proposed method can be extendednaturally to other ﬁelds (e.g., human-robot interaction). How-ever, there are several technical gaps we need to bridge beforeextending the current framework to more complicated trafﬁcscenarios and interactive systems in other ﬁelds. One gapbetween the current model and these practical modules isgraph dynamics. Throughout the paper, we assume a staticinteraction graph over the time horizon. We will investigatehow to incorporate dynamic graph modeling into the currentframework. Another gap is the cooperative assumption, whichwe would like to remove in the future so that the framework Fig. 11: Results in out-of-distribution naturalistic trafﬁc lane-changing scenario. We plot the mean values of ∆ x f , ∆ y , and d min versus the ratio of change of initial longitudinal distance δ , with error bands denoting conﬁdence interval.can be generalized to non-cooperative scenarios. Besides,as we have mentioned before, the GRI model still has aconsiderable gap in reconstruction performance compared tothe other baselines. In future work, we will improve the modelarchitecture and training algorithm to ﬁll the performance gapwhile maintaining the advantages of GRI as an explainablemodel. VIII. A PPENDIX

A. Graph Neural Network Model Details

In terms of model structure, both the encoder and the policydecoder are built based on node-to-node message-passing [32],consisting of a node-to-edge message-passing and an edge-to-node message-passing: v → e : h li,j = f le ( h li , h lj , x i,j ) , (14) e → v : h l +1 j = f lv ( (cid:88) i ∈N j h li,j , x j ) , (15)where h li is the embedded hidden state of node v i in the l th layer and h li,j is the embedded hidden state of the edge e i,j . Fig. 12: Examples where the leading car is placed behind the following one at the initial timestep. The trajectories are visualizedas a sequences of rectangles. Each rectangle represents a vehicle at a speciﬁc time step. The vehicles are driving along thepositive direction of the x-axis. The GRI policy still prompts the car-following behavior: It slows down the vehicle until theleading one surpasses it. Meanwhile, the NRI policy and the supervised one do not behave as G interact suggests.The features x i and x i,j are assigned to the node v i and theedge e i,j respectively as inputs. N j denotes the set of theindices of v i ’s neighbouring nodes connected by an incomingedge. The functions f le and f lv are neural networks for edgesand nodes respectively, shared across the graph within the l th layer of node-to-node massage-passing. GNN Encoder.

The GNN encoder is essentially the sameas in NRI. It models the posterior distribution as q φ ( z | τ ) withthe following operations: h j = f emb ( x j ) ,v → e : h i,j = f e ( h i , h j ) ,e → v : h j = f v (cid:16)(cid:88) i (cid:54) = j h i,j (cid:17) ,v → e : h i,j = f e ( h i , h j ) ,q φ ( z i,j | τ ) = softmax (cid:0) h i,j (cid:1) , where f e , f v and f e are fully-connected networks (MLP) and f emb is a 1D convolutional networks (CNN) with attentivepooling. GNN Policy Decoder.

The policy operates over G interact and models the distribution π η ( a t | x t , z ) , which can be fac-torized with π η (cid:0) a tj | x t , z (cid:1) as in Eqn. (1). We model π η as aGaussian distribution with the mean value parameterized bythe following GNN: v → e : ˜ h ti,j = K (cid:88) k =0 ( z i,j = k ) ˜ f ke ( x ti , x tj ) , (16) e → v : µ tj = ˜ f v (cid:16)(cid:88) i (cid:54) = j ˜ h ti,j (cid:17) , (17) π η (cid:0) a tj | x t , z (cid:1) = N ( µ tj , σ I ) . (18) Alternatively, the model capacity is improved by using arecurrent policy π η (cid:0) a tj | x t , . . . , x , z (cid:1) ; Namely, the agents takeactions according to the historical trajectories of the system.We follow the practice in [12] and add a GRU unit to obtainthe following recurrent model: v → e : ˜ h ti,j = K (cid:88) k =0 ( z i,j = k ) ˜ f ke (cid:16) ˜ h ti , ˜ h tj (cid:17) , (19) e → v : ˜ h t +1 j = GRU (cid:16)(cid:88) i (cid:54) = j ˜ h ti,j , x tj , ˜ h tj (cid:17) , (20) µ tj = f out (cid:16) ˜ h t +1 j (cid:17) , (21) π η (cid:0) a tj | x t , . . . , x , z (cid:1) = N ( µ tj , σ I ) , (22)where ˜ h ti is the recurrent hidden state encoding the historicalinformation up to the time step t − .R EFERENCES[1] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. , “Endto end learning for self-driving cars,” arXiv preprint arXiv:1604.07316 ,2016.[2] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d objectdetection network for autonomous driving,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2017,pp. 1907–1915.[3] C. Tang, Z. Xu, and M. Tomizuka, “Disturbance-Observer-Based Track-ing Controller for Neural Network Driving Policy Transfer,”

IEEETransactions on Intelligent Transportation Systems , 2019.[4] A. B. Arrieta, N. D´ıaz-Rodr´ıguez, J. Del Ser, A. Bennetot, S. Tabik,A. Barbado, S. Garc´ıa, S. Gil-L´opez, D. Molina, R. Benjamins, et al. ,“Explainable artiﬁcial intelligence (xai): Concepts, taxonomies, opportu-nities and challenges toward responsible ai,”

Information Fusion , vol. 58,pp. 82–115, 2020. [5] J. Kim and J. Canny, “Interpretable learning for self-driving cars byvisualizing causal attention,” in Proceedings of the IEEE internationalconference on computer vision (ICCV) , 2017, pp. 2942–2950.[6] M. Bojarski, A. Choromanska, K. Choromanski, B. Firner, L. J. Ackel,U. Muller, P. Yeres, and K. Zieba, “Visualbackprop: Efﬁcient visual-ization of cnns for autonomous driving,” in . IEEE, 2018, pp. 1–8.[7] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, andS. Savarese, “Social lstm: Human trajectory prediction in crowdedspaces,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2016, pp. 961–971.[8] A. Vemula, K. Muelling, and J. Oh, “Social attention: Modeling attentionin human crowds,” in . IEEE, 2018, pp. 1–7.[9] Y. Hoshen, “Vain: Attentional multi-agent predictive modeling,” in

Advances in Neural Information Processing Systems (NIPS) , 2017, pp.2701–2711.[10] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio, andY. Bengio, “Graph attention networks,”

International Conference onLearning Representations (ICLR) , 2018.[11] S. Sukhbaatar, A. Szlam, and R. Fergus, “Learning multiagent com-munication with backpropagation,” in

Advances in Neural InformationProcessing Systems (NIPS) , 2016, pp. 2244–2252.[12] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel, “Neuralrelational inference for interacting systems,”

International Conferenceon Machine Learning (ICML) , 2018.[13] L. Sun, W. Zhan, and M. Tomizuka, “Probabilistic prediction of inter-active driving behavior via hierarchical inverse reinforcement learning,”in . IEEE, 2018, pp. 2111–2117.[14] D. Lee, Y. Gu, J. Hoang, and M. Marchetti-Bowick, “Joint interactionand trajectory prediction for autonomous driving using graph neuralnetworks,” arXiv preprint arXiv:1912.07882 , 2019.[15] S. Van Steenkiste, M. Chang, K. Greff, and J. Schmidhuber, “Relationalneural expectation maximization: Unsupervised discovery of objects andtheir interactions,”

International Conference on Learning Representa-tions (ICLR) , 2018.[16] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. , “Interactionnetworks for learning about objects, relations and physics,” in

Advancesin Neural Information Processing Systems (NIPS) , 2016, pp. 4502–4510.[17] L. Yu, J. Song, and S. Ermon, “Multi-agent adversarial inverse reinforce-ment learning,”

International Conference on Learning Representations(ICLR) , 2019.[18] R. P. Bhattacharyya, D. J. Phillips, B. Wulfe, J. Morton, A. Kueﬂer,and M. J. Kochenderfer, “Multi-agent imitation learning for drivingsimulation,” in . IEEE, 2018, pp. 1534–1539.[19] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in

Advances in Neural Information Processing Systems (NIPS) , 2016, pp.4565–4573.[20] R. P. Bhattacharyya, D. J. Phillips, C. Liu, J. K. Gupta, K. Driggs-Campbell, and M. J. Kochenderfer, “Simulating emergent properties ofhuman driving behavior using multi-agent reward augmented imitationlearning,” in . IEEE, 2019, pp. 789–795.[21] Z. Wang, J. S. Merel, S. E. Reed, N. de Freitas, G. Wayne, andN. Heess, “Robust imitation of diverse behaviors,” in

Advances in NeuralInformation Processing Systems (NIPS) , 2017, pp. 5320–5329.[22] X. B. Peng, A. Kanazawa, S. Toyer, P. Abbeel, and S. Levine, “Vari-ational discriminator bottleneck: Improving imitation learning, inverserl, and gans by constraining information ﬂow,”

International Conferenceon Learning Representations (ICLR) , 2019.[23] L. Yu, T. Yu, C. Finn, and S. Ermon, “Meta-inverse reinforcementlearning with probabilistic context variables,” in

Advances in NeuralInformation Processing Systems (NIPS) , 2019, pp. 11 772–11 783.[24] J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textualexplanations for self-driving vehicles,” in

Proceedings of the Europeanconference on computer vision (ECCV) , 2018, pp. 563–578.[25] P. de Haan, D. Jayaraman, and S. Levine, “Causal confusion in imita-tion learning,” in

Advances in Neural Information Processing Systems(NIPS) , 2019, pp. 11 698–11 709.[26] C. Li, S. H. Chan, and Y.-T. Chen, “Who make drivers stop? towardsdriver-centric risk assessment: Risk object identiﬁcation via causalinference,” arXiv preprint arXiv:2003.02425 , 2020.[27] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximumentropy inverse reinforcement learning,” in

Proceedings of AAAI Con-ference on Artiﬁcial Intelligence , 2008. [28] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverseoptimal control via policy optimization,” in

International Conference onMachine Learning (ICML) , 2016, pp. 49–58.[29] C. Finn, P. Christiano, P. Abbeel, and S. Levine, “A connection betweengenerative adversarial networks, inverse reinforcement learning, andenergy-based models,” arXiv preprint arXiv:1611.03852 , 2016.[30] J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarialinverse reinforcement learning,” arXiv preprint arXiv:1710.11248 , 2017.[31] A. Kesting, M. Treiber, and D. Helbing, “Enhanced intelligent drivermodel to access the impact of driving strategies on trafﬁc capacity,”

Philosophical Transactions of the Royal Society A: Mathematical, Phys-ical and Engineering Sciences , vol. 368, no. 1928, pp. 4585–4605, 2010.[32] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,“Neural message passing for quantum chemistry,” in

Proceedings of the34th International Conference on Machine Learning (ICML)-Volume 70 .JMLR. org, 2017, pp. 1263–1272.[33] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variationalinformation bottleneck,”

International Conference on Learning Repre-sentations (ICLR) , 2017.[34] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick,S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual conceptswith a constrained variational framework,”

International Conference onLearning Representations (ICLR) , 2017.[35] S. Levine, “Reinforcement learning and control as probabilistic infer-ence: Tutorial and review,” arXiv preprint arXiv:1805.00909 , 2018.[36] M. Treiber, A. Hennecke, and D. Helbing, “Congested trafﬁc states inempirical observations and microscopic simulations,”