GRIP++: Enhanced Graph-based Interaction-aware Trajectory Prediction for Autonomous Driving
GGRIP++: Enhanced Graph-based Interaction-aware TrajectoryPrediction for Autonomous Driving
Xin Li, Xiaowen Ying, Mooi Choo ChuahDepartment of Computer Science and Engineering, Lehigh University [email protected], [email protected], [email protected]
Abstract —Despite the advancement in the technology of au-tonomous driving cars, the safety of a self-driving car is stilla challenging problem that has not been well studied. Motionprediction is one of the core functions of an autonomous drivingcar. Previously, we propose a novel scheme called GRIP whichis designed to predict trajectories for traffic agents around anautonomous car efficiently. GRIP uses a graph to represent theinteractions of close objects, applies several graph convolutionalblocks to extract features, and subsequently uses an encoder-decoder long short-term memory (LSTM) model to make pre-dictions. Even though our experimental results show that GRIPimproves the prediction accuracy of the state-of-the-art solutionby 30%, GRIP still has some limitations. GRIP uses a fixedgraph to describe the relationships between different trafficagents and hence may suffer some performance degradationswhen it is being used in urban traffic scenarios. Hence, in thispaper, we describe an improved scheme called GRIP++ wherewe use both fixed and dynamic graphs for trajectory predictionsof different types of traffic agents. Such an improvement canhelp autonomous driving cars avoid many traffic accidents. Ourevaluations using a recently released urban traffic dataset, namelyApolloScape showed that GRIP++ achieves better predictionaccuracy than state-of-the-art schemes. GRIP++ ranked I. I NTRODUCTION
Nowadays, high-quality and affordable cameras are avail-able in many gadgets, e.g., smart-phones, wireless cameras,autonomous vehicles, that humans own these days. Analyzingimages/videos captured by these cameras impacts our dailylives. For example, smart-phones have been using face recog-nition algorithms [1], [2], [3] to analyze frames captured byfront-cameras (RGB or infrared camera) to recognize users,that improves the security and usability of smart-phones. Smartsurveillance video systems which can detect and identify sus-pects [4], [5], [6], [7], [8], [9], [10], [11] help law enforcementpersonnel maintain a safer living environment. Hand gesturerecognition algorithms [12], [13], [14], [15] provide a brandnew way for human-computer interaction interfaces to bedesigned. Model decomposition solutions [16], [17], [18], [19],[20] make deep learning models run much faster on resource-constrained devices.Recent technology advancement in the fields of computervision, sensor signal processing, and hardware designing, etc.have enabled autonomous driving technology to go from the“likely feasible” to the “commercially available” state. How-ever, recent traffic accidents involving autonomous driving cars from Tesla and Uber in 2018 raised people’s concern about thesafety of self-driving vehicles. Thus, it is extremely importantto improve the performance of the intelligent algorithmsrunning on autonomous driving cars. One important exampleof such intelligent algorithms is the prediction of the futuretrajectories of the surrounding traffic agents, e.g., vehicles,pedestrians, bicycles, etc. One can avoid traffic accidents ifeach autonomous driving car involved could precisely predictthe locations of its surrounding objects.Accurately predicting the motion of surrounding objects isan extremely challenging task, considering that many factorscan affect the future trajectory of an object. Prior works [21],[22], [23], [24], [25] proposed to predict future locations byrecognizing maneuver (change lanes, brake, or keep going,etc.). However, these methods fail to predict the positionsof objects accurately when they infer wrongly the type ofmaneuver. Typically such wrong inference happens when ascheme makes a prediction only based on sensors like GPSthat misses visual clues, e.g., turn signals. Then, Karasev et al.[26] proposed to predict the motion of pedestrians by modelingtheir behaviors as jump-Markov processes. Unfortunately, theirproposed method requires a semantic map and knowledge ofone or several goals of the pedestrian, which is not usefulin the autonomous driving scenario because an autonomousdriving car cannot know the destination of a pedestrian (orother objects) in advance. Bhattacharyya et al. [27] triedto predict the bounding boxes of objects in RGB cameraframes by predicting future vehicle odometry sequence. Yet,the predicted bounding boxes in RGB frames still need to bemapped to the coordinate system of the self-driving car toallow the self-driving car to make a correct response to thesepredicted locations.Besides, few of the schemes we discussed above take thestates of surrounding objects into account. We argue that themotion states of surrounding objects are crucial for motionprediction especially in the field of autonomous driving. In au-tonomous driving scenarios, there are different types of nearbytraffic agents, e.g., cars, pedestrians, bicycles, buses. Thesetraffic agents have various shapes, dynamics and differentmovement patterns. To ensure safe operations of autonomousvehicles, their perception and navigation systems should beable to analyze motion patterns of surrounding traffic agentsand predict their future locations so that autonomous vehiclescan make better driving decisions.In [28], we have proposed a robust and efficient objecttrajectory prediction scheme for autonomous driving cars,namely GRIP, that can infer future locations of nearby objects a r X i v : . [ c s . C V ] M a y imultaneously and is trainable end-to-end. Our preliminary re-sults using two large highway datasets show that our approachperforms better than existing schemes. However, we did notevaluate our scheme in urban driving environment. Driving inan urban environment is much more challenging than drivingon a highway. Urban traffic has more uncertainties and mayhave more complex road conditions, and diverse traffic agents.Different types of traffic agents have varying motion patternsand their behaviors affect one another. In addition, in [28],we use a fixed graph to represent the relationship betweentraffic agents. Such an approach may suffer from performancedegradation when it is being used in urban traffic scenarios.Thus, in this paper, we propose an improved scheme calledGRIP++ which utilizes both fixed and dynamic graphs tocapture the complex interactions between different types oftraffic agents for better trajectory prediction accuracy.In summary, our contributions of this paper include: • An improved object trajectory prediction scheme to pre-cisely predict future locations of various types of trafficagents surrounding an autonomous driving car. • The proposed scheme considers the impact of inter-agentinteractions on the motion. • Extensive evaluation using both highway and urban trafficdatasets show that our scheme achieves higher accuracyand runs an order of magnitude faster than existingschemes.The rest of this paper is organized as follows. In SectionII, we briefly discuss related work followed by the problemformulation in Section III. In Section IV, we describe ourproposed object trajectory prediction scheme and implemen-tation details. We report our experimental results in SectionV. Finally, we conclude this paper in Section VI.II. R ELATED WORK
Conventional Methods on Trajectory Prediction
The problem of trajectory prediction has been extensivelystudied by researchers over many years. Classical approachesinclude Monte Carlo Simulation [29], Bayesian networks [30],Hidden Markov Models (HMM) [31] etc. These methodstypically focus on analyzing objects based on their previousmovements and can only be used in simple traffic scenarioswith few interactions among vehicles but such methods maynot work well in scenarios involving heterogeneous types ofvehicles and pedestrians. Other traditional motion predictionmethods are either Markovian maneuver intention estimation-based [23], [24] or prototype-trajectory based methods [32].Such methods have limitations, e.g., they fail to predict theintent of traffic agents accurately if they infer the wrong typeof maneuvers or they are computationally very expensive.The authors in [33] combine the two techniques to developa dictionary learning algorithm called Augmented Semi Non-Negative Sparse Coding (ASNSC). However, ASNSC predictsthe intents only based on spatial features while ignoring theenvironmental context that may influence an object’s intent.Researchers have also attempted to predict trajectories forcrowds by modeling pedestrians’ behaviors and interactions.For example, [34], [35] combine an Ensemble Kalman Filter and human motion model to predict the trajectories of crowds.Ma et al [36] extend such methods to general traffic scenarioswhere they predict the trajectories of multiple traffic agentsby considering kinematic and dynamic constraints. However,they assume perfect sensing, shape and dynamics informationfor all traffic agents which often are not available in real life.
Recent Deep Learning Based Models for Trajectory Pre-diction
In recent years, deep learning based methods, e.g., Long Shortterm Memory (LSTM) based methods, have been proposed formaneuver classification and trajectory prediction, e.g., [37],[38]. Typically such methods require ideal road conditions,e.g., clear road lanes, no other types of traffic agents orperfect knowledge of surrounding objects. For example, [39]used one LSTM based encoder to study pattern underlyingpast trajectory and another LSTM decoder predicts futuretrajectory but they assume the ego vehicle knows the relativespeed and locations of nearby vehicles. Recently, researchershave realized such limitations and started exploring possiblesolutions. Thus, we merely summarize the more recent worksthat take inter-object interactions into account here.In [40], the authors presented a LSTM-CNN hybrid networkcalled TraPhic for trajectory prediction. Specifically, theytake into account heterogeneous interactions that implicitlyaccounts for the varying dynamics and behaviors of differentroad agents and use a semi-elliptical region (horizon) infront of each road agent to model horizon-based interactionswhich implicitly models the driving behavior of each roadagent. A LSTM is used to model each road agent. A horizonmap is created by pooling together the hidden states of thehorizon agents and a neighborhood map is created usinghidden states of all agents in the defined neighborhood. Sucha scheme is computationally expensive and certain movementinformation is lost by using CNN to pool the hidden statesof nearby agents which limits the accuracy it can achieve. Toovercome such limitation, another scheme is proposed in [41]where an instance layer is used to learn instances’ movementsand interactions and a category layer is used to learn thesimilarities of instances belonging to the same type of trafficagents to refine the prediction. This scheme performs betterthan TraPhic but its computation cost remains high since anLSTM is used for each traffic agent in the neighborhood.Luo et al. proposed a convolutional network for fast objectdetection, tracking and motion forecasting in [42]. Their modeltakes bird’s eye view LiDAR data as input and processes 3Dconvolutions across space and time. Then, two extra branchesof convolutional layers are added: one of them calculatesthe probability of being a vehicle at a given location andanother predicts the bounding box over the current frame aswell as several frames in the future. They believe that sucha structure can forecast motion because the model can learnvelocity and acceleration features from the input of multipleframes. However, the forecasting branch simply takes the 3Dconvolutional feature map as an input, so visual features of allobjects are represented in the same feature map. This resultsin the model losing track of objects and hence cannot performwell in a scene with crowded objects.In addition, Deo et al. [43], [44] proposed a unified frame-ork for surrounding vehicles’ maneuver classification andmotion prediction on freeways. First, an LSTM model isused to represent the track histories and relative positionsof all observed cars (the one being predicted and its nearbyvehicles) as a context vector. Then, this context is used formaneuver classification and another LSTM is used to predictthe vehicle’s future position. Considering that the LSTMmodel fails to capture the interdependencies of the motionof all cars in the scene, they later enhance their schemeby adding convolutional social pooling layers in [45]. Thisimproved model has access to the motion states of surroundingobjects and their spatial relationships and hence improves theaccuracy of future motion prediction. However, all of thesemodels merely predict the trajectory of one specific car (theone in the middle position) each time. Hence, these existingapproaches require intensive computation power if they wantto predict trajectories of all surrounding objects which ishighly inefficient. Besides, these schemes are maneuver basedand hence their performance suffer when wrong classificationsof the maneuver types occur.III. P ROBLEM F ORMULATION
Before introducing our proposed scheme, we would liketo formulate the trajectory prediction problem as one whichestimates the future positions of all objects in a scene basedon their trajectory histories. Specifically, the inputs X ofour model are trajectory histories (over t h time steps) of allobserved objects: X = [ p (1) , p (2) · · · , p ( t h ) ] (1)where, p ( t ) = [ x ( t )0 , y ( t )0 , x ( t )1 , y ( t )1 , · · · , x ( t ) n , y ( t ) n ] (2)are the co-ordinates of all observed objects at time t , and n is the number of observed objects. This format is the same aswhat Deo et al. defined in [43] and [45]. Global coordinatesare used here. Using relative measurement in the ego-vehicle-based coordinate system will improve the prediction accuracy,but will be left for future work.Considering that we feed track histories of all observedobjects into the model, we argue that it makes more senseto predict future positions for all of them simultaneously foran autonomous driving car. Thus, instead of only predictingthe position of one particular object as being done in [43] and[45], our proposed model outputs Y , predicted future positionsof all observed objects from time step t h + 1 to t h + t f : Y = [ p ( t h +1) , p ( t h +2) , · · · , p ( t h + t f ) ] (3)where p ( t ) is the same as Eq. (2) and t f is the predictedhorizon. IV. P ROPOSED S CHEME
To solve the limitations of existing approaches, we proposeGRIP++, a novel deep learning model for object trajectoryprediction in this section. Our model, illustrated in Figure 1,consists of three components: (1) Input Preprocessing Model,(2) Graph Convolutional Model, and (3) Trajectory PredictionModel. A. Input Preprocessing Model Input Representation : Before feeding the trajectory data of objects into our model,we convert the raw data into a specific format for subsequentefficient computation. Assuming that n objects in a trafficscene were observed in the past t h time steps, we representsuch information in a 3D array F input with a size of ( n × t h × c ) (as shown in Figure 1). In this paper, we set c = 2 to indicate x and y coordinates of an object. Considering that it is easierto predict the velocity of an object than predicting its location,we calculate velocities ( p t +1 − p t ) before feeding the data intoour model. Graph Construction : Considering that, in the autonomous driving applicationscenario, the motion of an object is profoundly impactedby the movements of its surrounding objects. This is highlysimilar to people’s behaviors on a social network (one personis usually to be impacted by his/her friends). This inspiresus to represent the inter-object interaction using an undirectedgraph G = { V, E } as what researchers have done for a socialnetwork.In this graph, each node in node set V corresponds to anobject in a traffic scene. Considering that each object mayhave different states at different time steps, the node set V isdefined as V = { v it | i = 1 , · · · , n, t = 1 , · · · , t h } , where n is the number of observed objects in a scene, and t h is theobserved time steps. The feature vector v it on a node is thecoordinate of i th object at time t .At each time step t , objects that have interactions should beconnected with edges. In the autonomous driving applicationscenario, such an interaction happens when two objects areclose to each other. Thus, the edge set E is composed of twoparts: (1) The first part describes the interaction informationbetween two objects in spatial space at time t . We call it a“spatial edge” and denote it as E S = { v it v jt | ( i, j ∈ D ) } ,where D is a set in which objects are close to each other.In this paper, we define that two objects are close if theirdistance is shorter than a threshold of D close . In Figure 1, wedemonstrate this concept on “Raw Data” using two blue circleswith a radius of D close . All objects within the blue circle areregarded as close to the one located in the middle of the circle.Thus, the top object has three close neighbors, and the lowerone only has one neighbor. (2) The second part is the inter-frame edges, which represents the historical information frameby frame in temporal space. Each observed object in one time-step is connected to itself in another time-step via the temporaledge and such edges are denoted as E F = { v it v i ( t +1) } . Thus,all edges in E F of one particular object represent its trajectoryover time steps.To make the computation more efficient, we represent thisgraph using an adjacency matrix A = { A , A } , where A isan identity matrix I representing self-connections in temporalspace, and A is a spatial connection adjacency matrix. Thus,at any time t , A [ i ][ j ]( orA [ i ][ j ]) = (cid:26) , if edge (cid:104) v it , v jt (cid:105) ∈ E , otherwise (4)ig. 1: The architecture of the proposed Scheme.Both A and A have a size of ( n × n ) , where n equals tothe number of observed objects in a scene. Such a graph isconstructed based on a manually designed rule, so it is fixedonce the input data is given and will not change during thetraining phase. Thus, we called it “Fixed Graph” (the bluegraph symbol in Figure 1) . B. Graph Convolutional Model
Given a preprocessed input data (Input Representation) F input := R n × t h × c , the Graph Convolutional Model firstpasses it through a 2D Convolutional layer with (1 × kernelsize (“Conv2D (1x1)” in Figure 1) to increase the number ofchannel. It maps the 2-dimensional input data (x,y coordinates)into a higher-dimensional space, which helps the model learna good representation for the trajectory prediction task. Thus,its output has a shape of ( n × t h × C ) , where C is the newnumber of channel ( C = 64 in Figure 1).After that, the input data is fed into several graph operationsas well as temporal convolutions. These graph operationsare designed to handle the inter-object interaction in spatialspace, and the temporal convolutions are used to captureuseful temporal features, e.g., the motion pattern of one object.Thus, as shown in Figure 1 (3 Graph Operation layers and3 Temporal Convolution layers are illustrated), one TemporalConvolution layer is added to the end of each Graph Operationlayer in this Graph Convolutional Model to process the inputdata spatially and temporally alternatively.Batch Normalization layers are employed to improve thetraining stability of our model. Besides, skip connections(green polylines) are used to make sure that the model canpropagate larger gradients to initial layers, and these layersalso could learn as fast as the final layers. Graph Operation Layer : A graph operation layer takesthe interactions of surrounding objects into account. Each Graph Operation layer consists of two graphs: (i) a FixedGraph (adjacency matrix A described in the previous section,blue graph symbols in Figure 1) constructed based on the cur-rent input, and (ii) a trainable graph (denoted as G train )(shownin orange graph symbols in the Graph Operation block inFigure 1) with the same shape as the Fixed Graph.To make sure the value range of feature maps remainunchanged after performing graph operations, we normalizeFixed Graph A using the following equation: G jfixed = Λ − j A j Λ − j (5)where Λ j is computed as: Λ iij = (cid:88) k ( A ikj ) + α (6)we set α = 0 . to avoid empty rows in A j .Considering that the Fixed Graph G fixed is constructedbased on a manually designed rule, it may not be able torepresent the interactions of objects properly. In this paper, tosolve this problem, we sum the Fixed Graph with the trainablegraph, so that the trainable graph can be trained to alleviate thelimitation of the Fixed Graph. Thus, once a Graph Operationlayer takes an input f conv from its previous layer, the outputfeature map f graph is calculated as: f graph = (cid:88) j =0 ( G jfixed + G jtrain ) f conv (7)Graph Operation layers do not change the size of features,so f graph has a size of ( n × t h × C ) . ) Temporal Convolutional Layer : Then, we feed thegenerated feature f graph := R n × t h × C to a Temporal Con-volutional layer. We set the kernel size of a Temporal Con-volutional layer to (1 × to force them to process the dataalong the temporal dimension (second dimension). Appropri-ate paddings and strides are added to make sure that each layerhas an output feature map with the expected size. C. Trajectory Prediction Model
The Trajectory Prediction Model consists of several net-works. These networks share the same Seq2Seq structure butwill be trained for different weights. In Figure 1, we showtwo Seq2Seq networks. Each network takes the Graph Feature(generated by the Graph Convolutional Model) as its input.Feature vectors (at each temporal dimension) of the GraphFeature are fed into the corresponding input cell of the EncoderGRU (gray arrows in Figure 1).Then, the hidden feature of the Encoder GRU, as well ascoordinates of objects at the previous time step, are fed into aDecoder GRU to predict the position coordinates at the currenttime step. Specifically, the input of the first decoding step(gray “Last History” boxes in Figure 1) is the coordinatesof objects at the “Last History” step (corresponding to thegray column of the Input Representation in Figure 1), and theoutput of the current step is fed into the next GRU cell. Sucha decoding process is repeated several times until the modelpredicts positions for all expected time steps ( t f ) in the future.Because few traffic-objects move in a constant velocity, weforce the model to predict the change of velocity by addinga residual connection (blue dashed lines in Figure 1) betweenthe input and the output to each cell of the Decoder GRU. Theimpact of using such a residual connection will be discussedin the Experiments section (Section V).Finally, once we get the predicted results of these Seq2Seqnetworks, we average the results (predicted velocities) at eachtime step. After getting the averaged predicted velocities, weadd them (∆ x, ∆ y ) back to the last historical location ( p ( th ) ) to convert the predicted results to ( x, y ) coordinates.The key difference between GRIP++ and GRIP are • GRIP++ takes velocity ( ∆ x, ∆ y) as input while GRIPtakes (x,y) coordinates as input • GRIP++ considers both fixed and trainable graphs whileGRIP merely considers fixed graphs in the graph convo-lution submodule. • GRIP++ uses 3 blocks in the graph convolution modeland adds batch normalization while GRIP uses 10 blocksin the graph convolution model without the batch normal-ization layers. In addition, GRIP++ uses skip connections. • GRIP++ uses GRU networks while GRIP uses LSTM net-works. GRIP++ also uses three encoder-decoder blocksfor trajectory prediction and average the results whileGRIP merely uses a single encoder-decoder block fortrajectory prediction. D. Implementation Details
Our scheme is implemented using Python ProgrammingLanguage and PyTorch Library [46]. We report the imple- mentation details of our scheme and the settings of importantparameters as follows.
Input Preprocessing Model:
In this paper, we process atraffic scene within 180 feet ( ±
90 feet). All objects withinthis region will be observed and predicted in the future. Whileconstructing the graph, we consider two objects are close iftheir distance is less than 25 feet ( D close = 25 ). Thus, anypair of objects within 25 feet are connected using a spatialedge, e s ∈ E S . Please refer to our ablation study in sectionV-C for more details. Graph Convolutional Model:
As shown in Figure 1, weuse a (1 × convolutional layer to increase the channel ofinput data to 64. The Graph Convolutional Model consists of 3Graph Operation layers. Each of these Graph Operation layersis followed by a Temporal Convolution layer. All TemporalConvolution layers have a convolutional kernel with a size of (1 × . We set stride = 1 and appropriate padding to maintainthe shape of feature maps. Thus, the output of the GraphConvolutional Model has a size of ( n × t h × . To avoidoverfitting, we randomly dropout features (0.5 probability)after each graph operation. Trajectory Prediction Model:
Both the encoder and de-coder of this prediction model are a two-layer GRU (Gatedrecurrent unit) networks. We set the number of hidden unitsof these two GRUs equals to r times of the output dimension( r × × n , where r is used to improve the representation ability, n is the number of objects and is the x, y coordinates). Inthis paper, we choose r = 30 for its best performance (pleaserefer to Experiments chapter for more discussion). The inputof the encoder has 64 channels that are the same as the outputof the Graph Convolutional Model. Optimization:
We train our model as a regression task ateach time. The overall loss can be computed as:
Loss = 1 t f t f (cid:88) t =1 loss t (8) = 1 t f t f (cid:88) t =1 (cid:13)(cid:13) Y tpred − Y tGT (cid:13)(cid:13) (9)where t f is the time step in the future (in Figure 1, t f =4 ), loss t is the loss at time t , Y pred and Y GT are predictedpositions and ground truth respectively. The model is trainedto minimize the Loss . Training Process:
We train the model using Adam op-timizer with default settings in Pytorch Library. We set batch size = 64 during training.V. E XPERIMENTS
We run our scheme on a desktop running Ubuntu 16.04 with4.0GHz Intel Core i7 CPU, 32GB Memory, and a NVIDIATitan Xp Graphics Card. A. Datasets
We evaluate our scheme on three well known trajectoryprediction datasets: NGSIM I-80 [47], US-101 [48], andApolloScape Trajectory dataset [41].
NGSIM Datasets:
Both NGSIM I-80 and US-101 datasetswere captured at 10 Hz over 45 minutes and segmented into5 minutes of mild, moderate and congested traffic conditions.These two datasets consist of trajectories of vehicles on realfreeway traffic. Coordinates of cars in a local coordinatesystem are provided.We follow Deo et al. [43], [44], [45] to split these twodatasets into training and testing sets. One-fourth of the datafrom each of the three subsets (mild, moderate, and congestedtraffic conditions) are selected for testing. Each trajectory issegmented into 8 seconds clips that the first 3 seconds are usedas observed track history and the remaining 5 seconds are theprediction ground truth. To make a fair comparison, we alsodo the same downsampling for each segment by a factor 2 asDeo et al. did, i.e. 5 frames per second. The code for datasetsegmentation can be downloaded from their Github . ApolloScape Trajectory Dataset:
The ApolloScape Trajec-tory dataset is collected by running the Apollo acquisition car[49] in urban areas during rush hours. Traffic data, includingcamera-based images and LiDAR-based point clouds, arecollected, and object trajectories are calculated using objectdetection and tracking algorithms. In total, the trajectorydataset consists of 53min training sequences and 50min testingsequences captured at 2 frames per second. Object id, objecttype, object position, object size, and heading angle, etc. areprovided. Because the data is collected in urban areas, thereare five different object types involved: small vehicles, bigvehicles, pedestrians, motorcyclists and bicyclists, and others.This particular dataset allows researchers to stress test thetrajectory prediction scheme they design for having varioustypes of traffic agents with different behaviors create additionalchallenges in the design. In Fig 2, we highlight some of thesechallenges that any object trajectory prediction scheme facesin urban traffic scenarios. In Fig 2 (a), we have traffic agentsmoving at various speeds in different directions while in Fig 2(b), we have three traffic agents of different sizes with onetraffic agent trying to squeeze through a tight space.During the training phase, we choose 20% sequences fromthe training subset for validation and train our model usingthe remaining 80% sequences. Once the model is trained,we generate predictions on testing sequences and submit theresults to the ApolloScape website for evaluation. B. Metrics
RMSE:
For NGSIM I-80 and US-101 dataset, we use thesame experimental settings and evaluation metrics as [45] and[50]. In this paper, we report our results in terms of the root ofthe mean squared error (RMSE) of the predicted trajectoriesin the future (5 seconds horizons). The RMSE at time t canbe computed as follows: RM SE = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ( Y tpred [ i ] − Y tGT [ i ]) (10)where n is the number of observed (predicted) objects, Y tpred and Y tGT are predicted results and ground truth at time t correspondingly. https://github.com/nachiket92/conv-social-pooling (a) Urban Traffic Scene 1(b) Urban Traffic Scene 2. Fig. 2: ApolloScape Urban Traffic Scenes [49]
WSADE and
WSFDE:
For ApolloScape Trajectory dataset,we use both Weighted Sum of Average Displacement Error(WSADE) and Weighted Sum of Final Displacement Error(WSFDE) metrics to evaluate the performance. As describedon the ApolloScape website, the Average displacement error(ADE) measures the mean Euclidean distance over all thepredicted positions and ground truth positions during theprediction time, and the Final displacement error (FDE) is themean Euclidean distance between the final predicted positionsand the corresponding ground truth locations. Because the tra-jectories of cars, bicyclist and pedestrians have different scales,they use the following weighted sum of ADE (WSADE) andweighted sum of FDE (WSFDE) as metrics.
W SADE = D v · ADE v + D p · ADE p + D b · ADE b (11) W SF DE = D v · F DE v + D p · F DE p + D b · F DE b (12)where D v , D p , and D b are related to reciprocals of the averagevelocity of vehicles, pedestrian and cyclist in the dataset. Theyset the values of these three variables to 0.20, 0.58, 0.22respectively. C. Ablation Study
In this subsection, we conduct two ablation studies:(1) We defined a threshold D close in section IV-A2. Twoobjects within D close range are regarded as close to each other.We first explore how this threshold impacts the performanceof our model. In Figure 3, we compare results when D close isset to different values. One can see that the prediction errorwhen D close = 0 (when none of the surrounding objects areonsidered, blue bars in Figure 3) is higher than the resultswhen D close > (taking nearby objects into account). Thus,considering the surrounding object indeed helps our modelmake a better prediction.Fig. 3: Comparison among various D close values.Also, we notice that the prediction error increases when D close increases from 25 feet (orange bars) to 50 feet (greenbars). This is because more objects are used to predict themotion of an object with larger Dclose . In real life, a trafficagent is more likely to be only impacted by its closest objects.Thus, considering too many surrounding objects does not helpto improve the prediction accuracy. Based on this observation,in this paper, we set D close = 25 feet as our default settingunless specified otherwise.(2) Given an input stream consisting of observed objects’past trajectories, our model is able to predict future trajectoriesfor all observed objects. Thus, in Figure 4, we report theprediction error for objects at different locations, e.g., − or − feet, within the observed area. In Figure 4, traffic agentsare moving from location − to location (left to right).Fig. 4: Prediction error at different locations.First, one may notice that the prediction error decreasesfrom location − to − , and then increases after − .Such an observation is obvious on the top 3 curves (“Future5/4/3 second”). This is impacted by the clue information fromsurrounding objects. Because objects are moving from left toright in Figure 4, so objects located at can only observeobjects behind them, while objects at − can only see objectsin front of them. Thus, prediction error at − is lower thanthe error at concludes that front objects are more importantthan behind objects for our trajectory prediction model. This is also the reason why prediction error increases after − (less and less front objects are observed from left to right).In addition, considering that predicting the motion of anobject in far future is difficult. Thus, in Figure 4, the error ofa long time prediction (“Future 5 second”) is higher than ashorter time prediction (“Future 1 second”). D. Experiments on the NGSIM Datasets
In this subsection, we compare our proposed scheme tothe following baselines (as done in [45]) and some existingsolutions using NGSIM datasets: • Constant Velocity (CV): This is a baseline that only usesa constant velocity Kalman filter to predict trajectories inthe future. • Vanilla LSTM (V-LSTM): A baseline that feeds a tackhistory of the predicted object to an LSTM model topredict a distribution of its future position. • C-VGMM + VIM: In [44], Deo et al. propose a maneuverbased variational Gaussian mixture model with a Markovrandom field based vehicle interaction module. • GAIL-GRU: Kuefler et al. [50] use a generative adversar-ial imitation learning model for vehicle trajectory predic-tion. However, they use ground truth data for surroundingvehicles as input during prediction phase. • CS-LSTM (M): This is the model that an LSTM modelwith convolutional social pooling layers proposed by Deoet al. in [45]. A maneuver classier is included. • CS-LSTM: A CS-LSTM model without the maneuverclassifier described in [45].Comparison results are reported in Table I. Our model canpredict the trajectories for all observed objects simultaneously,while other schemes listed in Table I only predict one specificobject (in the middle position) each time. Thus, to make afair comparison, we compute the RMSE for the same objectsas other schemes and report the result in the last column,“GRIP++ ( (cid:52)
CS-LSTM)”, of Table I. Compared to the existingstate-of-the-art result (CS-LSTM [45]), our proposed GRIP++improves the prediction performance by at least 30%. Onemay notice that, after 3 seconds in the future, the predictionerror of GRIP++ is a half meter (or longer) shorter than CS-LSTM [45]. We believe that such an improvement can helpan autonomous driving car avoid many traffic accidents.Then, compared the result of CS-LSTM(M) to CS-LSTM,one can see that CS-LSTM makes slightly better predictionthan CS-LSTM(M). This is consistent with our argument men-tioned in Section II that a wrong classification of maneuvertype has an adverse effect on the trajectory prediction.Besides, compared to our previous work (GRIP, the secondcolumn on the right side in Table I), GRIP++ achieves com-parable results in short prediction (the first three seconds) andbetter results in the long forecast (at 4 second and 5 second). Itproves that our GRIP++ has better capability to extract usefulfeatures from historical trajectories and then make a longprediction. The NGSIM datasets only consist of trajectories ofvehicles on freeway traffic, which means the motion patternsare similar and straightforward. Thus, our GRIP and GRIP++have a similar performance on the NGSIM datasets (especiallyABLE I: Root Mean Square Error (RMSE) for trajectory prediction on NGSIM I-80 and US-101 datasets. Data are convertedinto the meter unit. All results except ours are extracted from [45]. The smaller the value, the better.
PredictionHorizon (s) CV V-LSTM C-VGMM +VIM [44] GAIL-GRU[50] CS-LSTM(M)[45] CS-LSTM[45] GRIP[28] GRIP++( (cid:52)
CS-LSTM)1 0.73 0.68 0.66 0.69 0.62 0.61 ↑ -0.23)2 1.78 1.65 1.56 1.51 1.29 1.27 ↑ -0.38)3 3.13 2.91 2.75 2.55 2.13 2.09 1.45 (31% ↑ -0.64)4 4.78 4.46 4.24 3.65 3.20 3.10 2.21 (31% ↑ -0.96)5 6.68 6.27 5.99 4.71 4.52 4.37 3.16 (33% ↑ -1.43) for short-term prediction). However, predicting trajectories inurban scenarios is much more complicated and difficult than inhighway scenarios. Thus, in the next chapter, we evaluate theperformance of our proposed scheme using a dataset collectedin urban areas. E. Experiments on the ApolloScape Trajectory Datasets
In Table II, we compare our proposed scheme to othermethods on ApolloScape leaderboard that have publications. Itis obvious that GRIP++ achieves much better prediction resultsthan the TrafficPredict [51] (85% improvement). StarNet [52]ranked • BatchNorm:
As shown in Figure 1, we add a BatchNormalization layer after each Graph Operation layer andTemporal Convolution layer. In Table III, “ Y ” indicatesBatch Normalization layers included, and “ N ” indicatesexcluded. • Input:
Two types of input processings are tested. (1) Usethe position coordinates as the input of our model whilenormalizing them to the range of [-1, 1] by dividing alldata by the maximum value in the training set (mark as“
Norm ” in Table III). (2) Take the velocity of objects asthe input of our model (mark as “
Velocity ” in Table III). • RNN Type:
There are different types of RNN networks.In this paper, we tried Long short-term memory (LSTM)and Gated Recurrent Units (GRU). • GCN
We also tried different numbers of GCN layers.10 means 10 Graph Operation layers and 10 TemporalConvolution layers are used. • RNN In+Out:
We argue that it is easier to predictvelocity of an object than its location, and add a residualconnection between the input and the output to theDecoder GRU. Thus, in Table III, “ Y ” indicates residualconnections are used, and “ N ” indicates no residualconnections in Decoder GRU. • GCN Graph:
In each Graph Operation layer, we addthe Fixed graph and a trainable graph before performingthe graph operation. We evaluate the effectiveness of thetrainable graph. • RNN
The number of Seq2Seq networks in the Tra-jectory Prediction Model is also explored. 3 indicates 3 Seq2Seq networks with the same structure are used, andtheir results will be averaged. • RNN Size (r):
In Section IV-D, we set the hidden sizeof RNN networks to be r times of the output dimension.Thus, we explore the impact of using different values of r . • Data Aug.:
To train a model with better generalizationcapability, we applied data augmentation on the inputdata, e.g., randomly rotate the input data or enhancethe model using testing observed data. “ Y ” indicatesdata augmentation is applied, and “ N ” indicates no dataaugmentation.From Table III, we observe the following conclusions: •
1. Comparing B3 to B2, one can see a significant im-provement by using velocity as input instead of normal-ized positions. This verified our argument that predictingthe velocity of an object is easier than predicting itslocation. Predicting the physical position of an object ishard because the position value (x, y coordinates) can beany value, so it has a very large norm that the modelcannot easily learn a good weight to handle it. However,the velocity of an object is more constant, no matterwhere the object locates. •
2. From B5 to B6, another big improvement achievesdue to the residual connection between the input andthe output of the Decoder GRU. This result proves thatthe residual connection indeed helps the model learn toadjust the velocity. After adding the residual connection,the model just needs to learn the change of velocity(acceleration). Because the change of velocity (accelera-tion) is more constant than the velocity itself, predictingacceleration is an easier task for the model to learn, whichresults in a better performance. •
3. Compared to B6, B7 includes a trainable graph in eachGraph Operation layer. B7 makes a better prediction thanB6 proves that trainable graphs are indeed trained to coverthe shortage of Fixed Graphs. •
4. At B4, we change LSTM to GRU. The scale ofimprovement is surprising. Usually, GRUs train faster andperform better than LSTMs on less training data. If thisexperience holds in this task, it means that the amountof data in the ApolloScape dataset is not enough for theLSTM model we use. •
5. Then, at B5, we reduce the number of GCN layers(both Graph Operation layers and Temporal ConvolutionLayers) from 10 to 3. The simplified model achieves asimilar performance. Thus, we use the model with fewerABLE II: Competition Results On ApolloScape Trajectory Dataset.
Method WSADE ADEv ADEp ADEb WSFDE FDEv FDEp FDEbTrafficPredict [51] 8.5881 7.9467 7.1811 12.8805 24.2262 12.7757 11.1210 22.7912StarNet [52] 1.3425 2.3860 0.7854 1.8628 2.4984 4.2857 1.5156 3.4645
GRIP++ (Ours)
TABLE III: Changes in performance (in term os WSADE) while adjusting the model. Each time, we only change one setting,and the change is highlighted with a underlined bold font. The smaller the value, the better.
Index BatchNorm Input RNN Type GCN Y Norm LSTM 10 N Fixed Only 1 2 N 6.9971B3 Y
Velocity
LSTM 10 N Fixed Only 1 2 N 2.6679B4 Y Velocity
GRU
10 N Fixed Only 1 2 N 2.0743B5 Y Velocity GRU N Fixed Only 1 2 N 2.0034B6 Y Velocity GRU 3 Y Fixed Only 1 2 N 1.5207B7 Y Velocity GRU 3 Y
Fixed + Train N 1.3863B10 Y Velocity GRU 3 Y Fixed + Train 3 N 1.3245B11 Y Velocity GRU 3 Y Fixed + Train 3 N 1.3227B12 Y Velocity GRU 3 Y Fixed + Train 3 N 1.2803B13 (GRIP++) Y Velocity GRU 3 Y Fixed + Train 3 30 Y layers for faster speed (training and testing speed). •
6. From B8 to B12, different hidden sizes in our Seq2Seqnetworks are explored. One can see that the performanceinitially improves as r increases until r = 30 ). Afterthat, the performance degrades (B11 performs worsethan B12). With increasing r, we have more parametersto train. The increasing number of parameters helps tocapture more information but it will also eventually causean overfitting problem. •
7. Besides the above insights, we also notice that addingBatch Normalization layers, using more Seq2Seq models,and doing data augmentation helps a little bit but not toomuch in the performance of the model.Compared B13 to B1 in Table III, one can see that GRIP++achieves much better prediction results (83% improvement) onthe ApolloScape Trajectory Datasets. As we mentioned above,predicting trajectories in urban scenarios is difficult. This resultproves the proposed GRIP++ is more robust and useful in real-world scenarios. F. Computation Time
Computation efficiency is one of the important performanceindicators of an algorithm for autonomous driving cars. Thus,we evaluate the computation time of our proposed GRIP andGRIP++, and report the results in Table IV.To make a fair comparison, we downloaded the code of CS-LSTM [45] and ran it on our machine to collect its compu-tation time. CS-LSTM, GRIP and GRIP++ are implementedusing PyTorch.From Table IV, one can see that, when using 128 batch size,CS-LSTM [45] needs 0.29s to predict trajectories for 1000objects, GRIP takes 0.05s (5.8x faster), while our proposedGRIP++ only takes 0.02s (14.5x faster than CS-LSTM). Inthe autonomous driving application scenario, considering the https://github.com/nachiket92/conv-social-pooling TABLE IV: Computation time
Scheme Predicted limited resources, we can only set batch size = 1 , so wereport the results in the last column of Table IV. It shows thatGRIP runs 5.5 times faster and GRIP++ runs 21.7 times fasterthan CS-LSTM [45].The primary reason that GRIP and GRIP++ run faster thanCS-LSTM is that both GRIP and GRIP++ predict trajectoriesfor all observed objects simultaneously, while CS-LSTM onlypredicts for one object. Besides, GRIP++ consists of muchfewer layers than GRIP, which results in its faster speed. Weonly use 3 Graph Operation layers and 3 Temporal Convo-lution layers in GRIP++, but GRIP has 10 Graph Operationlayers and 10 Temporal Convolution layers. G. Visualization of Prediction Results
In Figure 5, we visualize several prediction results in mild,moderate, and congested traffic conditions (from left to right)using the datasets NGSIM I-80 and US-101. After observing3 seconds of history trajectories, our model predicts thetrajectories over 5 seconds horizon in the future. From Figure5, one can notice that: •
1. From Figure 5a to Figure 5c, it is obvious thatgreen-dashed lines (CS-LSTM) are longer than yellow-dashed lines (ours) and farther from the red-dashed lines(ground truth). This proves that when feeding the samehistory trajectories (all objects in the scene) to models,our proposed GRIP++ makes a better prediction for thecentral object than CS-LSTM. a) (b) (c) (d) (e)
Fig. 5: Visualized Prediction Results. Blue rectangles are the cars located in the middle which is the car that CS-LSTM [45]trys to predict. Black boxes are surrounding cars. Black-solid lines are the observed history, red-dashed lines are the groundtruth in the future, yellow-dashed lines are the predicted results (5 seconds) of our GRIP++ (GRIP has a similiar performanceon this dataset), and the green-dashed lines are the predicted results (5 seconds) of CS-LSTM [45]. Region from − to feet are observed areas. •
2. In Figure 5b, our model precisely predicts the trajectoryof the top car even when it is going to change lane inthe next 5 seconds. In addition, the car in the left lane isaffected by the top car, and our model still successfullypredict the trajectory for the car in the left lane. •
3. Our proposed GRIP++ can predict all objects in thescene simultaneously, while CS-LSTM can only predictthe one located in the middle. Especially, in Figure 5e, weshow a prediction result in a scene that involves 15 cars.In this scene, although some cars move slowly (vehiclesin the middle lane) while others move faster (cars in theright lane), our proposed GRIP++ model is able to predicttheir future trajectories correctly and simultaneously.Based on these observations from the visualized results,we can conclude that our proposed scheme, GRIP++, indeedimproves the trajectory prediction performance compared tothe existing methods. Even though Figure 5 only showsstraight high way scenario, our approach equally works forcurved roads. VI. C ONCLUSION
In this paper, we propose GRIP++ for autonomous drivingcars to predict object trajectories in the future. The proposedmodel uses a graph to represent the interaction among all closeobjects and employs an encoder-decoder GRU-based modelto make predictions. Unlike some existing solutions that onlypredict the future trajectory for a single traffic agent each time,GRIP++ is able to predict trajectories for all observed objectssimultaneously. The experimental results on two well-knownhighway and one urban traffic scenario datasets show that ourproposed model achieves much better prediction results thanexisting methods and run . times faster than one state-of-the-art scheme. Compared to our previous work, GRIP [28],GRIP++ achieves similar performance in highway scenarios83% improvement in urban scenarios. We also conduct ex-tensive ablation studies to understand how different designchoices affect the trajectory prediction accuracy. In the nearfuture, we hope to integrate GRIP++ into a route planningmodule and combine it with a deep learning based perceptionmodule to further evaluate the overall performance of thesewo modules. Subsequently, we intend to run this integratedperception and navigation module in prototype robotic cars ina testbed. Acknowledgement:
This work is partially supported by aQualcomm gift, an NSF CNS award 1931867, and a GPUdonated by NVIDIA. R
EFERENCES[1] Y. Shen, W. Hu, M. Yang, B. Wei, S. Lucey, and C. T. Chou,“Face recognition on smartphones via optimised sparse representationclassification,” in
Proceedings of the 13th international symposium onInformation processing in sensor networks . IEEE Press, 2014, pp.237–248.[2] K. Patel, H. Han, and A. K. Jain, “Secure face unlock: Spoof detection onsmartphones,”
IEEE transactions on information forensics and security ,vol. 11, no. 10, pp. 2268–2283, 2016.[3] X. Ying, X. Li, and M. C. Chuah, “Liveface: A multi-task cnn for fastface-authentication,” in . IEEE, 2018, pp. 955–960.[4] X. Li and M. Choo Chuah, “Sbgar: semantics based group activityrecognition,” in
Proceedings of the IEEE international conference oncomputer vision , 2017, pp. 2876–2885.[5] H. Fan, L. Zheng, C. Yan, and Y. Yang, “Unsupervised person re-identification: Clustering and fine-tuning,”
ACM Transactions on Multi-media Computing, Communications, and Applications (TOMM) , vol. 14,no. 4, p. 83, 2018.[6] X. Li, Q. Xue, and M. C. Chuah, “Casheirs: Cloud assisted scalable hi-erarchical encrypted based image retrieval system,” in
IEEE INFOCOM2017-IEEE Conference on Computer Communications . IEEE, 2017,pp. 1–9.[7] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2018, pp. 994–1003.[8] X. Li and M. C. Chuah, “Rehar: Robust and efficient human activityrecognition,” in . IEEE, 2018, pp. 362–371.[9] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, Z. Hu, C. Yan, and Y. Yang,“Improving person re-identification by attribute and identity learning,”
Pattern Recognition , 2019.[10] X. Li, “Robust and efficient activity recognition from videos,” Ph.D.dissertation, Lehigh University, 2020.[11] X. Li, M. C. Chuah, and S. Bhattacharya, “Uav assisted smart parkingsolution,” in . IEEE, 2017, pp. 1006–1013.[12] B. Feng, F. He, X. Wang, Y. Wu, H. Wang, S. Yi, and W. Liu, “Depth-projection-map-based bag of contour fragments for robust hand gesturerecognition,”
IEEE Transactions on Human-Machine Systems , vol. 47,no. 4, pp. 511–523, 2016.[13] S. S. Rautaray and A. Agrawal, “Real time hand gesture recognitionsystem for dynamic applications,”
International Journal of UbiComp ,vol. 3, no. 1, p. 21, 2012.[14] J. H. Mosquera, H. Loaiza, S. E. Nope, and A. D. Restrepo, “Identifyingfacial gestures to emulate a mouse: navigation application on facebook.”
IEEE Latin America Transactions , vol. 15, no. 1, pp. 121–128, 2017.[15] G. Zhu, L. Zhang, P. Shen, and J. Song, “Multimodal gesture recognitionusing 3-d convolution and convolutional lstm,”
IEEE Access , vol. 5, pp.4517–4524, 2017.[16] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convo-lutional neural networks with low rank expansions,” arXiv preprintarXiv:1405.3866 , 2014.[17] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolu-tional networks for classification and detection,”
IEEE transactions onpattern analysis and machine intelligence , vol. 38, no. 10, pp. 1943–1955, 2016.[18] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempit-sky, “Speeding-up convolutional neural networks using fine-tuned cp-decomposition,” arXiv preprint arXiv:1412.6553 , 2014.[19] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compressionof deep convolutional neural networks for fast and low power mobileapplications,” arXiv preprint arXiv:1511.06530 , 2015. [20] X. Li, S. Zhang, B. Jiang, Y. Qi, M. C. Chuah, and N. Bi, “Dac: Data-free automatic acceleration of convolutional networks,” in . IEEE,2019, pp. 1598–1606.[21] R. Toledo-Moreo and M. A. Zamora-Izquierdo, “Imm-based lane-changeprediction in highways with low-cost gps/ins,”
IEEE Transactions onIntelligent Transportation Systems , vol. 10, no. 1, pp. 180–185, 2009.[22] A. Houenou, P. Bonnifait, V. Cherfaoui, and W. Yao, “Vehicle trajectoryprediction based on motion model and maneuver recognition,” in .IEEE, 2013, pp. 4363–4369.[23] M. Schreier, V. Willert, and J. Adamy, “Bayesian, maneuver-based, long-term trajectory prediction and criticality assessment for driver assistancesystems,” in . IEEE, 2014, pp. 334–341.[24] Q. Tran and J. Firl, “Online maneuver recognition and multimodaltrajectory prediction for intersection assistance using non-parametricregression,” in .IEEE, 2014, pp. 918–923.[25] J. Schlechtriemen, F. Wirthmueller, A. Wedel, G. Breuel, and K.-D.Kuhnert, “When will it change the lane? a probabilistic regressionapproach for rarely occurring events,” in . IEEE, 2015, pp. 1373–1379.[26] V. Karasev, A. Ayvaci, B. Heisele, and S. Soatto, “Intent-aware long-term prediction of pedestrian motion,” in . IEEE, 2016, pp.2543–2549.[27] A. Bhattacharyya, M. Fritz, and B. Schiele, “Long-term on-boardprediction of people in traffic scenes under uncertainty,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2018, pp. 4194–4202.[28] X. Li, X. Yang, and M. Chuah, “Grip: Graph-based interaction awaretrajectory prediction,” in
Proceedings of IEEE ITSC , 2019.[29] S. Danielsson, L. Petersson, and A. Eide-hall, “Monte carlo based threatassessment: Analysis and improvements,” in
Proceeedings of IEEE IV ,2007.[30] S. Lefevre, C. Laugier, and J. Ibanez-Gusman, “Exploiting map informa-tion for driver intention estimation at road intersections,” in
ProceedingsIEEE IV , 2011.[31] J. Firl, H. Stubing, S. A. Huss, and C. Stiller, “Predictive maneuverevaluatoin for enhancement of car-to-x mobility data,” in
Proceedingsof IEEE IV , 2012.[32] S. Ferguson, B. Luders, R. C. Grande, and J. P. How, “Real-timepredictive modeling and robust avoidance of pedestrians with uncertain,changing intentions,” in
Algorithmic Foundations of Robotics XI , 2015,pp. 161–177.[33] E. Chen, X. Bai, L. Gao, H. C. Tinega, and Y. Ding, “Augmented dic-tionary learning for motion prediction,” in
Prcoeedings of InternationalConference on Robotics and Automation (ICRA) , 2016.[34] A. Bera, S. Kim, T. Randhavane, S. Pratapa, and D. Manocha, “Glmp-realtime pedestrian path prediction using global and local movementpatterns,” in
IEEE ICRA , 2016, pp. 5528–5535.[35] A. Bera, T. Randhavane, and D. Manocha, “Aggressive, tense or shy?”in
IJCAI , vol. 17, 2017, pp. 112–118.[36] Y. Ma, D. Manocha, and W. Wang, “Autorvo: Local navigation withdynamic constraints in dense heterogenous traffic,” in
CSCS , 2018.[37] A. Khoroshahi, “Learning, classification and prediction of maneuvirs ofsurround vehicles at intersections using lstms,” in
PhD thesis, UCSD ,2017.[38] F. Altche and A. DeLaFortelle, “An lstm network for highway trajectoryprediction,” in
IEEE ITSC , 2017, pp. 353–359.[39] S. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. Choi, “Sequence-to-sequence prediction of vehicle trajectory via lstm encoder-decoderarchitecture,” in arXiv:1802.06338 , 2018.[40] R. Chandra, U. Bhattacharya, A. Bera, and D. Manocha, “Traphic:Trajectory prediction in dense and heterogeneous traffic using weightedinteractions,” in arXiv: 1812.04767 , 2018.[41] Y. Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha,“Trafficpredict: Trajectory prediction for heterogeneous traffic-agents,” arXiv preprint arXiv:1811.02146 , 2018.[42] W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a singleconvolutional net,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2018, pp. 3569–3577.[43] N. Deo and M. M. Trivedi, “Multi-modal trajectory prediction of sur-rounding vehicles with maneuver based lstms,” in . IEEE, 2018, pp. 1179–1184.44] N. Deo, A. Rangesh, and M. M. Trivedi, “How would surround vehiclesmove? a unified framework for maneuver classification and motionprediction,”
IEEE Transactions on Intelligent Vehicles , vol. 3, no. 2,pp. 129–140, 2018.[45] N. Deo and M. M. Trivedi, “Convolutional social pooling for vehicletrajectory prediction,” in
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition Workshops , 2018, pp. 1468–1476.[46] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,” 2017.[47] J. Colyar and J. Halkias, “Us highway 80 dataset,”
Federal HighwayAdministration (FHWA), Tech. Rep. FHWA-HRT-07-030 , 2007.[48] ——, “Us highway 101 dataset,”
Federal Highway Administration(FHWA), Tech. Rep. FHWA-HRT-07-030 , 2007.[49] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “Theapolloscape open dataset for autonomous driving and its application,”2018.[50] A. Kuefler, J. Morton, T. Wheeler, and M. Kochenderfer, “Imitatingdriver behavior with generative adversarial networks,” in . IEEE, 2017, pp. 204–211.[51] Y. Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha,“Trafficpredict: Trajectory prediction for heterogeneous traffic-agents,”in
Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33,2019, pp. 6120–6127.[52] Y. Zhu, D. Qian, D. Ren, and H. Xia, “Starnet: Pedestrian trajectoryprediction using deep neural network in star topology,” arXiv preprintarXiv:1906.01797arXiv preprintarXiv:1906.01797