[PDF] ParkPredict: Motion and Intent Prediction of Vehicles in Parking Lots

Abstract

We investigate the problem of predicting driver behavior in parking lots, an environment which is less structured than typical road networks and features complex, interactive maneuvers in a compact space. Using the CARLA simulator, we develop a parking lot environment and collect a dataset of human parking maneuvers. We then study the impact of model complexity and feature information by comparing a multi-modal Long Short-Term Memory (LSTM) prediction model and a Convolution Neural Network LSTM (CNN-LSTM) to a physics-based Extended Kalman Filter (EKF) baseline. Our results show that 1) intent can be estimated well (roughly 85% top-1 accuracy and nearly 100% top-3 accuracy with the LSTM and CNN-LSTM model); 2) knowledge of the human driver's intended parking spot has a major impact on predicting parking trajectory; and 3) the semantic representation of the environment improves long term predictions.

Full PDF

PParkPredict: Motion and Intent Prediction of Vehicles in Parking Lots

Xu Shen ,(cid:63) , Ivo Batkovic , ,(cid:63) , Vijay Govindarajan ,(cid:63) , Paolo Falcone , Trevor Darrell , and Francesco Borrelli Abstract — We investigate the problem of predicting driverbehavior in parking lots, an environment which is less struc-tured than typical road networks and features complex, in-teractive maneuvers in a compact space. Using the CARLAsimulator, we develop a parking lot environment and collect adataset of human parking maneuvers. We then study the impactof model complexity and feature information by comparinga multi-modal Long Short-Term Memory (LSTM) predictionmodel and a Convolution Neural Network LSTM (CNN-LSTM)to a physics-based Extended Kalman Filter (EKF) baseline.Our results show that 1) intent can be estimated well (roughly85% top-1 accuracy and nearly 100% top-3 accuracy with theLSTM and CNN-LSTM model); 2) knowledge of the humandriver’s intended parking spot has a major impact on predictingparking trajectory; and 3) the semantic representation of theenvironment improves long term predictions.

I. I

NTRODUCTION

While autonomous driving technologies have advanced byleaps and bounds, autonomous vehicles (AVs) still face greatchallenges. Robust perception [1], prediction [2], and inter-action in real trafﬁc scenarios with other participants [3], [4],especially human-driven vehicles, are difﬁcult. Depending onthe estimated behavior of the others, an AV’s behaviour maybe too conservative, aggressive, or in the worst case, unsafe.This problem is especially difﬁcult in compact and un-structured domains like parking lots, which feature numerousinteractions with human agents in close proximity [5] andmay lead to congestion with suboptimal policies [6]. Thepotential to equip sensing and communicating infrastructurein these environments may help enable algorithmic solutionsfor connected AVs [7].Extensive research has been conducted to investigate theproblem of vehicle motion prediction, and algorithms aredeveloped upon various level of accessible information.Physics-based methods use the pose history to provideintuitive short-term predictions [8], where Kalman Filters [9]are used to propagate the vehicle dynamics forward in timeand predict a trajectory or reachable set [10]. On the otherside, data-driven methods like Recurrent Neural Networks(RNNs), especially Long Short-Term Memory (LSTM) net-works, can learn a motion model without explicit knowledgeof the vehicle parameters or input proﬁle. The encoder of the

This work is partly supported by Berkeley DeepDrive (BDD) andthe Wallenberg Artiﬁcial Intelligence, Autonomous Systems and SoftwareProgram (WASP) funded by Knut and Alice Wallenberg Foundation. (cid:63)

Indicates equal contribution. University of California, Berkeley, CA, USA ( { xu shen, govvijay,fborrelli, trevor } @berkeley.edu). Chalmers University of Technology, Gothenburg, Sweden( { ivo.batkovic, falcone } @chalmers.se). Zenuity AB, Gothenburg, Sweden ([email protected]). (a)

Bird’s-eye view. (b)

Driver view.

Fig. 1.

Parking lot scenario.

LSTM processes a pose history and produces a summaryof this sequence, which is then passed to a decoder forprediction [11], [12].Since the vehicle motion is directly inﬂuenced by theintent, such as turning, lane keeping, and lane changing, em-bedding the intent information with the motion informationcan improve prediction performance [13]. The LSTM en-coder learns a representation of the motion data to predict in-tent and generate interpretable multi-modal predictions [14].For highway or urban driving, intent is usually classiﬁedbased on lane graphs [15], [16], [17].Not requiring explicit processing of trajectory or environ-ment structure, Convolutional Neural Networks (CNNs) havebeen used in conjunction with LSTMs to synthesize infor-mation and make predictions from image sequences [18]. Inrecent work, these architectures have been applied to bird’s-eye view (BEV) semantic images [19], including vehiclegeometry, dynamics [20], and collision avoidance constraintsfor multimodal motion prediction [21].Most existing studies discussed above have focused onstructured environments, e.g., there exists a well deﬁned roadnetwork [22] consisting of intersections, lanes, and trafﬁclights. However, for environments such as parking lots, thefollowing challenges arise:1) only limited public datasets are available of human-driven vehicles inside parking lots [23];2) the parking maneuver is typically more complex [24]and challenging than highway driving;3) the compact space and proximity among surroundingobjects increases the risk of collision and congestion.In this work, we focus on the problem of predicting ahuman driver’s intended parking spot and future trajectory,given a set of features and levels of model abstraction. Thispaper offers the following contributions:1) we develop a parking lot simulation environment usingthe CARLA simulator and a racing wheel interface; a r X i v : . [ c s . R O ] A p r ) we collect an annotated dataset of human parkingbehaviors including both forward and reverse parkingmaneuvers, as well as various parking spot selections;3) we propose a hierarchical LSTM model structurewhich can provide both intent and multi-modal tra-jectory predictions;4) we propose a nested CNN-LSTM model structure witha visual encoder applied to semantic BEV imagescapturing parking lot geometry;5) we compare a physics-based Kalman Filter baselineagainst the higher complexity LSTM and CNN-LSTMmodels to investigate the impact of model complexity,feature complexity, and amount of accessible informa-tion on prediction performance.This paper is organized as follows. Section II discussesthe experimental design and dataset generation. Section IIIelaborates on the algorithms designed for intent classiﬁcationand motion prediction. Section IV discusses the results of theprediction algorithms. Finally, Section V summarizes our keyﬁndings and ideas for future work.II. E XPERIMENTAL D ESIGN AND D ATASET

This section provides an overview of the parking lot sim-ulation environment and experimental setup. We generateda dataset from human driving demonstrations consisting oftrajectories, ﬁnal parked spot location, and signaled intent.This dataset was then used for intent and motion prediction.

A. Parking Lot Scenario

In order to collect parking demonstrations in a controlledfashion, we used the CARLA simulator and CARLA ROSbridge [25] with a custom parking lot map modiﬁed fromTown04. The parking lot consists of 4 rows with 16 spotseach. In each trial, static vehicles were spawned into parkingspots such that only 8 free spot options, located in themiddle two rows, were available. The speciﬁc locations ofthese free spots were varied across trials to gather a diverserange of parking demonstrations from the subject. A freespot conﬁguration example can be seen in Fig. 1(a).Given this setting, the subject was instructed to park intoa free spot of their choosing, following a speciﬁed forwardor reverse parking maneuver. When the subject selected theparking spot, he or she was instructed to press a button tosignal a determined intent. The subject used a Logitech G27racing wheel to control brake, throttle, and steering of theego vehicle. In this experiment, only the ego vehicle wasmoving; all other vehicles in the scene remained static andparked. The driver view is visualized in in Fig. 1(b).A total of subjects performed 30 forward parking and30 reverse parking demonstrations, resulting in totaldemonstrations. In each demonstration, the kinematic motionstate history of the ego vehicle was recorded, as well as intentsignals to know when a parking spot had been selected. Inaddition, the locations of each parking spot was collected.All demonstrations containing collisions or without intentsignaling were ﬁltered out. Furthermore, the conﬁguration of the parking lot was recorded together with the boundingboxes of the ego vehicle and all other parked vehicles. B. Dataset Generation

We ﬁrst introduce some notation used for the data includedin each demonstration. • The vehicle pose at time t is denoted by (cid:126)z ( t ) = (cid:2) x ( t ) y ( t ) θ ( t ) (cid:3) (cid:124) . We assume the vehicle lies on the xy -plane and ignore variation in altitude, pitch, and roll. • We denote the parking spot occupancy at time t as O ( t ) =  x s, ( t ) y s, ( t ) f s, ( t ) ... ... ... x s,G ( t ) y s,G ( t ) f s,G ( t )  , which is a matrix of size G by , where G is the numberof parking spots. In each row j , the ﬁrst two entries, x s,j ( t ) , y s,j ( t ) are the spot location and the last entry, f s,j ( t ) , is a binary variable set to 1 if the spot is free. • The ground truth distribution of the driver intent isdenoted by g ( t ) , which is a one-hot vector with size G +1 . The ( G + 1) -th element corresponds to undeterminedintent. After the intent is signaled, g ( t ) corresponds tothe spot where the subject ﬁnally parks in. • The parking bird’s-eye view I ( t ) is a semantic imageof shape H by W by representing the parkingenvironment in Fig. 1(a). The three channels correspondto the parking spot markings, static vehicle boundingboxes, and ego vehicle bounding box respectively.For each demonstration, as shown in Fig. 2, the timeintervals before the subject started driving and after thevehicle was parked were removed. Given a timestep ∆ t =0 . s, the remaining portion of demonstrations were processedinto short snippets with a history horizon N hist = 5 , and aprediction horizon N pred = 20 . Each snippet was furtherprocessed to generate the following features:1) motion history of (cid:126)z ( t ) : Z hist ( t ) ∈ R N hist × ;2) parking spot and occupancy O ( t ) ∈ R G × ;3) image history of I ( t ) : I hist ( t ) ∈ R N hist × H × W × ;and labels:1) future motion of (cid:126)z ( t ) : Z future ( t ) ∈ R N pred ×

2) ground truth driver intent: g ( t ) .Note that all history features are sampled backward in timefrom t with horizon N pred and step size ∆ t . Similarly, theprediction label is sampled forward in time from t withhorizon N future and step size ∆ t .This processing resulted in a dataset D = { ( Z ( i )hist , O ( i ) , I ( i )hist ) , ( Z ( i )future , g ( i ) ) } Mi =1 . Here, the superscript ( i ) corresponds to the i -th dataset instance, and the totalnumber of instances is M = 20850 . The ﬁrst tuplecorresponds to a feature and the second tuple correspondsto a label, where we drop the time dependence of eachentry going forward. More details, as well as the processed dataset, can be found athttps://bit.ly/parkpredictig. 2.

Sample demonstration focused on the middle two rows.The shaded bounding boxes represent parked vehicles. At ﬁrst, thehuman driver has an undetermined intent (yellow section). Then thedriver decides the spot intent and trajectory is shown in blue.

III. M

ETHODOLOGY

Given the dataset D , we break the prediction probleminto intent and trajectory estimation subproblems. Usinginput history X ( i )hist and occupancy O ( i ) , we generate thedistribution of predicted driver intent ˆ g ( i ) and the predictedfuture N pred -step trajectory ˆ Z ( i )future . We evaluate three mod-els that vary in both model and feature complexity: an EKFbaseline, an LSTM network, and a CNN-LSTM network. Thefollowing sections describe the model structures and inputhistory features X ( i )hist provided to each model. A. Extended Kalman Filter (EKF) with Constant Velocity

As a baseline, we use an EKF with a constant velocityassumption for which X ( i )hist = ( Z ( i )hist ) . The following statedynamics and measurement model are used.

1) State Dynamics:  x k +1 y k +1 θ k +1 v k +1 ω k +1  =  x k + v k cos( θ k )∆ ty k + v k sin( θ k )∆ tθ k + ω k ∆ tv k ω k  + q k , q k ∼ N (0 , Q )

2) Measurement Model: ˆ (cid:126)z k =    x k y k θ k v k ω k  + r k , r k ∼ N (0 , R ) The disturbance covariance, Q , was estimated using thepose and velocity information provided in each demon-stration. The noise covariance was chosen as R = diag (1 e- , e- , e- , to reﬂect the ground truth pose mea-surement. We ﬁrst run the time and measurement updatesfor the pose history Z hist and extrapolate ˆ Z ( i ) future by thetime update. Then, ˆ g ( i ) is estimated by assigning probabilityto free spots (i.e. using O ( i ) ) based on the inverse of theEuclidean distance to the ﬁnal predicted position. If a givenspot exceeds a distance threshold (20 m), then the corre-sponding probability is added to the undetermined category,the ( G + 1) -th element of ˆ g ( i ) . B. Long Short-Term Memory Network (LSTM)

Our proposed LSTM model is shown in Fig. 3. As withthe EKF, X ( i )hist = ( Z ( i )hist ) , but intent and trajectory estimationare addressed in the reversed order. In particular, unlikethe EKF, the model predicts multimodal, intent-conditionedtrajectories with the following hierarchical structure.

1) Intent Prediction Module:

The intent prediction mod-ule takes X ( i )hist and O ( i ) as inputs and estimates the intentprobability distribution: ˆ g ( i ) = F intent ( X ( i )hist , O ( i ) ) The pose history and parking spot occupancy are ﬁrstpassed into the encoder LSTM stack and then processed bya fully connected layer. A softmax output layer produces thepredicted intent distribution, ˆ g ( i ) .The objective function we minimize for intent predictionhas three components, where j ∈ { , ..., G + 1 } denotes theintent index. • Cross entropy between prediction ˆ g ( i ) and ground truth g ( i ) to drive the predicted distribution closer to theground truth label: J intent1 = − G +1 (cid:88) j =1 g ( i ) j log(ˆ g ( i ) j ) . • Negative entropy of the predicted distribution ˆ g ( i ) toaccount for the stochasticity of the driver: J intent2 = −H (ˆ g ( i ) ) = G +1 (cid:88) j =1 ˆ g ( i ) j log(ˆ g ( i ) j ) . • Penalty of predicting already occupied parking spots: J intent3 = G (cid:88) j =1 max { (ˆ g ( i ) j − f ( i ) s,j ) , } . Therefore, the ﬁnal objective function is J intent = J intent1 + J intent2 + J intent3 .

2) Trajectory Prediction Module:

The trajectory predic-tion module takes X ( i )hist and the intent index j ∈ { , ..., G +1 } as input and estimates the future trajectory for N pred timesteps: ˆ Z ( i )future ,j = F traj ( X ( i )hist , j ) . This module works sequentially with the intent predictionmodule to generate multimodal predictions. The encoder ﬁrstprocesses X ( i )hist . The encoder’s ﬁnal hidden state and cellstate are used to initialize the decoder. Then, for each intentindex j , the decoder and the subsequent fully connectedlayer return a predicted future trajectory ˆ Z ( i )future ,j , which isassociated with probability ˆ g ( i ) j .However, during training, we decouple this module withintent prediction and only use ground truth label g ( i ) : ˆ Z ( i )future ,gt = F traj ( X ( i )hist , argmax j g ( i ) j ) . oftmaxFullyConnectedEncoder LSTMIntent Prediction Module F intent ( · ) X ( i )hist O ( i ) ˆg ( i ) Intent j Intent j Intent j Top-n: X ( i )hist Encoder LSTM Decoder LSTM FullyConnectedTrajectory Prediction Module F traj ( · ) ˆ Z ( i )future ,j Fig. 3.

Multi-modal LSTM prediction model. I ( i )hist Z ( i )hist CNNConcat. X ( i )hist Type / Stride ParametersConv / s1 8 × (3 × × − Conv / s2 16 × (3 × × − Flatten − Dropout p = 0 . p = 0 . Fig. 4.

CNN architecture. The table on the right describes the CNNpreprocessing block in the left ﬁgure.

The objective function J traj is deﬁned using mean squarederror (MSE) on position. Concretely, let (cid:126)z ( i ) k and ˆ (cid:126)z ( i ) k be the k -th row of Z ( i )future and ˆ Z ( i )future respectively, then: J traj = 1 N pred N pred (cid:88) k =1 (cid:113) ( (cid:126)z ( i ) k − ˆ (cid:126)z ( i ) k ) (cid:62) diag(1 , , (cid:126)z ( i ) k − ˆ (cid:126)z ( i ) k ) C. Convolutional Neural Network LSTM (CNN-LSTM)

The CNN-LSTM model, like the LSTM model, usesthe same structure in Fig. 3. However, as depicted inFig. 4, a single CNN preprocesses each image in I ( i )hist .The generated image features are then concatenated withthe motion features. Hence, for the CNN-LSTM model, X ( i )hist = ( F CNN ( I ( i )hist ) , Z ( i )hist ) , where F CNN ( · ) represents thevisual encoding operation performed by the CNN. This isinspired from the approaches used in [21], [26], which fusemotion and visual features with a LSTM temporal encoder. D. Prediction Evaluation

To compare the performance of each prediction algorithm,we use 5-fold cross validation, where the LSTM and CNN-LSTM are trained for 200 epochs with batch size 32.The variable ˜ M here corresponds to the cardinality of thecorresponding hold-out set being evaluated.1) Top- n Accuracy: Let the set G n (ˆ g ( i ) ) include the n most likely intent categories in the predicted intentdistribution, ˆ g ( i ) . Then, the top- n accuracy is: A n = M (cid:80) ˜ Mi =1 (cid:80) G +1 j =1 g ( i ) j I (cid:16) g ( i ) j ∈ G n (ˆ g ( i ) ) (cid:17) where I ( · ) is the indicator function.2) Mean Distance Error: Given a predicted ˆ Z future andactual trajectory Z future , we can look at the positionerror as a function of the prediction timestep. The meandistance error at timestep k , d k , is: d k = M (cid:80) ˜ Mi =1 (cid:113) ( (cid:126)z ( i ) k − ˆ (cid:126)z ( i ) k ) (cid:62) diag (1 , ,

0) ( (cid:126)z ( i ) k − ˆ (cid:126)z ( i ) k ) IV. R

ESULTS

In this section, we compare the intent prediction capabilityand the trajectory prediction error of each algorithm usingthe selected evaluation metrics. We investigate the impactof information level and multimodality on the predictionperformance. For brevity, CNN-LSTM is shortened to CNNin the subsequent ﬁgures.

A. Intent Classiﬁcation

Fig. 5.

Top-n intent classiﬁcation accuracy, A n . Fig. 5 shows the top- n accuracy A n across all modelsevaluated. As expected, the LSTM and CNN-LSTM outper-forms the EKF baseline at every n , achieving roughly %top-1 accuracy and nearly 100% top-3 or top-5 accuracy.This follows from the fact that the LSTM and CNN-LSTMare trained speciﬁcally on the intent prediction task, while aheuristic inverse distance approach is used after the trajectoryprediction step of the EKF. B. Trajectory Prediction

In this section, we look at the following models and levelsof information, where (cid:63) ∈ {

LSTM , CNN } : (cid:63) no intent : intent-agnostic model where the trajectorymodule F traj ( · ) is re-trained and applied only zeroedintent input. • (cid:63) gt intent : intent-conditioned model where F traj ( · ) is decoupled from intent module F intent ( · ) , predicting ˆ Z ( i ) future,gt only using ground truth intent g ( i ) . • (cid:63) multimodal : intent-conditioned, multimodal modelwhere F traj ( · ) is applied to the top- n entries of ˆ g ( i ) from F intent ( · ) . Here we select n = 3 .Fig. 6 captures the impact of intent knowledge and infor-mation level on the mean distance error, d k , at each timestep k . For both the CNN-LSTM and LSTM models, we observethat they outperform the EKF for long-term predictions, asthey learn a more nuanced motion model. For the LSTMmodel, the beneﬁt of intent knowledge is minimal, as the nointent model is on par with the ground truth model. However,in the case of the CNN-LSTM, the additional knowledge ofintent aids the prediction, as seen after time step . Thissuggests that the semantic BEV images can help the modelto better understand the intent label for prediction.Fig. 7 shows the beneﬁt of multimodal predictions on themean distance error, d k , at each timestep k . Note that forthe multimodal predictions, we take the top n = 3 intentlabels and generate 3 separate rollouts. The results reportedhere are computed by ﬁnding the rollout nearest the groundtruth trajectory (i.e. using mean squared error on position, J traj ) and then using this single rollout to evaluate d k .We note that every intent-conditioned model does on paror outperforms the model without intent knowledge, overthe prediction horizon. That shows that even if the intentis predicted and not known precisely, this additional infor-mation can still reduce trajectory prediction error comparedto intent-agnostic predictions. This is likely due, in part, tothe multimodal nature of the trajectory prediction, whichcan capture more evolution possibilities and driver modelstochasticity relative to a unimodal trajectory prediction. Wealso observe that the CNN-LSTM models are well-suited forlong-term predictions, for which the geometry of the parkinglot is likely more informative. Fig. 6.

Mean distance error d k vs. timestep k across varying modelsand levels of intent knowledge. Fig. 7.

Mean distance error d k vs. timestep k comparing intent-agnostic unimodal predictions to the best rollout among multimodal,intent-conditioned predictions. C. Scenario Examples (a) EKF(b) LSTM(c) CNN-LSTM

Fig. 8.

Prediction example across models. The black curve repre-sents the pose history Z ( i ) hist ; the blue pentagon and curve stand forground truth future intent g ( i ) and motion Z ( i ) future respectively; theorange, green, purple pentagons and curves correspond to the top-3intent and trajectory predictions. Their order and transparency levelsreﬂect the probability ˆ g ( i ) j , which is visualized by the cyan bars inspots. When a pentagon is marked on the trajectory, it correspondsto the undetermined intent. We show how the prediction algorithms compare for asample prediction instance in Fig. 8. The three sub-ﬁguresdepict the 2-D layout of the parking lot, where the ﬁlledshaded boxes are occupied parking spots.For intent prediction, the proximity-based heuristics of theEKF prioritizes the nearest spots in front of the vehicle, butisses the case that the vehicle may reverse into the groundtruth spot backwards. Instead, the LSTM and CNN-LSTMcapture the ground truth in top-3 candidates, meaning thatthey also learned from data that the driver might choose toback up in their maneuver.For trajectory prediction, the EKF extrapolates the dynam-ics so it only offers single trajectory prediction with a signif-icant delay. The LSTM and CNN-LSTM ﬁt the ground truthbetter and offer multi-modal behaviors for other likely nearbyspots. The top-3 candidates from the LSTM and CNN-LSTMare relatively more aware of the obstacles, as the LSTMbetter leverages the occupancy information and the CNN-LSTM learns a more detailed obstacle representation throughsemantic image inputs. Therefore, both models place moreemphasis on the reverse maneuver.V. D

ISCUSSION

This work investigated the problem of predicting a humandriver’s parking intent and maneuver with varying levels offeature and model complexity. A custom CARLA simulatorparking lot environment was constructed and used to generatea dataset of human parking maneuvers. We compared anintent-conditioned LSTM and CNN-LSTM prediction modelagainst an EKF baseline and noted the beneﬁts of providingintent information for trajectory prediction, even if esti-mated. Additionally, by encoding obstacles and parking lotgeometry, the semantic BEV images help improve predictionperformance in the long term. The hierarchical framework iscapable of offering multi-modal driver behavior predictionin a relatively unstructured environment like parking lots.For future work, we would like to investigate the costfunction used for the CNN-LSTM model by incorporatinga penalty according to kinematic constraints as in [20]. Inaddition, by expanding the dataset to collect a wider variationof real world behaviors, we hope to apply the multimodalpredictions in a stochastic control framework for autonomousparking in multi-agent settings.R

EFERENCES[1] J. Janai, F. G¨uney, A. Behl, and A. Geiger, “Computer vision forautonomous vehicles: Problems, datasets and state-of-the-art,” arXivpreprint arXiv:1704.05519 , 2017.[2] S. Lef`evre, D. Vasquez, and C. Laugier, “A survey on motionprediction and risk assessment for intelligent vehicles,”

ROBOMECHJournal , June 2019, pp. 256–262.[4] K. Driggs-Campbell, V. Govindarajan, and R. Bajcsy, “Integratingintuitive driver models in autonomous planning for interactive ma-neuvers,”

IEEE Transactions on Intelligent Transportation Systems

AnnualReviews in Control , vol. 45, pp. 18–40, 2018. [8] A. Houenou, P. Bonnifait, V. Cherfaoui, and W. Yao, “Vehicle trajec-tory prediction based on motion model and maneuver recognition,”in . IEEE, 2013, pp. 4363–4369.[9] R. Schubert, E. Richter, and G. Wanielik, “Comparison and evaluationof advanced motion models for vehicle tracking,” in . IEEE, 2008, pp. 1–6.[10] M. Althoff, O. Stursberg, and M. Buss, “Model-based probabilisticcollision detection in autonomous driving,”

IEEE Transactions onIntelligent Transportation Systems , vol. 10, no. 2, pp. 299–310, 2009.[11] S. H. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. Choi,“Sequence-to-Sequence Prediction of Vehicle Trajectory via LSTMEncoder-Decoder Architecture,”

IEEE Intelligent Vehicles Symposium,Proceedings , vol. 2018-June, no. Iv, pp. 1672–1678, 2018.[12] B. D. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W.Choi, “Probabilistic vehicle trajectory prediction over occupancy gridmap via recurrent neural network,”

IEEE Conference on IntelligentTransportation Systems, Proceedings, ITSC , vol. 2018-March, pp.399–404, 2018.[13] Y. Hu, W. Zhan, and M. Tomizuka, “Probabilistic prediction of vehiclesemantic intention and motion,” in , June 2018, pp. 307–313.[14] N. Deo and M. M. Trivedi, “Multi-Modal Trajectory Prediction ofSurrounding Vehicles with Maneuver based LSTMs,”

IEEE IntelligentVehicles Symposium, Proceedings , vol. 2018-June, no. Iv, pp. 1179–1184, 2018.[15] T. Streubel and K. H. Hoffmann, “Prediction of driver intended pathat intersections,”

IEEE Intelligent Vehicles Symposium, Proceedings ,no. Iv, pp. 134–139, 2014.[16] J. Schulz, C. Hubmann, J. Lochner, and D. Burschka, “Interaction-Aware Probabilistic Behavior Prediction in Urban Environments,”

IEEE International Conference on Intelligent Robots and Systems , pp.3999–4006, 2018.[17] T. Gindele, S. Brechtel, and R. Dillmann, “Learning driver behaviormodels from trafﬁc observations for decision making and planning,”

IEEE Intelligent Transportation Systems Magazine , vol. 7, no. 1, pp.69–79, 2015.[18] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrentconvolutional networks for visual recognition and description,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2015, pp. 2625–2634.[19] S. Casas, W. Luo, and R. Urtasun, “IntentNet: Learning to PredictIntention from Raw Sensor Data,”

CoRL , vol. 87, no. CoRL, pp. 947–956, 2018.[20] H. Cui, T. Nguyen, F.-C. Chou, T.-H. Lin, J. Schneider, D. Bradley, andN. Djuric, “Deep kinematic models for physically realistic predictionof vehicle trajectories,” arXiv preprint arXiv:1908.00219 , 2019.[21] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K.Huang, J. Schneider, and N. Djuric, “Multimodal Trajectory Predic-tions for Autonomous Driving using Deep Convolutional Networks,”pp. 2090–2096, 2019.[22] C. Vallon, Z. Ercan, A. Carvalho, and F. Borrelli, “A machine learningapproach for personalized autonomous lane change initiation andcontrol,” in . IEEE,2017, pp. 1590–1595.[23] D. A. Vasquez Govea, T. Fraichard, and C. Laugier, “Growing HiddenMarkov Models: An Incremental Tool for Learning and PredictingHuman and Vehicle Motion,”

The International Journal of RoboticsResearch , vol. 28, no. 11-12, pp. 1486–1506, Nov. 2009.[24] X. Zhang, A. Liniger, A. Sakai, and F. Borrelli, “Autonomous Park-ing Using Optimization-Based Collision Avoidance,” in . IEEE, dec 2019, pp.4327–4332.[25] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,“CARLA: An open urban driving simulator,” in

Proceedings of the1st Annual Conference on Robot Learning , 2017, pp. 1–16.[26] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of drivingmodels from large-scale video datasets,” in