Human Driver Behavior Prediction based on UrbanFlow
Zhiqian Qiao, Jing Zhao, Zachariah Tyree, Priyantha Mudalige, Jeff Schneider, John M. Dolan
HHuman Driver Behavior Prediction based on UrbanFlow ∗ Zhiqian Qiao , Jing Zhao , Zachariah Tyree , Priyantha Mudalige , Jeff Schneider and John M. Dolan Abstract — How autonomous vehicles and human driversshare public transportation systems is an important problem, asfully automatic transportation environments are still a long wayoff. Understanding human drivers’ behavior can be beneficialfor autonomous vehicle decision making and planning, espe-cially when the autonomous vehicle is surrounded by humandrivers who have various driving behaviors and patterns ofinteraction with other vehicles. In this paper, we propose anLSTM-based trajectory prediction method for human driverswhich can help the autonomous vehicle make better decisions,especially in urban intersection scenarios. Meanwhile, in orderto collect human drivers’ driving behavior data in the urbanscenario, we describe a system called UrbanFlow which includesthe whole procedure from raw bird’s-eye view data collectionvia drone to the final processed trajectories. The system ismainly intended for urban scenarios but can be extended tobe used for any traffic scenarios.
I. INTRODUCTIONA major challenge in recent work on autonomous vehiclesis making proper decisions about how to deal with inter-actions with human-driven vehicles. However, interactionsamong human drivers are hard to model via equationsdirectly. To address this problem, learning-based methods forcharacterizing human-driver behavior become good choicesand make it easier to simulate a human driver’s behaviorin simulations such as CARLA [1], VTD [2], etc. However,such methods require a large amount of driving data in orderto learn human drivers’ diverse behavior. For a long time,NGSIM [3] was the only public trajectory-based dataset fromwhich human driver behavior could be extracted. In 2018,highD [4] became available, but it only includes highwayscenarios. Moreover, how to extract and classify the humandriver behavior without manually labeling a large amount ofdata for ground-truth is another time-consuming challengewhen dealing with raw human driver data.The current state of the art in acquiring and using such datafaces several problems. First, some published work relieson privately collected datasets, the inaccessibility of whichmakes them impossible to use as benchmarks for compar-isons between various algorithms. Second, some datasets arecollected by autonomous vehicles from the perspective ofthe ego vehicle. Although this perspective is ultimately theone available to an autonomous vehicle, it is difficult forit to provide full sequences showing the social behavior of *This work is supported by General Motors Zhiqian Qiao is a Ph.D. student of Electrical and Computer Engi-neering, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, USA [email protected] Mechanical Engineering, Carnegie Mellon University Research & Development, General Motors Faculties of The Robotics Institute, Carnegie Mellon University
Fig. 1: The UrbanFlow dataset processing pipeline. Thepipeline includes the drone data collection and process flowfrom raw video data to the final trajectory data.surrounding vehicles. To derive models for such behavior,bird’s-eye view datasets are useful. In response to theseproblems, this paper constructs a method for benchmarkinghuman driver behavior based on a bird’s-eye-view datacollection system via drone. Figure 1 shows the pipeline ofthe data processing procedure.On the other side, based on the dataset, predicting the othervehicles’ intentions or trajectories is an essential procedurefor behavior planning of autonomous vehicles during thedecision making or trajectory planning procedures. Dur-ing the application of motion planning practice, accuratelypredicting human drivers’ behavior can help the ego carto have better decision making. In Pittsburgh, most trafficlights control
Going Straight ( GS ), Turning Left ( TL ) and Turning Right ( TR ) with one light with the result that at theurban intersection, many interactions occur between vehiclesapproaching from opposite directions with intention pair of GS and TL or TL and TR . Under these situations, ’whowill go first’ between the two interacting vehicles is a keyproblem even for human drivers.The main contributions of this work are: • A drone-based data collection and processing systemto analyze bird’s-eye view trajectory data of humandrivers. • An algorithm which can predict the interacting humandrivers’ intentions as well as trajectories based on thehistorical trajectories occupying a given period of timewhen approaching an an urban intersection.II. R
ELATED W ORK
This section introduces previous work related to thispaper, which can be categorized as follows: 1) papers thataddress algorithm which is part of the traffic data collectionprocedures; 2) papers that propose intention and trajectoriespredictions of human drivers. a r X i v : . [ c s . R O ] N ov . Data Extraction With the current popularity of autonomous driving, variousdatasets are available for researchers to develop and testtheir algorithms. These datasets can be categorized into twoclasses. The first one is traffic-flow-based datasets, whichfocus on a particular scene and simultaneously capture allthe vehicles within it. This type of dataset uses a bird’s-eye view to observe vehicle trajectories within the scene.The NGSIM dataset [3] is the best-known such dataset andincludes highway and urban scenarios. Last year, RWTHAachen University released the highD dataset [4], which usedadvanced computer vision technology to improve the datacollection mechanism based on the NGSIM dataset. Anotherkind of dataset is based on the sensors mounted on the ego-vehicle and data collected while driving the ego car over agiven route. Most such datasets create various vision-basedbenchmarks for further study. The KITTI dataset [5] offersa vision benchmark for different autonomous vehicle-relatedtasks. The Oxford RobotCar dataset [6] collected 20 millionimages from 6 cameras mounted on the vehicle, along withLIDAR, GPS, and INS ground truth. Recently, UC Berkeleyreleased the BDD100k [7] which includes diverse drivingvideos collected from a camera mounted on the vehicle withscalable annotation tooling.In the current work, in order to gain a comprehensive viewof the traffic situation, we use a bird’s-eye-view method tocollect traffic-flow-based datasets via drone. The portableend-to-end system allows researchers to collect their owndata from any site of interest, unlike the NGSIM system,which depended on the installation of a fixed camera. Whileour data collection method is similar to that used for thehighD dataset, our method focuses primarily on urban in-tersections, which are more challenging than the highwayscenarios that the highD dataset focuses on. Compared withhighD, this work extends to urban scenario.
B. Prediction
Liebner [8] proposed an explicit model to extract char-acteristic desired velocity profiles from real-world data thatallow the Intelligent Drive Model (IDM) to account for turn-related deceleration to represent both car-following and turn-ing behavior. Derek et al. [9] used LSTMs to classify vehiclemaneuvers at intersections. They predicted whether a driverwill turn left, turn right, or continue straight up to 150m withconsistent accuracy before reaching the intersection usingLSTM, with the mean cross-validated prediction accuracy av-eraging over 85% for both three- and four-way intersections.There are other works on predicting complete trajectoriesusing Hidden Markov Models, Gaussian Processes, DynamicBayesian Networks, Support Vector Machines, and inversereinforcement learning.Compared with [8], besides velocity profile, multiple fac-tors are added in our models, such as yaw variation, targetmotion features, etc., which contain information on environ-mental changes for the ego vehicle. The work concentrateson the interaction of ego and target car pairs by studying therelated interaction with each other. Meanwhile, we introduce the idea of direction intention prediction and use the resultto determine a more detailed trajectory prediction. The maindifficulties we tackled in the work is that the human driver’sintentions and trajectories of the vehicle are much morevariable when approaching an urban intersection with heavytraffic flow than in highway situations.III. P
RELIMINARIES
In this section, the preliminary background of the problemis described. The fundamental algorithms which are used forvideo stabilization, object detection and tracking are includedhere.
A. Enhanced Correlation Coefficient
The two main challenges for video stabilization are therobustness and the speed of the alignment. Feature-basedalignment is fast and is able to align images with largedisplacement. However, its robustness is susceptible to thequality and distribution of the feature points detected. Onthe other hand, the image alignment algorithm like EnhancedCorrelation Coefficient (ECC) [10] can be used to processevery pair of two consecutive frames in the video. But eachalignment iteration needs all the pixels to be searched, whichis computationally expensive. Moreover, ECC also fails toalign frames without a good initial guess at the homographymatrix when the two frames have a low similarity. As a result,a combination of feature-matching-based and homography-based methods is used to reap the advantages of both.Feature-based alignment involves detecting key-points,finding key-point correspondences, and computing im-age transformation using the Random sample consensus(RANSAC) [11] algorithm. The standard ECC alignmentuses normalized intensity with zero mean so that the sim-ilarity measurement is invariant to contrast and brightnesschange [10]. Each frame was warped first using the homog-raphy calculated from the feature-based method for a roughalignment, then warped with homography calculated fromECC alignment.
B. Retinanet
In the proposed pipeline, RetinaNet [12] is used fordetecting vehicles in the images. RetinaNet has a backbonenetwork which is responsible for computing the convolutionfeature map over an entire input image. A class sub-net isresponsible for predicting the class of object. In our case,there is only one class, which is ’car’. A box regression sub-net predicts the location and size of each vehicle. ResNet-50was used as the backbone for the forward pass of the FPNarchitecture since the residual learning framework promoteseasier convergence.
C. Kalman Filter
A Kalman filter [13] is used for tracking and trajectorysmoothing. Based on the car’s dynamic model, characteristicsof the system noise and measurement noise, the measurementvariables are used as the input signal, and the estimationvariables that we need to know are the output of the filter. Theig. 2: Optimized stabilization method flow.whole filtering process is composed of a prediction equationand an update equation as follows: X ( n ) = FX ( n − ) + V q ( n − ) Y ( n ) = HX ( n ) + V p ( n ) (1)where X ( n ) and Y ( n ) are the estimated state variable andmeasurement variable at frame n , respectively.IV. M ETHODOLOGY
In this section we propose UrbanFlow as a procedure todeal with the collected bird’s-eye view data. Then, based onUrbanFlow, we propose a method for predicting the humandriver’s intention as well as trajectories.
A. UrbanFlow1) Video Stabilization:
In this paper, we propose severalsteps for the video stabilization in order to deal with thedisplacement of the drone during the data collection process.Figure 2 visualizes the flow of the stabilization method.For each frame f n at time step n , the algorithm chooses areference frame f re f according to the alignment evaluationscore gotten from the result of the last time step and cor-responding homograph matrix in order to get the stabilizedframe. Firstly, a re-alignment is performed when the resultof the ECC alignment score is lower than a threshold. ECCtakes a lot of time to converge and is not adaptive to align thecurrent frame with a reference frame when their similarityis lower than a threshold. Secondly, since the alignment istime-consuming, it is only performed when a reference frameneeds to be re-chosen. The homography matrix is re-used forthe following frames until a new reference frame is chosenwhen the evaluation score drops to the threshold. Then, thehomography matrix calculated from ECC alignment duringthe previous step is used for initializing the guess for ECCin the next step to speed up the convergence. Lastly, imagesare down-sampled [14] so that ECC uses fewer pixels duringthe calculation. Fig. 3: Transition from original image-based coordinate toroad-based Coordinate
2) Object Detection:
The training dataset contains allthe bounding boxes and their corresponding labels for eachimage. The input images are re-sized to ensure that the sizeof detection objects is greater than 32-by-32 pixels as wellas not too large to fit for the GPU computational capability.Images are masked to crop out the roads in order to makedetection easier. RetinaNet was fine-tuned using pre-trainedweights from the COCO dataset [15].
3) Map Construction and Coordinate Transition:
Thefirst step in the creation of the map is to crop the areaof interest, which in this case is the roads. To attack thisproblem, we took advantage of the image segmentationnetwork ”U-net”, described in Ronneberger et al. [16], withjust a few adjustments based on the work of Iglovikov etal. [17]. We preserve the decoder section of the networkbecause by adding a large number of feature channels,it allows the network to propagate context information tothe higher-resolution layers. The important change was inthe encoder section, where it was replaced by the down-sampling elements of the VGG16 architecture in order totake advantage of the pre-trained weights in ImageNet [18],due to the limitation of the quantity of the collected data.After detecting the road and applying a color filter todetect the lane markings on the road, the work transforms allthe detected vehicle positions from the original image-basedcoordinates to the road-based coordinates. Figure 3 showsthe method to generate the road-based coordinate based ona random road geometry which may occur in the real world.The method firstly chooses an origin and then proceeds toobtain the x axis and y axis along the lane markings whichseparate the opposite directions of moving vehicles. For thegiven vehicles v and v , the figure lists two examples ofhow to extract the road-based positions. Finally, it is ableto represent the vehicles’ information, which contains thefollowing items:ig. 4: Intention Prediction Network StructureFig. 5: Trajectories Prediction Network Structure • Local x and y based on the road-based coordinates • Vehicle length and width • Section ID i • Lane ID l4) Vehicle Tracking and Trajectory Smoothing:
After thepositions of vehicles have been transformed into the local(road) coordinates, we apply the tracking algorithm to trackeach car. Meanwhile, we smooth each vehicle’s trajectory.In the system, we use vehicle position as the state variable. F is the state transition matrix and H is the measurementmatrix. V q ( n ) and V p ( n ) represent the system noise andmeasurement noise, respectively. B. Driving Behavior Prediction1) Intention Prediction:
For the driving behavior task,we construct the network with an LSTM layer which getsthe Direction Intention d and Yield Intention y (see Figure4). The direction intentions include Going Straight ( GS ), Turing Left ( TL ) and Turning Right ( TR ). The yield intentionindicates the prediction of which car will go through thepotential crash point first. For the interacting driver pairswith intentions of GS and TL or TL and TR , the inputstates include the positions, velocities, heading angles andrelative distance to the intersection center of both cars ofeach pair. During the interaction procedure, the yield motionalso changes based on the counterpart’s behavior. This willcontribute as a key factor to the next-step motion planningmodule and help to generate a safer and feasible trajectory.
2) Trajectories Prediction:
Based on the results of di-rection and yield predictions, a more detailed trajectoryprediction procedure includes more information on the futuretrajectories. In Figure 5, P t includes information on velocitiesand positions of the target car. A reference trajectory isfirst selected according to the intention prediction results. Fig. 6: Reference trajectories according to the directionintention. Follow the center of the lanes.TABLE I: Comparison between different stabilization meth-ods. DS means down sampled. Method Processing Time ( s / f rame ) SSIMORB + ECC w/o DS 1.7609 0.8032ORB + ECC, DS 0.6724 0.7759ORB + ECC, DS 0.4779 0.7324ORB + ECC, DS 0.4071 0.7160SURF + ECC w/o DS 0.6599 0.81896SURF + ECC, DS 0.6627 0.7758SURF + ECC, DS 0.4960 0.7336SURF + ECC, DS 0.3750 0.7166ECC, DS 0.9.3450 0.9404ORB 3.4665 0.8278SIFT 13.575 0.8370SURF 2.1060 0.8390
According to the reference trajectories (see Figure 6) withintersection geometry information, the velocities, headingangles and relative distance to the intersection center of bothcars, the network can predict the future trajectories.V. E
XPERIMENT
In this section, we show the results for methods corre-sponding to different data processing procedures.
A. UrbanFlow1) Video Stabilization:
In the previous section, we haveintroduced the combination of the feature-matching-basedand homography-based alignment methods. The result com-pares different combinations of feature-matching-based andhomography-based video stabilization algorithms with var-ious down-sampling ratios. Table I shows the results ofdifferent choices of algorithms and the corresponding struc-tural similarity (SSIM) score which is used to calculate thesimilarity between any two images. The higher SSIM scoreindicates a better stabilization result.We finally chose the Speeded Up Robust Features (SURF)detector combined with ECC and down-sampling to get arelatively good tradeoff between stabilization and compu-tational efficiency. We visualize images with and withoutstabilization in Figure 7 with four sub-figures. The ReferenceFrame shows the anchor frame for the stabilization. Ideally,the roads can be perfectly aligned in the Reference Frame andTarget Frame. Before stabilization, the Reference Frame andthe Target Frame are blended, which is shown as the Targetig. 7: Two blended frames before stabilization and afterstabilizationFig. 8: Vehicle detection results of Retinanet algorithm for aselected frame. TABLE II: Detection Result P / T Car (Train / Test) No Car (Train / Test)Car 387 / 2369 1 / 0No Car 3 / 72 0 / 0 GT ∩ PRGT ∪ PR GT ∩ PRGT GT ∩ PRPR
Blended Frame. It is obvious that the two frames have a bigmisalignment. After the stabilization of the frame, the resultis shown as the Stabilized Frame and then the new blendedresult is shown as the Stabilized Blended Frame.
2) Vehicle Detection:
By using Retinanet to do vehicledetection, we trained a good model to detect vehicles froma bird’s-eye view. For testing, only 97 out of 2322 vehiclesare not detected, giving an accuracy of 96%, and the averageintersection over union is 92%. This accuracy is high sincethe vehicles in the test cases are similar to the ones duringtraining. False positives were removed using non-maximumsuppression and thresholding the confidence score for aprediction. If vehicles such as a bus appear in testing buthad never appeared in training, these vehicles will not bedetected, since they are too different from what the networkhas learned both in size and color. The images with incorrectdetection were relabeled to fine-tune the network.For each frame, RetinaNet is applied to detect vehicles. Fig. 9: Comparison between vehicle trajectories with andwithout smoothing.Fig. 10: Scenario of interacting vehicles pair.Figure 8 visualizes one of the testing images after applyingthe vehicle detection method. The original image-basedpositions of all the red bounding boxes detected as vehiclesare saved. Table II shows the quantitative results of trainingand testing. GT is the abbreviation for the area of GroundTruth and PR is the abbreviation for the area of PredictedResults.
3) Trajectory Smoothing:
Most existing vehicle trajectorydatasets, such as NGSIM [3], only provide raw trajectories,which are noisy and therefore hard to use directly dueto the jerky trajectories. Figure 9 visualizes the results ofthe trajectories for one of the vehicles with and withoutsmoothing. The Figure 9(a) shows the result with equalscaling of the x and y axes. It is hard to find the differencebetween the trajectories with (RED) and without (GREEN)smoothing. However, when the x axis is enlarged in figure9(b), the trajectory without smoothing (GREEN) is muchjerkier than the one with smoothing (RED).Finally, the video includes all the dynamic results pro-posed in the pipeline. B. Prediction1) Scenario:
We tested the algorithm based on the Ur-banFlow dataset. We selected pairs of interacting vehicleswith the driving directions of GS and TL or TL and TR from the dataset. Figure 10 shows a pair of two interactingvehicles. The blue rectangle with E is the ego car and thegreen rectangle with T is the target vehicle. The input state https://youtu.be/oTPgLUdN_cU ig. 11: The direction and yield prediction result of selected interacting pairs. GT means that the corresponding intention isthe ground truth of the selected pair. Direction intention of ego cars is noted in parentheses in the ego car legend.Fig. 12: Direction and yield intention as well as MSE oftrajectories prediction for the target vehicle.includes velocities, heading angles and relative distances tothe intersection center of both ego and target cars.
2) Intention Prediction:
According to the described state,the intention network predicts the direction intention as wellas the yield intention. Figure 11 visualizes the results ofdirection and yield intentions. According to the coordinatetransitions, all the ego cars approach the intersection (inter-section center is coordinate ( , ) ) from the bottom. Differentcolors with marker • show the results of direction predictionswhen the target vehicle reaching that position and the othercolors with marker X present the yield intention predictionresults. Figure 12 shows the prediction accuracy with respectto the distance to the start of intersection for the target TABLE III: MSE of trajectory predictions. Method Average MSE (m)LSTM 3.71LSTM w/ intention 0.89LSTM w/ intention and reference trajectory 0.18 vehicle.
3) Trajectory Prediction:
We compared the mean squarederror (MSE) results between the trajectory prediction withand without intention results and reference trajectories. TableIII presents the average MSE for different methods andFigure 12 shows the MSE with respect to the distance tothe start of intersection for the target vehicle. When passthe start position of the intersection, the trajectories becomediverse due to various direction intentions, as a result, theMSE of trajectories prediction increase.VI. CONCLUSIONSIn this paper, we propose a pipeline called UrbanFlowwhich is used to deal with traffic data collected by dronesin urban environments. The raw data are processed throughvideo stabilization, vehicle detection, map construction andcoordinate transformation, vehicle tracking, and trajectorysmoothing. Moreover, the paper proposes a method fordriving behavior clustering and tests it on the UrbanFlowdataset. The following work for improving the dataset willfocus on increasing the quantity of the dataset. More types ofurban scenarios like T-intersections, stop-sign intersectionsand yield intersections will be included in the dataset.
EFERENCES[1] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,“CARLA: An open urban driving simulator,” in
Proceedings of the1st Annual Conference on Robot Learning , 2017, pp. 1–16.[2] “VTD homepage.” 2019. [Online]. Available: https://vires.com/vtd-vires-virtual-test-drive[3] “NGSIM homepage. FHWA.” 2005-2006. [Online]. Available:http://ngsim.fhwa.dot.gov.[4] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein, “The highd dataset:A drone dataset of naturalistic vehicle trajectories on german highwaysfor validation of highly automated driving systems,” in , 2018.[5] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:The kitti dataset,”
The International Journal of Robotics Research ,vol. 32, no. 11, pp. 1231–1237, 2013.[6] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year,1000km: The Oxford RobotCar Dataset,”
The International Journalof Robotics Research (IJRR) , vol. 36, no. 1, pp. 3–15, 2017. [Online].Available: http://dx.doi.org/10.1177/0278364916679498[7] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell,“Bdd100k: A diverse driving video database with scalable annotationtooling,” arXiv preprint arXiv:1805.04687 , 2018.[8] M. Liebner, M. Baumann, F. Klanner, and C. Stiller, “Driver intentinference at urban intersections using the intelligent driver model,” in . IEEE, 2012, pp. 1162–1167.[9] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizableintention prediction of human drivers at intersections,” in . IEEE, 2017, pp. 1665–1670. [10] G. D. Evangelidis and E. Z. Psarakis, “Parametric image alignmentusing enhanced correlation coefficient maximization,”
IEEE Transac-tions on Pattern Analysis and Machine Intelligence , vol. 30, no. 10,pp. 1858–1865, 2008.[11] M. A. Fischler and R. C. Bolles, “Random sample consensus: aparadigm for model fitting with applications to image analysis andautomated cartography,” in
Readings in computer vision . Elsevier,1987, pp. 726–740.[12] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss fordense object detection,”
IEEE transactions on pattern analysis andmachine intelligence , 2018.[13] Y. Chan, A. Hu, and J. Plant, “A kalman filter based tracking schemewith input estimation,”
IEEE transactions on Aerospace and ElectronicSystems , no. 2, pp. 237–244, 1979.[14] S. Overflow, “cv2 motion euclidean for the warpmode in ecc imagealignment method.”[15] H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuffclasses in context,” in
Computer vision and pattern recognition(CVPR), 2018 IEEE conference on . IEEE, 2018.[16] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in
International Confer-ence on Medical image computing and computer-assisted intervention .Springer, 2015, pp. 234–241.[17] V. Iglovikov and A. Shvets, “Ternausnet: U-net with vgg11 encoderpre-trained on imagenet for image segmentation,” arXiv preprintarXiv:1801.05746 , 2018.[18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”