An Evaluation of Trajectory Prediction Approaches and Notes on the TrajNet Benchmark
AAn Evaluation of Trajectory Prediction Approaches andNotes on the Tra jNet Benchmark.
Stefan Becker ∗ , Ronny Hug ∗ , Wolfgang H¨ubner and Michael ArensFraunhofer Institute for Optronics, System Technologies, and Image Exploitation IOSBGutleuthausstr. 1, 76275 Ettlingen, Germany ABSTRACT
In recent years, there is a shift from modeling the tracking problem based on Bayesian formulation towardsusing deep neural networks. Towards this end, in this paper the effectiveness of various deep neural networks forpredicting future pedestrian paths are evaluated. The analyzed deep networks solely rely, like in the traditionalapproaches, on observed tracklets without human-human interaction information. The evaluation is done onthe publicly available
TrajNet benchmark dataset, which builds up a repository of considerable and populardatasets for trajectory-based activity forecasting. We show that a Recurrent-Encoder with a Dense layer stackedon top, referred to as RED-predictor, is able to achieve sophisticated results compared to elaborated modelsin such scenarios. Further, we investigate failure cases and give explanations for observed phenomena and givesome recommendations for overcoming demonstrated shortcomings. Keywords:
Trajectory Forecasting, Path Prediction, Trajectory-based Activity Forecasting
1. INTRODUCTION
The prediction of possible future paths is a central building block for an automated risk assessment. The appli-cations cover a wide range from mobile robot navigation, including autonomous driving, smart video surveillanceto object tracking. Dividing the many variants of forecasting approaches can be roughly done by asking howthe problem is addressed or what kind of information is provided. Firstly, addressing this problem reaches fromtraditional approaches like the Kalman filter, linear or Gaussian regression models, auto-regressive models, time-series analysis to optimal control theory, deep learning combined with game theory, or the applicationof deep convolutional networks and recurrent neural networks (RNNs) as a sequence generation problem. Secondly, the grouping can be done by using the provided information. On the one hand, the approaches cansolely rely on observations of consecutive positions extracted by visual tracking or on the other hand, by usingricher context information. This can be for example human-human interactions or human-space interactionsor general additional visual extracted information like pedestrian head orientation or head poses. For somerepresentative approaches which model human-human interactions, one should mention the works of Helbing andMoln´ar and Coscia et al. or approaches in combination with RNNs like.
10, 11
The spatial context of motioncan in principle be learned by training a model on observed positions of a particular scene, but it is not guar-anteed that the model successfully captures spatial points of interest and does not only implicitly keep spatialinformation by performing path integration in order to predict new positions. Nevertheless, here we distinguishsuch approaches from approaches where scene context is provided as further cue for example by semantic label-ing or scene encoding. The challenges of
Trajectory Forecasting Benchmarking ( TrajNet are designedto cover some inherent properties of human motion in crowded scenes. The World H-H TrajNet challenge inparticular looks at predicting motions in world plane coordinates of human-human interactions. The aim of thispaper is to find an effective baseline predictor only based on the partial history and find the maximum potentialachievable prediction accuracy for this challenge. Achieving this objective involves an evaluation of differentdeep neural networks for trajectory prediction and analysis of the datasets properties. Further, we propose smallchanges and pre-processing steps to modify a standard RNN prediction model to result in a simple but effectiveRNN architecture that obtains comparable performance to more elaborated models, which additionally capturesthe interpersonal aspect of human-human interaction.The paper is structured as follows. Firstly, the properties of the
TrajNet benchmark dataset are analyzed in ∗ Equal contribution. a r X i v : . [ c s . C V ] A ug ection 2. Then, some basic deep neural networks are shortly described and evaluated. Further, the modificationsin order to increase the prediction performance are presented (section 3). The achieved results and an additionalfailure analysis are presented in section 4. Finally, a conclusion is given in section 5.
2. TRAJNET BENCHMARK DATASET ANALYSIS
The trajectory forecasting challenges
TrajNet provide the community with a defined and repeatable way ofcomparing path prediction approaches as well as a common platform for discussions in the field. In this sectionsome properties of the current repository for the World H-H TRAJ challenge of popular datasets for trajectory-based activity forecasting are analyzed and thereby design choices for the proposed predictor are deduced.
Table 1. Training (green) and test (cyan) dataset of the world plane human-human dataset challenge (adapted from the
TrajNet website ). Name Resolution
BIWI Hotel ×
576 389 2.5 Pelligrini et al. Crowds Zara ×
576 204 2.5 Lerner et al. Crowds Students ×
576 415 2.5 Lerner et al. Crowds Arxiepiskopi ×
576 24 2.5 Lerner et al. PETS 2009 ×
576 19 2.5 Ferryman et al. Stanford Drone Dataset (SDD) × ∗ BIWI ETH ×
480 360 2.5 Pelligrini et al. Crowds Zara ×
576 148 2.5 Lerner et al. Crowds Uni Examples ×
576 118 2.5 Lerner et al. Stanford Drone Dataset (SDD) × ∗ In most datasets, the scene is observed from a bird’s eye view, but there are also scenarios where the sceneis observed under a higher depression angle. The selected surveillance datasets cover real world scenarios with avarying crowd densities and varying complexity of trajectory patterns. Details of the datasets are summarizedin table 1 (adapted from
TrajNet website). The selection includes the following datasets. The
BIWI WalkingPedestrians Dataset also sometimes referenced as ETH Walking Pedestrians (EWAP) , which is split into twosets (
ETH and
Hotel ). The
Crowds dataset also called
UCY ”Crowds-by-Example” dataset contains threescenes from an oblique view, where the first (Zara) shows a part of a shopping street, the second ( Students / UniExamples ) captures a part of the uni campus and the third scene (
Arxiepiskopi ) captures a different part ofthe campus. Then, the
Stanford Drone Dataset (SDD) consists of multiple aerial images capturing differentlocations around the Stanford campus. And finally the PETS 2009 dataset, where different outdoor crowdsactivities are observed by multiple static cameras. Sample images with full trajectories and tracklets are shownin figure 1.It is common and good practice to apply cross-validation. For the TajNet challenge, it is done by omittingcomplete datasets for testing. Because the behavior of humans in crowds is scene-independent and for measuringthe generalization capabilities of various approaches across datasets this is very reasonable, in particular forproviding a benchmark for human-human interactions. Nevertheless, by combining all training sets the spatialcontext of scene specific motion and the reference systems are lost. When only relying on observed motion tra-jectories positional information is crucial in order to learn spatio-temporal variation. For example, the sidewalksin the Hyang sequences (see figure 1) lead to a spatially depending change in the curvature of a trajectory.Since our focus is on deep neural networks including RNNs, the shift from position information to higher ordermotion helps to overcome some drawbacks. Before RNNs were successfully applied for tracking pedestrians in asurveillance scenario, they gained attention due to their success in tasks like speech recognition
23, 24 and captiongeneration.
25, 26
Since these domain are particularly different to trajectory prediction in certain aspects, theirposition-dependent movement is not important. Accordingly, RNNs can benefit from conditioning on previousoffsets for scene independent motion prediction. This insight is not new, yet utilizing offsets really helps not only igure 1. Example trajectories from the
BIWI ETH dataset and example tracklets from the sequence
Hyang 07 fromthe
Stanford Drone Dataset (SDD) . stabilizing the learning process but also improves the prediction performance for the evaluated networks. Thisshift to offsets or rather velocities has been also successfully applied for example for the prediction of humanposes based on RNNs. In the context of deep networks the same effect can also be achieved by adding residualconnections, which have been shown to improve performance on deep convolutional networks. Presumably dueto the limitation of the input and output spaces, for applying on the
TrajNet challenge instead of predictionof the next position (where will the person be next) predicting the following offsets (where will the person gonext)
12, 29 also contributed to increased prediction accuracy. This becomes immediately apparent by looking atthe complete tracklets of the training and test set (see figure 2). Firstly, it takes a considerably higher modelingeffort to represent all possible positions instead of modeling particular velocities. Further, input data outsidethe training range can lead to undefined states in the deep network, which result in an unreasonably randomoutput. Some of the initialization tracklets clearly lie outside the training input space. Also, approaches withprofit from human-human interaction like
10, 11, 14, 30 in combination with deep networks lack here informationabout surrounding persons to interact, so that the decoding of relative distances is not possible because of areduced person density.Another factor for improving the prediction performance is becoming apparent when contemplating the offsetdistribution of the data. Figure 3 shows the offsets histograms for x and y separately. Due to the loss of thereference system, it is impossible to assume a reasonable location distribution a-priori. In contrast, the offset andmagnitude distribution clearly reflects the preferred walking speeds in the data. The histograms also show thata large amount of persons is standing. In the recent work of Hasan et al., it was emphasized that forecastingerrors are in general higher when the speed of persons is lower and argued that when persons are walking slowlytheir behavior becomes less predictable, due to physical reasons (less inertia). During our testing we discoveredthe same phenomenon. In particular RNN based networks tend to overestimate slow velocities and do some-times not accurately identify the standing behavior. Despite this problem, the range of offsets is very limitedcompared to the location distribution and shows a clear tendency towards expected a-priori values. Commontechniques for sequence prediction problems are normalization and standardization of the input data. Wherebynormalization has a similar role on the position data, applying standardization on position input data shows nobenefit. In our experiments, standardization worked slightly better than normalization or an embedding layerfor input encoding. Although the effect on the performance is quite low for the TrajNet challenge, our best resultis achieved using standardized offsets as input. It is rarely strictly necessary to standardize the inputs, but thereare practical reasons like accelerating the training or reducing the chances of getting stuck in local optima. Predicting offsets also guarantees that the output directly conforms better with the range of common activationfunctions. igure 2. (Left) Visualization of all tracklets of the training set from the
TrajNet dataset collection. (Right) Visualizationof all initialization tracklets of the test set.Figure 3. (Left, Middle) Offset histograms of the training set. (Right) Magnitude histogram of the offsets.
Without discretization artifacts, the dynamic of humans is smooth and persistent. The trajectory data fromthe
TrajNet dataset includes varying discretization artifacts or noise levels resulting from different methods withwhich ground truth data was generated. Part of the ground truth trajectories are generated by a visual trackeror manually annotated. For approximating the amount of noise in the datasets, the distance between a smoothedspline fit through the complete tracklets is compared to the provided ground truth tracklet points. The splinefitting is done with a polynom of degree k = 4 independent for the x and y values. If the smoothing is too strong,it can drift too far away from the actual data. Nevertheless, the achieved fitted trajectories form a smooth andnatural path and are used as rough assessment for the noise levels in the ground truth trajectory data. Theresults for the training set are summarized in table 2.The approximated noise levels clearly show the variation in the ground truth data. In order to outperforma linear baseline predictor the learned model must be able to successfully model different velocity profiles andcapture curved paths out of input data with different noise levels. Due to the varying noise levels, initialexperiments to solely train on smoothed fitted trajectories with synthetic noise performed worse. Nevertheless,for the prediction of the future steps the best performing predictor is trained to forecast smoothed paths.Before the different evaluated models are introduced, the last data analysis of the training set is intended to able 2. Standard deviation of the distance between a smoothed spline fit and the ground truth trajectory data. Theaverage R score for all tracklets in the subsets. Name σ x, spline [m] σ y, spline [m] ¯ R x ¯ R y Overall
BIWI Hotel
Crowds Zara 02
Crowds Zara 03
Crowds Students 01
Crowds Students 03
Crowds Arxiepiskopi 01
PETS 2009 S2L1
SSD Bookstore 00
SSD Bookstore 01
SSD Bookstore 02
SSD Bookstore 03
SSD Coupa 03
SSD Deathcircle 00
SSD Deathcircle 01
SSD Deathcircle 02
SSD Deathcircle 03
SSD Deathcircle 04
SSD Gates 00
SSD Gates 01
SSD Gates 03
SSD Gates 04
SSD Gates 05
SSD Gates 06
SSD Gates 07
SSD Gates 08
SSD Hyang 04
SSD Hyang 05
SSD Hyang 06
SSD Hyang 07
SSD Hyang 09
SSD Nexus 00
SSD Nexus 01
SSD Nexus 02
SSD Nexus 03
SSD Nexus 04
SSD Nexus 07
SSD Nexus 08
SSD Nexus 09 R for a linear interpolation is calculated separately for the x and y values. This linear interpolation serves asbaseline predictor for the TrajNet challenge. The histograms of R for the training set are shown in figure 4. R is the percentage of the variation that is explained by the model and is used to determine the suitability of theregression fit as a linearity measure. The average R values are summarized in table 2. It can be seen that for igure 4. Coefficient of determination R for x and y for all training tracklets of the World H-H TRAJ challenge. most tracklets a linear interpolation works very well. In order to outperform the linear interpolation baseline, itis crucial to not only cover a variety of complex observed motions, but to also produce robust results in simplersituations. As mentioned above, the person velocity has to be effectively captured by the model.
3. MODELS AND EVALUATION
The goal of this work is by using a sort of coarse to fine searching strategy to reach the maximum achievableprediction accuracy without further cues like human-human interaction or human-space interaction based onbasic networks. Towards this end, we started with a set of networks with a limited set of hyper-parameters tonarrow it down to one network, in order to then extend the hyper-parameter set for a more exhaustive tuning.The multi-modal aspect of trajectory prediction is hardly considerable when there is no fixed reference system.Thus, the performance is compared in accordance to the community with the two error metrics of the averagedisplacement error (ADE) and the final displacement error (FDE) (see for example
10, 14, 18, 19, 30, 33 ). The averageof both combined values are then used as overall average to rank the approaches. The ADE is defined as theaverage L2 distance between ground truth and the prediction over all predicted time steps and the FDE is definedas the L2 distance between the predicted final position and the true final position. For the
World H-H TrajNet challenge the unit of the error metrics is meter. For all experiments, 8 (3 . . World H-H TrajNet challenge, the following basic neural networksfor a coarse evaluation are selected:
Multi-Layer-Perceptron (MLP):
The MLP is tested with different linear and non-linear activation func-tions. One variation concatenates all inputs and predicts 24 outputs directly. Further, cascaded architectureswith a step-wise prediction are examined. We vary between different coordinate system of Euclidean and polarcoordinates. As mentioned in section 2, positions and offsets (also orientation normalized) are considered asinputs and outputs.
RNN-MLP:
RNNs extend feed-forward networks or rather the MLP model due to their recurrent connectionsbetween hidden units. Vanilla RNNs produce an output at each time step. For the evaluation of the RNN-MLP,we vary only the MLP layer which is used for the decoding of the positions and offsets.
RNN-Encoder-MLP:
In contrast to the RNN-MLP network, the complete initialization tracklet is used togenerate the internal representation before a prediction is done. The RNN-Encoder-MLP is varied by alternatingactivation functions for the MLP and by alternatively predicting the complete future path/offsets instead of onlyext steps. As a further alternative, the full path is predicted as offsets to one reference point instead of applyingpath integration in order to predict the final position.
RNN-Encoder-Decoder-Model (Seq Seq):
In addition to RNN-Encoder-MLPs, Seq2Seqs include asecond network. This second decoder network takes the internal representation of the encoder and then startspredicting the next steps. The different settings for the evaluation of this model where due to alternatingactivation functions for the MLP on top of the decoder RNN.
Temporal Convolutional Networks (TCN):
As an alternative to RNNs and based on
WaveNets , Baiet al. introduced a general convolution architecture for sequence prediction. We tested their standard andextended architecture with a gating mechanism (GTCN). For a more detailed description, we refer to the originalpapers.All networks were trained with varying number of layers (1 to 5) and hidden units (4 to 64) using stochas-tic gradient descent with a fixed learning rate of 0 . and have been implemented in Tensorflow . Firstly, only standard RNN cells are used for the ex-periments. Later, we also tested with RNNs variants Long Short-Term Memory (LSTM) and Gated RecurrentUnit (GRU). As loss the mean squared error between the predicted and the ground truth position or offsetsover all time steps is used.In order to emphasize trends a part from the result of the first experiments are summarized in table 3 (highlightedin gray). The best results were achieved with the RNN-Encoder-MLP. However, in most cases the different ar-chitectures perform very similar. These initial result also show that the best performing networks lie close to theresult achieved with linear interpolation. Outlier weak performances are due some strong overestimation of slowperson velocities and some undefined random predictions when using positions. Hasan et al. reduced this effectby integrating head pose information. We can only remark for the tested networks that this effect can also differfor different runs. Naturally it is important that during training the networks see enough samples from standingof slow moving situations. Excluding such samples through heuristic or probabilistic filtering only helps duringapplication.There is no network that is clearly performing best, thus the gap between a MLP predictor and a Seq2Seqmodel is very narrow in the test scenarios. However, besides the factors derived from the data analysis, aprediction of the full path instead of step-wise prediction helps to overcome an accumulation of errors that arefed back into the networks. For the TrajNet challenge with a fixed prediction horizon, we thus prefer the RNN-Encoder-MLP over a Seq2Seq model. In the domain of human pose prediction based on RNNs, Li et al reducedthis problem with an Auto-Conditioned RNN Network and Martinez et al. propose using a Seq2Seq modelalong with a sampling-based loss. The TCNs perform here similar to RNNs. Since RNNs are more common,also as part of architectures which model interactions (see
10, 11, 14, 18 ) to represent single motion, we keep theRNN-Encoder-MLP as our favored model.
RNN-Encoder-MLP → RED-predictor:
According to the training set analysis and the comparison ofarchitectures the selected model for the
TrajNet challenge modeling only single human motion is a RNN-Encoder-MLP. The RNN-Encoder can generalize to deal with varying noisy inputs and thus is able to better capture theperson motion compared to the linear interpolation baseline. The main insight is that motion continuity is easierto express in offsets or velocities, because it takes considerably more modeling effort to represent all possibleconditioning positions. Especially for the
World H-H TRAJ challenge, with the different range for positionsin the training and test set, this has significant influence on whether a good performance can be obtained. Incombination with easier, but helpful, data-prepossessing, the transfer to directly using smoothed trajectoriesas desired output, and a full path prediction to prevent error accumulation during a step-wise prediction, oursimple but effective baseline predictor for the
TrajNet challenge is ready. Recurrent-Encoder with a dense MLPlayer stacked on top the predictor is referred to as RED-predictor. As reference point for all predicted offsets ofthe full smooth path the last input position is used. Full path integration worked similar well, but here offsetsto the reference positions are predictedThe best achieved result is highlighted in red in table 3. After a fine search for this network, the shown result isproduced with a LSTM cell (state size of 32) and one recurrent layer. The proposed predictor was able to producesophisticated results compared to elaborated models which additionally rely on interaction information like the able 3. Results for the world plane human-human dataset challenge (
World H-H TRAJ challenge).
Approach Overall Average ↓ FDE [m] ↓ ADE [m] ↓ ReferenceRED Social Forces (ATTR) 0.904 1.395 0.412 Helbing and Moln´ar social lstm v2 1.387 2.098 0.675 Alahi et al. social lstm 1.563 2.299 0.826 Alahi et al. social lstm v3 2.874 4.323 1.424 Alahi et al. Interactive Gaussian Processes 1.642 1.038 2.245 Ellis et al. Linear Interpolation 0.894 1.359 0.429Linear MLP (Pos) 1.041 1.592 0.491Linear MLP (Off) 0.896 1.384 0.407Non-Linear MLP (Off) 2.103 3.181 1.024Linear RNN 0.951 1.482 0.420Non-Linear RNN Gated TCN 0.947 1.468 0.426 Bai et al. Results highlighted in blue are taken from the
TrajNet website ( http://trajnet.stanford.edu/ , accessed 19.05.2018)model from Helbing and Moln´ar and the Social-LSTM. The Social-LSTM is one of the first proposed RNN-based architectures which includes human-human interaction and laid the basis for architectures like presentedin the work of Hasan et al. or Xue et al. Single motion is modeled with an LSTM network. By applyingsome of the proposed factors to the model, it is expected that the model and equity accordingly model extensionsare able to outperform the proposed single motion predictor.
4. DISCUSSION AND FAILURE CASES
After emphasizing the factors needed in order to achieve sophisticated results based on standard neural networksin the above sections, in this section we discuss some failure cases.Without exploiting scene-specific knowledge for trajectory prediction, some particular changing behavior in thehuman motion is not predictable. For example, in the shown tracklet from
SSD Hyang (see figure 5), there isno cue for a turning maneuver in the initialization tracklet. In order to correct the prediction, new observa-tions are required. All methods tend to predict in such a situation a relatively straight line, resulting in a highprediction error. A scene-independent motion representation is pursuant to better generalize, but for overcom-ing some limitation in the achievable prediction accuracy, the spatial context is required. The sample trackletalso illustrates the multi-modal nature of the prediction problem. While the person is making a left turn, it isalso possible to make a right turn. By using a single maximum-likelihood path the multi-modality of a motionand the uncertainty in the prediction is not covered. The prediction uncertainty can be considered by usingthe normalized estimation error square (nees), also known as Mahalanobis distance, which corresponds to aweighted Euclidean distance of the errors. But most methods are designed as a regression model, thus for aunified evaluation system the Mahalanobis distance is not applicable. As mentioned, there are a few approacheswhich include the multi-modal aspect of the problem.
7, 29, 43
Without additional cues of the current scene, theseapproaches are limited to a fixed scene.Independent of the question how to include all aspects of a problem in a unified benchmarking, they strongly igure 5. Example where the scene context strongly influences the person trajectory. The initialization tracklet (solidline) delivers no evidence for a turning maneuver at the intersection. This also shows the multi-modal nature of theprediction problem. influence the possible achievable results. The results presented in section 3 show that independent from themodel complexity approaches restricted to observing only information from one trajectory are in range to theirreachable performance limit on the current dataset repository. Of course due to the fast development in the fieldof deep neural networks there is still space for improvement, but the current benchmark cannot be completelysolved. However, the
TrajNet challenges also provides human-human and human-space information and recentwork like the approaches of Gupta et al. (human-human) or Xua et al. and Sadeghian et al. (human-human,human-space) show possibilities of how to further improve the performance accuracy.
5. CONCLUSION
In this paper, we presented an evaluation of deep learning approaches for trajectory prediction on
TrajNet benchmark dataset. The initial results showed that without further cues like human-human interaction or human-space interaction most basic networks achieve similar results in small range close to a maximum achievableprediction accuracy. By modifying a standard RNN prediction model, we were able to provide a simple buteffective RNN architecture that achieves a performance comparable to more elaborated models.
Acknowledgements :
The authors thank the organizers of the
TrajNet challenge for providing a frameworktowards a more meaningful, standardized trajectory prediction benchmarking.
REFERENCES
1. Sadeghian, A., Kosaraju, V., Gupta, A., Savarese, S., and Alahi, A., “Trajnet: Towards a benchmark forhuman trajectory prediction,” arXiv preprint (2018).2. Kalman, R. E., “A new approach to linear filtering and prediction problems,”
ASME Journal of BasicEngineering (1960).3. McCullagh, P. and Nelder, J. A., [ Generalized Linear Models ], Chapman & Hall , CRC, London (1989).4. Williams, C. K. I., [
Prediction with Gaussian Processes: From Linear Regression to Linear Prediction andBeyond ], 599–621, Springer Netherlands, Dordrecht (1998).5. Akaike, H., “Fitting autoregressive models for prediction,”
Annals of the Institute of Statistical Mathemat-ics (1), 243–247 (1969).6. Priestley, M. B., [ Spectral analysis and time series ], Academic Press, London ; New York : (1981).. Kitani, K. M., Ziebart, B. D., Bagnell, J. A., and Hebert, M., “Activity forecasting,” in [
European Conferenceon Computer Vision (ECCV) ], 201–214, Springer Berlin Heidelberg, Berlin, Heidelberg (2012).8. Ma, W., Huang, D., Lee, N., and Kitani, K. M., “Forecasting interactive dynamics of pedestrians withfictitious play,” in [
Conference on Computer Vision and Pattern Recognition (CVPR) ], 4636–4644, IEEE(2017).9. Huang, S., Li, X., Zhang, Z., He, Z., Wu, F., Liu, W., Tang, J., and Zhuang, Y., “Deep learning drivenvisual path prediction from a single image,”
IEEE Transactions on Image Processing (12), 5892–5904(2016).10. Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., and Savarese, S., “Social LSTM: Humantrajectory prediction in crowded spaces,” in [ Conference on Computer Vision and Pattern Recognition(CVPR) ], 961–971, IEEE (2016).11. Alahi, A., Ramanathan, V., Goel, K., Robicquet, A., Sadeghian, A., Fei-Fei, L., and Savarese, S., “Learningto predict human behaviour in crowded scenes,” in [
Group and Crowd Behavior for Computer Vision ],Elsevier (2017).12. Hug, R., Becker, S., H¨ubner, W., and Arens, M., “On the reliability of lstm-mdl models for predictingpedestrian trajectories,” in [
Representations, Analysis and Recognition of Shape and Motion from ImagingData (RFMI) ], (2017).13. Kooij, J. F. P., Schneider, N., Flohr, F., and Gavrila, D. M., “Context-based pedestrian path prediction,”in [
European Conference on Computer Vision (ECCV) ], 618–633, Springer International Publishing (2014).14. Hasan, I., Setti, F., Tsesmelis, T., Bue, A. D., Galasso, F., and Cristani, M., “MX-LSTM: mixing trackletsand vislets to jointly forecast trajectories and head poses,” in [
Conference on Computer Vision and PatternRecognition (CVPR) ], IEEE (2018).15. Helbing, D. and Moln´ar, P., “Social force model for pedestrian dynamics,”
Phys. Rev. E , 4282–4286(1995).16. Coscia, P., Castaldo, F., Palmieri, F. A., Alahi, A., Savarese, S., and Ballan, L., “Long-term path predictionin urban scenarios using circulardistributions,” Image and Vision Computing , 81–91 (2018).17. Ballan, L., Castaldo, F., Alahi, A., Palmieri, F., and Savarese, S., [ Knowledge Transfer for Scene-SpecificMotion Prediction ], 697–713, Springer International Publishing (2016).18. Xue, H., Q., D., and Reynolds, H. M., “SS-LSTM: A hierarchical LSTM model for pedestrian trajectoryprediction,” in [
Winter Conference on Applications of Computer Vision (WACV) ], IEEE (2018).19. Pellegrini, S., Ess, A., Schindler, K., and van Gool, L., “You’ll never walk alone: Modeling social behaviorfor multi-target tracking,” in [
International Conference on Computer Vision ], 261–268, IEEE (2009).20. Lerner, A., Chrysanthou, Y., and Lischinski, D., “Crowds by example,”
Computer Graphic Forum (3),655–664 (2007).21. Ferryman, J. and Shahrokni, A., “Pets2009: Dataset and challenge,” in [ IEEE International Workshop onPerformance Evaluation of Tracking and Surveillance (PETS) ], 1–6 (2009).22. Robicquet, A., Sadeghian, A., Alahi, A., and Savarese, S., “Learning social etiquette: Human trajectoryunderstanding in crowded scenes,” in [
Computer Vision – ECCV 2016: 14th European Conference, Ams-terdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII ], Leibe, B., Matas, J., Sebe, N., andWelling, M., eds., 549–565, Springer International Publishing (2016).23. Graves, A., Mohamed, A., and Hinton, G., “Speech recognition with deep recurrent neural networks,” in[
International Conference on Acoustics, Speech and Signal Processing ], 6645–6649 (2013).24. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A., and Bengio, Y., “A recurrent latent variablemodel for sequential data,” in [
Advances in Neural Information Processing Systems (NIPS) ], (2015).25. Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell,T., “Long-term recurrent convolutional networks for visual recognition and description,” in [
Conference onComputer Vision and Pattern Recognition ], IEEE (2015).26. Xu, K., Ba, J., Kiros, R., K.Cho, Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y., “Show, attendand tell: Neural image caption generation with visual attention,” in [
International Conference on MachineLearning ], Bach, F. and Blei, D., eds.,
Proceedings of Machine Learning Research , 2048–2057, PMLR,Lille, France (2015).7. Martinez, J., Black, M. J., and Romero, J., “On human motion prediction using recurrent neural networks,”in [ Conference on Computer Vision and Pattern Recognition (CVPR) ], 4674–4683, IEEE (2017).28. He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [
Conference onComputer Vision and Pattern Recognition (CVPR) ], 770–778 (2016).29. Hug, R., Becker, S., H¨ubner, W., and Arens, M., “Particle-based pedestrian path prediction using LSTM-MDL models,” in [
IEEE International Conference on Intelligent Transportation Systems (ITSC)(accepted) ],(2018).30. Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., and Alahi, A., “Social gan: Socially acceptable trajec-tories with generative adversarial networks,” in [
Conference on Computer Vision and Pattern Recognition(CVPR) ], IEEE (2018).31. Brownlee, J., [
Introduction to Time Series Forecasting with Python: How to Prepare Data and DevelopModels to Predict the Future ], Jason Brownlee (2017).32. Draper, N. R. and Smith, H., [
Applied regression analysis ], Wiley series in probability and mathematicalstatistics, Wiley, New York (1966).33. Vemula, A., Muelling, K., and Oh, J., “Modeling cooperative navigation in dense human crowds,” in[
International Conference on Robotics and Automation (ICRA) ], 1685–1692, IEEE (May 2017).34. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner,N., Senior, A. W., and Kavukcuoglu, K., “Wavenet: A generative model for raw audio,” arXivpreprint abs/1609.03499 (2016).35. Bai, S., Kolter, J. Z., and Koltun, V., “An empirical evaluation of generic convolutional and recurrentnetworks for sequence modeling,” arXiv preprint abs/1803.01271 (2018).36. Kingma, D. P. and Ba, J., “Adam: A method for stochastic optimization,”
International Conference forLearning Representations (ICLR) (2015).37. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J.,Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser,L., Kudlur, M., Levenberg, J., Man´e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens,J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi´egas, F., Vinyals, O.,Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X., “TensorFlow: Large-scale machine learningon heterogeneous systems,” (2015). Software available from tensorflow.org.38. Hochreiter, S. and Schmidhuber, J., “Long Short-Term Memory,”
Neural Computation (8), 1735–1780(1997).39. Cho, K., van Merri¨enboer, B., G¨ul¸cehre, C¸ ., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y.,“Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in [ Con-ference on Empirical Methods in Natural Language Processing (EMNLP) ], 1724–1734, Association for Com-putational Linguistics, Doha, Qatar (2014).40. Ellis, D., Sommerlade, E., and Reid, I., “Modelling pedestrian trajectory patterns with gaussian processes,”in [
International Conference on Computer Vision Workshops (ICCVW) ], 1229–1234, IEEE (2009).41. Li, Z., Zhou, Y., Xiao, S., He, C., and Li, H., “Auto-conditioned LSTM network for extended complexhuman motion synthesis,” arXiv preprint abs/1707.05363 (2017).42. Huber, M.,
Nonlinear Gaussian Filtering : Theory, Algorithms, and Applications , PhD thesis, KarlsruheInstitute of Technology (KIT) (2015).43. Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H. S., and Chandraker, M., “Desire: Distant futureprediction in dynamic scenes with interacting agents,” in [
Conference on Computer Vision and PatternRecognition (CVPR) ], IEEE (2017).44. Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., and Savarese, S., “SoPhie: An attentive GAN forpredicting paths compliant to social and physical constraints,” arXiv preprint arXiv:1806.01482arXiv preprint arXiv:1806.01482