[PDF] An Evaluation of Trajectory Prediction Approaches and Notes on the TrajNet Benchmark

Abstract

In recent years, there is a shift from modeling the tracking problem based on Bayesian formulation towards using deep neural networks. Towards this end, in this paper the effectiveness of various deep neural networks for predicting future pedestrian paths are evaluated. The analyzed deep networks solely rely, like in the traditional approaches, on observed tracklets without human-human interaction information. The evaluation is done on the publicly available TrajNet benchmark dataset, which builds up a repository of considerable and popular datasets for trajectory-based activity forecasting. We show that a Recurrent-Encoder with a Dense layer stacked on top, referred to as RED-predictor, is able to achieve sophisticated results compared to elaborated models in such scenarios. Further, we investigate failure cases and give explanations for observed phenomena and give some recommendations for overcoming demonstrated shortcomings.

Full PDF

AAn Evaluation of Trajectory Prediction Approaches andNotes on the Tra jNet Benchmark.

Stefan Becker ∗ , Ronny Hug ∗ , Wolfgang H¨ubner and Michael ArensFraunhofer Institute for Optronics, System Technologies, and Image Exploitation IOSBGutleuthausstr. 1, 76275 Ettlingen, Germany ABSTRACT

In recent years, there is a shift from modeling the tracking problem based on Bayesian formulation towardsusing deep neural networks. Towards this end, in this paper the eﬀectiveness of various deep neural networks forpredicting future pedestrian paths are evaluated. The analyzed deep networks solely rely, like in the traditionalapproaches, on observed tracklets without human-human interaction information. The evaluation is done onthe publicly available

TrajNet benchmark dataset, which builds up a repository of considerable and populardatasets for trajectory-based activity forecasting. We show that a Recurrent-Encoder with a Dense layer stackedon top, referred to as RED-predictor, is able to achieve sophisticated results compared to elaborated modelsin such scenarios. Further, we investigate failure cases and give explanations for observed phenomena and givesome recommendations for overcoming demonstrated shortcomings. Keywords:

Trajectory Forecasting, Path Prediction, Trajectory-based Activity Forecasting

1. INTRODUCTION

The prediction of possible future paths is a central building block for an automated risk assessment. The appli-cations cover a wide range from mobile robot navigation, including autonomous driving, smart video surveillanceto object tracking. Dividing the many variants of forecasting approaches can be roughly done by asking howthe problem is addressed or what kind of information is provided. Firstly, addressing this problem reaches fromtraditional approaches like the Kalman ﬁlter, linear or Gaussian regression models, auto-regressive models, time-series analysis to optimal control theory, deep learning combined with game theory, or the applicationof deep convolutional networks and recurrent neural networks (RNNs) as a sequence generation problem. Secondly, the grouping can be done by using the provided information. On the one hand, the approaches cansolely rely on observations of consecutive positions extracted by visual tracking or on the other hand, by usingricher context information. This can be for example human-human interactions or human-space interactionsor general additional visual extracted information like pedestrian head orientation or head poses. For somerepresentative approaches which model human-human interactions, one should mention the works of Helbing andMoln´ar and Coscia et al. or approaches in combination with RNNs like.

10, 11

The spatial context of motioncan in principle be learned by training a model on observed positions of a particular scene, but it is not guar-anteed that the model successfully captures spatial points of interest and does not only implicitly keep spatialinformation by performing path integration in order to predict new positions. Nevertheless, here we distinguishsuch approaches from approaches where scene context is provided as further cue for example by semantic label-ing or scene encoding. The challenges of

Trajectory Forecasting Benchmarking ( TrajNet are designedto cover some inherent properties of human motion in crowded scenes. The World H-H TrajNet challenge inparticular looks at predicting motions in world plane coordinates of human-human interactions. The aim of thispaper is to ﬁnd an eﬀective baseline predictor only based on the partial history and ﬁnd the maximum potentialachievable prediction accuracy for this challenge. Achieving this objective involves an evaluation of diﬀerentdeep neural networks for trajectory prediction and analysis of the datasets properties. Further, we propose smallchanges and pre-processing steps to modify a standard RNN prediction model to result in a simple but eﬀectiveRNN architecture that obtains comparable performance to more elaborated models, which additionally capturesthe interpersonal aspect of human-human interaction.The paper is structured as follows. Firstly, the properties of the

TrajNet benchmark dataset are analyzed in ∗ Equal contribution. a r X i v : . [ c s . C V ] A ug ection 2. Then, some basic deep neural networks are shortly described and evaluated. Further, the modiﬁcationsin order to increase the prediction performance are presented (section 3). The achieved results and an additionalfailure analysis are presented in section 4. Finally, a conclusion is given in section 5.

2. TRAJNET BENCHMARK DATASET ANALYSIS

The trajectory forecasting challenges

TrajNet provide the community with a deﬁned and repeatable way ofcomparing path prediction approaches as well as a common platform for discussions in the ﬁeld. In this sectionsome properties of the current repository for the World H-H TRAJ challenge of popular datasets for trajectory-based activity forecasting are analyzed and thereby design choices for the proposed predictor are deduced.

Table 1. Training (green) and test (cyan) dataset of the world plane human-human dataset challenge (adapted from the

TrajNet website ). Name Resolution

BIWI Hotel ×

576 389 2.5 Pelligrini et al. Crowds Zara ×

576 204 2.5 Lerner et al. Crowds Students ×

576 415 2.5 Lerner et al. Crowds Arxiepiskopi ×

576 24 2.5 Lerner et al. PETS 2009 ×

576 19 2.5 Ferryman et al. Stanford Drone Dataset (SDD) × ∗ BIWI ETH ×

480 360 2.5 Pelligrini et al. Crowds Zara ×

576 148 2.5 Lerner et al. Crowds Uni Examples ×

576 118 2.5 Lerner et al. Stanford Drone Dataset (SDD) × ∗ In most datasets, the scene is observed from a bird’s eye view, but there are also scenarios where the sceneis observed under a higher depression angle. The selected surveillance datasets cover real world scenarios with avarying crowd densities and varying complexity of trajectory patterns. Details of the datasets are summarizedin table 1 (adapted from

TrajNet website). The selection includes the following datasets. The

BIWI WalkingPedestrians Dataset also sometimes referenced as ETH Walking Pedestrians (EWAP) , which is split into twosets (

ETH and

Hotel ). The

Crowds dataset also called

UCY ”Crowds-by-Example” dataset contains threescenes from an oblique view, where the ﬁrst (Zara) shows a part of a shopping street, the second ( Students / UniExamples ) captures a part of the uni campus and the third scene (

Arxiepiskopi ) captures a diﬀerent part ofthe campus. Then, the

Stanford Drone Dataset (SDD) consists of multiple aerial images capturing diﬀerentlocations around the Stanford campus. And ﬁnally the PETS 2009 dataset, where diﬀerent outdoor crowdsactivities are observed by multiple static cameras. Sample images with full trajectories and tracklets are shownin ﬁgure 1.It is common and good practice to apply cross-validation. For the TajNet challenge, it is done by omittingcomplete datasets for testing. Because the behavior of humans in crowds is scene-independent and for measuringthe generalization capabilities of various approaches across datasets this is very reasonable, in particular forproviding a benchmark for human-human interactions. Nevertheless, by combining all training sets the spatialcontext of scene speciﬁc motion and the reference systems are lost. When only relying on observed motion tra-jectories positional information is crucial in order to learn spatio-temporal variation. For example, the sidewalksin the Hyang sequences (see ﬁgure 1) lead to a spatially depending change in the curvature of a trajectory.Since our focus is on deep neural networks including RNNs, the shift from position information to higher ordermotion helps to overcome some drawbacks. Before RNNs were successfully applied for tracking pedestrians in asurveillance scenario, they gained attention due to their success in tasks like speech recognition

23, 24 and captiongeneration.

25, 26

Since these domain are particularly diﬀerent to trajectory prediction in certain aspects, theirposition-dependent movement is not important. Accordingly, RNNs can beneﬁt from conditioning on previousoﬀsets for scene independent motion prediction. This insight is not new, yet utilizing oﬀsets really helps not only igure 1. Example trajectories from the

BIWI ETH dataset and example tracklets from the sequence

Hyang 07 fromthe

Stanford Drone Dataset (SDD) . stabilizing the learning process but also improves the prediction performance for the evaluated networks. Thisshift to oﬀsets or rather velocities has been also successfully applied for example for the prediction of humanposes based on RNNs. In the context of deep networks the same eﬀect can also be achieved by adding residualconnections, which have been shown to improve performance on deep convolutional networks. Presumably dueto the limitation of the input and output spaces, for applying on the

TrajNet challenge instead of predictionof the next position (where will the person be next) predicting the following oﬀsets (where will the person gonext)

12, 29 also contributed to increased prediction accuracy. This becomes immediately apparent by looking atthe complete tracklets of the training and test set (see ﬁgure 2). Firstly, it takes a considerably higher modelingeﬀort to represent all possible positions instead of modeling particular velocities. Further, input data outsidethe training range can lead to undeﬁned states in the deep network, which result in an unreasonably randomoutput. Some of the initialization tracklets clearly lie outside the training input space. Also, approaches withproﬁt from human-human interaction like

10, 11, 14, 30 in combination with deep networks lack here informationabout surrounding persons to interact, so that the decoding of relative distances is not possible because of areduced person density.Another factor for improving the prediction performance is becoming apparent when contemplating the oﬀsetdistribution of the data. Figure 3 shows the oﬀsets histograms for x and y separately. Due to the loss of thereference system, it is impossible to assume a reasonable location distribution a-priori. In contrast, the oﬀset andmagnitude distribution clearly reﬂects the preferred walking speeds in the data. The histograms also show thata large amount of persons is standing. In the recent work of Hasan et al., it was emphasized that forecastingerrors are in general higher when the speed of persons is lower and argued that when persons are walking slowlytheir behavior becomes less predictable, due to physical reasons (less inertia). During our testing we discoveredthe same phenomenon. In particular RNN based networks tend to overestimate slow velocities and do some-times not accurately identify the standing behavior. Despite this problem, the range of oﬀsets is very limitedcompared to the location distribution and shows a clear tendency towards expected a-priori values. Commontechniques for sequence prediction problems are normalization and standardization of the input data. Wherebynormalization has a similar role on the position data, applying standardization on position input data shows nobeneﬁt. In our experiments, standardization worked slightly better than normalization or an embedding layerfor input encoding. Although the eﬀect on the performance is quite low for the TrajNet challenge, our best resultis achieved using standardized oﬀsets as input. It is rarely strictly necessary to standardize the inputs, but thereare practical reasons like accelerating the training or reducing the chances of getting stuck in local optima. Predicting oﬀsets also guarantees that the output directly conforms better with the range of common activationfunctions. igure 2. (Left) Visualization of all tracklets of the training set from the

TrajNet dataset collection. (Right) Visualizationof all initialization tracklets of the test set.Figure 3. (Left, Middle) Oﬀset histograms of the training set. (Right) Magnitude histogram of the oﬀsets.

Without discretization artifacts, the dynamic of humans is smooth and persistent. The trajectory data fromthe

TrajNet dataset includes varying discretization artifacts or noise levels resulting from diﬀerent methods withwhich ground truth data was generated. Part of the ground truth trajectories are generated by a visual trackeror manually annotated. For approximating the amount of noise in the datasets, the distance between a smoothedspline ﬁt through the complete tracklets is compared to the provided ground truth tracklet points. The splineﬁtting is done with a polynom of degree k = 4 independent for the x and y values. If the smoothing is too strong,it can drift too far away from the actual data. Nevertheless, the achieved ﬁtted trajectories form a smooth andnatural path and are used as rough assessment for the noise levels in the ground truth trajectory data. Theresults for the training set are summarized in table 2.The approximated noise levels clearly show the variation in the ground truth data. In order to outperforma linear baseline predictor the learned model must be able to successfully model diﬀerent velocity proﬁles andcapture curved paths out of input data with diﬀerent noise levels. Due to the varying noise levels, initialexperiments to solely train on smoothed ﬁtted trajectories with synthetic noise performed worse. Nevertheless,for the prediction of the future steps the best performing predictor is trained to forecast smoothed paths.Before the diﬀerent evaluated models are introduced, the last data analysis of the training set is intended to able 2. Standard deviation of the distance between a smoothed spline ﬁt and the ground truth trajectory data. Theaverage R score for all tracklets in the subsets. Name σ x, spline [m] σ y, spline [m] ¯ R x ¯ R y Overall

BIWI Hotel

Crowds Zara 02

Crowds Zara 03

Crowds Students 01

Crowds Students 03

Crowds Arxiepiskopi 01

PETS 2009 S2L1

SSD Bookstore 00

SSD Bookstore 01

SSD Bookstore 02

SSD Bookstore 03

SSD Coupa 03

SSD Deathcircle 00

SSD Deathcircle 01

SSD Deathcircle 02

SSD Deathcircle 03

SSD Deathcircle 04

SSD Gates 00

SSD Gates 01

SSD Gates 03

SSD Gates 04

SSD Gates 05

SSD Gates 06

SSD Gates 07

SSD Gates 08

SSD Hyang 04

SSD Hyang 05

SSD Hyang 06

SSD Hyang 07

SSD Hyang 09

SSD Nexus 00

SSD Nexus 01

SSD Nexus 02

SSD Nexus 03

SSD Nexus 04

SSD Nexus 07

SSD Nexus 08

SSD Nexus 09 R for a linear interpolation is calculated separately for the x and y values. This linear interpolation serves asbaseline predictor for the TrajNet challenge. The histograms of R for the training set are shown in ﬁgure 4. R is the percentage of the variation that is explained by the model and is used to determine the suitability of theregression ﬁt as a linearity measure. The average R values are summarized in table 2. It can be seen that for igure 4. Coeﬃcient of determination R for x and y for all training tracklets of the World H-H TRAJ challenge. most tracklets a linear interpolation works very well. In order to outperform the linear interpolation baseline, itis crucial to not only cover a variety of complex observed motions, but to also produce robust results in simplersituations. As mentioned above, the person velocity has to be eﬀectively captured by the model.

3. MODELS AND EVALUATION

The goal of this work is by using a sort of coarse to ﬁne searching strategy to reach the maximum achievableprediction accuracy without further cues like human-human interaction or human-space interaction based onbasic networks. Towards this end, we started with a set of networks with a limited set of hyper-parameters tonarrow it down to one network, in order to then extend the hyper-parameter set for a more exhaustive tuning.The multi-modal aspect of trajectory prediction is hardly considerable when there is no ﬁxed reference system.Thus, the performance is compared in accordance to the community with the two error metrics of the averagedisplacement error (ADE) and the ﬁnal displacement error (FDE) (see for example

10, 14, 18, 19, 30, 33 ). The averageof both combined values are then used as overall average to rank the approaches. The ADE is deﬁned as theaverage L2 distance between ground truth and the prediction over all predicted time steps and the FDE is deﬁnedas the L2 distance between the predicted ﬁnal position and the true ﬁnal position. For the

World H-H TrajNet challenge the unit of the error metrics is meter. For all experiments, 8 (3 . . World H-H TrajNet challenge, the following basic neural networksfor a coarse evaluation are selected:

Multi-Layer-Perceptron (MLP):

The MLP is tested with diﬀerent linear and non-linear activation func-tions. One variation concatenates all inputs and predicts 24 outputs directly. Further, cascaded architectureswith a step-wise prediction are examined. We vary between diﬀerent coordinate system of Euclidean and polarcoordinates. As mentioned in section 2, positions and oﬀsets (also orientation normalized) are considered asinputs and outputs.

RNN-MLP:

RNNs extend feed-forward networks or rather the MLP model due to their recurrent connectionsbetween hidden units. Vanilla RNNs produce an output at each time step. For the evaluation of the RNN-MLP,we vary only the MLP layer which is used for the decoding of the positions and oﬀsets.

RNN-Encoder-MLP:

In contrast to the RNN-MLP network, the complete initialization tracklet is used togenerate the internal representation before a prediction is done. The RNN-Encoder-MLP is varied by alternatingactivation functions for the MLP and by alternatively predicting the complete future path/oﬀsets instead of onlyext steps. As a further alternative, the full path is predicted as oﬀsets to one reference point instead of applyingpath integration in order to predict the ﬁnal position.

RNN-Encoder-Decoder-Model (Seq Seq):

In addition to RNN-Encoder-MLPs, Seq2Seqs include asecond network. This second decoder network takes the internal representation of the encoder and then startspredicting the next steps. The diﬀerent settings for the evaluation of this model where due to alternatingactivation functions for the MLP on top of the decoder RNN.

Temporal Convolutional Networks (TCN):

As an alternative to RNNs and based on

WaveNets , Baiet al. introduced a general convolution architecture for sequence prediction. We tested their standard andextended architecture with a gating mechanism (GTCN). For a more detailed description, we refer to the originalpapers.All networks were trained with varying number of layers (1 to 5) and hidden units (4 to 64) using stochas-tic gradient descent with a ﬁxed learning rate of 0 . and have been implemented in Tensorﬂow . Firstly, only standard RNN cells are used for the ex-periments. Later, we also tested with RNNs variants Long Short-Term Memory (LSTM) and Gated RecurrentUnit (GRU). As loss the mean squared error between the predicted and the ground truth position or oﬀsetsover all time steps is used.In order to emphasize trends a part from the result of the ﬁrst experiments are summarized in table 3 (highlightedin gray). The best results were achieved with the RNN-Encoder-MLP. However, in most cases the diﬀerent ar-chitectures perform very similar. These initial result also show that the best performing networks lie close to theresult achieved with linear interpolation. Outlier weak performances are due some strong overestimation of slowperson velocities and some undeﬁned random predictions when using positions. Hasan et al. reduced this eﬀectby integrating head pose information. We can only remark for the tested networks that this eﬀect can also diﬀerfor diﬀerent runs. Naturally it is important that during training the networks see enough samples from standingof slow moving situations. Excluding such samples through heuristic or probabilistic ﬁltering only helps duringapplication.There is no network that is clearly performing best, thus the gap between a MLP predictor and a Seq2Seqmodel is very narrow in the test scenarios. However, besides the factors derived from the data analysis, aprediction of the full path instead of step-wise prediction helps to overcome an accumulation of errors that arefed back into the networks. For the TrajNet challenge with a ﬁxed prediction horizon, we thus prefer the RNN-Encoder-MLP over a Seq2Seq model. In the domain of human pose prediction based on RNNs, Li et al reducedthis problem with an Auto-Conditioned RNN Network and Martinez et al. propose using a Seq2Seq modelalong with a sampling-based loss. The TCNs perform here similar to RNNs. Since RNNs are more common,also as part of architectures which model interactions (see

10, 11, 14, 18 ) to represent single motion, we keep theRNN-Encoder-MLP as our favored model.

RNN-Encoder-MLP → RED-predictor:

According to the training set analysis and the comparison ofarchitectures the selected model for the

TrajNet challenge modeling only single human motion is a RNN-Encoder-MLP. The RNN-Encoder can generalize to deal with varying noisy inputs and thus is able to better capture theperson motion compared to the linear interpolation baseline. The main insight is that motion continuity is easierto express in oﬀsets or velocities, because it takes considerably more modeling eﬀort to represent all possibleconditioning positions. Especially for the

World H-H TRAJ challenge, with the diﬀerent range for positionsin the training and test set, this has signiﬁcant inﬂuence on whether a good performance can be obtained. Incombination with easier, but helpful, data-prepossessing, the transfer to directly using smoothed trajectoriesas desired output, and a full path prediction to prevent error accumulation during a step-wise prediction, oursimple but eﬀective baseline predictor for the

TrajNet challenge is ready. Recurrent-Encoder with a dense MLPlayer stacked on top the predictor is referred to as RED-predictor. As reference point for all predicted oﬀsets ofthe full smooth path the last input position is used. Full path integration worked similar well, but here oﬀsetsto the reference positions are predictedThe best achieved result is highlighted in red in table 3. After a ﬁne search for this network, the shown result isproduced with a LSTM cell (state size of 32) and one recurrent layer. The proposed predictor was able to producesophisticated results compared to elaborated models which additionally rely on interaction information like the able 3. Results for the world plane human-human dataset challenge (

World H-H TRAJ challenge).

Approach Overall Average ↓ FDE [m] ↓ ADE [m] ↓ ReferenceRED Social Forces (ATTR) 0.904 1.395 0.412 Helbing and Moln´ar social lstm v2 1.387 2.098 0.675 Alahi et al. social lstm 1.563 2.299 0.826 Alahi et al. social lstm v3 2.874 4.323 1.424 Alahi et al. Interactive Gaussian Processes 1.642 1.038 2.245 Ellis et al. Linear Interpolation 0.894 1.359 0.429Linear MLP (Pos) 1.041 1.592 0.491Linear MLP (Oﬀ) 0.896 1.384 0.407Non-Linear MLP (Oﬀ) 2.103 3.181 1.024Linear RNN 0.951 1.482 0.420Non-Linear RNN Gated TCN 0.947 1.468 0.426 Bai et al. Results highlighted in blue are taken from the

TrajNet website ( http://trajnet.stanford.edu/ , accessed 19.05.2018)model from Helbing and Moln´ar and the Social-LSTM. The Social-LSTM is one of the ﬁrst proposed RNN-based architectures which includes human-human interaction and laid the basis for architectures like presentedin the work of Hasan et al. or Xue et al. Single motion is modeled with an LSTM network. By applyingsome of the proposed factors to the model, it is expected that the model and equity accordingly model extensionsare able to outperform the proposed single motion predictor.

4. DISCUSSION AND FAILURE CASES

After emphasizing the factors needed in order to achieve sophisticated results based on standard neural networksin the above sections, in this section we discuss some failure cases.Without exploiting scene-speciﬁc knowledge for trajectory prediction, some particular changing behavior in thehuman motion is not predictable. For example, in the shown tracklet from

SSD Hyang (see ﬁgure 5), there isno cue for a turning maneuver in the initialization tracklet. In order to correct the prediction, new observa-tions are required. All methods tend to predict in such a situation a relatively straight line, resulting in a highprediction error. A scene-independent motion representation is pursuant to better generalize, but for overcom-ing some limitation in the achievable prediction accuracy, the spatial context is required. The sample trackletalso illustrates the multi-modal nature of the prediction problem. While the person is making a left turn, it isalso possible to make a right turn. By using a single maximum-likelihood path the multi-modality of a motionand the uncertainty in the prediction is not covered. The prediction uncertainty can be considered by usingthe normalized estimation error square (nees), also known as Mahalanobis distance, which corresponds to aweighted Euclidean distance of the errors. But most methods are designed as a regression model, thus for auniﬁed evaluation system the Mahalanobis distance is not applicable. As mentioned, there are a few approacheswhich include the multi-modal aspect of the problem.

7, 29, 43

Without additional cues of the current scene, theseapproaches are limited to a ﬁxed scene.Independent of the question how to include all aspects of a problem in a uniﬁed benchmarking, they strongly igure 5. Example where the scene context strongly inﬂuences the person trajectory. The initialization tracklet (solidline) delivers no evidence for a turning maneuver at the intersection. This also shows the multi-modal nature of theprediction problem. inﬂuence the possible achievable results. The results presented in section 3 show that independent from themodel complexity approaches restricted to observing only information from one trajectory are in range to theirreachable performance limit on the current dataset repository. Of course due to the fast development in the ﬁeldof deep neural networks there is still space for improvement, but the current benchmark cannot be completelysolved. However, the

TrajNet challenges also provides human-human and human-space information and recentwork like the approaches of Gupta et al. (human-human) or Xua et al. and Sadeghian et al. (human-human,human-space) show possibilities of how to further improve the performance accuracy.

5. CONCLUSION

In this paper, we presented an evaluation of deep learning approaches for trajectory prediction on

TrajNet benchmark dataset. The initial results showed that without further cues like human-human interaction or human-space interaction most basic networks achieve similar results in small range close to a maximum achievableprediction accuracy. By modifying a standard RNN prediction model, we were able to provide a simple buteﬀective RNN architecture that achieves a performance comparable to more elaborated models.

Acknowledgements :

The authors thank the organizers of the

TrajNet challenge for providing a frameworktowards a more meaningful, standardized trajectory prediction benchmarking.

REFERENCES

1. Sadeghian, A., Kosaraju, V., Gupta, A., Savarese, S., and Alahi, A., “Trajnet: Towards a benchmark forhuman trajectory prediction,” arXiv preprint (2018).2. Kalman, R. E., “A new approach to linear ﬁltering and prediction problems,”

ASME Journal of BasicEngineering (1960).3. McCullagh, P. and Nelder, J. A., [ Generalized Linear Models ], Chapman & Hall , CRC, London (1989).4. Williams, C. K. I., [

Prediction with Gaussian Processes: From Linear Regression to Linear Prediction andBeyond ], 599–621, Springer Netherlands, Dordrecht (1998).5. Akaike, H., “Fitting autoregressive models for prediction,”

Annals of the Institute of Statistical Mathemat-ics (1), 243–247 (1969).6. Priestley, M. B., [ Spectral analysis and time series ], Academic Press, London ; New York : (1981).. Kitani, K. M., Ziebart, B. D., Bagnell, J. A., and Hebert, M., “Activity forecasting,” in [

European Conferenceon Computer Vision (ECCV) ], 201–214, Springer Berlin Heidelberg, Berlin, Heidelberg (2012).8. Ma, W., Huang, D., Lee, N., and Kitani, K. M., “Forecasting interactive dynamics of pedestrians withﬁctitious play,” in [

Conference on Computer Vision and Pattern Recognition (CVPR) ], 4636–4644, IEEE(2017).9. Huang, S., Li, X., Zhang, Z., He, Z., Wu, F., Liu, W., Tang, J., and Zhuang, Y., “Deep learning drivenvisual path prediction from a single image,”

IEEE Transactions on Image Processing (12), 5892–5904(2016).10. Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., and Savarese, S., “Social LSTM: Humantrajectory prediction in crowded spaces,” in [ Conference on Computer Vision and Pattern Recognition(CVPR) ], 961–971, IEEE (2016).11. Alahi, A., Ramanathan, V., Goel, K., Robicquet, A., Sadeghian, A., Fei-Fei, L., and Savarese, S., “Learningto predict human behaviour in crowded scenes,” in [

Group and Crowd Behavior for Computer Vision ],Elsevier (2017).12. Hug, R., Becker, S., H¨ubner, W., and Arens, M., “On the reliability of lstm-mdl models for predictingpedestrian trajectories,” in [

Representations, Analysis and Recognition of Shape and Motion from ImagingData (RFMI) ], (2017).13. Kooij, J. F. P., Schneider, N., Flohr, F., and Gavrila, D. M., “Context-based pedestrian path prediction,”in [

European Conference on Computer Vision (ECCV) ], 618–633, Springer International Publishing (2014).14. Hasan, I., Setti, F., Tsesmelis, T., Bue, A. D., Galasso, F., and Cristani, M., “MX-LSTM: mixing trackletsand vislets to jointly forecast trajectories and head poses,” in [

Conference on Computer Vision and PatternRecognition (CVPR) ], IEEE (2018).15. Helbing, D. and Moln´ar, P., “Social force model for pedestrian dynamics,”

Phys. Rev. E , 4282–4286(1995).16. Coscia, P., Castaldo, F., Palmieri, F. A., Alahi, A., Savarese, S., and Ballan, L., “Long-term path predictionin urban scenarios using circulardistributions,” Image and Vision Computing , 81–91 (2018).17. Ballan, L., Castaldo, F., Alahi, A., Palmieri, F., and Savarese, S., [ Knowledge Transfer for Scene-SpeciﬁcMotion Prediction ], 697–713, Springer International Publishing (2016).18. Xue, H., Q., D., and Reynolds, H. M., “SS-LSTM: A hierarchical LSTM model for pedestrian trajectoryprediction,” in [

Winter Conference on Applications of Computer Vision (WACV) ], IEEE (2018).19. Pellegrini, S., Ess, A., Schindler, K., and van Gool, L., “You’ll never walk alone: Modeling social behaviorfor multi-target tracking,” in [

International Conference on Computer Vision ], 261–268, IEEE (2009).20. Lerner, A., Chrysanthou, Y., and Lischinski, D., “Crowds by example,”

Computer Graphic Forum (3),655–664 (2007).21. Ferryman, J. and Shahrokni, A., “Pets2009: Dataset and challenge,” in [ IEEE International Workshop onPerformance Evaluation of Tracking and Surveillance (PETS) ], 1–6 (2009).22. Robicquet, A., Sadeghian, A., Alahi, A., and Savarese, S., “Learning social etiquette: Human trajectoryunderstanding in crowded scenes,” in [

Computer Vision – ECCV 2016: 14th European Conference, Ams-terdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII ], Leibe, B., Matas, J., Sebe, N., andWelling, M., eds., 549–565, Springer International Publishing (2016).23. Graves, A., Mohamed, A., and Hinton, G., “Speech recognition with deep recurrent neural networks,” in[

International Conference on Acoustics, Speech and Signal Processing ], 6645–6649 (2013).24. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A., and Bengio, Y., “A recurrent latent variablemodel for sequential data,” in [

Advances in Neural Information Processing Systems (NIPS) ], (2015).25. Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell,T., “Long-term recurrent convolutional networks for visual recognition and description,” in [

Conference onComputer Vision and Pattern Recognition ], IEEE (2015).26. Xu, K., Ba, J., Kiros, R., K.Cho, Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y., “Show, attendand tell: Neural image caption generation with visual attention,” in [

International Conference on MachineLearning ], Bach, F. and Blei, D., eds.,

Proceedings of Machine Learning Research , 2048–2057, PMLR,Lille, France (2015).7. Martinez, J., Black, M. J., and Romero, J., “On human motion prediction using recurrent neural networks,”in [ Conference on Computer Vision and Pattern Recognition (CVPR) ], 4674–4683, IEEE (2017).28. He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [

Conference onComputer Vision and Pattern Recognition (CVPR) ], 770–778 (2016).29. Hug, R., Becker, S., H¨ubner, W., and Arens, M., “Particle-based pedestrian path prediction using LSTM-MDL models,” in [

IEEE International Conference on Intelligent Transportation Systems (ITSC)(accepted) ],(2018).30. Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., and Alahi, A., “Social gan: Socially acceptable trajec-tories with generative adversarial networks,” in [

Conference on Computer Vision and Pattern Recognition(CVPR) ], IEEE (2018).31. Brownlee, J., [

Introduction to Time Series Forecasting with Python: How to Prepare Data and DevelopModels to Predict the Future ], Jason Brownlee (2017).32. Draper, N. R. and Smith, H., [

Applied regression analysis ], Wiley series in probability and mathematicalstatistics, Wiley, New York (1966).33. Vemula, A., Muelling, K., and Oh, J., “Modeling cooperative navigation in dense human crowds,” in[

International Conference on Robotics and Automation (ICRA) ], 1685–1692, IEEE (May 2017).34. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner,N., Senior, A. W., and Kavukcuoglu, K., “Wavenet: A generative model for raw audio,” arXivpreprint abs/1609.03499 (2016).35. Bai, S., Kolter, J. Z., and Koltun, V., “An empirical evaluation of generic convolutional and recurrentnetworks for sequence modeling,” arXiv preprint abs/1803.01271 (2018).36. Kingma, D. P. and Ba, J., “Adam: A method for stochastic optimization,”

International Conference forLearning Representations (ICLR) (2015).37. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J.,Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser,L., Kudlur, M., Levenberg, J., Man´e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens,J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi´egas, F., Vinyals, O.,Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X., “TensorFlow: Large-scale machine learningon heterogeneous systems,” (2015). Software available from tensorﬂow.org.38. Hochreiter, S. and Schmidhuber, J., “Long Short-Term Memory,”

Neural Computation (8), 1735–1780(1997).39. Cho, K., van Merri¨enboer, B., G¨ul¸cehre, C¸ ., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y.,“Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in [ Con-ference on Empirical Methods in Natural Language Processing (EMNLP) ], 1724–1734, Association for Com-putational Linguistics, Doha, Qatar (2014).40. Ellis, D., Sommerlade, E., and Reid, I., “Modelling pedestrian trajectory patterns with gaussian processes,”in [

International Conference on Computer Vision Workshops (ICCVW) ], 1229–1234, IEEE (2009).41. Li, Z., Zhou, Y., Xiao, S., He, C., and Li, H., “Auto-conditioned LSTM network for extended complexhuman motion synthesis,” arXiv preprint abs/1707.05363 (2017).42. Huber, M.,

Nonlinear Gaussian Filtering : Theory, Algorithms, and Applications , PhD thesis, KarlsruheInstitute of Technology (KIT) (2015).43. Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H. S., and Chandraker, M., “Desire: Distant futureprediction in dynamic scenes with interacting agents,” in [

Conference on Computer Vision and PatternRecognition (CVPR) ], IEEE (2017).44. Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., and Savarese, S., “SoPhie: An attentive GAN forpredicting paths compliant to social and physical constraints,” arXiv preprint arXiv:1806.01482arXiv preprint arXiv:1806.01482