[PDF] CoMoGCN: Coherent Motion Aware Trajectory Prediction with Graph Representation

Abstract

Forecasting human trajectories is critical for tasks such as robot crowd navigation and autonomous driving. Modeling social interactions is of great importance for accurate group-wise motion prediction. However, most existing methods do not consider information about coherence within the crowd, but rather only pairwise interactions. In this work, we propose a novel framework, coherent motion aware graph convolutional network (CoMoGCN), for trajectory prediction in crowded scenes with group constraints. First, we cluster pedestrian trajectories into groups according to motion coherence. Then, we use graph convolutional networks to aggregate crowd information efficiently. The CoMoGCN also takes advantage of variational autoencoders to capture the multimodal nature of the human trajectories by modeling the distribution. Our method achieves state-of-the-art performance on several different trajectory prediction benchmarks, and the best average performance among all benchmarks considered.

Full PDF

CCHEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN CoMoGCN: Coherent Motion AwareTrajectory Prediction with GraphRepresentation

Yuying Chen* [email protected]

Congcong Liu* [email protected]

Bertram Shi [email protected]

Ming Liu [email protected]

Robotics InstituteHong Kong University of Science andTechnologyHong Kong, China

Abstract

Forecasting human trajectories is critical for tasks such as robot crowd navigationand autonomous driving. Modeling social interactions is of great importance for accurategroup-wise motion prediction. However, most existing methods do not consider infor-mation about coherence within the crowd, but rather only pairwise interactions. In thiswork, we propose a novel framework, coherent motion aware graph convolutional net-work (CoMoGCN), for trajectory prediction in crowded scenes with group constraints.First, we cluster pedestrian trajectories into groups according to motion coherence. Then,we use graph convolutional networks to aggregate crowd information efﬁciently. TheCoMoGCN also takes advantage of variational autoencoders to capture the multimodalnature of the human trajectories by modeling the distribution. Our method achieves state-of-the-art performance on several different trajectory prediction benchmarks, and the bestaverage performance among all benchmarks considered.

Forecasting human trajectories is of great importance for tasks, such as robot navigationin crowds, autonomous driving, and crowd surveillance. For autonomous robot systems,predicting the human motion enables feasible and efﬁcient planning and control.However, making accurate trajectory predictions is still a challenging task because pedes-trian trajectories can be affected by many factors, such as the topology of the environment,intended goals, and social relationships and interactions [19]. Furthermore, the highly dy-namic and multimodal properties inherent in human motion must also be considered.Multimodality in trajectory prediction has been studied recently [2, 7, 12, 13, 20]. Mostpast work uses generative adversarial models (GANs) to generate multiple predictions. How-ever, GANs suffer from the instability of adversarial training, which is sensitive to hyperpa-rameters and structure [26]. As an alternative, variational autoencoder (VAE) is relatively c (cid:13) a r X i v : . [ c s . C V ] M a y CHEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN more stable. Lee et al. present a CVAE based framework to predict future object locations[13]. A recent work adopted CVAE for trajectory prediction [10]. This paper takes advantageof the VAE to capture the multimodality of human trajectories.Recently, some works have proposed to model the dynamic interactions of pedestriansby combining information from pairwise interactions, through pooling mechanisms such asmax-pooling [7] and self-attention pooling [20]. However, those works do not completelycapture important information about the geometric conﬁguration of the crowd. Furthermore,these works rely on some ad-hoc rules to handle varying numbers of agents, such as setting amaximum on the number of agents and using dummy values for non-existing agents [20]. Toavoid such ad-hoc assumptions, Chen et al. [5] propose to use graph convolutional networks(GCN) to aggregate information about neighboring humans for robot crowd navigation tasks.The GCN can handle varying numbers of neighbors naturally, and can be extended to mod-ulate interactions by changing its adjacency matrix. In this paper, we use a similar graphstructure for crowd information aggregation in a different task: trajectory prediction.Most previous work has focused only on the interactions between pairs of humans. Co-herent motion patterns of pedestrian groups, which encode rich information about implicitsocial rules, has rarely been considered. This lack of attention may be due in part to the lackof information about social grouping in current benchmark datasets, such as the commonlyused

ETH [18] and UCY [14] datasets, for trajectory prediction. To address this unavailabil-ity, we add coherent motion cluster labels to trajectory prediction datasets using a coherentﬁltering method [29], and leverage DBSCAN clustering to compensate for the drawbacksof the coherent ﬁltering method in the small group detection. These coherent motion labelsprovide a mid-level representation of crowd dynamics, which is very useful for crowd anal-ysis. We incorporated the coherent motion constraints into our model by using GCNs forintergroup and intragroup relationship modeling.There are several main contributions of our work: • We introduce graph convolutional networks (GCN) to better model social interactionswithin human crowds. The use of GCNs enables our approach to handle varyingcrowd sizes in a principled way. Interactions between humans can be controlled easilyby modifying the adjacency matrix. • Unlike past work that considered pairwise interactions between individuals only, wetake into account coherent motion constraints inside crowds to better capture socialinteractions. • We developed a hybrid labeling method to add coherent motion labels to trajectoryprediction datasets. We will release the re-labelled dataset publicly for use by otherresearchers . • We take advantage of the VAE to handle multimodality in trajectory modeling. • With the above mechanisms, the CoMoGCN achieves state-of-the-art performance onseveral different trajectory prediction benchmarks, and the best average performanceacross all datasets considered. https://sites.google.com/view/comogcn HEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN A pioneering work for crowd interaction modeling, the Social Force Model (SFM) proposedby [9], has been applied successfully to many applications such as abnormal crowd behaviordetection[16] and multi-object tracking [18]. However, as discussed in [1], the social forcemodel can model simple interactions, but fails to model complex crowd interactions. Thereare also other hand crafted feature based models, such as continuum dynamics [22], discretechoice [3] and Gaussian Process models [24]. However, all the above methods are based onhand-crafted energy functions and speciﬁc rules, which limit their performance.

Recently, Recurrent Neural Networks (RNN), such as the Long Short Term Memory (LSTM),have achieved many successes in trajectory prediction tasks [1, 8, 15, 21, 27, 28]. Alahi etal. proposed a social pooling layer to model neighboring humans [1]. Gupta et al. proposeda pooling module, which consists of an MLP followed by max-pooling to aggregate infor-mation from all other humans [7]. Sadeghian et al. [20] adopted a soft attention module toaggregate information across agents. More recent work uses GCNs to aggregate informationby treating humans as nodes and modeling interaction through edge strength for robot nav-igation [5]. Similarly, a variant of the GCN, the Graph Attention Network (GAT), has beenused to model the social interactions [12]. However, the use of multi-head attention in theGAT increases the number of parameters and the computational complexity of the GAT incomparison to the GCN. In this work, we integrate information across humans using GCNs,which enables our method to handle varying crowd sizes.

Most previous work only pay attention to interactions among pairs of pedestrians. However,the pedestrian trajectories are also inﬂuenced by more complex social relations betweenhumans. Coherent motion patterns inside crowds, which encode implicit social information,have been shown to be useful in many applications, such as crowd activity recognition[25].Bisagno et al. [4] considered intragroup interactions for trajectory predictions, but neglectedintergroup interactions. Current benchmark datasets for trajectory prediction do not providecoherent motion labels.Several works have been done in detecting coherent motions [29] and measuring thecollectiveness of crowds [17]. Zhou et al. [29] proposed the coherent ﬁltering that detectsinvariant neighbors of every individual, and measures the velocity correlations for motionclustering. It shows good performance on collective motion benchmark and can detect co-herent motions given the crowd trajectories in a short time window. In this paper, we usethe coherent ﬁltering method to label trajectory prediction datasets. In addition, we leverageDBSCAN clustering to compensate for the disadvantages of the coherent ﬁltering method insmall group detection. Based on the labels, we incorporate the coherent motion informationinto our model for better interaction modeling.

CHEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN

UU U

FC+LSTMFC GCN_inter

50 x 1 FC GCN_intra

50 x 1 LSTM+FC : Concatenation : Repeat r Loop times U ① Data pre-processingto obtain coherent motion ② Establish graphs for intergroup and intragroup . . . . . . ③ Trajectory prediction

Figure 1: System overview. There are three procedures: 1. We obtain coherent motion labelsfor each human in an ofﬂine data pre-processing procedure. 2. Based on the coherent motionlabels for each human, we establish graphs capturing intergroup and intragroup relationships.The encoder LSTM takes past trajectories as input and feeds the encoded features into twoGCNs. 3. The embeddings from the two GCNs are concatenated and forwarded to an MLPto create a distribution with mean µ z and variance Σ z . Then, features are sampled from thedistribution and fed into a decoder LSTM for trajectory prediction. The goal of this work is to generate the future trajectories of all humans in a scene at thesame time. The trajectory of a person i is deﬁned using x trel i = ( x ti , y ti ) which denotes therelative position of human i at time step t to the position at t −

1. Consistent with previousworks [7, 20], the observed trajectory of all humans in a scene is deﬁned as x ( t obs ) rel ,..., N for timesteps t = , ... t obs ; the future trajectory to be predicted is deﬁned as x ( t obs + t obs + T ) rel ,..., N for timestep t = t obs + , ..., t obs + T , where the number of humans N may change dynamically. Themodel aims to generate trajectories ˆ x ( t obs + t obs + T ) rel ,..., N whose distribution matches that of groundtruth future trajectories of all humans x ( t obs + t obs + T ) rel ,..., N . Figure 1 shows the overall framework of our method for trajectory prediction. Data pre-processing is applied ofﬂine to obtain the coherent motion pattern for each human. Forfeature extraction, we ﬁrst use a single layer MLP (FC) to encode each pedestrian’s relativedisplacements as a ﬁxed-length embedding. These embeddings are fed to an LSTM as shownbelow: e i = LST M en ( MLP enc ( x rel i ; W enc ) , h enc i , W en ) (1) HEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN where W enc is the weight of FC layer, and W en is the weight of the encoding LSTM. On theother hand, for speciﬁc person i , the relative position of other humans are fed into an FClayer to obtain social information p i which is similar to the pooling module in Social GAN[7]. Then the features from all pedestrians e ,..., N and the interested person’s social informa-tion p i are concatenated together as the input to the two GCNs for intergroup and intragroupinteraction aggregation: V intra i = GCN intra ([ e ,..., N , p i ] , A intra , W intra ) (2) V inter i = GCN inter ([ e ,..., N , p i ] , A inter , W inter ) (3)where A intra and A inter denote the adjacency matrices as described in more detail in Section.3.4. W intra and W inter are weight matrices. We extract the feature of node i after the ﬁnalgraph convolutional layer as the features V intra i and V inter i .The features computed by the outputs of the two GCNs are then concatenated togetherand input to an MLP, which computes the mean and variance of a distribution over the featurevectors to be input to the decoder: µ z , Σ z = MLP vae ([ V intra i , V inter i ] , W vae ) (4)where W vae is the weight matrix. We sample an input feature vector to the decoder stage, z ,from this distribution z ∼ N ( µ z , Σ z ) and concatenate it with the embedding computed froman embedding of the last predicted state. The resulting features c are fed into the decoderLSTM cell for trajectory prediction:ˆ x rel i = MLP dec ( LST M de ( c , h de i ; W de ) ; W dec ) (5)where W de is the weight for decoder LSTM and W dec is the weight for decoder MLP. For coherent motion detection, we use the coherent ﬁltering proposed by [29]. The processtakes the positions of humans from consecutive frames t to t k and generates a clusteringindex for each human and for each frame. Humans sharing the same index are consideredto have coherent motion. The process of coherent ﬁltering mainly includes three steps: a)ﬁnding K nearest neighbors b) ﬁnding the invariant neighbors of a individual c) measuringthe time-averaged velocity correlations of the invariant neighbors to the individual. Amongthese individual-neighbor pairs, pairs with correlation intensity above a threshold are markedas coherent pairs.Though this method is effective for crowds with large crowd densities, it performs poorlyfor sparse crowds and fails to detect small groups. To compensate, we apply an extra cluster-ing step, the DBSCAN method [6], for the unlabeled humans. As a density based clusteringmethod, it relies on the distance to ﬁnd the neighbors. We account for moving direction andcalculate the angular distance of each pair of humans. These differences are used to classifyhumans into clusters.Our hybrid labeling method improves the labeling yield and generates better labels thanthe coherent ﬁltering alone. Figure 1 of the supplementary ﬁle shows examples of detectionby coherent ﬁltering on each dataset. The quantitative evaluations of the coherent ﬁlteringand of our hybrid labeling method are shown in Table 2 and 3 of the supplementary ﬁle. CHEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN

Figure 2 of the supplementary ﬁle shows a qualitative comparison between the coherentﬁltering and our method. The parameter settings are shown in Table 1 of the supplementaryﬁle.

Dealing with the large and varied numbers of humans in a scene is one of the main challengesfor multi-human trajectory prediction. Previous works adopted ad-hoc solutions such assetting a maximum number of humans[20]. In this work, we address this problem in asimpler and more principled way through graph representations. Nodes in the graph denotehumans in the crowd. In the following, we denote the number of humans in the crowd by N .We adopt a two-layer graph convolutional networks (GCNs) [11] to aggregate informa-tion in crowds. To each node in the network, we associate a feature vector, which containsimportant information about the node. The graph convolutional layer is the main buildingblock of GCNs. It takes input feature vectors for each node and converts them to outputfeature vectors for each node by integrating information both within and across nodes. Weuse I to denote the dimension of the input feature vectors and O to denote the dimensionof the output feature vectors The input feature vectors of layer l are represented by matrix H l ∈ R N × I . The input feature matrix is converted to output vectors represented by a matrix H l + ∈ R N × O based on the layer-wise forward rule: H l + = σ (cid:16) A H l W l (cid:17) (6) W l ∈ R I × O is a trainable weight matrix for layer l . A ∈ R N × N is the adjacency matrix of thegraph, whose values determine how information from different nodes is aggregated. Eachrow of A is normalized to sum to one. σ ( · ) is Relu activation function.The adjacency matrix reﬂects the connections between nodes of the graph. The vanillaGCN assumes that the qualitative inﬂuence of each human on another (as determined by W l )is the same and only the strength of that inﬂuence can be modulated (through the adjacencymatrix). However, we think that the qualitative effect of humans in the crowd on a particularhuman’s trajectory are different, based on whether the humans are in the same group or not.A single GCN can not handle this. Thus, we propose to use two GCNs. As shown in Fig.2,for each human, we modulate the adjacency matrix by multiplication with two coherencemasks which encodes the intergroup and intragroup labels. Then we obtain two adjacencymatrix denoting intergroup connection ( A inter ) and intragroup ( A intra ) connection separatelyfor each human by pixelwise multiplying the adjacency matrix ( A ) with the masks. We setthe value in the adjacency matrix by ﬁrst constructing a binary matrix denoting connectionsbetween nodes, and then normalizing each row.By modulating the adjacency matrix of GCNs with coherent motion information, weincorporate implicit social relations into our network for better interaction modeling. We trained the network with Adam optimizer. The mini-batch size is 64 and the learningrate is 1e-4. The models were trained for 200 epochs. The encoder encodes the relativetrajectories by a single layer of MLP (

MLP enc ) with dimension of 16 followed by an LSTM(

LST M en )with a hidden dimension of 32. The embedding output from LSTM was then con-catenated with the features extracted from relative position from other humans by a single HEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN Adjacency matrix Coherency mask Modulated adjacency matrix

Figure 2: An example of how the adjacency matrices of the GCNs for crowd informationaggregation and determined. The example considers the adjacency matrix for the GCNs ofhuman i = 3, who is in the same cluster as humans 1,2 and 4, but not humans 5, 6 and 7.MLP with dimension of 16. The concatenated features are then fed into two GCNs for fea-ture integration. Then hidden number for two graph convolutional layer has the dimensionof 72 and 8 separately. Then an MLP (

MLP vae ) was used to take state of humans to create adistribution with mean and variance. Then we sample z from this distribution with a dimen-sion of 8, and fed it into an LSTM ( LST M de )with dimension of 32 and followed by an MLP( MLP dec )with dimension of 2 for decoding.

In this section, we evaluate our method in two public datasets ETH [18] and UCY [14]. TheETH datasets contain two scenes (ETH and Hotel) while the UCY datasets contain threescenes (Zara1, Zara2, and Univ). There are ﬁve sets of data with four different scenarios and1536 pedestrians in total.

Following the setting in [7], we adopt the leave-one-out approach, i.e. train with four sets andtest in the remaining set. We take trajectories with 8 time steps as observation and evaluatetrajectory predictions over the next 12 time steps.

Similar to previous works [7, 12, 20], we adopt two standard metrics including AverageDisplacement Error (ADE) and Final Displacement Error (FDE) in meter.

ADE : Mean L2 distance between ground truth and predictions of all time steps.

FDE : Mean L2 distance between ground truth and prediction at the ﬁnal time step.

CHEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN

Baselines OursDataset S-GAN Sophie Trajectron Social-BiGAT MLP GCN GAT GCN+group(CF) GCN+group(Hybrid)ETH 0.81/1.52 0.70/1.43

UNIV 0.60/1.26 0.54/1.24 0.59/1.21 0.55/1.32 0.61/1.31 0.55/1.18 0.55/1.19 0.55/1.19

ZARA1 0.34/0.69 0.30/0.63 0.55/1.09

AVG 0.58/1.18 0.54/1.15 0.53/1.06 0.48/1.00 0.49/1.01 0.47/0.94 0.47/0.96 0.46/0.93

Table 1: Quantitative results. We adopted two metrics Average Displacement Error (ADE)and Final Displacement Error (FED) for evaluation over ﬁve different datasets (ADE/FDEin meters). Our full model (GCN +group (hybrid)) achieves state-of-the-art results outper-forming all baseline methods (lower value denotes better performance).

We compare our work with following several recent works based on generative models:

Social GAN (S-GAN) [7]: A generative model using GAN to generate multimodal predic-tions. It utilizes a global pooling module to combine crowd interactions by an MLP followedby a max-pooling layer.

Sophie [20]: A improved GAN based model which considers both social interactions andphysical interaction with scene context.

Trajectron [10]: A generative model based on CVAE for multimodal predictions withspatiotemporal graphs.

Social-BiGAT [12]: A generative model using Bicycle-GAN for multimodal predictionand GAT for crowd interaction modeling.

As shown in Table 1, we compare our models with various baselines. The average displace-ment error (ADE) and ﬁnal displacement error (FDE) were reported across ﬁve datasets.Following settings in every baseline, we run 20 samples for evaluation.It is clear to see that our ﬁnal model with GCN and coherent motion constraints beatall baselines and obtain more consistent results in both ADE and FDE. Compared to SocialGAN, we achieve 22.4% improvement in ADE and 22.9% improvement in FDE on aver-age. Compared to Sophie who use additional scene context information, we achieve 16.7%improvement in ADE and 20.9% improvement in FDE on average. Compared to Trajec-tron who also uses VAE as backbone network, we achieve 15.1% improvement in ADE and14.2% improvement in FDE on average. Compare to Social-BiGAT who also considersgraph structure for interaction modeling, we achieve 6.3% improvement in ADE and 9.0%improvement in FDE on average.

We conduct several ablation studies to validate the beneﬁts of the use of GCN and coherentmotion information.To show the beneﬁt of the use of GCN, we investigated another model that replaces GCNwith MLP (followed by max-pooling, similar to the pooling module in social GAN [7]) asshown in Table 1.

HEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN ProposedS-GAN(a) (b) (c) (d)

Figure 3: Examples for generated human trajectories visualization for S-GAN and our modelacross several scenes. The observed trajectories are shown in solid lines, ground truth futuretrajectories are shown in wide dashed lines, generated 20 samples per model are shown inthin dashed lines. The dot-dashed lines denote the predictions of our VAE based model byapplying the mean value ( µ z ) of the distribution. Different humans are denoted by differentcolors.When comparing the model using GCN with MLP, we can see that the one with GCNachieves 4.1% improvement in ADE and 6.9% improvement in FDE.To show the beneﬁt of the incorporation of coherent motion information, we compareour full model with the one without considering coherent information (only using GCN),and GAT (same implementation with [23]).When compare the full model with the one using GCN only, we can see that our fullmodel with coherent motion information achieves 4.3% improvement in ADE and 3.2%improvement in FDE. When compare the full model with GAT, we can see that the fullmodel achieves 4.3% improvement in ADE and 5.2% improvement in FDE.The above ablation studies clearly demonstrate the beneﬁts of the use of GCN and theintroduction of coherent motion information.We further investigated trajectory prediction performance of models with different coher-ent detection method, Coherent Filtering method (CF) [29] vs. our hybrid labeling method(hybrid). We can see that model with our hybrid coherent detection method (Coherent Filter-ing + DBSCAN) outperforms model with Coherent Filtering method by 2.2 % improvementin ADE and 2.2 % improvement in FDE on average. The improvements are consistent overall ﬁve datasets. In order to better understanding the beneﬁts of our model in capturing social interactionsbetween humans, we visualize several examples of the generated trajectories across testingsets as shown in Fig.3.From the four examples, we can see that the predictions of our model generally havelower variance than S-GAN, which means we can generate model in a more efﬁcient way.Also, the examples show that our model better captures the interactions of pedestrians walk- CHEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN ing in the crowds which obtain more accurate predictions (as shown in (d)). It is clear to seethat our model generates more realistic predictions avoiding collisions as shown in example(b). Besides, S-GAN tends to predict slower motion in dataset HOTEL (as shown in (c)).For qualitative results of the ablation study, please refer to Fig. 3 in supplementary ﬁle.We can observe consistent results with the quantitative evaluation. The proposed full modelmake more accurate and realistic predictions.

In this paper, we propose a novel VAE based generative model for trajectory prediction whichoutperforms state-of-the-art methods. We introduce graph convolutional networks (GCNs)for efﬁcient crowd interaction aggregation. Furthermore, we provided coherent motion in-formation for the trajectory prediction datasets. The coherent motion labels that signiﬁcantlyenrich the social information for the commonly used datasets (ETH and UCY) will be re-leased to the research community later. Then we incorporated the coherent motion informa-tion, which contains rich information about implicit social relationship among the humans,into our methods. We show that the introduction of GCNs and coherent motion informationsigniﬁcantly improve the performance for accurate trajectory prediction.

References [1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 961–971, 2016.[2] Javad Amirian, Jean-Bernard Hayet, and Julien Pettré. Social ways: Learning multi-modal distributions of pedestrian trajectories with gans. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops , pages 0–0, 2019.[3] Gianluca Antonini, Michel Bierlaire, and Mats Weber. Discrete choice models ofpedestrian walking behavior.

Transportation Research Part B: Methodological , 40(8):667–687, 2006.[4] Niccolo Bisagno, Bo Zhang, and Nicola Conci. Group lstm: Group trajectory predic-tion in crowded scenarios. In

The European Conference on Computer Vision Work-shops , September 2018.[5] Yuying Chen, Congcong Liu, Bertram E Shi, and Ming Liu. Robot navigation incrowds by graph convolutional networks with attention learned from human gaze.

IEEERobotics and Automation Letters , 5(2):2754–2761, 2020.[6] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-basedalgorithm for discovering clusters in large spatial databases with noise. In

Proceedingsof the Second International Conference on Knowledge Discovery and Data Mining ,volume 96, pages 226–231, 1996.

HEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN [7] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Socialgan: Socially acceptable trajectories with generative adversarial networks. In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages2255–2264, 2018.[8] Irtiza Hasan, Francesco Setti, Theodore Tsesmelis, Alessio Del Bue, Fabio Galasso,and Marco Cristani. Mx-lstm: mixing tracklets and vislets to jointly forecast trajecto-ries and head poses. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 6067–6076, 2018.[9] Dirk Helbing and Peter Molnar. Social force model for pedestrian dynamics.

Physicalreview E , 51(5):4282, 1995.[10] Boris Ivanovic and Marco Pavone. The trajectron: Probabilistic multi-agent trajectorymodeling with dynamic spatiotemporal graphs. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision , pages 2375–2384, 2019.[11] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolu-tional networks. arXiv preprint arXiv:1609.02907 , 2016.[12] Vineet Kosaraju, Amir Sadeghian, Roberto Martín-Martín, Ian Reid, HamidRezatoﬁghi, and Silvio Savarese. Social-bigat: Multimodal trajectory forecasting usingbicycle-gan and graph attention networks. In

Advances in Neural Information Process-ing Systems , pages 137–146, 2019.[13] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, andManmohan Chandraker. Desire: Distant future prediction in dynamic scenes with inter-acting agents. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 336–345, 2017.[14] Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski. Crowds by example. In

Computer Graphics Forum , volume 26, pages 655–664. Wiley Online Library, 2007.[15] Matteo Lisotto, Pasquale Coscia, and Lamberto Ballan. Social and scene-aware trajec-tory prediction in crowded spaces. In

Proceedings of the IEEE International Confer-ence on Computer Vision Workshops , pages 0–0, 2019.[16] Ramin Mehran, Alexis Oyama, and Mubarak Shah. Abnormal crowd behavior de-tection using social force model. In , pages 935–942. IEEE, 2009.[17] Ling Mei, Jianghuang Lai, Zeyu Chen, and Xiaohua Xie. Measuring crowd collective-ness via global motion correlation. In

Proceedings of the IEEE International Confer-ence on Computer Vision Workshops , pages 0–0, 2019.[18] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. You’ll neverwalk alone: Modeling social behavior for multi-target tracking. In , pages 261–268. IEEE, 2009.[19] Andrey Rudenko, Luigi Palmieri, Michael Herman, Kris M Kitani, Dariu M Gavrila,and Kai O Arras. Human motion trajectory prediction: A survey. arXiv preprintarXiv:1905.06113 , 2019. CHEN, LIU, SHI, LIU: COHERENT MOTION AWARE TRAJECTORY PREDICTION WITH GCN [20] Amir Sadeghian, Vineet Kosaraju, Ali Sadeghian, Noriaki Hirose, Hamid Rezatoﬁghi,and Silvio Savarese. Sophie: An attentive gan for predicting paths compliant to socialand physical constraints. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 1349–1358, 2019.[21] Hang Su, Jun Zhu, Yinpeng Dong, and Bo Zhang. Forecast the plausible paths in crowdscenes. In

International Joint Conferences on Artiﬁcial Intelligence , volume 1, page 2,2017.[22] Adrien Treuille, Seth Cooper, and Zoran Popovi´c. Continuum crowds. In

ACM Trans-actions on Graphics (TOG) , volume 25, pages 1160–1168. ACM, 2006.[23] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio,and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.[24] Jack M Wang, David J Fleet, and Aaron Hertzmann. Gaussian process dynamicalmodels for human motion.

IEEE Transactions on Pattern Analysis and Machine Intel-ligence , 30(2):283–298, 2007.[25] Xiaogang Wang, Xiaoxu Ma, and W Eric L Grimson. Unsupervised activity percep-tion in crowded and complicated scenes using hierarchical bayesian models.

IEEETransactions on Pattern Analysis and Machine Intelligence , 31(3):539–555, 2008.[26] Zhengwei Wang, Qi She, and Tomas E Ward. Generative adversarial networks: Asurvey and taxonomy. arXiv preprint arXiv:1906.01529 , 2019.[27] Yanyu Xu, Zhixin Piao, and Shenghua Gao. Encoding crowd interaction with deep neu-ral network for pedestrian trajectory prediction. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 5275–5284, 2018.[28] Pu Zhang, Wanli Ouyang, Pengfei Zhang, Jianru Xue, and Nanning Zheng. Sr-lstm:State reﬁnement for lstm towards pedestrian trajectory prediction. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , pages 12085–12094,2019.[29] Bolei Zhou, Xiaoou Tang, and Xiaogang Wang. Coherent ﬁltering: Detecting coherentmotions from crowd clutters. In

European Conference on Computer Vision , pages 857–871. Springer, 2012.

HEN, LIU, SHI, LIU: SUPPLEMENTARY OF COMOGCN FOR TRAJECTORY PREDICTION Supplementary Material for CoMoGCN:Coherent Motion Aware TrajectoryPrediction with Graph Representation

Yuying Chen* [email protected]

Congcong Liu* [email protected]

Bertram Shi [email protected]

Ming Liu [email protected]

Robotics InstituteHong Kong University of Science andTechnologyHong Kong, China

Table 1 shows the parameters used for coherent ﬁltering (CF) as well as DBSCAN. The Co-herent Filtering methods are sensitive to the parameters chosen which are carefully tunedfor for each dataset. As coherent ﬁltering can easily detect coherent motions in dense envi-ronments and induce false positives, we set a larger frame window size to ensure accuracy.Besides, the angular difference is limited to a smaller value for DBSCAN for accurate de-tection. The typical DBSCAN applies euclidean distance as the distance function. However,we considered the angular distance, lateral distance as well as longitudinal distance for bettercoherent motion clustering.Dataset coherent ﬁltering DBSCAN d +2 K max λ θ s lateral s longitudinal minPtsETH, HOTEL,ZARA1, ZARA2 5 5 0.8 0.5 2 5 2UNIV 8 5 0.8 0.2 2 5 2Table 1: Parameters used for coherent motion clustering. d +2 frames indicates the framewindow size. K max indicates the maximum number of nearest neighbors. The coherentﬁltering considers K nearest neighbors. We set K = min ( K max , Neighborhood size ) . λ is thethreshold. θ is the angle distance and the unit is radian. s lateral is the lateral distance of thepotential neighbors to the pedestrian considered. s longitudinal is the longitudinal distance. Theunit is meter. minPts is the minimum points. c (cid:13) a r X i v : . [ c s . C V ] M a y CHEN, LIU, SHI, LIU: SUPPLEMENTARY OF COMOGCN FOR TRAJECTORY PREDICTION

ETH HOTELZARA1ZARA2 UNIV

Figure 1: Representative examples and coherent motion clustering results for ﬁve datasets.Same color denotes the same group. Black color denotes individual that has no coherencewith others. Circles show the current position and dots show the trajectory history usedfor clustering. We can see that coherent motions for both group of humans and individualhumans are detected. Best viewed in color.

Typical examples of the coherent motion detection for each dataset can be seen in Fig. 1.We can see good performance of the coherent motion detection.Fig. 2 shows the comparison of coherent motion detection between coherent ﬁltering [2]and ours (leveraging DBSCAN clustering to compensate the drawback of coherent ﬁltering).The clustering results clearly demonstrate the performance of coherent ﬁltering being appliedto the trajectory datasets. It performed well for detecting some coherent motions in densecrowds despite the difference in the motion directions and the space separations. However, itshowed poor performance for detecting coherence in sparse trajectory datasets that consist ofsmall groups. It can be observed that pedestrians with similar moving pattern are labeled aswith no coherent motion when the number of coherent pedestrians is small. This caused thelow labeling rate shown in Table 2, e.g. in the dataset HOTEL, of all the motions, only around10% are labeled as coherent motions. Besides, some static pedestrians are mis-labeled (Fig.2b) and become false positives for some clusters.To compensate, we applied the DBSCAN to detect the small pedestrian clusters. Asshown in Fig. 2, small group of pedestrians with similar moving directions are detectedand labeled as the same motion cluster. Through this, the percentages of labeled coherentmotions over all motions were increased to a reasonable value. To better show the improve-ment of the clustering methods, we compared the Fréchet distance [1] of trajectory pairs ofinter or intra groups clustered by coherent ﬁltering alone or with DBSCAN. The results areshown in Table 3. We can observe that with better coherence clustering on small groups,the Fréchet distance of coherent trajectories classiﬁed by coherent ﬁltering and DBSCANbecomes smaller and it becomes larger for trajectories with little coherence. A lower Fréchetdistance of two trajectories denotes higher similarity. It indicates improved coherence clus-tering of our proposed clustering method.

Table 4 shows the prediction error (ADE and FDE) of models with different coherent motiondetection methods. Corresponding to the coherent motion detection performance, the trajec-

HEN, LIU, SHI, LIU: SUPPLEMENTARY OF COMOGCN FOR TRAJECTORY PREDICTION Dataset CF CF + DBSCANETH 41.0% 77.3%HOTEL 12.4% 77.6%UNIV 35.0% 80.6%ZARA1 38.9% 83.9%ZARA2 45.6% 89.1%Table 2: Percentage of labeled coherent motions over all motions.coherent ﬁltering coherent ﬁltering + DBSCANDataset Intra Group Inter Group Intra Group Inter GroupETH 3.58 7.30 3.21 8.59HOTEL 4.10 5.08 2.90 5.69UNIV 2.82 7.28 2.54 7.67ZARA1 3.60 5.59 2.73 7.64ZARA2 3.57 5.37 1.70 6.06AVG 3.53 6.12 2.62 7.13Table 3: Similarity of intra group trajectories and inter group trajectories for the two co-herent detection methods. Here we use Fréchet distance to measure the similarity betweentrajectories.tory prediction model with coherent motions detected by CF and DBSCAN achieves betterprediction performance.

Figure 3 shows pedestrian trajectory prediction results for different models. We can observeconsistent results with the quantitative evaluation. When compared to S-GAN, we can seethat our models often generate more accurate and efﬁcient predictions with lower variance.We also observed that model using MLP tested in dataset HOTEL and UNIV tends to predictslower motion of humans than the real situations, which is similar to the performance of S-GAN. Model utilizing GAT is more likely to have unexpected predictions like sharp turnsshown in the second column of the ﬁgure. Among our models, we can see the proposed fullmodel make more accurate and realistic predictions.

ETH HOTEL UNIV ZARA1 ZARA2 AVGGCN+Group(CF alone) 0.71/1.28 0.37/0.76 0.55/1.19 0.34/0.72 0.32/0.68 0.46/0.93GCN+Group(CF + DBSCAN) 0.70/1.26 0.37/0.75 0.53/1.16 0.34/0.71 0.31/0.67 0.45/0.91

Table 4: Prediction performance comparison of models with different coherent motion clus-tering methods.

CHEN, LIU, SHI, LIU: SUPPLEMENTARY OF COMOGCN FOR TRAJECTORY PREDICTION

Coherent filter + DBSCAN(a)(b)(c)(d) Coherent filter

Figure 2: Representative examples for different coherent motion clustering methods. Samecolor denotes the same group. Black color denotes individual that has no coherence withothers. Circles show the current position and dots show the trajectory history used for clus-tering. Best viewed in color.

References [1] Jiang Bian, Dayong Tian, Yuanyan Tang, and Dacheng Tao. A survey on trajectoryclustering analysis. arXiv preprint arXiv:1802.06971 , 2018.[2] Bolei Zhou, Xiaoou Tang, and Xiaogang Wang. Coherent ﬁltering: Detecting coherentmotions from crowd clutters. In

European Conference on Computer Vision , pages 857–871. Springer, 2012.

HEN, LIU, SHI, LIU: SUPPLEMENTARY OF COMOGCN FOR TRAJECTORY PREDICTION Figure 3: Examples for predicted trajectories visualization for different models. The ob-served trajectories are shown in solid lines, ground truth future trajectories are shown inwide dashed lines, generated 20 samples per model are shown in thin dashed lines. Thedot-dashed lines denote the predictions of our VAE based model by applying the mean value( µ zz