Simple means Faster: Real-Time Human Motion Forecasting in Monocular First Person Videos on CPU
SSimple means Faster: Real-Time Human Motion Forecasting inMonocular First Person Videos on CPU
Junaid Ahmed Ansari and Brojeshwar Bhowmick
TCS Research and Innovation LabsKolkata, India{junaidahmed.ansari, b.bhowmick}@tcs.com
Abstract — We present a simple, fast, and light-weight RNNbased framework for forecasting future locations of humansin first person monocular videos. The primary motivation forthis work was to design a network which could accuratelypredict future trajectories at a very high rate on a CPU. Typicalapplications of such a system would be a social robot or a visualassistance system “for all”, as both cannot afford to have highcompute power to avoid getting heavier, less power efficient, andcostlier. In contrast to many previous methods which rely onmultiple type of cues such as camera ego-motion or 2D pose ofthe human, we show that a carefully designed network modelwhich relies solely on bounding boxes can not only performbetter but also predicts trajectories at a very high rate whilebeing quite low in size of approximately 17 MB. Specifically,we demonstrate that having an auto-encoder in the encodingphase of the past information and a regularizing layer in the endboosts the accuracy of predictions with negligible overhead. Weexperiment with three first person video datasets: CityWalks,FPL and JAAD. Our simple method trained on CityWalkssurpasses the prediction accuracy of state-of-the-art method(STED) while being 9.6x faster on a CPU (STED runs on aGPU). We also demonstrate that our model can transfer zero-shot or after just 15% fine-tuning to other similar datasetsand perform on par with the state-of-the-art methods on suchdatasets (FPL and DTP). To the best of our knowledge, weare the first to accurately forecast trajectories at a very highprediction rate of 78 trajectories per second on CPU.
I. I
NTRODUCTION
We address the problem of forecasting human motion infirst person monocular videos using neural networks. We areparticularly interested in developing a network which canaccurately predict future locations of humans in real-time onmachines with low compute and memory capability. To thisend, we propose a Recurrent Neural Network (RNN) basedframework which relies only on detection bounding boxes oftracked moving humans in the scene.While such system requirements as mentioned above, i.e.lightweight and simple, are not much relevant from theperspective of current autonomous cars, they are certainlycrucial for systems which have either power, weight, size, ormonetary constraints. Modern autonomous cars have accessto powerful computers with multiple CPUs and GPUs andtherefore their system’s real-time performance is not affectedby the complexity of the network model used.Take, for example, a social robot for all which cannotafford to have high end sensors and heavy compute resourcesbecause of monetary and size constraints. In addition to this,it cannot be sweating most of its compute power on just x xx x x x x Fig. 1: Figure illustrates the task at hand, which is, given pastdetection bounding boxes of moving humans, we aim to forecasttheir location in future frames using the proposed RNN basednetwork one task (forecasting pedestrian’s motion) if it has to beoperational for long. Similarly, consider a visual assistancesystem for people who are visually challenged, again, inaddition to it being accessible to all regardless of theireconomic standards, it has to be small in size so that it canbe carried for long and be power efficient for operationallongevity. All the aforementioned constraints mean – lesssensors, and low compute and memory resources.Current state-of-the works [1]–[4] in this context rely onmultiple types of information related to the scene (includingthe target humans) and/or the camera motion. For e.g. [1]relies on ego motion of the camera and 2D body pose andlocation of the humans in the scene. Similarly, [2] and [3]predict the centroids of future bounding boxes by feedingether only optical flow vectors (within the bounding box) orprovide bounding box coordinates in addition to the opticalflow, respectively. Using depth [5] [6] also helps the problembut getting such depth reliably is difficult.In this work, we pose the human motion forecasting as asequence-to-sequence learning [7] problem and propose touse a network model essentially comprising of RNNs. Asthe trajectory of humans when perceived in some referenceframe can be considered as a sequence of locations in timeand space, RNNs are the most natural choice. In this workwe use LSTMs [8] as RNN units.We design a network model that relies solely on pastobservations of detection bounding boxes to forecast theirfuture locations (illustrated in Fig. 1). Therefore our modelis light-weight to the extent that whole model is only ∼ a r X i v : . [ c s . C V ] N ov hat consumes the past bounding box observations and pro-duces a fixed size representation that summarizes the entirehistory 2) a decoder block which takes the representationvector from previous block to predict the future velocities 3)a trajectory concatenation layer that converts the predictedfuture velocities into future target locations. The last layeracts as a regularizer and does not have any parameters tolearn (see Sec. IV-A.3 ). The entire network is trained in anend-to-end fashion with two loss terms: auto-encoder loss(Eq. 5, Sec. IV-A.1) and future target location loss ((Eq. 9,Sec. IV-A.3)).The primary motivation for this design choice of havingtwo decoder LSTMs (one in auto-encoder and one for futuredecoding) was to make sure that network learns to encodethe past input observations into a representation that notonly extracts all information that is needed to extrapolate thefuture instances but also to make sure that the learned latentvector (representation) is actually representative of the entireinput. This design is inspired from the composite model ofLSTM Encoder-Decoder framework proposed in [9].We evaluate our model by training it on a recently intro-duced CityWalks dataset [3] which is captured by a personholding a camera and moving in multiple cities with differentweather conditions. The network is solely fed with detectionbounding box and no other information is provided to it. Weshow that our trained model not only surpasses the predictionaccuracy of the current state-of-the-art [3] on the CityWalksdatset [3], but also transfers zero-shot to another first personvideo dataset - First Person Locomotion (FPL) [1] whichis captured at a different frame rate. To the best of ourknowledge none of the previous works have demonstrateda zero-shot transfer to a different dataset, with [3] being oneexception. However, [3] relies on optical flow in addition tothe bounding boxes while we rely only on bounding boxes.We also try to test the same model (trained on CityWalks[3]) on a different type of dataset – Joint Attention forAutonomous Driving (JAAD) [10], [11], meant for behav-ioral understanding of traffic participants. In this dataset.We notice, as expected, that our model does not transferszero-shot. Interestingly, however, if we fine-tune our trainedmodel on just 15% of the train sample of JAAD [10] then theperformance of our model is on par with the state-of-the-art[2] on this dataset.In summary, the contributions of this work are as follows: • We show that a simple but carefully designed/chosennetwork architecture (Sec. IV-A) is sufficient to accu-rately forecast human motion in first person videos (seeSec. V) by evaluating it on three datasets: CityWalks[3], FPL [1] and JAAD [10], [11]. • We show that having an extra layer in the end –Trajectory Concatenation Layer – with no learnableunits improves the over all accuracy of prediction. (IV-A.3) • We also demonstrate that our simple and light-weightmodel trained on one dataset (CityWalks [3]) is capableof transferring zero-shot to another similar dataset ( FPL[1]). And when fine tuned with only 15% train samples of a not so similar dataset (JAAD [10], [11]), it performson par with the state-of-the-art [2] which was trained onthat dataset [10], [11]. • We show that by the virtue of simplicity of the ournetwork model, we can achieve a real-time performanceof predicting ∼
39 Trajectories Per Second (TPS) ona single core of a CPU which is 4.8x faster than ourcompetitor [3], while being only ∼
17 MB in size. Onthe entire CPU ( > cores), our model can predict 78TPS i.e. 9.6x faster than the state-of-the-art [3].II. R ELATED W ORKS
Predicting different aspects of human dynamics has beenthe focus of computer/robotics vision community for quitesome time now. Specifically, in the past decade or half, wehave seen remarkable developments in the area of humanactivity forecasting ( e.g. [12]–[17] ), pose forecasting (e.g.[18]–[20]), and human trajectory forecasting (e.g. [1]–[4],[21]–[23], [23]–[32]) - thanks to the incredible developmentson CNNs and RNNs. In this work, as we are concerned withthe task of predicting motion of humans in the scene, we willconstrain our discussion in this section to human trajectoryforecasting only.Most of the works in the context of human trajectory fore-casting has been done from the perspective of surveillance,social interaction, crowd behaviour analysis or in sports( e.g.[33]–[36]). Most of these works either rely on the bird’s eyeview of the scene or depend on multiple camera setup. Oneof the pioneering work – SocialLSTM – introduced in [37]used LSTMs to forecast human motion with social pooling.Social pooling was introduced so that network learns theinteraction between different pedestrians in the scene whilepredicting their future. Following it, many similar modelswere introduced which predict future locations based onsocial, physical, or semantic constraints. For e.g. SocialGAN[27] uses a Generative Adversial Network (GAN) framework[38] for predicting socially plausible human trajectories,SoPhie [26] bases its trajectories on social and physicalattention mechanism in addition to GAN in order to producerealistic trajectories, and [22] performs navigation, social,and semantic pooling to predict semantically compliant fu-ture trajectories. Similarly, [39] trains both on scene andpedestrian location information to predict future motion. Onthe other hand, [24] propose to use temporal correlationsof interactions with other pedestrians along with spatial.Few researcher have also considered different informationrelated to the pedestrian, for e.g. [28]–[30] use a sequenceof head poses in addition to other relevant information. Fora comprehensive collection of research activities in this arearefer [40]. We are distinct from all the aforementioned worksin two ways: 1) we only rely on front view 2) our modeldepends solely on detection bounding boxes and no otherinformation whatsoever.Very recently, we have seen few exciting works in the areaof human trajectory forecasting in first person perspective[1]–[4]. However, all of them rely on multiple informationrelated to the pedestrian whose motion is to be forecasted,cene in which the camera and pedestrians are moving andthe ego-motion of the camera. For e.g. [1] relies on cameraego-motion and 2D pose of the pedestrians to forecast theirmotion. Similarly, [2]–[4] all rely on optical flow informationand the detection bounding boxes. One slightly differentwork which forecasts motion of individual skeleton jointlocations of the pedestrian was also proposed in [41]. Thiswork too relies on multiple cues such as 2D pose (i.e.skeleton joint information), camera ego motion, and the 3Dstructure of the scene. In this setting, we are different from allaforementioned works in multiple ways: 1) we only rely ondetection bounding boxes of the pedestrians 2) our methodcan predict a very large number of trajectories per second onCPU, for e.g. our model predicts at 78 TPS which is 9.6xfaster than the state-of-the-art on CityWalks [3], 3) we trainand test only on CPUs and 4) our network is extremely lightweight, it is only ∼
17 MB in size.III. O
VERVIEW
In this section we start with formulating the problem offorecasting human motion in first person videos and showthat this task, by construction, is a sequence-to-sequenceproblem. Then, we discuss the evaluation metrics commonto all the datasets on which we evaluate.
A. Problem formulation
Consider a scene with a moving human. Assume that thescene is being captured in first person perspective by a freelymoving monocular camera. Let us say that at time t , we are atframe f in the video sequence. Assume that the human in thescene is being detected and tracked i.e. we have the detectionbounding boxes for each frame in the video sequence alongwith their track ID. The task at hand is as follows: givenonly the detection bounding boxes of the human in scenefor the past k frames { f − k, f − k + 1 , f − k + 2 , ..., f } ,we aim to predict the bounding boxes for the future f + p frames. Formally, consider B ≡ { b f − k , b f − k +1 , ..., b f } asequence of bounding boxes in the past k frames relativeto the frame f (inclusive of f ), we want to obtain P ≡{ b f +1 , b f +1 , ..., b f + p } a sequence of bounding boxes for thefuture p frames. In this work, we use human, person orpedestrian interchangeably. B. Evaluation Metrics
We adopt the two commonly used evaluation metric inthe trajectory forecasting setting from [37]: Average Dis-placement Error (ADE) and Final Displacement Error (FDE).Average Displacement Error (ADE) is defined as the meanEuclidean distance between the predicted and the groundtruth bounding box centroids for all the predicted boundingboxes, and Final Displacement Error (FDE) is defined sim-ilarly for the bounding box centroid at the final predictionstep only. IV. M
ETHODOLOGY
A. Architecture
Our network model, as shown in Fig. 2, can be understoodto be a sequence of three blocks: 1) an encoder block that processes the information for the input sequence and pro-duces a fixed length representation, 2) a decoder block thattakes as input the learned representation from the previousblock and predicts the information for the future boundingboxes and 3) a trajectory concatenation layer which convertsthe predicted outputs from the decoder block to actual futurelocations. The entire model is trained end-to-end and no pre-processing of any sort is performed on the input sequence.
1) Encoder Block:
This block of the model is essentiallyan LSTM based auto-encoder. Let us say that at time t we areat frame f in the video sequence. The input, I , to this blockis a sequence of k , 8-dimensional vectors, I ∈ R k × , whichcombines different aspects of the human’s bounding boxinformation observed over past k frames. Each vector in thesequence is made up of centroid of the bounding box, Cxy ∈ R , its dimensions in terms of width ( w ) and height( h ),where h, w ∈ R + , and change in the centroid and its di-mension, each in R + . Formally, the input sequence, I , can bewritten as { B } fi = f − k ≡ { B f − k , B f − k +1 , B f − k +2 , ..., B f } ,where each B is an 8-dimensional vector, B i =( cx i , cy i , w i , h i , ∆ cx i , ∆ cy i , ∆ w i , ∆ h i ) i.e. B ∈ R . The ∆ terms (or change terms) are computed as: ∆ U i = U i − U i − ,for ∀ U ∈ { cx, cy, w, h } .The Encoder LSTM, E enc , of this block runs through theinput sequence, I ≡ { B } fi = f − k , and generates a final hiddenstate vector, H ef which summarizes the complete sequenceof bounding box information. The final state vector is thenfed to a fully connected layer, FC eenc which maps it to avector of dimensions, Z f .The Decoder LSTM, D enc on the other hand, takes theencoded representation Z f and runs k times while taking thesame Z f as input at every iteration to reproduce k hiddenstates, { H di } f − ki = f , one for each iteration, which are thenpassed through an FC denc to map it to the input dimensioni.e. R k × . Note that we intentionally reproduce the inputsequence in the reverse direction as it has proven to be usefulas shown in [7]. All the above process can be formalized asfollowing (we do not explicitly show the reliance of LSTMson the hidden states of their previous iterations): H ef = E enc ( I ) (1) Z f = FC eenc ( H ef ) (2) { H di } f − ki = f = D enc ( Z f ) (3) ˆ I = FC denc ( { H id } f − ki = f ) (4) where, ˆ I ∈ R k × Why have an auto-encoder?
As the decoder LSTM inthis block of the model tries to reproduce the input itself,we can have an objective function (Eq. 5) which penalizesthe model based on how far it is from the actual input.This objective function makes sure that the encoder learnsthe right representation, in this case Z f , which adequatelydefines the past information of the bounding boxes. We haveempirically shown the effect of this in the Results section(see Table V or Sec. V). ncoder LSTM Decoder LSTMFC R e L U Z Decoder LSTMFC FC R e L U T r a j ec t o r y C on ca t e n a t i on L aye r Copy hidden state Copy hidden state
Encoder Block Decoder Block I npu t pa s t bound i ngbo x e l o c a t i on s P r ed i c t ed f u t u r e bound i ng bo x l o c a t i on s Works as auto-encoder Predicts future velocities
Fig. 2:
The proposed RNN based network architecture:
The encoder block consumes the bounding box information for the past framesand produces a fixed size representation. The decoder block reads this representation and predicts the velocities of centroid and changein dimension of bounding boxes in future frames. The last layer converts the velocities into locations L auto − enc = (cid:80) fi = k − f | ˆ I (cid:9) I| k × (5) where , (cid:9) represents element-wise vector subtraction opera-tion. There are two things to note here: a) we reverse theinput sequence, I , and add negative sign to componentscorresponding to the velocities and change in dimension ofbounding boxes and b) the auto-encoder is not pre-trainedand this objective is a part of the overall objective function(Eq. 11).
2) Decoder Block:
This block comprises of an LSTM D dec , and a FC layer, FC dec , which work in the similarfashion as the decoder of the encoder block with twoimportant differences: a) it runs to predict the future and notto reproduce the input and b) it only predicts the velocityand dimension change components i.e. we it predicts onlythe future instances of ∆ cx, ∆ cy , ∆ w and ∆ h .The working of this block is as follows. It takes the latentrepresentation Z f from the encoder block and runs p timeswhile taking the same Z f as the input. At every iteration, itproduces a hidden state vector H i , where i ∈ { f + 1 , f +2 , ..., f + p } which is then fed to FC dec , which maps itto a vector, V of 4-dimension, (∆ cx, ∆ cy , ∆ w , ∆ h ) , i.e. V ∈ R . Formally, the above process can be defined asfollows: { H i } pi = f +1 = D dec ( Z f ) (6) { ˆ V i } pi = f +1 = FC dec ( { H i } pi = f +1 ) (7)If we choose to apply supervision at this stage of the modelthen the supervising objective will be the following: L traj − del = (cid:80) pi = f +1 | ˆ V (cid:9) V | p × (8) where , V ∈ R p × is the ground truth for the predictedvelocity of the centroid ( ∆ cx , ∆ cy ) and dimension change( ∆ w , ∆ h ) of the bounding boxes for the future p frames.Every time the decoder LSTM, D dec , starts to decodethe future sequences of a trajectory, its hidden state, H f , is always initialized with the final hidden state of the encoderLSTM, H ef , i.e. H f = H ef . The motivation for doing this wasthat as the future motion of the human is not going to bemuch different from its past motion which is already encodedin the hidden state of the encoder, it makes sense to loadthe decoder LSTM with this knowledge before it even startsdecoding. Informally, consider it as a way of transferring thephysics of motion to the future decoder.
3) Trajectory concatenation layer:
This is the last layerof our model. This layer does not predict any new infor-mation and acts more like a regularizer. It is composed ofa multivariate differentiable function, G , which converts thepredicted future velocities of the centroids and the change indimension of detection bounding boxes into a sequence oflocations and dimension of the future bounding boxes. { ˆ O i } pi = f +1 = G ( { ˆ V i } pi = f +1 , I ∇ f ) (9) ˆ O f + i = (cid:40) I ∇ f ⊕ ˆ V f + i f or, i = 1ˆ O f + i − ⊕ ˆ V f + i ∀ i = 2 ...p (10)Where, with slight abuse of notation, I ∇ f ∈ I representsthe centroid and dimensions ( w and h ) of the bounding boxof the last input frame f , i.e. I ∇ f = ( cx f , cy f , w f , h f ) ,and ˆ O ∈ R p × is the sequence of centroid and dimensioninformation of the predicted future p bounding boxes; ⊕ represents element-wise vector addition.The supervision is applied on the result of this layer. Thepresence of this layer yields better prediction accuracy (seeV-D) with negligible overhead as this layer does not have anylearnable parameters. The supervising objective function isas follows: L traj = (cid:80) pi = f +1 | ˆ O (cid:9) O | p × (11) Where , O ∈ R p × is the ground truth centroid ( cx , cy ) anddimension( w , h ) of bounding box in the predicted sequenceof p future frames. Why does it help?
Supervising on this layer gives us betterresult because of the fact that this layer which is nothing but aulti-variate differentiable function, generates multiple newconstraints for every predicted vector without adding anyextra free parameter to be learned. We believe that this helpsthe network learn better and hence produce better predictions(see Table V).
B. Training Objective Function
We train our network model end-to-end by minimizing thefollowing objective function which is a linear combinationof the above discussed objectives, L auto − enc and L traj , inEq. 5 and Eq. 11, respectively: L = α · L auto − enc + β · L traj (12) Where , α ∈ R + and β ∈ R + are hyper-parameters whichdecide the importance of the corresponding loss term. Wehave also conducted an ablation study (see Sec. V-D) toevaluate the performance of different objective functions (seeEq. 5, Eq. 8 and Eq. 11) and their combination. C. Network Implementation Details
We implemented our network model using PyTorch [42]deep learning framework. All the RNNs were implementedusing a single layer LSTM comprising of hidden unitswith zero dropout.The final hidden state of the encoder LSTM of the encoderblock is passed through a ReLU unit before being fed intothe the FC layer of the encoder block. The FC layer of theencoder block maps its input dimensional hidden vectorinto a dimensional vector. This vector is then read byboth the decoder LSTMS following which we have two FCsto map them to corresponding output dimensions: FC comingafter the decoder LSTM of the encoder block maps the hidden vector back to 8-dimensional vector resembling theinput where as the FC coming after the decoder LSTM of thedecoder block maps the dimensional hidden state vectorinto 4-dimensional vector. The 4-dimensional vector is fedto our global trajectory layer to produce actual locations ofthe future bounding boxes. The auto-encoder is active onlyduring the training and is switched off while testing to saveunnecessary time and compute power consumption.V. R
ESULTS
A. Datasets
We evaluate our model on three recently proposeddatasets- CityWalks [3], First Person Locomotion (FPL) [1]and Joint Attention for Autonomous Driving (JAAD) [10],which are all captured in first person perspective. Whilewe train only CityWalks [3], we evaluate on all the three.While CityWalks and FPL datasets are quite similar in nature,JAAD is created for a different purpose, mainly created forthe study of behaviour study of traffic participants.
1) CityWalks [3]:
Citywalks dataset comprises of 358video sequences which are captured by a hand held camerain first person perspective. The video sequences are capturedin 21 different cities of 10 European countries during varyingweather conditions. All the videos are shot in a resolution of × at 30 Hz frame rate. The dataset also provides twosets of detections: one obtained from YOLOv3 [43] detectorand the other acquired using Mask-RCNN [44]. In addition todetection bounding boxes, it also provides tracking informa-tion. We use only the Mask-RCNN detections for evaluatingour work.
2) First Person Locomotion (FPL) [1]:
This datasetcomprises of multiple video sequences captured by peoplewearing a chest mounted camera and walking in diverseenvironment with multiple moving humans. The collectiveduration of all the video sequences is about . hourswith approximately person observations. The videosequences are captured at 10 Hz (i.e. 10 frames per seconds).This video does not provide detection bounding boxes.However, it does provides the 2D pose of all the humans.As our model relies solely on detection bounding boxes, touse this dataset for our evaluation, we convert the 2D posesinto detection bounding boxes.
3) Joint Attention for Autonomous Driving (JAAD) [10]:
This video dataset was primarily created for the study ofbehaviour of traffic participants. The dataset is made up of videos with about frames captured by a wide-angle camera mounted behind the wind shield below therear-view mirror of two cars. Most of the videos are capturedat a resolution of × pixels and few are shotat × pixel resolution. All the video sequencesare captured at real-time frame rate i.e. at 30 Hz. Thisdataset also comes with detection bounding box and trackinformation for each pedestrian in the scene. As the datasetis created with behavioural analysis of traffic participants, itconsists of pedestrians involved in different kind of motionbehavior for e.g. pedestrians could stop while walking andagain start walking, the person reduces or increases speed inthe course of motion, etc. and hence this dataset is of notmuch relevance in our setting. However, we do evaluate ournetwork model on this dataset too by fine-tuning our network(trained on CityWalks [3]) on just of its train set. B. Training details
We trained our network on the CityWalks dataset [3].Similar to [3], we split the entire dataset into three foldsand perform a 3-fold cross validation. At every evaluationtwo of the 3 folds serve as training and the other as test. Wetune the hyper-parameters only on the train fold and test iton the test fold which that particular network has never seen.The reported performance is the average performance of allthe three train-test combinations (see Table. I).For training, the bounding box tracks in the train set aresplit into multiple 90 frames mini-tracks by sliding overeach track with a stride of 30 frames. This way we obtainmini-trajectories of 3 second length. We train our modelto predict the location and dimension of bounding boxes 2econds into the future by observing past 1 second data. Inother words, the network is trained to take the boundingbox information of past 30 frames and predict the centroidlocations in future frames. The network is supervisedbased on the training objective discussed in section Sec. IV-B.The entire network is trainied end-to-end on a CPU (IntelXeon CPU E5-2650 v4 at 2.20GHz) with 24 cores, withoutpre-training of any component. We do not use any GPU fortraining or testing. The network is trained in batches of 200for 30 epochs with a starting learning rate of 0.00141. Thelearning rate is halved every 5 epochs. The hyper-parameters α and β in Eq. 12 were set to 1.0 and 2.0. The model isoptimized using L1 loss using Adam optimizer [45] with nomomentum or weight decay. C. Performance Evaluation1) Baseline models: • Spation-Temporal Encoder-Decoder (STED) [3]: Itis a GRU and CNN based encoder-decoder architec-ture which relies on bounding box and optical flowinformation to forecast future bounding boxes. We trainon CityWalks [3] to compare with this state-of-the-artmodel (STED [3]) on CityWalks. • First Person Localization (FPL) [1]: The model intro-duced in this work relies on 2D pose of pedestrians ex-tracted using OpenPose [46] and ego-motion estimatesof the camera using [47] to predict the future locationsof the human. We compare with [1] by transferring zero-shot to FPL dataset. One important thing to note is thatthis dataset is captured at 10 Hz while our model wastrained on CityWalks captured at 30 Hz. • Dynamic Trajectory Prediction (DTP) [2]: It usesCNN to forecast the future trajectory using past opticalflow frames. To compare with DTP [2], we fine-tuneour network on just 15% of training samples of JAADdataset [10].
2) Results on CityWalks dataset [3]:
The performanceof our model on CityWalks dataset is presented in Table.I where we compare with the all models proposed by thecurrent state-of-the-art [3] on this dataset. Similar to [3],our model was trained to predict a sequence of boundingbox centroids for 60 time steps into the future by observingbounding boxes of past 30 time steps, i.e. we predict 2seconds into the future; as discussed earlier, in contrast to us[3] also takes optical flow as input.It is clear from Table. I that that our simple RNN basedarchitecture (trained on Mask-RCNN [44] detections) consis-tently performs better than the STED model [3] and all itsvariants. In the table, we also report how much we improveover the corresponding model in percentage (%) shown incolumns Improvement(ADE) and Improvement(FDE); thismetric is computed as: | dm − om | /dm , where dm and om arethe performances of the other and our models, respectively.While we surpass the prediction metrics for all variantsproposed in [3], it is interesting to see our model performingapproximately 16% and 27% better than BB-encoder variant TABLE I: Results on CityWalks datast [3]. BB-encoder and OF-encoder of [3] take only bounding box and optical flow information,respectively. The term ‘both’ means the results are average ofthe performance of networks trained with YOLOv3 and Mask-RCNN detections. Improvement columns for the correspondingmetric means how much we are better compared to different models.All the models observe past 1s data and predict 2s into the futureat a frame rate of 30Hz
Model ADE FDE Our OurImprovement Improvement(ADE) (FDE)STED 26.0 46.9 16.88% 4.54%(Mask-RCNN)STED 27.4 49.8 21.13% 10.1%(YOLOv3)BB-encoder 29.6 53.2 27.0% 15.85%(both)OF-encoder 27.5 50.3 21.41% 11.0%(both)
Ours 21.61 44.77 (Mask-RCNN) of [3] as, just like us, this variant does not use opticalflow and relies solely on bounding boxes. We believe thisperformance is mainly due to the the presence of an extradecoder in the encoding phase and the global trajectoryconcatenation layer.
3) Zero-shot transfer on FPL dataset [1]:
To demonstratethe efficacy of our model which was trained on CityWalks[3], we directly deploy it on the test set of the FPL dataset[1] and compare it with the models proposed in [1] (seeTable II). One important thing to note is that this datasetis captured at 10 Hz while CityWalks [3] is captured at30 Hz. To evaluate, just like [1], we take a sequence ofboxes from past 10 frames and predict for 10 and 20 futureframes. As presented in Table II, we perform better than theconstant velocity, Nearest Neighbor, ans Social LSTM [37]based methods by a considerable margin (note that thesewere directly acquired from [1] as we test on same the test setprovided by [1]). Additionally, we also perform better than avariant (FPL( L in )) of FPL(Main) model [1] which takes onlycentroid of the pedestrians. We, however, underperform whencompared to the main model of [1], (FPL(Main)), whichtakes 2D pose, centroid location and scale of the pedestrian,and ego-motion of the camera; surprisingly we outperformthe FPL(Main) model [1] when predicting a longer sequence.
4) Results on JAAD dataset [10]:
The primary objectiveof evaluating our model on this (not so similar) dataset,was to see how well our model handles different kind ofbehavior based motion of pedestrians. Note that this datasetwas created for studying behaviour of traffic participants(we consider only humans in this context). In this datast,humans can be observed moving in ways which one doesnot encounter in FPL [1] or CityWalks [3] datasets. Fore.g. we see humans slowing down or stopping after walkingsome distance, or accelerating after few time steps, etc.. Asexpected, our model does not directly transfer to this datasetas shown in Table III. However, after fine-tuning our model
ABLE II: Results for zero-shot transfer on FPL dataset [1] i.e.our network did not see the FPL dataset while training. FPL(Main)is the primary model of [1] that takes camera ego-motion, scale,centroid and pose of the pedestrian as input. The FDE@ t meansthe Euclidean distance between the prediction and ground truth at t th future frame. All the models below observe past s data (in ourcase only bounding box) and predict s and s into the future asFPL is recorded at 10Hz frame rate. Model FDE@10 FDE@20ConstVel 107.15 -NNeighbor 98.38 -SocialLSTM 118.10 223.16FPL ( L in ) 88.16 -FPL ( X in ) 81.86 -FPL (Main) Ours (zero-shot) 85.28
TABLE III: Results on JAAD dataset [10] with 15% fine-tuning. Wecompare with the DTP model [2]. The FDE@ t means the Euclideandistance between the prediction and ground truth at t th future frame.To compare with DTP [2], we down-sampled the frame rate to15Hz. All the errors are reported with respect to frame size of × pixels Model FDE@5 FDE@10 FDE@15Constant Acceleration (CA) 15.3 28.3 52.8Constant Velocity (CV) 16.0 26.4 47.5DTP (5 optical flow frames) 9.4
Ours (10 Bounding boxes) 20.39 43.88 70.41(zero-shot transfer)
Ours (6 Bounding boxes) (15% fine-tuning)
Ours (10 Bounding boxes) (15% fine-tuning) with just 15% of the training sample (randomly sampledfrom sequences 1-250) it performs on par with the state-of-the-art method [2] for the test set(sequence 251-346) ofthis dataset. The method proposed in [2] takes optical flowframes to predict future locations of pedestrians. Again, asthe test set for us and [2] are same, we directly acquirethe prediction performance for Constant Acceleration andConstant Velocity methods from [2].
5) Time and memory efficiency:
Our network is capableof forecasting trajectories at a rate of 78 trajectories persecond or 4684 fps on CPU (Intel Xeon CPU E5-2650 v4at 2.20GHz) with more than 4 cores (see Table IV). This isan extremely high rate when compared with the state-of-the-art [3] which has a CNN for computing optical that itselftakes ms for one frame. In other words, if we ignore theoverhead of other components of the STED [3], it still runsat only 8.1 trajectories per seconds meaning we are aprrox.9.6x faster than STED [3] and perform better. At the sametime, our model is also extremely light-weight and is of only17.4 MBs in size.
TABLE IV: Time efficiency of our model on CPU cores
CPU Trajectories Per Second Faster than SOTA [3] FPS(cores) (TPS) (TPS)1 38.91 4.79x 23342 54.05 6.65x 32434 65.87 8.10x 3952 > TABLE V: Ablation study of our model on CityWalks dataset [3]. L traj − del : means neither auto-encoder nor the trajectory concate-nation layers are active, L traj : means we have trajectory concate-nation layer but auto-encoder is off and ( L traj + L auto − enc ): meansboth are active. Input Predicted L traj − del L traj Ours( L traj + L auto − enc )(ADE / FDE) (ADE / FDE) (ADE / FDE)30 15 6.49 / 11.17 / 10.97 6.46 /
30 30 11.23 / 20.22 10.99 / 19.61 /
30 45 16.24 / 31.78 15.81 / 30.92 /
30 60 21.77 / 44.45 21.27 / 44.14 / D. Ablation Study
In this section we do a thorough ablation study to un-derstand the impact different components of our model onCityWalks [3] (see Table V ). Specifically, we train threemodels 1) L traj − del : with no auto-decoder in the encoderblock i.e. without any auto-encoder loss and no globalconcatenation layer 2) L traj : with global trajectory layer butwithout the auto-decoder in the encoder block (Sec. IV-A.1and 3) Ours ( L traj + L auto − enc ): this is our main modelcomprising of all the components. We take a sequence ofbounding boxes for the past 30 time steps and predict forfuture 15, 30, 45, and 60 frames. Note that we report theresults for the best split. We show that each component addsto the performance and reduces the displacement error forall the cases shown in Table V.VI. C ONCLUSIONS
We presented a simple yet efficient and light-weight RNNbased network architecture for predicting motion of humansin first person monocular videos. We discussed how havingan auto-encoder in the encoding phase and a regularizinglayer in the end helps us get better accuracy. We showedthat our method which relied entirely on detection boundingboxes can not only perform better on datasets on which itis trained, but it was capable of transferring zero-shot on adifferent dataset. We also demonstrated that by fine-tuningon 15% of the train set of a not so similar dataset, our modelis capable of performing on par (even marginally better) withthe state-of-the-art method which was trained on this dataset.Also, by the virtue of the simplicity of our network, ourmodel could predict more accurate trajectories at almost 9.6xfaster than state-of-the-art while running only on a CPU andyet be of only 17.4 MB in size.
EFERENCES[1] T. Yagi, K. Mangalam, R. Yonetani, and Y. Sato, “Future person local-ization in first-person videos,” in
The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2018.[2] O. Styles, A. Ross, and V. Sanchez, “Forecasting pedestrian trajec-tory with machine-annotated training data,” in . IEEE, 2019, pp. 716–721.[3] O. Styles, V. Sanchez, and T. Guha, “Multiple object forecasting:Predicting future object locations in diverse environments,” in
TheIEEE Winter Conference on Applications of Computer Vision , 2020,pp. 690–699.[4] M. Huynh and G. Alaghband, “Aol: Adaptive online learning forhuman trajectory prediction in dynamic video scenes,” arXiv preprintarXiv:2002.06666 , 2020.[5] V. Prasad and B. Bhowmick, “Sfmlearner++: Learning monoculardepth and ego-motion using meaningful geometric constraints,” in
WACV , 2019.[6] A. Mallik, B. Bhowmick, and S. Alam, “A multi-sensor informationfusion approach for efficient 3d reconstruction in smart phone,” in
IPCV , 2015.[7] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in
Advances in neural information processingsystems , 2014, pp. 3104–3112.[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[9] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervisedlearning of video representations using lstms,” in
International con-ference on machine learning , 2015, pp. 843–852.[10] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Are they going to cross? abenchmark dataset and baseline for pedestrian crosswalk behavior,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2017, pp. 206–213.[11] ——, “Its not all about size: On the role of data properties in pedestriandetection,” in
ECCVW , 2018.[12] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activityforecasting,” in
European Conference on Computer Vision . Springer,2012, pp. 201–214.[13] C. Fan, J. Lee, and M. S. Ryoo, “Forecasting hand and object locationsin future frames,”
CoRR, abs/1705.07328 , vol. 2, 2017.[14] J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann, and L. Fei-Fei, “Peeking into the future: Predicting future person activities andlocations in videos,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2019, pp. 5725–5734.[15] Y. B. Ng and B. Fernando, “Forecasting future sequence of actions tocomplete an activity,” arXiv preprint arXiv:1912.04608 , 2019.[16] C. Sun, A. Shrivastava, C. Vondrick, R. Sukthankar, K. Murphy, andC. Schmid, “Relational action forecasting,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019, pp.273–283.[17] N. Rhinehart and K. M. Kitani, “First-person activity forecasting withonline inverse reinforcement learning,” in
Proceedings of the IEEEInternational Conference on Computer Vision , 2017, pp. 3696–3705.[18] W. Mao, M. Liu, M. Salzmann, and H. Li, “Learning trajectorydependencies for human motion prediction,” in
The IEEE InternationalConference on Computer Vision (ICCV) , October 2019.[19] H.-k. Chiu, E. Adeli, B. Wang, D.-A. Huang, and J. C. Niebles,“Action-agnostic human pose forecasting,” in . IEEE, 2019,pp. 1423–1432.[20] D. Pavllo, D. Grangier, and M. Auli, “Quaternet: A quaternion-basedrecurrent model for human motion,” arXiv preprint arXiv:1805.06485 ,2018.[21] H. Xue, D. Q. Huynh, and M. Reynolds, “Ss-lstm: A hierarchicallstm model for pedestrian trajectory prediction,” in . IEEE, 2018,pp. 1186–1194.[22] M. Lisotto, P. Coscia, and L. Ballan, “Social and scene-aware tra-jectory prediction in crowded spaces,” in
Proceedings of the IEEEInternational Conference on Computer Vision Workshops , 2019, pp.0–0.[23] N. Nikhil and B. Tran Morris, “Convolutional neural network fortrajectory prediction,” in
Proceedings of the European Conference onComputer Vision (ECCV) , 2018, pp. 0–0. [24] Y. Huang, H. Bi, Z. Li, T. Mao, and Z. Wang, “Stgat: Modeling spatial-temporal interactions for human trajectory prediction,” in
The IEEEInternational Conference on Computer Vision (ICCV) , October 2019.[25] H. Xue, D. Huynh, and M. Reynolds, “Location-velocity attention forpedestrian trajectory prediction,” in . IEEE, 2019, pp. 2038–2047.[26] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi,and S. Savarese, “Sophie: An attentive gan for predicting pathscompliant to social and physical constraints,” in
The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , June 2019.[27] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Socialgan: Socially acceptable trajectories with generative adversarial net-works,” in
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2018, pp. 2255–2264.[28] I. Hasan, F. Setti, T. Tsesmelis, A. Del Bue, F. Galasso, andM. Cristani, “Mx-lstm: mixing tracklets and vislets to jointly forecasttrajectories and head poses,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2018, pp. 6067–6076.[29] I. Hasan, F. Setti, T. Tsesmelis, V. Belagiannis, S. Amin, A. Del Bue,M. Cristani, and F. Galasso, “Forecasting people trajectories and headposes by jointly reasoning on tracklets and vislets,”
IEEE transactionson pattern analysis and machine intelligence , 2019.[30] I. Hasan, F. Setti, T. Tsesmelis, A. Del Bue, M. Cristani, andF. Galasso, “" seeing is believing": Pedestrian trajectory forecastingusing visual frustum of attention,” in . IEEE, 2018, pp. 1178–1185.[31] C. Choi and B. Dariush, “Looking to relations for future trajectoryforecast,” in
Proceedings of the IEEE International Conference onComputer Vision , 2019, pp. 921–930.[32] V. Karasev, A. Ayvaci, B. Heisele, and S. Soatto, “Intent-aware long-term prediction of pedestrian motion,” in . IEEE, 2016, pp.2543–2549.[33] H. Su, J. Zhu, Y. Dong, and B. Zhang, “Forecast the plausible pathsin crowd scenes,” in
Proceedings of the 26th International JointConference on Artificial Intelligence , 2017, pp. 2772–2778.[34] A. Vemula, K. Muelling, and J. Oh, “Modeling cooperative navigationin dense human crowds,” in . IEEE, 2017, pp. 1685–1692.[35] ——, “Social attention: Modeling attention in human crowds,” in .IEEE, 2018, pp. 1–7.[36] P. Felsen, P. Lucey, and S. Ganguly, “Where will they go? predictingfine-grained adversarial multi-agent motion using conditional varia-tional autoencoders,” in
The European Conference on Computer Vision(ECCV) , September 2018.[37] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, andS. Savarese, “Social lstm: Human trajectory prediction in crowdedspaces,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 961–971.[38] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems , 2014, pp. 2672–2680.[39] H. Manh and G. Alaghband, “Scene-lstm: A model for human trajec-tory prediction,” arXiv preprint arXiv:1808.04018 , 2018.[40] A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila,and K. O. Arras, “Human motion trajectory prediction: A survey,” arXiv preprint arXiv:1905.06113 , 2019.[41] K. Mangalam, E. Adeli, K.-H. Lee, A. Gaidon, and J. C. Niebles,“Disentangling human dynamics for pedestrian locomotion forecastingwith noisy supervision,” in
The IEEE Winter Conference on Applica-tions of Computer Vision , 2020, pp. 2784–2793.[42] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. , “Pytorch: Animperative style, high-performance deep learning library,” in
Advancesin Neural Information Processing Systems , 2019, pp. 8024–8035.[43] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767 , 2018.[44] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE international conference on computer vision ,2017, pp. 2961–2969.45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 , 2014.[46] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose:realtime multi-person 2D pose estimation using Part Affinity Fields,”in arXiv preprint arXiv:1812.08008 , 2018.[47] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervisedlearning of depth and ego-motion from video,” in