[PDF] Towards Robust Human Activity Recognition from RGB Video Stream with Limited Labeled Data

Abstract

Human activity recognition based on video streams has received numerous attentions in recent years. Due to lack of depth information, RGB video based activity recognition performs poorly compared to RGB-D video based solutions. On the other hand, acquiring depth information, inertia etc. is costly and requires special equipment, whereas RGB video streams are available in ordinary cameras. Hence, our goal is to investigate whether similar or even higher accuracy can be achieved with RGB-only modality. In this regard, we propose a novel framework that couples skeleton data extracted from RGB video and deep Bidirectional Long Short Term Memory (BLSTM) model for activity recognition. A big challenge of training such a deep network is the limited training data, and exploring RGB-only stream significantly exaggerates the difficulty. We therefore propose a set of algorithmic techniques to train this model effectively, e.g., data augmentation, dynamic frame dropout and gradient injection. The experiments demonstrate that our RGB-only solution surpasses the state-of-the-art approaches that all exploit RGB-D video streams by a notable margin. This makes our solution widely deployable with ordinary cameras.

Full PDF

TTowards Robust Human Activity Recognition from RGB Video Stream with LimitedLabeled Data

Krishanu Sarker

Dept. of Computer ScienceGeorgia State UniversityAtlanta, GA, [email protected]

Mohamed Masoud

Dept. of Computer ScienceGeorgia State UniversityAtlanta, GA, [email protected]

Saeid Belkasim

Dept. of Computer ScienceGeorgia State UniversityAtlanta, GA, [email protected]

Shihao Ji

Dept. of Computer ScienceGeorgia State UniversityAtlanta, GA, [email protected]

Abstract —Human activity recognition based on videostreams has received numerous attentions in recent years.Due to lack of depth information, RGB video based activityrecognition performs poorly compared to RGB-D video basedsolutions. On the other hand, acquiring depth information,inertia etc. is costly and requires special equipment, whereasRGB video streams are available in ordinary cameras. Hence,our goal is to investigate whether similar or even higheraccuracy can be achieved with RGB-only modality. In thisregard, we propose a novel framework that couples skeletondata extracted from RGB video and deep Bidirectional LongShort Term Memory (BLSTM) model for activity recognition.A big challenge of training such a deep network is the limitedtraining data, and exploring RGB-only stream signiﬁcantlyexaggerates the difﬁculty. We therefore propose a set ofalgorithmic techniques to train this model effectively, e.g., dataaugmentation, dynamic frame dropout and gradient injection.The experiments demonstrate that our RGB-only solutionsurpasses the state-of-the-art approaches that all exploit RGB-D video streams by a notable margin. This makes our solutionwidely deployable with ordinary cameras.

Keywords -human action recognition; computer vision; deeplearning; LSTM; limited data; RGB

I. I

NTRODUCTION

Human action is inherently complex due to the inter-class afﬁnity and intra-class diversity. Recognizing activityis hence a difﬁcult task, which has attracted numerousresearchers’ attention [1], [2], [3]. Even though state-of-the-art image classiﬁcation methods have surpassed humanlevel accuracy [4], performance of methods proposed inthe literature for activity recognition/classiﬁcation is stillunsatisfactory, especially methods based on RGB videostream [1].Despite many efforts, human action recognition fromRGB video streams still lacks in accuracy compared to theprogresses made in multi-modal data that includes depthenabled RGB-D video. One of the reasons behind this is thatmulti-modal datasets provide higher quantity of information,and the extracted depth information give precise detection ofmovement in the scene. However, depth enabled cameras areexpensive and require special settings for many possible use-cases of human action recognition. The economical factorand the installation complexity are the main reasons thatmost of the surveillance systems are using RGB cameras.Therefore, focusing our attention on the more popular RGB videos for detecting and classifying human motion wouldbeneﬁt the users and the real life applications.Traditionally, the works on RGB video stream are basedon handcrafted features [5], [6], [7], [8], [9]. These ap-proaches are highly data dependent. Due to this problem,these methods are very brittle and hard to deploy in real lifein spite of higher accuracy they achieve. With the advent ofdeep learning, methods were proposed where features couldbe automatically extracted [10], [11], [12], [13]. Successfuluse of deep learning with image classiﬁcation inspiredresearchers to deploy such methods in video classiﬁcation[2]. These methods use raw RGB frames, often coupled withmotion, to learn the temporal features. Even though thesemethods automate the feature extraction task, they oftenstruggle to gain high performance due to complex back-ground and partial occlusion of subjects in video streams.Hence, more robust, automated action recognition system isyet to be developed.Approaches based on multiple modalities of data [14],[15], [16], [17], however achieves higher accuracy even withcomplex actions. In these approaches, skeleton informationextracted from depth images are proven very efﬁcient inextracting important features of action. Inspired by this,in the paper we propose to use a technique that aims atseparating salient features from the scene by extractingskeleton key-points from RGB-only video streams. This isa distinct departure from all previous approaches that eitheruse raw RGB video stream as input directly or use skeletonkey-points extracted from the depth information for activityrecognition. Speciﬁcally, we utilize Openpose API [18] asa black box to extract the skeleton key-points from eachRGB frame. These key-point features are then fed into aBidirectional Long Short Term Memory (BLSTM) basedmodel to learn the spatio-temporal representations, whichare subsequently classiﬁed by a softmax classiﬁer.We use RGB-only modality for our experimental eval-uations whereas state-of-the-art methods utilized multipleavailable modalities (depth, inertia and skeleton data). Thisessentially reduces training data to one forth for our experi-ments. Hence, we are dealing with one of the key challengesof deep learning, i.e., training with limited labeled data. Totrain the deep network effectively, we explore data aug- a r X i v : . [ c s . C V ] D ec entation and a few algorithmic approaches. Experimentson two popular and challenging benchmarks validate theeffectiveness of these techniques and our RGB-only solutioneven surpasses the state-of-the-arts approaches that all ex-ploit RGB-D videos. We believe that the proposed RGB-onlyscheme is more cost effective and highly competitive thanRGB-D based solutions and therefore widely deployable.Our key contributions are summarized in the following: • There exist previous methods in literature that are eitherbased on skeleton extracted from depth data or purelybased on raw RGB data for human activity recognition.To the best of our knowledge, we are the ﬁrst toleverage skeleton key-points extracted from RGB-onlyvideos for human activity recognition. • We leverage data augmentation to tackle the problem oflimited labeled data in deep learning, and compensatethe data sparsity issue caused by using RGB-onlymodality. • Additionally, we explore a few algorithmic approachessuch as Dynamic Frame Dropout and Gradient Injectionto effectively train the deep architecture. • We evaluate our proposed framework on two popularand challenging benchmarks, and demonstrate for theﬁrst time that using RGB-only streams we can surpassthe state-of-the-art RGB-D based solutions, and makeour RGB-only solution widely deployable.The rest of the paper is organized as follows. Relatedworks are discussed in Section II. We present our pro-posed architecture in Section III and its effective trainingin Section IV. Experimental results with comparison to thestate-of-the-arts are presented in Section V, followed by adiscussion and future works in Section VI.II. R

ELATED W ORKS

Human activity recognition has been extensively studiedin the recent years [1], [2], [3]. Most of state-of-the-artmethods exact handcrafted features from RGB videos andrely on traditional shallow classiﬁers for activity classiﬁca-tion [5], [6], [7], [8], [9]. For example, Schuldt et al. [5]present a method that identiﬁes spatio-temporal interestpoints and classiﬁes action by using SVMs. Zhang et al. [6]introduce the concept of motion context to capture spatio-temporal structure. Liu and Shah [7] consider the correlationamong features. Bregonzio et al. [8] propose to calculate thedifference between subsequent frames to estimate the focusof attention. These methods often achieve very high accu-racy. However, since handcrafted features are highly datadependent, these methods are not very robust to the changeof environments. We instead utilize OpenPose to extract thesalient skeleton features from raw RGB frames, which makesthe proposed method less data dependent, robust to differentenvironments and therefore widely deployable in real lifeapplications. Deep learning based approaches for human activityrecogntion have also been explored extensively [10], [11],[12]. For example, Baccouche et al. [10] propose to useConvolutional Neural Network (CNN) to extract spatialfeatures and then use LSTM to learn the temporal features.Ji et al. [11] present 3D CNN to classify actions whichlearns inherent temporal structure among the consecutiveframes. A two-stream CNN based method is proposed in[12]. In contrast to state-of-the-art handcrafted feature designapproaches, deep learning based approaches use an end-to-end learning pipeline and extract feature representationsautomatically from data. However, these methods oftenfail to achieve higher accuracy as the high level featuresextracted from CNN are blurry and incapable of capturingthe sharp changes in video streams. This is primarily becauseconvolution and pooling tries to accurately capture theoverall structure, while repetitive convolution and poolingoperations often ignore the ﬁne-grained details.In order to solve the aforementioned issues, skeletoninformation from RGB-D video has been widely studiedto improve recognition accuracy [15], [16], [17], [19], [20].Observations from seminal work by Johansson [21] suggeststhat a few movement of human joints is sufﬁcient to recog-nize an action. Recently, Liu et al. [20] propose a CNNbased approach leveraging the skeleton data. In [19] theauthors propose hierarchical bidirectional Recurrent NeuralNetwork (RNN) to classify the human actions. Methodsproposed in [22] and [15] utilize skeleton data on threeCNN streams that are pretrained on large ImageNet Dataset[23]. Li et al. [16] use view invariant features from skeletondata to improve over [22] and [15], and they used similarfour stream pretrained models. All these methods utilizeskeleton data, either extracted from depth data or Kinect.Inspired from these works, we adopt a bidirectional LSTMin our method; instead of extracting skeleton data fromdepth information as in other methods, we extract skeletonkeypoints from RGB frames, which are available in ordinarydigital cameras.In addition, there exist a few CNN and LSTM basedapproaches for activity recognition from RGB-only data[24], [25]. However, none of them pay special attention tothe issue of training deep network effectively on limitedlabeled data. We emphasize more on algorithmic approachesto address the training issues of deep networks with limitedtraining data to alleviate overﬁtting and gradient vanishingproblems. Enhanced by these techniques, our RGB-onlysolution is able to surpass the state-of-the-arts that all exploitRGB-D streams. III. M

ETHODOLOGY

In this section, we present an end-to-end framework forhuman activity recognition from RGB video containinghuman silhouette. To make our discussion self-contained,we review some important concepts in the following sub-sections. igure 1. Overview of proposed method.

A. Overview

Our proposed architecture aims to classify human ac-tions from RGB-only streams to make our approach mostamenable to ordinary cameras. We formulate our problem aslearning the mapping, F : x → (cid:96) , where x is the raw videoand (cid:96) is the collection of action categories. After training, F is used to classify the test samples.Fig. 1 shows the overall pipeline of the proposed method.First, we extract pose key-points of human silhouette frominput raw RGB video using OpenPose API [18]. We thenpreprocess the extracted pose key-points to improve thequality of the feature representations. After preprocessing,we use a variety of data augmentation techniques on theextracted keypoints to increase the training data size (andtherefore mitigate the problem of data scarcity). In the end,the augmented training set is used to train our classiﬁer.We used deep BLSTM [26] network coupled with MLP[27] as our classiﬁer. Overﬁtting is a major drawback forLSTM when dealing with small dataset. In addition todata augmentation, we therefore explored additional regu-larization techniques, such as dropout and L2 regularizationto prevent our model from overﬁtting. We also proposeDynamic Frame Dropout to reduce the redundant framesfrom a video and improve the robustness of the BLSTMclassiﬁer. To mitigate the vanishing gradient issue of LSTM,we introduce Gradient Injection to improve gradient ﬂow.We will discuss each of these components in greater detailsin the following subsections. B. OpenPose

OpenPose [18] is an open source API that can be used todetect the 2D poses of multiple human subjects in an image.The API leverages a novel two stream multi-stage CNN,which facilitates it to work in real time. The methodologyproposed in [18] was ranked number one in COCO 2016keypoints challenge. The input of the architecture is rawRGB image and the output is 15 or 18 pose key-pointsalong with the part joining edges. More details about the architecture and working principle can be found in [18]. Inour work, we treat OpenPose as a black box with raw videoframes as inputs and 18 pose key-points per person as output.

C. LSTM

Long Short-Term Memory (LSTM) [26] is a descendantof Recurrent Neural Network (RNN) especially designed toadapt long range dependencies when modeling sequentialdata. RNN, in general, has been proven very successful inmodeling sequences that have strong temporal dependency.However, vanishing gradient problem makes Vanilla RNNhard to train [28]. LSTM migitates this issue by introducingnon-linear gates regulating the information ﬂow. In addition,vanilla LSTM can only learn from past contexts, whereasBidirectional LSTM (BLSTM) [29] can be used to learnboth from past and from future context by utilizing forwardand backward layers. For human activity recognition task,we found that BLSTM is a more suitable architecture thanvanilla LSTM as incorporating long term dependency in bothdirections in general helps improve learning of sequentialdata.

D. Preprocessing

The preprocessing step represents the ﬁrst step of our end-to-end pipeline where the raw video frames are fed intothe OpenPose API. The output of OpenPose for each videoframe is a matrix of shape ( n pose , ( a, b ) , c ) . Here, n pose isthe number of pose key-points, ( a, b ) is the coordinates ofthe key-points in Cartesian plane and c is the conﬁdencescore of the respective key-point. To simplify our problem,we put a constraint that each frame can contain at most oneperson, and hence the value of n pose here is 18. When allpose key-points are extracted from a video, we use a ﬁlterto set the pose keypoints values that has conﬁdence lowerthan a threshold value, Θ , to zero. Later, we mask thesezero valued keypoints in order to avoid learning from thesepoints. Afterwards, the pose matrix is ﬂattened and convertedinto a vector, Λ , of size n pose ∗ , excluding the conﬁdencevalue. We concatenate each pose frame into a 2 dimensionalmatrix of shape ( n frame , v ) , where n frame is the numberof frames in the video and v is the length of pose vector, Λ . E. Proposed Network Architecture

Our proposed deep architecture combines deep BLSTMlayers and MLP. We use ﬁve consecutive BLSTM layers withdropout to regularize the model training. We utilize BatchNormalization (BN) after each BLSTM layer to keep thedata normalized throughout the pipeline. We feed the outputof the Deep BLSTM layers to the MLP consisting of twoDense layers. For intermediate hidden BLSTM and Denselayers, we have utilized the Parametric Rectiﬁed Linear Unit(PReLU) [30] activation layer. We use the softmax functionfor the ﬁnal output layer to produce probabilistic score foreach class. Categorical cross-entropy is used to measurethe loss of our proposed network. We utilized RMSpropoptimizer [31] to minimize the loss function. igure 2. BLSTM with Dynamic Frame Dropout and Gradient Injection.

IV. E

FFECTIVE T RAINING OF

BLSTMIt’s challenging to train the BLSTM architecture as thenumber of parameters can be easily larger than severalhundred millions, while the number of training data forthe purpose of model parameter estimation is typically verysmall (e.g., 2-3 orders of magnitude lower). Therefore,special attention is needed to address the effective trainingof BLSTM, otherwise overﬁtting, gradient vanishing canquickly plague the learning process. In the following, wediscuss a few techniques we explored to train the deepmodel effectively, with the high level scheme demonstratedin Fig. 2.

A. Dynamic Frame Dropout

We propose to utilize Dynamic Frame Dropout (DFD) toreduce data redundancy. As different actions require differenttime span, and often there are redundant information inconsecutive frames, taking all frames into account actuallyoccludes crucial information and hampers the learning.Techniques like randomly dropping frames or dropping each n frames etc. are often used in state-of-the-art methods.However, doing so naively may result in loss of importantinformation. Instead of randomly dropping frames, DFDdrops frames only if they contain information that are almostredundant to their preceding frames. This not only reducesthe computation complexity but also introduces stochasticityand help to regularize the model training in a similar spiritof dropout.Speciﬁcally, we measure the redundancy between twoconsecutive frames by computing the euclidean distancebetween their feature vectors, i.e., pose key-points of twoconsecutive frames. The lower distance corresponds to sim-ilarity and the higher distance means these frames actuallyhave meaningful differences. Empirically, we set a cutoffthreshold, ˆ c = 15 . If d is distance between f rame and f rame and d < ˆ c , then we drop f rame . According to ourexperiments that follow, this setup of ˆ c drops 20 to 25 framesper video that carry information with minimal signiﬁcance. B. Gradient Injection

Although LSTM serves as the solution of vanilla RNN forgradient vanishing problem, it itself faces this issue in somedegree when training deep model [32]. LSTM many to onearchitecture is often used as the ﬁnal layer of network forvideo classiﬁcation. This creates a dependency on processingthe whole video sequence before we can perform classiﬁca-tion. However, a video can often be clearly classiﬁed beforehaving to see all the frames till the end. Hence, to improvegradient ﬂow and avoid gradient vanishing problem, weconnect the MLP classiﬁcation layer to the last K time-stepsand this allow the model to classify a video by incorporatingmultiple step information. When back-propagating error toupdate model parameters, this also allows the gradient tobe propagated earlier in time and mitigate the gradientvanishing problem. We call this technique Gradient Injection(GI). In other words, we utilize many to many architecture ofLSTM at the top layer to allow gradients ﬂow from multipletime steps, consequently, reducing the problem of vanishinggradient. Moreover, as outputs from multiple time steps arenow available, it creates an ensemble of multiple outputsand reduces dependency on all the video frames. C. Data Augmentation

Training a deep networks with limited amount of labeledtraining data is a major challenge in supervised learningparadigm. Our goal of achieving state-of-the-art perfor-mance with RGB-only data modality faces the same brickwall: insufﬁcient training data. According to our problemformulation, we only leverage RGB data modality. Dataaugmentation has been proven very successful in supervisedlearning for image analysis. Inspired by this, we haveexplored several data augmentation techniques to solve thedata scarcity problem. In our case, instead of the raw inputvideo, we take skeleton key-point features as the input fordata augmentation. We use translation, scaling and randomnoise to augment skeleton data. To keep the augmentationconsistent throughout a single sample, we deploy sametransformation on each key-point frames of that sample.In the experiments that follow we evaluate the signiﬁcanceof each of these techniques when training deep networksusing limited training data.V. E

XPERIMENTAL R ESULTS

The primary goal of this paper is to show that by usingRGB-only data modality with limited training data, we canachieve similar or higher accuracy on action recognition taskthan the state-of-the-arts that use RGB-D video streams.We have tested our proposed method with two widely useddatasets, KTH [5] and UTD-MHAD [14]. We focus onUTD-MHAD as this is a complex dataset offering multiplemodalities and current state-of-the-art methods utilize datamodalities consisting depth information to classify actions.Extensive experiments show that with the effective trainingtechniques discussed in Section IV, i.e., data augmentation, igure 3. Accuracy comparison on the UTD-MHAD dataset on our modelswith different number of LSTM layers. dynamic frame dropout and gradient injection, the proposedRGB-only solution surpasses the state-of-the-art methodsby a notable margin. On RGB-only dataset such as KTH,our method outperforms all the other methods reported inthe literature, demonstrating the versatility of the proposedarchitecture and techniques for human activity recognition.We implemented our system in Python with Tensorﬂowbackend on a GPU cluster with Intel Xeon CPU E5-2667v4 @ 3.20GHz with 504 GB of RAM and NVIDIA TITANXp with 12 GB of RAM and 3840 cuda cores. In ourexperiments, we empirically set learning rate, lr = 5 e − forRMSprop optimizer. We report conﬁdence interval based on50 bootstrap trials. More details about datasets we evaluatedour model on and comparative experimental studies withstate-of-the-art literatures are presented next. A. Dataset

KTH [5] is an RGB-only video dataset containing sixaction classes (walking, running, boxing, hand-waving, andhand-clapping), performed by 25 subjects in various con-ditions. KTH dataset provides full silhouette ﬁgure in allthe sequences, which satisﬁes our requirements of pose-based activity recogntion . We have followed the sameexperimental setup stated in [5].UTD-MHAD [14] is a multi-modal action dataset con-taining 27 actions performed by 8 subjects (4 males and4 females) performing same action 4 times, a total 861sequences. This dataset provides four temporally synchro-nized data modalities; RGB videos, depth videos, skeletonpositions, and inertial signals from Kinect camera and awearable inertial sensor. We follow 50-50 train-test splitsimilar to [14]. In the experiments we only use the RGBmodality to evaluate our proposed method. B. Experiments on UTD-MHAD dataset

We ﬁrst explore the choices of depth of the network. Wetest our baseline BLSTM model with three settings: 3-Layer, To extract pose key-points reliably, we need the subjects in videostreams expose their full silhouettes.

Figure 4. Accuracy comparison of the different design choices on theUTD-MHAD dataset.igure 5. Accuracy comparison on the UTD-MHAD dataset. [14], [15],[16], [22] and [33] use depth enabled modalities, while our method useRGB-only modality (conﬁdence interval of our method is also included). increase the training dataset by data augmentation, the top-1 error rate is consistently reduced. We also notice thatdata augmentation does not have much effect on top-3 errorrate, indicating that data augmentation mainly boosts correctanswers from top-3 positions to top-1 positions.

Table IE

FFECT OF D ATA A UGMENTATION ON THE

UTD-MHAD

DATASET . Augment Size Top-1 Error (%) Top-3 Error (%)

Finally, we summerize the results of our proposed methodwith state-of-the-art methods [14], [15], [22], [16], [33] inFig. 5. Most of these methods use depth or inertia datamodalities or both (section II). These data modalities areonly available from depth enabled camera and provide moreprecise information of motions related to actions. On thecontrary, we use RGB-only modality to train our model fromscratch. It can be seen from Fig. 5 that our method achievesan accuracy of 90.95% which outperforms all the state-of-the-art methods.

C. Experiments on KTH dataset

To further strengthen our hypothesis, we then compare ourproposed method with the state-of-the-arts on the RGB onlydataset, KTH [5], with the results presented in Fig. 6. We uti-lized similar training-testing split of the data as suggested in[5] to obtain the reported results. CNN based hybrid modelproposed by Lei et al. [34] achieves 91.41% accuracy, whichis outperformed by most of the state-of-the-art handcraftedfeature based methods. On the other hand, [9], [35], [6], and[8] using handcrafted features achieve competitive accuracy.However, these methods are extremely data dependent; pro-posed handcrafted feature extractors in these methods cannot

Figure 6. Accuracy comparison on KTH dataset with state-of-the-arts. [34]use CNN based method, while [6], [8], [9] and [35] utilize hand craftedfeatures. (conﬁdence interval of our method is also shown). robustly work on heterogeneous data. Hence, these methodsare not suitable for real world deployment. Our proposedmethod with data augmentation and dynamic frame dropoutachieves 96.07% accuracy, outperforming all the others.VI. C

ONCLUSION AND F UTURE W ORK

We propose an end-to-end framework that utilizes posekey-points extracted from OpenPose coupled with BLSTMfor human activity recognition. A major difference to thestate-of-the-art methods is that we use RGB-only modalitywhile all the other methods use RGB-D modality. Effectivetraining of deep networks in our setting is the major techni-cal challenge as we typically have very limited training dataand exploiting RGB-only modality exaggerates the difﬁcultyeven further. We therefore explore a number of algorithmictechniques like Dynamic Frame Dropout, Gradient Injectionand Data Augmentation to train our framework effectively.Extensive experiments demonstrate the effectiveness of ourBLSTM model and training methodologies, among whichdata augmentation is the most effective one. In the end, ourRGB-only solution surpasses all the state-of-the-art meth-ods that exploit RGB-D streams. This makes our solutioncost effective and widely deployable with ordinary digitalcameras.Our experiments were conducted on the KTH and UTD-MHAD datasets, where there is only one person presentper action and whole silhouette is visible. In the future, wewould like to extend our method for multi-person datasetswhere some body parts can be partially occluded, whichhappen more often in real video surveillance applications.A

CKNOWLEDGMENT

The authors would gratefully acknowledge the support ofNVIDIA Corporation with the donation of the Titan Xp GPUused for this research.

EFERENCES [1] G. Cheng, Y. Wan, A. N. Saudagar, K. Namuduri, and B. P.Buckles, “Advances in human action recognition: A survey,” arXiv preprint arXiv:1501.05964 , 2015.[2] S. Herath, M. Harandi, and F. Porikli, “Going deeper intoaction recognition: A survey,”

Image and Vision Computing ,vol. 60, pp. 4–21, 2017.[3] C. Chen, R. Jafari, and N. Kehtarnavaz, “A survey of depthand inertial sensor fusion for human action recognition,”

Multimedia Tools and Applications , vol. 76, no. 3, pp. 4405–4425, 2017.[4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “Imagenet large scale visual recognition challenge,”

International Journal of Computer Vision , vol. 115, no. 3,pp. 211–252, 2015.[5] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing humanactions: a local svm approach,” in

Pattern Recognition, 2004.ICPR 2004. Proceedings of the 17th International Conferenceon , vol. 3. IEEE, 2004, pp. 32–36.[6] Z. Zhang, Y. Hu, S. Chan, and L.-T. Chia, “Motion context: Anew representation for human action recognition,”

ComputerVision–ECCV 2008 , pp. 817–829, 2008.[7] J. Liu and M. Shah, “Learning human actions via informationmaximization,” in

Computer Vision and Pattern Recognition,2008. CVPR 2008. IEEE Conference on . IEEE, 2008, pp.1–8.[8] M. Bregonzio, S. Gong, and T. Xiang, “Recognising action asclouds of space-time interest points,” in

Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on .IEEE, 2009, pp. 1948–1955.[9] H. Wang, A. Kl¨aser, C. Schmid, and C.-L. Liu, “Densetrajectories and motion boundary descriptors for action recog-nition,”

International journal of computer vision , vol. 103,no. 1, pp. 60–79, 2013.[10] M. Baccouche, F. Mamalet, C. Wolf, Christian Garcia, andA. Baskurt, “Sequential deep learning for human actionrecognition,” in

International Workshop on Human BehaviorUnderstanding . Springer, 2011, pp. 29–39.[11] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neuralnetworks for human action recognition,”

IEEE transactionson pattern analysis and machine intelligence , vol. 35, no. 1,pp. 221–231, 2013.[12] K. Simonyan and A. Zisserman, “Two-stream convolutionalnetworks for action recognition in videos,” in

Advances inneural information processing systems , 2014, pp. 568–576.[13] M. E. Masoud, K. Sarker, S. Belkasim, and I. Chahine, “Au-tomatically generated semantic tags of art images,” in

IEEEInternational Conference on Signal and Image ProcessingApplications (ICSIPA) , 2017.[14] C. Chen, R. Jafari, and N. Kehtarnavaz, “Utd-mhad: Amultimodal dataset for human action recognition utilizinga depth camera and a wearable inertial sensor,” in

ImageProcessing (ICIP), 2015 IEEE International Conference on .IEEE, 2015, pp. 168–172.[15] P. Wang, Z. Li, Y. Hou, and W. Li, “Action recognition basedon joint trajectory maps using convolutional neural networks,”in

Proceedings of the 2016 ACM on Multimedia Conference .ACM, 2016, pp. 102–106.[16] C. Li, Y. Hou, P. Wang, and W. Li, “Joint distance maps basedaction recognition with convolutional neural networks,”

IEEESignal Processing Letters , vol. 24, no. 5, pp. 624–628, 2017.[17] H. Rahmani and M. Bennamoun, “Learning action recogni-tion model from depth and skeleton videos,” in

The IEEEInternational Conference on Computer Vision (ICCV) , 2017. [18] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds,” in

CVPR ,2017.[19] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrentneural network for skeleton based action recognition,” in

Proceedings of the IEEE conference on computer vision andpattern recognition , 2015, pp. 1110–1118.[20] M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visual-ization for view invariant human action recognition,”

PatternRecognition , vol. 68, pp. 346–362, 2017.[21] G. Johansson, “Visual perception of biological motion and amodel for its analysis,”

Perception & Psychophysics , vol. 14,no. 2, pp. 201–211, Jun 1973.[22] Y. Hou, Z. Li, P. Wang, and W. Li, “Skeleton opticalspectra based action recognition using convolutional neuralnetworks,”

IEEE Transactions on Circuits and Systems forVideo Technology , 2016.[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,”in

Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on . Ieee, 2009, pp. 248–255.[24] J. Donahue, L. Anne Hendricks, S. Guadarrama,M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell,“Long-term recurrent convolutional networks for visualrecognition and description,” in

Proceedings of the IEEEconference on computer vision and pattern recognition ,2015, pp. 2625–2634.[25] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan,O. Vinyals, R. Monga, and G. Toderici, “Beyond short snip-pets: Deep networks for video classiﬁcation,” in

Proceedingsof the IEEE conference on computer vision and patternrecognition , 2015, pp. 4694–4702.[26] S. Hochreiter and J. Schmidhuber, “Long short-term mem-ory,”

Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997.[27] W. S. Sarle, “Neural networks and statistical models,” 1994.[28] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difﬁcult,”

IEEEtransactions on neural networks , vol. 5, no. 2, pp. 157–166,1994.[29] A. Graves and J. Schmidhuber, “Framewise phoneme clas-siﬁcation with bidirectional lstm and other neural networkarchitectures,”

Neural Networks , vol. 18, no. 5, pp. 602–610,2005.[30] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep intorectiﬁers: Surpassing human-level performance on imagenetclassiﬁcation,” in

Proceedings of the IEEE international con-ference on computer vision , 2015, pp. 1026–1034.[31] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Dividethe gradient by a running average of its recent magnitude,”

COURSERA: Neural networks for machine learning , vol. 4,no. 2, pp. 26–31, 2012.[32] W.-N. Hsu, Y. Zhang, A. Lee, and J. Glass, “Exploitingdepth and highway connections in convolutional recurrentdeep neural networks for speech recognition,” cell , vol. 50,p. 1, 2016.[33] B. Zhang, Y. Yang, C. Chen, L. Yang, J. Han, and L. Shao,“Action recognition using 3d histograms of texture and amulti-class boosting classiﬁer,”

IEEE Transactions on ImageProcessing , vol. 26, no. 10, pp. 4648–4660, 2017.[34] J. Lei, G. Li, S. Li, D. Tu, and Q. Guo, “Continuous actionrecognition based on hybrid cnn-ldcrf model,” in

Image,Vision and Computing (ICIVC), International Conference on .IEEE, 2016, pp. 63–69.[35] L. Liu, L. Shao, X. Li, and K. Lu, “Learning spatio-temporalrepresentations for action recognition: A genetic program-ming approach,”