[PDF] A Fine-to-Coarse Convolutional Neural Network for 3D Human Action Recognition

Abstract

This paper presents a new framework for human action recognition from a 3D skeleton sequence. Previous studies do not fully utilize the temporal relationships between video segments in a human action. Some studies successfully used very deep Convolutional Neural Network (CNN) models but often suffer from the data insufficiency problem. In this study, we first segment a skeleton sequence into distinct temporal segments in order to exploit the correlations between them. The temporal and spatial features of a skeleton sequence are then extracted simultaneously by utilizing a fine-to-coarse (F2C) CNN architecture optimized for human skeleton sequences. We evaluate our proposed method on NTU RGB+D and SBU Kinect Interaction dataset. It achieves 79.6% and 84.6% of accuracies on NTU RGB+D with cross-object and cross-view protocol, respectively, which are almost identical with the state-of-the-art performance. In addition, our method significantly improves the accuracy of the actions in two-person interactions.

Full PDF

TTHAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION A Fine-to-Coarse Convolutional NeuralNetwork for 3D Human Action Recognition

Thao Minh Le [email protected]

Nakamasa Inoue [email protected]

Koichi Shinoda [email protected]

Department of Computer ScienceTokyo Institute of TechnologyTokyo, Japan

Abstract

This paper presents a new framework for human action recognition from a 3D skele-ton sequence. Previous studies do not fully utilize the temporal relationships betweenvideo segments in a human action. Some studies successfully used very deep Convolu-tional Neural Network (CNN) models but often suffer from the data insufﬁciency prob-lem. In this study, we ﬁrst segment a skeleton sequence into distinct temporal segmentsin order to exploit the correlations between them. The temporal and spatial features of askeleton sequence are then extracted simultaneously by utilizing a ﬁne-to-coarse (F2C)CNN architecture optimized for human skeleton sequences. We evaluate our proposedmethod on NTU RGB+D and SBU Kinect Interaction dataset. It achieves 79.6% and84.6% of accuracies on NTU RGB+D with cross-object and cross-view protocol, respec-tively, which are almost identical with the state-of-the-art performance. In addition, ourmethod signiﬁcantly improves the accuracy of the actions in two-person interactions.

In the past few years, human action recognition has become an intensive area of research,as a result of the dramatic growth of societal applications for a number of areas includingsecurity surveillance systems, human-computer-interaction-based games, and healthcare in-dustry. The conventional approach based on RGB data was not robust against intra-classvariations and illumination variations. With the advancement of 3D sensing technologies, inparticular, affordable RGB-D cameras such as Microsoft Kinect, these problems have beenremedied to some extent. Human action recognition studies utilizing 3D skeleton data havedrawn a great deal of attention [4, 22].Human action recognition based on 3D skeleton data is a time series problem, andaccordingly, a great body of previous studies have focused on extracting motion patternsfrom a skeleton sequence. Earlier methods utilized hand-crafted features for representingthe intra-frame relationships through the skeleton sequences [29, 30]. Some studies uti-lized the deep learning, end-to-end learning based on Recurrent Neural Networks (RNNs)with Long Short-Term Memory (LSTM) has been utilized to learn the temporal dynamics[2, 14, 15, 24, 26, 36]. Recent studies have shown the superiority of Convolutional Neural c (cid:13) a r X i v : . [ c s . C V ] A ug THAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION

Fine-to-CoarseCNN (a) (b) FC C l a ss i ﬁc a t i on SoftmaxFeature representation at time tWhole-Body-Based FeatureJoint LocationJoint VelocityBody-Part-Based Feature B G R t i m e BP j o i n t Body part k (BP ), k=1, 2,..,5 BP k BP k-1 k k+1 i=[0,..,10] ... ......... ............... ... ......... ............... FC CNN features

Figure 1: Overview of The Proposed Method. It consists of two parts: (a) feature repre-sentation and (b) high-level feature learning with a F2C CNN-based network architecture.A skeleton from a video input sequence is represented by whole-body-based features (WB)and body-part-based features (BP). These features are then transformed into a skeleton imagethat contains both the spatial structure information of a human body as well as the temporalsequence feature of a human action. The skeleton images are then fed into an F2C convolu-tional neural network for high-level feature learning. Finally, CNN features are concatenatedbefore being passed to two subsequent fully connected layers, and a soft-max layer for ﬁnalclassiﬁcation.Networks (CNNs) over RNN with LSTM for this task [8, 9, 17, 18]. Most of the CNN-basedstudies encoded the trajectories of human joints in an image space representing the spatio-temporal information of the skeleton data. The encoded feature is then fed into a deep CNNpre-trained on large scale image datasets, for example, ImageNet [23], under the notion oftransfer learning [21]. This CNN-based method is, however, weak in handling long temporalsequences. And thus, it usually fails to distinguish actions with similar distance variationsbut with different durations, such as “handshaking” and “giving something to other persons”.Motivated by the success of the generative model for CAPTCHA images [3], we believe3D human action recognition systems can also beneﬁt from a speciﬁc network structure forthis application domain. The ﬁrst step is to segment a given skeleton sequence into differenttemporal segments. Here, we assume that temporal features of different time-steps havedifferent correlations. We further utilize a tailor-made F2C CNN-based network architectureto model high-level features. By utilizing both the temporal relationships between temporalsegments and spatial connectivities among human body parts, our method is expected to havea superior performance to the naive deep CNN networks. To the best of our knowledge, thisis the ﬁrst attempt to use F2C network for 3D human action recognition.The paper is organized as follows. In Section 2, we discuss the related studies. In Section3, we explain our proposed network architecture in detail. We then show the experimentalresults to justify our motivations in Section 4. Finally, we conclude our study in Section 5.

HAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION Deep learning techniques drew a great attention in the ﬁeld of 3D human action recognition.Especially, the end-to-end network architectures can discriminate actions from raw skeletondata without any handcrafted features. Zhu et al. [36] adopted three LSTM layers to exploitthe co-occurrence features of skeleton joints at different layers. Du et al. [2] proposed ahierarchical RNN to exploit the spatio-temporal feature of a skeleton sequence. They di-vided skeleton joints into ﬁve subsets corresponding to ﬁve body parts before independentlyfeeding them into ﬁve bidirectional recurrent neural networks for local feature extraction.The relationships between body parts were then modeled in later layers by hierarchicallyfusing them together. LSTMs were deliberately used in the last layer to tackle the vanishingproblem of a vanilla RNN.The use of deep learning techniques for this area of research was exploded when NTURGB+D dataset [24] was released. Shahroudy et al. [24] introduced a part-aware LSTM tolearn the long-term dynamics of a long skeleton sequence from multimodal inputs extractedfrom human body parts. Liu et al. [14], on the other hand, employed a spatio-temporalLSTM (ST-LSTM) to handle both the spatial dependency and the temporal dependency. ST-LSTM is also enhanced with a tree-structure based traversal method for transmitting inputdata of each frame into the network. In addition, this method used a trust gate mechanism toexclude noisy data from the input. Zhang et al. [33] proposed a view adaptation scheme for3D skeleton data and further integrated it into an end-to-end LSTM network for sequentialdata modeling and feature extraction.CNNs are powerful for the task of object detection from images. Transfer learning tech-niques enable them to perform well even with a limited number of data samples [19, 28].Motivated by this, Ke et al. [8] was the ﬁrst to apply transfer learning for 3D human actionrecognition. They used a VGG model [1] pre-trained with ImageNet to extract high-level fea-tures from cosine distance features between joint vectors and their normalized magnitude.Ke et.al 2017b [9] further transformed the cylindrical coordinates of an original skeletonsequence into three clips of gray-scale images. The clips are then processed by pre-trainedVGG19 model [25] to extract image features. Multi-task learning was also proposed by [9]for the ﬁnal classiﬁcation, which achieved the state-of-the-art performance on NTU RGB+Ddataset.Our study addresses two problems of the previous studies: (1) the loss of temporal infor-mation of a skeleton sequence during training and, (2) the need for a speciﬁc CNN structurefor skeleton data. We believe that a very deep CNN model such as VGG [25], AlexNet [13]or ResNet [5] are overqualiﬁed for such sparse data as human skeleton. Moreover, the avail-able skeleton datasets are relatively small compared to image datasets. Thus, we believe anetwork architecture which is able to leverage the geometric dependencies of human jointsis promising for solving this issue.

This section presents our proposed method for 3D skeleton-based action recognition whichexploits the geometric dependency of human body parts and the temporal relationship in atime sequence of skeletons (Figure 1). It consists of two phases: feature representation andhigh-level feature learning with a F2C network architecture.

THAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION

Feature representation at time tWhole-Body-Based Feature Joint LocationJoint VelocityBody-Part-Based Feature B G R t i m e BP j o i n t Body part k (BP ), k=1, 2,..,5 k BP k-1 k k+1 (a)(b) X Y Z - - ... - Pos

YZX r - - ... - r r BP ... ... r: reference joint PosVelVel r Figure 2: Feature Generation. Figure (a) illustrates the procedure of generating WB fea-tures obtained by transforming the joint positions in the camera coordinate system to thehip-based coordinate system. In Figure (b), we arrange BP features side by side to obtainone unique feature 2D array before projecting the coordinates in Euclidean space into RGBimage space using a linear transformation and further up-scaling by using cubic interpolationtransformation.

We encode the geometry of human body originally given in an image space into local coor-dinate systems to extract the relative geometric relationships among human joints in a videoframe. We select six joints in a human skeleton as reference joints in order to generatewhole-body-based (WB) features and body-part-based (BP) features. The hip joint is chosenas the origin of the coordinate system presenting the WB features, while the other referencejoints, namely the head, the left shoulder, the right shoulder, the left hip, and the right hip,are selected exactly the same as [8] to represent the BP features. The WB features representthe motions of human joints around the base of the spine, while the BP features representthe variation of appearance and deformation of the human pose when viewed from differentbody parts. We believe that the combined use of WB and BP is robust against coordinatetransformations.Different from the other studies using BP features [8, 14, 24], we extract a velocitytogether with a joint position from each joint of the raw skeleton. The velocity represents thevariations over the time and has been widely employed in many previous studies, mostly inthe handcrafted-feature-based approaches [10, 32, 34]. It is robust against the speed changes;and accordingly, is effective to discriminate actions with similar distance variations but withdifferent speeds, such as punching and pushing.In the t -th frame of sequence of skeletons with n joints, the 3D position of the i -th jointis depicted as: p i ( t ) = [ p xi ( t ) , p yi ( t ) , p zi ( t )] (cid:62) . (1)The relative inter-joint positions are highly discriminative for human actions [20]. The rela-tive position of joint i at time t is described as:ˆ p i ( t ) = p i ( t ) − p ref ( t ) , (2) HAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION where p ref ( t ) depicts the position of a selected reference joint. The velocity feature ˆ v i ( t ) attime frame t is deﬁned as the ﬁrst derivatives of the relative position feature ˆ p i ( t ) . Zanﬁr etal. [32] showed that it is effective to compute the derivatives of human instantaneous posewhich is represented by joints’ location at a given time frame t over a time segment. Thevelocity feature, therefore, is formulated as:ˆ v i ( t ) ≈ ˆ p i ( t + ) − ˆ p i ( t − ) . (3) As mentioned above, we choose the hip joint as the reference joint in order to represent WBfeatures (See ﬁgure 2(a)). In addition, we follow the limb normalization procedure [32] toreduce the problem caused by the variations in human body size among human subjects. Weﬁrst compute the average bone lengths of each two connected joints over the training dataset,and then use them to normalize each human subject’s bones. To put it differently, we stretcheach bone of a certain human subject with a normalized length while keeping the joint anglebetween any two bones unchanged.In order to extract the spatial features of a human skeleton at time t over the set of joints,we ﬁrst deﬁne a spatial conﬁguration of a joint chain. We believe that the order of jointsgreatly affects the learning ability of 2D CNN since the joints in adjacent body parts sharemore spatial relations than a random pair of joints. For example, in most actions, the joints ofthe right arm are more correlated to those of the left arm than those of the left leg are. Withthis intention, we concatenate joints in the following order: left arm, right arm, torso, left leg,right leg. Note that the torso in the context of this paper includes the head joint of the humanskeleton. Let T be the number of frames in a given skeleton sequence. In the next step,we compute each feature of skeleton data over T frames and stack them as a feature row.Consequently, we obtain the WB features of two 2D arrays; each corresponds to the jointlocation and velocity. Finally, we project these 2D array features into RGB image space usinga linear transformation. In particular, each of three components ( x , y , z ) of each skeleton jointis represented as one of the three corresponding components ( R , G , B ) of a pixel in a colorimage; by normalizing the ( x , y , z ) values to the range 0 to 255. The two sets of color imagesare further up-scaled by using a cubic spline interpolation. Cubic spline interpolation is acommonly used technique in image processing to minimize the interpolation error [6]. Wecall these two RGB images as skeleton images. In order to represent the BP features, we choose ﬁve joints corresponding to ﬁve humanbody parts as the reference joints: the head, the left shoulder, the right shoulder, the left hip,and the right hip, as in [8]. They are relatively stable in most actions. Provided that, wecalculate joint position features and velocity features for each reference joint in the aboveorder dependently. As a result, with each skeleton at time t , we obtain ﬁve feature vectors ofa joint location and ﬁve vectors of a velocity corresponding to ﬁve distinct reference joints.We then place all BP features side by side to produce one unique row feature and place themalong the temporal axis to obtain a 2D array feature. Finally, we apply a linear transformationto represent these array features as RGB images and further up-scale them by using a cubicspline interpolation. After all, we obtain two BP-base skeleton images; one correspondingto the joint location and the other to the velocity from each skeleton sequence. The wholeprocess is illustrated in Figure 2(b). THAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION

Concatenationand Convolution PoolingFlatteningConcatenationandConvolution

Joint chain

TorsoLeftArm

RightLeg

RightArm

LeftLeg T i m e Fine (Body parts) Coarse (Whole body)

Figure 3: Proposed Fine-to-Coarse Network Architecture. Blue arrows show pair sliceswhich are concatenated along each dimension before passing to a convolutional block.

In this section, we explain the detail of our proposed F2C network architecture for high-levelfeature learning. Figure 3 illustrates our network structure in three dimensions.Our F2C network takes three color channels of skeleton images generated from the fea-ture representation phase as inputs. Accordingly, the input of our F2C network consistsof two dimensions: the spatial dimension which describes the geometric dependencies ofhuman joints along the joint chain, and the temporal dimension of the time-feature represen-tation over T frames of a skeleton sequence. Let m be the number of segments along thetemporal axis , n is the number of body parts ( n = m × n slices (Figure 3). Assume T seg ( T =m × T seg ) is the number of frames in onetemporal segment, l bp is the dimension of one body part along the spatial dimension, eachinput slide has size of l bp × T seg . In the next step, we simultaneously concatenate the slicesover both the spatial axis and temporal axis. In other words, regarding the spatial dimension,we ﬁrst concatenate each body part which belongs to human limbs (arms and legs) with thetorso, while concatenating two consecutive temporal segments together. Each concatenated2D array feature is further passed through a convolutional layer and a max pooling layer. Thesame fusion procedure is applied before passing the next convolutional layer. In short, ourF2C network composes of three layer-concatenation steps, and three convolutional blocksaccordingly. In the last step, the extracted image features are ﬂattened to obtain an output of1D array feature.Both WB-based skeleton images and BP-based skeleton images are fed into the proposedF2C network in the same way. While it is conceivable for feeding BP features into ournetwork for high-level feature learning, we believe WB features also beneﬁt from going HAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION Actions BP-based skeleton image WB-based skeleton image

Joint Position Velocity Joint Position VelocityStanding upTake off jacketPoint fingerat the other personHandshaking

Figure 4: Examples of Generated Skeleton Images. "Standing up" and "take off jacket"present single actions while "point ﬁnger at the other person" and "handshaking" are two-person interation actions.through the network since the spatial dimension of WB features, which are formed by thepre-deﬁned joint chain, includes the intrinsic relationships between body parts.Our network can be viewed as a procedure to eliminate unwanted connections betweenlayers from the conventional CNN. We believe traditional CNN models include some redun-dant connections for capturing human-body-geometric features. Many actions only requirethe movement of the upper body ( e . g . hand waving, clapping) or the lower body ( e . g . sitting,kicking), while the other requires the movements of the whole body ( e . g . moving towardsanother person, pick up something). For this reason, the bottom layers in our proposedmethod can discriminate “ﬁne”actions which require the movements of some certain bodyparts, while the top layers are discriminative for “coarse” actions using the movements ofthe whole body. We conduct experiments on two skeleton benchmark datasets publicly available: NTURGB+D [24] and SBU Kinect Interaction Dataset [31]. As the method proposed by [8] isrelatively related to this paper, we employ their method as our baseline. We also compareour proposed method with other state-of-the-art methods reported on the same datasets.

NTU RGB+D Dataset is the largest skeleton-based human action dataset for the timebeing with 56,880 sequences. The skeleton data were collected by utilizing Microsoft Kinectv2 sensors. Each skeleton contains 25 human joints. In this dataset, there are 60 distinctaction classes of three human-action groups: daily actions, health-related actions, and two-person interactive actions. All the actions are performed by 40 distinct subjects. The actionsare recorded simultaneously by three camera sensors located at different angles: − ◦ , 0 ◦ ,45 ◦ . This dataset is challenging due to the large variations of viewpoints and sequence THAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION

Table 1: Network Conﬁguration conv3-64: 3 × Input of 224 ×

224 RGB image35 input slices of 32 × × × × × Table 2: Classiﬁcation Performance on NTU RGB+DDataset

Methods CS CVLie Group [27] 50.1 52.8Part-aware LSTM [24] 62.9 70.3ST-LSTM + Trust Gate [14] 69.2 77.7Temporal Perceptive Network [7] 75.3 84.0Context-aware attention LSTM [16] 76.1 84.0Enhanced skeleton visualization [18] 76.0 82.6Temporal CNNs[11] 74.3 83.1Clips+CNN+Concatenation [9] 77.1 81.1Clips+CNN+MTLN [9]

SkeletonNet [8] 75.9 81.2(WB + BP) + VGG 68.1 72.4BP + F2C network 78.2 81.9(WB + BP) w/o velocity + F2C network 76.6 81.7

F2CSkeleton (Proposed) 79.6 lengths. In our experiments, we use the two standard evaluation protocols proposed by theoriginal study [24], namely, cross-subject (CS) and cross-view (CV).

SBU Kinect Interaction Dataset is another skeleton-based dataset collected using theMicrosoft Kinect sensor. There are 282 skeleton sequences divided into 21 subsets, whichare collected from eight different types of two-person interactions including approaching,departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands.Each skeleton contains 15 joints. There are seven subjects who performed the actions in thesame laboratory environment. We also augment data as in [8] before doing ﬁve-fold cross-validation. Each skeleton image is ﬁrst resized to 250 ×

250 and then is randomly croppedinto 20 sub-images with the size of 224 × Implementation Details

The proposed model was implemented using Keras with Ten-sorFlow backend. For a fair comparison with the previous studies, transfer learning is appliedin order to improve the classiﬁcation performance. To be more speciﬁc, our proposed F2Cnetwork architecture is ﬁrst trained with ImageNet with the input image dimension is set to224 × https://github.com/keras-team/keras HAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION Table 3: Classiﬁcation Performance with Two-person Interactions, RGB+D Dataset, CV Protocol

Actions SkeletonNet F2CSkeletonPrec. Rec. Prec. Rec.Punching/slapping 59.2 56.0

Kicking 46.8 64.9

Pushing 69.7 72.2

Pat on back 54.7 46.2

Point ﬁnger 42.8 72.8

Hugging 77.6 83.5

Giving something 72.5 72.5

Touch other’s pocket 66.9 50.6

Handshaking 83.1 82.6

Walking towards 66.2 82.3

Walking apart 61.8

Table 4: Classiﬁcation Performance onSBU Dataset

Methods Acc.Deep LSTM+Co-occurence [36] 90.4ST-LSTM+Trust Gate [14] 93.3SkeletonNet [8] 93.5Clips+CNN+Concatenation [9] 92.9Clips+CNN+MTLN [9] 93.6Context-aware attention LSTM [16] 94.9VA-LSTM [33] 97.2

F2CSkeleton (Proposed) 99.1 on NTU RGB+D dataset. Considering body part features have different contributions to anaction, we do not share weights between input slices during training. This might increase thenumber of parameters but gain better generalization ability of the network. Table 1 showsthe detail of our network conﬁguration.

NTU RGB+D Dataset

We compare the performance of our method with the previousstudies in Table 2. The classiﬁed accuracy is chosen as the evaluation metric. (WB + BP) + VGG

In this experiment, we use VGG16 pre-trained on ImageNet datasetinstead of our F2C network. This experiment examines the signiﬁcance of the proposed F2Cnetwork for high-level feature learning against the conventional deep CNN models.

BP + F2C network

In this experiment, we only adopt the skeleton images generated byBP features to feed into the proposed F2C network architecture. This aims to justify thecontribution of WB features going through our F2C network. (WB + BP) w/o velocity + F2C network

In this experiment, only joint position featuresare put into the proposed F2C network architecture for the purpose of examining the impor-tance of incorporating velocity feature to the ﬁnal classiﬁcation performance.

WB + BP + F2C network (F2CSkeleton)

This is our proposed method.As shown in Table 2, our proposed method outperforms results reported by [7, 8, 14, 16,18, 24, 27] with the same testing condition. In particular, we gain over 3.0% improvementfrom our baseline [8] on both CS and CV testing protocols. Similarly, our method is around2.5% better than the method with feature concatenation [9]. However, [9] with Multi-TaskLearning Network (MTLN) obtained a slightly better performance than our method with theCV protocol. The learning paradigm MTLN works as a hierarchical method to effectivelylearn the intrinsic correlations between multiple related tasks [35], thus, outperforms a mereconcatenation. We believe our method also can beneﬁt from MTLN. We will include this asa part of our future work to improve our network. It is also to note that while our method withthe CS protocol outperforms [33], they achieved a better performance of 3% when comingto the CV protocol handled by a view adaption scheme in a multiple-view environment.Table 2 also shows that our F2C network performs signiﬁcantly better than VGG16.In particular, our F2C network improves the accuracy from 68.1% to 79.6% with the CSprotocol and from 72.4% to 84.6% with the CV protocol. The incorporation of velocity THAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION improves the performance about 3.0 points in both testing protocols. Besides, the use of WBand BP features in combination improves the accuracies from 78.2% to 79.6% and 81.9% to84.6% with the CS and CV protocol, respectively.Our method outperforms SkeletonNet on all the two-person interactions. Table 3 showsour classiﬁcation performance with the CV protocol. Two-person interactions usually requirethe movement of the whole body. Top layers of our tailored network architecture can learnthe whole body motion better than the naive CNN models originally designed for detectinggeneric objects in a still image.On the other hand, it appears that our method performs poorly on two classes, namely“brushing teeth” (58.3%) and “brushing hair” (47.6%). Confusion matrix reveals that “brush-ing teeth” is often misclassiﬁed as either “cheer up” and “hand waving”, while the “brushinghair” is misclassiﬁed as “hand waving”. This may be because the “head joint”, which isselected as the reference joint for the torso, is not stationary enough compared to the otherreference joints in these action types.

SBU Kinect Interaction Dataset

Table 4 shows the comparisons of our proposed methodwith the previous studies on SBU dataset. As can be seen, our proposed method achievedthe best performance on this dataset over all the other previous methods. In particular, ourmethod gains more than 5.0 points improvement compared to the two state-of-the-art CNN-based methods [8, 9], about 4.0 points better than [16], and approximately 2.0 points betterthan [33]. These results again conﬁrm that our method has superior performance on two-person interaction actions.

This paper addresses two problems of the previous studies: the loss of temporal informa-tion in a skeleton sequence when modeling using CNNs and the need for a network modelspeciﬁc to a human skeleton sequence. We ﬁrst propose to segment a skeleton sequenceto retrieve the dependencies between temporal segments in an action. We also propose anF2C CNN architecture for exploiting the spatio-temporal feature of skeleton data. As a re-sult, our method with only three network blocks shows the superior generalization abilityacross very deep CNN models. We achieve a performance of 79.6% and 84.6% of accura-cies on the large skeleton dataset, NTU RGB+D, with cross-object and cross-view protocol,respectively, which reaches the state-of-the-art.In the future, as has been noted, we will adopt the notion of multi-task learning. Inaddition, since we do not share weights between input slices during training, our networkhas more trainable parameters compared to general CNN models with the same input sizeand the number of ﬁlters. We believe our method will work better if we reduce the numberof feature maps in convolutional layers. The current skeleton data is very challenging due tonoisy joints. For example, by manually checking skeleton data from the ﬁrst data collectionsetup of NTU RGB+D, we ﬁnd that there were about 8.8% of noisy detections. Because ourmethod did not apply any algorithms to remove these noises from the input, it is promisingto take this into consideration for better performance.

Acknowledgments

This work was supported by JSPS KAKENHI 15K12061 and by JST CREST Grant NumberJPMJCR1687, Japan.

HAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION References [1] K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in thedetails: Delving deep into convolutional nets. In

British Machine Vision Conference(BMVC) , 2014.[2] Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network for skele-ton based action recognition. In

Proc. of Computer Vision and Pattern Recognition(CVPR) , pages 1110–1118, 2015.[3] Dileep George, Wolfgang Lehrach, Ken Kansky, Miguel Lázaro-Gredilla, ChristopherLaan, Bhaskara Marthi, Xinghua Lou, Zhaoshi Meng, Yi Liu, Huayan Wang, et al.A generative vision model that trains with high data efﬁciency and breaks text-basedcaptchas.

Science , 358(6368):eaag2612, 2017.[4] Fei Han, Brian Reily, William Hoff, and Hao Zhang. Space-time representation ofpeople based on 3d skeletal data: A review.

Proc. of Computer Vision and ImageUnderstanding , 158:85–105, 2017.[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor image recognition. In

Proc. of Computer Vision and Pattern Recognition (CVPR) ,pages 770–778, 2016.[6] Hsieh Hou and H Andrews. Cubic splines for image interpolation and digital ﬁltering.

IEEE Transactions on acoustics, speech, and signal processing , 26(6):508–517, 1978.[7] Yueyu Hu, Chunhui Liu, Yanghao Li, Sijie Song, and Jiaying Liu. Temporal percep-tive network for skeleton-based action recognition. In

Proc. of British Machine VisionConference (BMVC) , pages 1–2, 2017.[8] Qiuhong Ke, Senjian An, Mohammed Bennamoun, Ferdous Sohel, and Farid Bous-said. Skeletonnet: Mining deep part features for 3-d action recognition.

IEEE signalprocessing letters , 24(6):731–735, 2017.[9] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid.A new representation of skeleton sequences for 3d action recognition. In

Proc. ofComputer Vision and Pattern Recognition (CVPR) , pages 4570–4579. IEEE, 2017.[10] Tommi Kerola, Nakamasa Inoue, and Koichi Shinoda. Graph regularized implicit posefor 3d human action recognition. In

Proc. of Asia-Paciﬁc Signal and Information Pro-cessing Association Annual Summit and Conference (APSIPA) , pages 1–4. IEEE, 2016.[11] Tae Soo Kim and Austin Reiter. Interpretable 3d human action analysis with tempo-ral convolutional networks. In

Computer Vision and Pattern Recognition Workshops(CVPRW) , pages 1623–1631. IEEE, 2017.[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

Proc.of International Conference on Learning Representations (ICLR) , 2015.[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation withdeep convolutional neural networks. In

Proc. of Advances in Neural Information Pro-cessing Systems (NIPS) , pages 1097–1105, 2012. THAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION [14] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang. Spatio-temporal lstm with trustgates for 3d human action recognition. In

Proc. of European Conference on ComputerVision , pages 816–833. Springer, 2016.[15] Jun Liu, Amir Shahroudy, Dong Xu, Alex Kot Chichung, and Gang Wang. Skeleton-based action recognition using spatio-temporal lstm network with trust gates.

IEEETransactions on Pattern Analysis and Machine Intelligence , 2017.[16] Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, and Alex C Kot. Skeleton-based human action recognition with global context-aware attention lstm networks.

IEEE Transactions on Image Processing , 27(4):1586–1599, 2018.[17] Mengyuan Liu, Chen Chen, and Hong Liu. 3d action recognition using data visu-alization and convolutional neural networks. In

IEEE International Conference onMultimedia and Expo (ICME) , pages 925–930. IEEE, 2017.[18] Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skeleton visualization for viewinvariant human action recognition.

Pattern Recognition , 68:346–362, 2017.[19] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferablefeatures with deep adaptation networks. In

Proc. of the 32nd International Conferenceon Machine Learning, ICML , pages 97–105, 2015.[20] Jiajia Luo, Wei Wang, and Hairong Qi. Group sparsity and geometry constrained dic-tionary learning for action recognition from depth maps. In

Proc. of Computer vision(ICCV) , pages 1809–1816. IEEE, 2013.[21] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.

IEEE Transactionson knowledge and data engineering , 22(10):1345–1359, 2010.[22] Liliana Lo Presti and Marco La Cascia. 3d skeleton-based human action classiﬁcation:A survey.

Proc. of Pattern Recognition , 53:130–147, 2016.[23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.

Inter-national Journal of Computer Vision (IJCV) , 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.[24] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+d: A large scaledataset for 3d human activity analysis. In

Proc. of Computer Vision and Pattern Recog-nition (CVPR) , June 2016.[25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.

Proc. of CoRR , 2014.[26] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-endspatio-temporal attention model for human action recognition from skeleton data. In

Proc. of Association for the Advancement of Artiﬁcial Intelligence (AAAI) , volume 1,page 7, 2017.

HAO LE ET AL.: A FINE-TO-COARSE CNN FOR 3-D HUMAN ACTION RECOGNITION [27] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. Human action recognitionby representing 3d skeletons as points in a lie group. In Proc. of Computer Vision andPattern Recognition , pages 588–595, 2014.[28] Raimar Wagner, Markus Thom, Roland Schweiger, Gunther Palm, and AlbrechtRothermel. Learning convolutional neural networks from few samples. In

Neural Net-works (IJCNN), The 2013 International Joint Conference on , pages 1–7. IEEE, 2013.[29] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. Mining actionlet ensemblefor action recognition with depth cameras. In

Proc. of Computer Vision and PatternRecognition (CVPR) , pages 1290–1297. IEEE, 2012.[30] Xiaodong Yang and YingLi Tian. Effective 3d action recognition using eigenjoints.

Journal of Visual Communication and Image Representation , 25(1):2–11, 2014.[31] Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L Berg, and DimitrisSamaras. Two-person interaction detection using body-pose features and multiple in-stance learning. In

IEEE Computer Society Conference on Computer Vision and PatternRecognition Workshops (CVPRW) , pages 28–35. IEEE, 2012.[32] Mihai Zanﬁr, Marius Leordeanu, and Cristian Sminchisescu. The moving pose: Anefﬁcient 3d kinematics descriptor for low-latency action recognition and detection. In

Proc. of the IEEE International Conference on Computer Vision , pages 2752–2759,2013.[33] Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jianru Xue, and NanningZheng. View adaptive recurrent neural networks for high performance human actionrecognition from skeleton data.

International Conference on Computer Vision , pages2136–2145, 2017.[34] Songyang Zhang, Xiaoming Liu, and Jun Xiao. On geometric features for skeleton-based action recognition using multilayer lstm networks. In

IEEE Winter Conferenceon Applications of Computer Vision (WACV) , pages 148–157. IEEE, 2017.[35] Yu Zhang and Dit-Yan Yeung. A regularization approach to learning task relationshipsin multitask learning.

ACM Transactions on Knowledge Discovery from Data (TKDD) ,8(3):12, 2014.[36] Wentao Zhu, Cuiling Lan, Junliang Xing, Wenjun Zeng, Yanghao Li, Li Shen, Xiao-hui Xie, et al. Co-occurrence feature learning for skeleton based action recognitionusing regularized deep lstm networks. In