[PDF] What and Where: Modeling Skeletons from Semantic and Spatial Perspectives for Action Recognition

Abstract

Skeleton data, which consists of only the 2D/3D coordinates of the human joints, has been widely studied for human action recognition. Existing methods take the semantics as prior knowledge to group human joints and draw correlations according to their spatial locations, which we call the semantic perspective for skeleton modeling. In this paper, in contrast to previous approaches, we propose to model skeletons from a novel spatial perspective, from which the model takes the spatial location as prior knowledge to group human joints and mines the discriminative patterns of local areas in a hierarchical manner. The two perspectives are orthogonal and complementary to each other; and by fusing them in a unified framework, our method achieves a more comprehensive understanding of the skeleton data. Besides, we customized two networks for the two perspectives. From the semantic perspective, we propose a Transformer-like network that is expert in modeling joint correlations, and present three effective techniques to adapt it for skeleton data. From the spatial perspective, we transform the skeleton data into the sparse format for efficient feature extraction and present two types of sparse convolutional networks for sparse skeleton modeling. Extensive experiments are conducted on three challenging datasets for skeleton-based human action/gesture recognition, namely, NTU-60, NTU-120 and SHREC, where our method achieves state-of-the-art performance.

Full PDF

SSC4D: A Sparse 4D Convolutional Network forSkeleton-Based Action Recognition

Lei Shi , , Yifan Zhang , , Jian Cheng , , and Hanqing Lu , NLPR & AIRIA, Institute of Automation, Chinese Academy of Sciences School of Artiﬁcial Intelligence, University of Chinese Academy of Sciences, Beijing100049, China CAS Center for Excellence in Brain Science and Intelligence Technology

Abstract.

In this paper, a new perspective is presented for skeleton-based action recognition. Speciﬁcally, we regard the skeletal sequenceas a spatial-temporal point cloud and voxelize it into a 4-dimensionalgrid. A novel sparse 4D convolutional network (SC4D) is proposed todirectly process the generated 4D grid for high-level perceptions. With-out manually designing the hand-crafted transformation rules, it makesbetter use of the advantages of the convolutional network, resulting ina more concise, general and robust framework for skeletal data. Besides,by processing the space and time simultaneously, it largely keeps thespatial-temporal consistency of the skeletal data, and thus brings betterexpressiveness. Moreover, with the help of the sparse tensor, it can beeﬃciently executed with less computations. To verify the superiority ofSC4D, extensive experiments are conducted on two challenging datasets,namely, NTU-RGBD and SHREC, where SC4D achieves state-of-the-artperformance on both of them.

Keywords:

Skeleton, Action Recognition, 4D Convolution.

Action recognition is a popular research topic because it can be applied tomany practical ﬁelds such as human-computer interaction and video surveil-lance [2,22,5,19]. In recent years, skeleton-based action recognition has drawnconsiderable attentions due to its small amount of data, higher-level semanticinformation and strong robustness for complicated environment [32,12,21]. Indetail, the skeletal data is generally a sequence of frames each contains the po-sition information of the human body joints, which is expressed as the 2D/3Dcoordinates of the camera coordinate system. It totally removes the backgroundinformation, and thus it focuses more on the human body itself. Recently, withthe success of the deep leaning, the data-driven methods have become the main-stream for skeleton-based action recognition. In most of the existing neural-network-based approaches, the joint coordinate is viewed as the attribute ofeach element, and various hand-crafted strategies are designed to transform thesejoints into various speciﬁc forms such as pseudo-images or graphs to feed theminto RNNs [8,26], CNNs [12,1] or GCNs [32,20] for feature extraction. a r X i v : . [ c s . C V ] A p r Lei Shi et al.

Generally speaking, the GCN-based method, which structures the skeletaldata as a spatiotemporal graph and feeds it into the graph convolutional net-work, shows great eﬀectiveness and better performance. However, recent worksfound that parameterizing the graph topology performs better than using theﬁxed human-body-based graph [21,9]. It means the prior-knowledge-based graphstructure is not the key of the success of the GCN-based method for skeletal data.Its success may largely owes to the great generalizability and robustness of theconvolutional network. Thus, the GCN-based method is essentially the samewith the CNN-based method, which tries to design rules to transform the skele-tal data into a CNN-suitable form. However, there are three limitations for thesemethods: (1) Manually designed rules are not guaranteed to be an optimal choicedue to the human factor. Although some works propose to learn these rules inthe training process, it still more or less limits the ﬂexibility of the model. For ex-ample, some works try to learn the graph topology, but the graph is still shiftedbased on the human-body-based graphs and is limited by the hand-crafted graphgeneration mechanism [21,9]. Besides, the raw skeletons exist in the metric space(3D coordinate system) with a class-speciﬁc pattern, which are naturally suitablefor grid-based convolutional networks. It is unnecessary to still force a learningof the transformation rules for convolutions in such a data-driven framework. (2)Skeletal data is more ﬁxed compared with other graph-structured data such associal networks or physical molecules. Its every element possesses a speciﬁc phys-ical/semantic meaning, i.e., diﬀerent human joints such as feet or hands, whichis constant for various data samples. However, the convolution is weight-shared.Thus, the ﬁxed rules force the convolutional model to mine constant interactionpatterns for diﬀerent joints and its neighbors of diﬀerent samples, which limitsthe model’s ﬂexibility and capacity. There are also some works trying to avoidconstant contributions by permuting the arrangement of joints [1] or multiplyingattention weights for every joints [32], but they are treating symptoms and notthe root cause. (3) In previous works for skeletal data, the position informationis employed as features while the hand-crafted rules are used to organize con-volutional operations. It is somewhat conﬂicted with the translation-invariantcharacter of the convolutional network. Since the input is position-variant, oncethe skeleton is shifted a little bit, the extracted features also become completelydiﬀerent. This reduces the robustness and the generalizability of the model.Instead, a new perspective is proposed in this work for skeleton-based actionrecognition. We regard the skeletal sequence as a spatial-temporal point cloudand voxelize it into a 4D grid. Similar to the RGB values of image pixels, eachvoxel is attached with a feature vector that denotes whether there is a jointin this position and which joint it is. Diﬀerent with previous works, it needsno manual designs of transformation rules and can better utilize the power ofthe data-driven mechanism. By performing convolution on 4D grid, the weight-sharing mechanism is no longer worked for the ﬁxed relation structure, whichavoids constant contributions for diﬀerent joints and its neighbors of diﬀerentsamples. Besides, the position information is not included in the feature vector,and thus it is more robust for position translation.

C4D: A Sparse 4D Convolutional Network 3

Besides, diﬀerent with most existing methods that process the skeletal se-quence frame by frame, we directly construct a 4D convolutional network toprocess the generated 4D grid. It hierarchically extracts the spatial-temporalfeatures from low-levels to high-levels, which largely keeps the spatial-temporalconsistencies of the skeletal data. To construct 4D convolutional networks, anaive method can be directly expanding the 3D convolutional kernels to 4D.However, due to the additional temporal dimension of the kernel, it has a largeamount of computational cost. Since the skeletal data is very sparse, we proposeto employ the sparse tensor and the sparse convolution to reduce the computa-tional cost inspired by [4]. It only tracks the non-empty voxels of the 4d grid,along with their position information and associated feature vectors. The convo-lutional operation is performed on these sparse voxels hierarchically, which gen-erates the sparse output accordingly. It can generate similar results comparedwith the dense convolution, but it is more eﬃcient and needs less computations.Moreover, to further improve the performance, we propose three data aug-mentation techniques. First, we suggest to interpolate points along bones, whichcan utilize the prior knowledge of the human body and make it more distinctive.The number of interpolations are determined based on the average length ofthe bones in the human body. Second, since the skeletal joint is very sparse, adilation technique is used to enhance the spatial pattern to ease the recognition.The values of the new added points are reduced proportionally to distinguishthem from the original joints. Third, as shown in previous methods [21], the boneinformation and the motion information are eﬀective for skeleton-based actionrecognition. To also exploit them for our voxelization-based methods, we proposeto transform the input from the coordinate space into the bone space and themotion space. By modeling the information of the three spaces with multiplestreams and ﬁnally fusing the results, the performance is further improved.The proposed method, namely, the sparse 4D convolutional network (SC4D),is a general framework for skeletal data. To verify the eﬀectiveness of SC4D,extensive experiments are conducted on two popular datasets for diﬀerent tasks,i.e., NTU-RGBD for action recognition and SHREC for gesture recognition.Although it is the ﬁrst try to voxelize skeletons into a 4D grid, SC4D achievesstate-of-the-art performance on both datasets, which illustrates that it is aneﬀective method and a valuable-researched perspective.Overall, our contributions lie in three aspects:1. We propose a new perspective for skeleton-based action recognition, wherethe skeletal data is voxelized into a 4d grid and is directly processed witha sparse 4D convolutional network. The proposed framework is eﬀective,concise, robust, and eﬃcient.2. Two data enhancement strategies are proposed to augment the generatedskeletal grid to ease the recognition. We further transform the coordinatedata into two other spaces, i.e., the bone space and the motion space, toutilize their complementarity to better recognize the human action.

Lei Shi et al.

3. The ﬁnal model achieves state-of-the-art performance on two challengingdatasets for diﬀerent tasks. The code will be released to facilitate futureworks.

Skeleton-based action recognition has been studied for decades. Early-stage ap-proaches concentrate on designing variable hand-crafted features [31]. Withthe rise of deep learning, the mainstream approaches in recent years lie inthree aspect: (1) the RNN-based approaches where the skeleton sequence isfed into the RNN models and the joints are modeled in a predeﬁned traver-sal order [35,10,25,24]. (2) the CNN-based approaches where the skeleton se-quence is transformed into a pseudo-image and are fed into the CNNs for recog-nition [8,12,1]. (3) the GCN-based approaches where the skeleton sequence isencoded into a spatiotemporal graph according to the physical structure of thehuman body and is modeled with GCNs [32,28,21,20]. In contrast to previousworks, we project the sparse joints into a 4D grid and employ a sparse 4D con-volutional network to extract features and make predictions.

Here, since we regard the skeletal data as point clouds, we introduce some recentworks for perception tasks based on the 3D point cloud. There have been manyworks investigating how to model the point cloud with deep neural networks.Some works exploit the metric space distance to aggregate features from local toglobal in a hierarchical manner [17,23]. Some works transform the original datainto other forms such as surface [14], octree [18] or sparse lattice [27], which aremodeled with low-dimension neural networks. Some works perform the volumet-ric representation for point cloud. For example, Tchapmi et al. [29] project theraw point cloud data into 3D volumes and employ 3D convolutional networksfor 3D segmentation. It should be note that for many tasks, the point cloud isvideo-based, and thus have an additional temporal dimension. For these tasks,one strategy is to process the video frames sequentially and ﬁnally aggregate thetemporal features. For example, You et al. [34] represent multiple peoples ac-tions captured by depth cameras as a sequence of point-cloud-based volume andprocess the volumes frame by frame with a 3D convolutional network. Anotherstrategy is to directly model the videos in a 4D manner. For example, Choy etal. [4] project the sequence of 3D scans into a 4D tesseract and introduce a 4Dsparse spatiotemporal convolutional network to extract features for 3D videosegmentation. This work follows the voxelization-based methods.

C4D: A Sparse 4D Convolutional Network 5 …………

Sparse 4D

CNN clappingselfiedrink walk

4D Grid …………

Pose Estimation

Voxelization

Action Classification

Fig. 1.

Pipeline overview.

Fig. 1 shows the pipeline of our method. Skeletal data can be obtained by motion-capture devices or pose estimation algorithms from videos. It is ﬁrst normalizedand voxelized into a 4D grid. Then the generated grid is directly processedwith a sparse 4D convolutional network, which hierarchically extracts semanticsfrom the 4D grid and outputs the high-level feature maps. Finally these featuremaps are global-average-pooled and classiﬁed with the SoftMax classiﬁer. Thefollowing sections will go over these steps.

The raw skeletal data is a sequence of frames, each of which records the Carte-sian coordinates of the human joints in the current frame. It is formulated asa set of joint coordinates as { R rawt,m,n : R rawt,m,n ∈ R C coor , t = 1 , , · · · , T, m =1 , , · · · , M, n = 1 , , · · · , N } , where T , M and N denote the number of frames,persons and joints deﬁned in the data acquisition system, respectively. In the restof the paper, C coor is default to 3, i.e., the joints are obtained in a 3D coordinatesystem.Voxelizing these joint coordinates into a 4D grid is equivalent to calculat-ing the new coordinates of these joints in the 4D-grid-based coordinate system.First, we add the temporal coordinate for every joints, i.e., R rawt,m,n ∈ R → R dt,m,n ∈ R . The dimension order of the 4D coordinate system is default totime-z-y-x, which means R dt,m,n (1), R dt,m,n (2), R dt,m,n (3) and R dt,m,n (4) denotethe t-coordinate, z-coordinate, y-coordinate and x-coordinate, respectively. Thetemporal coordinate is set as the index of frames, i.e., R dt,m,n (1) = t .Then, we normalize the coordinate into the range of zero to the grid size. Indetail, we use S ∈ R to denote the grid shape, e.g., S = [32 , , , S ( i )denotes the grid size of the coordinate dimension i , where i = 1 , , , R dt,m,n Lei Shi et al. (a) Origin (b) Bone interpolation (c) Dilation

Fig. 2.

Examples of two data enhancement techniques: bone interpolation and dilation. is normalized to R nort,m,n by: R nort,m,n ( i ) = R dt,m,n ( i ) − min ∀ t,m,n { R dt,m,n ( i ) } max ∀ t,m,n { R dt,m,n ( i ) } − min ∀ t,m,n { R dt,m,n ( i ) } × S ( i ) (1)Finally, since the coordinate should be integer, the ﬁnal coordinate of the jointsin the 4d grid R jptt,m,n is obtained by R jptt,m,n = f loor ( R nort,m,n + 0 . After the voxelization process, we know the positions of all the joints in the4D grid. Now we attach a feature vector for every joints to identify it. Thisis similar with the word embedding process of the NLP ﬁeld, where words orphrases from the vocabulary are mapped to vectors of real numbers. In detail, R jptt,m,n is attached with a feature vector F jptt,m,n ∈ R C fea . In this work, we proposethree strategies for embedding. The ﬁrst strategy is setting the feature of thevoxels that are occupied by joints as 1 and setting the feature of other voxels as0. It means C fea = 1. However, this strategy cannot tell which joint occupies thecurrent voxel, and thus the information of the joint semantics is lost. The secondstrategy is C fea = M + N . The ﬁrst M elements construct a one-hot vector thatindicates whether the current voxel is occupied by the joint of the m th person.Similarly, the last N elements indicates whether it is the n th joint. In formulation, F jptt,m,n ( i ) = I{ i == m || i == M + n } , where i = 1 , , · · · , C fea . || denotes “or”.If the expression is true, I{ expression } = 1, otherwise I{ expression } = 0.However, it can not distinguish the situation that the joints of multiple personsfalling into a same voxel. The third strategy is C fea = M N . If the current voxelis the n th joint of the m th person, F jptt,m,n ( i ) = 1, otherwise F jptt,m,n ( i ) = 0. Informulation, F jptt,m,n ( i ) = I{ i == ( n + m × N ) } . The third strategy is competentfor more situations, but it needs more data volumes. In the following paper, weuse the third strategy. The number of human joints are generally small. We propose two spatial featureenhancement techniques to augment the local patterns as shown in Fig. 2. The

C4D: A Sparse 4D Convolutional Network 7 ﬁrst strategy is interpolating points along the human bones, which are deﬁnedas the natural connections of the human body. Usually the human body can beviewed as a tree structure [33,11]. Thus, the number of bones is one less than thenumber of joints. For example, Fig. 4 shows the deﬁnition of the joints and bonesfor the two datasets. First, we count the average length of every bones in thedataset. The bone with the shortest length is deﬁned as the basic bone, whichmeans the number of interpolated points for this bone is one. Then, the numberof interpolated points for other bones are determined by the multiples of theirlengths to the length of the basic bone. The points are uniformly interpolated inthe segment between the two end joints of the bone. For example, if the lengthof the basic bone is 2 and the length of the forearm is 10, we will uniformlyinterpolate 5 points in the segment between the wrist and the elbow. As forthe features of the new interpolated points, they are calculated by the weightedsum of the features of two end joints, where the weight is inversely proportionalto its distance from the corresponding end joint. In formulation, the number ofbones is B = N −

1. The number of the interpolated joints for every bones,i.e., J ∈ R B , is calculated by above strategies. J ( b ) denotes the number of theinterpolated joints of the b th bone, where b = 1 , , · · · , B denotes the bone index.A map I ∈ R B × is deﬁned to record the indices of the two end joints of everybones. I ( b,

1) and I ( b,

2) are the indices of two end joints of the b th bone, i.e., I ( b, , I ( b, ∈ { , , · · · , N } . The coordinate and the feature of the interpolatedpoint are represented as R intert,m,b,j ∈ R and F intert,m,b,j ∈ R C fea , respectively, where j = 1 , , · · · , J ( b ) denotes the indices of the interpolated points of the b th bone.Given R jptt,m,n and F jptt,m,n , they are calculated as: R intert,m,b,j = R jptt,m, I ( b, + f loor ( R jptt,m, I ( b, − R jptt,m, I ( b, J ( b ) + 1 × j + 0 . F intert,m,b,j = F jptt,m, I ( b, × (1 − j J ( b ) ) + F jptt,m, I ( b, × j J ( b ) (2)The second strategy is the spatial dilation. We expand the coordinate of onepoint along all its spatial dimensions according to the dilation number. Using the1D data as an example, assume there is one point in position 2, after dilating itwith the dilation number 1, there will be three points in the position 1, 2 and 3.For simple, when performing dilation on one dimension, the coordinates of otherdimensions are kept the same as before. The features of the new dilated points arethe same with the original points, but it is divided by a scale, which is inverselyproportional to its distance from the original point. In formulation, given thedilation value Σ , the number of the new added points of one person in oneframe along the coordinate dimension c is D c = 2 ΣN (cid:48) . c ∈ { , , } because thatdilation is only performed along spatial dimensions. N (cid:48) denotes the number of theoriginal points, e.g., N (cid:48) = N + (cid:80) Bb J ( b ) if the bone interpolation is performed.Then, the coordinate and the feature of the dilated point are represented as R dilt,m,c,d ∈ R and F dilt,m,c,d ∈ R C fea , where d = 1 , , · · · , D c denotes the indexof the new dilated points along the coordinate dimension c . Given R jptt,m,n and Lei Shi et al. F jptt,m,n , they are calculated as: σ = ceil ( ( d − N (cid:48) ) + 12 ) φ = ceil ( d σ ) R dilt,m,c,d ( i ) = R jptt,m,φ ( i ) + ( − d × σ × I { i == c } F dilt,m,c,d = F jptt,m,φ /σ (3)where i = 1 , , , σ ∈ { , , . . . , Σ } denotes the distance from the new dilated points to theoriginal point. φ ∈ { , , . . . , N } denotes the index of the corresponding originalpoint in the current dilation step. % denotes calculating remainder. Ceil meansrounding up the ﬂoat coordinates to integers. After the voxelization and the data enhancement, we remove the duplicate pointsthat fall into the same voxel. In detail, we simply keep the ﬁrst one during thetraversal of points. Now the raw skeletal joint coordinate has been transformedinto a coordinate vector of the 4d grid R inn ∈ R and a corresponding featurevector F inn ∈ R C fea , where n = 1 , , · · · , N in , N in denotes the total numberof points after performing the data enhancement techniques and removing du-plications. Then, the problem is how to process them with neural networks.The conventional method is updating the traditional 3D CNNs to 4D CNNs byexpanding the convolutional kernel dimension to 4, which we called dense con-volution. Speciﬁcally, we create a 5D tensor F den ∈ R S (1) × S (2) × S (3) × S (4) × C fea torepresent a 4D grid and ﬁll its values based on R inn and F inn . S ∈ R denotes thegrid shape. F den can be directly fed into a 4D CNN.However, because most of the elements in the generated 4D grid are 0, it isunnecessary to perform convolutions for every elements, which in practice causesthe waste of computations and GPU memory. Instead, we consider to use thesparse tensor to reduce the computations. We follow the method introduced in[4]. To perform convolution or other operations sparsely, the key is to obtain amapping M to identify which input aﬀects which output according to the inputcoordinates R inn and the operation deﬁnitions. It is deﬁned as pairs of lists ofinput indices I in ∈ R N in , output indices I out ∈ R N out and weight indices I wei ∈ R N wei (optional), i.e., M = { ( I in ( i ) , I out ( j ) , I wei ( k )) } where i ∈ { , , · · · , N in } , j ∈ { , , · · · , N out } and k ∈ { , , · · · , N wei } . Then the output is calculated ac-cording to the inputs ( R inn and F inn ), the weight ( W ( opt. ) ), the mapping ( M ) andthe deﬁnition of the operation ( f ), i.e., R outn , F outn ←− f ( R inn , F inn , W ( opt. ) , M ).For convolutional operation, the input features are multiplied with the corre-sponding weights, and then added to the corresponding output features basedon M . For pooling-based operations such as max-pooling and global-average-pooling, weights are not needed. The input features are gathered and directly C4D: A Sparse 4D Convolutional Network 9 reduced based on M to get the output features. For non-spatial functions suchas Batch Normalization and ReLu, we can directly use the 1D dense operationon the input features. S p a r s e C o n v ( , C x ) S p a r s e D r o p o u t S p a r s e L i n e a r S p a r s e C o n v ( K , C x ) S p a r s e M a x P oo l S p a r s e C o n v ( K , C x ) S p a r s e M a x P oo l P r e d i c t i o n S p a r s e C o n v ( K , C x ) S p a r s e C o n v ( K , C x ) S p a r s e M a x P oo l S p a r s e C o n v ( K , C x ) S p a r s e C o n v ( K , C x ) S p a r s e M a x P oo l S p a r s e C o n v ( K , C x ) S p a r s e C o n v ( K , C x ) S p a r s e G l o b a l A v e P oo l I npu t Fig. 3.

Architecture of the SC4D. Each convolution is appended with a Batch Normal-ization layer and a ReLU layer. K denote the kernel size. C denote the basic numberof ﬁlters.

After deﬁning these operations, we can now build the network only with gen-eralized sparse operations. With extensive experiments, the architecture of thesparse 4D convolutional network (SC4D) for skeleton-based action recognitionis built as shown in Fig 3. It is inspired by the C3D network [30]. All layersare built with sparse operations. There are totally 9 sparse convolutional layers.Each sparse convolution is appended with a sparse Batch Normalization layerand a sparse ReLU layer. The kernel size is 1 for the ﬁrst layer and K for others.The number of ﬁlters are C , C , 2 C , 4 C , 4 C , 8 C , 8 C , 8 C , 8 C , respectively. Both K and C can be adjusted to balance the model size and the performance. Theﬁrst sparse convolutional layer serves as an embedding layer to embed the one-hot features into the feature space. The residual connections are added for everyconvolutions except for the ﬁrst one to ease the gradient passing following [6].Sparse max-pooling layer is added after the 2 nd , 3 rd , 5 th and 7 th convolutionallayers. The stride of the max-pooling is 2 for all dimensions. If the input grid sizeis too small, the ﬁrst several pooling layers will be removed to save the resolu-tion. A sparse global-average-pooling layer and a sparse fully-connected layer isadded in the end to make predictions. Dropout is used before the fully-connectedlayer to avoid overﬁtting. Previous methods have shown that apart from the position information, the boneinformation and the motion information of the skeletal data are also helpful foraction recognition [21]. Here, we transform the raw skeletal data from the co-ordinate space into the bone space and the motion space to utilize these two types of information. In detail, for bone information, given the raw joint coordi-nate R rawt,m,n ∈ R of the original space, we ﬁrst calculate the corresponding rawcoordinate R brawt,m,b ∈ R of the bone space by: R brawt,m,b = R rawt,m, I ( b, − R rawt,m, I ( b, (4)where b = 1 , , · · · , B . The map I ∈ R B × records the indices of the two endjoints of every bones. Then, similar with the procedure of voxelizing R rawt,m,n intro-duced in Sec. 3.2, R brawt,m,b is voxelized into a bone-space-based 4D grid, resultingin the sparse bone-space-based coordinate vector R bonet,m,b and the correspondingfeature vector F bonet,m,b . F bonet,m,b is same as F jptt,m,n .Similarly, the raw coordinate R mrawτ,m,n ∈ R of the motion space is obtainedby: R mrawτ,m,n = R rawt +1 ,m,n − R rawt,m,n (5)where τ = 1 , , · · · , T −

1. It is also voxelized into a motion-space-based 4D grid,resulting in R motionτ,m,n and F motionτ,m,n . Both the bone information and the motion in-formation are modeled with two additional networks with the same architectureas SC4D. The SoftMax scores of three streams are averaged to get the ﬁnalprediction. We specially select two datasets, namely, NTU-RGBD and SHREC, with diﬀer-ent tasks to show the generalizability of our model.

NTU-RGBD consists of 56,000 action clips in 60 action classes. Each actionis captured by 3 cameras. It provides 3D joint locations of 25 joints detected byKinect-V2 depth sensors as shown in Fig. 4, left. Each video has no more than 2subjects. Because the accuracy of the cross-view benchmark is nearly saturated,we conduct experiments on the cross-subject benchmark of the dataset, wherethe training and testing sets are split based on diﬀerent subjects. Since the wholedataset is large and the training is time-consuming, we extract a subset of NTU-RGBD, namely, NTU-RGBD-SUB, for ablation studies. Speciﬁcally, because thesamples are captured by 3 camera, the samples captured by the ﬁrst camera areused to form the NTU-RGBD-SUB.

SHREC contains 2800 gesture sequences performed by 28 subjects in twoways: using one ﬁnger to perform gestures or using the whole hand to performgestures. It provides the 3D coordinates of 22 hand joints captured by Intel-Real-Sense depth camera as shown in Fig. 4, right. This dataset has once beenused for the competition of SHREC’17 in conjunction with the Eurographics3DOR’2017 Workshop, and thus it reﬂects the highest level in this ﬁeld.All experiments are conducted on the PyTorch deep learning framework [15].Stochastic gradient descent (SGD) with Nesterov momentum (0 .

9) is applied as

C4D: A Sparse 4D Convolutional Network 11 (b) SHREC

17 13

23 2224 (a) NTU-RGBD

Fig. 4.

Illustration of the joint index and the bones for (a) NTU-RGBD dataset and(b) SHREC dataset. the optimization strategy. The batch size is set to 32. Cross-entropy is selected asthe loss function to back-propagate gradients. The weight decay is set to 0 . Ablation studies are conducted on NTU-RGBD-SUB. Firstly, we investigate thestrategies of the feature representation introduced in Sec. 3.3. We update theC3D network [30] to C4D by simply expanding the dimension of the kernel from3 to 4. Others are kept the same as before. Due to the GPU memory limitations,we set the grid size to 16 (same for all dimensions) and the number of basicchannels to 16, i.e., the output channels of 8 convolutional layers in C4D is 16,32, 64, 64, 128, 128, 128, 128. Tab. 1 shows the results of the three strategies,where S3 performs better as expected.Then, we replace the dense convolution with the sparse convolution to see thediﬀerence. To keep the comparison fair, instead of using the ﬁnal SC4D showed inFig. 3, we use a C4D-comparable model named “Sparse C4D” for comparison,whose architecture is the same with C4D except for the sparse operation. As

Table 1.

Action recognition performance for diﬀerent feature representation strategies.S1, S2 and S3 correspond to the strategies introduced in Sec. 3.3 sequentially.Strategy Acc. (%)S1 80.6S2 81.7S3 82.3 shown in Tab 2, using dense C4D achieves only a little better than using sparseC4D. But the sparse C4D largely saves the computations. For dense C4D, itneeds 4 TITANXP GPUs to train the model using the PyTorch framework. Butfor sparse C4D, it needs only one-tenth memory of one GPU.

Table 2.

Comparison of the dense convolution and the sparse convolution.Methods Acc. (%)Dense C4D 82.3Sparse C4D 82.1

Now that we can save a large amount of GPU memories by using the sparseconvolution, we use the network architecture showed in Fig. 3, i.e., SC4D, forexperiments in the rest experiments. Compared with sparse C4D, residual con-nections are added for every convolutional layers. A feature embedding layer isadded at the beginning. The basic number of ﬁlters is set to 32, i.e., C=32. Thebasic kernel size is 3, i.e., K=3. With these designations, SC4D performs betterthan sparse C4D. Then, we investigate the eﬀect of the grid size as shown inTab. 3. “Overlap” denotes the ratio of the number of points in sparse tensor tothe number of original points. It reﬂects the degree of diﬀerent joints falling intothe same voxel, which causes the loss of information. It is 100% when there areno diﬀerent joints falling into the same voxel. The result shows that properlyincreasing the grid size can improve the performance (Size=16 vs Size=32 vsSize=64). We found that using larger grid can reduce the overlaps of the joints.Thus, it can keep more useful information. However, it can not be increasedtoo large (Size=64 vs Size=128). It is because along with the increasing of thegrid size, the distance between the points also grows, which brings diﬃcultiesfor relation modeling with the ﬁx-size convolutional kernel.The most important two factors that aﬀect the model performance is thekernel size and the number of ﬁlters. Tab. 4 lists the performance of diﬀerentconﬁgurations for the kernel size and the number of ﬁlters. The gird size isﬁxed to 32. It shows that increasing the kernel size improves the performanceespecially when the original kernel size is small (K=3 vs K=4 vs K=5). Webelieve it is because the large kernel size can help covering more points so thatcapture more information in one convolutional step. The improvement decreasesor even becomes negative when the original kernel size is already large enough(K=5 vs K=6). It is because the larger kernel size also brings more parameters

C4D: A Sparse 4D Convolutional Network 13

Table 3.

Action recognition performance for diﬀerent grid sizes. “Overlap” denotesthe ratio of the number of points in sparse tensor to the number of original points. Itis 100% when there no diﬀerent joints falling into the same voxel.Grid size Acc. (%) Overlap (%)Size=32 87.9 98.6Size=16 86.6 93.1Size=64 88.8 99.7Size=128 88.5 99.9 that causes the diﬃculty for network training. The similar phenomenon is alsoobserved for the number of ﬁlters (C=32 vs C=64 vs C=128).

Table 4.

Action recognition performance for diﬀerent kernel size (K) and the numberof ﬁlters (C). K and C are corresponded with Fig. 3.Conﬁguration Acc. (%)K=3, C=32 87.9K=4, C=32 89.0K=5, C=32 90.2K=6, C=32 89.8K=3, C=64 89.1K=3, C=128 89.4 Then, we test the eﬀectiveness of two data enhancement strategies introducedin Sec. 3.3. The grid size is increased to 64 to make the grid more sparse. Tab. 1shows that both of the two strategies bring improvement. However, since thenumber of points is also increased, the data becomes denser and it needs morecomputations and memories.

Table 5.

Action recognition performance for diﬀerent data enhancement strategies.Strategy Acc. (%)no enhancement 88.8+edge 88.9+edge&dilate 89.7

Finally, we investigate the eﬀectiveness of the three streams introduced inSec. 3.7. They are processed by three networks with the same architecture, i.e.,SC4D. The

Sof tM ax scores are fused to get the ﬁnal prediction. It shows usingthe bone information or the motion information performs worse than the originaldata, but fusing three streams can largely improve the performance.

Table 6.

Action recognition performance for diﬀerent streams.Modality Acc. (%)joint 88.8bone 86.9motion 81.5fusion 90.7

Although many strategies have been shown eﬀective for improving the perfor-mance such as increasing the kernel size or the number of ﬁlters, due to theGPU memory limitations, we have to made some trade-oﬀs. Finally, the kernelsize of the SC4D is set to 3 for the temporal dimension and 5 for the spatialdimensions. The basic number of ﬁlters is 64, i.e., C=64. Grid size is 64 forNTU-RGBD dataset and is 32 for SHREC dataset because the “overlap” (intro-duced in Tab. 3) of the 32 × × ×

32 grid already reaches 99.7% for SHREC.The bone interpolation strategy is used for the joint stream. We test our ﬁnalmodel on two datasets: NTU-RGBD for action recognition and SHREC for ges-ture recognition. The results are showed in Tab. 7 and Tab. 8, where our SC4Dachieves state-of-the-art performance on two datasets. More comparisons areshowed in supplement materials. It veriﬁes the eﬀectiveness and generalizabilityof our method.

Table 7.

Comparison with the state-of-art-methods on NTU-RGBD dataset.Method Year Acc. (%)AGC-LSTM [24] 2019 89.22s-AGCN [21] 2019 88.5DGNN [20] 2019 89.9Bayesian-GCN [36] 2020 81.8NAS [16] 2020 89.4SC4D (ours) -

Comparison with the state-of-art-methods on SHREC dataset.Method Year 14 gestures 28 gesturesST-GCN [32] 2018 92.7 87.7STA-Res-TCN [7] 2018 93.6 90.7ST-TS-HGR-NET [13] 2019 94.3 89.4DG-STA. [3] 2019 94.4 90.7SC4D (ours) -

C4D: A Sparse 4D Convolutional Network 15

In this work, we propose a new perspective for skeleton-based action recogni-tion, where the skeletal data is viewed as a point cloud and is voxelized intoa 4d grid for recognition. A novel sparse 4D convolutional network (SC4D) isproposed to directly model the 4D grid, which largely keeps the spatial-temporalconsistencies of the skeletal data. The overall framework is concise, robust, andeﬃcient due to the sparse operation. Besides, two data enhancement techniquesare introduced to augment the spatial pattern and ease the recognition. The datais additionally projected into two other spaces to utilize the bone informationand the motion information for better performance. Our method achieves state-of-the-art performance on two challenging datasets for diﬀerent tasks, whichconﬁrms its eﬀectiveness and generalizability.

References

1. Cao, C., Lan, C., Zhang, Y., Zeng, W., Lu, H., Zhang, Y.: Skeleton-Based ActionRecognition with Gated Convolutional Neural Networks. IEEE Transactions onCircuits and Systems for Video Technology pp. 3247–3257 (2018)2. Carreira, J., Zisserman, A.: Quo Vadis, Action Recognition? A New Model andthe Kinetics Dataset. In: The IEEE Conference on Computer Vision and PatternRecognition (CVPR). pp. 6299–6308 (Jul 2017)3. Chen, Y., Zhao, L., Peng, X., Yuan, J., Metaxas, D.N.: Construct Dynamic Graphsfor Hand Gesture Recognition via Spatial-Temporal Attention. In: BMVC (2019)4. Choy, C., Gwak, J., Savarese, S.: 4d Spatio-Temporal ConvNets: Minkowski Con-volutional Neural Networks. arXiv preprint arXiv:1904.08755 (2019)5. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recogni-tion. In: Proceedings of the IEEE International Conference on Computer Vision.pp. 6202–6211 (2019)6. He, K., Zhang, X., Ren, S., Sun, J.: Identity Mappings in Deep Residual Networks.In: The European Conference on Computer Vision (ECCV). pp. 630–645 (2016)7. Hou, J., Wang, G., Chen, X., Xue, J.H., Zhu, R., Yang, H.: Spatial-temporal at-tention res-TCN for skeleton-based dynamic hand gesture recognition. In: TheEuropean Conference on Computer Vision (ECCV). pp. 0–0 (2018)8. Li, C., Zhong, Q., Xie, D., Pu, S.: Skeleton-based Action Recognition with Con-volutional Neural Networks. In: IEEE International Conference on Multimedia &Expo Workshops (ICMEW). pp. 597–600 (2017)9. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-StructuralGraph Convolutional Networks for Skeleton-Based Action Recognition. pp. 3595–3603 (2019)10. Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent neural net-work (indrnn): Building A longer and deeper RNN. In: The IEEE Conference onComputer Vision and Pattern Recognition (CVPR). pp. 5457–5466 (2018)11. Liu, J., Shahroudy, A., Xu, D., Wang, G., Liu, J., Shahroudy, A., Xu, D., Wang,G.: Spatio-Temporal LSTM with Trust Gates for 3d Human Action Recognition.In: The European Conference on Computer Vision (ECCV). vol. 9907, pp. 816–833(2016)12. Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invarianthuman action recognition. Pattern Recognition68