Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action Recognition
DDecoupled Spatial-Temporal Attention Networkfor Skeleton-Based Action Recognition
Lei Shi , , Yifan Zhang , , Jian Cheng , , and Hanqing Lu , NLPR & AIRIA, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing100049, China CAS Center for Excellence in Brain Science and Intelligence Technology
Abstract.
Dynamic skeletal data, represented as the 2D/3D coordi-nates of human joints, has been widely studied for human action recog-nition due to its high-level semantic information and environmental ro-bustness. However, previous methods heavily rely on designing hand-crafted traversal rules or graph topologies to draw dependencies betweenthe joints, which are limited in performance and generalizability. In thiswork, we present a novel decoupled spatial-temporal attention network(DSTA-Net) for skeleton-based action recognition. It involves solely theattention blocks, allowing for modeling spatial-temporal dependenciesbetween joints without the requirement of knowing their positions ormutual connections. Specifically, to meet the specific requirements ofthe skeletal data, three techniques are proposed for building attentionblocks, namely, spatial-temporal attention decoupling, decoupled posi-tion encoding and spatial global regularization. Besides, from the dataaspect, we introduce a skeletal data decoupling technique to emphasizethe specific characteristics of space/time and different motion scales, re-sulting in a more comprehensive understanding of the human actions.To test the effectiveness of the proposed method, extensive experimentsare conducted on four challenging datasets for skeleton-based gestureand action recognition, namely, SHREC, DHG, NTU-60 and NTU-120,where DSTA-Net achieves state-of-the-art performance on all of them.
Keywords:
Skeleton, Action Recognition, Attention.
Human action recognition has been studied for decades since it can be widelyused for many applications such as human-computer interaction and abnormalbehavior monitoring [3,27,8,24]. Recently, skeletal data draws increasingly moreattention because it contains higher-level semantic information in a small amountof data and has strong adaptability to the dynamic circumstance [33,26,25].The raw skeletal data is a sequence of frames each contains a set of points.Each point represents a joint of human body in the form of 2D/3D coordi-nates. Previous data-driven methods for skeleton-based action recognition relyon manual designs of traversal rules or graph topologies to transform the raw a r X i v : . [ c s . C V ] J u l Lei Shi et al. skeletal data into a meaningful form such as a point-sequence, a pseudo-imageor a graph, so that they can be fed into the deep networks such as RNNs, CNNsand GCNs for feature extraction [33,21,34]. However, there is no guarantee thatthe hand-crafted rule is the optimal choice of modeling global dependencies ofjoints, which limits the performance and generalizability of previous approaches.Recently, transformer [31,5] has achieved big success in the NLP field, whosebasic block is the self-attention mechanism. It can learn the global dependen-cies between the input elements with less computational complexity and betterparallelizability. For skeletal data, employing the self-attention mechanism hasan additional advantage that there is no requirement of knowing a intrinsic rela-tions between the elements, thus it provides more flexibility for discovering usefulpatterns. Besides, since the number of joints of the human body is limited, theextra cost of applying self-attention mechanism is also relatively small.Inspired by above observations, we propose a novel decoupled spatial-temporalattention networks (DSTA-Net) for skeleton-based action recognition. It is basedsolely on the self-attention mechanism, without using the structure-relevantRNNs, CNNs or GCNs. However, it is not straightforward to apply a pure atten-tion network for skeletal data as shown in following three aspects: (1) The inputof original self-attention mechanism is the sequential data, while the skeletal dataexists in both the spatial and temporal dimensions. A naive method is simplyflattening the spatial-temporal data into a single sequence like [32]. However,it is not reasonable to treat the time and space equivalently because they con-tain totally different semantics [8]. Besides, simple flattening operation increasesthe sequence length, which greatly increases the computation cost due to thedot-product operation of the self-attention mechanism. Instead, we propose todecouple the self-attention mechanism into the spatial attention and the tempo-ral attention sequentially. Three strategies are specially designed to balance theindependence and the interaction between the space and the time. (2) There areno predefined orders or structures when feeding the skeletal joints into the at-tention networks. To provide unique markers for every joint, a position encodingtechnique is introduced. For the same reason as before, it is also decoupled intothe spatial encoding and the temporal encoding. (3) It has been verified thatadding proper regularization based on prior knowledge can effectively reduce theover-fitting problem and improve the model generalizability. For example, dueto the translation-invariant structure of images, CNNs exploit the local-weight-sharing mechanism to force the model to learn more general filters for differentregions of images. As for skeletal data, each joint of the skeletons has specificphysical/semantic meaning (e.g., head or hand), which is fixed for all the framesand is consistent for all the data samples. Based on this prior knowledge, aspatial global regularization is proposed to force the model to learn more gen-eral attentions for different samples. Note the regularization is not suitable fortemporal dimension because there is no such semantic alignment property.Besides, from the data aspect, the most discriminative pattern is distinct fordifferent actions. We claim that two properties should be considered. One prop-erty is whether the action is motion relevant or motion irrelevant, which aims
STA-Net for Skeleton-Based Action Recognition 3 to choose the specific characters of space and time. For example, to classify thegestures of “waving up” versus “waving down”, the global trajectories of hand ismore important than hand shape, but when recognizing the gestures like “pointwith one finger” versus “point with two finger”, the spatial pattern is more im-portant than hand motion. Based on this observation, we propose to decouplethe data into the spatial and temporal dimensions, where the spatial streamcontains only the motion-irrelevant features and temporal stream contains onlythe motion-relevant features. By modeling these two streams separately, themodel can better focus on spatial/temporal features and identity specific pat-terns. Finally, by fusing the two streams, it can obtain a more comprehensiveunderstanding of the human actions. Another property is the sensibility of themotion scales. For temporal stream, the classification of some actions may relyon the motion mode of a few consecutive frames while others may rely on theoverall movement trend. For example, to classify the gestures of “clapping” ver-sus “put two hands together”, the short-term motion detail is essential. But for“waving up” versus “waving down”, the long-term motion trend is more impor-tant. Thus, inspired by [8], we split the temporal information into a fast streamand a slow stream based on the sampling rate. The low-frame-rate stream cancapture more about global motion information and the high-frame-rate streamcan focus more on the detailed movements. Similarly, the two streams are fusedto improve the recognition performance.We conduct extensive experiments on four datasets, including two hand ges-ture recognition datasets, i.e., SHREC and DHG, and two human action recog-nition datasets, i.e., NTU-60 and NTU-120. Without the need of hand-craftedtraversal rules or graph topologies, our method achieves state-of-the-art perfor-mance on all these datasets, which demonstrates the effectiveness and general-izability of the proposed method.Overall, our contributions lie in four aspects:1. To the best of our knowledge, we are the first to propose a decoupled spatial-temporal attention networks (DSTA-Net) for skeleton-based action recogni-tion, which is built with pure attention modules without manual designs oftraversal rules or graph topologies.2. We propose three effective techniques in building attention networks to meetthe specific requirements for skeletal data, namely, spatial-temporal atten-tion decoupling, decoupled position encoding and spatial global regulariza-tion.3. We propose to decouple the data into four streams, namely, spatial-temporalstream, spatial stream, slow-temporal stream and fast-temporal stream, eachfocuses on a specific aspect of the skeleton sequence. By fusing different typesof features, the model can have a more comprehensive understanding forhuman actions.4. On four challenging datasets for action recognition, our method achievesstate-of-the-art performance with a significant margin. DSTA-Net outper-forms SOTA 2.6%/3.2% and 1.9%/2.9% on 14-class/28-class benchmarks ofSHREC and DHG, respectively. It achieves 91.5%/96.4% and 86.6%/89.0%
Lei Shi et al. on CS/CV benchmarks of NTU-60 and NTU-120, respectively. The code willbe released for future works.
Skeleton-based action recognition has been widely studied for decades. Themain-stream methods lie in three branches: (1) the RNN-based methods thatformulate the skeletal data as a sequential data with a predefined traversal rules,and feed it into the RNN-based models such as the LSTM [34,13,29,28]; (2) theCNN-based methods that convert the input skeletons into a pseudo-image withhand-crafted transformation rules, and model it with various successful networksused in image classification fields [12,16,2]; (3) the GCN-based methods thatencode the skeletal data into a predefined spatial-temporal graph, and modelit with the graph convolutional networks [33,30,26]. In this work, instead offormulating the skeletal data into the images or graphs, we directly model thedependencies of joints with pure attention blocks. Our model is more conciseand general, without the need of designing hand-crafted transformation rules,and it outperforms the previous methods with a significant margin.
Self-attention mechanism is the basic block of transformer [31,5], whichis the mainstream method in the NLP field. Its input consists of a set of queries Q , keys K of dimension C and values V , which are packaged in the matrix formfor fast computation. It first computes the dot products of the query with allkeys, divides each by √ C , and applies a softmax function to obtain the weightson the values [31]. In formulation: Attention ( Q, K, V ) = sof tmax ( QK T √ C ) (1)The similar idea has also been used for many computer vision tasks such asrelation modeling [22], detection [11] and semantic segmentation [9]. To the bestof our knowledge, we are the fist to apply the pure attention networks for skeletaldata and further propose several improvements to meet the specific requirementsof skeletons. In this section, we first propose three techniques for building attention blocks inSec. 3.1, Sec. 3.2 and Sec. 3.3. Then, the basic attention module and the detailedarchitecture of the network are introduced in Sec. 3.4 and Sec. 3.5. Finally, thedata decoupling and the overall multi-stream solution is described in Sec. 3.6.
Original transformer is fed with the sequential data, i.e., a matrix X ∈ R N × C ,where N denotes the number of elements and C denotes the number of channels. STA-Net for Skeleton-Based Action Recognition 5 ( a ) ( b ) ( c ) 𝑁 𝑁 … … Dot-Product
Attention
Attention
Map
Dot-ProductAttention AttentionMap 𝑇 𝐴 𝐴 𝑇 𝑁𝑁 Dot-ProductAttention AttentionMap 𝑇 𝐴 Dot-Product
Attention
Dot-Product
AttentionDot-ProductAttention … Dot-Product
Attention Attention
MapDot-ProductAttention
𝑇 𝐴 … Fig. 1.
Illustration of the three decoupling strategies. We use the spatial attentionstrategy as an example and the temporal attention strategy is an analogy. N and Tdenote the number of joints and frames, respectively.
For dynamic skeletal data, the input is a 3-order tensor X ∈ R N × T × C , where Tdenotes the number of frames. It is worth to investigate how to deal with therelationship between the time and the space. Wang et al. [32] propose to ignorethe difference between time and space, and regard the inputs as a sequential data X ∈ R ˆ N × C , where ˆ N = N T . However, the temporal dimension and the spatialdimension are totally different as introduced in Sec. 1. It is not reasonable totreat them equivalently. Besides, the computational complexity of calculatingthe attention map in this strategy is O ( T N C ) (using the naive matrix mul-tiplication algorithm), which is too large. Instead, we propose to decouple thespatial and temporal dimensions, where the computational complexity is largelyreduced and the performance is improved.We design three strategies for decoupling as shown in Fig 1. Using the spatialattention as an example, the first strategy (Fig 1, a) is calculating the attentionmaps frame by frame, and each frame uses a unique attention map: A t = sof tmax ( σ ( X t ) φ ( X t ) (cid:48) ) (2)where A t ∈ R N × N is the attention map for frame t . X t ∈ R N × C . σ and φ are two embedding functions. (cid:48) denote matrix transpose. This strategy onlyconsiders the dependencies of joints in a single frame thus lacks the modelingcapacity. The computational complexity of calculating spatial attention of thisstrategy is O ( T N C ). For temporal attention, the attention map of joint n is A n ∈ R T × T and the input data is X n ∈ R T × C . Its calculation is analogicalwith the spatial attention. Considering both the spatial and temporal attention,the computational complexity of the first strategy for all frames is O ( T N C + N T C ).The second strategy (Fig 1, b) is calculating the relations of two joints be-tween all of the frames, which means both the intra-frame relations and theinter-frame relations of two joints are taken into account simultaneously. Theattention map is shared over all frames. In formulation: A t = sof tmax ( T (cid:88) t T (cid:88) τ ( σ ( X t ) φ ( X τ ) (cid:48) )) (3) Lei Shi et al.
The computational complexity of this strategy is O ( T N C + N T C ).The third strategy (Fig 1, c) is a compromise, where only the joints in sameframe are considered to calculate the attention map, but the obtained attentionmaps of all frames are averaged and shared. It is equivalent to adding a timeconsistency restriction for attention computation, which can somewhat reducethe overfitting problem caused by the element-wise relation modeling of thesecond strategy. A t = sof tmax ( T (cid:88) t ( σ ( X t ) φ ( X t ) (cid:48) )) (4)By concatenating the frames into an N × T C matrix, the summation of mat-multiplications can be efficiently implemented with one big mat-multiplicationoperation. The computational complexity of this strategy is O ( T N C + N T C ).as shown in ablation study 4.3, we finally use the strategy (c) in the model. The skeletal joints are organized as a tensor to be fed into the neural networks.Because there are no predefined orders or structures for each element of thetensor to show its identity (e.g., joint index or frame index), we need a positionencoding module to provide unique markers for every joint. Following [31], we usethe sine and cosine functions with different frequencies as the encoding functions:
P E ( p, i ) = sin ( p/ i/C in ) P E ( p, i + 1) = cos ( p/ i/C in ) (5)where p denotes the position of element and i denotes the dimension of theposition encoding vector. However, different with [31], the input of skeletal datahave two dimensions, i.e., space and time. One strategy for position encoding isunifying the spatial and temporal dimensions and encoding them sequentially.For example assuming there are three joints, for the first frame the position ofjoints is 1 , ,
3, and for the second frame it is 4 , ,
6. This strategy cannot welldistinguish the same joint in different frames. Another strategy is decoupling theprocess into spatial position encoding and temporal position encoding. Using thespatial position encoding as an example, the joints in the same frame are encodedsequentially and the same joints in different frames have the same encoding.In above examples, it means for the first frame the position is 1 , ,
3, and forthe second frame it is also 1 , ,
3. As for the temporal position encoding, it isreversed and analogical, which means the joints in the same frame have thesame encoding and the same joints in different frames are encoded sequentially.Finally, the position features are added to the input data as shown in Fig 2. Inthis way, each element is aligned with an unique marker to help learning themutual relations between the joints, and the difference between space and timeis also well expressed.
STA-Net for Skeleton-Based Action Recognition 7 L i n e a r L i n e a r M a t M u l t . T a nh Position Encoding Attention
Map
N x N
𝑁 × 𝑇𝐶 in 𝑁 ×𝑇𝐶 e x S M a t M u l t . Spatial
Global
Regularization × α 𝑁 × 𝑇𝐶 𝑖𝑛 𝑁 × 𝑇𝐶 𝑜𝑢𝑡 M u l t i - H e a d C o n c a t . L i n e a r L R e L U + L i n e a r + B N 𝑁 ×𝑆𝑇𝐶 e 𝑁×𝑇𝐶 𝑜𝑢𝑡 L R e L U Residual (opt.)
N x N
Fig. 2.
Illustration of the attention module. We show the spatial attention module as anexample. The temporal attention module is an analogy. The purple rounded rectanglebox represents a single-head self-attention module. There are totally S self-attentionmodules, whose output are concatenated and fed into two linear layers to obtain theoutput. LReLU represents the leaky ReLU [18]. As explained in Sec. 1, each joint has a specific meaning. Based on this priorknowledge, we propose to add a spatial global regularization to force the modelto learn more general attentions for different samples. In detail, a global attentionmap ( N × N matrix) is added to the attention map ( N × N matrix) learned by thedot-product attention mechanism introduced in Sec. 3.1. The global attentionmap is shared for all data samples, which represents a unified intrinsic relation-ship pattern of the human joints. We set it as the parameter of the network andoptimize it together with the model. An α is multiplied to balance the strengthof the spatial global regularization. This module is simple and light-weight, butit is effective as shown in the ablation study. Note that the regularization is onlyadded for spatial attention computing because the temporal dimension has nosuch semantic alignment property. Forcing a global regularization for temporalattention is not reasonable and will harm the performance. Because the spatial attention module and the temporal attention module areanalogical, we select the spatial module as an example for detailed introduction.The complete attention module is showed in Fig 2. The procedures inside thepurple rounded rectangle box illustrate the process of the single-head attentioncalculation. The input X ∈ R N × T C in is first added with the spatial position en-coding. Then it is embedded with two linear mapping functions to X ∈ R N × T C e . C e is usually small than C out to remove the feature redundancy and reduce thecomputations. The attention map is calculated by the strategy (c) of Fig. 1 andadded with the spatial global regularization. Note that we found the Tanh isbetter than SoftMax when computing the attention map. We believe that it isbecause the output of Tanh is not restricted to positive values thus can gener-ate negative relations and provide more flexibility. Finally the attention map ismat-multiplied with the original input to get the output features. Lei Shi et al.
Spatial
Attention
ModuleN x TC
T x NC
Temporal
Attention
Module x L
Prediction
N x TC T x NC G A P + F C Fig. 3.
Illustration of the overall architecture of the DSTA-Net. N, T, C denote thenumber of joints, frames and channels, respectively. The red rounded rectangle boxrepresents one spatial-temporal attention layer. There are totally L layers. The finaloutput features are global-average-pooled (GAP) and fed into a fully-conected layer(FC) to make the prediction.
To allow the model jointly attending to information from different repre-sentation sub-spaces, there are totally S heads for attention calculations in themodule. The results of all heads are concatenated and mapped to the outputspace R N × T C out with a linear layer. Similar with the transformer, a point-wisefeed-forward layer is added in the end to obtain the final output. We use theleaky ReLU as the non-linear function. There are two residual connections in themodule as shown in the Fig 2 to stabilize the network training and integrate dif-ferent features. Finally, all of the procedures inside the green rounded rectanglebox represent one whole attention module.
Fig. 3 shows the overall architecture of our method. The input is a skeletonsequence with N joints, T frames and C channels. In each layer, we first regardthe input as an N × T C matrix, i.e., N elements with
T C channels, and feedit into the spatial attention module (introduced in Fig. 2) to model the spatialrelations between the joints. Then, we transpose the output matrix and regardit as T elements each has
N C channels, and feed it into the temporal attentionmodule to model the temporal relations between the frames. There are totally L layers stacked to update features. The final output features are global-average-pooled and fed into a fully-connected layers to obtain the classification scores. The action can be decoupled into two dimensions: the spatial dimension and thetemporal dimension as illustrated in Fig. 4 (a, b and c). The spatial informationis the difference of two different joints that are in the same frame, which mainlycontains the relative position relationship between different joints. To reducethe redundant information, we only calculate the spatial information along thehuman bones. The temporal information is the difference of the two joints withsame spatial meaning in different frames, which mainly describes the motiontrajectory of one joint along the temporal dimension. When we recognize the
STA-Net for Skeleton-Based Action Recognition 9 ( a ) Spatial-temporal information 𝑷 𝒕 𝒋 𝑷 𝒕 𝒋 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 ( b ) Spatial information ( c ) Temporal information ( d ) Different frame rate 𝑷 𝒕 𝒊 𝑷 𝒕 𝒋 𝑷 𝒕 𝒊 𝑷 𝒕 𝒋 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 𝑷 𝒕 𝒋 𝑷 𝒕 𝒋 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 Fig. 4.
For simplicity, we draw two joins in two consecutive frames in a 2D coordinatesystem to illustrate the data decoupling. as shown in (a), P it k denotes the joint i inframe k . Assume that joint i and joint j are the two end joints of one bone. (a) denotesthe raw data, i.e., the spatial-temporal information. The orange dotted line and bluedotted line denote the decoupled spatial information and temporal information, whichare showed as (b) and (c), respectively. (d) illustrates the difference between the fast-temporal information (blue arrow) and the slow-temporal information (orange arrow). gestures like “Point with one finger” versus “Point with two finger”, the spa-tial information is more important. However, when we recognize the gestureslike “waving up” versus “waving down”, the temporal information will be moreessential.Besides, for temporal stream, different actions have different sensibilities ofthe motion scale. For some actions such as “clapping” versus “put two handstogether”, the short-term motion detail is essential. But for actions like “wavingup” versus “waving down”, the long-term movement trend is more important.Inspired by [8], we propose to calculate the temporal motion with both thehigh frame-rate sampling and the low frame-rate sampling as shown in Fig. 4(d). The generated two streams are called as the fast-temporal stream and theslow-temporal stream, respectively.Finally, we have four streams all together, namely, spatial-temporal stream(original data), spatial stream, fast-temporal stream and slow-temporal stream.We separately train four models with the same architecture for each of thestreams. The classification scores are averaged to obtain the final result. To verify the generalization of the model, we use two datasets for hand gesturerecognition (DHG [6] and SHREC [7]) and two datasets for human action recog-nition (NTU-60 [23] and NTU-120 [14]). We first perform exhaustive ablationstudies on SHREC to verify the effectiveness of the proposed model compo-nents. Then, we evaluate our model on all four datasets to compare with thestate-of-the-art methods.
DHG [6] dataset contains 2800 video sequences of 14 hand gestures per-formed 5 times by 20 subjects. They are performed in two ways: using one fingerand the whole hand. So it has two benchmarks: 14-gestures for coarse classifi-cation and 28-gestures for fine-grained classification. The 3D coordinates of 22hand joints in real-world space is captured by the Intel Real-sense camera. Ituses the leave-one-subject-out cross-validation strategy for evaluation.
SHREC:
SHREC [7] dataset contains 2800 gesture sequences performed 1and 10 times by 28 participants in two ways like the DHG dataset. It splits thesequences into 1960 train sequences and 840 test sequences. The length of samplegestures ranges from 20 to 50 frames. This dataset is used for the competitionof SHREC’17 in conjunction with the Euro-graphics 3DOR’2017 Workshop.
NTU-60:
NTU-60 [23] is a most widely used in-door-captured action recog-nition dataset, which contains 56,000 action clips in 60 action classes. The clipsare performed by 40 volunteers and is captured by 3 KinectV2 cameras withdifferent views. This dataset provides 25 joints for each subject in the skeletonsequences. It recommends two benchmarks: cross-subject (CS) and cross-view(CV), where the subjects and cameras used in the training/test splits are differ-ent, respectively.
NTU-120:
NTU-120 [23] is similar with NTU-60 but is larger. It contains114,480 action clips in 120 action classes. The clips are performed by 106 volun-teers in 32 camera setups. It recommends two benchmarks: cross-subject (CS)and cross-setup (CE), where cross-setup means using samples with odd setupIDs for training and others for testing.
To show the generalization of our methods, we use the same configuration forall experiments. The network is stacked using 8 DSTA blocks with 3 heads.The output channels are 64, 64, 128, 128, 256, 256, 256 and 256, respectively.The input video is randomly/uniformly sampled to 150 frames and then ran-domly/centrally cropped to 128 frames for training/test splits. For fast-temporalfeatures, the sampling interval is 2. When training, the initial learning rate is 0.1and is divided by 10 in 60 and 90 epochs. The training is ended in 120 epochs.Batch size is 32. Weight decay is 0 . .
9) as the optimizer and the cross-entropyas the loss function.
In this section, we investigate the effectiveness of the proposed components ofthe network and different data modalities. We conduct experiments on SHRECdataset. Except for the explored object, other details are set the same for faircomparison.
STA-Net for Skeleton-Based Action Recognition 11
Network architectures
We first investigate the effect of the position embed-ding. as shown in Tab. 1, removing the position encoding will seriously harm theperformance. Decoupling the spatial and temporal dimension (DPE) is betterthan not (UPE). This is because the spatial and temporal dimensions actuallyhave different properties and treat them equivalently will confuse the model.Then we investigate the effect of the proposed spatial global regularization(SGR). By adding the SGR, the performance is improved from 94.3% to 96.3%,but if we meanwhile regularize the temporal dimension, the performance drops.This is reasonable since there are no specific meanings for temporal dimensionand forced learning of a unified pattern will cause the gap between the trainingset an testing set.Finally, we compare the three strategies introduced in Fig. 1. It shows thatthe strategy (a) obtains the lowest performance. We conjecture that it duesto the fact that it only considers the intra-frame relations and ignore the inter-frame relations. Modeling the inter-frame relations exhaustively (strategy b) willimprove the performance and a compromise (c) obtains the best performance. Itmay because that the compromise strategy can somewhat reduce the overfittingproblem.
Table 1.
Ablation studies for architectures of the model on the SHREC dataset.ST-ATT-c denotes the s patial t emporal att ention networks with attention type c introduced in Fig 1. PE denotes p osition e ncoding. UPE/DPE denote using u nified/ d ecoupled encoding for spatial and temporal dimensions. STGR denotes s patial- t emporal g lobal r egularizations for computing attention maps.Method AccuracyST-Att-c w/o PE 89.4ST-Att-c + UPE 93.2ST-Att-c + DPE 94.5ST-Att-c + DPE + SGR ST-Att-c + DPE + STGR 94.6ST-Att-a + DPE + SGR 94.6ST-Att-b + DPE + SGR 95.1
We show the learned attention maps of different layers (layer W r i s t P a l m T T T T I I I I M M M M R R R R L L L L WristPalmT1T2T3T4I1I2I3I4M1M2M3M4R1R2R3R4L1L2L3L4
Layer_1 W r i s t P a l m T T T T I I I I M M M M R R R R L L L L WristPalmT1T2T3T4I1I2I3I4M1M2M3M4R1R2R3R4L1L2L3L4
Layer_8
Fig. 5.
Examples of the learned attention maps for different layers. T, I, M, R and Ldenote thumb, index finger, middle finger, ring finger and little finger, respectively. Asfor articulation, T1 denotes the base of the thumb and T4 denote the tip of the thumb.
Data decoupling
To show the necessity of decoupling the raw data into fourstreams as introduced in Sec. 3.6, we show the results of using four streamsseparately and the result of fusion in Tab. 2. It shows that the accuracies ofdecoupled streams are not as good as the raw data because some of the informa-tion is lost. However, since the four streams focus on different aspects and arecomplementary with each other, when fusing them together, the performance isimproved significantly.
Table 2.
Ablation studies for feature fusion on the SHREC dataset. Spatial-temporaldenotes the raw data, i.e., the joint coordinates. Other types of features are introducedin Sec. 3.6. Method Accuracyspatial-temporal 96.3spatial 95.1fast-temporal 94.5slow-temporal 93.7Fusion
As shown in Fig. 6, We plot the per-class accuracies of the four streamsto show the complementarity clearly. We also plot the difference of accuraciesbetween different streams, which are represented as the dotted lines. For spatialinformation versus temporal information, it (orange dotted lines) shows thatthe network with spatial information obtains higher accuracies mainly in classesthat are closely related with the shape changes such as “grab”, “expand” and
STA-Net for Skeleton-Based Action Recognition 13 g r a b t a p e x p a n d p i n c h r o t c l o c k w i s e r o t c / c l o c k w i s e s w i p e r i g h t s w i p e l e f t s w i p e u p s w i p e d o w n s w i p e x s w i p e + s w i p e v s h a k e a cc spatial slow_temporal fast_temporal spatial - slow_temporal slow_temporal - fast_temporal Fig. 6.
Per-class accuracies for different modalities on SHREC-14 dataset. The dottedlines shows the difference between two modalities. “pinch”, and the network with temporal information obtains higher accuraciesmainly in classes that are closely related with the positional changes such as“swipe”, “rot” and “shake”. As for different frame-rate sampling, it (red dottedlines) shows that the slow-temporal performs better for classes of “expand”,“tap”, etc, and the fast-temporal performs better for classes of “swipe”, “rot”,etc. These phenomenons verify the complementarity of the four modalities.
Table 3.
Recognition accuracy comparison of our method and state-of-the-art methodson SHREC dataset and DHG dataset.Method Year SHREC DHG14 gestures 28 gestures 14 gestures 28 gesturesST-GCN [33] 2018 92.7 87.7 91.2 87.1STA-Res-TCN [10] 2018 93.6 90.7 89.2 85.0ST-TS-HGR-NET [19] 2019 94.3 89.4 87.3 83.4DG-STA. [4] 2019 94.4 90.7 91.9 88.0
DSTA-Net(ours) - We evaluate our model with state-of-the-art methods for skeleton-based actionrecognition on all four datasets, where our model significantly outperforms theother methods. Due to the space restriction, we only show some representa-tive works, where more comparisons are showed in supplement materials. OnSHREC/DHG datasets for skeleton-based hand gestures recognition (Tab. 3),our model brings 2 . .
9% and 3 . .
9% improvements for 14-gestures and28-gestures benchmarks compared with the state-of-the-arts. Note that the state-of-the-art accuracies are already very high (94 . .
9% and 90 . .
0% for
Table 4.
Recognition accuracy comparison of our method and state-of-the-art methodson NTU-60 dataset. CS and CV denote the cross-subject and cross-view benchmarks,respectively. Methods Year CS (%) CV (%)ST-GCN [33] 2018 81.5 88.3SRN+TSL [29] 2018 84.8 92.42s-AGCN [26] 2019 88.5 95.1DGNN [25] 2019 89.9 96.1NAS [20] 2020 89.4 95.7
DSTA-Net(ours) - Recognition accuracy comparison of our method and state-of-the-art methodson NTU-120 dataset. CS and CE denote the cross-subject and cross-setup benchmarks,respectively.Methods Year CS (%) CE (%)Two-Stream Attention LSTM [15] 2017 61.2 63.3Body Pose Evolution Map [17] 2018 64.6 66.9SkeletonMotion [1] 2019 67.7 66.9
DSTA-Net(ours) - .
6% and0 .
3% improvements. The performance of CV benchmark is nearly saturated. Forboth CS and CV benchmarks, we visualize the wrong examples and find that itis even impossible for human to recognize many examples using only the skeletaldata. For example, for the two classes of reading and writing, the humans areboth in a same posture (standing or sitting) and holding a book. The only dif-ference is whether there is a pen in the hand, which cannot be captured throughthe skeletal data. On NTU-120 dataset (Tab. 5), our model also achieves state-of-the-art performance. Since this dataset is released recently, our method canprovide a new baseline on it.
In this paper, we propose a novel decoupled spatial-temporal attention net-work (DSTA-Net) for skeleton-based action recognition. It is a unified frameworkbased solely on attention mechanism, with no needs of designing hand-craftedtraversal rules or graph topologies. We propose three techniques in buildingDSTA-Net to meet the specific requirements for skeletal data, including spatial-temporal attention decoupling, decoupled position encoding and spatial globalregularization. Besides, we introduce a skeleton-decoupling method to emphasizethe spatial/temporal variations and motion scales of the skeletal data, resultingin a more comprehensive understanding for human actions and gestures. To ver-ify the effectiveness and generalizability of the DSTA-Net, extensive experiments
STA-Net for Skeleton-Based Action Recognition 15 are conducted on four large datasets for both gesture and action recognition,where the DSTA-Net achieves the state-of-the-art performance on all of them.
References
1. Caetano, C., Sena, J., Brmond, F., Dos Santos, J.A., Schwartz, W.R.: SkeleMotion:A New Representation of Skeleton Joint Sequences based on Motion Informationfor 3d Action Recognition. In: 2019 16th IEEE International Conference on Ad-vanced Video and Signal Based Surveillance (AVSS). pp. 1–8 (Sep 2019)2. Cao, C., Lan, C., Zhang, Y., Zeng, W., Lu, H., Zhang, Y.: Skeleton-Based ActionRecognition with Gated Convolutional Neural Networks. IEEE Transactions onCircuits and Systems for Video Technology pp. 3247–3257 (2018)3. Carreira, J., Zisserman, A.: Quo Vadis, Action Recognition? A New Model andthe Kinetics Dataset. In: The IEEE Conference on Computer Vision and PatternRecognition (CVPR). pp. 6299–6308 (Jul 2017)4. Chen, Y., Zhao, L., Peng, X., Yuan, J., Metaxas, D.N.: Construct Dynamic Graphsfor Hand Gesture Recognition via Spatial-Temporal Attention. In: BMVC (2019)5. Dai, Z., Yang, Z., Yang, Y., Cohen, W.W., Carbonell, J., Le, Q.V., Salakhutdi-nov, R.: Transformer-xl: Attentive language models beyond a fixed-length context.arXiv:1901.02860 (2019)6. De Smedt, Q., Wannous, H., Vandeborre, J.P.: Skeleton-Based Dynamic HandGesture Recognition. In: The IEEE Conference on Computer Vision and PatternRecognition Workshops (CVPRW). pp. 1206–1214 (Jun 2016)7. De Smedt, Q., Wannous, H., Vandeborre, J.P., Guerry, J., Le Saux, B., Filliat,D.: SHREC’17 Track: 3d Hand Gesture Recognition Using a Depth and SkeletalDataset. In: Pratikakis, I., Dupont, F., Ovsjanikov, M. (eds.) Eurographics Work-shop on 3D Object Retrieval. pp. 1–6 (2017)8. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recogni-tion. In: Proceedings of the IEEE International Conference on Computer Vision.pp. 6202–6211 (2019)9. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network forscene segmentation. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 3146–3154 (2019)10. Hou, J., Wang, G., Chen, X., Xue, J.H., Zhu, R., Yang, H.: Spatial-temporal at-tention res-TCN for skeleton-based dynamic hand gesture recognition. In: TheEuropean Conference on Computer Vision (ECCV). pp. 0–0 (2018)11. Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation Networks for Object Detection.In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(Jun 2018)12. Li, C., Zhong, Q., Xie, D., Pu, S.: Skeleton-based Action Recognition with Con-volutional Neural Networks. In: IEEE International Conference on Multimedia &Expo Workshops (ICMEW). pp. 597–600 (2017)13. Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent neural net-work (indrnn): Building A longer and deeper RNN. In: The IEEE Conference onComputer Vision and Pattern Recognition (CVPR). pp. 5457–5466 (2018)14. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: NTU RGB+D120: A Large-Scale Benchmark for 3d Human Activity Understanding. IEEE Trans-actions on Pattern Analysis and Machine Intelligence pp. 1–1 (2019)6 Lei Shi et al.15. Liu, J., Wang, G., Duan, L.Y., Abdiyeva, K., Kot, A.C.: Skeleton-based human ac-tion recognition with global context-aware attention LSTM networks. IEEE Trans-actions on Image Processing (4), 1586–1599 (2017)16. Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invarianthuman action recognition. Pattern Recognition68