[PDF] Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action Recognition

Abstract

Dynamic skeletal data, represented as the 2D/3D coordinates of human joints, has been widely studied for human action recognition due to its high-level semantic information and environmental robustness. However, previous methods heavily rely on designing hand-crafted traversal rules or graph topologies to draw dependencies between the joints, which are limited in performance and generalizability. In this work, we present a novel decoupled spatial-temporal attention network(DSTA-Net) for skeleton-based action recognition. It involves solely the attention blocks, allowing for modeling spatial-temporal dependencies between joints without the requirement of knowing their positions or mutual connections. Specifically, to meet the specific requirements of the skeletal data, three techniques are proposed for building attention blocks, namely, spatial-temporal attention decoupling, decoupled position encoding and spatial global regularization. Besides, from the data aspect, we introduce a skeletal data decoupling technique to emphasize the specific characteristics of space/time and different motion scales, resulting in a more comprehensive understanding of the human this http URL test the effectiveness of the proposed method, extensive experiments are conducted on four challenging datasets for skeleton-based gesture and action recognition, namely, SHREC, DHG, NTU-60 and NTU-120, where DSTA-Net achieves state-of-the-art performance on all of them.

Full PDF

DDecoupled Spatial-Temporal Attention Networkfor Skeleton-Based Action Recognition

Lei Shi , , Yifan Zhang , , Jian Cheng , , and Hanqing Lu , NLPR & AIRIA, Institute of Automation, Chinese Academy of Sciences School of Artiﬁcial Intelligence, University of Chinese Academy of Sciences, Beijing100049, China CAS Center for Excellence in Brain Science and Intelligence Technology

Abstract.

Dynamic skeletal data, represented as the 2D/3D coordi-nates of human joints, has been widely studied for human action recog-nition due to its high-level semantic information and environmental ro-bustness. However, previous methods heavily rely on designing hand-crafted traversal rules or graph topologies to draw dependencies betweenthe joints, which are limited in performance and generalizability. In thiswork, we present a novel decoupled spatial-temporal attention network(DSTA-Net) for skeleton-based action recognition. It involves solely theattention blocks, allowing for modeling spatial-temporal dependenciesbetween joints without the requirement of knowing their positions ormutual connections. Speciﬁcally, to meet the speciﬁc requirements ofthe skeletal data, three techniques are proposed for building attentionblocks, namely, spatial-temporal attention decoupling, decoupled posi-tion encoding and spatial global regularization. Besides, from the dataaspect, we introduce a skeletal data decoupling technique to emphasizethe speciﬁc characteristics of space/time and diﬀerent motion scales, re-sulting in a more comprehensive understanding of the human actions.To test the eﬀectiveness of the proposed method, extensive experimentsare conducted on four challenging datasets for skeleton-based gestureand action recognition, namely, SHREC, DHG, NTU-60 and NTU-120,where DSTA-Net achieves state-of-the-art performance on all of them.

Keywords:

Skeleton, Action Recognition, Attention.

Human action recognition has been studied for decades since it can be widelyused for many applications such as human-computer interaction and abnormalbehavior monitoring [3,27,8,24]. Recently, skeletal data draws increasingly moreattention because it contains higher-level semantic information in a small amountof data and has strong adaptability to the dynamic circumstance [33,26,25].The raw skeletal data is a sequence of frames each contains a set of points.Each point represents a joint of human body in the form of 2D/3D coordi-nates. Previous data-driven methods for skeleton-based action recognition relyon manual designs of traversal rules or graph topologies to transform the raw a r X i v : . [ c s . C V ] J u l Lei Shi et al. skeletal data into a meaningful form such as a point-sequence, a pseudo-imageor a graph, so that they can be fed into the deep networks such as RNNs, CNNsand GCNs for feature extraction [33,21,34]. However, there is no guarantee thatthe hand-crafted rule is the optimal choice of modeling global dependencies ofjoints, which limits the performance and generalizability of previous approaches.Recently, transformer [31,5] has achieved big success in the NLP ﬁeld, whosebasic block is the self-attention mechanism. It can learn the global dependen-cies between the input elements with less computational complexity and betterparallelizability. For skeletal data, employing the self-attention mechanism hasan additional advantage that there is no requirement of knowing a intrinsic rela-tions between the elements, thus it provides more ﬂexibility for discovering usefulpatterns. Besides, since the number of joints of the human body is limited, theextra cost of applying self-attention mechanism is also relatively small.Inspired by above observations, we propose a novel decoupled spatial-temporalattention networks (DSTA-Net) for skeleton-based action recognition. It is basedsolely on the self-attention mechanism, without using the structure-relevantRNNs, CNNs or GCNs. However, it is not straightforward to apply a pure atten-tion network for skeletal data as shown in following three aspects: (1) The inputof original self-attention mechanism is the sequential data, while the skeletal dataexists in both the spatial and temporal dimensions. A naive method is simplyﬂattening the spatial-temporal data into a single sequence like [32]. However,it is not reasonable to treat the time and space equivalently because they con-tain totally diﬀerent semantics [8]. Besides, simple ﬂattening operation increasesthe sequence length, which greatly increases the computation cost due to thedot-product operation of the self-attention mechanism. Instead, we propose todecouple the self-attention mechanism into the spatial attention and the tempo-ral attention sequentially. Three strategies are specially designed to balance theindependence and the interaction between the space and the time. (2) There areno predeﬁned orders or structures when feeding the skeletal joints into the at-tention networks. To provide unique markers for every joint, a position encodingtechnique is introduced. For the same reason as before, it is also decoupled intothe spatial encoding and the temporal encoding. (3) It has been veriﬁed thatadding proper regularization based on prior knowledge can eﬀectively reduce theover-ﬁtting problem and improve the model generalizability. For example, dueto the translation-invariant structure of images, CNNs exploit the local-weight-sharing mechanism to force the model to learn more general ﬁlters for diﬀerentregions of images. As for skeletal data, each joint of the skeletons has speciﬁcphysical/semantic meaning (e.g., head or hand), which is ﬁxed for all the framesand is consistent for all the data samples. Based on this prior knowledge, aspatial global regularization is proposed to force the model to learn more gen-eral attentions for diﬀerent samples. Note the regularization is not suitable fortemporal dimension because there is no such semantic alignment property.Besides, from the data aspect, the most discriminative pattern is distinct fordiﬀerent actions. We claim that two properties should be considered. One prop-erty is whether the action is motion relevant or motion irrelevant, which aims

STA-Net for Skeleton-Based Action Recognition 3 to choose the speciﬁc characters of space and time. For example, to classify thegestures of “waving up” versus “waving down”, the global trajectories of hand ismore important than hand shape, but when recognizing the gestures like “pointwith one ﬁnger” versus “point with two ﬁnger”, the spatial pattern is more im-portant than hand motion. Based on this observation, we propose to decouplethe data into the spatial and temporal dimensions, where the spatial streamcontains only the motion-irrelevant features and temporal stream contains onlythe motion-relevant features. By modeling these two streams separately, themodel can better focus on spatial/temporal features and identity speciﬁc pat-terns. Finally, by fusing the two streams, it can obtain a more comprehensiveunderstanding of the human actions. Another property is the sensibility of themotion scales. For temporal stream, the classiﬁcation of some actions may relyon the motion mode of a few consecutive frames while others may rely on theoverall movement trend. For example, to classify the gestures of “clapping” ver-sus “put two hands together”, the short-term motion detail is essential. But for“waving up” versus “waving down”, the long-term motion trend is more impor-tant. Thus, inspired by [8], we split the temporal information into a fast streamand a slow stream based on the sampling rate. The low-frame-rate stream cancapture more about global motion information and the high-frame-rate streamcan focus more on the detailed movements. Similarly, the two streams are fusedto improve the recognition performance.We conduct extensive experiments on four datasets, including two hand ges-ture recognition datasets, i.e., SHREC and DHG, and two human action recog-nition datasets, i.e., NTU-60 and NTU-120. Without the need of hand-craftedtraversal rules or graph topologies, our method achieves state-of-the-art perfor-mance on all these datasets, which demonstrates the eﬀectiveness and general-izability of the proposed method.Overall, our contributions lie in four aspects:1. To the best of our knowledge, we are the ﬁrst to propose a decoupled spatial-temporal attention networks (DSTA-Net) for skeleton-based action recogni-tion, which is built with pure attention modules without manual designs oftraversal rules or graph topologies.2. We propose three eﬀective techniques in building attention networks to meetthe speciﬁc requirements for skeletal data, namely, spatial-temporal atten-tion decoupling, decoupled position encoding and spatial global regulariza-tion.3. We propose to decouple the data into four streams, namely, spatial-temporalstream, spatial stream, slow-temporal stream and fast-temporal stream, eachfocuses on a speciﬁc aspect of the skeleton sequence. By fusing diﬀerent typesof features, the model can have a more comprehensive understanding forhuman actions.4. On four challenging datasets for action recognition, our method achievesstate-of-the-art performance with a signiﬁcant margin. DSTA-Net outper-forms SOTA 2.6%/3.2% and 1.9%/2.9% on 14-class/28-class benchmarks ofSHREC and DHG, respectively. It achieves 91.5%/96.4% and 86.6%/89.0%

Lei Shi et al. on CS/CV benchmarks of NTU-60 and NTU-120, respectively. The code willbe released for future works.

Skeleton-based action recognition has been widely studied for decades. Themain-stream methods lie in three branches: (1) the RNN-based methods thatformulate the skeletal data as a sequential data with a predeﬁned traversal rules,and feed it into the RNN-based models such as the LSTM [34,13,29,28]; (2) theCNN-based methods that convert the input skeletons into a pseudo-image withhand-crafted transformation rules, and model it with various successful networksused in image classiﬁcation ﬁelds [12,16,2]; (3) the GCN-based methods thatencode the skeletal data into a predeﬁned spatial-temporal graph, and modelit with the graph convolutional networks [33,30,26]. In this work, instead offormulating the skeletal data into the images or graphs, we directly model thedependencies of joints with pure attention blocks. Our model is more conciseand general, without the need of designing hand-crafted transformation rules,and it outperforms the previous methods with a signiﬁcant margin.

Self-attention mechanism is the basic block of transformer [31,5], whichis the mainstream method in the NLP ﬁeld. Its input consists of a set of queries Q , keys K of dimension C and values V , which are packaged in the matrix formfor fast computation. It ﬁrst computes the dot products of the query with allkeys, divides each by √ C , and applies a softmax function to obtain the weightson the values [31]. In formulation: Attention ( Q, K, V ) = sof tmax ( QK T √ C ) (1)The similar idea has also been used for many computer vision tasks such asrelation modeling [22], detection [11] and semantic segmentation [9]. To the bestof our knowledge, we are the ﬁst to apply the pure attention networks for skeletaldata and further propose several improvements to meet the speciﬁc requirementsof skeletons. In this section, we ﬁrst propose three techniques for building attention blocks inSec. 3.1, Sec. 3.2 and Sec. 3.3. Then, the basic attention module and the detailedarchitecture of the network are introduced in Sec. 3.4 and Sec. 3.5. Finally, thedata decoupling and the overall multi-stream solution is described in Sec. 3.6.

Original transformer is fed with the sequential data, i.e., a matrix X ∈ R N × C ,where N denotes the number of elements and C denotes the number of channels. STA-Net for Skeleton-Based Action Recognition 5 （ a ）（ b ）（ c ） 𝑁 𝑁 … … Dot-Product

Attention

Map

Dot-ProductAttention AttentionMap 𝑇 𝐴 𝐴 𝑇 𝑁𝑁 Dot-ProductAttention AttentionMap 𝑇 𝐴 Dot-Product

Attention

Dot-Product

AttentionDot-ProductAttention … Dot-Product

Attention Attention

MapDot-ProductAttention

𝑇 𝐴 … Fig. 1.

Illustration of the three decoupling strategies. We use the spatial attentionstrategy as an example and the temporal attention strategy is an analogy. N and Tdenote the number of joints and frames, respectively.

For dynamic skeletal data, the input is a 3-order tensor X ∈ R N × T × C , where Tdenotes the number of frames. It is worth to investigate how to deal with therelationship between the time and the space. Wang et al. [32] propose to ignorethe diﬀerence between time and space, and regard the inputs as a sequential data X ∈ R ˆ N × C , where ˆ N = N T . However, the temporal dimension and the spatialdimension are totally diﬀerent as introduced in Sec. 1. It is not reasonable totreat them equivalently. Besides, the computational complexity of calculatingthe attention map in this strategy is O ( T N C ) (using the naive matrix mul-tiplication algorithm), which is too large. Instead, we propose to decouple thespatial and temporal dimensions, where the computational complexity is largelyreduced and the performance is improved.We design three strategies for decoupling as shown in Fig 1. Using the spatialattention as an example, the ﬁrst strategy (Fig 1, a) is calculating the attentionmaps frame by frame, and each frame uses a unique attention map: A t = sof tmax ( σ ( X t ) φ ( X t ) (cid:48) ) (2)where A t ∈ R N × N is the attention map for frame t . X t ∈ R N × C . σ and φ are two embedding functions. (cid:48) denote matrix transpose. This strategy onlyconsiders the dependencies of joints in a single frame thus lacks the modelingcapacity. The computational complexity of calculating spatial attention of thisstrategy is O ( T N C ). For temporal attention, the attention map of joint n is A n ∈ R T × T and the input data is X n ∈ R T × C . Its calculation is analogicalwith the spatial attention. Considering both the spatial and temporal attention,the computational complexity of the ﬁrst strategy for all frames is O ( T N C + N T C ).The second strategy (Fig 1, b) is calculating the relations of two joints be-tween all of the frames, which means both the intra-frame relations and theinter-frame relations of two joints are taken into account simultaneously. Theattention map is shared over all frames. In formulation: A t = sof tmax ( T (cid:88) t T (cid:88) τ ( σ ( X t ) φ ( X τ ) (cid:48) )) (3) Lei Shi et al.

The computational complexity of this strategy is O ( T N C + N T C ).The third strategy (Fig 1, c) is a compromise, where only the joints in sameframe are considered to calculate the attention map, but the obtained attentionmaps of all frames are averaged and shared. It is equivalent to adding a timeconsistency restriction for attention computation, which can somewhat reducethe overﬁtting problem caused by the element-wise relation modeling of thesecond strategy. A t = sof tmax ( T (cid:88) t ( σ ( X t ) φ ( X t ) (cid:48) )) (4)By concatenating the frames into an N × T C matrix, the summation of mat-multiplications can be eﬃciently implemented with one big mat-multiplicationoperation. The computational complexity of this strategy is O ( T N C + N T C ).as shown in ablation study 4.3, we ﬁnally use the strategy (c) in the model. The skeletal joints are organized as a tensor to be fed into the neural networks.Because there are no predeﬁned orders or structures for each element of thetensor to show its identity (e.g., joint index or frame index), we need a positionencoding module to provide unique markers for every joint. Following [31], we usethe sine and cosine functions with diﬀerent frequencies as the encoding functions:

P E ( p, i ) = sin ( p/ i/C in ) P E ( p, i + 1) = cos ( p/ i/C in ) (5)where p denotes the position of element and i denotes the dimension of theposition encoding vector. However, diﬀerent with [31], the input of skeletal datahave two dimensions, i.e., space and time. One strategy for position encoding isunifying the spatial and temporal dimensions and encoding them sequentially.For example assuming there are three joints, for the ﬁrst frame the position ofjoints is 1 , ,

3, and for the second frame it is 4 , ,

6. This strategy cannot welldistinguish the same joint in diﬀerent frames. Another strategy is decoupling theprocess into spatial position encoding and temporal position encoding. Using thespatial position encoding as an example, the joints in the same frame are encodedsequentially and the same joints in diﬀerent frames have the same encoding.In above examples, it means for the ﬁrst frame the position is 1 , ,

3, and forthe second frame it is also 1 , ,

3. As for the temporal position encoding, it isreversed and analogical, which means the joints in the same frame have thesame encoding and the same joints in diﬀerent frames are encoded sequentially.Finally, the position features are added to the input data as shown in Fig 2. Inthis way, each element is aligned with an unique marker to help learning themutual relations between the joints, and the diﬀerence between space and timeis also well expressed.

STA-Net for Skeleton-Based Action Recognition 7 L i n e a r L i n e a r M a t M u l t . T a nh Position Encoding Attention

Map

N x N

𝑁 × 𝑇𝐶 in 𝑁 ×𝑇𝐶 e x S M a t M u l t . Spatial

Global

Regularization × α 𝑁 × 𝑇𝐶 𝑖𝑛 𝑁 × 𝑇𝐶 𝑜𝑢𝑡 M u l t i - H e a d C o n c a t . L i n e a r L R e L U + L i n e a r + B N 𝑁 ×𝑆𝑇𝐶 e 𝑁×𝑇𝐶 𝑜𝑢𝑡 L R e L U Residual (opt.)

N x N

Fig. 2.

Illustration of the attention module. We show the spatial attention module as anexample. The temporal attention module is an analogy. The purple rounded rectanglebox represents a single-head self-attention module. There are totally S self-attentionmodules, whose output are concatenated and fed into two linear layers to obtain theoutput. LReLU represents the leaky ReLU [18]. As explained in Sec. 1, each joint has a speciﬁc meaning. Based on this priorknowledge, we propose to add a spatial global regularization to force the modelto learn more general attentions for diﬀerent samples. In detail, a global attentionmap ( N × N matrix) is added to the attention map ( N × N matrix) learned by thedot-product attention mechanism introduced in Sec. 3.1. The global attentionmap is shared for all data samples, which represents a uniﬁed intrinsic relation-ship pattern of the human joints. We set it as the parameter of the network andoptimize it together with the model. An α is multiplied to balance the strengthof the spatial global regularization. This module is simple and light-weight, butit is eﬀective as shown in the ablation study. Note that the regularization is onlyadded for spatial attention computing because the temporal dimension has nosuch semantic alignment property. Forcing a global regularization for temporalattention is not reasonable and will harm the performance. Because the spatial attention module and the temporal attention module areanalogical, we select the spatial module as an example for detailed introduction.The complete attention module is showed in Fig 2. The procedures inside thepurple rounded rectangle box illustrate the process of the single-head attentioncalculation. The input X ∈ R N × T C in is ﬁrst added with the spatial position en-coding. Then it is embedded with two linear mapping functions to X ∈ R N × T C e . C e is usually small than C out to remove the feature redundancy and reduce thecomputations. The attention map is calculated by the strategy (c) of Fig. 1 andadded with the spatial global regularization. Note that we found the Tanh isbetter than SoftMax when computing the attention map. We believe that it isbecause the output of Tanh is not restricted to positive values thus can gener-ate negative relations and provide more ﬂexibility. Finally the attention map ismat-multiplied with the original input to get the output features. Lei Shi et al.

Spatial

Attention

ModuleN x TC

T x NC

Temporal

Attention

Module x L

Prediction

N x TC T x NC G A P + F C Fig. 3.

Illustration of the overall architecture of the DSTA-Net. N, T, C denote thenumber of joints, frames and channels, respectively. The red rounded rectangle boxrepresents one spatial-temporal attention layer. There are totally L layers. The ﬁnaloutput features are global-average-pooled (GAP) and fed into a fully-conected layer(FC) to make the prediction.

To allow the model jointly attending to information from diﬀerent repre-sentation sub-spaces, there are totally S heads for attention calculations in themodule. The results of all heads are concatenated and mapped to the outputspace R N × T C out with a linear layer. Similar with the transformer, a point-wisefeed-forward layer is added in the end to obtain the ﬁnal output. We use theleaky ReLU as the non-linear function. There are two residual connections in themodule as shown in the Fig 2 to stabilize the network training and integrate dif-ferent features. Finally, all of the procedures inside the green rounded rectanglebox represent one whole attention module.

Fig. 3 shows the overall architecture of our method. The input is a skeletonsequence with N joints, T frames and C channels. In each layer, we ﬁrst regardthe input as an N × T C matrix, i.e., N elements with

T C channels, and feedit into the spatial attention module (introduced in Fig. 2) to model the spatialrelations between the joints. Then, we transpose the output matrix and regardit as T elements each has

N C channels, and feed it into the temporal attentionmodule to model the temporal relations between the frames. There are totally L layers stacked to update features. The ﬁnal output features are global-average-pooled and fed into a fully-connected layers to obtain the classiﬁcation scores. The action can be decoupled into two dimensions: the spatial dimension and thetemporal dimension as illustrated in Fig. 4 (a, b and c). The spatial informationis the diﬀerence of two diﬀerent joints that are in the same frame, which mainlycontains the relative position relationship between diﬀerent joints. To reducethe redundant information, we only calculate the spatial information along thehuman bones. The temporal information is the diﬀerence of the two joints withsame spatial meaning in diﬀerent frames, which mainly describes the motiontrajectory of one joint along the temporal dimension. When we recognize the

STA-Net for Skeleton-Based Action Recognition 9 （ a ） Spatial-temporal information 𝑷 𝒕 𝒋 𝑷 𝒕 𝒋 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 （ b ） Spatial information （ c ） Temporal information （ d ） Different frame rate 𝑷 𝒕 𝒊 𝑷 𝒕 𝒋 𝑷 𝒕 𝒊 𝑷 𝒕 𝒋 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 𝑷 𝒕 𝒋 𝑷 𝒕 𝒋 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 𝑷 𝒕 𝒊 Fig. 4.

For simplicity, we draw two joins in two consecutive frames in a 2D coordinatesystem to illustrate the data decoupling. as shown in (a), P it k denotes the joint i inframe k . Assume that joint i and joint j are the two end joints of one bone. (a) denotesthe raw data, i.e., the spatial-temporal information. The orange dotted line and bluedotted line denote the decoupled spatial information and temporal information, whichare showed as (b) and (c), respectively. (d) illustrates the diﬀerence between the fast-temporal information (blue arrow) and the slow-temporal information (orange arrow). gestures like “Point with one ﬁnger” versus “Point with two ﬁnger”, the spa-tial information is more important. However, when we recognize the gestureslike “waving up” versus “waving down”, the temporal information will be moreessential.Besides, for temporal stream, diﬀerent actions have diﬀerent sensibilities ofthe motion scale. For some actions such as “clapping” versus “put two handstogether”, the short-term motion detail is essential. But for actions like “wavingup” versus “waving down”, the long-term movement trend is more important.Inspired by [8], we propose to calculate the temporal motion with both thehigh frame-rate sampling and the low frame-rate sampling as shown in Fig. 4(d). The generated two streams are called as the fast-temporal stream and theslow-temporal stream, respectively.Finally, we have four streams all together, namely, spatial-temporal stream(original data), spatial stream, fast-temporal stream and slow-temporal stream.We separately train four models with the same architecture for each of thestreams. The classiﬁcation scores are averaged to obtain the ﬁnal result. To verify the generalization of the model, we use two datasets for hand gesturerecognition (DHG [6] and SHREC [7]) and two datasets for human action recog-nition (NTU-60 [23] and NTU-120 [14]). We ﬁrst perform exhaustive ablationstudies on SHREC to verify the eﬀectiveness of the proposed model compo-nents. Then, we evaluate our model on all four datasets to compare with thestate-of-the-art methods.

DHG [6] dataset contains 2800 video sequences of 14 hand gestures per-formed 5 times by 20 subjects. They are performed in two ways: using one ﬁngerand the whole hand. So it has two benchmarks: 14-gestures for coarse classiﬁ-cation and 28-gestures for ﬁne-grained classiﬁcation. The 3D coordinates of 22hand joints in real-world space is captured by the Intel Real-sense camera. Ituses the leave-one-subject-out cross-validation strategy for evaluation.

SHREC:

SHREC [7] dataset contains 2800 gesture sequences performed 1and 10 times by 28 participants in two ways like the DHG dataset. It splits thesequences into 1960 train sequences and 840 test sequences. The length of samplegestures ranges from 20 to 50 frames. This dataset is used for the competitionof SHREC’17 in conjunction with the Euro-graphics 3DOR’2017 Workshop.

NTU-60:

NTU-60 [23] is a most widely used in-door-captured action recog-nition dataset, which contains 56,000 action clips in 60 action classes. The clipsare performed by 40 volunteers and is captured by 3 KinectV2 cameras withdiﬀerent views. This dataset provides 25 joints for each subject in the skeletonsequences. It recommends two benchmarks: cross-subject (CS) and cross-view(CV), where the subjects and cameras used in the training/test splits are diﬀer-ent, respectively.

NTU-120:

NTU-120 [23] is similar with NTU-60 but is larger. It contains114,480 action clips in 120 action classes. The clips are performed by 106 volun-teers in 32 camera setups. It recommends two benchmarks: cross-subject (CS)and cross-setup (CE), where cross-setup means using samples with odd setupIDs for training and others for testing.

To show the generalization of our methods, we use the same conﬁguration forall experiments. The network is stacked using 8 DSTA blocks with 3 heads.The output channels are 64, 64, 128, 128, 256, 256, 256 and 256, respectively.The input video is randomly/uniformly sampled to 150 frames and then ran-domly/centrally cropped to 128 frames for training/test splits. For fast-temporalfeatures, the sampling interval is 2. When training, the initial learning rate is 0.1and is divided by 10 in 60 and 90 epochs. The training is ended in 120 epochs.Batch size is 32. Weight decay is 0 . .

9) as the optimizer and the cross-entropyas the loss function.

In this section, we investigate the eﬀectiveness of the proposed components ofthe network and diﬀerent data modalities. We conduct experiments on SHRECdataset. Except for the explored object, other details are set the same for faircomparison.

STA-Net for Skeleton-Based Action Recognition 11

Network architectures

We ﬁrst investigate the eﬀect of the position embed-ding. as shown in Tab. 1, removing the position encoding will seriously harm theperformance. Decoupling the spatial and temporal dimension (DPE) is betterthan not (UPE). This is because the spatial and temporal dimensions actuallyhave diﬀerent properties and treat them equivalently will confuse the model.Then we investigate the eﬀect of the proposed spatial global regularization(SGR). By adding the SGR, the performance is improved from 94.3% to 96.3%,but if we meanwhile regularize the temporal dimension, the performance drops.This is reasonable since there are no speciﬁc meanings for temporal dimensionand forced learning of a uniﬁed pattern will cause the gap between the trainingset an testing set.Finally, we compare the three strategies introduced in Fig. 1. It shows thatthe strategy (a) obtains the lowest performance. We conjecture that it duesto the fact that it only considers the intra-frame relations and ignore the inter-frame relations. Modeling the inter-frame relations exhaustively (strategy b) willimprove the performance and a compromise (c) obtains the best performance. Itmay because that the compromise strategy can somewhat reduce the overﬁttingproblem.

Table 1.

Ablation studies for architectures of the model on the SHREC dataset.ST-ATT-c denotes the s patial t emporal att ention networks with attention type c introduced in Fig 1. PE denotes p osition e ncoding. UPE/DPE denote using u niﬁed/ d ecoupled encoding for spatial and temporal dimensions. STGR denotes s patial- t emporal g lobal r egularizations for computing attention maps.Method AccuracyST-Att-c w/o PE 89.4ST-Att-c + UPE 93.2ST-Att-c + DPE 94.5ST-Att-c + DPE + SGR ST-Att-c + DPE + STGR 94.6ST-Att-a + DPE + SGR 94.6ST-Att-b + DPE + SGR 95.1

We show the learned attention maps of diﬀerent layers (layer W r i s t P a l m T T T T I I I I M M M M R R R R L L L L WristPalmT1T2T3T4I1I2I3I4M1M2M3M4R1R2R3R4L1L2L3L4

Layer_1 W r i s t P a l m T T T T I I I I M M M M R R R R L L L L WristPalmT1T2T3T4I1I2I3I4M1M2M3M4R1R2R3R4L1L2L3L4

Layer_8

Fig. 5.

Examples of the learned attention maps for diﬀerent layers. T, I, M, R and Ldenote thumb, index ﬁnger, middle ﬁnger, ring ﬁnger and little ﬁnger, respectively. Asfor articulation, T1 denotes the base of the thumb and T4 denote the tip of the thumb.

Data decoupling

To show the necessity of decoupling the raw data into fourstreams as introduced in Sec. 3.6, we show the results of using four streamsseparately and the result of fusion in Tab. 2. It shows that the accuracies ofdecoupled streams are not as good as the raw data because some of the informa-tion is lost. However, since the four streams focus on diﬀerent aspects and arecomplementary with each other, when fusing them together, the performance isimproved signiﬁcantly.

Table 2.

Ablation studies for feature fusion on the SHREC dataset. Spatial-temporaldenotes the raw data, i.e., the joint coordinates. Other types of features are introducedin Sec. 3.6. Method Accuracyspatial-temporal 96.3spatial 95.1fast-temporal 94.5slow-temporal 93.7Fusion

As shown in Fig. 6, We plot the per-class accuracies of the four streamsto show the complementarity clearly. We also plot the diﬀerence of accuraciesbetween diﬀerent streams, which are represented as the dotted lines. For spatialinformation versus temporal information, it (orange dotted lines) shows thatthe network with spatial information obtains higher accuracies mainly in classesthat are closely related with the shape changes such as “grab”, “expand” and

STA-Net for Skeleton-Based Action Recognition 13 g r a b t a p e x p a n d p i n c h r o t c l o c k w i s e r o t c / c l o c k w i s e s w i p e r i g h t s w i p e l e f t s w i p e u p s w i p e d o w n s w i p e x s w i p e + s w i p e v s h a k e a cc spatial slow_temporal fast_temporal spatial - slow_temporal slow_temporal - fast_temporal Fig. 6.

Per-class accuracies for diﬀerent modalities on SHREC-14 dataset. The dottedlines shows the diﬀerence between two modalities. “pinch”, and the network with temporal information obtains higher accuraciesmainly in classes that are closely related with the positional changes such as“swipe”, “rot” and “shake”. As for diﬀerent frame-rate sampling, it (red dottedlines) shows that the slow-temporal performs better for classes of “expand”,“tap”, etc, and the fast-temporal performs better for classes of “swipe”, “rot”,etc. These phenomenons verify the complementarity of the four modalities.

Table 3.

Recognition accuracy comparison of our method and state-of-the-art methodson SHREC dataset and DHG dataset.Method Year SHREC DHG14 gestures 28 gestures 14 gestures 28 gesturesST-GCN [33] 2018 92.7 87.7 91.2 87.1STA-Res-TCN [10] 2018 93.6 90.7 89.2 85.0ST-TS-HGR-NET [19] 2019 94.3 89.4 87.3 83.4DG-STA. [4] 2019 94.4 90.7 91.9 88.0

DSTA-Net(ours) - We evaluate our model with state-of-the-art methods for skeleton-based actionrecognition on all four datasets, where our model signiﬁcantly outperforms theother methods. Due to the space restriction, we only show some representa-tive works, where more comparisons are showed in supplement materials. OnSHREC/DHG datasets for skeleton-based hand gestures recognition (Tab. 3),our model brings 2 . .

9% and 3 . .

9% improvements for 14-gestures and28-gestures benchmarks compared with the state-of-the-arts. Note that the state-of-the-art accuracies are already very high (94 . .

9% and 90 . .

0% for

Table 4.

Recognition accuracy comparison of our method and state-of-the-art methodson NTU-60 dataset. CS and CV denote the cross-subject and cross-view benchmarks,respectively. Methods Year CS (%) CV (%)ST-GCN [33] 2018 81.5 88.3SRN+TSL [29] 2018 84.8 92.42s-AGCN [26] 2019 88.5 95.1DGNN [25] 2019 89.9 96.1NAS [20] 2020 89.4 95.7

DSTA-Net(ours) - Recognition accuracy comparison of our method and state-of-the-art methodson NTU-120 dataset. CS and CE denote the cross-subject and cross-setup benchmarks,respectively.Methods Year CS (%) CE (%)Two-Stream Attention LSTM [15] 2017 61.2 63.3Body Pose Evolution Map [17] 2018 64.6 66.9SkeletonMotion [1] 2019 67.7 66.9

DSTA-Net(ours) - .

6% and0 .

3% improvements. The performance of CV benchmark is nearly saturated. Forboth CS and CV benchmarks, we visualize the wrong examples and ﬁnd that itis even impossible for human to recognize many examples using only the skeletaldata. For example, for the two classes of reading and writing, the humans areboth in a same posture (standing or sitting) and holding a book. The only dif-ference is whether there is a pen in the hand, which cannot be captured throughthe skeletal data. On NTU-120 dataset (Tab. 5), our model also achieves state-of-the-art performance. Since this dataset is released recently, our method canprovide a new baseline on it.

In this paper, we propose a novel decoupled spatial-temporal attention net-work (DSTA-Net) for skeleton-based action recognition. It is a uniﬁed frameworkbased solely on attention mechanism, with no needs of designing hand-craftedtraversal rules or graph topologies. We propose three techniques in buildingDSTA-Net to meet the speciﬁc requirements for skeletal data, including spatial-temporal attention decoupling, decoupled position encoding and spatial globalregularization. Besides, we introduce a skeleton-decoupling method to emphasizethe spatial/temporal variations and motion scales of the skeletal data, resultingin a more comprehensive understanding for human actions and gestures. To ver-ify the eﬀectiveness and generalizability of the DSTA-Net, extensive experiments

STA-Net for Skeleton-Based Action Recognition 15 are conducted on four large datasets for both gesture and action recognition,where the DSTA-Net achieves the state-of-the-art performance on all of them.

References

1. Caetano, C., Sena, J., Brmond, F., Dos Santos, J.A., Schwartz, W.R.: SkeleMotion:A New Representation of Skeleton Joint Sequences based on Motion Informationfor 3d Action Recognition. In: 2019 16th IEEE International Conference on Ad-vanced Video and Signal Based Surveillance (AVSS). pp. 1–8 (Sep 2019)2. Cao, C., Lan, C., Zhang, Y., Zeng, W., Lu, H., Zhang, Y.: Skeleton-Based ActionRecognition with Gated Convolutional Neural Networks. IEEE Transactions onCircuits and Systems for Video Technology pp. 3247–3257 (2018)3. Carreira, J., Zisserman, A.: Quo Vadis, Action Recognition? A New Model andthe Kinetics Dataset. In: The IEEE Conference on Computer Vision and PatternRecognition (CVPR). pp. 6299–6308 (Jul 2017)4. Chen, Y., Zhao, L., Peng, X., Yuan, J., Metaxas, D.N.: Construct Dynamic Graphsfor Hand Gesture Recognition via Spatial-Temporal Attention. In: BMVC (2019)5. Dai, Z., Yang, Z., Yang, Y., Cohen, W.W., Carbonell, J., Le, Q.V., Salakhutdi-nov, R.: Transformer-xl: Attentive language models beyond a ﬁxed-length context.arXiv:1901.02860 (2019)6. De Smedt, Q., Wannous, H., Vandeborre, J.P.: Skeleton-Based Dynamic HandGesture Recognition. In: The IEEE Conference on Computer Vision and PatternRecognition Workshops (CVPRW). pp. 1206–1214 (Jun 2016)7. De Smedt, Q., Wannous, H., Vandeborre, J.P., Guerry, J., Le Saux, B., Filliat,D.: SHREC’17 Track: 3d Hand Gesture Recognition Using a Depth and SkeletalDataset. In: Pratikakis, I., Dupont, F., Ovsjanikov, M. (eds.) Eurographics Work-shop on 3D Object Retrieval. pp. 1–6 (2017)8. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recogni-tion. In: Proceedings of the IEEE International Conference on Computer Vision.pp. 6202–6211 (2019)9. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network forscene segmentation. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 3146–3154 (2019)10. Hou, J., Wang, G., Chen, X., Xue, J.H., Zhu, R., Yang, H.: Spatial-temporal at-tention res-TCN for skeleton-based dynamic hand gesture recognition. In: TheEuropean Conference on Computer Vision (ECCV). pp. 0–0 (2018)11. Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation Networks for Object Detection.In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(Jun 2018)12. Li, C., Zhong, Q., Xie, D., Pu, S.: Skeleton-based Action Recognition with Con-volutional Neural Networks. In: IEEE International Conference on Multimedia &Expo Workshops (ICMEW). pp. 597–600 (2017)13. Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent neural net-work (indrnn): Building A longer and deeper RNN. In: The IEEE Conference onComputer Vision and Pattern Recognition (CVPR). pp. 5457–5466 (2018)14. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: NTU RGB+D120: A Large-Scale Benchmark for 3d Human Activity Understanding. IEEE Trans-actions on Pattern Analysis and Machine Intelligence pp. 1–1 (2019)6 Lei Shi et al.15. Liu, J., Wang, G., Duan, L.Y., Abdiyeva, K., Kot, A.C.: Skeleton-based human ac-tion recognition with global context-aware attention LSTM networks. IEEE Trans-actions on Image Processing (4), 1586–1599 (2017)16. Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invarianthuman action recognition. Pattern Recognition68