Signal, Image and Video Processing | 2021
View transform graph attention recurrent networks for skeleton-based action recognition
Abstract
Human action recognition based on skeleton recently has attracted attention of researchers due to the accessibility and popularity of the 3D skeleton data. However, it is complicated to effectively represent spatial–temporal skeleton sequences given the large variations of action representations when they are captured from different viewpoints. In order to get a better representation of the spatial–temporal skeletal features, this paper introduces a view transform graph attention recurrent networks (VT+GARN) method for view-invariant human action recognition. We design a view-invariant transform strategy based on the sequence to reduce the influence of different views on the spatial–temporal position of skeleton joint. Then, the graph attention recurrent network automatically calculates the coefficient of attention and learns the representation of spatiotemporal skeletal features after the transformation and outputs the classification result. Ablation studies and extensive experiments on three challenging datasets, Northwestern-UCLA, NTU RGB+D and UWA3DII, demonstrate the effectiveness and superiority of our method