Multimedia Tools and Applications | 2021

A deep multimodal network based on bottleneck layer features fusion for action recognition

 
 

Abstract


Human Activity Recognition (HAR) in videos using convolution neural network become the preferred choice for researcher due to the tremendous success of deep learning models for visual recognition applications. After the invention of the low-cost depth sensor, multiple modalities based activity recognition systems were successfully developed in the past decade. Although it is always challenging to recognize the complex human activities in videos. In this work, we proposed a deep bottleneck multimodal feature fusion (D-BMFF) framework that fused three different modalities of RGB, RGB-D(depth) and 3D coordinates information for activity classification. It helps to better recognize and make full use of information available simultaneously from a depth sensor. During the training process RGB and depth, frames are fed at regular intervals for an activity video while 3D coordinates are first converted into single RGB skeleton motion history image (RGB-SklMHI). We have extracted the features from multimodal data inputs using the latest deep pre-trained network architecture. The multimodal feature obtained from bottleneck layers before the top layer is fused by using multiset discriminant correlation analysis (M-DCA), which allows for robust visual action modeling. Finally, using a linear multiclass support vector machine (SVM) method, the fused features are categorized. The proposed approach is evaluated over four standard RGB-D datasets: UT-Kinect, CAD-60, Florence 3D and SBU Interaction. Our framework produces outstanding results and outperformed the state-of-the-art methods.

Volume None
Pages None
DOI 10.1007/s11042-021-11415-9
Language English
Journal Multimedia Tools and Applications

Full Text