Multimedia Tools and Applications | 2021

A multi-modal emotion fusion classification method combined expression and speech based on attention mechanism

 
 
 
 

Abstract


This paper researches how to use attention mechanism to fuse the time series information of facial expression and speech, and proposes a multi-modal feature fusion emotion recognition model based on attention mechanism. First, facial expression features and speech features are extracted. Facial expression feature extraction, based on C3D-LSTM hybrid model, can effectively obtain the temporal and spatial expression features in videos. For speech feature extraction, Mel Frequency Cepstral Coefficient (MFCC) is used for extracting the initial speech features, and convolution neural network is for further features. Then, a face and speech recognition method based on attention mechanism is proposed. Through the attention analysis of the fusion features, the proposed method can obtain the relationship between the features, so that the features without noise and with strong distinguishability obtain more weight, and reduce the weight of noisy features at the same time. Finally, this method is applied to face expression and speech fusion recognition. The experimental results show that the proposed multi-modal emotion classification model is better than those in other literatures in RML dataset, with an average recognition rate of up to 81.18%.

Volume None
Pages None
DOI 10.1007/s11042-021-11260-w
Language English
Journal Multimedia Tools and Applications

Full Text