2021 International Wireless Communications and Mobile Computing (IWCMC) | 2021
Speech Emotion Recognition Model with Time-Scale-Invariance MFCCs as Input
Abstract
Speech Emotion Recognition (SER) is a significant task for human communication. In the recent years, Mel-frequency Cepstrum Coefficient (MFCC) feature can be usually utilized in the related tasks of speech emotion recognition. In this study, we developed a multi-head-attention CNN model with auxiliary task of gender task. Base on proposed model, we explore the effect of different time-scale MFCCs and different combination of them as input on the performance of proposed model. Experimental results show that MFCC having higher resolution in time-scale as input can help model achieving better performance of speech emotion recognition with a moderate range. Also, it can help model achieving better performance to combine different time-scale MFCCs appropriately.