2020 28th European Signal Processing Conference (EUSIPCO) | 2021

Adieu recurrence? End-to-end speech emotion recognition using a context stacking dilated convolutional network

 
 
 
 

Abstract


In state-of-the-art end-to-end Speech Emotion Recognition (SER) systems, Convolutional Neural Network (CNN) layers are typically used to extract affective features while Long Short-Term Memory (LSTM) layers model long-term temporal dependencies. However, these systems suffer from several problems: 1) the model largely ignores temporal structure in speech due to the limited receptive field of the CNN layers, 2) the model inherits the drawbacks of Recurrent Neural Network (RNN)s, e.g. the gradient exploding/vanishing problem, the polynomial growth of computation time with the input sequence length and the lack of parallelizability. In this work, we propose a novel end-to-end SER structure that does not contain any recurrent or fully connected layers. By levering the power of the dilated causal convolution, the receptive field of the proposed model largely increases with reasonably low computational cost. By also using context stacking, the proposed model is capable of exploiting long-term temporal dependencies and can be an alternative to RNN. Experiments on the RECOLA database publicly available partition show improved results compare to a state-of-the-art system. We also verify that both the proposed model and the state-of-the-art model learned from short sequences (i.e.20s) can make accurate predictions for very long sequences (e.g. ≥ 75s).

Volume None
Pages 1-5
DOI 10.23919/Eusipco47968.2020.9287667
Language English
Journal 2020 28th European Signal Processing Conference (EUSIPCO)

Full Text