[PDF] A Transfer Learning Method for Speech Emotion Recognition from Automatic Speech Recognition

Abstract

This paper presents a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network (TDNN) architecture. A major challenge in the current speech-based emotion detection research is data scarcity. The proposed method resolves this problem by applying transfer learning techniques in order to leverage data from the automatic speech recognition (ASR) task for which ample data is available. Our experiments also show the advantage of speaker-class adaptation modeling techniques by adopting identity-vector (i-vector) based features in addition to standard Mel-Frequency Cepstral Coefficient (MFCC) features.[1] We show the transfer learning models significantly outperform the other methods without pretraining on ASR. The experiments performed on the publicly available IEMOCAP dataset which provides 12 hours of motional speech data. The transfer learning was initialized by using the Ted-Lium v.2 speech dataset providing 207 hours of audio with the corresponding transcripts. We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation. Using only speech, we obtain an accuracy 71.7% for anger, excitement, sadness, and neutrality emotion content.

Full PDF

AA Transfer Learning Method for Speech Emotion Recognition from AutomaticSpeech Recognition

Sitong Zhou ,Homayoon Beigi Columbia University Recognition Technologies, Inc. and Columbia University [email protected], [email protected] Abstract

This paper presents a transfer learning method in speechemotion recognition based on a Time-Delay Neural Network(TDNN) architecture. A major challenge in the current speech-based emotion detection research is data scarcity. The pro-posed method resolves this problem by applying transfer learn-ing techniques in order to leverage data from the automaticspeech recognition (ASR) task for which ample data is avail-able. Our experiments also show the advantage of speaker-classadaptation modeling techniques by adopting identity-vector (i-vector) based features in addition to standard Mel-FrequencyCepstral Coefﬁcient (MFCC) features.[1] We show the trans-fer learning models signiﬁcantly outperform the other methodswithout pretraining on ASR. The experiments performed on thepublicly available IEMOCAP dataset which provides 12 hoursof emotional speech data. The transfer learning was initializedby using the Ted-Lium v.2 speech dataset providing 207 hoursof audio with the corresponding transcripts. We achieve thehighest signiﬁcantly higher accuracy when compared to state-of-the-art, using ﬁve-fold cross validation. Using only speech,we obtain an accuracy 71.7% for anger, excitement, sadness,and neutrality emotion content.

Index Terms : transfer learning, emotion recognition, IEMO-CAP, time-delay neural network

1. Introduction

Detecting emotions from speech has attracted attention for itsusage in enhancing natural human-computer interaction. Theability of understanding human emotion status is helpful for ma-chines to bring empathy in various applications.Speech emotion recognition is suffering from insufﬁciencyof labeled data. Though several emotion datasets have been re-leased [2][3][4], the size of emotion datasets are relatively smalldue to the expensive collection costs compared with plentifuldata for tasks like Automatic Speech Recognition(ASR)[5] andspeaker recognition[6]. Most speech emotion detection mod-els are trained from scratch within a single dataset[7][8][9][10],therefore cannot successfully adapt to novel scenarios whichhas not been encountered during training. One possible solu-tion is to leverage knowledge acquired from large-scale datasetsof relevant speech tasks to emotion recognition domain. Al-though some efforts were spent on transfer learning method forcategorical emotion detection from other paralinguistic tasks,such as speaker, and gender recognition and effective emo-tional attributes prediction[11][12], however they dont choosesource domain with large-scale datasets, not shown signiﬁcantimprovement over non-transfer learning methods.Previous research has shown that speech emotion de-tection can be improved after combined with textual data [13]. Multi-modal methods can signiﬁcantly improve the emo-tion detection performance by incorporating lexical featuresfrom given transcripts with the acoustic features from audios[14][15][16].However, in real application scenario, transcriptsare often absent. Although ASR can provide transcripts in realtime to emotional speech data[16], it requires large languagemodels loaded, and is computational costly when decoding se-quence. Therefore, transfer learning using ASR as the sourcedomain might be an efﬁcient solution in emotion detection to in-corporate textual features through high level features extractedin ASR models.Another challenge for emotion recognition is that speak-ers express emotions in different ways, in addition, environ-ments can affect acoustic features. Speaker adaptation is usefulto capsulate speaker and environment speciﬁc information intoacoustic features. iVector[17] based adaptation has been shownfast and efﬁcient in speech recognition[18]. We employ i-vectorbased speaker adaptation in emotion detection.This paper proposes a transfer learning method to adaptASR models in emotion recognition domain. The model ispre-trained on Tedlium2 dataset[5], with over 207 hours data,and ﬁne tuned on 12 hours of emotional speech. The modelarchitecture is TDNN-based[19][1][18] with input as speakeradapted MFCC[1] features. Our experiments show the improve-ments in emotion detection using transfer learning from ASRto speech emotion recognition combined with speaker adapta-tion. The performance is evaluated on the benchmark datasetInteractive Emotional Dyadic Motion Capture (IEMOCAP)[2],with 71.7% unweighted accuracy among angry, happy, sad andneutral under the 5-fold cross validation strategy. Our methodsigniﬁcantly outperforms the state-of-art strategy[9].

2. Related Work

Most efﬁcient models in prior work are based on deep learn-ing models, which can learn high-level features from low-levelacoustic features. Lees work[7] showed the importance of long-range context effect, and signiﬁcantly improve the RNN resultsover DNN model. This work had been staying state-of-art foryears with 63.89% unweighted accuracy(UA), until surpassedby a hybrid approach of convolution layers and LSTM convo-lutional LSTM[9] with 68.8% accuracy. Previous literature hasseldomly discussed about TDNN architecture in emotion recog-nition. TDNN can efﬁciency capture temporal information asRNN and LSTM do, but is faster for its parallelization abilityand lower computation costs during training[18], which is a de-sired property when training on large-scale ASR data.Many approaches[7] [10]are speaker independent wherefeatures are normalized for individuals resulting in informationloss, while our work is conducted in speaker dependent contextusing full MFCC raw features combined with iVectors contain- a r X i v : . [ ee ss . A S ] A ug ng speaker characteristics[17]. Peddinti[18] has proposed anefﬁcient TDNN-based architecture for ASR with features as thecombination of MFCC and iVectors, efﬁciently learned robustrepresentations among various speakers and environments. Ourstudy uses bottleneck layers of this TDNN architecture to usehigh-level feature representations that reﬂect insights from ASRtasks, and ﬁne tunes on emotion datasets.

3. Method

The emotion recognition problem is a classiﬁcation problemwhen we represent emotion as categories rather than dimen-sional representations, D = { ( X, z ) } (1)where where X are the acoustic features input and z is dimen-sional output corresponding to the emotion prediction. We wantto ﬁnd a function D = { f : X → z } (2)to map features to categories. This model is trained on frame-level labels, and predicts the utterance labels by aggregatingframe-level predictions through max likelihood by summing upthe results of frames. Full MFCC features with all 40 coefﬁcients are computed ateach time index is used as input to neural network. Insteadof mean normalization on MFCC, an 100-dimension iVectoris appended to MFCC features at each frame to encode mean-offset information. The iVector extraction model is trained asdescribed in [18].

TDNN is designed for capturing long term temporal dependen-cies in lower computational costs compared to RNN. It operatessimilar to a feed forward DNN architecture where lower layersfocus input content in narrow windows, and higher layers con-nects windows of selected previous layer nodes to process theinformation from a wider context. Therefore its deeper layerscan learn effective long term temporal dependencies without re-current connections which hurdle parallel computation.The pretraining on ASR follows the Kaldi recipe for theTED-Lium tasksi[20], where uses 13 TDNN layers and eachlayer consists of 1024 activation nodes. The time stride of eachlayer, which deﬁnes the window at which calculating over nodesat neighbor time steps in the past layer, is assigned as 0 for the1 st and 5 th TDNN layer, as 1 from the 2 nd and 4 th layer, and as3 for layers after since the 6 th . A fully connected preﬁnal layerof 1024 dimension follows the 13 th TDNN layer before decod-ing output sequences. The model is trained with a sequence-level objective function named lattice-free version of the max-imum mutual information (LF-MMI)[21], for maximising thelog-likelihood of the correct sequences.

Emotion labels are given for each utterance in the dataset. Welabel all the frames using the utterance label where the frameslie in. To train for emotion detection, the 12 th and 13 th TDNNlayers as well as the ASR preﬁnal layer are selected to produce bottleneck embeddings as high level features learnt from ASR,a new fully connected dense layer is appended after the embed-ding layer for predicting the frame-level emotion labels, and asoftmax layer with four dimension outputs is used to predictframe-level emotion. The model uses cross-entropy as the ob-jective function for frame-level classiﬁcation.The output of this model is for frame unit rather than for ut-terance unit, we aggregate frame-level predictions, using maxi-mum likelihood by adding the output vectors over frames, cor-responding to the highest valued dimension of the sum of thesoftmax layer output over all frames within the utterance

4. Experiment

This work uses IEMOCAP, which contains 12 hours of audiodata with scripted and improvised speech, performed by ten ac-tors, one male and one female as a pair in ﬁve sessions. In train-ing and testing, four categories angry, happy, sad and neutralare selected out of ten categories, for a more balanced and efﬁ-cient dataset, resulting in a ﬁnal collection of 4936 utterances,each utterance has unique emotion label. This dataset consistsof ﬁve sessions, and the category distribution is as in Table 1.Table 1:

Emotion category distribution in IEMOCAP session ang exc neu sad total ses1

229 143 384 194 950 ses2

137 210 362 197 906 ses3

240 151 320 305 1016 ses4

327 238 258 143 966 ses5

170 299 384 245 1098 total

For pretraining on ASR, we use the feed-forward TDNN to cap-ture long term temporal dependencies from short term featurerepresentations. Hidden activations are sub-sampled in order tospeed up the training[18]. The model is pre-trained on ASRdata, with 13 TDNN layers and output layers for decoding se-quence based on acoustic models. As neighboring activationsshares largely overlapped input contexts, sub-sampling on acti-vations can reduce computational costs without sacriﬁcing thecoverage range over input frames. The hyper parameters formodel architecture and training are chosen according to KaldiTedlium2 TDNN recipe[20], which has been tuned properly onTedlium2 dataset, achieving 7.6 word error rate(WER) on testdataset after six epochs training. The parameters are optimizedthrough preconditioned stochastic gradient descent (SGD) up-dates, following the training recipe detailed in [22].

After obtaining the pretraining model, we use 12 th , 13 th and thepreﬁnal layer as the bottleneck features for the appended fullyconnected layer and the softmax layer for predicting frame emo-tions. We test on session 5 after training on session 1-4, and ﬁndthe 12 th TDNN layer the best performance, therefore we use the12 th layer output as the bottleneck embeddings for later experi-ments. .4. Evaluation Method For parallel comparison with other methods[7][9][10], we trainunder a 5-fold cross validation strategy where each time wechoose a session from IEMOCAP for testing, and the other fourfor training. The results are evaluated by the average of un-weighted accuracy over ﬁve cross validation experiment runs.

5. Results

Compare the preﬁnal, 12 th and 13 th layer as in Table 2, we foundthat the 12 th layer has the best performance and preﬁnal havethe worst. We hypothesis that is because in ASR, preﬁnal and13 th layer are more specialized in speech recognition as they arecloser to the ﬁnal output layer, while the 12 th learns the generalhigh-level acoustic features that helps emotion recognition.Table 2: Test Accuracy on Session 5 with different bottlenecklayers

Bottleneck Layer Test Accuracy on Ses 5 (%)

Preﬁnal . th TDNN . th TDNN . Our model using the 12 th TDNN layer outperforms other cur-rent state-of-art methods. Table 3 to our best knowledge. The5-fold cross validation unweighted accuracy is improved from68.8%[9] to 71.7%.Table 3:

Model Comparison on Unweighted Accuracy(UA) in% from Five-fold Cross Validation

Model Single Session UA 5-fold CV UA

ASR Transfer Learning ses1 . ses2 . ses3 . ses4 . ses5 . RNN-ELM[7] . Conv-LSTM[9] . CTC-BLSTM[10] To study the model performance within each emotion category,we present a confusion matrix, shown in Figure 1, by calculat-ing the average confusion matrix of ﬁve cross validation exper-iments for our best model architecture, using TDNN 12 th layeras the bottleneck layer. The excitement category has a lower ac-curacy compared to other categories, in which 28% samples areconfused with neutral utterances. We observed that the netralemotion is the most likely wrong prediction from all non-neutralcategories. It might due to the fact that non-neutral utterancesusually consist of a large proportion of frames carrying no emo-tional content. We also found the model confuses neutrality as Figure 1: Confusion Matrix for Transfer Learning Method Pre-dictions sadness, that may because many neutral utterances are in lowtone, as in sadness.

6. Conclusion

Our study shows transfer learning from ASR is a good strat-egy for emotion classiﬁcation, and indicates potential featureoverlap between speech-to-text and emotion recognition. Ourmethod is limited in frame-level prediction, where frames arepredicted ﬁrst then aggregated into utterance level labels. Theframe-level structure results in the ignorance of sequential in-formation in emotion labels decoding. In future, we expect se-quence models can predict at utterance-level and bring furtherperformance improvements by considering sequential informa-tion for sequence decoding.

7. References [1] H. Beigi,

Fundamentals of Speaker Recognition

LanguageResources and Evaluation , vol. 42, no. 4, pp. 335–359, 2008,springerlink Online: DOI 10.1007/s10579-008-9076.[3] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, andB. Weiss, “A database of german emotional speech,” in

Proceed-ings of the 9 th European Conference on Speech Communicationand Technology , vol. 5, Lisbon, Portugal, 2006.[4] P. Jackson and S. ul Haq, “Surrey audio-visual expressed emotion(SAVEE) database,” 04 2011.[5] A. Rousseau, P. Delglise, and Y. Estve, “Enhancing the TED-LIUM corpus with selected data for language modeling and moreTED talks,” 05 2014.[6] A. Nagrani, S. Albanie, and A. Zisserman, “Seeing voices andhearing faces: Cross-modal biometric matching,” Jun. 18-222018.[7] J. Lee and I. Tashev, “High-level feature representation using re-current neural network for speech emotion recognition,” Sep.[8] Y. Kim, H. Lee, and E. M. Provost, “Deep learning for robustfeature generation in audiovisual emotion recognition,” pp. 3687–3691, May 26-31 2013.[9] A. Satt, S. Rozenberg, and R. Hoory, “Eifﬁcient emotion recog-nition from speech using deep learning on spectrograms,” Aug.20-24.10] V. Chernykh, G. Sterling, and P. Prihodko, “Emotion recogni-tion from speech with recurrent neural networks,” arXiv preprintarXiv:1701.08071 , Jul. 2018.[11] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Transfer learn-ing for improving speech emotion classiﬁcation accuracy,” Sep.2-6.[12] S. Ghosh, E. Laksana, L.-P. Morency, and S. Scherer, “Represen-tation learning for speech emotion recognition,” in

Interspeech ,Sep. 2016.[13] S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review ofaffective computing: From unimodal analysis to multimodal fu-sion,”

Information Fusion , vol. 37, pp. 98–126, Sep. 2017.[14] S. Tripathi and H. Beigi, “Multi-modal emotion recognitionon iemocap dataset using deep learning,” arXiv:1804.05788v1[cs.AI] , Apr 2018.[15] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, andS. Zafeiriou, “End-to-End multimodal emotion recognition usingdeep neural networks,”

IEEE Journal of Selected Topics in SignalProcessing , vol. 11, pp. 1301–1309, Sep. 2017.[16] J. Cho, R. Pappagari, P. Kulkarni, J. Villalba, Y. Carmiel, andN. Dehak, “Deep neural networks for emotion recognition com-bining audio and transcripts,” Sep. 2-6.[17] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker veriﬁcation,” vol. 19, no. 4,pp. 788–798, May 2011.[18] V. Peddinti, D. Povey, and S. Khundanpur, “A time delay neuralnetwork architecture for efﬁcient modeling of long temporal con-texts,” in

Interspeech , Sep. 2015.[19] M. Sugiyama, H. Sawai, and A. Waibel, “Review of tdnn (time de-lay neural network) architectures for speech recognition,” vol. 1,Jun 1991, pp. 582–585.[20] I. GitHub, “A TDNN recipe for automatic speech recog-nition training on ted-lium,” Website. [Online]. Avail-able: https://github.com/kaldi-asr/kaldi/blob/master/egs/tedlium/s5 r2/local/chain/tuning/run tdnn 1g.sh[21] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar,X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neu-ral networks for asr based on lattice-free mmi,” in