A Transfer Learning Method for Speech Emotion Recognition from Automatic Speech Recognition
AA Transfer Learning Method for Speech Emotion Recognition from AutomaticSpeech Recognition
Sitong Zhou ,Homayoon Beigi Columbia University Recognition Technologies, Inc. and Columbia University [email protected], [email protected] Abstract
This paper presents a transfer learning method in speechemotion recognition based on a Time-Delay Neural Network(TDNN) architecture. A major challenge in the current speech-based emotion detection research is data scarcity. The pro-posed method resolves this problem by applying transfer learn-ing techniques in order to leverage data from the automaticspeech recognition (ASR) task for which ample data is avail-able. Our experiments also show the advantage of speaker-classadaptation modeling techniques by adopting identity-vector (i-vector) based features in addition to standard Mel-FrequencyCepstral Coefficient (MFCC) features.[1] We show the trans-fer learning models significantly outperform the other methodswithout pretraining on ASR. The experiments performed on thepublicly available IEMOCAP dataset which provides 12 hoursof emotional speech data. The transfer learning was initializedby using the Ted-Lium v.2 speech dataset providing 207 hoursof audio with the corresponding transcripts. We achieve thehighest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation. Using only speech,we obtain an accuracy 71.7% for anger, excitement, sadness,and neutrality emotion content.
Index Terms : transfer learning, emotion recognition, IEMO-CAP, time-delay neural network
1. Introduction
Detecting emotions from speech has attracted attention for itsusage in enhancing natural human-computer interaction. Theability of understanding human emotion status is helpful for ma-chines to bring empathy in various applications.Speech emotion recognition is suffering from insufficiencyof labeled data. Though several emotion datasets have been re-leased [2][3][4], the size of emotion datasets are relatively smalldue to the expensive collection costs compared with plentifuldata for tasks like Automatic Speech Recognition(ASR)[5] andspeaker recognition[6]. Most speech emotion detection mod-els are trained from scratch within a single dataset[7][8][9][10],therefore cannot successfully adapt to novel scenarios whichhas not been encountered during training. One possible solu-tion is to leverage knowledge acquired from large-scale datasetsof relevant speech tasks to emotion recognition domain. Al-though some efforts were spent on transfer learning method forcategorical emotion detection from other paralinguistic tasks,such as speaker, and gender recognition and effective emo-tional attributes prediction[11][12], however they dont choosesource domain with large-scale datasets, not shown significantimprovement over non-transfer learning methods.Previous research has shown that speech emotion de-tection can be improved after combined with textual data [13]. Multi-modal methods can significantly improve the emo-tion detection performance by incorporating lexical featuresfrom given transcripts with the acoustic features from audios[14][15][16].However, in real application scenario, transcriptsare often absent. Although ASR can provide transcripts in realtime to emotional speech data[16], it requires large languagemodels loaded, and is computational costly when decoding se-quence. Therefore, transfer learning using ASR as the sourcedomain might be an efficient solution in emotion detection to in-corporate textual features through high level features extractedin ASR models.Another challenge for emotion recognition is that speak-ers express emotions in different ways, in addition, environ-ments can affect acoustic features. Speaker adaptation is usefulto capsulate speaker and environment specific information intoacoustic features. iVector[17] based adaptation has been shownfast and efficient in speech recognition[18]. We employ i-vectorbased speaker adaptation in emotion detection.This paper proposes a transfer learning method to adaptASR models in emotion recognition domain. The model ispre-trained on Tedlium2 dataset[5], with over 207 hours data,and fine tuned on 12 hours of emotional speech. The modelarchitecture is TDNN-based[19][1][18] with input as speakeradapted MFCC[1] features. Our experiments show the improve-ments in emotion detection using transfer learning from ASRto speech emotion recognition combined with speaker adapta-tion. The performance is evaluated on the benchmark datasetInteractive Emotional Dyadic Motion Capture (IEMOCAP)[2],with 71.7% unweighted accuracy among angry, happy, sad andneutral under the 5-fold cross validation strategy. Our methodsignificantly outperforms the state-of-art strategy[9].
2. Related Work
Most efficient models in prior work are based on deep learn-ing models, which can learn high-level features from low-levelacoustic features. Lees work[7] showed the importance of long-range context effect, and significantly improve the RNN resultsover DNN model. This work had been staying state-of-art foryears with 63.89% unweighted accuracy(UA), until surpassedby a hybrid approach of convolution layers and LSTM convo-lutional LSTM[9] with 68.8% accuracy. Previous literature hasseldomly discussed about TDNN architecture in emotion recog-nition. TDNN can efficiency capture temporal information asRNN and LSTM do, but is faster for its parallelization abilityand lower computation costs during training[18], which is a de-sired property when training on large-scale ASR data.Many approaches[7] [10]are speaker independent wherefeatures are normalized for individuals resulting in informationloss, while our work is conducted in speaker dependent contextusing full MFCC raw features combined with iVectors contain- a r X i v : . [ ee ss . A S ] A ug ng speaker characteristics[17]. Peddinti[18] has proposed anefficient TDNN-based architecture for ASR with features as thecombination of MFCC and iVectors, efficiently learned robustrepresentations among various speakers and environments. Ourstudy uses bottleneck layers of this TDNN architecture to usehigh-level feature representations that reflect insights from ASRtasks, and fine tunes on emotion datasets.
3. Method
The emotion recognition problem is a classification problemwhen we represent emotion as categories rather than dimen-sional representations, D = { ( X, z ) } (1)where where X are the acoustic features input and z is dimen-sional output corresponding to the emotion prediction. We wantto find a function D = { f : X → z } (2)to map features to categories. This model is trained on frame-level labels, and predicts the utterance labels by aggregatingframe-level predictions through max likelihood by summing upthe results of frames. Full MFCC features with all 40 coefficients are computed ateach time index is used as input to neural network. Insteadof mean normalization on MFCC, an 100-dimension iVectoris appended to MFCC features at each frame to encode mean-offset information. The iVector extraction model is trained asdescribed in [18].
TDNN is designed for capturing long term temporal dependen-cies in lower computational costs compared to RNN. It operatessimilar to a feed forward DNN architecture where lower layersfocus input content in narrow windows, and higher layers con-nects windows of selected previous layer nodes to process theinformation from a wider context. Therefore its deeper layerscan learn effective long term temporal dependencies without re-current connections which hurdle parallel computation.The pretraining on ASR follows the Kaldi recipe for theTED-Lium tasksi[20], where uses 13 TDNN layers and eachlayer consists of 1024 activation nodes. The time stride of eachlayer, which defines the window at which calculating over nodesat neighbor time steps in the past layer, is assigned as 0 for the1 st and 5 th TDNN layer, as 1 from the 2 nd and 4 th layer, and as3 for layers after since the 6 th . A fully connected prefinal layerof 1024 dimension follows the 13 th TDNN layer before decod-ing output sequences. The model is trained with a sequence-level objective function named lattice-free version of the max-imum mutual information (LF-MMI)[21], for maximising thelog-likelihood of the correct sequences.
Emotion labels are given for each utterance in the dataset. Welabel all the frames using the utterance label where the frameslie in. To train for emotion detection, the 12 th and 13 th TDNNlayers as well as the ASR prefinal layer are selected to produce bottleneck embeddings as high level features learnt from ASR,a new fully connected dense layer is appended after the embed-ding layer for predicting the frame-level emotion labels, and asoftmax layer with four dimension outputs is used to predictframe-level emotion. The model uses cross-entropy as the ob-jective function for frame-level classification.The output of this model is for frame unit rather than for ut-terance unit, we aggregate frame-level predictions, using maxi-mum likelihood by adding the output vectors over frames, cor-responding to the highest valued dimension of the sum of thesoftmax layer output over all frames within the utterance
4. Experiment
This work uses IEMOCAP, which contains 12 hours of audiodata with scripted and improvised speech, performed by ten ac-tors, one male and one female as a pair in five sessions. In train-ing and testing, four categories angry, happy, sad and neutralare selected out of ten categories, for a more balanced and effi-cient dataset, resulting in a final collection of 4936 utterances,each utterance has unique emotion label. This dataset consistsof five sessions, and the category distribution is as in Table 1.Table 1:
Emotion category distribution in IEMOCAP session ang exc neu sad total ses1
229 143 384 194 950 ses2
137 210 362 197 906 ses3
240 151 320 305 1016 ses4
327 238 258 143 966 ses5
170 299 384 245 1098 total
For pretraining on ASR, we use the feed-forward TDNN to cap-ture long term temporal dependencies from short term featurerepresentations. Hidden activations are sub-sampled in order tospeed up the training[18]. The model is pre-trained on ASRdata, with 13 TDNN layers and output layers for decoding se-quence based on acoustic models. As neighboring activationsshares largely overlapped input contexts, sub-sampling on acti-vations can reduce computational costs without sacrificing thecoverage range over input frames. The hyper parameters formodel architecture and training are chosen according to KaldiTedlium2 TDNN recipe[20], which has been tuned properly onTedlium2 dataset, achieving 7.6 word error rate(WER) on testdataset after six epochs training. The parameters are optimizedthrough preconditioned stochastic gradient descent (SGD) up-dates, following the training recipe detailed in [22].
After obtaining the pretraining model, we use 12 th , 13 th and theprefinal layer as the bottleneck features for the appended fullyconnected layer and the softmax layer for predicting frame emo-tions. We test on session 5 after training on session 1-4, and findthe 12 th TDNN layer the best performance, therefore we use the12 th layer output as the bottleneck embeddings for later experi-ments. .4. Evaluation Method For parallel comparison with other methods[7][9][10], we trainunder a 5-fold cross validation strategy where each time wechoose a session from IEMOCAP for testing, and the other fourfor training. The results are evaluated by the average of un-weighted accuracy over five cross validation experiment runs.
5. Results
Compare the prefinal, 12 th and 13 th layer as in Table 2, we foundthat the 12 th layer has the best performance and prefinal havethe worst. We hypothesis that is because in ASR, prefinal and13 th layer are more specialized in speech recognition as they arecloser to the final output layer, while the 12 th learns the generalhigh-level acoustic features that helps emotion recognition.Table 2: Test Accuracy on Session 5 with different bottlenecklayers
Bottleneck Layer Test Accuracy on Ses 5 (%)
Prefinal . th TDNN . th TDNN . Our model using the 12 th TDNN layer outperforms other cur-rent state-of-art methods. Table 3 to our best knowledge. The5-fold cross validation unweighted accuracy is improved from68.8%[9] to 71.7%.Table 3:
Model Comparison on Unweighted Accuracy(UA) in% from Five-fold Cross Validation
Model Single Session UA 5-fold CV UA
ASR Transfer Learning ses1 . ses2 . ses3 . ses4 . ses5 . RNN-ELM[7] . Conv-LSTM[9] . CTC-BLSTM[10] To study the model performance within each emotion category,we present a confusion matrix, shown in Figure 1, by calculat-ing the average confusion matrix of five cross validation exper-iments for our best model architecture, using TDNN 12 th layeras the bottleneck layer. The excitement category has a lower ac-curacy compared to other categories, in which 28% samples areconfused with neutral utterances. We observed that the netralemotion is the most likely wrong prediction from all non-neutralcategories. It might due to the fact that non-neutral utterancesusually consist of a large proportion of frames carrying no emo-tional content. We also found the model confuses neutrality as Figure 1: Confusion Matrix for Transfer Learning Method Pre-dictions sadness, that may because many neutral utterances are in lowtone, as in sadness.
6. Conclusion
Our study shows transfer learning from ASR is a good strat-egy for emotion classification, and indicates potential featureoverlap between speech-to-text and emotion recognition. Ourmethod is limited in frame-level prediction, where frames arepredicted first then aggregated into utterance level labels. Theframe-level structure results in the ignorance of sequential in-formation in emotion labels decoding. In future, we expect se-quence models can predict at utterance-level and bring furtherperformance improvements by considering sequential informa-tion for sequence decoding.
7. References [1] H. Beigi,
Fundamentals of Speaker Recognition
LanguageResources and Evaluation , vol. 42, no. 4, pp. 335–359, 2008,springerlink Online: DOI 10.1007/s10579-008-9076.[3] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, andB. Weiss, “A database of german emotional speech,” in
Proceed-ings of the 9 th European Conference on Speech Communicationand Technology , vol. 5, Lisbon, Portugal, 2006.[4] P. Jackson and S. ul Haq, “Surrey audio-visual expressed emotion(SAVEE) database,” 04 2011.[5] A. Rousseau, P. Delglise, and Y. Estve, “Enhancing the TED-LIUM corpus with selected data for language modeling and moreTED talks,” 05 2014.[6] A. Nagrani, S. Albanie, and A. Zisserman, “Seeing voices andhearing faces: Cross-modal biometric matching,” Jun. 18-222018.[7] J. Lee and I. Tashev, “High-level feature representation using re-current neural network for speech emotion recognition,” Sep.[8] Y. Kim, H. Lee, and E. M. Provost, “Deep learning for robustfeature generation in audiovisual emotion recognition,” pp. 3687–3691, May 26-31 2013.[9] A. Satt, S. Rozenberg, and R. Hoory, “Eifficient emotion recog-nition from speech using deep learning on spectrograms,” Aug.20-24.10] V. Chernykh, G. Sterling, and P. Prihodko, “Emotion recogni-tion from speech with recurrent neural networks,” arXiv preprintarXiv:1701.08071 , Jul. 2018.[11] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Transfer learn-ing for improving speech emotion classification accuracy,” Sep.2-6.[12] S. Ghosh, E. Laksana, L.-P. Morency, and S. Scherer, “Represen-tation learning for speech emotion recognition,” in
Interspeech ,Sep. 2016.[13] S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review ofaffective computing: From unimodal analysis to multimodal fu-sion,”
Information Fusion , vol. 37, pp. 98–126, Sep. 2017.[14] S. Tripathi and H. Beigi, “Multi-modal emotion recognitionon iemocap dataset using deep learning,” arXiv:1804.05788v1[cs.AI] , Apr 2018.[15] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, andS. Zafeiriou, “End-to-End multimodal emotion recognition usingdeep neural networks,”
IEEE Journal of Selected Topics in SignalProcessing , vol. 11, pp. 1301–1309, Sep. 2017.[16] J. Cho, R. Pappagari, P. Kulkarni, J. Villalba, Y. Carmiel, andN. Dehak, “Deep neural networks for emotion recognition com-bining audio and transcripts,” Sep. 2-6.[17] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker verification,” vol. 19, no. 4,pp. 788–798, May 2011.[18] V. Peddinti, D. Povey, and S. Khundanpur, “A time delay neuralnetwork architecture for efficient modeling of long temporal con-texts,” in
Interspeech , Sep. 2015.[19] M. Sugiyama, H. Sawai, and A. Waibel, “Review of tdnn (time de-lay neural network) architectures for speech recognition,” vol. 1,Jun 1991, pp. 582–585.[20] I. GitHub, “A TDNN recipe for automatic speech recog-nition training on ted-lium,” Website. [Online]. Avail-able: https://github.com/kaldi-asr/kaldi/blob/master/egs/tedlium/s5 r2/local/chain/tuning/run tdnn 1g.sh[21] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar,X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neu-ral networks for asr based on lattice-free mmi,” in