Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation
SSPEECH EMOTION RECOGNITION WITH MULTISCALE AREA ATTENTION AND DATAAUGMENTATION
Mingke Xu , Fan Zhang , Xiaodong Cui , Wei Zhang Nanjing Tech University, China IBM Data and AI, USA IBM Research AI, USA
ABSTRACT
In Speech Emotion Recognition (SER), emotional char-acteristics often appear in diverse forms of energy patterns inspectrograms. Typical attention neural network classifiers ofSER are usually optimized on a fixed attention granularity. Inthis paper, we apply multiscale area attention in a deep con-volutional neural network to attend emotional characteristicswith varied granularities and therefore the classifier can ben-efit from an ensemble of attentions with different scales. Todeal with data sparsity, we conduct data augmentation withvocal tract length perturbation (VTLP) to improve the gen-eralization capability of the classifier. Experiments are car-ried out on the Interactive Emotional Dyadic Motion Capture(IEMOCAP) dataset. We achieved 79.34% weighted accu-racy (WA) and 77.54% unweighted accuracy (UA), which, tothe best of our knowledge, is the state of the art on this dataset.
Index Terms : speech emotion recognition, convolutionalneural network, attention mechanism, data augmentation
1. INTRODUCTION
Speech is an important carrier of emotions in human commu-nication. Speech Emotion Recognition (SER) has wide appli-cation perspectives on psychological assessment [1], robots[2], mobile services [3], etc . For example, a psychologistcan design a treatment plan according to the emotions hid-den/expressed in the patient’s speech. Deep learning has ac-celerated the progress of recognizing human emotions fromspeech [4–9], but there are still deficiencies in the research ofSER, such as data shortage and insufficient model accuracy.Recently, we proposed Head Fusion Net [10] whichachieved the state-of-the-art performance on the IEMOCAPdataset. However, it does not fully address the above prob-lems. In SER, emotion may display distinct energy patterns inspectrograms with varied granularity of areas. However, typ-ical attention models in SER are usually optimized on a fixedscale, which may limit the model’s capability to deal withdiverse areas and granularities. Therefore, in this paper, weintroduce multiscale area attention to a deep convolutionalneural network model based on Head Fusion to improve The code is released at github.com/lessonxmk/head fusion model accuracy. Furthermore, data augmentation is used toaddress the data scarcity issue.Our main contributions are as follows:• To the best of our knowledge, this is the first attemptfor applying multiscale area attention to SER.• We performed data augmentation on the IEMOCAPdataset with vocal tract length perturbation (VTLP)and achieved an accuracy improvement of about 0.5%absolute.• With area attention and VTLP-based data augmenta-tion, we achieved the state-of-the-art on the IEMOCAPdataset with an WA of 79.34% and UA of 77.54%.
2. RELATED WORK
In 2014, the first SER model based on deep learning was pro-posed by Han et al. [4]. Recently, for the same purpose,M. Chen et al. [5] combined convolutional neural networks(CNN) and Long Short-Term Memory (LSTM); X. Wu et al. [6] replaced CNN with capsule networks (CapsNet); Y. Xu etal. [7] used Gate Recurrent Unit (GRU) to calculate featuresfrom frame and utterance level, and S. Parthasarathy [11] usedladder networks to combine the unsupervised auxiliary taskand the primary task of predicting emotional attributes.There is a recent resurgence of interest on attention-basedSER models [8, 9, 12]. However, those attention mechanismscan only be calculated with a preset granularity which maynot adapt dynamically to different areas of interest in spectro-gram. Y. Li et al. [13] proposed area attention that allows themodel to calculate attention with multiple granularities con-currently, an idea that is not yet explored in SER.Insufficient data hinders progress in SER. Data augmen-tation has become a popular method to increase training data[14–17] in the related field of Automatic Speech Recognition(ASR). Yet, it has not enjoyed broad attention for SER.In this paper, we extend the multiscale area attention toSER with data augmentation. We introduce our method insection 3 and experiment results in section 4 followed by theconclusion with section 5. The code is released at github.com/lessonxmk/Optimized attention for SER a r X i v : . [ c s . S D ] F e b ig. 1 . The architecture of the CNN with attention used as aclassifier in this work.
3. METHODOLOGY
We first introduce our base convolutional neural networks thatshares similarity to our Head Fusion Net [10], then our newlyintroduced multiscale area attention that enhances Head Fu-sion Net, and finally the data augmentation technique.
We designed an attention-based convolutional neural networkwith 5 convolutional layers, an attention layer, and a fullyconnected layer. Fig. 1 shows the detailed model structure.First, the Librosa audio processing library [18] is used to ex-tract the logMel spectrogram as features, which are fed intotwo parallel convolutional layers to extract textures from thetime axis and frequency axis, respectively. The result is fedinto four consecutive convolutional layers and generates an80-channel representation. Then the attention layer attendson the representation and sends the outputs to the fully con-nected layer for classification. Batch normalization is appliedafter each convolutional layer.
In this section, We extend the area attention in Y. Li et al. [13]to SER. The attention mechanism can be regarded as a softaddressing operation, which uses key-value pairs to representthe content stored in the memory, and the elements are com-posed of the address ( key ) and the value ( value ). Query canmatch to a key which correspondent value is retrieved fromthe memory according to the degree of correlation betweenthe query and the key . The query , key , and value are usuallyfirst multiplied by a parameter matrix W to obtain Q, K and V . Eq.1 shows the calculation of attention score, where d k isthe dimension of K [19] to prevent the result from being too large. Q = W q ∗ query, K = W k ∗ key, V = W v ∗ valueAttention ( Q, K, V ) = sof tmax (cid:18) QK T √ d k (cid:19) V (1)In self-attention, the query , key and value come from asame input X . By calculating self-attention, the model canfocus on the connection between different parts of the in-put. In SER, the distribution of emotion characteristics of-ten crosses a larger scale, and using self-attention in speechemotion recognition improves the accuracy.However, under the conventional attention, the modelonly uses a preset granularity as the basic unit for calculation, e.g. , a word for a word-level translation model, a grid cellfor an image-based model, etc. Yet, it is hard to know whichgranularity is most suitable for a complex task.Area attention allows the model to attend at multiplescales and granularities and to learn the most appropriategranularities. As shown in Fig. 2, for a continuous memoryblock, multiple areas can be created to accommodate for dif-ferent granularities, e.g. , 1x2, 2x1, 2x2 and etc. . In order tocalculate attention in units of areas, we need to define the key and value for the area. For example, we can define the meanof an area as the key and the sum of an area as the value , sothat the attention can be evaluated in a way similar to ordinaryattention. (Eq.1)Exhaustive evaluation of attention on a large memoryblock may be computationally prohibitive. A maximumlength and width is set to an area under investigation.
Fig. 2 . Generating Area
Multiple areas can be generated bycombining adjacent items in a continuous memory block. Fora 3x3 memory block, if we set the max area size to 2x2, thememory block can be divided into 1x1,1x2,2x1 and 2x2.
Given the limited amount of training data in IEMOCAP, weuse vocal tract length perturbation (VTLP) [14] as means fordata augmentation. VTLP increases the number of speakersby perturbing the vocal tract length. We generated additional7 replicas of the original data with nlpaug library [20]. Theaugmented data is only used for training. a) On the original data set (b) On the augmented data set (c) comparison of
ACC
Fig. 3 . Result of modifying max area size
It can be seen that when trained on the original data set, the model with the maxarea size of 4x4 achieved the highest
ACC , followed by 3x3. When trained on the augmented data set, the model with the maxarea size of 3x3 achieved the highest
ACC . In most cases, the use of enhanced data brings an accuracy increase of more than0.5%. The best model achieved 78.44%
ACC , where WA=79.34% and UA=77.54%.
4. EXPERIMENTS4.1. Dataset
Interactive Emotional Dyadic Motion Capture (IEMOCAP) et al. [21] is the most widely used dataset in the SER field. Itcontains 12 hours of emotional speech performed by 10 actorsfrom the Drama Department of University of Southern Cali-fornia. The performance is divided into two parts, improvisedand scripted, according to whether the actors perform accord-ing to a fixed script. The utterances are labeled with 9 typesof emotion-anger, happiness, excitement, sadness, frustration,fear, surprise, other and neutral state.Due to the imbalances in the dataset, researchers usuallychoose the most common emotions, such as neutral state, hap-piness, sadness, and anger. Because excitement and happinesshave a certain degree of similarity and there are too few happyutterances, researchers sometimes replace happiness with ex-citement or combine excitement and happiness to increase theamount of data [22–24]. In addition, previous studies haveshown that the accuracy of using improvised data is higherthan that of scripted data, [12, 22] which can be due to thefact that actors pay more attention to expressing their emo-tion rather than the script during improvisation.In this paper, following other published work, we use im-provised data in the IEMOCAP dataset with four types ofemotion–neutral state, excitement, sadness, and anger.
We use weighted accuracy (WA) and unweighted accuracy(UA) for evaluation, which have been broadly employed inSER literature. Considering that WA and UA may not reachthe maximum in the same model, we calculate the averageof WA and UA as the final evaluation criterion (indicated by
ACC below), i.e. , we save the model with the largest averageof WA and UA as the optimal model.
We randomly divide the dataset into a training set (80% ofdata) and a test set (20% of data) for 5-fold cross-validation.Each utterance is divided into 2-second segments, with 1 sec-ond (in training) or 1.6 seconds (in testing) overlap betweensegments. Although divided, the test is still based on the ut-terance, and the prediction results from the same utterance areaveraged as the prediction result of the utterance. Experienceshows that a large overlap can make the recognition result ofutterance more stable in testing.
Fig. 4 . Result of modifying the amount of data with VTLP
The horizontal axis refers to the total amount of data usedfor training. For example, 8 on the horizontal axis representsoriginal data plus 7 replicas of augmented data. It can be seenthat the more augmented data added the higher the accuracy.
Optimal maximum areasize is investigated on both the original data and the aug-mented data with VTLP, respectively. The result is shown inFig 3. It can be seen that when trained on the original data set,the model with the max area size of 4x4 achieved the highest
ACC , followed by 3x3. When trained on the augmented data able 1 . Result of area feature selection (a) WA Key Value
Max Mean SumMax 0.7869 0.7880 0.7869Mean 0.7864 0.7808 0.7846Sample (b) UA Key Value
Max Mean SumMax 0.7642 0.7627 0.7686Mean 0.7639 0.7608 0.7675Sample (c)
ACC
Key Value
Max Mean SumMax 0.7755 0.7753 0.7777Mean 0.7751 0.7708 0.7760Sample (a) Original LogMel (b) CNN (c) Area attention
Fig. 5 . Feature representation
CNN attends more on areaswith high-energy or sharp energy changes. This benefits thesudden and intense emotions. The area attention model notonly attends on these areas but also extends that to the timedomain (horizontal axis), which enables it to distinguish thoselong-term emotions.
Table 2 . Ablation experiment
Model WA UA ACC
CNN 0.7467 0.7222 0.7345CNN+VTLP 0.7891 0.7683 0.7787Attention 0.7807 0.7628 0.7718Attention+VTLP 0.7879 0.7734 0.7807Area attention 0.7911 0.7705 0.7808Area attention+VTLP set, the model with the max area size of 3x3 achieved thehighest
ACC . In most cases, the use of enhanced data bringsan accuracy increase of more than 0.5% absolute. Therefore,we suggest using a max area size of 3x3 and using VTLP fordata augmentation.
Selection of area features
Experiments are conducted to in-vestigate the performance of using various area features. For
Key , we selected
Max, Mean and
Sample ; for
Value , we se-lected
Max, Mean and
Sum .The
Sample refers to adding aperturbation proportional to the standard deviation on the ba-sis of
Mean when training, which is calculated according toEq.2 where x is a sample and µ and σ are the mean and stan-dard variance, respectively. ξ is a random variable assuming N (0 , distribution. We use K-V to represent the model se-lected K as Key and V as Value . x = µ + σ ∗ ξ, where ξ ∼ N (0 , (2)Table 1 shows the result. It can be observed that the Sample-Max achieved the highest
ACC and the
Sample-Mean achievedthe lowest
ACC . There is little difference in
ACC in othercases. We speculate that it is because perturbed
Key in train-ing introduces greater randomness.
Amount of augmented data
Experiments are also carried outto study the impact of amount of augmented data under VTLPto the SER performance, as demonstrated in Fig 4. It can beobserved that with more replicas of augmented data added inthe training, the accuracy increases.
Ablation study
We conducted ablation experiments on themodel without the attention layer (only CNN) and the modelwith an original attention layer (equivalent to 1x1 max areasize). Table 2 shows the result. It can be seen that the areaattention and VTLP enables the model to achieve the highestaccuracy. As a case study, we visualize the feature representa-tions of an input logMel right before the fully connected layerof the learned model in Fig 5. It clearly shows that comparedto the conventional CNN with more localized representation,area attention tends to cover a wide context along the timeaxis, which is one of the reasons that area attention can out-perform CNN. Also from Table 2 we can see that when themodel becomes stronger, the improvement brought by VTLPmarginally decreases. This is because VTLP conducts label-preserving perturbation to improve the robustness of the clas-sifier. When the model gets stronger with attention or mul-tiscale area attention, the model itself becomes more robust,which may offset to certain degree the impact of VTLP.
Table 3 . Accuracy comparison with existing SER results
Method WA(%) UA(%) Year
Attention pooling (P. Li et al. ) [22] 71.75 68.06 2018CTC + Attention (Z. Zhao et al. ) [23] 67.00 69.00 2019Self attention (L. Tarantino et al. ) [12] 70.17 70.85 2019BiGRU (Y. Xu et al. ) [7] 66.60 70.50 2020Multitask learning + Attention(A. Nediyanchath et al. ) [9] 76.40 70.10 2020Head fusion (Ours) [10] 76.18 76.36 2020Area attention (Ours)
Comparison with existing results
As shown in Table 3, wecompared our accuracy with other published SER results inrecent years. These results use the same data set and the eval-uation metrics as our experiment.
5. CONCLUSION
In this paper, we have applyed multiscale area attention toSER and designed an attention-based convolutional neuralnetwork, conducted experiments on the IEMOCAP data setwith VTLP augmentation, and obtained 79.34% WA and77.54% UA. The result is state-of-the-art.n future research, we will continue to work along thelines by improving the application of attention on SER andapply more data augmentation methods to SER.
6. REFERENCES [1] Lu-Shih Alex Low, Namunu C Maddage, Margaret Lech,Lisa B Sheeber, and Nicholas B Allen, “Detection of clinicaldepression in adolescents’ speech during family interactions,”
IEEE Transactions on Biomedical Engineering , vol. 58, no. 3,pp. 574–586, 2010.[2] Xu Huahu, Gao Jue, and Yuan Jian, “Application of speechemotion recognition in intelligent household robot,” in . IEEE, 2010, vol. 1, pp. 537–541.[3] Won-Joong Yoon, Youn-Ho Cho, and Kyu-Sik Park, “A studyof speech emotion recognition and its application to mobile ser-vices,” in
International Conference on Ubiquitous Intelligenceand Computing . Springer, 2007, pp. 758–766.[4] Kun Han, Dong Yu, and Ivan Tashev, “Speech emotion recog-nition using deep neural network and extreme learning ma-chine,” in
Fifteenth annual conference of the internationalspeech communication association , 2014.[5] Mingyi Chen, Xuanji He, Jing Yang, and Han Zhang, “3-d con-volutional recurrent neural networks with attention model forspeech emotion recognition,”
IEEE Signal Processing Letters ,vol. 25, no. 10, pp. 1440–1444, 2018.[6] Xixin Wu, Songxiang Liu, Yuewen Cao, Xu Li, Jianwei Yu,Dongyang Dai, Xi Ma, Shoukang Hu, Zhiyong Wu, XunyingLiu, et al., “Speech emotion recognition using capsule net-works,” in
ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 6695–6699.[7] Yunfeng Xu, Hua Xu, and Jiyun Zou, “Hgfm: A hierarchi-cal grained and feature model for acoustic emotion recogni-tion,” in
ICASSP 2020-2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2020, pp. 6499–6503.[8] Darshana Priyasad, Tharindu Fernando, Simon Denman,Sridha Sridharan, and Clinton Fookes, “Attention driven fu-sion for multi-modal emotion recognition,” in
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2020, pp. 3227–3231.[9] Anish Nediyanchath, Periyasamy Paramasivam, and PromodYenigalla, “Multi-head attention for speech emotion recog-nition with auxiliary learning of gender recognition,” in
ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.7179–7183.[10] Mingke Xu, Fan Zhang, and Samee U Khan, “Improve accu-racy of speech emotion recognition with attention head fusion,”in . IEEE, 2020, pp. 1058–1064.[11] Srinivas Parthasarathy and Carlos Busso, “Semi-supervisedspeech emotion recognition with ladder networks,”
IEEE/ACMtransactions on audio, speech, and language processing , 2020. [12] Lorenzo Tarantino, Philip N Garner, and Alexandros Lazaridis,“Self-attention for speech emotion recognition,”
Proc. Inter-speech 2019 , pp. 2578–2582, 2019.[13] Yang Li, Lukasz Kaiser, Samy Bengio, and Si Si, “Area atten-tion,” in
International Conference on Machine Learning , 2019,pp. 3846–3855.[14] Navdeep Jaitly and Geoffrey E Hinton, “Vocal tract length per-turbation (vtlp) improves speech recognition,” in
Proc. ICMLWorkshop on Deep Learning for Audio, Speech and Language ,2013, vol. 117.[15] Xiaodong Cui, Vaibhava Goel, and Brian Kingsbury, “Dataaugmentation for deep convolutional neural network acousticmodeling,” in . IEEE, 2015, pp.4545–4549.[16] Xiaodong Cui, Vaibhava Goel, and Brian Kingsbury, “Dataaugmentation for deep neural network acoustic modeling,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 23, no. 9, pp. 1469–1477, 2015.[17] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu,Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: Asimple data augmentation method for automatic speech recog-nition,” arXiv preprint arXiv:1904.08779 , 2019.[18] Brian Mcfee, Colin Raffel, Dawen Liang, Daniel Ellis, andOriol Nieto, “librosa: Audio and music signal analysis inpython,” in
Python in Science Conference , 2015.[19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin, “Attention is all you need,” in
Advances in neuralinformation processing systems , 2017, pp. 5998–6008.[20] Edward Ma, “Nlp augmentation,”https://github.com/makcedward/nlpaug, 2019.[21] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, AbeKazemzadeh, Emily Mower, Samuel Kim, Jeannette NChang, Sungbok Lee, and Shrikanth S Narayanan, “Iemo-cap: Interactive emotional dyadic motion capture database,”
Language resources and evaluation , vol. 42, no. 4, pp. 335,2008.[22] Pengcheng Li, Yan Song, Ian Vince McLoughlin, Wu Guo, andLirong Dai, “An attention pooling based representation learn-ing method for speech emotion recognition.,” in
Interspeech ,2018, pp. 3087–3091.[23] Ziping Zhao, Zhongtian Bao, Zixing Zhang, Nicholas Cum-mins, Haishuai Wang, and Bj¨orn Schuller, “Attention-enhanced connectionist temporal classification for discretespeech emotion recognition,”
Proc. Interspeech 2019 , pp. 206–210, 2019.[24] Michael Neumann and Ngoc Thang Vu, “Improving speechemotion recognition with unsupervised representation learningon unlabeled speech,” in