Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition
AAdvancing Multiple Instance Learning with Attention Modeling forCategorical Speech Emotion Recognition
Shuiyang Mao , P. C. Ching , C.-C. Jay Kuo and Tan Lee Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong Ming Heish Department of Electrical Engineering, University of Southern California, USA [email protected], { pcching, tanlee } @ee.cuhk.edu.hk, [email protected] Abstract
Categorical speech emotion recognition is typically performedas a sequence-to-label problem, i. e., to determine the discreteemotion label of the input utterance as a whole. One of the mainchallenges in practice is that most of the existing emotion cor-pora do not give ground truth labels for each segment; instead,we only have labels for whole utterances. To extract segment-level emotional information from such weakly labeled emotioncorpora, we propose using multiple instance learning (MIL)to learn segment embeddings in a weakly supervised manner.Also, for a sufficiently long utterance, not all of the segmentscontain relevant emotional information. In this regard, threeattention-based neural network models are then applied to thelearned segment embeddings to attend the most salient part ofa speech utterance. Experiments on the CASIA corpus and theIEMOCAP database show better or highly competitive resultsthan other state-of-the-art approaches.
Index Terms : categorical speech emotion recognition, weaklabeling, multiple instance learning, attention modeling
1. Introduction
Automatic speech emotion recognition (ASER) aims to decodeemotional content from audio signals. It has constituted an ac-tive research topic in the field of human-machine interaction(HCI). Detection of lies, monitoring of call centers, and medicaldiagnoses are also considered as promising application scenar-ios of speech emotion recognition.Categorical speech emotion recognition at utterance levelcan be formulated as a sequence-to-label problem. The inpututterance is divided into a sequence of acoustic segments, andthe output is a single label of emotion type. A few previousstudies explored the use of segment units for categorical speechemotion recognition and demonstrated that combining segment-level prediction results led to superior performance [1, 2, 3].These prior works were mostly based on conventional models,such as support vector machine (SVM) and
K-nearest neighbors (K-NN), for segment classification. In this paper, a convolu-tional neural network (CNN) is applied to extract emotionallyrelevant features, i. e., to learn emotion relevant segment em-beddings. Also, most of the existing emotion corpora do notprovide ground truth labels at segment level; instead, they onlyhave labels for whole utterances. A viable solution is to learnlocal concepts from global annotations, which is the main ideaof multiple instance learning (MIL) [4, 5]. MIL has been suc-cessfully applied to sound event detection (SED) [6, 7], speechrecognition [8], and image analysis [5]. In the MIL problemstatement, the training set contains labeled bags that comprisemany unlabeled instances, and the task is to predict the labelsof unseen bags and instances. For categorical speech emotion Figure 1:
Illustration of the proposed framework for categoricalspeech emotion recognition. recognition, each utterance is treated as a bag, and segmentswithin the utterance as instances. One main feature of this workis the application of deep learning of feature representation inthe MIL framework to learn segment embeddings from weaklabels at utterance level. Compared to the raw features suchas MFCC, energy, or pitch, the learned segment embeddingsare more tied to the task of interest, thus naturally highlightingsalient portions of the data, which we conjecture would offer anadvantage in the final classification.A key question is how to enable a deep learning model toidentify and focus on the most salient parts of a speech utterancewhen making an utterance-level decision with the learned seg-ment embeddings. In this regard, attention neural network mod-els are investigated. The key idea behind the attention mecha-nism is to align the input-output sequences such that, in the de-coding phase, the major contribution is from the correspondingencoded information. In contrast, the effect of irrelevant onesis minimized. In this work, attention modeling is expected tofacilitate a structurally meaningful composition of the utterancerepresentation from the learned emotionally relevant segmentembeddings. This is the first attempt to combine MIL-baseddeep learning of segment embedding with attention modelingfor categorical speech emotion recognition.
2. Methodology
Figure 1 illustrates a schematic approach of the proposedmethod. It comprises a CNN model trained to learn emotionallysalient segment embeddings from log-Mel filterbank features ofindividual segments. The learned segment embeddings are thenused as inputs for utterance-level emotion recognition, whichis achieved by a dense-layer neural network implemented withvarious attention mechanisms. a r X i v : . [ ee ss . A S ] A ug igure 2: An example of CNN outputs for the audio file“Happy liuchanhg 382.wav” from the CASIA corpus.
We formulate our segment-based approach as a MIL problemfollowing the instance space paradigm [4]. Each utterance (bag)is first divided into a sequence of segments (instances). Theseindividual segments are then used to train a CNN model. Thelearned CNN aims to generate emotionally salient embeddingsfor each segment.
For the segment-level features, we use the 64-bin log Mel fil-terbanks, which have been extensively evaluated in the existingliterature [9, 10, 11]. They are computed by short-time Fouriertransform (STFT) with a window length of 25 ms, hop lengthof 10 ms, and FFT length of 512. Subsequently, 64-bin logMel filterbanks are derived from each short-time frame, and theframe-level features are combined to form a time-frequency ma-trix representation of the segment.
Our segment-based method must address how to train asegment-level model without access to a training set of labeledsegments. To address this problem, we follow the most straight-forward approach, called Single Instance Learning (SIL) [12],i.e., each segment inherits the label of the utterance where it lies.A CNN is then trained on the resulting dataset. The outputs ofthe penultimate layer of the trained CNN, which we refer to as segment embeddings in this work, are stored and will be em-ployed as inputs to the subsequent recognition part. Besides, asoftmax layer sits on top of the CNN model and aims to predicta probability distribution P as follows: P = [ p ( e ) , p ( e ) , · · · , p ( e K )] T (1)where K denotes the number of possible emotions.Figure 2 shows an example of a probability distri-bution predicted by the trained CNN for the audio file“Happy liuchanhg 382.wav” from the CASIA corpus. It can beobserved that: (1) the probability distribution of each segmentchanges across the whole utterance; (2) most of the segmentsconvey information that conforms to the utterance where theylie; and (3) there are segments within one utterance that do notconvey any information about the target emotion class or thatare more related to other classes, which constitutes confusinginformation. If we can place additional focus on these morerelevant segments, system performance might be improved. Inthis regard, we have developed three attention-based neural net-works, which are described in detail in the following section. Attention neural networks assume that the bag-level predictioncan be constructed as a weighted sum of the instance-level pre-dictions. Herein, three attention-based neural network modelsare investigated and compared, i.e., decision-level single atten-tion (D-Single-Att.) [13], decision-level multiple attention (D-Multi-Att.) [14] and feature-level attention (Feature-Att.) [15],as shown in Figure 3(a)-(c), respectively. We denote the in-put segment embeddings within a certain speech utterance as X ∈ R T × M , where T is the number of segments and M repre-sents the dimension of segment embeddings. The output of thesecond fully-connected (FC) layer is denoted as h , which has adimension of , i. e., H is set to 120 in Figure 3. In the decision-level single attention model (as shown in Fig-ure 3(a)), an attention function is applied on the predictions ofthe instances to obtain the bag-level prediction: F ( B ) k = (cid:88) h ∈ B w ( h ) k f ( h ) k (2)where k denotes the k -th emotion class of the instance-levelprediction f ( h ) ∈ [0 , K and the bag-level prediction F ( B ) ∈ [0 , K , and w ( h ) k ∈ [0 , is a weight of f ( h ) k that we referto as a decision-level attention function : w ( h ) k = s ( h ) k / (cid:88) h ∈ B s ( h ) k (3)where s ( . ) can be any non-negative function (i.e., Softmax non-linearity) to ensure that attention w ( . ) is normalized. Both theattention function w ( . ) and the instance-level classifier f ( . ) de-pend on a set of learnable parameters. The decision-level multiple attention model is an extension ofthe above decision-level single attention model. It consists ofseveral single attention modules (we herein use two attentionmodules, as shown in Figure 3(b)) applied to intermediate neu-ral network layers. The outputs of these attention modules areconcatenated.
The limitation of the above decision-level attention neural net-works is that the attention function w ( . ) is only applied to theprediction of the instances f ( h ) . To address this constraint,we also investigate the effect of applying the attention functionto the outputs of the hidden layers, which we refer to as thefeature-level attention (as shown in Figure 3(c)), in which thebag-level representation U can be modeled as: U d = (cid:88) h ∈ B v ( h ) d q ( h ) d (4)where d denotes the d -th dimension of the hidden layer output q ( h ) ∈ R D and the bag-level representation U ∈ R D ; and v ( h ) d ∈ [0 , is a weight of q ( h ) d that we refer to as a feature-level attention function : v ( h ) d = u ( h ) d / (cid:88) h ∈ B u ( h ) d (5)igure 3: (a) Decision-level single attention neural network; (b) decision-level multiple attention neural network; (c) feature-levelattention neural network. ( ◦ : Hadamard product; (cid:80) : element-wise summation; T: length of input sequence; M: dimension of inputbottleneck features; H: dimension of FC layer; K: number of emotion classes; D: dimension of feature-level attention function.) where u ( . ) can be any non-negative function (i.e., Sigmoid non-linearity) to ensure that attention v ( . ) is normalized. Both theattention function v ( . ) and the instance-level feature mappingfunction q ( . ) depend on a set of learnable parameters. The pre-diction of a bag B can then be obtained by classifying the bag-level representation U as follows: F ( B ) = g ( U ) (6)where g ( . ) is the final classifier that corresponds to the last neu-ral network layer.
3. Emotion Corpora
Two different emotion corpora are used to evaluate the validityof the proposed method, namely, a Chinese emotional corpus(CASIA) [16] and an English emotional database (IEMOCAP)[17], which have been extensively evaluated in the literature.Specifically, the CASIA corpus [16] contains 9,600 utter-ances that are simulated by four subjects (two males and two fe-males) in six different emotional states, i. e., angry, fear, happy,neutral, sad, and surprise. In our experiments, we only use7,200 utterances that correspond to 300 linguistically neutralsentences with the same statements. All of the emotion cate-gories are selected.The IEMOCAP database [17] was collected using motioncapture and audio/video recording over five dyadic sessionswith 10 subjects. At least three evaluators annotated each utter-ance in the database with the categorical emotion labels chosenfrom the set: angry, disgusting, excited, fear, frustrate, happy,neutral, sad, surprise, and others. We consider only the utter-ances with majority agreement (i. e., at least two out of threeevaluators assigned the same emotion label) over the emotionclasses of angry, happy (combined with the “excited” category),neutral and sad, which results in 5,531 utterances in total.
4. Experiments
In our experiment, the size of each speech segment is set to32 frames, i. e., the total length of a segment is 10 ms ×
32+ (25 - 10) ms = 335 ms, shifting 60 ms each time. In thisway, we collected approximately 200,000 segments for the CA-SIA corpus and 300,000 segments for the IEMOCAP database,respectively. Moreover, since the input length for our attentionneural networks has to be equal for all samples, we heuristicallyset the maximal length for each speech utterance to the average duration of each dataset, i. e., 2.07 s for CASIA and 4.55 s forIEMOCAP, respectively. Longer speech utterances are cut atthe maximal length, and shorter ones are padded with zeros.The architecture of the CNN model is similar to the
SegCNN model as used in our previous work [18]. Theonly change we made was to the last three FC layers, i. e., { , , K } units, respectively, where and K correspond tothe dimension of segment embeddings and the number of possi-ble emotions, respectively. In the training stage, for both CNNand attention neural networks, ADAM [19] optimizer with thedefault setting in Tensorflow [20] was used, with an initial learn-ing rate of . and an exponential decay scheme with a rateof . every two epochs. The batch size was set to . Earlystopping with patience of epochs was utilized to mitigate anoverfitting problem.For the CASIA corpus, we perform leave-one-fold-out ten-fold cross-validation experiments. For the IEMOCAP database,the leave-one-session-out five-fold cross-validation method iscarried out. For both datasets, a second cross-validation is per-formed since we need to utilize the segment-level results to trainthe attention neural networks. The results are presented in termsof unweighted accuracy (UA). In the CNN-MAX-RF baseline, the maxout pooling is directlyapplied on the segment embedding X ∈ R T × M : U m = max ≤ t ≤ T { X m } (7)A Random Forest (RF) is then used to make the utterance-levelprediction based on the resultant utterance representation U . The CNN-AVG-RF baseline is similar to the above CNN-MAX-RF baseline. The only difference is that the maxout pooling inthe CNN-MAX-RF baseline is replaced by an average poolingin the CNN-AVG-RF baseline.
In the CNN-MP baseline, the maxout pooling is applied on theinstance-level prediction f ( h ) ∈ [0 , K across a certain speechutterance to obtain the bag-level prediction: F ( B ) k = max h ∈ B { f ( h ) k } (8)igure 4: Confusion matrices obtained using feature-level attention modeling for: (a) CASIA corpus and (b) IEMOCAP database.
Table 1:
Comparison of UAs on the CASIA corpus.
Methods for comparison UA [ % ] ELM-Decision Tree [21] . DNN-HMM [22] . LSTM-TF-Att. [23] . CNN-MAX-RF ( baseline ) . CNN-AVG-RF ( baseline ) . CNN-MP ( baseline ) . CNN-AP ( baseline ) . CNN-D-Single-Att. ( ours ) . CNN-D-Multi-Att. ( ours ) . CNN-Feature-Att. ( ours ) . Similarly, for the CNN-AP baseline, the average pooling is ap-plied on the instance-level prediction f ( h ) ∈ [0 , K across aparticular speech utterance to obtain the bag-level prediction: F ( B ) k = mean h ∈ B { f ( h ) k } (9) Tables 1-2 show the experimental results on the two men-tioned emotion corpora, respectively. The following can beseen: (1) our baseline systems achieved respectable results onboth datasets, which proved the effectiveness of the MIL-basedframework; (2) the last two baselines (i. e., the CNN-MP base-line and the CNN-AP baseline) consistently outperformed thefirst two baselines (i. e., the CNN-MAX-RF baseline and theCNN-AVG-RF baseline). This performance gain might de-rive from the joint optimization of the aggregation strategy ofsegment-level representations and the utterance-level decisionmaking of the last two baselines; (3) the attention-based meth-ods substantially augmented the performance of the baselinesystems overall. This is mainly attributed to the effectivenessof the attention modeling; (4) due to the positive combinationof different attention modules, the decision-level multiple atten-tion modeling achieved noticeably better performance than thedecision-level single attention modeling on both datasets; (5) Table 2:
Comparison of UAs on the IEMOCAP database.
Methods for Comparison UA [ % ] CNN-LSTM [10] . DNN-HMM [22] . FCN-Att. [24] . CNN-MAX-RF ( baseline ) . CNN-AVG-RF ( baseline ) . CNN-MP ( baseline ) . CNN-AP ( baseline ) . CNN-D-Single-Att. ( ours ) . CNN-D-Multi-Att. ( ours ) . CNN-Feature-Att. ( ours ) . the feature-level attention modeling outperformed the decision-level attention neural networks by a significant margin. Thisis due to the fact that the dimension of v ( h ) (i.e., D ) can beany value, while the dimension of w ( h ) is fixed to be the num-ber of emotion classes K . With an increase in the dimensionof v ( h ) , the capacity of feature-level attention neural networksis increased; and (6) for the CASIA corpus, our feature-levelattention-based system achieved the highest recognition accu-racy of . , establishing a new benchmark (to the best ofour knowledge). For the IEMOCAP database, which might con-stitute a more challenging dataset, our methods also achievedcompetitive results. Figure 4 shows the confusion matrices ob-tained using feature-level attention modeling on both datasets,respectively.
5. Conclusion
In this paper, we proposed to combine multiple instance learn-ing with attention neural networks for better modeling of cate-gorical speech emotion recognition. Three attention-based neu-ral network models were investigated and compared. Exper-imental results on two well-known emotion corpora showedcompetitive outcomes. Since we herein blindly used all seg-ments to train the segment-level classifier, it is anticipated withproper segment selection strategy, better results are expected.More advanced neural network architectures and better algo-rithm optimization will also be investigated in the near future. . References [1] J. H. Jeon, R. Xia, and Y. Liu, “Sentence level emotion recognitionbased on decisions from subsentence segments,” in
Proc. ICASSP ,2011, pp. 4940–4943.[2] M. T. Shami and M. S. Kamel, “Segment-based approach to therecognition of emotions in speech,” in
Proc. ICME , 2005, p. 4.[3] B. Schuller and G. Rigoll, “Timing levels in segment-basedspeech emotion recognition,” in
Proc. INTERSPEECH , 2006, pp.1818–1821.[4] J. Amores, “Multiple instance classification: Review, taxonomyand comparative study,”
Artificial Intelligence , vol. 201, pp. 81–105, 2013.[5] Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai, I. Eric, and C. Chang,“Deep learning of feature representation with multiple instancelearning for medical image analysis,” in
Proc. ICASSP , 2014, pp.1626–1630.[6] T.-W. Su, J.-Y. Liu, and Y.-H. Yang, “Weakly-supervised audioevent detection using event-specific gaussian filters and fully con-volutional networks,” in
Proc. ICASSP , 2017, pp. 791–795.[7] A. Kumar and B. Raj, “Audio event detection using weakly la-beled data,” in
Proc. ACM International Conference on Multime-dia , 2016, pp. 1038–1047.[8] Y. Wang, J. Li, and F. Metze, “Comparing the max and noisy-orpooling functions in multiple instance learning for weakly super-vised sequence learning tasks,” in
Proc. INTERSPEECH , 2018,pp. 1339–1343.[9] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotionrecognition using cnn,” in
Proc. ACM International Conferenceon Multimedia , 2014, pp. 801–804.[10] A. Satt, S. Rozenberg, and R. Hoory, “Efficient emotion recogni-tion from speech using deep learning on spectrograms.” in
Proc.INTERSPEECH , 2017, pp. 1089–1093.[11] L. Zhang, L. Wang, J. Dang, L. Guo, and H. Guan, “Convolu-tional neural network with spectrogram and perceptual featuresfor speech emotion recognition,” in
Proc. ICONIP , 2018, pp. 62–71.[12] R. C. Bunescu and R. J. Mooney, “Multiple instance learning forsparse positive bags,” in
Proc. ICML , 2007, pp. 105–112.[13] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Audio set clas-sification with attention model: A probabilistic perspective,” in
Proc. ICASSP , 2018, pp. 316–320.[14] C. Yu, K. S. Barsim, Q. Kong, and B. Yang, “Multi-level at-tention model for weakly supervised audio classification,” in arXiv:1803.02353 , 2018.[15] Q. Kong, C. Yu, Y. Xu, T. Iqbal, W. Wang, and M. D. Plumb-ley, “Weakly labelled audioset tagging with attention neural net-works,”
IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 27, no. 11, pp. 1791–1802, 2019.[16] J. Tao, F. Liu, M. Zhang, and H. Jia, “Design of speech corpus formandarin text to speech,” in
Proc. the 4th Workshop on BlizzardChallenge , 2005.[17] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap:Interactive emotional dyadic motion capture database,”
Languageresources and evaluation , vol. 42, no. 4, p. 335, 2008.[18] S. Mao, P. C. Ching, and T. Lee, “Deep learning of segment-level feature representation with multiple instance learning forutterance-level speech emotion recognition,” in
Proc. INTER-SPEECH , 2019, pp. 1686–1690.[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[20] M. Abadi et al. , “Tensorflow: A system for large-scale machinelearning,” in
Proc. OSDI , 2016, pp. 265–283. [21] Z.-T. Liu, M. Wu, W.-H. Cao, J.-W. Mao, J.-P. Xu, and G.-Z.Tan, “Speech emotion recognition based on feature selection andextreme learning machine decision tree,”
Neurocomputing , vol.273, pp. 271–280, 2018.[22] S. Mao, D. Tao, G. Zhang, P. C. Ching, and T. Lee, “Revisitinghidden markov models for speech emotion recognition,” in
Proc.ICASSP , 2019, pp. 6715–6719.[23] Y. Xie, R. Liang, Z. Liang, C. Huang, C. Zou, and B. Schuller,“Speech emotion classification using attention-based lstm,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 27, no. 11, pp. 1675–1685, 2019.[24] Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu, “Attention basedfully convolutional network for speech emotion recognition,” in