[PDF] Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition

Abstract

Categorical speech emotion recognition is typically performed as a sequence-to-label problem, i.e., to determine the discrete emotion label of the input utterance as a whole. One of the main challenges in practice is that most of the existing emotion corpora do not give ground truth labels for each segment; instead, we only have labels for whole utterances. To extract segment-level emotional information from such weakly labeled emotion corpora, we propose using multiple instance learning (MIL) to learn segment embeddings in a weakly supervised manner. Also, for a sufficiently long utterance, not all of the segments contain relevant emotional information. In this regard, three attention-based neural network models are then applied to the learned segment embeddings to attend the most salient part of a speech utterance. Experiments on the CASIA corpus and the IEMOCAP database show better or highly competitive results than other state-of-the-art approaches.

Full PDF

AAdvancing Multiple Instance Learning with Attention Modeling forCategorical Speech Emotion Recognition

Shuiyang Mao , P. C. Ching , C.-C. Jay Kuo and Tan Lee Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong Ming Heish Department of Electrical Engineering, University of Southern California, USA [email protected], { pcching, tanlee } @ee.cuhk.edu.hk, [email protected] Abstract

Categorical speech emotion recognition is typically performedas a sequence-to-label problem, i. e., to determine the discreteemotion label of the input utterance as a whole. One of the mainchallenges in practice is that most of the existing emotion cor-pora do not give ground truth labels for each segment; instead,we only have labels for whole utterances. To extract segment-level emotional information from such weakly labeled emotioncorpora, we propose using multiple instance learning (MIL)to learn segment embeddings in a weakly supervised manner.Also, for a sufﬁciently long utterance, not all of the segmentscontain relevant emotional information. In this regard, threeattention-based neural network models are then applied to thelearned segment embeddings to attend the most salient part ofa speech utterance. Experiments on the CASIA corpus and theIEMOCAP database show better or highly competitive resultsthan other state-of-the-art approaches.

Index Terms : categorical speech emotion recognition, weaklabeling, multiple instance learning, attention modeling

1. Introduction

Automatic speech emotion recognition (ASER) aims to decodeemotional content from audio signals. It has constituted an ac-tive research topic in the ﬁeld of human-machine interaction(HCI). Detection of lies, monitoring of call centers, and medicaldiagnoses are also considered as promising application scenar-ios of speech emotion recognition.Categorical speech emotion recognition at utterance levelcan be formulated as a sequence-to-label problem. The inpututterance is divided into a sequence of acoustic segments, andthe output is a single label of emotion type. A few previousstudies explored the use of segment units for categorical speechemotion recognition and demonstrated that combining segment-level prediction results led to superior performance [1, 2, 3].These prior works were mostly based on conventional models,such as support vector machine (SVM) and

K-nearest neighbors (K-NN), for segment classiﬁcation. In this paper, a convolu-tional neural network (CNN) is applied to extract emotionallyrelevant features, i. e., to learn emotion relevant segment em-beddings. Also, most of the existing emotion corpora do notprovide ground truth labels at segment level; instead, they onlyhave labels for whole utterances. A viable solution is to learnlocal concepts from global annotations, which is the main ideaof multiple instance learning (MIL) [4, 5]. MIL has been suc-cessfully applied to sound event detection (SED) [6, 7], speechrecognition [8], and image analysis [5]. In the MIL problemstatement, the training set contains labeled bags that comprisemany unlabeled instances, and the task is to predict the labelsof unseen bags and instances. For categorical speech emotion Figure 1:

Illustration of the proposed framework for categoricalspeech emotion recognition. recognition, each utterance is treated as a bag, and segmentswithin the utterance as instances. One main feature of this workis the application of deep learning of feature representation inthe MIL framework to learn segment embeddings from weaklabels at utterance level. Compared to the raw features suchas MFCC, energy, or pitch, the learned segment embeddingsare more tied to the task of interest, thus naturally highlightingsalient portions of the data, which we conjecture would offer anadvantage in the ﬁnal classiﬁcation.A key question is how to enable a deep learning model toidentify and focus on the most salient parts of a speech utterancewhen making an utterance-level decision with the learned seg-ment embeddings. In this regard, attention neural network mod-els are investigated. The key idea behind the attention mecha-nism is to align the input-output sequences such that, in the de-coding phase, the major contribution is from the correspondingencoded information. In contrast, the effect of irrelevant onesis minimized. In this work, attention modeling is expected tofacilitate a structurally meaningful composition of the utterancerepresentation from the learned emotionally relevant segmentembeddings. This is the ﬁrst attempt to combine MIL-baseddeep learning of segment embedding with attention modelingfor categorical speech emotion recognition.

2. Methodology

Figure 1 illustrates a schematic approach of the proposedmethod. It comprises a CNN model trained to learn emotionallysalient segment embeddings from log-Mel ﬁlterbank features ofindividual segments. The learned segment embeddings are thenused as inputs for utterance-level emotion recognition, whichis achieved by a dense-layer neural network implemented withvarious attention mechanisms. a r X i v : . [ ee ss . A S ] A ug igure 2: An example of CNN outputs for the audio ﬁle“Happy liuchanhg 382.wav” from the CASIA corpus.

We formulate our segment-based approach as a MIL problemfollowing the instance space paradigm [4]. Each utterance (bag)is ﬁrst divided into a sequence of segments (instances). Theseindividual segments are then used to train a CNN model. Thelearned CNN aims to generate emotionally salient embeddingsfor each segment.

For the segment-level features, we use the 64-bin log Mel ﬁl-terbanks, which have been extensively evaluated in the existingliterature [9, 10, 11]. They are computed by short-time Fouriertransform (STFT) with a window length of 25 ms, hop lengthof 10 ms, and FFT length of 512. Subsequently, 64-bin logMel ﬁlterbanks are derived from each short-time frame, and theframe-level features are combined to form a time-frequency ma-trix representation of the segment.

Our segment-based method must address how to train asegment-level model without access to a training set of labeledsegments. To address this problem, we follow the most straight-forward approach, called Single Instance Learning (SIL) [12],i.e., each segment inherits the label of the utterance where it lies.A CNN is then trained on the resulting dataset. The outputs ofthe penultimate layer of the trained CNN, which we refer to as segment embeddings in this work, are stored and will be em-ployed as inputs to the subsequent recognition part. Besides, asoftmax layer sits on top of the CNN model and aims to predicta probability distribution P as follows: P = [ p ( e ) , p ( e ) , · · · , p ( e K )] T (1)where K denotes the number of possible emotions.Figure 2 shows an example of a probability distri-bution predicted by the trained CNN for the audio ﬁle“Happy liuchanhg 382.wav” from the CASIA corpus. It can beobserved that: (1) the probability distribution of each segmentchanges across the whole utterance; (2) most of the segmentsconvey information that conforms to the utterance where theylie; and (3) there are segments within one utterance that do notconvey any information about the target emotion class or thatare more related to other classes, which constitutes confusinginformation. If we can place additional focus on these morerelevant segments, system performance might be improved. Inthis regard, we have developed three attention-based neural net-works, which are described in detail in the following section. Attention neural networks assume that the bag-level predictioncan be constructed as a weighted sum of the instance-level pre-dictions. Herein, three attention-based neural network modelsare investigated and compared, i.e., decision-level single atten-tion (D-Single-Att.) [13], decision-level multiple attention (D-Multi-Att.) [14] and feature-level attention (Feature-Att.) [15],as shown in Figure 3(a)-(c), respectively. We denote the in-put segment embeddings within a certain speech utterance as X ∈ R T × M , where T is the number of segments and M repre-sents the dimension of segment embeddings. The output of thesecond fully-connected (FC) layer is denoted as h , which has adimension of , i. e., H is set to 120 in Figure 3. In the decision-level single attention model (as shown in Fig-ure 3(a)), an attention function is applied on the predictions ofthe instances to obtain the bag-level prediction: F ( B ) k = (cid:88) h ∈ B w ( h ) k f ( h ) k (2)where k denotes the k -th emotion class of the instance-levelprediction f ( h ) ∈ [0 , K and the bag-level prediction F ( B ) ∈ [0 , K , and w ( h ) k ∈ [0 , is a weight of f ( h ) k that we referto as a decision-level attention function : w ( h ) k = s ( h ) k / (cid:88) h ∈ B s ( h ) k (3)where s ( . ) can be any non-negative function (i.e., Softmax non-linearity) to ensure that attention w ( . ) is normalized. Both theattention function w ( . ) and the instance-level classiﬁer f ( . ) de-pend on a set of learnable parameters. The decision-level multiple attention model is an extension ofthe above decision-level single attention model. It consists ofseveral single attention modules (we herein use two attentionmodules, as shown in Figure 3(b)) applied to intermediate neu-ral network layers. The outputs of these attention modules areconcatenated.

The limitation of the above decision-level attention neural net-works is that the attention function w ( . ) is only applied to theprediction of the instances f ( h ) . To address this constraint,we also investigate the effect of applying the attention functionto the outputs of the hidden layers, which we refer to as thefeature-level attention (as shown in Figure 3(c)), in which thebag-level representation U can be modeled as: U d = (cid:88) h ∈ B v ( h ) d q ( h ) d (4)where d denotes the d -th dimension of the hidden layer output q ( h ) ∈ R D and the bag-level representation U ∈ R D ; and v ( h ) d ∈ [0 , is a weight of q ( h ) d that we refer to as a feature-level attention function : v ( h ) d = u ( h ) d / (cid:88) h ∈ B u ( h ) d (5)igure 3: (a) Decision-level single attention neural network; (b) decision-level multiple attention neural network; (c) feature-levelattention neural network. ( ◦ : Hadamard product; (cid:80) : element-wise summation; T: length of input sequence; M: dimension of inputbottleneck features; H: dimension of FC layer; K: number of emotion classes; D: dimension of feature-level attention function.) where u ( . ) can be any non-negative function (i.e., Sigmoid non-linearity) to ensure that attention v ( . ) is normalized. Both theattention function v ( . ) and the instance-level feature mappingfunction q ( . ) depend on a set of learnable parameters. The pre-diction of a bag B can then be obtained by classifying the bag-level representation U as follows: F ( B ) = g ( U ) (6)where g ( . ) is the ﬁnal classiﬁer that corresponds to the last neu-ral network layer.

3. Emotion Corpora

Two different emotion corpora are used to evaluate the validityof the proposed method, namely, a Chinese emotional corpus(CASIA) [16] and an English emotional database (IEMOCAP)[17], which have been extensively evaluated in the literature.Speciﬁcally, the CASIA corpus [16] contains 9,600 utter-ances that are simulated by four subjects (two males and two fe-males) in six different emotional states, i. e., angry, fear, happy,neutral, sad, and surprise. In our experiments, we only use7,200 utterances that correspond to 300 linguistically neutralsentences with the same statements. All of the emotion cate-gories are selected.The IEMOCAP database [17] was collected using motioncapture and audio/video recording over ﬁve dyadic sessionswith 10 subjects. At least three evaluators annotated each utter-ance in the database with the categorical emotion labels chosenfrom the set: angry, disgusting, excited, fear, frustrate, happy,neutral, sad, surprise, and others. We consider only the utter-ances with majority agreement (i. e., at least two out of threeevaluators assigned the same emotion label) over the emotionclasses of angry, happy (combined with the “excited” category),neutral and sad, which results in 5,531 utterances in total.

4. Experiments

In our experiment, the size of each speech segment is set to32 frames, i. e., the total length of a segment is 10 ms ×

32+ (25 - 10) ms = 335 ms, shifting 60 ms each time. In thisway, we collected approximately 200,000 segments for the CA-SIA corpus and 300,000 segments for the IEMOCAP database,respectively. Moreover, since the input length for our attentionneural networks has to be equal for all samples, we heuristicallyset the maximal length for each speech utterance to the average duration of each dataset, i. e., 2.07 s for CASIA and 4.55 s forIEMOCAP, respectively. Longer speech utterances are cut atthe maximal length, and shorter ones are padded with zeros.The architecture of the CNN model is similar to the

SegCNN model as used in our previous work [18]. Theonly change we made was to the last three FC layers, i. e., { , , K } units, respectively, where and K correspond tothe dimension of segment embeddings and the number of possi-ble emotions, respectively. In the training stage, for both CNNand attention neural networks, ADAM [19] optimizer with thedefault setting in Tensorﬂow [20] was used, with an initial learn-ing rate of . and an exponential decay scheme with a rateof . every two epochs. The batch size was set to . Earlystopping with patience of epochs was utilized to mitigate anoverﬁtting problem.For the CASIA corpus, we perform leave-one-fold-out ten-fold cross-validation experiments. For the IEMOCAP database,the leave-one-session-out ﬁve-fold cross-validation method iscarried out. For both datasets, a second cross-validation is per-formed since we need to utilize the segment-level results to trainthe attention neural networks. The results are presented in termsof unweighted accuracy (UA). In the CNN-MAX-RF baseline, the maxout pooling is directlyapplied on the segment embedding X ∈ R T × M : U m = max ≤ t ≤ T { X m } (7)A Random Forest (RF) is then used to make the utterance-levelprediction based on the resultant utterance representation U . The CNN-AVG-RF baseline is similar to the above CNN-MAX-RF baseline. The only difference is that the maxout pooling inthe CNN-MAX-RF baseline is replaced by an average poolingin the CNN-AVG-RF baseline.

In the CNN-MP baseline, the maxout pooling is applied on theinstance-level prediction f ( h ) ∈ [0 , K across a certain speechutterance to obtain the bag-level prediction: F ( B ) k = max h ∈ B { f ( h ) k } (8)igure 4: Confusion matrices obtained using feature-level attention modeling for: (a) CASIA corpus and (b) IEMOCAP database.

Table 1:

Comparison of UAs on the CASIA corpus.

Methods for comparison UA [ % ] ELM-Decision Tree [21] . DNN-HMM [22] . LSTM-TF-Att. [23] . CNN-MAX-RF ( baseline ) . CNN-AVG-RF ( baseline ) . CNN-MP ( baseline ) . CNN-AP ( baseline ) . CNN-D-Single-Att. ( ours ) . CNN-D-Multi-Att. ( ours ) . CNN-Feature-Att. ( ours ) . Similarly, for the CNN-AP baseline, the average pooling is ap-plied on the instance-level prediction f ( h ) ∈ [0 , K across aparticular speech utterance to obtain the bag-level prediction: F ( B ) k = mean h ∈ B { f ( h ) k } (9) Tables 1-2 show the experimental results on the two men-tioned emotion corpora, respectively. The following can beseen: (1) our baseline systems achieved respectable results onboth datasets, which proved the effectiveness of the MIL-basedframework; (2) the last two baselines (i. e., the CNN-MP base-line and the CNN-AP baseline) consistently outperformed theﬁrst two baselines (i. e., the CNN-MAX-RF baseline and theCNN-AVG-RF baseline). This performance gain might de-rive from the joint optimization of the aggregation strategy ofsegment-level representations and the utterance-level decisionmaking of the last two baselines; (3) the attention-based meth-ods substantially augmented the performance of the baselinesystems overall. This is mainly attributed to the effectivenessof the attention modeling; (4) due to the positive combinationof different attention modules, the decision-level multiple atten-tion modeling achieved noticeably better performance than thedecision-level single attention modeling on both datasets; (5) Table 2:

Comparison of UAs on the IEMOCAP database.

Methods for Comparison UA [ % ] CNN-LSTM [10] . DNN-HMM [22] . FCN-Att. [24] . CNN-MAX-RF ( baseline ) . CNN-AVG-RF ( baseline ) . CNN-MP ( baseline ) . CNN-AP ( baseline ) . CNN-D-Single-Att. ( ours ) . CNN-D-Multi-Att. ( ours ) . CNN-Feature-Att. ( ours ) . the feature-level attention modeling outperformed the decision-level attention neural networks by a signiﬁcant margin. Thisis due to the fact that the dimension of v ( h ) (i.e., D ) can beany value, while the dimension of w ( h ) is ﬁxed to be the num-ber of emotion classes K . With an increase in the dimensionof v ( h ) , the capacity of feature-level attention neural networksis increased; and (6) for the CASIA corpus, our feature-levelattention-based system achieved the highest recognition accu-racy of . , establishing a new benchmark (to the best ofour knowledge). For the IEMOCAP database, which might con-stitute a more challenging dataset, our methods also achievedcompetitive results. Figure 4 shows the confusion matrices ob-tained using feature-level attention modeling on both datasets,respectively.

5. Conclusion

In this paper, we proposed to combine multiple instance learn-ing with attention neural networks for better modeling of cate-gorical speech emotion recognition. Three attention-based neu-ral network models were investigated and compared. Exper-imental results on two well-known emotion corpora showedcompetitive outcomes. Since we herein blindly used all seg-ments to train the segment-level classiﬁer, it is anticipated withproper segment selection strategy, better results are expected.More advanced neural network architectures and better algo-rithm optimization will also be investigated in the near future. . References [1] J. H. Jeon, R. Xia, and Y. Liu, “Sentence level emotion recognitionbased on decisions from subsentence segments,” in

Proc. ICASSP ,2011, pp. 4940–4943.[2] M. T. Shami and M. S. Kamel, “Segment-based approach to therecognition of emotions in speech,” in

Proc. ICME , 2005, p. 4.[3] B. Schuller and G. Rigoll, “Timing levels in segment-basedspeech emotion recognition,” in

Proc. INTERSPEECH , 2006, pp.1818–1821.[4] J. Amores, “Multiple instance classiﬁcation: Review, taxonomyand comparative study,”

Artiﬁcial Intelligence , vol. 201, pp. 81–105, 2013.[5] Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai, I. Eric, and C. Chang,“Deep learning of feature representation with multiple instancelearning for medical image analysis,” in

Proc. ICASSP , 2014, pp.1626–1630.[6] T.-W. Su, J.-Y. Liu, and Y.-H. Yang, “Weakly-supervised audioevent detection using event-speciﬁc gaussian ﬁlters and fully con-volutional networks,” in

Proc. ICASSP , 2017, pp. 791–795.[7] A. Kumar and B. Raj, “Audio event detection using weakly la-beled data,” in

Proc. ACM International Conference on Multime-dia , 2016, pp. 1038–1047.[8] Y. Wang, J. Li, and F. Metze, “Comparing the max and noisy-orpooling functions in multiple instance learning for weakly super-vised sequence learning tasks,” in

Proc. INTERSPEECH , 2018,pp. 1339–1343.[9] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotionrecognition using cnn,” in

Proc. ACM International Conferenceon Multimedia , 2014, pp. 801–804.[10] A. Satt, S. Rozenberg, and R. Hoory, “Efﬁcient emotion recogni-tion from speech using deep learning on spectrograms.” in

Proc.INTERSPEECH , 2017, pp. 1089–1093.[11] L. Zhang, L. Wang, J. Dang, L. Guo, and H. Guan, “Convolu-tional neural network with spectrogram and perceptual featuresfor speech emotion recognition,” in

Proc. ICONIP , 2018, pp. 62–71.[12] R. C. Bunescu and R. J. Mooney, “Multiple instance learning forsparse positive bags,” in

Proc. ICML , 2007, pp. 105–112.[13] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Audio set clas-siﬁcation with attention model: A probabilistic perspective,” in

Proc. ICASSP , 2018, pp. 316–320.[14] C. Yu, K. S. Barsim, Q. Kong, and B. Yang, “Multi-level at-tention model for weakly supervised audio classiﬁcation,” in arXiv:1803.02353 , 2018.[15] Q. Kong, C. Yu, Y. Xu, T. Iqbal, W. Wang, and M. D. Plumb-ley, “Weakly labelled audioset tagging with attention neural net-works,”

IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 27, no. 11, pp. 1791–1802, 2019.[16] J. Tao, F. Liu, M. Zhang, and H. Jia, “Design of speech corpus formandarin text to speech,” in

Proc. the 4th Workshop on BlizzardChallenge , 2005.[17] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap:Interactive emotional dyadic motion capture database,”

Languageresources and evaluation , vol. 42, no. 4, p. 335, 2008.[18] S. Mao, P. C. Ching, and T. Lee, “Deep learning of segment-level feature representation with multiple instance learning forutterance-level speech emotion recognition,” in

Proc. INTER-SPEECH , 2019, pp. 1686–1690.[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[20] M. Abadi et al. , “Tensorﬂow: A system for large-scale machinelearning,” in

Proc. OSDI , 2016, pp. 265–283. [21] Z.-T. Liu, M. Wu, W.-H. Cao, J.-W. Mao, J.-P. Xu, and G.-Z.Tan, “Speech emotion recognition based on feature selection andextreme learning machine decision tree,”

Neurocomputing , vol.273, pp. 271–280, 2018.[22] S. Mao, D. Tao, G. Zhang, P. C. Ching, and T. Lee, “Revisitinghidden markov models for speech emotion recognition,” in

Proc.ICASSP , 2019, pp. 6715–6719.[23] Y. Xie, R. Liang, Z. Liang, C. Huang, C. Zou, and B. Schuller,“Speech emotion classiﬁcation using attention-based lstm,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 27, no. 11, pp. 1675–1685, 2019.[24] Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu, “Attention basedfully convolutional network for speech emotion recognition,” in