Contrastive Unsupervised Learning for Speech Emotion Recognition
Mao Li, Bo Yang, Joshua Levy, Andreas Stolcke, Viktor Rozgic, Spyros Matsoukas, Constantinos Papayiannis, Daniel Bone, Chao Wang
CCONTRASTIVE UNSUPERVISED LEARNING FOR SPEECH EMOTION RECOGNITION
Mao Li (cid:63) , Bo Yang † , Joshua Levy † , Andreas Stolcke † , Viktor Rozgic † , Spyros Matsoukas † ,Constantinos Papayiannis † , Daniel Bone † , Chao Wang † (cid:63) Department of Computer Science, University of Illinois at Chicago † Amazon Alexa (cid:63) [email protected], † { amzbyang, levyjos, stolcke, rozgicv, matsouka, papayiac,danibone, wngcha } @amazon.com ABSTRACT
Speech emotion recognition (SER) is a key technology toenable more natural human-machine communication. How-ever, SER has long suffered from a lack of public large-scalelabeled datasets. To circumvent this problem, we investi-gate how unsupervised representation learning on unlabeleddatasets can benefit SER. We show that the contrastive pre-dictive coding (CPC) method can learn salient representationsfrom unlabeled datasets, which improves emotion recogni-tion performance. In our experiments, this method achievedstate-of-the-art concordance correlation coefficient (CCC)performance for all emotion primitives (activation, valence,and dominance) on IEMOCAP. Additionally, on the MSP-Podcast dataset, our method obtained considerable perfor-mance improvements compared to baselines.
Index Terms — Speech emotion recognition, Contrastivepredictive coding, Unsupervised pre-training.
1. INTRODUCTION
Speech emotion recognition (SER) aims at discerning theemotional state of a speaker, thus enabling more human-likeinteractions between human and machines. An agent canunderstand the command of a human better if it is able to in-terpret the emotional state of the speaker as well. Moreover,a digital assistant can prove to be a human-like companionwhen equipped with the capability of recognizing emotions.These fascinating applications provide key motivations un-derpinning the fast growing research interest in this area[1, 2]Despite the substantial interest from both academia andindustry, SER has not found many real-world applications.One possible reason is the unsatisfactory performance of ex-isting systems. The difficulty is caused by, and contributes to,relatively small public data sets [3, 4] in this domain. The lackof large scale emotion annotated data hinders the applicationof deep learning methods, from which many other speech-related tasks (e.g automatic speech recognition [5]) have ben-efited greatly. In order to circumvent the data sparsity issue of SER,we investigate the use of unsupervised pre-training. Unsu-pervised pre-training techniques have received increased at-tention over the last few years. The research interest in thisdirection is well-motivated: while deep-learning (DL) basedmethods achieve state-of-the-art results across multiple do-mains, these methods tend to be data-intensive. Training alarge and deep neutral network usually requires very large la-beled datasets. The cost of data labeling has since become amajor obstacle for applying DL techniques to real-world ap-plications, and SER is no exception. Motivated by recent de-velopments in unsupervised representation learning, we lever-age an unsupervised pre-training approach for SER.The proposed method shows great performance improve-ment on two widely used public benchmarks. The improve-ments on recognizing valence (positivity/negativity of thetone of voice) are particularly encouraging, as valence isknown to be very hard to predict from speech data alone, seee.g. [6, 7]. Furthermore, our analysis implies, even withoutexplicit supervision in training, emotion clusters emerge inthe embedding space of the pre-trained model, confirming thesuitability of unsupervised pre-training for SER.
2. RELATED WORK
Recent studies on unsupervised representation learning haveachieved great success in natural language processing [8, 9]and computer vision [10, 11]. While leveraging unsupervisedlearning for SER has been investigated relatively little, pre-vious attempts using autoencoders have been successful [12,13]. More recently, it has been shown that learning to pre-dict future information in a time series is a useful pre-trainingmechanism [14].Unsupervised methods based on contrastive learning haveestablished strong and feasible baselines in many domains, re-cently. For instance, contrastive predictive coding (CPC) [11]is able to extract useful representations from sequential dataand achieves competitive performance on various tasks, in-cluding phone and speaker classification in speech. Our work a r X i v : . [ c s . S D ] F e b elies on the use of a CPC network for learning acoustic rep-resentations from large unlabeled speech datasets.
3. BACKGROUND
The primary goal of this study is to learn representations thatencode emotional attributes shared across frames of speechaudios without supervision. We start by reviewing relevantconcepts in emotion representation, then we give a brief re-view of the contrastive predictive coding (CPC) method.
In general, there are two widely used approaches to representemotion: by emotion categories (happiness, sadness, anger,etc.) or by dimensional emotion metrics (aka emotion prim-itives) [3, 4, 15]. Albeit intuitive, the categories-based rep-resentation may miss the subtleties of emotion “strength”,e.g. annoyance versus rage. The dimensional emotion met-rics often include activation (aka arousal, very calm versusvery active) , valence (level of positivity or negativity) anddominance (very weak versus very strong). In this work, wemainly focus on predicting dimensional emotion metrics fromspeech. Since emotion representation is an active researchtopic, we refer the interested readers to [15, 16].
As the name suggests, CPC falls into the contrastive learningparadigm: positive example and negative examples are con-structed, and the loss function encourages separation of pos-itive from negative examples. We give a detailed descriptionof CPC below.For an audio sequence X = ( x , x , ..., x n ) , CPC uses anonlinear encoder f to project observation x t ∈ R D x to itslatent representation z t = f ( x t ) , where z t ∈ R D z . Then anautoregressive model g is adopted to aggregate t consecutivelatent representations from the past into a contextual repre-sentation c t = g ( z ≤ t ) , where c t ∈ R D c .Since c t summarizes the past, it should be able to infer thelatent representation z t + k of future observations x t + k from c t , for a small k . For this purpose, a prediction function h k for a specific k takes the context representation as the input topredict the future representation: ˆ z t + k = h k ( c t ) = h k ( g ( z ≤ t )) . (1)To form a contrastive learning problem, some negative sam-ples (i.e. other observation x ) are drawn, either from the samesequence or other sequences, and their latent representations( z ) are computed.Assuming N − negatives are randomly sampled for eachcontext representation, then positive and negatives form a setof N samples that contains only one positive and N − nega-tives. To guide feature learning, the CPC method proposes to discriminate the positive from negatives, which boils down toan N-way classification problem. CPC uses the infoNCE lossfunction: for an audio segment and a time step t , the infoNCEloss is defined as L = − k (cid:88) m =1 (cid:34) log exp(ˆ z (cid:62) t + m z t + m ) /τ exp(ˆ z (cid:62) t + m z t + m ) /τ + (cid:80) N − i =1 exp(ˆ z (cid:62) t + m z i ) /τ (cid:35) , (2)where τ is a scaling factor (a.k.a temperature) to control theconcentration-level of the feature distribution, k is the upper-bound on time extrapolation. Notice that the summation over i assumes that the randomly drawn negative samples are la-beled as { , ...N − } , and these are different for each z t + m .In addition, the loss function considers all the future time ex-trapolation up to k . Clearly, the loss (2) is additive across dif-ferent audio segments and time steps, hence in training, theloss (2) is usually computed for batches of audio segmentsand all possible time steps in these segments, to utilize themini-batch-based Adam [17] optimizer.Optimizing (2) results in larger inner product between alatent representation and its predicted counterpart, than any ofthe negatives – mismatched latent representation and predic-tions. Theoretical justification for the optimization objectivefunction (2) can be found in [11] and [18].
4. PROPOSED METHOD
The proposed method consists of two stages: pre-training a“feature extractor” model with CPC on a large un-labeleddataset, and training an emotion recognizer with featureslearned in the first stage. In this section, we introduce theemotion recognizer and training loss function.
The output of CPC is a sequence of encoded vectors C = { c , c , ..., c L } , C ∈ R L × D c . To predict primitive emotionsfor a certain speech utterance, an utterance-level embeddingis desired. Since certain parts of an utterance are often moreemotionally salient than others, we adopt a self-attentionmechanism to focus on these periods for utilizing relevantfeatures. Specifically, a structured self-attention [19] layeraggregates information from the output of CPC and producesa fixed-length vector u as the representation of the speechutterance.Given C as input of the emotion recognizer, we follow[19] to compute the scaled dot-product attention representa-tion H as H = softmax (cid:16) CW Q ( CW K ) (cid:62) / (cid:112) D attn (cid:17) CW V (3)where W Q , W K , and W V are trainable parameters, and allhave shape D c × D attn . The subscripts Q, K, V stand forquery, key, and value, as defined in [19].n order to learn an embedding from multiple aspects, weuse a multi-headed mechanism to process the input multipletimes in parallel. The independent attention outputs are sim-ply concatenated and linearly transformed to get the final em-bedding U ∈ R D u . H j = softmax (cid:16) W jQ C ( W jK C ) (cid:62) / (cid:112) D attn (cid:17) W jV C (4) U = Concat ( H , H , ..., H n ) W O (5)where W O ∈ R nD attn × D u is another trainable weight ma-trix, and U ∈ R L × D u is the sequence representation after themulti-headed attention layer.Following the multi-headed attention layer, we computethe mean and standard deviation along the time dimension,and concatenate them as the sequence representation u = [ mean ( U ); std ( U )] (6)Subsequently, two dense layers with ReLU activation areused. We apply a dropout after these two dense layers witha small dropout probability. The final output layer is a denselayer with hidden units of the number of emotion attributes(e.g. three dimensions corresponding to activation, valenceand dominance respectively). Following [20], we build a loss function based on the concor-dance correlation coefficient (CCC, [21]). For two randomvariables X and Y , the CCC is defined asCCC ( X, Y ) = ρ σ X σ Y σ X + σ Y + ( µ X − µ Y ) , (7)where ρ = σ XY σ X σ Y is the Pearson correlation coefficient, and µ and σ are the mean and standard deviation, respectively.As can be seen from (7), CCC measures alignment of tworandom variables. In our setting, model predictions and datalabels assume the role of X and Y in (7).Since the emotion recognizer predicts at the same timeactivation, valence and dominance, we use a loss function thatcombines CCC act , CCC val , CCC dom values for activation,valence, and dominance, respectively L = 1 − α CCC act − β CCC val − γ CCC dom (8)We set the trade-off parameters α = β = γ = 1 / in all ourexperiments.
5. SPEECH CORPORA
For unsupervised pre-training, we train the CPC model onLibriSpeech dataset [22], which is a large scale corpus origi-nally created for automatic speech recognition (ASR). It con-tains 1000 hours of English audiobook reading speech, sam-pled at 16kHz. In our experiment, due to computational limi-tations, we use an official subset ”train-clean-100” containing 100 hours of clean speech for unsupervised pre-training. Inthis subset, 126 male and 125 female speaker were assignedto the training set. For each speaker, the amount of speechwas limited to 25 minutes to avoid imbalances in per-speakerduration.To evaluate the empirical emotion recognition perfor-mance, we perform experiments on the widely used MSP-Podcast dataset [4] and IEMOCAP dataset [3]. MSP-Podcastis a database of spontaneous emotional speech. In our work,we used version 1.6 of the corpus, which contains 50,362 ut-terances amounting to 84 hours of audio recordings. Each ut-terance contains a single speaker with duration between 2.75sand 11s. We follow the official partition of the dataset, whichhas 34,280, 5,958, and 10,124 utterances in the training, vali-dation and test sets, respectively. The dataset provides scoresfor activation, valence and dominance, as well as categoricalemotion labels.IEMOCAP is a widely used corpus in SER research. Ithas audio-visual recordings from five male and five femaleactors. The actors were instructed to either improvise or actout certain specific emotions. The dataset contains 5,531 ut-terances grouped into 5 sessions, which amount to about 12hours of audio. Similar to MSP-Podcast, this dataset providescategorical and dimensional emotion labels. In this work, wefocus on predicting the dimensional emotion metrics from thespeech data.
6. EXPERIMENT RESULTS6.1. Setups
Our experiments investigate four different setups: a). supervised only (Sup) : As a simple baseline, an emo-tion recognizer was trained and tested on 40-dimensional logfilterbank energies (LFBE) features of IEMOCAP and MSP-Podcast, respectively. LFBE features have been tested in awide variety of applications. b). joint CPC + supervised (jointCPC) : JointCPC trainedCPC model and emotion recognizer in an end-to-end man-ner, where the CPC model aims to learn features from the rawaudios directly, while the Sup setup uses hand-crafted fea-tures for the supervised task. We included this baseline to testwhether it is possible to learn better features when the featureextraction part is aware of the downstream task. c). miniCPC : Compared with jointCPC, miniCPC trains theCPC model and the emotion recognizer in two separate stageson the same datasets. In this setup, we can verify whetherCPC model can learn universal representations that can facil-itate various downstream tasks. d). CPC pre-train + supervised (preCPC) : We first pre-trained a CPC model with a 100-hour subset of the Lib-riSpeech dataset. Then an attention-based emotion recog-nizer will be trained on features that were extracted from thelearned CPC model with MSP-Podcast and IEMOCAP, re-pectively. Since the training corpus for CPC is much largerthan the labeled datasets, we can test whether introducing alarge out-of-domain dataset for unsupervised pretraining isuseful.For the CPC model used in the above settings, we use afour layer CNN with strides [5, 4, 4, 2], filter-sizes [10, 8,8, 4] and 128 hidden units with ReLU activations to encodethe 16KHz audio waveform inputs. A unidirectional gated re-current unit (GRU) network with 256 hidden dimensions isused as the autoregressive model. For each output of GRU,we predict 12 timesteps in the future using 50 negative sam-ples, sampled from the same sequence, in each prediction. Wetrain the CPC model with fixed length utterances of 10s dura-tion. Longer utterances are cut at 10s, and shorter ones werepadded by repeating themselves.For the emotion recognizer, an 8-head attention layer with512 dimensional hidden states is used. The outputs of atten-tion layer have the same dimension of the inputs. The twofully-connected layers have 128 hidden units. The drop outprobability is set to 0.2 for the dropout layers.Our model was implemented in PyTorch and all methodswere conducted on 8 GPUs each with a minibatch size of 8examples for CPC pretraining. We use Adam optimizer witha weight decay of 0.00001 and a learning rate of 0.0002. Weused 50 epochs for training and saved the model that performbest on validation set for testing.To evaluate the IEMOCAP dataset, we configured 5-foldcross-validation to evaluate the model. All experiments wererun five times to produce the means and standard deviations.
Table 1 and 2 present the performance in terms of CCC for ac-tivation, valence and dominance on the IEMOCAP and MSP-Podcast corpora, respectively. As shown in these tables, onboth datasets preCPC consistently outperforms other setups.preCPC achieves higher CCC values for all metrics than Sup,which implies that the representations learned by CPC are su-perior to hand-crafted features for speech emotion recogni-tion task. Surprisingly, even pre-training the CPC model ona small dataset, miniCPC still performs better than jointCPCon both datasets. We hypothesize that this is because unsuper-vised pre-training learns universal representations that are lessspecialized towards solving a certain task. Hence, it producesrepresentations with better generalization which might facil-itate various downstream tasks. However, for the jointCPCmethod, a trade-off has to be made between emotion predic-tion capability and representation learning. Also notice that,preCPC outperforms miniCPC by a large margin. This con-firms our intuition that exposing the model to more diverseacoustic conditions and speaker variations is beneficial forlearning robust features.We also plot the representations extracted by CPC fromIEMOCAP to examine how suitable these representations are
Table 1 : CCC scores (mean/std) on the IEMOCAP datasetMethods CCC avg
CCC act
CCC val
CCC dom
Sup .664 ± .007 .638 ± .017 .718 ± .004 .635 ± .009 jointCPC .562 ± .012 .549 ± .032 .642 ± .013 .491 ± .016 miniCPC .660 ± .005 .673 ± .028 .702 ± .009 .606 ± .019 preCPC .731 ± .003 .752 ± .014 .752 ± .009 .691 ± .009 Table 2 : CCC scores (mean/std) on the MSP-Podcast datasetMethods CCC avg
CCC act
CCC val
CCC dom
Sup .458 ± .005 .596 ± .007 .266 ± .004 .501 ± .013 jointCPC .491 ± .008 .628 ± .006 .280 ± .006 .568 ± .007 miniCPC .549 ± .006 .688 ± .009 .345 ± .005 .615 ± .011 preCPC .571 ± .004 .706 ± .006 .377 ± .008 .639 ± .012
60 40 20 0 20 40 60402002040 angersadness
Fig. 1 : Visualization of the learned representationsfor emotion. For visualization purposes, we used the categor-ical emotion labels when making the figure. As can be seenfrom Figure 1, the CPC model representation is capable ofseparating sadness from anger to a good extent, even thoughit is trained without emotion labels.
7. CONCLUSION
Our experiment results demonstrated that CPC can learn use-ful features from unlabeled speech corpora that benefit emo-tion recognition. We have also observed significant perfor-mance improvement on widely used public benchmarks un-der various experiments setups, compared to baseline meth-ods. Further, we also present a visualization that confirms thediscriminative nature, with respect to emotion classes, of theCPC-learned representations.So far we mainly conducted experiments on LibriSpeechfor pre-training. In the future, it would be interesting to inves-tigate the impact of other corpora for pre-training. In partic-ular, corpora that have more varied and expressive emotionsmight yield representations that are even more relevant forSER. . REFERENCES [1] George Trigeorgis, Fabien Ringeval, Raymond Brueck-ner, Erik Marchi, Mihalis A Nicolaou, Bj¨orn Schuller,and Stefanos Zafeiriou, “Adieu features? End-to-endspeech emotion recognition using a deep convolutionalrecurrent network,” in . IEEE, 2016, pp. 5200–5204.[2] Bj¨orn W Schuller, “Speech emotion recognition: Twodecades in a nutshell, benchmarks, and ongoing trends,”
Communications of the ACM , vol. 61, no. 5, pp. 90–99,2018.[3] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, AbeKazemzadeh, Emily Mower Provost, Samuel Kim,Jeannette N. Chang, Sungbok Lee, and Shrikanth S.Narayanan, “IEMOCAP: interactive emotional dyadicmotion capture database,”
Language Resources andEvaluation , vol. 42, pp. 335–359, 2008.[4] Reza Lotfian and Carlos Busso, “Building naturalisticemotionally balanced speech corpus by retrieving emo-tional speech from existing podcast recordings,”
IEEETransactions on Affective Computing , vol. 10, pp. 471–483, 2019.[5] Dong Yu and Li Deng,
AUTOMATIC SPEECHRECOGNITION , Springer, 2016.[6] Alan Hanjalic, “Extracting moods from pictures andsounds: Towards truly personalized TV,”
IEEE SignalProcessing Magazine , vol. 23, no. 2, pp. 90–100, 2006.[7] Emily Mower, Angeliki Metallinou, Chi-Chun Lee,Abe Kazemzadeh, Carlos Busso, Sungbok Lee, andShrikanth Narayanan, “Interpreting ambiguous emo-tional expressions,” in . IEEE, 2009, pp. 1–8.[8] Tom B. Brown et.al., “Language models are few-shotlearners,”
ArXiv , vol. abs/2005.14165, 2020.[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova, “BERT: Pre-training of deep bidi-rectional transformers for language understanding,” in
NAACL-HLT , 2019.[10] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, andRoss Girshick, “Momentum contrast for unsupervisedvisual representation learning,” in
Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition , 2020, pp. 9729–9738.[11] A¨aron van den Oord, Yazhe Li, and Oriol Vinyals, “Rep-resentation learning with contrastive predictive coding,”
ArXiv , vol. abs/1807.03748, 2018. [12] Sefik Emre Eskimez, Zhiyao Duan, and Wendi Heinzel-man, “Unsupervised learning approach to featureanalysis for automatic speech emotion recognition,”in . IEEE, 2018,pp. 5099–5103.[13] Jun Deng, Rui Xia, Zixing Zhang, Yang Liu, andBj¨orn Schuller, “Introducing shared-hidden-layer au-toencoders for transfer learning and their application inacoustic emotion recognition,” in . IEEE, 2014, pp. 4818–4822.[14] Zheng Lian, Jianhua Tao, Bin Liu, and Jian Huang, “Un-supervised representation learning with future observa-tion prediction for speech emotion recognition,” arXivpreprint arXiv:1910.13806 , 2019.[15] Roddy Cowie and Randolph R Cornelius, “Describ-ing the emotional states that are expressed in speech,”
Speech Communication , vol. 40, no. 1-2, pp. 5–32,2003.[16] Georgios N Yannakakis, Roddy Cowie, and CarlosBusso, “The ordinal nature of emotions: An emergingapproach,”
IEEE Transactions on Affective Computing ,2018.[17] Diederik P Kingma and Jimmy Ba, “Adam: Amethod for stochastic optimization,” arXiv preprintarXiv:1412.6980 , 2014.[18] Ben Poole, Sherjil Ozair, Aaron van den Oord,A. Alemi, and G. Tucker, “On variational bounds ofmutual information,” in
ICML , 2019.[19] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in
Ad-vances in Neural Information Processing Systems , 2017,pp. 5998–6008.[20] Felix Weninger, Fabien Ringeval, Erik Marchi, andBj¨orn W Schuller, “Discriminatively trained recurrentneural networks for continuous dimensional emotionrecognition from audio.,” in
IJCAI , 2016, vol. 2016, pp.2196–2202.[21] Lawrence I-Kuei Lin, “A concordance correlation coef-ficient to evaluate reproducibility,”
Biometrics , pp. 255–268, 1989.[22] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: An ASR corpus basedon public domain audio books,”2015 IEEE Interna-tional Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP)