Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks
LLEARNING REPRESENTATIONS OF EMOTIONAL SPEECH WITH DEEPCONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS
Jonathan Chang
Harvey Mudd CollegeComputer Science Department1250 North Dartmouth AveClaremont, CA 91711
Stefan Scherer
University of Southern CaliforniaInstitute for Creative Technologies12015 Waterfront DrPlaya Vista, CA 90094
ABSTRACT
Automatically assessing emotional valence in human speech hashistorically been a difficult task for machine learning algorithms.The subtle changes in the voice of the speaker that are indicativeof positive or negative emotional states are often ”overshadowed”by voice characteristics relating to emotional intensity or emotionalactivation. In this work we explore a representation learning ap-proach that automatically derives discriminative representations ofemotional speech. In particular, we investigate two machine learningstrategies to improve classifier performance: (1) utilization of unla-beled data using a deep convolutional generative adversarial network(DCGAN), and (2) multitask learning. Within our extensive experi-ments we leverage a multitask annotated emotional corpus as well asa large unlabeled meeting corpus (around 100 hours). Our speaker-independent classification experiments show that in particular theuse of unlabeled data in our investigations improves performance ofthe classifiers and both fully supervised baseline approaches are out-performed considerably. We improve the classification of emotionalvalence on a discrete 5-point scale to 43.88% and on a 3-point scaleto 49.80%, which is competitive to state-of-the-art performance.
Index Terms — Machine Learning, Affective Computing, Semi-supervised Learning, Deep Learning
1. INTRODUCTION
Machine Learning, in general, and affective computing, in particu-lar, rely on good data representations or features that have a gooddiscriminatory faculty in classification and regression experiments,such as emotion recognition from speech. To derive efficient rep-resentations of data, researchers have adopted two main strategies:(1) carefully crafted and tailored feature extractors designed for aparticular task [1] and (2) algorithms that learn representations auto-matically from the data itself [2]. The latter approach is called Rep-resentation Learning (RL), and has received growing attention in thepast few years and is highly reliant on large quantities of data. Mostapproaches for emotion recognition from speech still rely on the ex-traction of standard acoustic features such as pitch, shimmer, jitterand MFCCs (Mel-Frequency Cepstral Coefficients), with a few no-table exceptions [3, 4, 5, 6]. In this work we leverage RL strategiesand automatically learn representations of emotional speech from
This material is based upon work supported by the U.S. Army ResearchLaboratory under contract number W911NF-14-D-0005. Any opinions, find-ings, and conclusions or recommendations expressed in this material arethose of the author(s) and do not necessarily reflect the views of the Gov-ernment, and no official endorsement should be inferred. the spectrogram directly using a deep convolutional neural network(CNN) architecture.To learn strong representations of speech we seek to leverage asmuch data as possible. However, emotion annotations are difficultto obtain and scarce [7]. We leverage the USC-IEMOCAP dataset,which comprises of around 12 hours of highly emotional and partlyacted data from 10 speakers [8]. However, we aim to improve thelearned representations of emotional speech with unlabeled speechdata from an unrelated meeting corpus, which consists of about 100hours of data [9]. While the meeting corpus is qualitatively quitedifferent from the highly emotional USC-IEMOCAP data , we be-lieve that the learned representations will improve through the use ofthese additional data. This combination of two separate data sourcesleads to a semi-supervised machine learning task and we extend theCNN architecture to a deep convolutional generative neural network(DCGAN) that can be trained in an unsupervised fashion [10].Within this work, we particularly target emotional valence asthe primary task, as it has been shown to be the most challeng-ing emotional dimension for acoustic analyses in a number of stud-ies [11, 12]. Apart from solely targeting valence classification, wefurther investigate the principle of multitask learning. In multitasklearning, a set of related tasks are learned (e.g., emotional activa-tion), along with a primary task (e.g., emotional valence); both tasksshare parts of the network topology and are hence jointly trained, asdepicted in Figure 1. It is expected that data for the secondary taskmodels information, which would also be discriminative in learningthe primary task. In fact, this approach has been shown to improvegeneralizability across corpora [13].The remainder of this paper is organized as follows: First weintroduce the DCGAN model and discuss prior work, in Section 2.Then we describe our specific multitask DCGAN model in Section3, introduce the datasets in Section 4, and describe our experimentaldesign in Section 5. Finally, we report our results in Section 6 anddiscuss our findings in Section 7.
2. RELATED WORK
The proposed model builds upon previous results in the field of emo-tion recognition, and leverages prior work in representation learning.Multitask learning has been effective in some prior experimentson emotion detection. In particular, Xia and Liu proposed a mul-titask model for emotion recognition which, like the investigatedmodel, has activation and valence as targets [14]. Their work uses a The AMI meeting corpus is not highly emotional and strong emotionssuch as anger do not appear. a r X i v : . [ c s . C L ] A p r eep Belief Network (DBN) architecture to classify the emotion ofaudio input, with valence and activation as secondary tasks. Theirexperiments indicate that the use of multitask learning produces im-proved unweighted accuracy on the emotion classification task. LikeXia and Liu, the proposed model uses multitask learning with va-lence and activation as targets. Unlike them, however, we are pri-marily interested not in emotion classification, but in valence clas-sification as a primary task. Thus, our multitask model has valenceas a primary target and activation as a secondary target. Also, whileour experiments use the IEMOCAP database like Xia and Liu do,our method of speaker split differs from theirs. Xia and Liu usea leave-one-speaker-out cross validation scheme with separate trainand test sets that have no speaker overlap. This method lacks a dis-tinct validation set; instead, they validate and test on the same set.Our experimental setup, on the other hand, splits the data into dis-tinct train, validation, and test sets, still with no speaker overlap.This is described in greater detail in Section 5.The unsupervised learning part of the investigated model buildsupon an architecture known as the deep convolutional generativeadversarial network, or DCGAN. DCGAN consists of two compo-nents, known as the generator and discriminator , which are trainedagainst each other in a minimax setup. The generator learns to mapsamples from a random distribution to output matrices of some pre-specified form. The discriminator takes an input which is either agenerator output or a “real” sample from a dataset. The discrimina-tor learns to classify the input as either generated or real [10].For training, the discriminator uses a cross entropy loss functionbased on how many inputs were correctly classified as real and howmany were correctly classified as generated. The cross entropy lossbetween true labels y and predictions ˆ y is defined as: L ( w ) = − N N (cid:88) n =1 [ y n log ˆ y n + (1 − y n ) log(1 − ˆ y n )] (1)Where w is the learned vector of weights, and N is the number ofsamples. For purposes of this computation, labels are represented asnumerical values of 1 for real and 0 for generated. Then, letting ˆ y r represent the discriminator’s predictions for all real inputs, the crossentropy for correct predictions of “real” simplifies to: L r ( w ) = − N N (cid:88) n =1 log ˆ y r,n Because in this case the correct predictions are all ones. Similarly,letting ˆ y g represent the discriminator’s predictions for all generatedinputs, the cross entropy for correct predictions of “generated” sim-plifies to: L f ( w ) = − N N (cid:88) n =1 log(1 − ˆ y g,n ) Because here the correct predictions are all zeroes. The total lossfor the discriminator is given by the sum of the previous two terms L d = L r + L f .The generator also uses a cross entropy loss, but its loss is de-fined in terms of how many generated outputs got incorrectly classi-fied as real: L g ( w ) = − N N (cid:88) n =1 log(ˆ y g,n ) Thus, the generator’s loss gets lower the better it is able to produceoutputs that the discriminator thinks are real. This leads the gen-erator to eventually produce outputs that look like real samples ofspeech given sufficient training iterations.
3. MULTITASK DEEP CONVOLUTIONAL GENERATIVEADVERSARIAL NETWORK
The investigated multitask model is based upon the DCGAN archi-tecture described in Section 2 and is implemented in TensorFlow .For emotion classification a fully connected layer is attached to thefinal convolutional layer of the DCGAN’s discriminator. The outputof this layer is then fed to two separate fully connected layers, oneof which outputs a valence label and the other of which outputs anactivation label. This setup is shown visually in Figure 1. Throughthis setup, the model is able to take advantage of unlabeled data dur-ing training by feeding it through the DCGAN layers in the model,and is also able to take advantage of multitask learning and train thevalence and activation outputs simultaneously.In particular, the model is trained by iteratively running the gen-erator, discriminator, valence classifier, and activation classifier, andback-propagating the error for each component through the network.The loss functions for the generator and discriminator are unaltered,and remain as shown in Section 2. Both the valence classifier andactivation classifier use cross entropy loss as in Equation 1.Since the valence and activation classifiers share layers with thediscriminator the model learns features and convolutional filters thatare effective for the tasks of valence classification, activation classi-fication, and discriminating between real and generated samples. GeneratorDiscriminator
Generated SampleReal Sample
Final Convolutional LayerShared Layer
Decision(Real vs Generated)
Valence Classifier Activation Classifier
Predicted ActivationPredicted Valence
Fig. 1 . Visual representation of the deep convolutional generativeadversarial network with multitask valence and activation classifier.
4. DATA CORPUS
Due to the semi-supervised nature of the proposed Multitask DC-GAN model, we utilize both labeled and unlabeled data. For theunlabeled data, we use audio from the AMI [9] and IEMOCAP [8]datasets. For the labeled data, we use audio from the IEMOCAPdataset, which comes with labels for activation and valence, bothmeasured on a 5-point Likert scale from three distinct annotators.Although IEMOCAP provides per-word activation and valence la-bels, in practice these labels do not generally change over time ina given audio file, and so for simplicity we label each audio clipwith the average valence and activation. Since valence and activa-tion are both measured on a 5-point scale, the labels are encodedin 5-element one-hot vectors. For instance, a valence of 5 is repre-sented with the vector [0 , , , , . The one-hot encoding can be We will make the code available on GitHub upon publication. hought of as a probability distribution representing the likelihood ofthe correct label being some particular value. Thus, in cases wherethe annotators disagree on the valence or activation label, this canbe represented by assigning probabilities to multiple positions in thelabel vector. For instance, a label of 4.5 conceptually means that the“correct” valence is either 4 or 5 with equal probability, so the cor-responding vector would be [0 , , , . , . . These “fuzzy labels”have been shown to improve classification performance in a numberof applications [15, 16]. It should be noted here that we had gen-erally greater success with this fuzzy label method than training theneural network model on the valence label directly, i.e. classificationtask vs. regression. Pre-processing.
Audio data is fed to the network models in theform of spectrograms. The spectrograms are computed using a shorttime Fourier transform with window size of 1024 samples, which atthe 16 kHz sampling rate is equivalent to 64 ms. Each spectrogram is128 pixels high, representing the frequency range 0-11 kHz. Due tothe varying lengths of the IEMOCAP audio files, the spectrogramsvary in width, which poses a problem for the batching process ofthe neural network training. To compensate for this, the model ran-domly crops a region of each input spectrogram. The crop width isdetermined in advance. To ensure that the selected crop region con-tains at least some data (i.e. is not entirely silence), cropping occursusing the following procedure: a random word in the transcript ofthe audio file is selected, and the corresponding time range is lookedup. A random point within this time range is selected, which is thentreated as the center line of the crop. The crop is then made usingthe region defined by the center line and crop width.Early on, we found that there is a noticeable imbalance in thevalence labels for the IEMOCAP data, in that the labels skew heav-ily towards the neutral (2-3) range. In order to prevent the modelfrom overfitting to this distribution during training, we normalize thetraining data by oversampling underrepresented valence data, suchthat the overall distribution of valence labels is more even.
5. EXPERIMENTAL DESIGNInvestigated Models.
We investigate the impact of both unlabeleddata for improved emotional speech representations and multitasklearning on emotional valence classification performance. To thisend, we compared four different neural network models:1.
BasicCNN:
A fully supervised, single-task valence classifier.2.
MultitaskCNN:
A fully supervised, multitask valence clas-sifier with activation as secondary task.3.
BasicDCGAN:
A semi-supervised, DCGAN with single-taskvalence classifier.4.
MultitaskDCGAN:
A semi-supervised, DCGAN with mul-titask valence classifier and activation as secondary task.The BasicCNN represents a “bare minimum” valence classifierand thus sets a lower bound for expected performance. Comparisonwith MultitaskCNN indicates the effect of the inclusion of a sec-ondary task, i.e. emotional activation recognition. Comparison withBasicDCGAN indicates the effect of the incorporation of unlabeleddata during training.For fairness, the architectures of all three baselines are basedupon the full MultitaskDCGAN model. BasicDCGAN for exampleis simply the MultitaskDCGAN model with the activation layer re-moved, while the two fully supervised baselines were built by takingthe convolutional layers from the discriminator component of theMultitaskDCGAN, and adding fully connected layers for valence Basic Multitask Basic MultitaskParameter CNN CNN DCGAN DCGANCrop Width 128 64 128 64Learning Rate e − e − e − e − Batch Size 64 128 64 128Filter Size 15 8 9 6No. of Filters 84 32 72 88
Table 1 . Final parameters used for each model as found by randomparameter searchand activation output. Specifically, the discriminator contains fourconvolutional layers; there is no explicit pooling but the kernel stridesize is 2 so image size gets halved at each step. Thus, by design, allfour models have this same convolutional structure. This is to en-sure that potential performance gains do not stem from a larger com-plexity or higher number of trainable weights within the DCGANmodels, but rather stem from improved representations of speech.
Experimental Procedure.
The parameters for each model, in-cluding batch size, filter size (for convolution), and learning rate,were determined by randomly sampling different parameter combi-nations, training the model with those parameters, and computingaccuracy on a held-out validation set. For each model, we kept theparameters that yield the best accuracy on the held-out set. This pro-cedure ensures that each model is fairly represented during evalua-tion. Our hyper-parameters included crop width of the input signal ∈{ , } , convolutional layer filter sizes ∈ [2 , w/ (where w is theselected crop width and gets divided by 8 to account for each halvingof image size from the 3 convolutional layers leading up to the lastone), number of convolutional filters ∈ [32 , (step size 4), batchsize ∈ { , , } , and learning rates ∈ { e − , e − , e − } .Identified parameters per model are shown in Table 1.For evaluation, we utilized a 5-fold leave-one-session-out val-idation. Each fold leaves one of the five sessions in the labeledIEMOCAP data out of the training set entirely. From this left-outconversation, one speaker’s audio is used as a validation set, whilethe other speaker’s audio is used as a test set.For each fold, the evaluation procedure is as follows: the modelbeing evaluated is trained on the training set, and after each full passthrough the training set, accuracy is computed on the validation set.This process continues until the accuracy on the validation set isfound to no longer increase; in other words, we locate a local maxi-mum in validation accuracy. To increase the certainty that this localmaximum is truly representative of the model’s best performance,we continue to run more iterations after a local maximum is found,and look for 5 consecutive iterations with lower accuracy values. If,in the course of these 5 iterations, a higher accuracy value is found,that is treated as the new local maximum and the search restarts fromthere. Once a best accuracy value is found in this manner, we restorethe model’s weights to those of the iteration corresponding to thebest accuracy, and evaluate the accuracy on the test set.We evaluated each model on all 5 folds using the methodologydescribed above, recording test accuracies for each fold. Evaluation Strategy.
We collected several statistics aboutour models’ performances. We were primarily interested in theunweighted per-class accuracy. In addition, we converted the net-work’s output from probability distributions back into numericallabels by taking the expected value; that is: v = (cid:80) i =1 ip i , where p is the model’s prediction, in its original form as a 5 element vectorprobability distribution. We then used this to compute the Pearsoncorrelation ( ρ measure) between predicted and actual labels.ccuracy Accuracy Pearson CorrelationModel (5 class) (3 class) ( ρ value)BasicCNN .
52% 46 .
59% 0 . MultitaskCNN .
78% 40 .
57% 0 . BasicDCGAN . % . % . MultitaskDCGAN .
69% 48 .
88% 0 . Table 2 . Evaluation metrics for all four models, averaged across 5test folds. Speaker-independent unweighted accuracies in % for both5-class and 3-class valence performance as well as Pearson correla-tion ρ are reported.Some pre-processing was needed to obtain accurate measures.In particular, in cases where human annotators were perfectly spliton what the correct label for a particular sample should be, bothpossibilities should be accepted as correct predictions. For instance,if the correct label is 4.5 (vector form [0 , , , . , . ), a correctprediction could be either 4 or
6. RESULTS
Table 2 shows the unweighted per-class accuracies and Pearson cor-relation coeffecients ( ρ values) between actual and predicted labelsfor each model. All values shown are average values across the testsets for all 5 folds.Results indicate that the use of unsupervised learning yields aclear improvement in performance. Both BasicDCGAN and Multi-taskDCGAN have considerably better accuracies and linear correla-tions compared to the fully supervised CNN models. This is a strongindication that the use of large quantities of task-unrelated speechdata improved the filter learning in the CNN layers of the DCGANdiscriminator.Multitask learning, on the other hand, does not appear to haveany positive impact on performance. Comparing the two CNN mod-els, the addition of multitask learning actually appears to impair per-formance, with MultitaskCNN doing worse than BasicCNN in allthree metrics. The difference is smaller when comparing BasicD-CGAN and MultitaskDCGAN, and may not be enough to decid-edly conclude that the use of multitask learning has a net negativeimpact there, but certainly there is no indication of a net positiveimpact. The observed performance of both the BasicDCGAN andMultitaskDCGAN using 3-classes is comparable to the state-of-the-art, with 49.80% compared to 49.99% reported in [17]. It needs to benoted that in [17] data from the test speaker’s session partner was uti-lized in the training of the model. Our models in contrast are trainedon only four of the five sessions as discussed in 5. Further, the herepresented models are trained on the raw spectrograms of the audioand no feature extraction was employed whatsoever. This represen-tation learning approach is employed in order to allow the DCGANcomponent of the model to train on vast amounts of unsupervised Very Neg. Neg. Neu. Pos. Very Pos.Very Neg. . Confusion matrix for 5-class valence classification withthe BasicDCGAN model. Predictions are reported in columns andactual targets in rows. Valence classes are sorted from very negativeto very positive. These classes correspond to the numeric labels 1through 5.speech data.We further report the confusion matrix of the best performingmodel BasicDCGAN in Table 3. It is noted that the “negative” class(i.e., the second row) is classified the best. However, it appears thatthis class is picked more frequently by the model resulting in highrecall = 0.7701 and low precision = 0.3502. The class with the high-est F1 score is “very positive” (i.e., the last row) with F . .The confusion of “very negative” valence with “very positive” va-lence in the top right corner is interesting and has been previouslyobserved [5].
7. CONCLUSIONS
We investigated the use of unsupervised and multitask learning toimprove the performance of an emotional valence classifier. Overall,we found that unsupervised learning yields considerable improve-ments in classification accuracy for the emotional valence recogni-tion task. The best performing model achieves 43.88% in the 5-classcase and 49.80% in the 3-class case with a significant Pearson corre-lation between continuous target label and prediction of ρ = 0 . ( p < . ). There is no indication that multitask learning providesany advantage.The results for multitask learning are somewhat surprising. Itmay be that the valence and activation classification tasks are notsufficiently related for multitask learning to yield improvements inaccuracy. Alternatively, a different neural network architecture maybe needed for multitask learning to work. Further, the alternatingupdate strategy employed in the present work might not have beenthe optimal strategy for training. The iterative swapping of targettasks valence/activation might have created instabilities in the weightupdates of the backpropagation algorithm. There may yet be otherexplanations; further investigation may be warranted.Lastly, it is important to note that this model’s performance isonly approaching state-of-the-art, which employs potentially bet-ter suited sequential classifiers such as Long Short-term Memory(LSTM) networks [18]. However, basic LSTM are not suited tolearn from entirely unsupervised data, which we leveraged for theproposed DCGAN models. For future work, we hope to adapt thetechnique of using unlabeled data to sequential models, includingLSTM. We expect that combining our work here with the advan-tages of sequential models may result in further performance gainswhich may be more competitive with today’s leading models andpotentially outperform them. For the purposes of this investigation,the key takeaway is that the use of unsupervised learning yields clearperformance gains on the emotional valence classification task, andthat this represents a technique that may be adapted to other modelsto achieve even higher classification accuracies. . REFERENCES [1] Piotr Doll´ar, Zhuowen Tu, Hai Tao, and Serge Belongie, “Fea-ture mining for image classification,” in . IEEE, 2007, pp.1–8.[2] Yoshua Bengio, Aaron Courville, and Pascal Vincent, “Rep-resentation learning: A review and new perspectives,” IEEEtransactions on pattern analysis and machine intelligence , vol.35, no. 8, pp. 1798–1828, 2013.[3] George Trigeorgis, Fabien Ringeval, Raymond Brueckner,Erik Marchi, Mihalis A Nicolaou, Stefanos Zafeiriou, et al.,“Adieu features? end-to-end speech emotion recognition usinga deep convolutional recurrent network,” in . IEEE, 2016, pp. 5200–5204.[4] Sayan Ghosh, Eugene Laksana, Louis-Philippe Morency, andStefan Scherer, “Learning representations of affect fromspeech,”
CoRR , vol. abs/1511.04747, 2015.[5] S. Ghosh, E. Laksana, L.-P. Morency, and S. Scherer, “Rep-resentation learning for speech emotion recognition,” in
Pro-ceedings of Interspeech 2016 , 2016.[6] S. Ghosh, L.-P. Morency, E. Laksana, and S. Scherer, “Anunsupervised approach to glottal inverse filtering,” in
Proceed-ings of EUSIPCO 2016 , 2016.[7] Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic,and Marc Schroder, “The semaine database: Annotated mul-timodal records of emotionally colored conversations betweena person and a limited agent,”
IEEE Transactions on AffectiveComputing , vol. 3, no. 1, pp. 5–17, 2012.[8] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, AbeKazemzadeh, Emily Mower, Samuel Kim, Jeannette NChang, Sungbok Lee, and Shrikanth S Narayanan, “Iemocap:Interactive emotional dyadic motion capture database,”
Lan-guage resources and evaluation , vol. 42, no. 4, pp. 335–359,2008.[9] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn,Mael Guillemot, Thomas Hain, Jaroslav Kadlec, VasilisKaraiskos, Wessel Kraaij, Melissa Kronenthal, et al., “Theami meeting corpus: A pre-announcement,” in
InternationalWorkshop on Machine Learning for Multimodal Interaction .Springer, 2005, pp. 28–39.[10] Alec Radford, Luke Metz, and Soumith Chintala, “Unsuper-vised representation learning with deep convolutional genera-tive adversarial networks,” arXiv preprint arXiv:1511.06434 ,2015.[11] T. Dang, V. Sethu, and E. Ambikairajah, “Factor analysis basedspeaker normalisation for continuous emotion prediction,” in
Proceedings of Interspeech 2016 , 2016.[12] Carlos Busso and Tauhidur Rahman, “Unveiling the acousticproperties that describe the valence dimension.,” in
INTER-SPEECH , 2012, pp. 1179–1182.[13] Sayan Ghosh, Eugene Laksana, Stefan Scherer, and Louis-Philippe Morency, “A multi-label convolutional neural net-work approach to cross-domain action unit detection,” in
Af-fective Computing and Intelligent Interaction (ACII), 2015 In-ternational Conference on . IEEE, 2015, pp. 609–615. [14] Rui Xia and Yang Liu, “A multi-task learning framework foremotion recognition using 2d continuous space,”
IEEE Trans-actions on Affective Computing , 2015.[15] S. Scherer, J. Kane, C. Gobl, and F. Schwenker, “Investigatingfuzzy-input fuzzy-output support vector machines for robustvoice quality classification,”
Computer Speech and Language ,vol. 27, no. 1, pp. 263–287, 2013.[16] C. Thiel, S. Scherer, and F. Schwenker, “Fuzzy-input fuzzy-output one-against-all support vector machines,” in . 2007, vol. 3 of
Lecture Notes in Artificial Intelligence , pp. 156–165, Springer.[17] Angeliki Metallinou, Martin Wollmer, Athanasios Katsama-nis, Florian Eyben, Bjorn Schuller, and Shrikanth Narayanan,“Context-sensitive learning for enhanced audiovisual emotionclassification,”
IEEE Transactions on Affective Computing ,vol. 3, no. 2, pp. 184–198, 2012.[18] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-termmemory,”