A Siamese Neural Network with Modified Distance Loss For Transfer Learning in Speech Emotion Recognition
AA Siamese Neural Network with ModifiedDistance Loss For Transfer Learning in SpeechEmotion Recognition
Kexin Feng and Theodora Chaspari
HUman Bio-Behavioral Signals (HUBBS) LabDepartment of Computer Science and EngineeringTexas A&M University { kexin0814, chaspari } @tamu.edu Abstract.
Automatic emotion recognition plays a significant role in theprocess of human computer interaction and the design of Internet ofThings (IOT) technologies. Yet, a common problem in emotion recogni-tion systems lies in the scarcity of reliable labels. By modelling pairwisedifferences between samples of interest, a Siamese network can help tomitigate this challenge since it requires fewer samples than traditionaldeep learning methods. In this paper, we propose a distance loss, whichcan be applied on the Siamese network fine-tuning, by optimizing themodel based on the relevant distance between same and different classpairs. Our system uses samples from the source data to pre-train theweights of proposed Siamese neural network, which are fine-tuned basedon the target data. We present an emotion recognition task that usesspeech, since it is one of the most ubiquitous and frequently used bio-behavioral signals. Our target data comes from the RAVDESS dataset,while the CREMA-D and eNTERFACE05 are used as source data, re-spectively. Our results indicate that the proposed distance loss is ableto greatly benefit the fine-tuning process of Siamese network. Also, theselection of source data has more effect on the Siamese network perfor-mance compared to the number of frozen layers. These suggest the greatpotential of applying the Siamese network and modelling pairwise differ-ences in the field of transfer learning for automatic emotion recognition.
Keywords:
Emotion recognition · Speech · Transfer learning · Fine-tuning · Siamese neural network.
Automatic emotion recognition refers to identifying emotions using various human-related signals such as facial expression, physiological signals, and speech [11]. Itcan potentially benefit with many applications related to human computer inter-action, health informatics, or even the design of smart cities and communities.Among these signals, speech data is largely explored due to its relatively higheravailability and ease of collection. Acquiring reliable annotation for such large a r X i v : . [ c s . C V ] J un Feng et al. amounts of audio clips can be extremely hard to obtain, providing a significantimpediment for the reliable training of emotion recognition systems.A great number of machine learning approaches have been proposed to ad-dress this challenge, and transfer learning was shown to be one of the mostpromising directions. Transfer learning methods such as fine-tuning [1] make useof a well-trained model on another emotion dataset. Also, progressive neuralnetworks (PNN) [5], which are less forgetful when applied to target, have beenproposed and obtained good performance in leveraging knowledge between var-ious conditions. However, these methods might be less effective when there arevery limited number of data in target domain, preventing adequate training ofthe corresponding machine learning models.Modelling the pairwise differences between samples of interest (e.g., throughSiamese networks [7]) could be a potential solution to address small amounts oflabelled target data in various applications. In this paper, we propose the useof Siamese network structure for the task of transfer learning in speech emotionrecognition, trained using the fine-tuning method, and further optimized using adistance loss that incorporates relative distance among pairs. A publicly availabledataset, RAVDESS [9], was used as target data due to its relatively small numberof speakers and sample size. Two other emotional datasets, CREMA-D [3] andeNTERFACE05 [10], are used as source data to compare the influence of differentsource domains. Our results indicate that the selection of source data can have asignificant impact on the Siamese network fine-tuning, and our distance loss cansignificantly benefit the Siamese network fine-tuning process, which yields animprovement of up to 7% compared to fine-tuning the Siamese network withoutthe proposed distance loss.
The Siamese network is a type of neural network which takes a pair of datasamples as an input, and decides whether the corresponding samples belongto the same or different classes. This network structure was first introduced byBromley et al. for the task of signature verification by comparing whether the twosignatures are from the same person [2]. In Siamese network, each data samplein a pair is the input to the network, then the network outputs an extractedfeature vector. The l l l l iamese Network with Modified Distance Loss in Transfer Learning 3 time the Siamese network was applied and further optimized for emotion-relatedtransfer learning task. The target data comes from the RAVDESS dataset [9], which includes 47 min-utes of audio from 24 actors, and each audio sample is further labeled by 247annotators. The source data comes from two different publicly available datasets:the eNTERFACE05 [10] and CREMA-D [3]. The eNTERFACE05 contains 45minutes of data from 42 participants, who were asked to express their emotionsin scripted sentences after listening to specific stories. The CREMA-D datasetcontains 165 minutes of audio data from 91 actors, who were asked to performuntil they get approved by a director and each audio clip is further labeled byhuman annotators.Speech samples depicting four common emotions (anger, happiness, sadness,and fear) across all datasets were selected and processed using openSMILEtoolkit [4]. A 64-dimensional feature set, part of the INTERSPEECH09 emo-tion challenge feature set [13], is extracted from each audio segment which in-cludes the mean and standard deviations of speech intensity, zero-crossing rate,voice probability, fundamental frequency, and the first 12 Mel-frequency cepstralcoefficient (MFCC). The first order derivative of each of these features is alsocomputed.
In this section, we plan to discuss our three baseline methods: in-domain training,out of domain training, and Siamese network fine-tuning. Then we will introducethe proposed distance loss, and how this loss is applied for the Siamese network.All the methods are based on an optimized Siamese network structure, whichcontains 64, 32, and 16 nodes with ReLU activation for the first three featureextractor layers, followed by a 16-node decision-based layer. During the trainingprocess, the cross entropy loss is calculated and used to update the weights ofthe model. After the Siamese network is trained, the test data will be comparedwith all the available labeled target data, and a final classification decision willbe generated based on the similarity of log-sum with each class. The unweightedaverage recall (UAR) will be used to evaluate the performance for each method.
In order to assess whether the proposed loss will benefit the knowledge transferefficiency when limited data sample is available, we proposed three baselinemethods. The first baseline performs an out-of-domain training (OODT). ASiamese network is trained on all the source data, and tested on the targetdataset without any adaptation. For this baseline, the data samples from the
Feng et al.
Data sample 1Data sample 2 Input layer Hidden layer Feature extractor outInput layer Hidden layer Feature extractor out Feature vector 1Feature vector 2 L1-norm Decisionmaking layer Same or different pairDistanceLossCross Entropy Loss
Fig. 1.
Schematic representation of the proposed Siamese network fine-tuning withmodified distance loss. source are used to determine the final classes, since no labeled target data isavailable.The second baseline is an in-domain training (IDT), which is trained us-ing sufficient amount of target data. A leave-one-subject-out (LOSO) cross-validation is performed. More specifically, samples from a given speaker are usedfor testing, and all other data are used for training of the Siamese network. Thisprocess is repeated until all the speakers have been used for testing. This baselineserves as an upper limit of potential knowledge transfer.The third baseline is the traditional fine-tuning method in the field of trans-fer learning, to further assess the benefit provided by our proposed distanceloss. The models trained in OODT are used for fine-tuning. One random datasample for each emotion from random 2 speakers are selected as labeled data.As a result, a total number of 8 samples (2 speakers × × . . . , 18, 20) are tested to evaluate in detail the knowledge transfer efficiency. Inspired by the fact that the traditional fine-tuning method fails to efficientlyleverage the knowledge when limited target data is available, we designed adistance loss to maximize the distance difference between pairs from the sameclass and pairs from the different class. Let X s be the set of s pairs with thesame class in a batch, and X d is the set of d pairs with the different class. Let g W ( x ) be the function parameterized by the weights W of the Siamese networkthat performs the transformation between the original input x and the extractedfeature vector. The parameter W is learned by minimizing the average relativedistance between pairs X s of the same class and maximizing the distance betweenpairs X d of different classes: iamese Network with Modified Distance Loss in Transfer Learning 5 Table 1.
Unweighted average recall (UAR%) of the out-of-domain training (OODT),in-domain training (IDT), and the best results obtained among the different numberof frozen layers and adopted speakers using Siamese NN fine-tuning with / withoutmodified loss. source eNTERFACE05 CREMA-DOODT 32.8 29.3IDT 50.0 50.0Siamese NN fine-tuning 32.9 37.8Siamese NN fine-tuning with modified loss 39.9 43.4 W ∗ = arg min W (cid:107)X d (cid:107) (cid:80) x , x (cid:48) ∈X d (cid:107) g ( x ) , g ( x (cid:48) ) (cid:107) + (cid:107)X s (cid:107) (cid:80) x , x (cid:48) ∈ X s (cid:107) g ( x ) , g ( x (cid:48) ) (cid:107) (cid:107)X d | (cid:80) x , x (cid:48) ∈ X d (cid:107) g ( x ) , g ( x (cid:48) ) (cid:107) − (cid:107)X s (cid:107) (cid:80) x , x (cid:48) ∈ X s (cid:107) g ( x ) , g ( x (cid:48) ) (cid:107) (1)As indicated in the previous description, this loss is only applied to thefeature extraction layers at the end of a batch in the Siamese network fine-tuning process to minimize the possible influence to the decision making process(Figure 1). Besides this loss, the other part for this method remains the samecompared to the fine-tuning baseline method. In an effort to discuss the domain difference between the different speech emo-tion datasets, we first examine the unweighted average recall (UAR) for fouremotion classification task on OODT baseline. This yields a 32.8% UAR on theRAVDESS dataset when using eNTERFACE05 as source and 29.3% when usingCREMA-D as source. We then examines the upper limit of classification perfor-mance using the proposed feature and model structure, and obtained a UAR at50.0% with in-domain training.The fine-tuning of the Siamese neural network without the proposed lossis used to illustrated the knowledge transfer efficiency. This approach resultedin a minor improvement when using eNTERFACE05 as source data at 32.9%,and a relatively large improvement when using CREMA-D as source data at37.8%. We finally added our proposed distance loss in the fine-tuning process,and obtained a significant improvement of 39.9% using eNTERFACE05 and43.3% using CREMA-D. A comparison of the performance of different methodscan be found at Table 1. Our results also indicate that compared with differentnumber of frozen layers, different source data may has a more important role forSiamese network fine-tuning (Figure 2).
Our results indicate that including 3-5 speakers in the Siamese NN fine-tuningwith modified loss may result in the best trade-off between performance and the
Feng et al. U A R eNTERFACE05 Siamese NN fine-tuningeNTERFACE05 Siamese NN fine-tuning with modified lossCREMA-D Siamese NN fine-tuningCREMA-D Siamese NN fine-tuning with modified loss (a) No frozen layers. U A R eNTERFACE05 Siamese NN fine-tuningeNTERFACE05 Siamese NN fine-tuning with modified lossCREMA-D Siamese NN fine-tuningCREMA-D Siamese NN fine-tuning with modified loss (b) First layer is frozen. U A R eNTERFACE05 Siamese NN fine-tuningeNTERFACE05 Siamese NN fine-tuning with modified lossCREMA-D Siamese NN fine-tuningCREMA-D Siamese NN fine-tuning with modified loss (c) First two layers are frozen. Fig. 2.
Unweighted average recall (UAR) for freeze different number of layers anddifferent number of speakers adopted in fine-tuning process with / without distanceloss. number of data samples used for training. This can be potentially explained bythe fact that this number of speakers might not be enough to capture the distri-bution of the target dataset. Performance is degraded when including a smallernumber of speakers in the target data. Also, if the data distribution includes alot of variability, Siamese NN with modified loss will be greatly restricted, sincepairwise differences are less likely to express the class information.There has not been a lot of research relevant to few-shot emotion recog-nition with speech signals, where data from only a few number of speakers isincluded in the target data. The most comparable results are from Gideon etal., in which the proposed approach with progressive neural networks achieved alimited improvement (around 3%) compared to traditional fine-tuning ( [5]). Ourproposed distance loss is able to bring a relative larger improvement at about 7%compared to fine-tuning, without significant increase in the computational cost.Even though Siamese NN fine-tuning with modified loss still fails to outperformthe in-domain training, it shows great potential. We will attempt to modify theproposed methodology and explore whether it can reach in-domain performanceas part of our future work.
We propose a distance loss which is based on the relative distance between sameand different class pairs. Such loss can increase the upper limit and increaseknowledge transfer efficiency when very limited target data is available. Ourresults also indicate that the selection of source data plays a more importantrole than the number of frozen layers. Findings of this work can be appliedon other tasks using Siamese network, fine-tuning, or few-shot learning. Theapplication of distance between different emotion classes can be a fundamentalstep in understanding the difference between emotion categories.As part of our future work, we plan to explore the usage of pairs in addressingthe domain mismatch, and mitigate the influence of different source data in the iamese Network with Modified Distance Loss in Transfer Learning 7 process of knowledge transfer. We will further perform the multi-source transferlearning by using pairwise information to select proper source data from eachdataset. Finally, we plan to understand the distance between different emotionswith the help of the pairing information.
References
1. Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognitionfrom spectrograms with deep convolutional neural network. In: 2017 internationalconference on platform technology and service (PlatCon). pp. 1–5. IEEE (2017)2. Bromley, J., Guyon, I., LeCun, Y., S¨ackinger, E., Shah, R.: Signature verificationusing a” siamese” time delay neural network. In: Advances in neural informationprocessing systems. pp. 737–744 (1994)3. Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.:Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactionson affective computing (4), 377–390 (2014)4. Eyben, F., W¨ollmer, M., Schuller, B.: Opensmile: the munich versatile and fastopen-source audio feature extractor. In: Proceedings of the 18th ACM internationalconference on Multimedia. pp. 1459–1462. ACM (2010)5. Gideon, J., Khorram, S., Aldeneh, Z., Dimitriadis, D., Provost, E.M.: Progres-sive neural networks for transfer learning in emotion recognition. arXiv preprintarXiv:1706.03256 (2017)6. Huang, J., Li, Y., Tao, J., Lian, Z., et al.: Speech emotion recognition from variable-length inputs with triplet loss function. In: Interspeech. pp. 3673–3677 (2018)7. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shotimage recognition. In: ICML deep learning workshop. vol. 2 (2015)8. Lian, Z., Li, Y., Tao, J., Huang, J.: Speech emotion recognition via contrastive lossunder siamese networks. In: Proceedings of the Joint Workshop of the 4th Work-shop on Affective Social Multimedia Computing and first Multi-Modal AffectiveComputing of Large-Scale Multimedia Data. pp. 21–26. ACM (2018)9. Livingstone, S.R., Peck, K., Russo, F.A.: Ravdess: The ryerson audio-visualdatabase of emotional speech and song. In: 22nd Annual Meeting of the Cana-dian Society for Brain, Behaviour and Cognitive Science (CSBBCS). pp. 1459–1462(2012)10. Martin, O., Kotsia, I., Macq, B., Pitas, I.: The enterface’05 audio-visual emo-tion database. In: 22nd International Conference on Data Engineering Workshops(ICDEW’06). pp. 8–8. IEEE (2006)11. Picard, R.W.: Affective computing. MIT press (2000)12. Sabri, M., Kurita, T.: Facial expression intensity estimation using siamese andtriplet networks. Neurocomputing313