[PDF] A Siamese Neural Network with Modified Distance Loss For Transfer Learning in Speech Emotion Recognition

Abstract

Automatic emotion recognition plays a significant role in the process of human computer interaction and the design of Internet of Things (IOT) technologies. Yet, a common problem in emotion recognition systems lies in the scarcity of reliable labels. By modeling pairwise differences between samples of interest, a Siamese network can help to mitigate this challenge since it requires fewer samples than traditional deep learning methods. In this paper, we propose a distance loss, which can be applied on the Siamese network fine-tuning, by optimizing the model based on the relevant distance between same and difference class pairs. Our system use samples from the source data to pre-train the weights of proposed Siamese neural network, which are fine-tuned based on the target data. We present an emotion recognition task that uses speech, since it is one of the most ubiquitous and frequently used bio-behavioral signals. Our target data comes from the RAVDESS dataset, while the CREMA-D and eNTERFACE'05 are used as source data, respectively. Our results indicate that the proposed distance loss is able to greatly benefit the fine-tuning process of Siamese network. Also, the selection of source data has more effect on the Siamese network performance compared to the number of frozen layers. These suggest the great potential of applying the Siamese network and modelling pairwise differences in the field of transfer learning for automatic emotion recognition.

Full PDF

AA Siamese Neural Network with ModiﬁedDistance Loss For Transfer Learning in SpeechEmotion Recognition

Kexin Feng and Theodora Chaspari

HUman Bio-Behavioral Signals (HUBBS) LabDepartment of Computer Science and EngineeringTexas A&M University { kexin0814, chaspari } @tamu.edu Abstract.

Automatic emotion recognition plays a signiﬁcant role in theprocess of human computer interaction and the design of Internet ofThings (IOT) technologies. Yet, a common problem in emotion recogni-tion systems lies in the scarcity of reliable labels. By modelling pairwisediﬀerences between samples of interest, a Siamese network can help tomitigate this challenge since it requires fewer samples than traditionaldeep learning methods. In this paper, we propose a distance loss, whichcan be applied on the Siamese network ﬁne-tuning, by optimizing themodel based on the relevant distance between same and diﬀerent classpairs. Our system uses samples from the source data to pre-train theweights of proposed Siamese neural network, which are ﬁne-tuned basedon the target data. We present an emotion recognition task that usesspeech, since it is one of the most ubiquitous and frequently used bio-behavioral signals. Our target data comes from the RAVDESS dataset,while the CREMA-D and eNTERFACE05 are used as source data, re-spectively. Our results indicate that the proposed distance loss is ableto greatly beneﬁt the ﬁne-tuning process of Siamese network. Also, theselection of source data has more eﬀect on the Siamese network perfor-mance compared to the number of frozen layers. These suggest the greatpotential of applying the Siamese network and modelling pairwise diﬀer-ences in the ﬁeld of transfer learning for automatic emotion recognition.

Keywords:

Emotion recognition · Speech · Transfer learning · Fine-tuning · Siamese neural network.

Automatic emotion recognition refers to identifying emotions using various human-related signals such as facial expression, physiological signals, and speech [11]. Itcan potentially beneﬁt with many applications related to human computer inter-action, health informatics, or even the design of smart cities and communities.Among these signals, speech data is largely explored due to its relatively higheravailability and ease of collection. Acquiring reliable annotation for such large a r X i v : . [ c s . C V ] J un Feng et al. amounts of audio clips can be extremely hard to obtain, providing a signiﬁcantimpediment for the reliable training of emotion recognition systems.A great number of machine learning approaches have been proposed to ad-dress this challenge, and transfer learning was shown to be one of the mostpromising directions. Transfer learning methods such as ﬁne-tuning [1] make useof a well-trained model on another emotion dataset. Also, progressive neuralnetworks (PNN) [5], which are less forgetful when applied to target, have beenproposed and obtained good performance in leveraging knowledge between var-ious conditions. However, these methods might be less eﬀective when there arevery limited number of data in target domain, preventing adequate training ofthe corresponding machine learning models.Modelling the pairwise diﬀerences between samples of interest (e.g., throughSiamese networks [7]) could be a potential solution to address small amounts oflabelled target data in various applications. In this paper, we propose the useof Siamese network structure for the task of transfer learning in speech emotionrecognition, trained using the ﬁne-tuning method, and further optimized using adistance loss that incorporates relative distance among pairs. A publicly availabledataset, RAVDESS [9], was used as target data due to its relatively small numberof speakers and sample size. Two other emotional datasets, CREMA-D [3] andeNTERFACE05 [10], are used as source data to compare the inﬂuence of diﬀerentsource domains. Our results indicate that the selection of source data can have asigniﬁcant impact on the Siamese network ﬁne-tuning, and our distance loss cansigniﬁcantly beneﬁt the Siamese network ﬁne-tuning process, which yields animprovement of up to 7% compared to ﬁne-tuning the Siamese network withoutthe proposed distance loss.

The Siamese network is a type of neural network which takes a pair of datasamples as an input, and decides whether the corresponding samples belongto the same or diﬀerent classes. This network structure was ﬁrst introduced byBromley et al. for the task of signature veriﬁcation by comparing whether the twosignatures are from the same person [2]. In Siamese network, each data samplein a pair is the input to the network, then the network outputs an extractedfeature vector. The l l l l iamese Network with Modiﬁed Distance Loss in Transfer Learning 3 time the Siamese network was applied and further optimized for emotion-relatedtransfer learning task. The target data comes from the RAVDESS dataset [9], which includes 47 min-utes of audio from 24 actors, and each audio sample is further labeled by 247annotators. The source data comes from two diﬀerent publicly available datasets:the eNTERFACE05 [10] and CREMA-D [3]. The eNTERFACE05 contains 45minutes of data from 42 participants, who were asked to express their emotionsin scripted sentences after listening to speciﬁc stories. The CREMA-D datasetcontains 165 minutes of audio data from 91 actors, who were asked to performuntil they get approved by a director and each audio clip is further labeled byhuman annotators.Speech samples depicting four common emotions (anger, happiness, sadness,and fear) across all datasets were selected and processed using openSMILEtoolkit [4]. A 64-dimensional feature set, part of the INTERSPEECH09 emo-tion challenge feature set [13], is extracted from each audio segment which in-cludes the mean and standard deviations of speech intensity, zero-crossing rate,voice probability, fundamental frequency, and the ﬁrst 12 Mel-frequency cepstralcoeﬃcient (MFCC). The ﬁrst order derivative of each of these features is alsocomputed.

In this section, we plan to discuss our three baseline methods: in-domain training,out of domain training, and Siamese network ﬁne-tuning. Then we will introducethe proposed distance loss, and how this loss is applied for the Siamese network.All the methods are based on an optimized Siamese network structure, whichcontains 64, 32, and 16 nodes with ReLU activation for the ﬁrst three featureextractor layers, followed by a 16-node decision-based layer. During the trainingprocess, the cross entropy loss is calculated and used to update the weights ofthe model. After the Siamese network is trained, the test data will be comparedwith all the available labeled target data, and a ﬁnal classiﬁcation decision willbe generated based on the similarity of log-sum with each class. The unweightedaverage recall (UAR) will be used to evaluate the performance for each method.

In order to assess whether the proposed loss will beneﬁt the knowledge transfereﬃciency when limited data sample is available, we proposed three baselinemethods. The ﬁrst baseline performs an out-of-domain training (OODT). ASiamese network is trained on all the source data, and tested on the targetdataset without any adaptation. For this baseline, the data samples from the

Feng et al.

Data sample 1Data sample 2 Input layer Hidden layer Feature extractor outInput layer Hidden layer Feature extractor out Feature vector 1Feature vector 2 L1-norm Decisionmaking layer Same or different pairDistanceLossCross Entropy Loss

Fig. 1.

Schematic representation of the proposed Siamese network ﬁne-tuning withmodiﬁed distance loss. source are used to determine the ﬁnal classes, since no labeled target data isavailable.The second baseline is an in-domain training (IDT), which is trained us-ing suﬃcient amount of target data. A leave-one-subject-out (LOSO) cross-validation is performed. More speciﬁcally, samples from a given speaker are usedfor testing, and all other data are used for training of the Siamese network. Thisprocess is repeated until all the speakers have been used for testing. This baselineserves as an upper limit of potential knowledge transfer.The third baseline is the traditional ﬁne-tuning method in the ﬁeld of trans-fer learning, to further assess the beneﬁt provided by our proposed distanceloss. The models trained in OODT are used for ﬁne-tuning. One random datasample for each emotion from random 2 speakers are selected as labeled data.As a result, a total number of 8 samples (2 speakers × × . . . , 18, 20) are tested to evaluate in detail the knowledge transfer eﬃciency. Inspired by the fact that the traditional ﬁne-tuning method fails to eﬃcientlyleverage the knowledge when limited target data is available, we designed adistance loss to maximize the distance diﬀerence between pairs from the sameclass and pairs from the diﬀerent class. Let X s be the set of s pairs with thesame class in a batch, and X d is the set of d pairs with the diﬀerent class. Let g W ( x ) be the function parameterized by the weights W of the Siamese networkthat performs the transformation between the original input x and the extractedfeature vector. The parameter W is learned by minimizing the average relativedistance between pairs X s of the same class and maximizing the distance betweenpairs X d of diﬀerent classes: iamese Network with Modiﬁed Distance Loss in Transfer Learning 5 Table 1.

Unweighted average recall (UAR%) of the out-of-domain training (OODT),in-domain training (IDT), and the best results obtained among the diﬀerent numberof frozen layers and adopted speakers using Siamese NN ﬁne-tuning with / withoutmodiﬁed loss. source eNTERFACE05 CREMA-DOODT 32.8 29.3IDT 50.0 50.0Siamese NN ﬁne-tuning 32.9 37.8Siamese NN ﬁne-tuning with modiﬁed loss 39.9 43.4 W ∗ = arg min W (cid:107)X d (cid:107) (cid:80) x , x (cid:48) ∈X d (cid:107) g ( x ) , g ( x (cid:48) ) (cid:107) + (cid:107)X s (cid:107) (cid:80) x , x (cid:48) ∈ X s (cid:107) g ( x ) , g ( x (cid:48) ) (cid:107) (cid:107)X d | (cid:80) x , x (cid:48) ∈ X d (cid:107) g ( x ) , g ( x (cid:48) ) (cid:107) − (cid:107)X s (cid:107) (cid:80) x , x (cid:48) ∈ X s (cid:107) g ( x ) , g ( x (cid:48) ) (cid:107) (1)As indicated in the previous description, this loss is only applied to thefeature extraction layers at the end of a batch in the Siamese network ﬁne-tuning process to minimize the possible inﬂuence to the decision making process(Figure 1). Besides this loss, the other part for this method remains the samecompared to the ﬁne-tuning baseline method. In an eﬀort to discuss the domain diﬀerence between the diﬀerent speech emo-tion datasets, we ﬁrst examine the unweighted average recall (UAR) for fouremotion classiﬁcation task on OODT baseline. This yields a 32.8% UAR on theRAVDESS dataset when using eNTERFACE05 as source and 29.3% when usingCREMA-D as source. We then examines the upper limit of classiﬁcation perfor-mance using the proposed feature and model structure, and obtained a UAR at50.0% with in-domain training.The ﬁne-tuning of the Siamese neural network without the proposed lossis used to illustrated the knowledge transfer eﬃciency. This approach resultedin a minor improvement when using eNTERFACE05 as source data at 32.9%,and a relatively large improvement when using CREMA-D as source data at37.8%. We ﬁnally added our proposed distance loss in the ﬁne-tuning process,and obtained a signiﬁcant improvement of 39.9% using eNTERFACE05 and43.3% using CREMA-D. A comparison of the performance of diﬀerent methodscan be found at Table 1. Our results also indicate that compared with diﬀerentnumber of frozen layers, diﬀerent source data may has a more important role forSiamese network ﬁne-tuning (Figure 2).

Our results indicate that including 3-5 speakers in the Siamese NN ﬁne-tuningwith modiﬁed loss may result in the best trade-oﬀ between performance and the

Feng et al. U A R eNTERFACE05 Siamese NN fine-tuningeNTERFACE05 Siamese NN fine-tuning with modified lossCREMA-D Siamese NN fine-tuningCREMA-D Siamese NN fine-tuning with modified loss (a) No frozen layers. U A R eNTERFACE05 Siamese NN fine-tuningeNTERFACE05 Siamese NN fine-tuning with modified lossCREMA-D Siamese NN fine-tuningCREMA-D Siamese NN fine-tuning with modified loss (b) First layer is frozen. U A R eNTERFACE05 Siamese NN fine-tuningeNTERFACE05 Siamese NN fine-tuning with modified lossCREMA-D Siamese NN fine-tuningCREMA-D Siamese NN fine-tuning with modified loss (c) First two layers are frozen. Fig. 2.

Unweighted average recall (UAR) for freeze diﬀerent number of layers anddiﬀerent number of speakers adopted in ﬁne-tuning process with / without distanceloss. number of data samples used for training. This can be potentially explained bythe fact that this number of speakers might not be enough to capture the distri-bution of the target dataset. Performance is degraded when including a smallernumber of speakers in the target data. Also, if the data distribution includes alot of variability, Siamese NN with modiﬁed loss will be greatly restricted, sincepairwise diﬀerences are less likely to express the class information.There has not been a lot of research relevant to few-shot emotion recog-nition with speech signals, where data from only a few number of speakers isincluded in the target data. The most comparable results are from Gideon etal., in which the proposed approach with progressive neural networks achieved alimited improvement (around 3%) compared to traditional ﬁne-tuning ( [5]). Ourproposed distance loss is able to bring a relative larger improvement at about 7%compared to ﬁne-tuning, without signiﬁcant increase in the computational cost.Even though Siamese NN ﬁne-tuning with modiﬁed loss still fails to outperformthe in-domain training, it shows great potential. We will attempt to modify theproposed methodology and explore whether it can reach in-domain performanceas part of our future work.

We propose a distance loss which is based on the relative distance between sameand diﬀerent class pairs. Such loss can increase the upper limit and increaseknowledge transfer eﬃciency when very limited target data is available. Ourresults also indicate that the selection of source data plays a more importantrole than the number of frozen layers. Findings of this work can be appliedon other tasks using Siamese network, ﬁne-tuning, or few-shot learning. Theapplication of distance between diﬀerent emotion classes can be a fundamentalstep in understanding the diﬀerence between emotion categories.As part of our future work, we plan to explore the usage of pairs in addressingthe domain mismatch, and mitigate the inﬂuence of diﬀerent source data in the iamese Network with Modiﬁed Distance Loss in Transfer Learning 7 process of knowledge transfer. We will further perform the multi-source transferlearning by using pairwise information to select proper source data from eachdataset. Finally, we plan to understand the distance between diﬀerent emotionswith the help of the pairing information.

References

1. Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognitionfrom spectrograms with deep convolutional neural network. In: 2017 internationalconference on platform technology and service (PlatCon). pp. 1–5. IEEE (2017)2. Bromley, J., Guyon, I., LeCun, Y., S¨ackinger, E., Shah, R.: Signature veriﬁcationusing a” siamese” time delay neural network. In: Advances in neural informationprocessing systems. pp. 737–744 (1994)3. Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.:Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactionson aﬀective computing (4), 377–390 (2014)4. Eyben, F., W¨ollmer, M., Schuller, B.: Opensmile: the munich versatile and fastopen-source audio feature extractor. In: Proceedings of the 18th ACM internationalconference on Multimedia. pp. 1459–1462. ACM (2010)5. Gideon, J., Khorram, S., Aldeneh, Z., Dimitriadis, D., Provost, E.M.: Progres-sive neural networks for transfer learning in emotion recognition. arXiv preprintarXiv:1706.03256 (2017)6. Huang, J., Li, Y., Tao, J., Lian, Z., et al.: Speech emotion recognition from variable-length inputs with triplet loss function. In: Interspeech. pp. 3673–3677 (2018)7. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shotimage recognition. In: ICML deep learning workshop. vol. 2 (2015)8. Lian, Z., Li, Y., Tao, J., Huang, J.: Speech emotion recognition via contrastive lossunder siamese networks. In: Proceedings of the Joint Workshop of the 4th Work-shop on Aﬀective Social Multimedia Computing and ﬁrst Multi-Modal AﬀectiveComputing of Large-Scale Multimedia Data. pp. 21–26. ACM (2018)9. Livingstone, S.R., Peck, K., Russo, F.A.: Ravdess: The ryerson audio-visualdatabase of emotional speech and song. In: 22nd Annual Meeting of the Cana-dian Society for Brain, Behaviour and Cognitive Science (CSBBCS). pp. 1459–1462(2012)10. Martin, O., Kotsia, I., Macq, B., Pitas, I.: The enterface’05 audio-visual emo-tion database. In: 22nd International Conference on Data Engineering Workshops(ICDEW’06). pp. 8–8. IEEE (2006)11. Picard, R.W.: Aﬀective computing. MIT press (2000)12. Sabri, M., Kurita, T.: Facial expression intensity estimation using siamese andtriplet networks. Neurocomputing313