[PDF] Fixed-MAML for Few Shot Classification in Multilingual Speech Emotion Recognition

Abstract

In this paper, we analyze the feasibility of applying few-shot learning to speech emotion recognition task (SER). The current speech emotion recognition models work exceptionally well but fail when then input is multilingual. Moreover, when training such models, the models' performance is suitable only when the training corpus is vast. This availability of a big training corpus is a significant problem when choosing a language that is not much popular or obscure. We attempt to solve this challenge of multilingualism and lack of available data by turning this problem into a few-shot learning problem. We suggest relaxing the assumption that all N classes in an N-way K-shot problem be new and define an N+F way problem where N and F are the number of emotion classes and predefined fixed classes, respectively. We propose this modification to the Model-Agnostic MetaLearning (MAML) algorithm to solve the problem and call this new model F-MAML. This modification performs better than the original MAML and outperforms on EmoFilm dataset.

Full PDF

FFixed-MAML for Few Shot Classiﬁcation inMultilingual Speech Emotion Recognition

Anugunj NamanIIIT Guwahati [email protected]

Liliana ManciniCardiff University, UK [email protected]

Abstract

In this paper, we analyze the feasibility of applying few-shot learning to speech emotion recognition task (SER). Thecurrent speech emotion recognition models work exception-ally well but fail when then input is multilingual. More-over, when training such models, the models’ performanceis suitable only when the training corpus is vast. Thisavailability of a big training corpus is a signiﬁcant prob-lem when choosing a language that is not much popularor obscure. We attempt to solve this challenge of multi-lingualism and lack of available data by turning this prob-lem into a few-shot learning problem. We suggest relaxingthe assumption that all N classes in an N-way K-shot prob-lem be new and deﬁne an N+F way problem where N andF are the number of emotion classes and predeﬁned ﬁxedclasses, respectively. We propose this modiﬁcation to theModel-Agnostic MetaLearning (MAML) algorithm to solvethe problem and call this new model F-MAML. This modiﬁ-cation performs better than the original MAML and outper-forms on EmoFilm dataset.

1. Introduction

Emotion recognition plays a signiﬁcant role in many in-telligent interfaces [1]. Even with the recent advances inmachine learning, this is still a challenging task. The mainreason behind this is that most publicly available annotateddatasets in this domain are small in scale, which makes DLmodels prone to over-ﬁtting. Another essential feature ofemotion recognition is the inherent multi-modality in ex-pressing emotions [2]. Emotional information can be cap-tured by studying many modalities, including facial expres-sions, body postures, and EEG [3]. Of these, arguably,speech is the most accessible. In addition to accessibility,speech signals contain many other emotional cues [4]. We,therefore, use speech signals as a base to predict the emo-tion.Generally, in speech emotion recognition (SER) task, conventional supervised learning solves the problem efﬁ-ciently given sufﬁcient training data. Several studies onSER for different single corpora have been conducted us-ing the language-dependent optimal acoustic sets over sev-eral decades. Such systems can be analyzed in mono-lingual scenarios; changing the source corpus requires re-selecting the optimal acoustic features and re-training thesystem. Human-emotion perception, however, has provedto be cross-lingual, even without the understanding of thelanguage used [5]. An SER system is expected to recognizeemotions as such.However, for an automatic SER system to recognizeemotion, there are two signiﬁcant problems. First, the train-ing corpus available for many different languages is verylimited. Second, it is not clear which standard featuresare efﬁcient in detecting emotions across different cultures.Commonalities and differences in human-emotion percep-tion across languages in the valence-activation (V-A) spacehave recently been studied [5]. It was revealed that direc-tion and distance from neutral to other emotions are sim-ilar across languages, and languages’ neutral positions arelanguage-dependent. In this paper, motivated by the abovechallenges, we want to simulate a scenario where one canprovide few labeled speech samples in any language andtrain a model on that language for few iterations to get arobust SER system. This proposed scenario removes therequirement for a large amount of data and identiﬁes thestandard features efﬁcient in detecting emotions about thatculture and ﬁne-tune to it accordingly.Supervised learning has been extremely successful incomputer vision, speech, or machine translation tasks,thanks to improvements in optimization technology, largerdatasets, and streamlined designs of deep convolutional orrecurrent architectures. Despite these successes, this learn-ing setup does not cover many aspects where learning ispossible and desirable. One such instance is learning fromvery few examples in the so-called few-shot learning tasks[6]. Rather than depending on regularization to compen-sate for the lack of training data, researchers have exploredways to leverage the distribution of similar tasks, inspired1 a r X i v : . [ c s . S D ] J a n y human learning [7]. A lot of useful solutions have beendeveloped, and the most popular solution right now usesmeta-learning.Meanwhile, most of the studies on few-shot learning areconducted on image tasks. We here attempt to apply thosemeta-learning solutions to SER systems. We formulate theproblem mentioned above as a few-shot learning problemand analyze the performance state-of-art model-level few-shot learning algorithms.Meta-learning, also known as ’learning to learn,’ aims tomake quick adaptation to new tasks with only a few exam-ples. Recently many different meta-learning solutions havebeen proposed to solve the few-shot learning problems. Allthese solutions differ in the form of learning a shared metric[8, 9, 10, 11], a generic inference network [12, 13], a sharedoptimization algorithm [14, 15], or a shared initializationfor the model parameters [16, 17, 18]. In this paper, weuse the Model-Agnostic Meta-Learning (MAML) approach[16] because of the following reasons:1. It is a model-agnostic general framework that can beeasily used on a new task.2. It achieves state-of-the-art performance in existingfew-shot learning tasks.Few-shot learning is often deﬁned as an N-way, K-shotproblem where N is the number of class labels in the tar-get task, and K is the number of examples of each class. Inmost previous studies, it is assumed that all the N classesor labels are new. However, in real-life applications, theseclasses or labels are not necessary to be all new. Thus, wefurther deﬁne an N+F-way, K-shot problem where N and Fare the numbers of new classes or labels and ﬁxed classes,respectively. In this new devised task, the model has to clas-sify among new classes and ﬁxed classes. We propose thismodiﬁcation to the original MAML algorithm to solve thisproblem and call this new model F-MAML.We conduct our experiment on EmoFilm dataset [19]to simulate a scenario in SER. We compare our approachwith two baseline approaches: the conventional supervisedlearning approach and the MAML approach. Experimentalresults show that MAML and F-MAML lead to obvious im-provement over the supervised learning approach, with F-MAML performing better than MAML. Our contributionsin this paper are summarized here:1. We analyze the feasibility of few-shot learning fortraining SER models.2. We propose an efﬁcient way compared to MAML (F-MAML) to train future SER models for any languagewith few training examples.The rest of the paper is presented in the following man-ner: In section 2, we discuss our work’s background. In Figure 1. The MAML algorithm learns a good parameter initial-izer θ ∗ by training across various meta-tasks such that it can adaptquickly to new tasks. section 3, we discuss our proposed method. In section 4,experiments performed in detail, and results are mentioned.In section 5, we ﬁnally give a conclusion.

2. Background

In this section, we ﬁrst brieﬂy introduce MAML, thebase, and our solution’s motivation.Model-Agnostic Meta-Learning (MAML) is one of themost popular meta-learning algorithms that aim to solve thefew-shot learning problem. The main goal of MAML is totrain a model initializer that can adapt to any new task usingvery few labeled examples and training iterations [16]. Themodel is trained across several tasks to reach this goal, andit treats the entire task as a training example. The modelis required to face different tasks to get used to adapting tonew tasks. In this section, we describe the MAML trainingframework. As is shown in Figure-1, the optimization pro-cedure consists of two stages. A meta-learning stage on thetraining data and a ﬁne-tuning stage on the testing tasks.

Given that the target evaluation task is an N-way, K-shottask, the model is trained across a set of task T where eachtask T i is also an N-way, K-shot task. In each iteration, alearning task, i.e., the meta-task T i is sampled according toa distribution over tasks p ( T ) . Each T i consist of a supportset S i and a query set Q i .Consider a model represented by a parametrized function f θ with parameters θ . θ (cid:48) i is computed from θ through theadaptation to task T i . A loss function L S i ( f θ ) , which iscross-entropy loss over support set examples, is deﬁned tospeciﬁed to guide the computation of θ (cid:48) i : L S i ( f θ ) = − (cid:88) ( x j ,y j ) ∈ S i y j log f θ ( x j ) . (1)2 one-step gradient update is as below: θ (cid:48) i = θ − α ∇ θ L S i ( f θ ) . (2)Here, α is the learning rate, which can be a ﬁxed hyper-parameter or learned like the Meta-SGD [17]. The gradienthere is updated for multiple steps.After this, the model parameters are optimized on theperformance of f θ (cid:48) i evaluated by the query set Q i with re-spect to θ . L Q i ( f θ (cid:48) i ) is another cross entropy loss over queryset examples: L Q i ( f θ (cid:48) i ) = − (cid:88) ( x (cid:48) u ,y (cid:48) u ) ∈ Q i y (cid:48) u log f θ (cid:48) ( x (cid:48) u ) . (3)Broadly speaking, MAML aims to optimize the modelparameters such that few gradient steps on a new task willultimately lead to a maximally effective behavior on thatnew task. At the end of each training iteration, the parame-ters θ are updated as below: θ ← θ − β ∇ θ L Q i ( f θ (cid:48) i ) . (4)Here, β is the learning rate of the meta learner. To in-crease the stability of training , instead of only one task abatch of tasks is sampled in each iteration. The optimizationis performed by averaging the loss across the tasks. Thus,equation (4) can be generalized to: θ ← θ − β ∇ θ (cid:88) i L Q i ( f θ (cid:48) i ) . (5) A ﬁne-tuning is performed before the evaluation. In anN-way, K-shot task, K examples from each of the N classlabels are available at this stage in the target task’s supportset. The model trained above in the meta-learning stagewill now be ﬁne-tuned according to equation (2) for a fewiterations. The updated model will then be evaluated on theremaining unlabeled examples (the target task’s query set).

3. Proposed Method

In the original MAML, it is assumed that all class la-bels in the target task are new class labels. However, theseclass labels do not necessarily need to be all new. In real-life applications, some of the class labels are known so thatmore examples of these class labels can be used in the meta-learning stage. This paper will call them ﬁxed classes aswe later ﬁx their output positions in the neural networkclassiﬁer. We call this task, which has to classify amongnew classes and ﬁxed classes, an N+F-way, K-shot prob-lem where N, F, K are the number of emotion class labels,

Figure 2. Framework of our F-MAML approach for few-shot SER. ﬁxed class labels, and examples from each new class la-bels for ﬁne-tuning respectively. This problem of simul-taneously classifying unseen and seen class labels has notbeen investigated in the original MAML. In our solution,we try to tackle the problem by proposing modiﬁcations tothe MAML training framework. We believe that the N+Fway, K-shot problem is more realistic and our modiﬁcationto MAML applies to various tasks. We now describe ourmethodology for a few-shot SER task.

Although the N+F-way, K-shot problem can be regardedas a speciﬁc form of the normal N-way, K-shot problem,solving it with the original MAML framework will lead toa performance degradation. Using the prior information ofthe F ﬁxed classes, we modify the MAML framework in thefollowing aspects:1. We ﬁx the output positions, i.e., the output at the end ofclassiﬁcation for a random sample for the ﬁxed classesin the neural network classiﬁer.2. These ﬁxed classes occur in every meta-task T i in themeta-learning stage.3. The adaptation of ﬁxed classes is not needed in theﬁne-tuning stage as they have already been learned inthe meta-learning stage.The above three modiﬁcation to the original MAMLmakes the proposed framework more effectively to real ap-plications. We formulate a scenario for SER as N+F-way, K-shotclassiﬁcation task. N is the number of emotions that onewishes to recognize, and one should provide K speech audio3amples for each such emotion. Fixed labels here are silenceand neutral.Figure 2 illustrates the framework of F-MAML ap-proach. The target data contains audio samples from onelanguage, not in source data, while source data contain au-dio examples from all other languages. The ﬁxed classesare the same in target and source data. In the meta-learningstage, several N+2-way, K-shot meta-task are sampled fromsource data for each language. Each meta-tasks is similarto the target task. We expect to learn a model initializerthat can adapt to the target task using the provided speechsamples and emotion labels. We exclude the ﬁxed class la-bels from the support set in both meta-learning and the ﬁne-tuning stages. As we can assume the availability of moretraining examples of ﬁxed classes, we can keep them in themeta-tasks’ query set in the meta-learning stage. Moreover,it can be seen that the positions of silence and the neutralclasses are ﬁxed to the last of network output (the orangearea). Thus, we force our model to ”recall” the ﬁxed classeswithout the need for adaptation.

Algorithm 1

F-MAML approach for few-shot SER

Require: p ( T ) : distribution over tasks Require: X : training dataset Require: S il : silence class set, N eu : neutral class set Require: S i ∈ X : support set, Q i ∈ [ X ∪ S il ∪ N eu ] \ S i : query set Require: α, β : learning rates Randomly initialize base model parameters θ while not done do Sample a batch of meta-tasks T i ∼ p ( T ) for all T i do Sample a support set S i ∈ X Compute the gradient L S i ( f θ ) using S i and X as show in equation (1). Update base model parameters with gradientdescent: θ (cid:48) i = θ − α ∇ θ L S i ( f θ ) . (Step 6-7 can be re-peated several times.) Sample a query set Q i from the union [ X, S il , N eu ] \ S i . (Selected emotion label from X in Q i and S i within T i are the same). Compute the loss L Q i ( f θ (cid:48) i ) using Q i and the up-dated model f θ (cid:48) . end for Update the parameters for θ using each Q i and L Q i ( f θ (cid:48) ) : θ ← θ − β ∇ θ (cid:80) i L Q i ( f θ (cid:48) i ) . end while Algorithm 1 summarizes the details of our approach.The algorithm described here is based on the work of [16]but is different in terms of how sampling is done for thesupport set and the query set during the meta-learning stage,which is introduced in section 3.1.

Figure 3. Spectrogram for different emotions in each langauge.

4. Experimentation

We conduct our experiments on EmoFilm dataset [19].It consists of 1115 clips with a mean length of 3.5 seconds,resulting in 341 English audio clips, with an average of 34.3utterances per emotion; 410 Italian audio clips with an aver-age of 41.3 utterances per emotion; and 356 Spanish clips,with an average of 35.9 utterances per emotion (std 9). Thehigher number of Italian clips might be due to Italian beinga more ’emotionally expressive’ language; this could alsorelate to the pre-test made by Italian listeners, who may bebetter at perceiving emotions in their language [19]. Thedataset is categorized into ﬁve emotion labels: happiness,sadness, anger, fear, and disgust. We formulate three 5-way, K-shot tasks using the same setup as the audio recog-nition tutorial in ofﬁcial PyTorch documentation. Table-1gives information about total samples for each emotion ineach language. The mel spectrogram for all ﬁve emotionsin each language can be viewed in Figure-3. We performthree experiments here.1. The ﬁrst experiment is SER in the English language,where we use the English language as a testing set4 able 1. Dataset Details

Language Total Samples Samples per Emotion

English 341 72 - Fear, 50 - Disgust,69 - Happiness, 76 - Anger,74 - SadnessItalian 410 83 - Fear, 68 - Disgust,93 - Happiness, 73 - Anger,93 - SadnessSpanish 356 63 - Fear, 50 - Disgust,76 - Happiness, 82 - Anger,85 - Sadnesswhile Spanish and Italian are used in training.2. The second experiment is SER in the Italian language,where we use the Italian language as a testing set whileEnglish and Spanish are used in training.3. The third experiment is SER in the Spanish language,where we use the Spanish language as a testing setwhile English and Italian are used in training.The testing language is unseen to the meta-learningstage, and only K labeled examples of each label are avail-able in the ﬁne-tuning stage. The initialized model is ﬁne-tuned on the labeled examples and evaluated on the unla-beled examples. The samples for silence class and neutralclass were self-generated with a mean length of 3.5 seconds.

The 3-4 second clips are sampled at 16kHz. We use theMel-frequency Cepstral Coefﬁcient (MFCC) features. Foreach clip, we extract 40-dimensional MFCCs with a framelength of 30ms and a frame step of 10ms. ConvolutionNeural Networks is adopted as the base model, which con-tains 4 convolutional blocks. Each block comprises a 3 ×3 convolutions and 64 ﬁlters, followed by ReLU and batchnormalization [20]. The ﬂattened layer after the convolu-tional blocks contains 576 neurons and is fully connected tothe output layer with a linear function. We avoided usingResNet architecture because it overﬁtted very quickly. Themodel is trained with a mini-batch size of 16 for 5, 10, 20-shot classiﬁcation. We set the learning rate α to 0.1 and β to 0.001. The learning rates were found using a grid search. We compare our proposed approach with two baselineapproaches the conventional supervised learning approach which trains the model on the support set of the target taskonly, and the original MAML, which treats the 5+2-way The details model for the supervised model is inspired from here

Table 2. Accuracy in 5-shot learning

Model English Italian SpanishSupervised .

33% 16 .

19% 20 . MAML .

21% 64 .

13% 64 . F-MAML

Table 3. Accuracy in 10-shot learning

Model English Italian SpanishSupervised .

11% 32 .

54% 16 . MAML .

11% 70 .

13% 71 . F-MAML

Table 4. Accuracy in 20-shot learning

Model English Italian SpanishSupervised .

53% 28 .

64% 24 . MAML .

21% 77 .

13% 77 . F-MAML problem as a 7-way problem. In the evaluation, we sam-ple K examples from each class for ﬁne-tuning the modeland 25 examples per label for evaluation. We do 100 timesrandom tests and evaluate different approaches on accuracy.

We compare our approach with three baselines. Table 2,3 and 4 list the performance of 5, 10 and 20-shot task onSER in English, Spanish, Italian languages respectively.Not surprisingly, MAML based approaches performmuch better than conventional supervised learning in a few-shot learning situation. This improvement is because itprovides a good initialization of a model’s parameters toachieve optimal fast learning on a new task with few gradi-ent steps while avoiding overﬁtting that may happen whenusing a small dataset.Finally, our proposed approach F-MAML outperformsthe original MAML. We attribute the improvement to priorinformation of the ﬁxed classes, which helps in efﬁcientﬁne-tuning to new tasks compared to the original MAML.The Figure-4 shows the loss of original MAML comparedto F-MAML on 5-shot learning. It can easily be seen that F-MAML converges quickly and in fewer steps than the orig-inal MAML.

5. Conclusions and Future Work

In this paper, we simulated a scenario of SER as a few-shot learning problem. We deﬁne it as an N+F-way, K-shotproblem and propose a modiﬁcation to the Model-AgnosticMeta-Learning (MAML) algorithm where we kept F ﬁxedto solve the problem. Experiments conducted on theEmoFilm dataset show that our approach performs the best5 igure 4. Convergence Comparison MAML vs F-MAML compared to the baselines. In the future, we will attemptto test the feasibility of the approach on Indic languages,and mandarin derived languages since these languages dif-fer vastly from each other.

References [1] R. W. Picard. Affective computing.

MIT Press , 2000.[2] Jeng-Lin Li and Chi-Chun Lee. Attentive to individual: Amultimodal emotion recognition network with personalizedattention proﬁle. In Gernot Kubin and Zdravko Kacic, edi-tors,

Interspeech 2019, 20th Annual Conference of the Inter-national Speech Communication Association, Graz, Austria,15-19 September 2019 , pages 211–215. ISCA, 2019.[3] Nicu Sebe, Ira Cohen, Theo Gevers, and Thomas S. Huang.Multimodal approaches for emotion recognition: a survey.In Simone Santini, Raimondo Schettini, and Theo Gevers,editors,

Internet Imaging VI , volume 5670, pages 56 – 67.International Society for Optics and Photonics, SPIE, 2005.[4] John Kim and Rif A. Saurous. Emotion recognition fromhuman speech using temporal information and deep learning.In

Proc. Interspeech 2018 , pages 937–940, 2018.[5] Xingfeng Li and Masato Akagi. Multilingual speech emo-tion recognition system based on a three-layer model. In

Interspeech 2016 , pages 3608–3612, 2016.[6] Victor Garcia Satorras and Joan Bruna Estrach. Few-shotlearning with graph neural networks. In

International Con-ference on Learning Representations , 2018.[7] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B.Tenenbaum. Human-level concept learning through proba-bilistic program induction.

Science , 350(6266):1332–1338,2015.[8] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, KorayKavukcuoglu, and Daan Wierstra. Matching networks forone shot learning. In

Proceedings of the 30th Interna-tional Conference on Neural Information Processing Sys-tems , NIPS’16, page 3637–3645, Red Hook, NY, USA,2016. Curran Associates Inc. [9] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypicalnetworks for few-shot learning. In

Proceedings of the 31stInternational Conference on Neural Information ProcessingSystems , NIPS’17, page 4080–4090, Red Hook, NY, USA,2017. Curran Associates Inc.[10] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M.Hospedales. Learning to compare: Relation network for few-shot learning. In , pages 1199–1208, 2018.[11] T. Ko, Y. Chen, and Q. Li. Prototypical networks for smallfootprint text-independent speaker veriﬁcation. In

ICASSP2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pages 6804–6808,2020.[12] Adam Santoro, Sergey Bartunov, Matthew Botvinick, DaanWierstra, and Timothy Lillicrap. Meta-learning withmemory-augmented neural networks. In

Proceedings ofthe 33rd International Conference on International Confer-ence on Machine Learning , volume 48 of

ICML’16 , page1842–1850. JMLR.org, 2016.[13] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and PieterAbbeel. A simple neural attentive meta-learner. In

Interna-tional Conference on Learning Representations , 2018.[14] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In

Proceedings of the 34rd International Conference on Inter-national Conference on Machine Learning , volume 70 of

ICML’17 , pages 2554–2563. JMLR.org, 2017.[15] Sachin Ravi and Hugo Larochelle. Optimization as a modelfor few-shot learning. In , 2017.[16] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks.In

Proceedings of the 34th International Conference on Ma-chine Learning , volume 70 of

ICML’17 , pages 1126–1135.JMLR.org, 2017.[17] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning, 2017.[18] Alex Nichol, Joshua Achiam, and John Schulman. On ﬁrst-order meta-learning algorithms, 2018.[19] Emilia Parada-Cabaleiro, Giovanni Costantini, Anton Bat-liner, Alice Baird, and Bj¨orn Schuller. Categorical vs di-mensional perception of italian emotional speech. In

Proc.Interspeech 2018 , pages 3638–3642, 2018.[20] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. In

Proceedings of the 32nd International Con-ference on International Conference on Machine Learning -Volume 37 , ICML’15, page 448–456. JMLR.org, 2015., ICML’15, page 448–456. JMLR.org, 2015.