Fixed-MAML for Few Shot Classification in Multilingual Speech Emotion Recognition
FFixed-MAML for Few Shot Classification inMultilingual Speech Emotion Recognition
Anugunj NamanIIIT Guwahati [email protected]
Liliana ManciniCardiff University, UK [email protected]
Abstract
In this paper, we analyze the feasibility of applying few-shot learning to speech emotion recognition task (SER). Thecurrent speech emotion recognition models work exception-ally well but fail when then input is multilingual. More-over, when training such models, the models’ performanceis suitable only when the training corpus is vast. Thisavailability of a big training corpus is a significant prob-lem when choosing a language that is not much popularor obscure. We attempt to solve this challenge of multi-lingualism and lack of available data by turning this prob-lem into a few-shot learning problem. We suggest relaxingthe assumption that all N classes in an N-way K-shot prob-lem be new and define an N+F way problem where N andF are the number of emotion classes and predefined fixedclasses, respectively. We propose this modification to theModel-Agnostic MetaLearning (MAML) algorithm to solvethe problem and call this new model F-MAML. This modifi-cation performs better than the original MAML and outper-forms on EmoFilm dataset.
1. Introduction
Emotion recognition plays a significant role in many in-telligent interfaces [1]. Even with the recent advances inmachine learning, this is still a challenging task. The mainreason behind this is that most publicly available annotateddatasets in this domain are small in scale, which makes DLmodels prone to over-fitting. Another essential feature ofemotion recognition is the inherent multi-modality in ex-pressing emotions [2]. Emotional information can be cap-tured by studying many modalities, including facial expres-sions, body postures, and EEG [3]. Of these, arguably,speech is the most accessible. In addition to accessibility,speech signals contain many other emotional cues [4]. We,therefore, use speech signals as a base to predict the emo-tion.Generally, in speech emotion recognition (SER) task, conventional supervised learning solves the problem effi-ciently given sufficient training data. Several studies onSER for different single corpora have been conducted us-ing the language-dependent optimal acoustic sets over sev-eral decades. Such systems can be analyzed in mono-lingual scenarios; changing the source corpus requires re-selecting the optimal acoustic features and re-training thesystem. Human-emotion perception, however, has provedto be cross-lingual, even without the understanding of thelanguage used [5]. An SER system is expected to recognizeemotions as such.However, for an automatic SER system to recognizeemotion, there are two significant problems. First, the train-ing corpus available for many different languages is verylimited. Second, it is not clear which standard featuresare efficient in detecting emotions across different cultures.Commonalities and differences in human-emotion percep-tion across languages in the valence-activation (V-A) spacehave recently been studied [5]. It was revealed that direc-tion and distance from neutral to other emotions are sim-ilar across languages, and languages’ neutral positions arelanguage-dependent. In this paper, motivated by the abovechallenges, we want to simulate a scenario where one canprovide few labeled speech samples in any language andtrain a model on that language for few iterations to get arobust SER system. This proposed scenario removes therequirement for a large amount of data and identifies thestandard features efficient in detecting emotions about thatculture and fine-tune to it accordingly.Supervised learning has been extremely successful incomputer vision, speech, or machine translation tasks,thanks to improvements in optimization technology, largerdatasets, and streamlined designs of deep convolutional orrecurrent architectures. Despite these successes, this learn-ing setup does not cover many aspects where learning ispossible and desirable. One such instance is learning fromvery few examples in the so-called few-shot learning tasks[6]. Rather than depending on regularization to compen-sate for the lack of training data, researchers have exploredways to leverage the distribution of similar tasks, inspired1 a r X i v : . [ c s . S D ] J a n y human learning [7]. A lot of useful solutions have beendeveloped, and the most popular solution right now usesmeta-learning.Meanwhile, most of the studies on few-shot learning areconducted on image tasks. We here attempt to apply thosemeta-learning solutions to SER systems. We formulate theproblem mentioned above as a few-shot learning problemand analyze the performance state-of-art model-level few-shot learning algorithms.Meta-learning, also known as ’learning to learn,’ aims tomake quick adaptation to new tasks with only a few exam-ples. Recently many different meta-learning solutions havebeen proposed to solve the few-shot learning problems. Allthese solutions differ in the form of learning a shared metric[8, 9, 10, 11], a generic inference network [12, 13], a sharedoptimization algorithm [14, 15], or a shared initializationfor the model parameters [16, 17, 18]. In this paper, weuse the Model-Agnostic Meta-Learning (MAML) approach[16] because of the following reasons:1. It is a model-agnostic general framework that can beeasily used on a new task.2. It achieves state-of-the-art performance in existingfew-shot learning tasks.Few-shot learning is often defined as an N-way, K-shotproblem where N is the number of class labels in the tar-get task, and K is the number of examples of each class. Inmost previous studies, it is assumed that all the N classesor labels are new. However, in real-life applications, theseclasses or labels are not necessary to be all new. Thus, wefurther define an N+F-way, K-shot problem where N and Fare the numbers of new classes or labels and fixed classes,respectively. In this new devised task, the model has to clas-sify among new classes and fixed classes. We propose thismodification to the original MAML algorithm to solve thisproblem and call this new model F-MAML.We conduct our experiment on EmoFilm dataset [19]to simulate a scenario in SER. We compare our approachwith two baseline approaches: the conventional supervisedlearning approach and the MAML approach. Experimentalresults show that MAML and F-MAML lead to obvious im-provement over the supervised learning approach, with F-MAML performing better than MAML. Our contributionsin this paper are summarized here:1. We analyze the feasibility of few-shot learning fortraining SER models.2. We propose an efficient way compared to MAML (F-MAML) to train future SER models for any languagewith few training examples.The rest of the paper is presented in the following man-ner: In section 2, we discuss our work’s background. In Figure 1. The MAML algorithm learns a good parameter initial-izer θ ∗ by training across various meta-tasks such that it can adaptquickly to new tasks. section 3, we discuss our proposed method. In section 4,experiments performed in detail, and results are mentioned.In section 5, we finally give a conclusion.
2. Background
In this section, we first briefly introduce MAML, thebase, and our solution’s motivation.Model-Agnostic Meta-Learning (MAML) is one of themost popular meta-learning algorithms that aim to solve thefew-shot learning problem. The main goal of MAML is totrain a model initializer that can adapt to any new task usingvery few labeled examples and training iterations [16]. Themodel is trained across several tasks to reach this goal, andit treats the entire task as a training example. The modelis required to face different tasks to get used to adapting tonew tasks. In this section, we describe the MAML trainingframework. As is shown in Figure-1, the optimization pro-cedure consists of two stages. A meta-learning stage on thetraining data and a fine-tuning stage on the testing tasks.
Given that the target evaluation task is an N-way, K-shottask, the model is trained across a set of task T where eachtask T i is also an N-way, K-shot task. In each iteration, alearning task, i.e., the meta-task T i is sampled according toa distribution over tasks p ( T ) . Each T i consist of a supportset S i and a query set Q i .Consider a model represented by a parametrized function f θ with parameters θ . θ (cid:48) i is computed from θ through theadaptation to task T i . A loss function L S i ( f θ ) , which iscross-entropy loss over support set examples, is defined tospecified to guide the computation of θ (cid:48) i : L S i ( f θ ) = − (cid:88) ( x j ,y j ) ∈ S i y j log f θ ( x j ) . (1)2 one-step gradient update is as below: θ (cid:48) i = θ − α ∇ θ L S i ( f θ ) . (2)Here, α is the learning rate, which can be a fixed hyper-parameter or learned like the Meta-SGD [17]. The gradienthere is updated for multiple steps.After this, the model parameters are optimized on theperformance of f θ (cid:48) i evaluated by the query set Q i with re-spect to θ . L Q i ( f θ (cid:48) i ) is another cross entropy loss over queryset examples: L Q i ( f θ (cid:48) i ) = − (cid:88) ( x (cid:48) u ,y (cid:48) u ) ∈ Q i y (cid:48) u log f θ (cid:48) ( x (cid:48) u ) . (3)Broadly speaking, MAML aims to optimize the modelparameters such that few gradient steps on a new task willultimately lead to a maximally effective behavior on thatnew task. At the end of each training iteration, the parame-ters θ are updated as below: θ ← θ − β ∇ θ L Q i ( f θ (cid:48) i ) . (4)Here, β is the learning rate of the meta learner. To in-crease the stability of training , instead of only one task abatch of tasks is sampled in each iteration. The optimizationis performed by averaging the loss across the tasks. Thus,equation (4) can be generalized to: θ ← θ − β ∇ θ (cid:88) i L Q i ( f θ (cid:48) i ) . (5) A fine-tuning is performed before the evaluation. In anN-way, K-shot task, K examples from each of the N classlabels are available at this stage in the target task’s supportset. The model trained above in the meta-learning stagewill now be fine-tuned according to equation (2) for a fewiterations. The updated model will then be evaluated on theremaining unlabeled examples (the target task’s query set).
3. Proposed Method
In the original MAML, it is assumed that all class la-bels in the target task are new class labels. However, theseclass labels do not necessarily need to be all new. In real-life applications, some of the class labels are known so thatmore examples of these class labels can be used in the meta-learning stage. This paper will call them fixed classes aswe later fix their output positions in the neural networkclassifier. We call this task, which has to classify amongnew classes and fixed classes, an N+F-way, K-shot prob-lem where N, F, K are the number of emotion class labels,
Figure 2. Framework of our F-MAML approach for few-shot SER. fixed class labels, and examples from each new class la-bels for fine-tuning respectively. This problem of simul-taneously classifying unseen and seen class labels has notbeen investigated in the original MAML. In our solution,we try to tackle the problem by proposing modifications tothe MAML training framework. We believe that the N+Fway, K-shot problem is more realistic and our modificationto MAML applies to various tasks. We now describe ourmethodology for a few-shot SER task.
Although the N+F-way, K-shot problem can be regardedas a specific form of the normal N-way, K-shot problem,solving it with the original MAML framework will lead toa performance degradation. Using the prior information ofthe F fixed classes, we modify the MAML framework in thefollowing aspects:1. We fix the output positions, i.e., the output at the end ofclassification for a random sample for the fixed classesin the neural network classifier.2. These fixed classes occur in every meta-task T i in themeta-learning stage.3. The adaptation of fixed classes is not needed in thefine-tuning stage as they have already been learned inthe meta-learning stage.The above three modification to the original MAMLmakes the proposed framework more effectively to real ap-plications. We formulate a scenario for SER as N+F-way, K-shotclassification task. N is the number of emotions that onewishes to recognize, and one should provide K speech audio3amples for each such emotion. Fixed labels here are silenceand neutral.Figure 2 illustrates the framework of F-MAML ap-proach. The target data contains audio samples from onelanguage, not in source data, while source data contain au-dio examples from all other languages. The fixed classesare the same in target and source data. In the meta-learningstage, several N+2-way, K-shot meta-task are sampled fromsource data for each language. Each meta-tasks is similarto the target task. We expect to learn a model initializerthat can adapt to the target task using the provided speechsamples and emotion labels. We exclude the fixed class la-bels from the support set in both meta-learning and the fine-tuning stages. As we can assume the availability of moretraining examples of fixed classes, we can keep them in themeta-tasks’ query set in the meta-learning stage. Moreover,it can be seen that the positions of silence and the neutralclasses are fixed to the last of network output (the orangearea). Thus, we force our model to ”recall” the fixed classeswithout the need for adaptation.
Algorithm 1
F-MAML approach for few-shot SER
Require: p ( T ) : distribution over tasks Require: X : training dataset Require: S il : silence class set, N eu : neutral class set Require: S i ∈ X : support set, Q i ∈ [ X ∪ S il ∪ N eu ] \ S i : query set Require: α, β : learning rates Randomly initialize base model parameters θ while not done do Sample a batch of meta-tasks T i ∼ p ( T ) for all T i do Sample a support set S i ∈ X Compute the gradient L S i ( f θ ) using S i and X as show in equation (1). Update base model parameters with gradientdescent: θ (cid:48) i = θ − α ∇ θ L S i ( f θ ) . (Step 6-7 can be re-peated several times.) Sample a query set Q i from the union [ X, S il , N eu ] \ S i . (Selected emotion label from X in Q i and S i within T i are the same). Compute the loss L Q i ( f θ (cid:48) i ) using Q i and the up-dated model f θ (cid:48) . end for Update the parameters for θ using each Q i and L Q i ( f θ (cid:48) ) : θ ← θ − β ∇ θ (cid:80) i L Q i ( f θ (cid:48) i ) . end while Algorithm 1 summarizes the details of our approach.The algorithm described here is based on the work of [16]but is different in terms of how sampling is done for thesupport set and the query set during the meta-learning stage,which is introduced in section 3.1.
Figure 3. Spectrogram for different emotions in each langauge.
4. Experimentation
We conduct our experiments on EmoFilm dataset [19].It consists of 1115 clips with a mean length of 3.5 seconds,resulting in 341 English audio clips, with an average of 34.3utterances per emotion; 410 Italian audio clips with an aver-age of 41.3 utterances per emotion; and 356 Spanish clips,with an average of 35.9 utterances per emotion (std 9). Thehigher number of Italian clips might be due to Italian beinga more ’emotionally expressive’ language; this could alsorelate to the pre-test made by Italian listeners, who may bebetter at perceiving emotions in their language [19]. Thedataset is categorized into five emotion labels: happiness,sadness, anger, fear, and disgust. We formulate three 5-way, K-shot tasks using the same setup as the audio recog-nition tutorial in official PyTorch documentation. Table-1gives information about total samples for each emotion ineach language. The mel spectrogram for all five emotionsin each language can be viewed in Figure-3. We performthree experiments here.1. The first experiment is SER in the English language,where we use the English language as a testing set4 able 1. Dataset Details
Language Total Samples Samples per Emotion
English 341 72 - Fear, 50 - Disgust,69 - Happiness, 76 - Anger,74 - SadnessItalian 410 83 - Fear, 68 - Disgust,93 - Happiness, 73 - Anger,93 - SadnessSpanish 356 63 - Fear, 50 - Disgust,76 - Happiness, 82 - Anger,85 - Sadnesswhile Spanish and Italian are used in training.2. The second experiment is SER in the Italian language,where we use the Italian language as a testing set whileEnglish and Spanish are used in training.3. The third experiment is SER in the Spanish language,where we use the Spanish language as a testing setwhile English and Italian are used in training.The testing language is unseen to the meta-learningstage, and only K labeled examples of each label are avail-able in the fine-tuning stage. The initialized model is fine-tuned on the labeled examples and evaluated on the unla-beled examples. The samples for silence class and neutralclass were self-generated with a mean length of 3.5 seconds.
The 3-4 second clips are sampled at 16kHz. We use theMel-frequency Cepstral Coefficient (MFCC) features. Foreach clip, we extract 40-dimensional MFCCs with a framelength of 30ms and a frame step of 10ms. ConvolutionNeural Networks is adopted as the base model, which con-tains 4 convolutional blocks. Each block comprises a 3 ×3 convolutions and 64 filters, followed by ReLU and batchnormalization [20]. The flattened layer after the convolu-tional blocks contains 576 neurons and is fully connected tothe output layer with a linear function. We avoided usingResNet architecture because it overfitted very quickly. Themodel is trained with a mini-batch size of 16 for 5, 10, 20-shot classification. We set the learning rate α to 0.1 and β to 0.001. The learning rates were found using a grid search. We compare our proposed approach with two baselineapproaches the conventional supervised learning approach which trains the model on the support set of the target taskonly, and the original MAML, which treats the 5+2-way The details model for the supervised model is inspired from here
Table 2. Accuracy in 5-shot learning
Model English Italian SpanishSupervised .
33% 16 .
19% 20 . MAML .
21% 64 .
13% 64 . F-MAML
Table 3. Accuracy in 10-shot learning
Model English Italian SpanishSupervised .
11% 32 .
54% 16 . MAML .
11% 70 .
13% 71 . F-MAML
Table 4. Accuracy in 20-shot learning
Model English Italian SpanishSupervised .
53% 28 .
64% 24 . MAML .
21% 77 .
13% 77 . F-MAML problem as a 7-way problem. In the evaluation, we sam-ple K examples from each class for fine-tuning the modeland 25 examples per label for evaluation. We do 100 timesrandom tests and evaluate different approaches on accuracy.
We compare our approach with three baselines. Table 2,3 and 4 list the performance of 5, 10 and 20-shot task onSER in English, Spanish, Italian languages respectively.Not surprisingly, MAML based approaches performmuch better than conventional supervised learning in a few-shot learning situation. This improvement is because itprovides a good initialization of a model’s parameters toachieve optimal fast learning on a new task with few gradi-ent steps while avoiding overfitting that may happen whenusing a small dataset.Finally, our proposed approach F-MAML outperformsthe original MAML. We attribute the improvement to priorinformation of the fixed classes, which helps in efficientfine-tuning to new tasks compared to the original MAML.The Figure-4 shows the loss of original MAML comparedto F-MAML on 5-shot learning. It can easily be seen that F-MAML converges quickly and in fewer steps than the orig-inal MAML.
5. Conclusions and Future Work
In this paper, we simulated a scenario of SER as a few-shot learning problem. We define it as an N+F-way, K-shotproblem and propose a modification to the Model-AgnosticMeta-Learning (MAML) algorithm where we kept F fixedto solve the problem. Experiments conducted on theEmoFilm dataset show that our approach performs the best5 igure 4. Convergence Comparison MAML vs F-MAML compared to the baselines. In the future, we will attemptto test the feasibility of the approach on Indic languages,and mandarin derived languages since these languages dif-fer vastly from each other.
References [1] R. W. Picard. Affective computing.
MIT Press , 2000.[2] Jeng-Lin Li and Chi-Chun Lee. Attentive to individual: Amultimodal emotion recognition network with personalizedattention profile. In Gernot Kubin and Zdravko Kacic, edi-tors,
Interspeech 2019, 20th Annual Conference of the Inter-national Speech Communication Association, Graz, Austria,15-19 September 2019 , pages 211–215. ISCA, 2019.[3] Nicu Sebe, Ira Cohen, Theo Gevers, and Thomas S. Huang.Multimodal approaches for emotion recognition: a survey.In Simone Santini, Raimondo Schettini, and Theo Gevers,editors,
Internet Imaging VI , volume 5670, pages 56 – 67.International Society for Optics and Photonics, SPIE, 2005.[4] John Kim and Rif A. Saurous. Emotion recognition fromhuman speech using temporal information and deep learning.In
Proc. Interspeech 2018 , pages 937–940, 2018.[5] Xingfeng Li and Masato Akagi. Multilingual speech emo-tion recognition system based on a three-layer model. In
Interspeech 2016 , pages 3608–3612, 2016.[6] Victor Garcia Satorras and Joan Bruna Estrach. Few-shotlearning with graph neural networks. In
International Con-ference on Learning Representations , 2018.[7] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B.Tenenbaum. Human-level concept learning through proba-bilistic program induction.
Science , 350(6266):1332–1338,2015.[8] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, KorayKavukcuoglu, and Daan Wierstra. Matching networks forone shot learning. In
Proceedings of the 30th Interna-tional Conference on Neural Information Processing Sys-tems , NIPS’16, page 3637–3645, Red Hook, NY, USA,2016. Curran Associates Inc. [9] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypicalnetworks for few-shot learning. In
Proceedings of the 31stInternational Conference on Neural Information ProcessingSystems , NIPS’17, page 4080–4090, Red Hook, NY, USA,2017. Curran Associates Inc.[10] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M.Hospedales. Learning to compare: Relation network for few-shot learning. In , pages 1199–1208, 2018.[11] T. Ko, Y. Chen, and Q. Li. Prototypical networks for smallfootprint text-independent speaker verification. In
ICASSP2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pages 6804–6808,2020.[12] Adam Santoro, Sergey Bartunov, Matthew Botvinick, DaanWierstra, and Timothy Lillicrap. Meta-learning withmemory-augmented neural networks. In
Proceedings ofthe 33rd International Conference on International Confer-ence on Machine Learning , volume 48 of
ICML’16 , page1842–1850. JMLR.org, 2016.[13] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and PieterAbbeel. A simple neural attentive meta-learner. In
Interna-tional Conference on Learning Representations , 2018.[14] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In
Proceedings of the 34rd International Conference on Inter-national Conference on Machine Learning , volume 70 of
ICML’17 , pages 2554–2563. JMLR.org, 2017.[15] Sachin Ravi and Hugo Larochelle. Optimization as a modelfor few-shot learning. In , 2017.[16] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks.In
Proceedings of the 34th International Conference on Ma-chine Learning , volume 70 of
ICML’17 , pages 1126–1135.JMLR.org, 2017.[17] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning, 2017.[18] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms, 2018.[19] Emilia Parada-Cabaleiro, Giovanni Costantini, Anton Bat-liner, Alice Baird, and Bj¨orn Schuller. Categorical vs di-mensional perception of italian emotional speech. In
Proc.Interspeech 2018 , pages 3638–3642, 2018.[20] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. In
Proceedings of the 32nd International Con-ference on International Conference on Machine Learning -Volume 37 , ICML’15, page 448–456. JMLR.org, 2015., ICML’15, page 448–456. JMLR.org, 2015.