[PDF] Semi Supervised Learning For Few-shot Audio Classification By Episodic Triplet Mining

Abstract

Few-shot learning aims to generalize unseen classes that appear during testing but are unavailable during training. Prototypical networks incorporate few-shot metric learning, by constructing a class prototype in the form of a mean vector of the embedded support points within a class. The performance of prototypical networks in extreme few-shot scenarios (like one-shot) degrades drastically, mainly due to the desuetude of variations within the clusters while constructing prototypes. In this paper, we propose to replace the typical prototypical loss function with an Episodic Triplet Mining (ETM) technique. The conventional triplet selection leads to overfitting, because of all possible combinations being used during training. We incorporate episodic training for mining the semi hard positive and the semi hard negative triplets to overcome the overfitting. We also propose an adaptation to make use of unlabeled training samples for better modeling. Experimenting on two different audio processing tasks, namely speaker recognition and audio event detection; show improved performances and hence the efficacy of ETM over the prototypical loss function and other meta-learning frameworks. Further, we show improved performances when unlabeled training samples are used.

Full PDF

SSEMI SUPERVISED LEARNING FOR FEW-SHOT AUDIO CLASSIFICATION BY EPISODICTRIPLET MINING

Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu

TCS Research and Innovation – Mumbai, INDIA email: { bhosale.swapnil2, rupayan.chakraborty, sunilkumar.kopparapu } @tcs.com ABSTRACT

Few-shot learning aims to generalize unseen classes that appear dur-ing testing but are unavailable during training. Prototypical networksincorporate few-shot metric learning, by constructing a class proto-type in the form of a mean vector of the embedded support pointswithin a class. The performance of prototypical networks in extremefew-shot scenarios (like one-shot) degrades drastically, mainly dueto the desuetude of variations within the clusters while constructingprototypes. In this paper, we propose to replace the typical proto-typical loss function with an Episodic Triplet Mining (ETM) tech-nique. The conventional triplet selection leads to overﬁtting, becauseof all possible combinations being used during training. We incor-porate episodic training for mining the semi hard positive and thesemi hard negative triplets to overcome the overﬁtting. We also pro-pose an adaptation to make use of unlabeled training samples forbetter modeling. Experimenting on two different audio processingtasks, namely speaker recognition and audio event detection; showimproved performances and hence the efﬁcacy of ETM over the pro-totypical loss function and other meta-learning frameworks. Further,we show improved performances when unlabeled training data areused.

Index Terms — Few-shot learning, Episodic training, Speakerrecognition, Audio event detection, Semi supervised learning

1. INTRODUCTION

Supervised deep learning models rely heavily on the availability ofsubstantial amount of labeled training data. However, in any clas-siﬁcation task such as identiﬁcation of rare objects (unique speciesof birds) [1], diagnosing uncommon diseases [2], authenticating anew employee in large enterprise (i.e. speech biometry) [3], detec-tion of rare acoustic events (e.g. in audio surveillance) [4–6], it is achallenge to create generalizable models using traditional deep neu-ral networks. On the other hand, humans utilize their past experi-ence in order to learn new concepts across different domains. Andconversely, most deep learning models are able to learn high levelfeatures and extract complex characteristics, only when sufﬁcientamount of labeled data is used for supervised training [7].Few-shot learning has the potential to create generalizable mod-els as it re-frames the learning paradigm such that the model is nottrained to classify a sample into one of the categories seen duringtraining. But it is optimized such that, given a pair of samples, it canpredict whether the two samples are similar or dissimilar [8] [9] [10].An extreme case of few-shot learning is one-shot learning,where a single sample (i.e. a reference) for each of the unseenclasses is available for making an inference about a test sample.Many few-shot learning problems utilize meta learning algorithms that learn a mapping to an embedding space, where samples belong-ing to the same classes are closer, compared to those belonging todifferent classes [11]. One such popular framework is Prototypicalnetworks that learns an embedding space, where the samples withinthe same class form clusters around a single prototypical point,which is represented by the mean of individual samples within thecluster. During inference, a query sample is assigned the label corre-sponding to the nearest prototype in the embedding space. There aretwo prominent problems with such approach. First, variance in thedata can easily affect the relative positions of the prototypes sincethe model relies on the unweighted average of the samples. Second,for extreme cases like one-shot, it assumes each individual sample tobe a separate cluster, with sample itself being the prototype. Hence,the performance of prototypical networks trained on multiple shots,degrades drastically in one-shot inference setup.To address these problems, we propose to use the episodic tripletloss instead of the typical prototypical loss function. In addition,conventional triplet selection during the training process is cum-bersome since choosing all possible selections, can easily lead toover-ﬁtting. Therefore, we incorporate episodic training for min-ing the semi hard positive and the semi hard negative triplets. Tovalidate our proposal, we conducted experiments over two differ-ent speech processing tasks, namely, (1) Speaker recognition task(using VCTK corpus), (2) Audio Event Classiﬁcation task (usingFreesound dataset, 2018). Our results show improved performances,especially in case of extreme few shot test setups, and indicate theefﬁcacy of using the episodic triplet loss over the prototypical lossfunction. To summarize, the main contributions of this paper are,(a) A novel few-shot learning approach by replacing the conven-tional prototypical loss function with a Episodic Triplet Mining tech-nique (ETM). This is more useful in case of extreme few-shot sce-nario (e.g. one-shot learning). (b) Incorporate episodic training formining the semi hard positive and the semi hard negative triplets,particularly to avoid the over-ﬁtting that arises due to the usual allpossible triplet mining strategy. (c) An adaptation of our proposedapproach to effectively use the unlabeled samples available duringtraining in a semi supervised paradigm. We experimentally validatethe proposed approach in semi supervised scenarios, and show thatour model trained using a subset of labeled training data performscompetitively with the models that are trained in supervised manneron complete set of labeled data. The rest of the paper is organized asfollows. In Section 2, we provide a brief literature review of the re-lated work in this area. In Section 3, we explain in detail the systemdesign and the proposed approach. The experimental details, and re-sults are presented in Section 4, followed by conclusion in Section5. a r X i v : . [ c s . S D ] F e b . RELATED WORK Prototypical networks [10] and Matching networks [12] are the mostpopular metric learning based methods for few-shot learning. Inprototypical networks, there exists an embedding space in whichsamples belonging to the same class form distinct clusters. Duringthe training process, prototypes for each class are computed usingthe average of the embeddings of all samples having that class astheir label. That is why the prototypes do not effectively capturethe variations in data. Moreover, extensions of prototypical loss inextreme few-shot cases, such as one-shot, assume each point to bean individual cluster with its embedding as the prototype. [13], [14]adapted prototypical networks for AED and showed generalizationto unseen audio events during real-time. In [15], authors proposeda prototypical loss based few-shot learning architecture for speakerrecognition, where capsule networks are used to extract audio em-beddings from input Mel-spectrograms features. In contrast, we usea simpler embedding network with 1D convolutional layers, withself-attention on the outputs of convolution layers, and the averageof attention heads is used as the embedding for input spectrogram.Authors in [16], compare the triplet loss with prototypical loss forspeaker recognition task. Although their results favor prototypicalloss over triplet loss, but we hypothesize that the triplet mining strat-egy could severely affect the performance of such networks. In tradi-tional triplet frameworks, given a triplet, the interaction of the anchorpoint is limited to a single positive and negative point. In order tooptimize the anchor embedding with respect to different points, thesame point must have multiple entries as anchor in various triplets.As a result, the number of possible training triplets can grow rapidlywith the size of the dataset, thus making it impractical. In this di-rection, we propose an episodic training framework for triplet loss,wherein each episode optimizes the triplet loss with respect to eachquery point (as an anchor). By doing so, in each episode a singleanchor point now simultaneously interacts with all the positive andnegative samples present in the support set.Recently in [17], [18] and [19], authors have extended prototyp-ical networks for incorporating unlabeled data samples, by adaptinga semi supervised paradigm. To amplify the efﬁcacy of triplet losstrained with episodic framework, we adapt our approach to semi su-pervised paradigms as well. During the training process, we ﬁrst in-corporate pseudo-labeling for the unlabeled samples in each episode,and then combine it with the existing support set. The query pointsare now optimized by comparing the distances of positive and neg-ative samples from the new support set. Through extensive exper-imentation, we empirically show the inﬂuence of amount of unla-beled samples available in each episode on the performance of thesystem.

3. SYSTEM DESIGN

Consider a N -shot K -way learning problem such that N samplesfrom each of the K unique classes are provided. Each sample is rep-resented by a F -dimensional feature vector x i ∈ R F , and its label y i ∈ { , ..., K } . A Support set is deﬁned as (cid:8) x i , y i (cid:9) N S i =1 , where N S is the number of support samples. Similarly, a Query set is deﬁned as (cid:8) x i (cid:9) N Q i =1 , where N Q is the number of queries. The embedding func-tion f φ ( . ) , parameterized by weights φ , projects each sample x i toan M dimensional embedding in the latent space. f φ ( . ) is optimizedto reduce the distance between the embeddings of queries and sup-port samples belonging to the same class and increase the distancebetween queries and support samples belonging to different classes. Fig. 1 . Selection of hard positives and hard negatives from distancematrix, D , for a single episode. E S and E Q denote the embeddings for all samples within thesupport set and query set respectively. A distance matrix D is formed with each embedding in E Q and E S placed alongthe rows and columns respectively, with value at each cell as, D i,j = d e ( E Q [ i ] , E S [ j ]) , where i = { , ..., N Q } , j = { , ..., N S } and d e ( . ) is the euclidean distance. For each query q i ∈ E Q , thesamples within the support set, belonging to the same class as that of q i , constitute the set of positive samples and all remaining samplesfrom the support, form a set of negative samples (see Figure 1).Triplet loss function compares the distance of an input sample(anchor), to its distance from a positive input (same class as anchor)and a negative input (different class than anchor). Within a trainingepisode, the triplet loss is calculated for each query sample q i treatedas the anchor and f φ ( . ) is optimized such that, ( d Nq i − d Pq i ) > z ,where d Nq i is the distance between q i and the negative set, d Pq i isthe distance between q i and the positive set, and z is the margin.Considering q i ∈ E Q , d Pq i and d Nq i can be represented as follows, d Pq i = µ (Λ max ( d e ( q i , s j ) , n P )) , ∀ s j ∈ E S | l ( s j ) = l ( q i ) d Nq i = µ (Λ min ( d e ( q i , s j ) , n N )) , ∀ s j ∈ E S | l ( s j ) (cid:54) = l ( q i ) where µ is the mean operator, Λ max ( A, n ) samples highest n valuesfrom set A , Λ min ( A, n ) samples lowest n values from set A and l ( . ) returns the corresponding class label. Instead of choosing onlythe farthest positive and the nearest negative sample, we average thetop n P samples from the positive set which are farthest from q i andaverage the top n N samples from the negative set which are nearestfrom q i .In traditional triplet loss [20], the number of possible triplets tobe passed through f φ ( . ) grows quadratically with number of sam-ples due to which the same anchor point gets repeated in multipletriplets, thus leading to overﬁtting. Conversely, in ETM, each anchorsimultaneously interacts with all negative samples in the episode andhence, gives a more stable update and converges faster. The n N pa-rameter in ETM, chooses semi hard negatives among hard negativesof individual classes in each episode. Algorithm 1 explains the pro-cedure for computing the loss for a single episode in detail. Similarto the training phase, during inference for each query sample, weidentify top n P nearest samples from the support set, and assign thelabel accordingly. lgorithm 1 Triplet loss for a single episode.

Input : Train set, X = { ( x , y ) , ..., ( x N , y N ) } , where y i ∈ { , ..., K } . X k denotes the subset of X containing all elements ( x i , y i ) suchthat y i = k . N c : number of unique classes in an episode. N S : number of support samples for each class. N Q : number of query samples for each class. Λ( X, n ) : randomly samples a subset of size n from set X Output : triplet loss for each episode. V ← Λ( { , ..., K } , N c ) . for k in {

1, ..., N c } do S k ← Λ( X V k , N S ) Q k ← Λ( X V k \ S k , N Q ) end for S ← { S , ..., S N c } Q ← { Q , ..., Q N c } E S ← f φ ( x i ) ∀ x i ∈ S E Q ← f φ ( x i ) ∀ x i ∈ Q D ij ← d e ( E Q [ i ] , E S [ j ]) ∀ i ∈ { , ..., N Q }∀ j ∈ { , ..., N S } for q i ∈ E Q do loss i = max ( d Pq i − d Nq i + z, end for loss ← (cid:80) loss i Fig. 2 . Single training episode for semi supervised few-shot learn-ing.

In a semi supervised system, along with a support set S and a queryset Q , we are also provided with an unlabeled set, R . The unlabeledsamples can be used for regularization, and to overcome the overﬁt-ting when labeled training samples are scarce. We adopt a pseudo-labeling technique wherein each training episode, we ﬁrst infer thelabels for each sample in R (considering it as a query) based on la-bels of the nearest samples within S and denote it as ˜ R . Next, wecombine S and ˜ R to form our new support set, ˜ S , which is used tocalculate the loss over each query in Q . We hypothesize that thepseudo-labeling in the ﬁrst step of every episode induces an errorin the system. This gets further propagated while optimizing thetriplet loss over query samples with ˜ S as support set. As a result,the overﬁtting is delayed since there is always some loss that getsbackpropagated through the embedding network. Figure 2 depicts the steps in a single training episode for semi-supervised learning.We choose two different scenarios, (a) weakly labeled (b) com-pletely unlabeled. In weakly labeled setup, for a particular episode,the unlabeled samples R are chosen from the same classes that arepresent in the S . On the other hand, the completely unlabeled setupresembles a more realistic scenario, in which no metadata is involvedwhile choosing the samples in R . The classes in R may be eitherpresent within S or maybe completely disjoint. In contrast to [21],we refrain from using the distractor classes, since choosing distrac-tor samples requires prior information about labels, which may notbe always available.

4. EXPERIMENTS

To validate our proposed approach, experiments are conducted fortwo different tasks, namely, (1) Speaker Recognition (VCTK cor-pus [22]) (2) Audio Event Classiﬁcation (Freesound Dataset, 2018(FSD) [23]).

Speaker Recognition task :

VCTK corpus is an English multi-speaker dataset, with 44 hours of audio spoken by 109 native En-glish speakers. We split the dataset into 70:20:10 random train-test-validation split, such that set of speakers in train, test and validationset are completely disjoint. We down-sampled each audio to 16 kHz,and split into audio segments of 3 seconds each. Mel-spectrogramsare extracted as an initial feature from each segment and used as aninput to the embedding network. The embedding network is con-structed using two layers of 1-D convolutions each with a kernel ofsize 3 and with 128 ﬁlters each. The use of 1-D convolution helpslearn the temporal contexts between the adjacent frames. Each con-volution layer is followed by a max-pooling layer with kernel of size3. Additionally, batch normalization is performed over the output ofeach convolution layer. We apply a multi-head self attention mecha-nism [24] over the output of the second convolution layer and aver-age the output of each head to obtain an 128 dimensional embedding.The semi supervised learning experiments are conducted con-sidering the availability of two different portions of the training dataas labeled train set, (a) 33% of the samples of each speaker present inthe training data and (b) 66% of the samples of each speaker presentin the training data. In both (a) and (b), the remaining samples fromthe training data are used as the unlabeled train set.

Audio Event Classiﬁcation task :

Freesound dataset (2018) con-sists of 18,873 audio ﬁles wherein each audio ﬁle is assigned oneof the 41 unique audio events from the Google’s Audioset Ontol-ogy [25]. In each of the 3 folds of our experiment, we randomlychoose 10 classes with all its corresponding audio ﬁles as our test set,and split remaining classes into train classes and validation classesin 90:10 ratio, such that the audio events in train, test and validationare always disjoint. All audio ﬁles are down-sampled to 16 kHz, andsplit into 1 second chunks. We use a VGGish architecture [26] as ourembedding model, which outputs a 128 dimensional embedding foreach chunk of audio. For the semi supervised learning experiments,we randomly choose 50% of samples from each audio event withinthe training data, and mark it as our labeled training set. The rest ofthe samples forms our unlabeled train set. The test set and validationset for the supervised and semi supervised experiments for each taskare kept ﬁxed.

Episode construction

For speaker recognition we found N S = 20, N Q = 15 and N c = 5, to give the best performance across all testing conﬁgurations. able 1 . Performance comparison for speaker recognition task. (en-tire train data is labeled) loss function 5-way 1-shot 5-way 5-shot 20-way 1-shot 20-way 5-shotMatching Network 82.77 93.79 55.02 76.29Relation Network 82.68 94.31 54.55 76.12Prototypical loss 83.42 Table 2 . Speaker recognition performance of semi supervised learn-ing system using weakly labeled and completely unlabeled data.(only 33% of the training data is labeled)

Type 5-way 1-shot 5-way 5-shot 20-way 1-shot 20-way 5-shotSupervised (baseline) 76.13 87.90 49.66 65.45Semi supervised (Weakly labeled)

Semi supervised (completely unlabeled) 85.41

Table 3 . Speaker recognition performance of semi supervised learn-ing system using weakly labeled and completely unlabeled data.(only 66% of the training data is labeled)

Type 5-way 1-shot 5-way 5-shot 20-way 1-shot 20-way 5-shotSupervised (baseline) 84.12 92.14 63.53 76.32Semi supervised (Weakly labeled) 85.28 92.34

Top-line (100% labeled data) 86.66 93.04 67.40 79.15

Similarly, for the audio event classiﬁcation task, we found N S =10, N Q = 5 and N c = 5, to give the best results. In all tasks, wechoose z (i.e. margin) as 0.3. Also, we train a single model, and testit against all test conﬁgurations (5-way 1-shot, 5-way 5-shot and soon). We use n P = 3 and n N = 5 for all tasks. For speaker recognitionand audio event classiﬁcation task, we train our model for 10,000episodes. We use the Adam optimizer with initial learning rate of − which gets reduced by half after every 1000 episodes. Whiletesting, we use N Q = 15, and report the average accuracy over 1000test episodes.For the semi supervised setup, we found that pre-training themodel for a few initial episodes (50 episodes for both speaker recog-nition and audio event classiﬁcation) in a completely supervisedsetup results in slightly smoother training curves. This might be dueto the fact that the incorporating the unlabeled data from the ﬁrstepisode itself might result in a higher value of error, thus making theconvergence difﬁcult. We vary the amount of unlabeled samples perclass in each episode from 1 to 5. We employ a similar testing setupas the supervised setup, keeping the test set unchanged. Tables 1-3 and Tables 4-5 show the results for speaker recognitionand audio event classiﬁcation tasks, respectively. We evaluate theperformance of all the experiments in terms of average accuracyacross all test episodes.Table 1 shows results for few-shot speaker recognition on theentire training data. We compare our approach with prototypicalloss, matching network, relation network all with the same embed-ding module. We surpass the accuracy of the model trained usingprototypical loss in , , . Es-pecially in and , our approach outper-forms a model trained using prototypical loss with identical embed-ding network by an absolute of 11.17% and 3.24%, respectively.In case of semi supervised few-shot learning, while using only33% of the training data as labeled (see Table 2), our model achievessigniﬁcantly better results over the baseline (which undergoes su-pervised training using 33% of training data), thereby reducing thegap with respect to the top-line across all four setups by an averageof 11.45%. Also, while using 66% of the training data as labeled Table 4 . Performance comparison for Audio Event Classiﬁcationtask. (entire train data is labeled) loss function 5-way 1-shot 7-way 1-shot 10-way 1-shotMatching Network 75.11 72.88 63.96Relation Network 74.29 73.13 64.61Prototypical loss 75.33 72.95 63.33Episodic Triplet Mining (ETM)

Table 5 . Audio event classiﬁcation performance of semi supervisedlearning system using weakly labeled and completely unlabeled datausing ETM. (only 50% of the training data is labeled)

Type 5-way 1-shot 7-way 1-shot 10-way 1-shotSupervised (baseline) 76.56 72.34 61.44Semi supervised (Weakly labeled)

Semi supervised (completely unlabeled) 77.73 (see Table 3), we achieve an average increase of 1.66% across allfour setups. Speciﬁcally for , our approach achievesan absolute improvement of 0.39% over a model trained in super-vised manner using 100% labeled training data. The psuedo label-ing technique is an iterative process which instead of diffusing theprobability spread over multiple classes, assigns high probability to-wards one particular class [27]. This helps by reducing the density(or entropy [28]) around the decision boundaries. Please note thatthe psuedo label is assigned using a few-shot inference.Table 4 compares the performance of episodic triplet miningwith existing meta-learning frameworks for the audio event classi-ﬁcation task on various one-shot scenarios. We achieve an averageimprovement of 3.72% in terms of accuracy when training the modelusing ETM technique across all three one-shot scenarios. Similar tothe speaker recognition task, Table 5 shows the performance of ETMextended to semi supervised domain for the audio event classiﬁca-tion task. For both, weakly labeled and completely labeled exper-iments, we use 50% of the samples belonging to each class in thetrain set as the unlabeled training set and the remaining samples asthe labeled training set.

5. CONCLUSION

In this paper, we propose a novel Episodic Triplet Mining (ETM) forfew-shot learning and compared it with the conventional Prototypi-cal loss models. We incorporate the episodic training for mining thesemi hard positive and the semi hard negative triplets, precisely toavoid the over-ﬁtting, which arise due to usual strategy of all possi-ble triplet mining. Proposed ETM, importantly, in case of one-shotlearning, outperformed the prototypical loss based model. More-over, we validate the usefulness of ETM over prototypical loss, es-pecially in presence of unlabeled samples. ETM model adapted tothe semi supervised learning surpasses the baseline by a large marginin both the cases, (a) in presence of weakly labeled data (i.e. whenthe unlabeled data is used based on the metadata), (b) in presenceof completely unlabeled data. Moreover, the performance variationswith respect to the amount of unlabeled data available in each train-ing episode are also investigated. The ability of deep learning mod-els to generalize the new classes, without the need of re-training orﬁne tuning is extremely useful. In addition, there is always a con-straint on how many reference samples could the user be requestedto provide, which makes the generalizability in the presence of mini-mum reference samples much more crucial. The better performanceof ETM speciﬁcally for extreme few-shot scenarios (like one-shot)are being observed in our work consistently. The ETM adaptationto semi supervised domain leverages the unlabeled data which ispresent in abundance. . REFERENCES [1] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang FrankWang, and Jia-Bin Huang, “A Closer Look at Few-Shot Clas-siﬁcation,” arXiv preprint arXiv:1904.04232 , 2019.[2] Xiaomeng Li, Lequan Yu, Chi-Wing Fu, and Pheng-Ann Heng,“Difﬁculty-Aware Meta-Learning for Rare Disease Diagno-sis,” arXiv preprint arXiv:1907.00354 , 2019.[3] Jacob Baldwin, Ryan Burnham, Andrew Meyer, Robert Dora,and Robert Wright, “Beyond Speech: Generalizing D-Vectorsfor Biometric Veriﬁcation,” in

Proceedings of the AAAI Con-ference on Artiﬁcial Intelligence , 2019, vol. 33, pp. 842–849.[4] Keming Zhang, Y. Cai, Y. Ren, Ruida Ye, and L. He, “MTF-CRNN: Multiscale Time-Frequency Convolutional RecurrentNeural Network for Sound Event Detection,”

IEEE Access ,vol. 8, pp. 147337–147348, 2020.[5] A. Mesaros, A. Diment, B. Elizalde, T. Heittola, E. Vincent,B. Raj, and T. Virtanen, “Sound Event Detection in the DCASE2017 Challenge,”

IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 27, no. 6, pp. 992–1006, 2019.[6] Jeong-Sik Park and Seok-Hoon Kim, “Sound Learning basedEvent Detection for Acoustic Surveillance Sensors,”

Multime-dia Tools and Applications , vol. 79, pp. 16127–16139, 2019.[7] Maryam M Najafabadi, Flavio Villanustre, Taghi M Khosh-goftaar, Naeem Seliya, Randall Wald, and Edin Muharemagic,“Deep Learning Applications and Challenges in Big Data An-alytics,”

Journal of Big Data , vol. 2, no. 1, pp. 1, 2015.[8] Eleni Triantaﬁllou, Richard Zemel, and Raquel Urtasun, “Few-Shot Learning Through an Information Retrieval Lens,” in

Ad-vances in Neural Information Processing Systems , pp. 2255–2265. Curran Associates, Inc., 2017.[9] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HSTorr, and Timothy M Hospedales, “Learning to Compare: Re-lation Network for Few-Shot Learning,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 2018, pp. 1199–1208.[10] Jake Snell, Kevin Swersky, and Richard Zemel, “Prototypi-cal Networks for Few-Shot Learning,” in

Advances in NeuralInformation Processing Systems , 2017, pp. 4077–4087.[11] Artsiom Sanakoyeu, Vadim Tschernezki, Uta Buchler, andBjorn Ommer, “Divide and Conquer the Embedding Space forMetric Learning,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2019, pp. 471–480.[12] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, DaanWierstra, et al., “Matching Networks for One Shot Learning,”in

Advances in Neural Information Processing Systems , 2016,pp. 3630–3638.[13] Yu Wang, Justin Salamon, Nicholas J Bryan, and Juan PabloBello, “Few-Shot Sound Event Detection,” in

IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 81–85.[14] Bowen Shi, Ming Sun, Krishna C Puvvada, Chieh-Chi Kao,Spyros Matsoukas, and Chao Wang, “Few-Shot Acous-tic Event Detection via Meta Learning,” in

IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 76–80. [15] Prashant Anand, Ajeet Kumar Singh, Siddharth Srivastava, andBrejesh Lall, “Few Shot Speaker Recognition using Deep Neu-ral Networks,” arXiv preprint arXiv:1904.08775 , 2019.[16] Jixuan Wang, Kuan-Chieh Wang, Marc T Law, Frank Rudzicz,and Michael Brudno, “Centroid-based Deep Metric Learningfor Speaker Recognition,” in

IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 3652–3656.[17] Rinu Boney and Alexander Ilin, “Semi-Supervised Few-ShotLearning with Prototypical Networks,”

CoRR abs/1711.10856 ,2017.[18] Ahmed Ayyad, Nassir Navab, Mohamed Elhoseiny, and ShadiAlbarqouni, “Semi-Supervised Few-Shot Learning with Localand Global Consistency,” arXiv preprint arXiv:1903.02164 ,2019.[19] Xinzhe Li, Qianru Sun, Yaoyao Liu, Qin Zhou, Shibao Zheng,Tat-Seng Chua, and Bernt Schiele, “Learning to Self-Trainfor Semi-Supervised Few-Shot Classiﬁcation,” in

Advancesin Neural Information Processing Systems , 2019, pp. 10276–10286.[20] Florian Schroff, Dmitry Kalenichenko, and James Philbin,“FaceNet: A Uniﬁed Embedding for Face Recognition andClustering,” in

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 2015, pp. 815–823.[21] Mengye Ren, Eleni Triantaﬁllou, Sachin Ravi, Jake Snell,Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, andRichard S Zemel, “Meta-Learning for Semi-Supervised Few-Shot Classiﬁcation,” arXiv preprint arXiv:1803.00676 , 2018.[22] Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald,et al., “Superseded-CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit,” 2016.[23] Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel PW El-lis, Xavier Favory, Jordi Pons, and Xavier Serra, “General-Purpose Tagging of Freesound Audio with Audioset Labels:Task Description, Dataset, and Baseline,” arXiv preprintarXiv:1807.09902 , 2018.[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polo-sukhin, “Attention is All You Need,” in

Advances in NeuralInformation Processing Systems , 2017, pp. 5998–6008.[25] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, ArenJansen, Wade Lawrence, R Channing Moore, Manoj Plakal,and Marvin Ritter, “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events,” in

IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2017, pp. 776–780.[26] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort FGemmeke, Aren Jansen, R Channing Moore, Manoj Plakal,Devin Platt, Rif A Saurous, Bryan Seybold, et al., “CNN Ar-chitectures for Large-Scale Audio Classiﬁcation,” in

IEEE In-ternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , 2017, pp. 131–135.[27] Dong-Hyun Lee, “Pseudo-Label: The Simple and Efﬁ-cient Semi-Supervised Learning Method for Deep Neural Net-works,” in

Workshop on Challenges in Representation Learn-ing, ICML , 2013, vol. 3, p. 2.[28] Yves Grandvalet and Yoshua Bengio, “Semi-SupervisedLearning by Entropy Minimization,” in