[PDF] A Study of Few-Shot Audio Classification

Abstract

Advances in deep learning have resulted in state-of-the-art performance for many audio classification tasks but, unlike humans, these systems traditionally require large amounts of data to make accurate predictions. Not every person or organization has access to those resources, and the organizations that do, like our field at large, do not reflect the demographics of our country. Enabling people to use machine learning without significant resource hurdles is important, because machine learning is an increasingly useful tool for solving problems, and can solve a broader set of problems when put in the hands of a broader set of people. Few-shot learning is a type of machine learning designed to enable the model to generalize to new classes with very few examples. In this research, we address two audio classification tasks (speaker identification and activity classification) with the Prototypical Network few-shot learning algorithm, and assess performance of various encoder architectures. Our encoders include recurrent neural networks, as well as one- and two-dimensional convolutional neural networks. We evaluate our model for speaker identification on the VoxCeleb dataset and ICSI Meeting Corpus, obtaining 5-shot 5-way accuracies of 93.5% and 54.0%, respectively. We also evaluate for activity classification from audio using few-shot subsets of the Kinetics~600 dataset and AudioSet, both drawn from Youtube videos, obtaining 51.5% and 35.2% accuracy, respectively.

Full PDF

AA Study of Few-Shot Audio Classiﬁcation

Piper Wolters ‡∗ , Chris Careaga ‡∗ , Brian Hutchinson ‡† and Lauren Phillips †‡ Computer Science Department, Western Washington University, Bellingham, WA † Paciﬁc Northwest National Laboratory, Richland, WA

Abstract —Advances in deep learning have resulted in state-of-the-art performance for many audio classiﬁcation tasks but,unlike humans, these systems traditionally require large amountsof data to make accurate predictions. Not every person ororganization has access to those resources, and the organizationsthat do, like our ﬁeld at large, do not reﬂect the demographicsof our country. Enabling people to use machine learning with-out signiﬁcant resource hurdles is important, because machinelearning is an increasingly useful tool for solving problems, andcan solve a broader set of problems when put in the handsof a broader set of people.

Few-shot learning is a type ofmachine learning designed to enable the model to generalizeto new classes with very few examples. In this research, weaddress two audio classiﬁcation tasks (speaker identiﬁcation andactivity classiﬁcation) with the Prototypical Network few-shotlearning algorithm, and assess performance of various encoderarchitectures. Our encoders include recurrent neural networks,as well as one- and two-dimensional convolutional neural net-works. We evaluate our model for speaker identiﬁcation on theVoxCeleb dataset and ICSI Meeting Corpus, obtaining 5-shot5-way accuracies of 93.5% and 54.0%, respectively. We alsoevaluate for activity classiﬁcation from audio using few-shotsubsets of the Kinetics 600 dataset and AudioSet, both drawnfrom Youtube videos, obtaining 51.5% and 35.2% accuracy,respectively.

I. I

NTRODUCTION

The speech and signal processing communities were amongthe earliest adopters of deep learning, which is now usedextensively there for tasks ranging from speech recognition[1] to speaker and language identiﬁcation [2], to audio eventdetection [3]. To work well, however, these methods typicallyrequire very large amounts of training data, at a great cost. Inorder to mitigate this limitation, researchers have tried well-known strategies, including unsupervised and semi-supervisedlearning [4], transfer learning [5], and data augmentation.In recent years, many few-shot learning methods have beenintroduced, designed to generalize effectively to unseen classes with only a handful of examples for each class. For example,consider a medical practitioner who wishes to build a coughclassiﬁer that, given an audio recording, can classify whichtype of cough (if any) is present in a recording. Such aclassiﬁer could help to automate triage by phone in under-served areas by detecting various cough types (e.g. dry, wet,whooping, barking, croupy) that may be indicative of differentdiseases. In contrast to classical machine learning techniques,using few-shot learning means the practitioner need onlycollect a very small ( ∼

5) set of examples of the types ofcough of interest, signiﬁcantly lowering the barrier to creating ∗ These authors contributed equally.This work was funded by the U.S. Government. Fig. 1. Few-shot pipeline. 3-shot 3-way setup shown. Prototypical Networkdiagram inspired by a ﬁgure in [9]. such a system. As another example, one could train a few-shotclassiﬁer to identify linguistic or non-linguistic cues of sexualharassment (e.g. phrases or catcalls), to enable detection ofsexual harassment from audio.One approach to few-shot learning is metric learning , whichinvolves learning an embedding space to compare classes.Notable metric-learning few-shot algorithms include SiameseNetworks [6], Matching Networks [7], Relation Networks [8],and Prototypical Networks [9]. These methods have beenprimarily developed in the computer vision ﬁeld, but recentwork has begun to address few-shot classiﬁcation of audio.Pons et al. [10] experiment with prototypical networks, transferlearning, and the combination thereof in order to improve theperformance of audio classiﬁers provided with small labeleddatasets. They evaluate on the UrbanSound8k [11] and TUT[12] datasets, reporting 5-way classiﬁcation accuracies of upto ∼ % and ∼ %, respectively. Anad et al. [13] proposeutilizing an autoencoder to learn generalized feature embed-dings from class-speciﬁc embeddings obtained from a capsulenetwork. Performing exhaustive experiments on VoxCeleb [14]and VCTK datasets, they obtain 5-way speaker classiﬁcationaccuracies of 91.5% and 96.5%, respectively.While other work has evaluated the effectiveness of few-shot learning with audio with a limited set of encoders on alimited set of datasets, we conduct a study of audio few-shotclassiﬁcation contrasting ﬁve audio encoder architectures andreporting results on four widely used audio datasets. We usethe prototypical network few-shot algorithm due to its strongperformance on image datasets.II. M ETHODS

Our pipeline begins with audio input, either raw waveformsor log-scaled mel ﬁlterbank features, which are then fed intoan encoder, which outputs embeddings that are used by theprototypical network. The pipeline is shown in Fig. 1.

We envision a future where the people who imagine and build technology mirror the people and societies for whom they build it. a r X i v : . [ ee ss . A S ] D ec . Feature Extraction Each datapoint is a waveform audio clip with a 16kHzsampling rate. The SincNet acts directly upon this raw wave-form. For the LSTM and VGG11 encoders, we further breakthe waveform into overlapping 25ms frames with a 10msoffset using a sliding hamming window. From each frame weextract 64 mel ﬁlterbank features (the mel scale is a non-linear frequency scaling aligned with human perception offrequency). The features are then log scaled.

B. Encoders1) VGG11:

We utilize a VGGish model, based on thepopular VGG [15] 2d convolutional neural network (CNN)that has performed well for computer vision tasks. Amongthe model conﬁgurations we tried, including deeper variants,VGG11 performed the best. We feed it windows consistingof 96 consecutive frames, each window offset by 48 framesfrom the previous. The 3072-dimensional per-window outputsof VGG11 are averaged to produce the representation of theentire audio clip.

2) LSTM:

Long Short-Term Memory networks [16] area popular type of recurrent neural network; with LSTMs,predictions at timestep t can in theory leverage informationcontained in all inputs up to and including time t . We use asingle-layer LSTM with a hidden size of 4096 and an outputsize of 2048. At each timestep, we feed in one frame. The per-frame outputs are averaged to produce the clip embedding.

3) SincNet:

The SincNet [17] is a 1d CNN that actsupon the raw waveform. It is intended to learn meaningfulﬁlters in the ﬁrst layer of the model; speciﬁcally, built offof parameterized sinc functions, it learns high and low cutofffrequencies of a set of frequency bins. Prior to training, themodel is initialized to the same mel-scale bins used in the melspectogram. We also develop two variants, SincNet+LSTMand SincNet+VGG11, in which the output of the SincNet isfed into LSTM and VGG11 encoders, respectively.

C. Few-shot Learning and Prototypical Network

Many few-shot learning methods, including the prototypicalnetwork we use here, are trained and evaluated using theconcept of an episode [7]. An episode consists of a supportset and a query set. To build an episode, we ﬁrst randomlydraw k classes. The support set consists of n examples fromeach of these classes, and the query set consists of examplesdrawn from each of these classes that are not in the support set.This corresponding task is sometimes referred to as a n -shot k -way classiﬁcation problem. Episodes function like batchesand training proceeds by repeatedly sampling an episode,computing the gradients of the loss function on the query set,and taking a gradient step to adjust model parameters.The parameters of the overall network are the weights ofthe encoder that embed each data point into the feature space(the Prototypical Network itself is non-parametric). In thesenetworks, each class is deﬁned by a prototype, which is simplythe average of the embeddings for all support set datapointsbelonging to that class. The probability the model assigns https://github.com/tensorﬂow/models/tree/master/research/audioset/vggish to each class for some novel data point x depends on thesquared euclidean distance between the embedding of x andthe prototypes for each class (see Fig. 1).III. E XPERIMENTS

A. Datasets1) Kinetics 600:

The Kinetics 600 [18] dataset consists of10 second clips of distinct actions (e.g. hugging baby, openingwine bottle). We use train/validation/test splits that are suitablefor a few-shot setting, proposed by [19]. There are two testsets deﬁned: one with randomly selected held out classes, andthe other with held out musical classes (instruments, singing,etc.) which are more likely to be discriminable by audio.

2) VoxCeleb:

VoxCeleb [20] is a dataset containing hun-dreds of thousands of utterances of celebrity speech. Theseaudio clips are recorded in a diverse set of acoustic envi-ronments, ranging from outdoor stadiums to indoor studios,with varying quality. We create splits suitable for the few-shotsetting, with speakers as classes.

3) ICSI Meetings Corpus:

The ICSI Meetings Corpus [21]consists of natural meetings held at the International ComputerScience Institute in Berkeley, California. In order to utilizethis data for the few-shot setting, we use a subset of thecorpus containing meetings that only have speakers that arealso present in other audio clips (so that we can build a supportset for each query speaker). In total, we use 64 of the meetings,and segment them using ground truth segmentations to producethe datapoints used in the support and query sets. Due to thelimited amount of data, we pretrain our VGG11 model onVoxCeleb and evaluate on ICSI without further training.

4) AudioSet:

AudioSet [22] is a collection of audio fromYouTube videos, speciﬁcally chosen for acoustic content.Because there are multiple positive labels per audio clip, wecreate a subset of AudioSet suitable for a few-shot setting.With a discrete optimization algorithm, we ﬁnd an approxi-mately optimal subset of classes that maximizes the numberof audio clips containing only a single positive label amongthe subset of classes chosen. This yields a set of 150 classes,each having at least 378 examples, which is then split intotrain, validation and test.

B. Training Details

All experiments have n = 5 labeled datapoints per classin the support set, with k = 1 or k = 5 classes in eachepisode. We perform additional experiments with k = 10 for AudioSet. We train for 25,000 episodes, but evaluatevalidation performance every 500 episodes and use earlystopping to terminate training if no progress has been madeon the validation set for 10 consecutive checks. We use theAdam optimizer with a ﬁxed learning rate of − . Little to nohyper-parameter tuning is performed. After training converges,we evaluate on 1000 randomly selected episodes from the testset and report average performance across these episodes. C. Results and Discussion

We evaluate each of the encoders on Kinetics 600 andVoxCeleb and report the results in Tables I and II. Theresults show that VGG11 is the best performing model across

We envision a future where the people who imagine and build technology mirror the people and societies for whom they build it. -shot, 5-way 5-shot, 5-wayVGG11 / / LSTM 27.7% / 30.2% 38.1% / 42.2%SincNet 27.5% / 31.4% 34.7% / 41.5%SincNet+VGG11 31.2% / 34.9% 44.9% / 48.4%SincNet+LSTM 29.5% / 30.6% 38.5% / 46.3%TABLE IK

INETICS

600 T

EST S ET EST S ET CCURACIES .1-shot, 5-way 5-shot, 5-wayVGG11

LSTM 68.4% 86.5%SincNet 37.5% 49.9%SincNet+VGG11 64.8% 82.1%SincNet+LSTM 70.5% 88.3%TABLE IIV OX C ELEB T EST S ET A CCURACIES both tasks (activity classiﬁcation and speaker identiﬁcation).None of the SincNet variants outperforms VGG11 alone, butSincNet+LSTM does outperform the LSTM and SincNet indi-vidually. Note that the models perform better on our Kinetics600 Test Set 2, consisting of different musical instruments,than Test Set 1, as would be expected of a set whose classesare deﬁned by acoustic cues. The speaker identiﬁcation task(Table II) is signiﬁcantly easier than the activity classiﬁcationtask, with very strong performance on VoxCeleb.Given that VGG11 yields the best performance, we evaluateit on the ICSI Meetings Corpus. Results are reported inTable III, which lists speaker identiﬁcation accuracies as afunction of the number of speakers in the ﬁle (averaging acrossﬁles with the same number of speakers). As expected, accuracygenerally gets better as the number of speakers decreases.Finally, our results on AudioSet are shown in Table IV.Performance is very similar to Kinetics. On one hand, thisis expected as both AudioSet and Kinetics are derived fromYoutube, and thus highly noisy and variable. On the otherhand, this is somewhat surprising because the AudioSet clipswere speciﬁcally chosen to be characterized by audio, andtherefore we would anticipate better performance on it.

EETINGS C ORPUS A CCURACIES WITH

VGG111-shot 5-way 5-shot 5-way 1-shot 10-way 5-shot 10-way31.0% 35.2% 19.3% 25.0%TABLE IVA

UDIO S ET T EST S ET A CCURACIES WITH

VGG11

IV. C

ONCLUSIONS

Few-shot learning methods aid efforts to democratize ma-chine learning, giving people the ability to construct classiﬁersthat solve problems that matter to them, with fewer resourcehurdles. This should boost machine learning’s applicability toa long tail of tasks with societal impact. In this paper, weprovide a study of few-shot learning applied to audio data.Our experiments cover four different datasets, split betweenspeaker identiﬁcation and activity classiﬁcation tasks. Wecompare the performance of three existing audio encodermodels, and propose two new variations (SincNet+VGG11 andSincNet+LSTM). We ﬁnd that the VGG-based model performsthe best across all datasets and tasks. Useful extensions of thiswork left to future work include varying the few-shot methoditself, and applying these ﬁndings to problems that directlyimpact underrepresented groups in technology.R

EFERENCES[1] W Chan, N Jaitly, Q Le, and O Vinyals. Listen, attend and spell: Aneural network for large vocabulary conversational speech recognition.In

Proc. ICASSP , pages 4960–4964, 2016.[2] F Richardson, D Reynolds, and N Dehak. Deep neural networkapproaches to speaker and language recognition.

IEEE Signal ProcessingLetters , 22(10):1671–1675, 2015.[3] Y Wang and F Metze. Connectionist temporal localization for soundevent detection with sequential labeling. In

Proc. ICASSP , 2019.[4] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky. Deep neuralnetwork features and semi-supervised training for low resource speechrecognition. In

Proc. ICASSP , pages 6704–6708, 2013.[5] J Kunze, L Kirsch, I Kurenkov, A Krug, J Johannsmeier, and S Stober.Transfer learning for speech recognition on a budget. In

Proc. ACLWorkshop on Representation Learning for NLP , pages 168–177, 2017.[6] G Koch. Siamese neural networks for one-shot image recognition. In

Proc. ICML , 2015.[7] O Vinyals, C Blundell, T Lillicrap, K Kavukcuoglu, and D Wierstra.Matching networks for one shot learning. arXiv:1606.04080 , 2016.[8] F Sung, Y Yang, L Zhang, T Xiang, P Torr, and T Hospedales. Learningto compare: Relation network for few-shot learning. In

Proc. CVPR ,2017.[9] J Snell, K Swersky, and R Zemel. Prototypical networks for few-shotlearning.

Proc. NIPS , 2017.[10] J Pons, J Serr`a, and X Serra. Training neural audio classiﬁers with fewdata. arXiv:1810.10274 , 2018.[11] J Salamon, C Jacoby, and J Pablo Bello. A dataset and taxonomy forurban sound research. In

Proc. ACM MM , pages 1041–1044, 2014.[12] A Mesaros, T Heittola, and T Virtanen. Tut database for acoustic sceneclassiﬁcation and sound event detection.

EUSIPCO , 2016.[13] P Anand, A Kumar Singh, S Srivastava, and B Lall. Few shot speakerrecognition using deep neural networks. arXiv:1904.08775 , 2019.[14] A Nagrani, J Son Chung, and A Zisserman. Voxceleb: a large-scalespeaker identiﬁcation dataset. arXiv:1706.08612 , 2017.[15] K Simonyan and A Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv:1409.1556 , 2014.[16] S Hochreiter and J Schmidhuber. Long short-term memory.

NeuralComputation , 9:1735–1780, 1997.[17] M Ravanelli and Y Bengio. Speaker recognition from raw waveformwith sincnet.

Proc. SLT , 2018.[18] J Carreira, E Noland, A Banki-Horvath, C Hillier, and A Zisserman. Ashort note about kinetics-600. arXiv:1808.01340 , 2018.[19] C Careaga, B Hutchinson, N Hodas, and L Phillips. Metric-based few-shot learning for video action recognition. arXiv:1909.09602 , 2019.[20] J Son Chung, A Nagrani, and A Zisserman. Voxceleb2: Deep speakerrecognition.

Proc. Interspeech , 2018.[21] A Janin, D Baron, J Edwards, D Ellis, D Gelbart, N Morgan, B Peskin,T Pfau, E Shriberg, A Stolcke, and C Wooters. The ICSI meeting corpus.

Proc. ICASSP , 2003.[22] J Gemmeke, D Ellis, D Freedman, A Jansen, W Lawrence, R ChanningMoore, M Plakal, and M Ritter. Audio set: An ontology and human-labeled dataset for audio events.

Proc. ICASSP , 2017., 2017.