A Study of Few-Shot Audio Classification
Piper Wolters, Chris Careaga, Brian Hutchinson, Lauren Phillips
AA Study of Few-Shot Audio Classification
Piper Wolters ‡∗ , Chris Careaga ‡∗ , Brian Hutchinson ‡† and Lauren Phillips †‡ Computer Science Department, Western Washington University, Bellingham, WA † Pacific Northwest National Laboratory, Richland, WA
Abstract —Advances in deep learning have resulted in state-of-the-art performance for many audio classification tasks but,unlike humans, these systems traditionally require large amountsof data to make accurate predictions. Not every person ororganization has access to those resources, and the organizationsthat do, like our field at large, do not reflect the demographicsof our country. Enabling people to use machine learning with-out significant resource hurdles is important, because machinelearning is an increasingly useful tool for solving problems, andcan solve a broader set of problems when put in the handsof a broader set of people.
Few-shot learning is a type ofmachine learning designed to enable the model to generalizeto new classes with very few examples. In this research, weaddress two audio classification tasks (speaker identification andactivity classification) with the Prototypical Network few-shotlearning algorithm, and assess performance of various encoderarchitectures. Our encoders include recurrent neural networks,as well as one- and two-dimensional convolutional neural net-works. We evaluate our model for speaker identification on theVoxCeleb dataset and ICSI Meeting Corpus, obtaining 5-shot5-way accuracies of 93.5% and 54.0%, respectively. We alsoevaluate for activity classification from audio using few-shotsubsets of the Kinetics 600 dataset and AudioSet, both drawnfrom Youtube videos, obtaining 51.5% and 35.2% accuracy,respectively.
I. I
NTRODUCTION
The speech and signal processing communities were amongthe earliest adopters of deep learning, which is now usedextensively there for tasks ranging from speech recognition[1] to speaker and language identification [2], to audio eventdetection [3]. To work well, however, these methods typicallyrequire very large amounts of training data, at a great cost. Inorder to mitigate this limitation, researchers have tried well-known strategies, including unsupervised and semi-supervisedlearning [4], transfer learning [5], and data augmentation.In recent years, many few-shot learning methods have beenintroduced, designed to generalize effectively to unseen classes with only a handful of examples for each class. For example,consider a medical practitioner who wishes to build a coughclassifier that, given an audio recording, can classify whichtype of cough (if any) is present in a recording. Such aclassifier could help to automate triage by phone in under-served areas by detecting various cough types (e.g. dry, wet,whooping, barking, croupy) that may be indicative of differentdiseases. In contrast to classical machine learning techniques,using few-shot learning means the practitioner need onlycollect a very small ( ∼
5) set of examples of the types ofcough of interest, significantly lowering the barrier to creating ∗ These authors contributed equally.This work was funded by the U.S. Government. Fig. 1. Few-shot pipeline. 3-shot 3-way setup shown. Prototypical Networkdiagram inspired by a figure in [9]. such a system. As another example, one could train a few-shotclassifier to identify linguistic or non-linguistic cues of sexualharassment (e.g. phrases or catcalls), to enable detection ofsexual harassment from audio.One approach to few-shot learning is metric learning , whichinvolves learning an embedding space to compare classes.Notable metric-learning few-shot algorithms include SiameseNetworks [6], Matching Networks [7], Relation Networks [8],and Prototypical Networks [9]. These methods have beenprimarily developed in the computer vision field, but recentwork has begun to address few-shot classification of audio.Pons et al. [10] experiment with prototypical networks, transferlearning, and the combination thereof in order to improve theperformance of audio classifiers provided with small labeleddatasets. They evaluate on the UrbanSound8k [11] and TUT[12] datasets, reporting 5-way classification accuracies of upto ∼ % and ∼ %, respectively. Anad et al. [13] proposeutilizing an autoencoder to learn generalized feature embed-dings from class-specific embeddings obtained from a capsulenetwork. Performing exhaustive experiments on VoxCeleb [14]and VCTK datasets, they obtain 5-way speaker classificationaccuracies of 91.5% and 96.5%, respectively.While other work has evaluated the effectiveness of few-shot learning with audio with a limited set of encoders on alimited set of datasets, we conduct a study of audio few-shotclassification contrasting five audio encoder architectures andreporting results on four widely used audio datasets. We usethe prototypical network few-shot algorithm due to its strongperformance on image datasets.II. M ETHODS
Our pipeline begins with audio input, either raw waveformsor log-scaled mel filterbank features, which are then fed intoan encoder, which outputs embeddings that are used by theprototypical network. The pipeline is shown in Fig. 1.
We envision a future where the people who imagine and build technology mirror the people and societies for whom they build it. a r X i v : . [ ee ss . A S ] D ec . Feature Extraction Each datapoint is a waveform audio clip with a 16kHzsampling rate. The SincNet acts directly upon this raw wave-form. For the LSTM and VGG11 encoders, we further breakthe waveform into overlapping 25ms frames with a 10msoffset using a sliding hamming window. From each frame weextract 64 mel filterbank features (the mel scale is a non-linear frequency scaling aligned with human perception offrequency). The features are then log scaled.
B. Encoders1) VGG11:
We utilize a VGGish model, based on thepopular VGG [15] 2d convolutional neural network (CNN)that has performed well for computer vision tasks. Amongthe model configurations we tried, including deeper variants,VGG11 performed the best. We feed it windows consistingof 96 consecutive frames, each window offset by 48 framesfrom the previous. The 3072-dimensional per-window outputsof VGG11 are averaged to produce the representation of theentire audio clip.
2) LSTM:
Long Short-Term Memory networks [16] area popular type of recurrent neural network; with LSTMs,predictions at timestep t can in theory leverage informationcontained in all inputs up to and including time t . We use asingle-layer LSTM with a hidden size of 4096 and an outputsize of 2048. At each timestep, we feed in one frame. The per-frame outputs are averaged to produce the clip embedding.
3) SincNet:
The SincNet [17] is a 1d CNN that actsupon the raw waveform. It is intended to learn meaningfulfilters in the first layer of the model; specifically, built offof parameterized sinc functions, it learns high and low cutofffrequencies of a set of frequency bins. Prior to training, themodel is initialized to the same mel-scale bins used in the melspectogram. We also develop two variants, SincNet+LSTMand SincNet+VGG11, in which the output of the SincNet isfed into LSTM and VGG11 encoders, respectively.
C. Few-shot Learning and Prototypical Network
Many few-shot learning methods, including the prototypicalnetwork we use here, are trained and evaluated using theconcept of an episode [7]. An episode consists of a supportset and a query set. To build an episode, we first randomlydraw k classes. The support set consists of n examples fromeach of these classes, and the query set consists of examplesdrawn from each of these classes that are not in the support set.This corresponding task is sometimes referred to as a n -shot k -way classification problem. Episodes function like batchesand training proceeds by repeatedly sampling an episode,computing the gradients of the loss function on the query set,and taking a gradient step to adjust model parameters.The parameters of the overall network are the weights ofthe encoder that embed each data point into the feature space(the Prototypical Network itself is non-parametric). In thesenetworks, each class is defined by a prototype, which is simplythe average of the embeddings for all support set datapointsbelonging to that class. The probability the model assigns https://github.com/tensorflow/models/tree/master/research/audioset/vggish to each class for some novel data point x depends on thesquared euclidean distance between the embedding of x andthe prototypes for each class (see Fig. 1).III. E XPERIMENTS
A. Datasets1) Kinetics 600:
The Kinetics 600 [18] dataset consists of10 second clips of distinct actions (e.g. hugging baby, openingwine bottle). We use train/validation/test splits that are suitablefor a few-shot setting, proposed by [19]. There are two testsets defined: one with randomly selected held out classes, andthe other with held out musical classes (instruments, singing,etc.) which are more likely to be discriminable by audio.
2) VoxCeleb:
VoxCeleb [20] is a dataset containing hun-dreds of thousands of utterances of celebrity speech. Theseaudio clips are recorded in a diverse set of acoustic envi-ronments, ranging from outdoor stadiums to indoor studios,with varying quality. We create splits suitable for the few-shotsetting, with speakers as classes.
3) ICSI Meetings Corpus:
The ICSI Meetings Corpus [21]consists of natural meetings held at the International ComputerScience Institute in Berkeley, California. In order to utilizethis data for the few-shot setting, we use a subset of thecorpus containing meetings that only have speakers that arealso present in other audio clips (so that we can build a supportset for each query speaker). In total, we use 64 of the meetings,and segment them using ground truth segmentations to producethe datapoints used in the support and query sets. Due to thelimited amount of data, we pretrain our VGG11 model onVoxCeleb and evaluate on ICSI without further training.
4) AudioSet:
AudioSet [22] is a collection of audio fromYouTube videos, specifically chosen for acoustic content.Because there are multiple positive labels per audio clip, wecreate a subset of AudioSet suitable for a few-shot setting.With a discrete optimization algorithm, we find an approxi-mately optimal subset of classes that maximizes the numberof audio clips containing only a single positive label amongthe subset of classes chosen. This yields a set of 150 classes,each having at least 378 examples, which is then split intotrain, validation and test.
B. Training Details
All experiments have n = 5 labeled datapoints per classin the support set, with k = 1 or k = 5 classes in eachepisode. We perform additional experiments with k = 10 for AudioSet. We train for 25,000 episodes, but evaluatevalidation performance every 500 episodes and use earlystopping to terminate training if no progress has been madeon the validation set for 10 consecutive checks. We use theAdam optimizer with a fixed learning rate of − . Little to nohyper-parameter tuning is performed. After training converges,we evaluate on 1000 randomly selected episodes from the testset and report average performance across these episodes. C. Results and Discussion
We evaluate each of the encoders on Kinetics 600 andVoxCeleb and report the results in Tables I and II. Theresults show that VGG11 is the best performing model across
We envision a future where the people who imagine and build technology mirror the people and societies for whom they build it. -shot, 5-way 5-shot, 5-wayVGG11 / / LSTM 27.7% / 30.2% 38.1% / 42.2%SincNet 27.5% / 31.4% 34.7% / 41.5%SincNet+VGG11 31.2% / 34.9% 44.9% / 48.4%SincNet+LSTM 29.5% / 30.6% 38.5% / 46.3%TABLE IK
INETICS
600 T
EST S ET EST S ET CCURACIES .1-shot, 5-way 5-shot, 5-wayVGG11
LSTM 68.4% 86.5%SincNet 37.5% 49.9%SincNet+VGG11 64.8% 82.1%SincNet+LSTM 70.5% 88.3%TABLE IIV OX C ELEB T EST S ET A CCURACIES both tasks (activity classification and speaker identification).None of the SincNet variants outperforms VGG11 alone, butSincNet+LSTM does outperform the LSTM and SincNet indi-vidually. Note that the models perform better on our Kinetics600 Test Set 2, consisting of different musical instruments,than Test Set 1, as would be expected of a set whose classesare defined by acoustic cues. The speaker identification task(Table II) is significantly easier than the activity classificationtask, with very strong performance on VoxCeleb.Given that VGG11 yields the best performance, we evaluateit on the ICSI Meetings Corpus. Results are reported inTable III, which lists speaker identification accuracies as afunction of the number of speakers in the file (averaging acrossfiles with the same number of speakers). As expected, accuracygenerally gets better as the number of speakers decreases.Finally, our results on AudioSet are shown in Table IV.Performance is very similar to Kinetics. On one hand, thisis expected as both AudioSet and Kinetics are derived fromYoutube, and thus highly noisy and variable. On the otherhand, this is somewhat surprising because the AudioSet clipswere specifically chosen to be characterized by audio, andtherefore we would anticipate better performance on it.
EETINGS C ORPUS A CCURACIES WITH
VGG111-shot 5-way 5-shot 5-way 1-shot 10-way 5-shot 10-way31.0% 35.2% 19.3% 25.0%TABLE IVA
UDIO S ET T EST S ET A CCURACIES WITH
VGG11
IV. C
ONCLUSIONS
Few-shot learning methods aid efforts to democratize ma-chine learning, giving people the ability to construct classifiersthat solve problems that matter to them, with fewer resourcehurdles. This should boost machine learning’s applicability toa long tail of tasks with societal impact. In this paper, weprovide a study of few-shot learning applied to audio data.Our experiments cover four different datasets, split betweenspeaker identification and activity classification tasks. Wecompare the performance of three existing audio encodermodels, and propose two new variations (SincNet+VGG11 andSincNet+LSTM). We find that the VGG-based model performsthe best across all datasets and tasks. Useful extensions of thiswork left to future work include varying the few-shot methoditself, and applying these findings to problems that directlyimpact underrepresented groups in technology.R
EFERENCES[1] W Chan, N Jaitly, Q Le, and O Vinyals. Listen, attend and spell: Aneural network for large vocabulary conversational speech recognition.In
Proc. ICASSP , pages 4960–4964, 2016.[2] F Richardson, D Reynolds, and N Dehak. Deep neural networkapproaches to speaker and language recognition.
IEEE Signal ProcessingLetters , 22(10):1671–1675, 2015.[3] Y Wang and F Metze. Connectionist temporal localization for soundevent detection with sequential labeling. In
Proc. ICASSP , 2019.[4] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky. Deep neuralnetwork features and semi-supervised training for low resource speechrecognition. In
Proc. ICASSP , pages 6704–6708, 2013.[5] J Kunze, L Kirsch, I Kurenkov, A Krug, J Johannsmeier, and S Stober.Transfer learning for speech recognition on a budget. In
Proc. ACLWorkshop on Representation Learning for NLP , pages 168–177, 2017.[6] G Koch. Siamese neural networks for one-shot image recognition. In
Proc. ICML , 2015.[7] O Vinyals, C Blundell, T Lillicrap, K Kavukcuoglu, and D Wierstra.Matching networks for one shot learning. arXiv:1606.04080 , 2016.[8] F Sung, Y Yang, L Zhang, T Xiang, P Torr, and T Hospedales. Learningto compare: Relation network for few-shot learning. In
Proc. CVPR ,2017.[9] J Snell, K Swersky, and R Zemel. Prototypical networks for few-shotlearning.
Proc. NIPS , 2017.[10] J Pons, J Serr`a, and X Serra. Training neural audio classifiers with fewdata. arXiv:1810.10274 , 2018.[11] J Salamon, C Jacoby, and J Pablo Bello. A dataset and taxonomy forurban sound research. In
Proc. ACM MM , pages 1041–1044, 2014.[12] A Mesaros, T Heittola, and T Virtanen. Tut database for acoustic sceneclassification and sound event detection.
EUSIPCO , 2016.[13] P Anand, A Kumar Singh, S Srivastava, and B Lall. Few shot speakerrecognition using deep neural networks. arXiv:1904.08775 , 2019.[14] A Nagrani, J Son Chung, and A Zisserman. Voxceleb: a large-scalespeaker identification dataset. arXiv:1706.08612 , 2017.[15] K Simonyan and A Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv:1409.1556 , 2014.[16] S Hochreiter and J Schmidhuber. Long short-term memory.
NeuralComputation , 9:1735–1780, 1997.[17] M Ravanelli and Y Bengio. Speaker recognition from raw waveformwith sincnet.
Proc. SLT , 2018.[18] J Carreira, E Noland, A Banki-Horvath, C Hillier, and A Zisserman. Ashort note about kinetics-600. arXiv:1808.01340 , 2018.[19] C Careaga, B Hutchinson, N Hodas, and L Phillips. Metric-based few-shot learning for video action recognition. arXiv:1909.09602 , 2019.[20] J Son Chung, A Nagrani, and A Zisserman. Voxceleb2: Deep speakerrecognition.
Proc. Interspeech , 2018.[21] A Janin, D Baron, J Edwards, D Ellis, D Gelbart, N Morgan, B Peskin,T Pfau, E Shriberg, A Stolcke, and C Wooters. The ICSI meeting corpus.
Proc. ICASSP , 2003.[22] J Gemmeke, D Ellis, D Freedman, A Jansen, W Lawrence, R ChanningMoore, M Plakal, and M Ritter. Audio set: An ontology and human-labeled dataset for audio events.
Proc. ICASSP , 2017., 2017.