[PDF] Few-Shot Keyword Spotting With Prototypical Networks

Abstract

Recognizing a particular command or a keyword, keyword spotting has been widely used in many voice interfaces such as Amazon's Alexa and Google Home. In order to recognize a set of keywords, most of the recent deep learning based approaches use a neural network trained with a large number of samples to identify certain pre-defined keywords. This restricts the system from recognizing new, user-defined keywords. Therefore, we first formulate this problem as a few-shot keyword spotting and approach it using metric learning. To enable this research, we also synthesize and publish a Few-shot Google Speech Commands dataset. We then propose a solution to the few-shot keyword spotting problem using temporal and dilated convolutions on prototypical networks. Our comparative experimental results demonstrate keyword spotting of new keywords using just a small number of samples.

Full PDF

FFew-Shot Keyword Spotting With Prototypical Networks

Archit Parnami, Minwoo Lee

The University of North Carolina at Charlotte [email protected], [email protected]

Abstract

Recognizing a particular command or a keyword, keyword spot-ting has been widely used in many voice interfaces such asAmazon’s Alexa and Google Home. In order to recognize a setof keywords, most of the recent deep learning based approachesuse a neural network trained with a large number of samples toidentify certain pre-deﬁned keywords. This restricts the systemfrom recognizing new, user-deﬁned keywords. Therefore, weﬁrst formulate this problem as a few-shot keyword spotting andapproach it using metric learning. To enable this research, wealso synthesize and publish a Few-shot Google Speech Com-mands dataset. We then propose a solution to the few-shot key-word spotting problem using temporal and dilated convolutionson prototypical networks. Our comparative experimental re-sults demonstrate keyword spotting of new keywords using justa small number of samples.

1. Introduction

Most smart devices these days have an inbuilt voice recogni-tion system which is mainly used for taking voice input from auser. This requires the voice recognition system to detect spe-ciﬁc words (keywords/commands), also known as the KeywordSpotting (KWS) problem. Most approaches use either LargeVocabulary Continuous Speech Recognition (LVCSR) basedmodels [1, 2] or lightweight deep neural network based mod-els [3]. The former, LVCSR demands a lot of resource andcomputation power and hence are deployed in the cloud, rais-ing privacy concerns and latency issues. The latter models aretrained with a set of pre-deﬁned keywords to recognize usingthousands of training examples. However, with smart devicesbecoming more personalized, there is a growing need for suchsystems 1) to recognize custom or new keywords on-device and2) to quickly adapt from a small number of user samples as theexisting approaches require large number of training samples.Therefore, we attempt to solve this problem of recognizing newkeywords given a few samples, hereon referred to as Few-ShotKeyword Spotting (FS-KWS).Current approaches to KWS involves extracting audio fea-tures from the input keyword and then passing it as input to aDeep Neural Network (DNN) for classiﬁcation [4, 3, 5, 6, 7].Especially, the use of convolutional neural networks (CNNs)[8] in adjunction with Mel-frequency Cepstral Coefﬁcients(MFCC) as speech features have shown to produce remarkableresults [3, 4, 7, 9, 10].Due to the data hungry nature of DNNs, recently the ﬁeldof Few-Shot Learning has gained a lot of attention. Speciﬁcally,Few-Shot Classiﬁcation (FSC) [11] aims to learn a classiﬁerthat can recognize new classes (not seen during training) whengiven limited, labeled examples for each new class. Broadlythere are two approaches to FSC. First, Metric Learning basedapproaches [12, 13, 14] try to learn a good embedding func-tion which can align examples of same class close to each other and far from examples of different class in an embedding spacebased on a metric (distance function). Second, Optimizationbased approaches [15, 16] attempts to learn good initializationparameters for a classiﬁer such that it can be ﬁnetuned using fewgradient descent steps on examples from new classes to classifythem correctly. Both approaches involve training the classiﬁerwith a new set of classes in each training episode such that itwill be able to classify another new set of classes at test time.Previously, [17] have attempted to solve FS-KWS usingmodel-agnostic meta learning (MAML) [16], an optimizationbased approach to FSC. However, since KWS is deployed onsmall devices with limited computation capability, an optimiza-tion based approach that requires ﬁne-tuning may not always befeasible. Hence, we approach FS-KWS using metric learningbased approach, speciﬁcally using Prototypical Networks [14]which can perform inference in an end-to-end manner. The fol-lowing summarizes our main contributions:• We propose a keyword spotting system that can classifynew keywords from limited samples by a few-shot for-mulation of keyword spotting with metric learning.• We propose a temporally dilated CNN architecture as abetter embedding function for FS-KWS.• We release a FS-KWS dataset synthesized from Google'sSpeech command dataset [18]. To make it more chal-lenging, we also incorporate background noise and de-tection of silence and unknown (negative) keywords.

2. Few-Shot Keyword Spotting (FS-KWS)Problem

Consider a set S of user-deﬁned keywords such that S = { ( s i , y i ) } N × Ki where s i is a keyword sample (voice input) and y i is its label. The set S contains N keywords, each keywordhaving K samples where K is a small number (for ex., 1,2,5).Then given a user query q , the objective of FS-KWS system isto classify q into one of N keyword classes. The user-deﬁnedkeywords in S could be new i.e, never seen before during thetraining of FS-KWS system. Yet, the system should be able todetect q , given S .

3. FS-KWS Framework

We base our framework (Figure 1) on Prototypical Networks[14] for building the FS-KWS system. The FS-KWS model istrained on a labeled dataset D train and tested on D test . The setof keywords present in D train and D test are disjoint. The testset has only a few labeled samples per keyword. We follow anepisodic training paradigm in which each episode the model istrained to solve an N -way K -Shot FS-KWS task. Each episode e is created by ﬁrst sampling N categories from the trainingset and then sampling two sets of examples from these cate-gories: (1) the support set S e = { ( s i , y i ) } N × Ki =1 containing K examples for each of the N categories and (2) the query set a r X i v : . [ ee ss . A S ] J u l igure 1: Few-Shot Keyword Spotting Pipeline Q e = { ( q j , y j ) } N × Qj =1 containing Q different examples fromthe same N categories. The episodic training for FS-KWS min-imizes, for each episode, the loss of the prediction on samplesin the query set, given the support set. The model is a parame-terized function and the loss is the negative log likelihood of thetrue class of each query sample: L ( θ ) = − | Q e | (cid:88) t =1 log P θ ( y t | q t , S e ) , (1)where ( q t , y t ) ∈ Q e and S e are, respectively, the sampled queryand support set at episode e and θ are the parameters of themodel.Prototypical networks make use of the support set to com-pute a centroid (prototype) for each category (in the sampledepisode) and query samples are classiﬁed based on the distanceto each prototype. The model is a CNN f : (cid:60) n v → (cid:60) n p , pa-rameterized by θ f , that learns a n p -dimensional space where n v -dimensional input samples of the same category are closeand those of different categories are far apart. For every episode e , each embedding prototype p c (of category c ) is computed byaveraging the embeddings of all support samples of class c : p c = 1 | S ce | (cid:88) ( s i ,y i ) ∈ S ce f ( s i ) , where S ce ⊂ S e is the subset of support examples belonging toclass c. Given a distance function d , the distance of the query q t to each of the class prototypes p c is calculated. By takinga softmax [19] of the measured (negative) distances, the modelproduces a distribution over the N categories in each episode: P ( y = c | q t , S e , θ ) = exp ( − d ( f ( q t ) , p c )) (cid:80) n exp ( − d ( f ( q t ) , p n )) , where metric d is a Euclidean distance and the parameters θ ofthe model are updated with stochastic gradient descent by mini-mizing Equation (1). Once the training ﬁnishes, the parameters θ of the network are frozen. Then, given any new FS-KWS task,the category corresponding to the maximum P is the predictedcategory for the input query q t . In each episode, we ﬁrst obtain Mel-frequency Cepstral Coefﬁ-cients (MFCC) features for all the examples in the support set (a)

Input Speech (b)

MFCC Features

Figure 2:

Example transformation of input speech to MFCCfeatures and the query set which then act as input to the embedding net-work as shown in Figure 1. Following [5], we extract 40 MFCCfeatures from a speech frame of length 40 ms and stride 20 ms (see Figure 2).Figure 3: Reshaping MFCC features for time convolution.

Choi et al. [9] demonstrated improved performance on KWSwith temporal convolutions by reshaping the input MFCC fea-tures (Figure 3). Also, Cocke et al. [10] have shown that dilatedconvolutions are helpful in the processing of keyword signals.Therefore, we combine both the techniques by ﬁrst reshapingthe input MFCC features and then performing temporal con-volutions along with dilation. We modify the TC-ResNet8 [9]architecture to reduce the size of the kernel to × and usedilation of 1, 2, and 4 with stride 1 in three ResNet blocks re-spectively. This proposed architecture TD-ResNet7 (Figure 4)is then used to embed the reshaped input MFCC features (Fig-ure 3). a) Block (b)

TD-ResNet7

Figure 4:

The proposed dilated time convolutional neural net-work for embedding.

4. Few-Shot Google Speech CommandDataset

Google’s Speech Commands dataset [18] has been used previ-ously [5, 9] for keyword spotting problem. The dataset has atotal of 35 keywords and contains multiple utterances of eachkeyword by multiple speakers. Each utterance is stored as aone-second (or less) WAVE format ﬁle, with the sample dataencoded as linear 16-bit single-channel PCM values, at a 16kHz rate. We curate a FS-KWS dataset from this dataset byperforming the following preprocessing steps:1.

Filtering:

We ﬁlter out all the utterances which are lessthan one second. This ensures the consistency of the out-put MFCC feature matrix obtained from each audio ﬁle.2.

Grouping:

To train our KWS system to detect if an in-put query is an unknown keyword (not present in S ), wegroup our keywords into two categories: Core and

Un-known . Keywords having more than 1000 speakers areconsidered as core words and the rest are put in the cate-gory of unknown words.3.

Balancing:

Next, we balance the dataset so that all key-words in a group have the same number of samples. Asa result, we have 30 core keywords each with 1062 sam-ples and 5 unknown keywords each with 386 samples andwhere all samples for a particular keyword come from adifferent speaker.4.

Splitting: (a) Core Keywords.

They are randomly splitinto 20, 5, and 5 sets for training, validation, and testingrespectively. Note that here the splits do not have anyclasses (keywords) in common. (b) Unknown Keywords.

They are used for detecting negative inputs. Since wehave only 5 keywords in an unknown category, we utilizethem in all three phases of training, validation, and test-ing. For each keyword in the unknown category, 60% ofits samples are used in training, 20% for validation, and20% for testing. Note that in this case, all the training,validation, and test phases use the same 5 keywords asan unknown class but the samples are still from differentspeakers.5.

Mixing Background Noise:

The original speech com-mands dataset [18] comes with a collection of sounds(6 WAVE ﬁles) that can be mixed with one-second Keywords Speakers UtterancesMin Max MeanCoredown 1465 1 14 2.44zero 1450 1 13 2.59seven 1450 1 11 2.53nine 1443 1 12 2.51ﬁve 1442 1 19 2.58yes 1422 1 20 2.6four 1421 1 14 2.39left 1416 1 12 2.47stop 1413 1 22 2.52six 1411 1 14 2.55right 1409 1 15 2.45on 1403 1 19 2.47three 1401 1 11 2.43off 1387 1 16 2.47dog 1385 1 5 1.31marvin 1378 1 6 1.33one 1376 1 12 2.54go 1372 1 12 2.53no 1368 1 18 2.59two 1367 1 15 2.58eight 1358 1 15 2.53house 1357 1 5 1.35wow 1336 1 5 1.35happy 1332 1 7 1.33bird 1315 1 7 1.34cat 1300 1 5 1.32up 1291 1 17 2.53sheila 1291 1 6 1.36bed 1257 1 6 1.34tree 1062 1 6 1.39Unknownvisual 412 1 7 3.57forward 397 1 10 3.66backward 396 1 23 3.93follow 387 1 11 3.76learn 386 1 24 3.69Table 1:

Keyword Statistics utterances of keywords to simulate background noise.Following [20] implementation of mixing backgroundnoise, small snippets of these ﬁles are chosen at randomand mixed at a low volume into audio samples duringtraining. The loudness is also chosen randomly, and con-trolled by a hyper-parameter as a proportion where 0 issilence, and 1 is full volume. In our experiments, we setbackground volume to 0.1 and conduct experiments withboth the presence and absence of background noise.6.

Detecting Silence:

Apart from core classes and un-known class, we curate another class silence to detectthe absence of keywords. Again following [20] imple-mentation, we randomly sample 1000 one-second longsections of data from background sounds. Since thereis never complete silence in real environments, we haveto supply examples with quiet and irrelevant audio. Weconduct experiments in both the presence and absence ofsamples from silence class.We provide a script to synthesize this Few-Shot SpeechCommand dataset at our repository . https://github.com/ArchitParnami/Few-Shot-KWS igure 5: Training Cases demonstrated for 3-Way FS-KWS. (a) Core:

In each task T i , 3 Core classes are randomly sampled from D train . Then for each Core class C n , s support examples C sn and q query examples C qn are sampled (different from support examples).For testing, a new task T new is constructed which contains new classes C i , C j , C k sampled from D test . (b) Core + Background: Hereeach keyword sample is mixed with background noise. (c) Core + Optional : An optional class (O) is present along with Core classesboth during training and testing. (d) Core + Unknown + Background + Silence:

Two optional classes i.e. Unknown (U) and Silence(S) are present and also the samples are mixed with background noise. (Note: In our experiments, the position of optional classes in(c) and (d) is random and not always at the last position as presented in this ﬁgure)

5. Experiments

To test the effectiveness of our approach, we divide our experi-ments in four cases (Figure 5):(a)

Core - Pure Keyword Detection: Both during trainingand testing, the keyword samples in the support ( S ) andquery ( Q ) sets are from core keywords and without anybackground noise.(b) Core + Background : Same as (a), except the keywordsamples are now mixed with random background noise.(c)

Core + Optional : To account for scenarios when the in-put query is not from any of the keywords present in theprovided support set or when there is simply no input,we train and test in presence of an optional class. Thisoptional class is unknown keywords when we want to de-tect negatives and is silence when we want to detect theabsence of any spoken keywords.(d)

Core + Unknown + Silence + Background : Samplesfrom both the optional classes i.e,

Unknown and

Silence are present and are also mixed with background noise.This case simulates more realistic scenarios when inputis often mixed with background noise and could be anunknown word or just silence. In each of the above cases, we train and test in a N -way K -shot manner where N refers to the number of core classes and K refers to the number of training examples per class in eachepisode as explained in Section 3. In cases where an optionalclass ( Silence or Unknown ) is used, we add K support examplesfor the optional class in the support sets both during trainingand testing. We perform episodic training as suggested in [14]and train all our models for 200 epochs where each epoch has200 training episodes and 100 validation (test) episodes. Weuse SGD with Adam [21] and an initial learning rate of − and cut the learning rate in half every 20 epochs. We conductexperiments with N = { , } and K = { , } for all the men-tioned cases. The model is trained on the loss computed from5 queries per class in each episode and evaluated more strictlywith 15 queries per class during testing. As we formulate and propose a new FS-KWS problem, there isa lack of prior research and standard FS-KWS dataset. Thus, toshow the effectiveness of the proposed framework, we employthree different existing architectures as embedding network inour FS-KWS framework to examine the performance of theproposed approach. Following are the baseline embedding net-works: a) Core (b)

Core + Background (c)

Core + Unknown (d)

Core + Unknown + Background + Silence

Figure 6:

Comparing test accuracy of embedding network architectures on 4-way FS-KWS as we increase the number of supportexamples. The results are presented for all the four cases mentioned in section 5.1 • cnn trad fpool3 [3] was originally proposed for KWSproblem. It has two convolutional layers followed by alinear, a dense, and a softmax layer. We use the outputof the dense layer as network embeddings.• C64 [14] is the original 4-layer CNN used in Prototypi-cal Networks for doing few-shot image classiﬁcation onminiImageNet [13].•

TC-ResNet8 [9] has demonstrated great results onKWS. We remove the last fully connected and softmaxlayer and use the remaining architecture as our embed-ding network in FS-KWS framework.

Table 2 lists the results for the three baselines and our pro-posed architecture on experiments mentioned in Section 5.1.Given a new 2-way-5-shot KWS task with keywords not seen during the training, our TD-ResNet7 model can classify an in-put query with ∼

94% accuracy with the proposed FW-KWSpipeline. This is not even feasible with classical deep learningsolutions withou FS-KWS formulation. The TD-ResNet7 architecture also outperforms all the ex-isting baselines architectures on all the test cases except in (b)Core + Background where the performance of TC-ResNet8on 2-way 5-shot KWS is slightly better but the difference isnot signiﬁcant ( p = 0 . while ANOVA for others presents p (cid:28) . ). These results are illustrated in Figure 6. As we in-crease the number of shots (samples per class), the overall per-formance improves for all architectures, yet the TD-ResNet7architecture consistently outperforms other baselines. All theaccuracy results are averaged over 100 test episodes and are re-ported with 95% conﬁdence intervals.

6. Conclusion

In this work, we attempted to solve the keyword spotting prob-lem using only limited samples from each keyword. We demon-strated that using prototypical networks with our proposed em-bedding model which uses temporal and dilated convolutions,can produce signiﬁcant results with only few examples. Wealso synthesis and release a Few-Shot Google Speech commanddataset for future research on Few-Shot Keyword Spotting. ase EmbeddingNetwork 2-way Acc. 4-way Acc. ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± core+background cnn trad fpool3 69.53 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± core+unknown cnn trad fpool3 58.33 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± core +unknown +background +silence cnn trad fpool3 67.43 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2:

Performance comparison of different embedding networks when plugged into FS-KWS pipeline for 4 different cases.

7. References [1] P. Motlicek, F. Valente, and I. Szoke, “Improving acoustic basedkeyword spotting using lvcsr lattices,” in , 2012, pp. 4413–4416.[2] D. Can and M. Saraclar, “Lattice indexing for spoken term detec-tion,”

IEEE Transactions on Audio, Speech, and Language Pro-cessing , vol. 19, no. 8, pp. 2338–2347, 2011.[3] T. N. Sainath and C. Parada, “Convolutional neural networksfor small-footprint keyword spotting,” in

Sixteenth Annual Con-ference of the International Speech Communication Association ,2015.[4] G. Chen, C. Parada, and G. Heigold, “Small-footprint keywordspotting using deep neural networks,” in . IEEE, 2014, pp. 4087–4091.[5] Y. Zhang, N. Suda, L. Lai, and V. Chandra, “Hello edge: Keywordspotting on microcontrollers,” arXiv preprint arXiv:1711.07128 ,2017.[6] R. Tang and J. Lin, “Deep residual learning for small-footprintkeyword spotting,” in . IEEE, 2018,pp. 5484–5488.[7] D. C. de Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, “Aneural attention model for speech command recognition,” arXivpreprint arXiv:1808.08929 , 2018.[8] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al. , “Gradient-based learning applied to document recognition,”

Proceedings ofthe IEEE , vol. 86, no. 11, pp. 2278–2324, 1998.[9] S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim,and S. Ha, “Temporal convolution for real-time keyword spottingon mobile devices,” arXiv preprint arXiv:1904.03814 , 2019.[10] A. Coucke, M. Chlieh, T. Gisselbrecht, D. Leroy, M. Poumeyrol,and T. Lavril, “Efﬁcient keyword spotting using dilated convolu-tions and gating,” in

ICASSP 2019-2019 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 6351–6355.[11] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B.Huang, “A closer look at few-shot classiﬁcation,”

ArXiv , vol.abs/1904.04232, 2019.[12] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural net-works for one-shot image recognition,” in

ICML deep learningworkshop , vol. 2, 2015. [13] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, andD. Wierstra, “Matching networks for one shot learning,” in

NIPS ,2016.[14] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks forfew-shot learning,” in

Advances in Neural Information ProcessingSystems , 2017, pp. 4077–4087.[15] S. Ravi and H. Larochelle, “Optimization as a model for few-shotlearning,” in

ICLR , 2017.[16] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learningfor fast adaptation of deep networks,” in

ICML , 2017.[17] Y. Chen, T. Ko, L. Shang, X. Chen, X. Jiang, and Q. Li,“Meta learning for few-shot keyword spotting,” arXiv preprintarXiv:1812.10233 , 2018.[18] P. Warden, “Speech commands: A dataset for limited-vocabularyspeech recognition,” arXiv preprint arXiv:1804.03209 , 2018.[19] J. S. Bridle, “Probabilistic interpretation of feedforward classi-ﬁcation network outputs, with relationships to statistical patternrecognition,” in

Neurocomputing . Springer, 1990, pp. 227–236.[20] P. Warden,

Launching the speech commands dataset. ,2017. [Online]. Available: https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html[21] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980