Accurate Detection of Wake Word Start and End Using a CNN
Christin Jose, Yuriy Mishchenko, Thibaud Senechal, Anish Shah, Alex Escott, Shiv Vitaladevuni
AAccurate Detection of Wake Word Start and End Using a CNN
Christin Jose * , Yuriy Mishchenko * , Thibaud Senechal, Anish Shah, Alex Escott, Shiv Vitaladevuni Amazon Alexa Research { chrjse,yuriym,thibauds,anishsh,escottal,shivnaga } @amazon.com Abstract
Small footprint embedded devices require keyword spotters(KWS) with small model size and detection latency for en-abling voice assistants. Such a keyword is often referred to as wake word as it is used to wake up voice assistant enabled de-vices. Together with wake word detection, accurate estimationof wake word endpoints (start and end) is an important task ofKWS. In this paper, we propose two new methods for detect-ing the endpoints of wake words in neural KWS that use single-stage word-level neural networks. Our results show that the newtechniques give superior accuracy for detecting wake words’endpoints of up to 50 msec standard error versus human annota-tions, on par with the conventional Acoustic Model plus HMMforced alignment. To our knowledge, this is the first study ofwake word endpoints detection methods for single-stage neuralKWS.
Index Terms : keyword spotting, multi-label training, speechrecognition, wake word detection, deep neural network, convo-lutional neural network, keyword endpoints, keyword start
1. Introduction
Keyword spotting is the task of detecting keywords of interest ina continuous audio stream. It has been an active research areain speech recognition and applications recently. With the re-cent increase in the popularity of voice assistants such as Alexa,Hey Google, and Siri, KWS have attracted much attention in thecontext of on-device wake word (WW) spotting. Accurate de-tection of WW endpoints in an audio stream is an importantfeature of KWS in WW applications. Voice assistant enableddevices only start streaming audio to the cloud when the Key-word Spotter detects a WW, and streaming must start from theWW start point.Conventional WW detection methods use 2-stage modelscomprising a first stage Deep Neural Network acoustic model(AM DNN) and a Hidden Markov Model (HMM) [1, 2, 3, 4, 5].Such a keyword spotter may also have additional classifiers af-ter the HMM, such as an SVM, to increase the accuracy of WWdetection [6]. These KWS naturally provide the endpoints of theWW via the HMM output. Specifically, during runtime, thesesystems perform Viterbi decoding of the WW senone sequencesthat produces the times of the WW’s start and end senones inthe input audio. However, this procedure can be computation-ally expensive, depending on the HMM topology, and trainingof such a keyword spotter is very complex.A recent work investigated KWS based on a single-stagefeed-forward DNN [7]. That DNN is trained to predict sub-keyword targets and has been shown to outperform a Key-word/Filler HMM approach. Such DNNs are also attractive forrunning on hardware-limited devices since the size of the modelcan be easily controlled to fit the devices’ CPU and memory * Equal contribution budget by changing the number of parameters in the DNN. Con-volutional Neural Networks (CNNs) have also become popularfor acoustic modeling and have shown improvements over thefully connected feed-forward DNNs as KWS [8].An important functionality of WW KWS is the ability todetermine the endpoints of a WW in the audio. WW endpointsare used to decide which audio will be sent to the cloud, whichhelps to protect user privacy and reduces the cloud-side pro-cessing costs. The recent neural keyword spotting methods dis-cussed above, aim at improving WW detection accuracy. How-ever, the detection of WW endpoints becomes a greater prob-lem with those methods. E.g., compared with the conventionalAM+HMM approach, no such output as the HMM senone statesis available in neural KWS. The Deep keyword spotter [7] canestimate the endpoints of the keyword in an audio based on therise of the model’s posterior corresponding to the WW sub-word labeling. However, this may not detect the endpoints ofthe WW very accurately.In this paper, we consider WW spotter models designed assingle-stage feed-forward CNN operating on an audio contextup to a second, or word-level KWS. We introduce two meth-ods for detecting keyword endpoints in such word-level KWS.The first method uses a second regression model trained on in-termediate representations of keyword spotting CNN, in orderto predict the keyword endpoints inside the input window. Thesecond method uses a novel approach of a multi-aligned CNNmodel trained to detect keyword in different alignments insidethe input window, such as towards the start or the end of theinput window. To our knowledge, these are the first methodsin the literature for keyword endpointing in word-level KWS.Likewise, the approach of multi-aligned keyword modeling isnovel and may be of interest to other applications. The de-scribed methods improve standard error for keyword endpointsup to 60% compared to a constant offset algorithm. Our meth-ods have an equivalent standard error when compared with goldstandard method of AM+HMM model. These methods will al-low significantly simpler model training and inference as wellas better keyword spotting accuracy.
2. Word-level Keyword Spotting model
The word-level (WL) keyword spotter considered here is a CNNWW detector similar to the Deep KWS [7] and the CNN KWS[8]. However, differently from those, here the input windowencompasses the entire audio context of one WW—the inputwindow can range up to 1 sec. Specifically, the model is trainedby using a set of positive examples (i.e. the audio fragmentscontaining the WW) and negative examples (i.e. the audio frag-ments not containing the WW). In the WW-positive examples,the WW is consistently aligned in the long CNN input window,such as centered. The model learns to produce output represent-ing the posterior probability of finding the WW in a given inputaudio fragment.Our CNN operates on 64-dimensional Log mel Filter-Bank a r X i v : . [ ee ss . A S ] A ug igure 1: Word level CNN architecture.
Energy (LFBE) features calculated over the standard 25 msecframes with a 10 msec shift. The CNN architecture is five con-volutional layers plus three fully connected (FC) layers withmax-pooling after the first layer and 3-stride convolution inthe second layer. See Fig. 1 for specific architecture details.Dropout and batch normalization are used with all hidden lay-ers. The output layer is softmax over two outputs compris-ing “WW” and “non-WW”. The model is trained using cross-entropy loss on data prepared as a mixture of WW-positive andWW-negative word-level examples, as described above, withlabels in one-hot encoding. The described CNN architectureachieves superior baseline accuracy for neural WW spotting.
3. Baseline Methods for WW endpointsdetection
Choosing baselines for the WW endpoints detection algorithmin the context of WL KWS is difficult as no methods for WWendpointing exist for that setting. Unlike “frame-level” modelssuch as AM+HMM (i.e. KWS that work on sub-word input),neural WL KWS by design does not offer natural means for de-tecting the WW endpoints in the audio, because such models aretrained to detect varying-length WW in the input window uni-formly. That is, one cannot know from a detection event alonewhere in the input window, the WW starts and ends, without ad-ditional information about at least the WW length. We considertwo baselines in this work—the AM+HMM KWS as the in-dustry’s gold standard for keyword endpointing, and a constantoffset method, which is the most straightforward algorithm forWW endpointing in WL KWS.
The 2-stage AM+HMM KWS [1, 2, 3, 4, 5] is de-facto thegolden standard for keyword endpointing in the industry. In thatalgorithm, the posteriors are produced by an Acoustic ModelDNN for a set of senones, based on an input audio stream, andan HMM is tuned to force-align a sequence of senones expectedin the keyword to those detected. This is the classical approachused in ASR. The keyword endpoints are naturally produced inthat algorithm as the times of the first and the last senones inthe HMM state sequence corresponding to a keyword detection[9]. In this paper, we use a 2-stage WW model from [6] for suchendpointing baseline.
Constant offset method is the simplest algorithm for detectingWW endpoints in WL KWS. In this approach, we depart fromthe view that the WW has a relatively small variation in length,such as due to pronunciation by different speakers or speed ofspeech differences. For example, for “Alexa” the 10-90 th per-centile variability in keyword length across speakers is 500 to900 msec. In that setting, we may estimate the WW start and end points by using a suitably chosen constant offset from thetime of the WW detection event in WL KWS, given known typ-ical WW duration and expected alignment in the keyword spot-ter input window. For different WWs, an optimal offset can bechosen based on a measured mean or median duration of thoseWWs. While simple in principle, the accuracy of this methodmay not be satisfactory for WWs whose length can vary greatlyfrom the mean or median value.
4. WW endpoints detection in WL KWS
In this method, we add a second regression model that runs inparallel with the main detector WL CNN. The regression modeluses intermediate feature representations from the hidden layersof the WL CNN as its input and is trained to output the relativeoffsets of the WW endpoints in the input window against groundtruth, Fig 2. The ground-truth is prepared via pseudo-labelsproduced by the AM+HMM KWS [6]. The model is trainedusing the mean square error loss.
Figure 2:
Endpointing in WL KWS using second start-end regressionmodel.
More specifically, we first train the main WW detectorCNN using data prepared as a mixture of WW-positive andWW-negative WL examples, as described in Section 2. Wethen add the second branch of the start-end regression model.The key idea is that the intermediate representations from theWL CNN can allow predicting the positions of WW endpoints.We experimented with training the main detector CNN and thestart-end regression model simultaneously, in a multi-task man-ner, or freezing the main CNN weights and training the start-end model separately. We found that freezing the main CNNwas the best for WW detection accuracy.The start-end regression model comprises one convolutionlayer of dimensions (5,5,200) and one fully connected layerwith two outputs. We experimented with the detector CNN’shidden layer serving as the input for the start-end regressionmodel and found the convolution layer 4 to produce the best re-sults. The two outputs encode the start and end offsets of WWinside the input window measured in units relative to the inputindow length. That is, output 0 corresponds to the beginningof the input window, and output 1 corresponds to the end of theinput window. The WW can be longer than the input window.In that case, the start can be negative, and the end can be greaterthan 1.
In this method, we train the WL KWS with additional outputsthat detect different alignments of WW inside the input window,see Fig. 3 (i). That is, one output of the CNN may be detectingWW centrally positioned inside the input window, while an-other output may be detecting WW positioned to start at a givenframe in the input window, yet another may be detecting WWpositioned to end at a given frame of the input window. Thetime of the peaks of each output’s posteriors then allows us todetect the WW center, start, and end time points, see Fig 3 (ii).
Figure 3:
Endpointing in WL KWS using multi-aligned output model.(i) Training. (ii) Inference.
More specifically, we add two outputs to the main detectorCNN output in the softmax layer, Fig. 4. The outputs are fordetecting the start and end alignments of WW inside the inputwindow. This is in addition to the main detector output, whichis centrally aligned. We found such output to perform the bestfor the WW detection in experiments. It is also possible to usethe information from all outputs together to generate WW de-tection events. However, we did not experiment with this optionspecifically.
Figure 4:
Architecture of the multi-aligned output WL KWS model.
To train the model, we prepare WL training examples withWW differently aligned in the input window. An importantpoint for start-aligned examples is that we align WW start withthe middle of the input window, the “post-center-aligned” out-put in Fig. 3. Post-center alignment is introduced instead ofa more straightforward WW alignment with the start of the in-put in order to reduce the latency of WW start detection. Be-cause WL KWS has a long input window (e.g., 1 sec), the start-aligned posterior may peak significantly after the WW end forWWs that are shorter than the WL input window. We avoid thiscomplication with post-center alignment. In this case, the WWstart-aligned output becomes available before the other WWoutputs. For end-aligned WW output, we prepare WL exam-ples such that the WW end is aligned with the end of the inputwindow up to a small margin, Fig. 3.We train the model by mixing differently aligned WL ex-amples in training minibatch. Specifically, we used minibatch in proportions 25%:12.5%:12.5%:50% with respect to the cen-ter, start, end aligned, and negative WW examples, respectively,which we found to work the best in our experiments. A largerweight for the centrally aligned WW in the minibatch is givento ensure better WW detection performance. We generate theminibatch in the described manner dynamically, during train-ing. That is, WW examples are prepared first with the contextof about 2 sec and central alignment of WW inside examples.That allows selecting post-center and end-aligned WW exam-ples later during training. During training, the WW alignmentis randomly chosen during the formation of the minibatch, ac-cording to the biased die above. Random jitter is applied tothe training examples by shifting the WW position slightly, toimprove generalization. The examples are labeled in a one-hotmanner according to their WW alignment and no-WW label.
5. Experiments and results
All models were trained for the keyword “Alexa”, which is theWW for the Echo family devices at Amazon. We used 12MWW-positive and 5M WW-negative examples prepared from acorpus of annotated audio representing the far-field speech nor-mally observed by Alexa devices in en-US locale. For trainingeither the WW start-end regression model or WW multi-alignedoutput model, the WW endpoints labels are necessary. Manu-ally annotating WW endpoints in raw audio is extremely dif-ficult and laborious. For that reason, we used WW endpointsgenerated by an AM+HMM KWS [6] as pseudo-ground truthfor either training and evaluation. Larger fragments of audioof 2 sec containing centered WW were prepared based on theAM+HMM KWS endpoints, as described. A set of negativeexamples comprising the AM+HMM KWS detections audio ondata with negative annotations and central fragments of audiowith negative annotations and without AM+HMM detectionswere added as the negative examples. WL CNN (Section 2) andMulti-aligned CNN were trained using cross-entropy loss andAdam optimizer in Tensorflow on that data for 2M steps usingrandom initialization, mini-batch size of 4k, and learning rate of0.001. The start-end regression model was trained for 50k stepsusing the frozen WL CNN model.We tested the WW endpoints detection accuracy on held-out 33k streams. For that evaluation, we compare WW end-points detection with pseudo-ground truth produced by theAM+HMM KWS, referred to as the “online dataset” below. Ad-ditionally, we evaluated WW start point accuracy versus humanannotation on a smaller dataset. This dataset consists of 1100“Alexa” invocations in 5 different noise conditions, includinghousehold noises, external music, pink noise, and pink noiseplus music. This dataset is referred to as the “human-annotateddataset” below. We also evaluated the WW detection accuracywith respect to possible degradation because of the addition ofWW endpoints detection models.These results are presented in Table 1. The notation usedin that table is as follows. The baseline cnn const is the con-stant offset baseline described in the methods section. cnn align is the multi-aligned outputs model. cnn regression multi task is the WW start-end regression model where the main CNNdetector and the start-end model were trained simultaneously. thres crossing and local max are the same where the main CNNdetector’s weights were frozen for the training of the start-endregression model. thres crossing and local max are the ver-sions of cnn regression differing in the way the output of thestart-end regression model is used during inference on stream-ing audio. Specifically, local max uses the start-end outputs at etector Name All streams Long WW streams (length >
800 ms) WW detection(FRR at fixed FAR)Start Std Error End Std Error Start Std Error End Std Error cnn const 46 ms 107.2 ms 55.7 ms 116.7 ms 9.3%cnn regression multi task 17 ms 100.1 ms 18 ms 75.6 ms 10.23%cnn regression thres crossing 16.4 ms 95.3 ms
55 ms 20.1 ms 113.3 ms 9.3%cnn align 22.4 ms
49 ms
WW start end detection accuracy for different WW start-end detection models in this work. the time where the raw WW posterior achieves local maximum,for calculating the WW endpoints. thres crossing uses the firstpoint of crossing by smooth posteriors of the detection thresh-old, for reduced detection latency. The metric used to quan-tify the accuracy of WW endpoints detection is the standarddeviation of WW start or end errors (STD). This is calculatedwith respect to the AM+HMM endpoints in the “online dataset”and human annotations in the “human-annotated dataset”. Wepresent the WW endpoints accuracy split by WW length. Thisis because WW endpoints detection accuracy may be differentfor very long WWs. The metrics for the long WW category arecalculated using “Alexa” examples with WW duration exceed-ing 800 msec (1.3k streams out of 33k in “online dataset”). WWduration of 800 msec is used here as the longest ten percentilefor “Alexa”.We now examine the results in Table 1. For WW start, themulti-aligned outputs CNN makes a 60% improvement vs. the cnn const baseline. With the cnn regression local max model,we obtain a 65% improvement vs. cnn const . In all models, theWW start accuracy with respect to the AM+HMM KWS on the“online dataset” is about 20 msec.For WW end, multi-aligned output CNN makes a 65% im-provement vs. cnn const baseline and close to 55% improve-ment on long WWs. For cnn regression local max model, weobtain a 49% improvement vs. cnn const . In absolute terms,multi-aligned output CNN provides WW end detection accu-racy close to 40 msec and start-end regression model 55-110msec, depending on the prescription for which time-point to usefor the streaming start-end model outputs.In summary, both proposed methods show superior accu-racy for detecting the WW start in WL KWS. However, forWW end detection, the multi-aligned output CNN performs sig-nificantly better. One reason for that may be the uncertaintyrelated to using the WW start-end regression model’s outputsin streaming audio. In streaming audio, the WW start-end re-gression model produces outputs continuously. Using differenttime-points for collecting the start-end outputs, which can bea smoothed posteriors’ peak, raw posteriors’ peak, first thresh-old crossing, leads to different WW endpoints prediction ac-curacy. We find that using the start-end model’s outputs whenthe raw WW posterior reaches its local maximum provided thebest accuracy for WW start and end detections. However, thisprescription leads to the worst performance on long WWs. See cnn regression local max in Table 1.We now discuss the impact of adding WW endpoint modelson the accuracy of main WW detection, the last column in Ta-ble 1. Metric used here is False Rejection Rate (FRR) at fixedFalse Acceptance Rate (FAR). In cnn regression thres cross and cnn regression local max detectors, since the weights ofthe detector CNN are frozen, the accuracy of WW detection isguaranteed to remain the same. However, if the WW start-endmodel is trained simultaneously with the main detector CNN,( cnn regression multi task ), we observe that WW detection ac-curacy can degrade by as much as relative 10%. Multi-alignedoutput method, cnn align , achieves similar WW detection ac-curacy as the standard CNN WL KWS. Finally, we evaluated the WW start detection accuracy us-ing a smaller human-annotated dataset. Because “start” anno-tation is laborious, This dataset is small. However, we can seein Table 2 that the accuracy of WW start detection for all ourWL keyword spotters are, in fact, identical to that of the gold-standard AM+HMM KWS, on human-annotated data. This ac-curacy is about 50 msec STD. This accuracy is comparable,albeit worse, with the human annotators’ accuracy, which weestimated to be about 30 msec.
Detector name Start Std error
AM+HMM 52 mscnn align/regression 52 ms2nd human annotation 31.2 ms
Table 2:
WW start detection accuracy in WL KWS and 2-stageAM+HMM model using a smaller human-annotated dataset.
6. Conclusions and discussions
We present two new methods for WW endpoints detection inthe WL keyword spotter, namely the second WW start-end re-gression model and the multi-aligned output modeling method.In the case of the WW start point detection, the accuracy for“Alexa” WW is found to be within 20 msec of the gold-standardAM+HMM keyword spotter and within 50 msec of human an-notator, the latter being identical for the new methods and the in-dustry gold-standard AM+HMM keyword spotter. For the WWend point detection, the accuracy for “Alexa” WW is found tobe within 40 msec for multi-aligned output CNN and within 60msec for the WW start-end regression model with respect to thegold-standard AM+HMM keyword spotter.The multi-aligned output method provides overall superioraccuracy for WW endpoints with a guaranteed KWS accuracy.It allows using a single CNN both for WW detection and WWstart-end detection, thus, simplifying model training and infer-ence. On the other hand, the WW start-end regression modelcan guarantee that the WW detection accuracy remains con-stant with the WW endpoints model added, because it can betrained with the main detector KWS weights frozen. In themulti-aligned output KWS, because the same feature represen-tations are used for all WW alignments and WW detection, WWdetection cannot be in principle decoupled from the endpointsdetection, although we did not find that to create issues in prac-tice.Multi-aligned output models allow for early or low-latencydetection of WW, i.e., by using the post-center aligned output.In particular, this output can produce a lower-accuracy WW de-tection by seeing a partial WW aligned post-center in the inputwindow. Otherwise, the WL KWS incurs a constant latencyof WW detection with respect, e.g., to the WW center point.This feature is of great interest to the cascade KWS or VAD set-tings. A unique advantage of multi-aligned outputs approach isalso that it can be straightforwardly generalized to other appli-cations, including ASR endpointing and WW detection usingrecurrent deep learning models such as LSTM. In this work, weconsidered a WW composed of a single keyword “Alexa”. Itis straightforward to extend both presented methods to phraseWWs such as “Ok Google”. . References [1] S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Man-dal, B. Hoffmeister, and S. Vitaladevuni, “Multi-task learning andweighted cross-entropy for dnn-based keyword spotting,” in
IN-TERSPEECH , 2016.[2] M. Sun, D. Snyder, Y. Gao, V. Nagaraja, M. Rodehorst, S. Pan-chapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni, “Com-pressed time delay neural network for small-footprint keywordspotting,” 08 2017, pp. 3607–3611.[3] M. Sun, A. Schwarz, M. Wu, N. Strom, S. Matsoukas, and S. Vi-taladevuni, “An empirical study of cross-lingual transfer learningtechniques for small-footprint keyword spotting,” , pp. 255–260, 2017.[4] K. Kumatani, S. Panchapagesan, M. Wu, M. Kim, N. Strom, G. Ti-wari, and A. Mandal, “Direct modeling of raw audio with dnns forwake word detection,” , pp. 252–257, 2017.[5] J. Guo, K. Kumatani, M. Sun, M. Wu, A. Raju, N. Strom, andA. Mandal, “Time-delayed bottleneck highway networks using adft feature for keyword spotting,” 04 2018.[6] M. Wu, S. Panchapagesan, M. Sun, J. Gu, R. Thomas, S. N. PrasadVitaladevuni, B. Hoffmeister, and A. Mandal, “Monophone-basedbackground modeling for two-stage on-device wake word detec-tion,” in , 2018, pp. 5494–5498.[7] G. Chen, C. Parada, and G. Heigold, “Small-footprint keywordspotting using deep neural networks,” in ,2014, pp. 4087–4091.[8] T. N. Sainath and C. Parada, “Convolutional neural networks forsmall-footprint keyword spotting.” in