Streaming Voice Query Recognition using Causal Convolutional Recurrent Neural Networks
Raphael Tang, Gefei Yang, Hong Wei, Yajie Mao, Ferhan Ture, Jimmy Lin
SSTREAMING VOICE QUERY RECOGNITION USING CAUSAL CONVOLUTIONALRECURRENT NEURAL NETWORKS
Raphael Tang ∗ Gefei Yang Hong Wei ∗ Yajie Mao Ferhan Ture Jimmy Lin David R. Cheriton School of Computer Science, University of Waterloo Comcast Applied AI Research Lab, Comcast Corporation { r33tang,jimmylin } @uwaterloo.ca , { gefei yang,hong wei,yajie mao,ferhan ture } @comcast.com ABSTRACT
Voice-enabled commercial products are ubiquitous, typicallyenabled by lightweight on-device keyword spotting (KWS)and full automatic speech recognition (ASR) in the cloud.ASR systems require significant computational resources intraining and for inference, not to mention copious amountsof annotated speech data. KWS systems, on the other hand,are less resource-intensive but have limited capabilities. Onthe Comcast Xfinity X1 entertainment platform, we explorea middle ground between ASR and KWS: We introducea novel, resource-efficient neural network for voice queryrecognition that is much more accurate than state-of-the-artCNNs for KWS, yet can be easily trained and deployed withlimited resources. On an evaluation dataset representing thetop 200 voice queries, we achieve a low false alarm rate of 1%and a query error rate of 6%. Our model performs inference8.24 × faster than the current ASR system. Index Terms — streaming voice query recognition, con-volutional recurrent neural networks
1. INTRODUCTION
Most voice-enabled intelligent agents, such as Apple’s Siriand the Amazon Echo, are powered by a combination of twotechnologies: lightweight keyword spotting (KWS) to detecta few pre-defined phrases within streaming audio (e.g., “HeySiri”) and full automatic speech recognition (ASR) to tran-scribe complete user utterances. In this work, we explore amiddle ground: techniques for voice query recognition capa-ble of handling a couple of hundred commands.Why is this an interesting point in the design space? Onthe one hand, this task is much more challenging than the(at most) a couple of dozen keywords handled by state-of-the-art KWS systems [1, 2]. Their highly constrained vocab-ulary limits application to wake-word and simple commandrecognition. Furthermore, their use is constrained to detect-ing whether some audio contains a phrase, not exact tran-scriptions needed for voice query recognition. For example, ∗ Work done while interning at Comcast Labs in Washington, D.C. if “YouTube” were the keyword, KWS systems would makeno distinction between the phrases “quit YouTube” and “openYouTube”—this is obviously not sufficient since they corre-spond to different commands. On the other hand, our formu-lation of voice query recognition was specifically designed tobe far more lightweight than full ASR models, typically recur-rent neural networks that comprise tens of millions of param-eters, take weeks to train and fine tune, and require enormousinvestment in gathering training data. Thus, full ASR typi-cally incurs high computational costs during inference timeand have large memory footprints [3].The context of our work is the Comcast Xfinity X1 en-tertainment platform, which provides a “voice remote” thataccepts spoken queries from users. A user, for example,might initiate a voice query with a button push on the remoteand then say “CNN” as an alternative to remembering theexact channel number or flipping through channel guides.Voice queries are a powerful feature, since modern enter-tainment packages typically have hundreds of channels andremote controls have become too complicated for many usersto operate. On average, X1 accepts tens of millions of voicequeries per day, totaling 1.7 terabytes of audio, equal to15,000 spoken hours.A middle ground between KWS and full ASR is partic-ularly interesting in our application because of the Zipfiandistribution of users’ queries. The 200 most popular queriescover a significant portion of monthly voice traffic and ac-counts for millions of queries per day. The key contribution ofthis work is a novel, resource-efficient architecture for stream-ing voice query recognition on the Comcast X1. We showthat existing KWS models are insufficient for this task, andthat our models answer queries more than eight times fasterthan the current full ASR system, with a low false alarm rate(FAR) of 1.0% and query error rate (QER) of 6.0%.
2. RELATED WORK
The typical approach to voice query recognition is to de-velop a full automatic speech recognition (ASR) system [4].Open-source toolkits like Kaldi [5] provide ASR models to a r X i v : . [ c s . C L ] D ec ... ... ... ... l k jd f ihc ... g PCEN ba Fig. 1 . Illustration of our architecture. The labels are as follows: (A) raw audio waveform (B) streaming Mel–PCEN filterbank (C)
PCEN features (D) causal convolution (E)
GRU layer (F) feature extraction convolution (G) max-pool across time (H) output concatenation (I) (J)
DNN classifier (K) long-term context modeling (L) short-term context modeling.researchers; however, state-of-the-art commercial systemsfrequently require thousands of hours of training data [6] anddozens of gigabytes for the combined acoustic and languagemodels [3]. Furthermore, we argue that these systems are ex-cessive for usage scenarios characterized by Zipf’s Law, suchas those often encountered in voice query recognition: forexample, on the X1, the top 200 queries cover a significant,disproportionate amount of our entire voice traffic. Thus, toreduce computational requirements associated with trainingand running a full ASR system, we propose to develop alightweight model for handling the top-K queries only.While our task is related to keyword spotting, KWS sys-tems only strictly detect the mere occurrence of a phrasewithin audio, not the exact transcription, as in our task. Neu-ral networks with both convolutional and recurrent compo-nents have been successfully used in keyword spotting [2, 7];others use only convolutional neural networks (CNNs) [8, 1]and popular image classification models [9].
3. TASK AND MODEL
Our precise task is to classify an audio clip as one of N + 1 classes, with N labels denoting N different voice queries anda single unknown label representing everything else. To im-prove responsiveness and hence the user experience, we im-pose the constraint that model inference executes in an on-line, streaming manner, defined as predictions that occur ev-ery 100 milliseconds and in constant time and space, withrespect to the total audio input length. This enables softwareapplications to display on-the-fly transcriptions of real-timespeech, which is important for user satisfaction: we immedi-ately begin processing speech input when the user depressesthe trigger button on the X1 voice remote. First, we apply dataset augmentation to reduce generalizationerror in speech recognition models [10]. In our work, we ran- domly apply noise, band-pass filtering, and pitch shifting toeach audio sample. Specifically, we add a mixture of Gaus-sian and salt-and-pepper noise—the latter is specifically cho-sen due to the voice remote microphone introducing such ar-tifacts, since we notice “clicks” while listening to audio sam-ples. For band-pass filtering, we suppress by a factor of 0.5the frequencies outside a range with random endpoints [ a, b ] ,where a and b roughly correspond to frequencies drawn uni-formly from [0 , . kHz and [1 . , . kHz, respectively. Forpitch shifting, we apply a random shift of ±
33 Hz. The aug-mentation procedure was verified by ear to be reasonable.We then preprocess the dataset from raw audio wave-form to forty-dimensional per-channel energy normalized(PCEN) [11] frames, with a window size of 30 millisecondsand a frame shift of 10 milliseconds. PCEN provides robust-ness to per-channel energy differences between near-field andfar-field speech applications, where it is used to achieve thestate of the art in keyword spotting [2, 11]. Conveniently, ithandles streaming audio; in our application, the user’s au-dio is streamed in real-time to our platform. As is standardin speech recognition applications, all audio is recorded in16kHz, 16-bit mono-channel PCM format.
We draw inspiration from convolutional recurrent neuralnetworks (ConvRNN) for text modeling [12], where it hasachieved state of the art in sentence classification. However,the model cannot be applied as-is to our task, since the bi-directional components violate our streaming constraint, andit was originally designed for no more than five output labels.Thus, we begin with this model as a template only.We illustrate our architecture in Figure 1, where the modelcan be best described as having three sequential components:first, it uses causal convolutions to model short-term speechcontext. Next, it feeds the short-term context into a gated re-current unit (GRU) [13] layer and pools across time to modellong-term context. Finally, it feeds the long-term context into deep neural network (DNN) classifier for our N + 1 voicequery labels. Short-term context modeling.
Given 40-dimensional PCENinputs x , . . . , x t , we first stack the frames to form a 2D in-put x t ∈ R t × ; see Figure 1, label C, where the x -axisrepresents 40-dimensional features and the y -axis time. Then,to model short-term context, we use a 2D causal convolutionlayer (Figure 1, label D) to extract feature vectors s , . . . , s t for s t = W · x + b , where W ∈ R c × ( m × n ) is the convo-lution weight, x − m +2:0 is silence padding in the beginning, · denotes valid convolution, and s i is a context vector in R c × f .Finally, we pass the outputs into a rectified linear (ReLU) ac-tivation and then a batch normalization layer, as is standardin image classification. Since causal convolutions use a fixednumber of past and current inputs only, the streaming con-straint is necessarily maintained. Long-term context modeling.
To model long-term context,we first flatten the short-term context vector per time stepfrom s i ∈ R c × f to R cf . Then, we feed them into a single uni-directional GRU layer (examine Figure 1, label E) consistingof k hidden units, yielding hidden outputs h , . . . , h t , h i ∈ R k . Following text modeling work [12], we then use a 1Dconvolution filter W ∈ R d × k with ReLU activation to extractfeatures from the hidden outputs, where d is the number ofoutput channels. We max-pool these features across time (seeFigure 1, label G) to obtain a fixed-length context c max ∈ R d .Finally, we concatenate c max and h t for the final context vec-tor, c ∈ R k + d , as shown in Figure 1, label H.Clearly, these operations maintain the streaming con-straint, since uni-directional GRUs and max-pooling acrosstime require the storage of only the last hidden and maxi-mum states, respectively. We also experimentally find thatthe max-pooling operation helps to propagate across time thestrongest activations, which may be “forgotten” if only thelast hidden output from the GRU were used as the context. DNN classifier.
Finally, we feed the context vector c into asmall DNN with one hidden layer with ReLU activation, anda softmax output across the N + 1 voice query labels. Forinference on streaming audio, we merely execute the DNNon the final context vector at a desired interval, such as every100 milliseconds; in our models, we choose the number ofhidden units r so that the classifier is sufficiently lightweight.
4. EVALUATION
On our specific task, we choose N = 200 representing thetop 200 queries on the Xfinity X1 platform, altogether cover-ing a significant portion of all voice traffic—this subset cor-responds to hundreds of millions of queries to the system permonth. For each positive class, we collected 1,500 examplesconsisting of anonymized real data. For the negative class, wecollected a larger set of 670K examples not containing any ofthe positive keywords. Thus, our dataset contains a total of Short-term context modeling c, m, n = 250 , , Long-term context modeling k = 750 d = 350 DNN classifier (100ms interval) r = 768 N + 1 = 201 Total:
Table 1 . Model footprint and hyperparameters. “ ± . % (95% confidence interval) word-error rate (WER)on our dataset; this choice is reasonable because the WER ofhuman annotations is similar [14], and our deployment ap-proach is to short-circuit and replace the current third-partyASR system where possible. For the causal convolution layer, we choose c = 250 out-put channels, m = 3 width in time, and n = 20 length infrequency. We then stride the entire filter across time and fre-quency by one and ten steps, respectively. This configurationyields a receptive field of 50 milliseconds across f = 3 dif-ferent frequency bands, which roughly correspond to highs,mids, and lows. For long-term context modeling, we choose k = 750 hidden dimensions and d = 350 convolution filters.Finally, we choose the hidden layer of the classifier to have768 units. Table 1 summarizes the footprint and hyperparam-eters of our architecture; we name this model crnn-750m ,with the first “ c ” representing the causal convolution layerand the trailing “ m ” max pooling.During training, we feed only the final context vector ofthe entire audio sample into the DNN classifier. For each sam-ple, we obtain a single softmax output across the 201 targetsfor the cross entropy loss. The model is then trained usingstochastic gradient descent with a momentum of 0.9, batchsize of 48, L weight decay of − , and an initial learningrate of − . At epochs 9 and 13, the learning rate decreasesto − and − , respectively, before training finishes for atotal of 16 epochs. Model Variants.
As a baseline, we adapt the previous state-
Model Val. Test Footprint
FAR QER FAR QER res8 crnn-750m crnn-750 rnn-750m
Table 2 . Comparison of model results. “ res8 , we report the number on eight seconds of audio, sincefixed-length input is expected. Best results are bolded.of-the-art KWS model res8 [1] to our task by increasing thenumber of outputs in the final softmax to 201 classes. Thismodel requires fixed-length audio, so we pad and trim audioinput to a length that is sufficient to cover most of the audio inour dataset. We choose this length to be eight seconds, since99.9% of queries are shorter.To examine the effect of the causal convolution layer, wetrain a model without it, feeding the PCEN inputs directly tothe GRU layer. We also examine the contribution of max-pooling across time by removing it: we name these variants rnn-750m and crnn-750 . The model runs quickly on a commodity GPU machine withone Nvidia GTX 1080: to classify one second of streamingaudio, our model takes 68 milliseconds. Clearly, the modelis also much more lightweight than a full ASR system, oc-cupying only 19 MB of disk space for the weights and 5 KBof RAM for the persistent state per audio stream. The stateconsists of the two previous PCEN frames for the causal con-volution layer (320 bytes; all zeros for the first two paddingframes), the GRU hidden state (3 KB), and the last maximumstate for max-pooling across time (1.4 KB).In our system, we define a false alarm (FA) as a nega-tive misclassification. In other words, a model prediction iscounted as an FA if it is misclassified and the prediction isone of the known, 200 queries. This is reasonable, since wefall back to the third-party ASR system if the voice query isclassified as unknown. We also define a query error (QE) asany misclassified example; then, false alarm rate (FAR) andquery error rate (QER) correspond to the number of FAs andQEs, respectively, divided by the number of examples. Thus,the overall query accuracy rate is − QER .Initially, the best model, crnn-750m , attains an FARand QER of . and . , respectively. This FAR is higherthan our production target of ; thus, we further thresholdthe predictions to adjust the specificity of the model. Usedalso in our previous work [1], a simple approach is to clas-sify as unknown all predictions whose probability outputs arebelow some global threshold α . That is, if the probability of Q u e r y E rr o r R a t e ROC (test set) crnn-750mrnn-750mcrnn-750
Fig. 2 . ROC curves for our models.a prediction falls below some threshold α , it is classified asunknown. In Table 2, we report the results corresponding toour target FAR of , with the α determined from the val-idation set. To draw ROC curves (see Figure 2) on the testset, we sweep α from 0 to 0.9999, where QER is analogousto false reject rate (FRR) in the classic keyword spotting liter-ature. We omit res8 due to it having a QER of 29%, whichis unusable in practice.After thresholding, our best model with max pooling andcausal convolutions ( crnn-750m ) achieves an FAR of 1%and QER of 6% on both the validation and test sets, as shownin Table 2, row 2. Max-pooling across time is effective, re-sulting in a QER improvement of 0.5% over the ablated model( crnn-750 ; see row 3). The causal convolution layer is ef-fective as well, though slightly less than max-pooling is; forthe same QER (6.4%) on the validation set, the model with-out the causal convolution layer, rnn-750m , uses 87M fewermultiplies per second than crnn-750 does (presented in row4), due to the large decrease in the number of parameters forthe GRU, which uses an input of size 40 in rnn-750m , com-pared to 750 in crnn-750 . We have similar findings for theROC curves (see Figure 2), where crnn-750m outperforms crnn-750 and rnn-750m , and the ablated models yieldsimilar curves. All of these models greatly outperform res8 ,which was originally designed for keyword spotting.
5. CONCLUSION AND FUTURE WORK
We describe a novel resource-efficient model for the task ofvoice query recognition on streaming audio, achieving anFAR and QER of and , respectively, while performingmore than × faster than the current third-party ASR system.One potential extension to this paper is to explore the ap-plication of neural network compression techniques, such asintrinsic sparse structures [15] and binary quantization [16],which could further decrease the footprint of our model. . REFERENCES [1] Raphael Tang and Jimmy Lin, “Deep residual learningfor small-footprint keyword spotting,” in Proceedings ofthe 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) .[2] Sercan O Arik, Markus Kliegl, Rewon Child, JoelHestness, Andrew Gibiansky, Chris Fougner, RyanPrenger, and Adam Coates, “Convolutional recurrentneural networks for small-footprint keyword spotting,” arXiv:1703.05390 , 2017.[3] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu,Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen,Anjuli Kannan, Ron J Weiss, Kanishka Rao, KatyaGonina, et al., “State-of-the-art speech recognitionwith sequence-to-sequence models,” arXiv:1712.01769 ,2017.[4] William Chan, Navdeep Jaitly, Quoc V Le, and OriolVinyals, “Listen, attend and spell: A neural networkfor large vocabulary conversational speech recognition,”in , 2016.[5] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, Mirko Han-nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, JanSilovsky, Georg Stemmer, and Karel Vesely, “The Kaldispeech recognition toolkit,” 2011.[6] Wayne Xiong, Jasha Droppo, Xuedong Huang, FrankSeide, Mike Seltzer, Andreas Stolcke, Dong Yu, andGeoffrey Zweig, “The Microsoft 2016 conversationalspeech recognition system,” in
Proceedings of the 2017IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2017.[7] Dong Wang, Shaohe Lv, Xiaodong Wang, and XinyeLin, “Gated convolutional LSTM for speech commandsrecognition,” in
ICCS 2018 , 2018.[8] Tara N Sainath and Carolina Parada, “Convolutionalneural networks for small-footprint keyword spotting,”in
INTERSPEECH-2015 , 2015.[9] Brian McMahan and Delip Rao, “Listening to the worldimproves speech command recognition,” in
Proceed-ings of the Thirty-Second AAAI Conference on ArtificialIntelligence , 2018.[10] Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-jeev Khudanpur, “Audio augmentation for speech recog-nition,” in
INTERSPEECH-2015 , 2015.[11] Yuxuan Wang, Pascal Getreuer, Thad Hughes,Richard F Lyon, and Rif A Saurous, “Trainable frontend for robust and far-field keyword spotting,” in
Proceedings of the 2017 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) ,2017.[12] Chenglong Wang, Feijun Jiang, and Hongxia Yang, “Ahybrid framework for text modeling with convolutionalRNN,” in
Proceedings of the 23rd ACM SIGKDD Inter-national Conference on Knowledge Discovery and DataMining , 2017.[13] Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio, “Learning phrase rep-resentations using RNN encoder–decoder for statisti-cal machine translation,” in
Proceedings of the 2014Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , 2014.[14] Andreas Stolcke and Jasha Droppo, “Comparing humanand machine errors in conversational speech transcrip-tion,” arXiv:1708.08615 , 2017.[15] Wei Wen, Yuxiong He, Samyam Rajbhandari, MinjiaZhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen,and Hai Li, “Learning intrinsic sparse structures withinlong short-term memory,” in
International Conferenceon Learning Representations , 2018.[16] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, RanEl-Yaniv, and Yoshua Bengio, “Binarized neural net-works: Training deep neural networks with weights andactivations constrained to +1 or -1,” arXiv:1602.02830arXiv:1602.02830