Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering
Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir
HHybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering
Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir
Apple (sadya,vineetgarg,sidsigtia,psimha,cdhir)@apple.com
Abstract
We consider the design of two-pass voice trigger detection sys-tems. We focus on the networks in the second pass that areused to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTMlayers, trained by minimizing the CTC loss. We replace theBiLSTM layers with self-attention layers. Results on internalevaluation sets show that self-attention networks yield betteraccuracy while requiring fewer parameters. We add an auto-regressive decoder network on top of the self-attention layersand jointly minimize the CTC loss on the encoder and the cross-entropy loss on the decoder. This design yields further improve-ments over the baseline. We retrain all the models above in amulti-task learning(MTL) setting, where one branch of a sharednetwork is trained as an AM, while the second branch classifiesthe whole sequence to be true-trigger or not. Results demon-strate that networks with self-attention layers yield ∼
60% rel-ative reduction in false reject rates for a given false-alarm rate,while requiring 10% fewer parameters. When trained in theMTL setup, self-attention networks yield further accuracy im-provements. On-device measurements show that we observe70% relative reduction in inference time. Additionally, the pro-posed network architectures are ∼
5X faster to train.
Index Terms : Keyword Spotting, Speech Recognition, Acous-tic Modeling, Neural Networks, Deep Learning
1. Introduction
There are a growing number of devices with speech as theprimary means of user input for e.g. smart speakers, head-phones and watches. As a result, voice trigger detection sys-tems have become an important component of the user interac-tion pipeline as they signal the start of an interaction betweenthe user and a device. Since these systems are deployed en-tirely on-device, there are several considerations like privacy,latency, accuracy and battery/power consumption that informtheir design. We employ a two-stage architecture for the trig-ger detectors [1, 2, 3], where a low-power first-pass detectorreceives streaming input from the microphone and is alwaysrunning [4]. If a detection is made at this stage, larger morecomplex models are used to re-score the candidate acoustic seg-ments from the first-pass [5]. This design offers a balance be-tween power/battery consumption which is determined by thefirst-pass and overall accuracy which is determined by the largermodels in the second-pass.This paper aims to improve the architecture of the second-pass detectors in order to make better use of the available on-device hardware. Recent approaches to this problem have ex-plored a number of neural network architectures like DNNs[6, 7], CNNs [8, 9, 10] and RNNs [11, 12, 13, 5]. Here we ex-periment with using stacks of self-attention layers [14, 15, 16]to replace bidirectional LSTM (BiLSTM) layers. This designis motivated by 2 observations. Firstly, the second-pass models receive the entire input audio for re-scoring at once and do notneed to be run in a streaming setting. Previously [4] we tookadvantage of this fact by using BiLSTM layers to read the inputfrom both directions. However this arrangement requires se-quential computations at every layer in the network, which canbe slow. Self-attention layers, on the other hand, process theentire input sequence with feed-forward matrix multiplications(c.f. Section 2). Secondly, we can improve training and infer-ence times significantly, because the feed-forward computationsin the self-attention layers with large matrix multiplication op-erations can be easily parallelized using the available hardware.In previous work [5], we argued that there are two natu-ral ways to design a second-pass voice trigger detector. Thefirst method is to train a monophone AM and use this modelto compute the probability for the phone sequence in a triggerphrase given an acoustic segment from the first-pass. The sec-ond method is to directly train a binary classifier to discrimi-nate between true examples of the trigger phrase and the falseexamples including easily confusable/phonetically similar ex-amples. The first method has the advantage that we can uselarge transcribed training sets to train the main speech recog-nizer for a given language, but suffers from the fact that pho-netically confusable utterances are assigned similar scores. Thesecond method has the advantage that we train using exactly thecorrect objective function for the task at hand. However, collect-ing large training sets for this discriminative task in a privacypreserving way is extremely challenging. We proposed to com-bine the useful properties of both approaches using multi-tasklearning (MTL) and observed significant improvement in accu-racies. In the present work, we build on these ideas by replacingthe stack of BiLSTM layers with stacks of self-attention layersin order to better utilize the on-device hardware during infer-ence. Our results show that the self-attention networks yieldsimilar accuracies to the models in [5], without requiring theadditional discriminative training data and requiring 10% fewer model parameters. This result is a significant improvement as itallows us to train more accurate models for languages where weonly have access to general AM training data but do not have adataset of true triggers and false alarms. We also show thatadding discriminative training data to these networks yields fur-ther improvements, significantly improving over the baselinespresented in [5]. Finally, we measure the inference speed of theproposed models on some recent hardware devices and we findthat inference time with the proposed networks can be reducedby upto 70%.
2. Model Architectures
In this section we present details of the baseline architectureand the proposed modifications to the baseline. To recap, ourmotivation is two-fold; to improve the accuracy of the mod-els and to make better use of the available on-device hardware.We start with replacing the BiLSTM layers in the baseline withself-attention layers. We find that this modification yields bet- a r X i v : . [ ee ss . A S ] A ug er accuracies on 2 evaluation sets while requiring fewer pa-rameters. Next, we add an auto-regressive decoder as an ad-ditional/auxiliary loss (Figure 1). We find that jointly mini-mizing the connectionist temporal classification (CTC) loss andthe cross-entropy loss yields further improvements compared tominimizing only the CTC loss. Note that during inference, weonly use the encoder part of the network (Figure 1), to avoid se-quential computations in the auto-regressive decoder. Thereforethe transformer decoder can be seen as regularizing the CTCloss. Alternatively, this setup can be viewed as an instanceof multi-task learning where we jointly minimize 2 differentlosses. We use the same baseline BiLSTM architecture as in [5]. Wecompute 40-dimensional mel-filterbank features from the audioat 100 frames-per-second (FPS). At every time-step we splice 7frames together to form a 280-dimensional input window, andwe subsample the sequence of windows by a factor of 3. Theinputs are presented to a stack of 4 BiLSTM layers with 256units each, resulting in 5.4 million trainable weights. The out-put layer comprises an affine transformation followed by thesoftmax non-linearity resulting in 54 outputs which span the setof context-independent phones (monophones) and sentence andword boundaries. The network is trained by minimizing theCTC loss given a large training set containing pairs of speechutterances and their corresponding text transcriptions.
Next, we replace the BiLSTM layers with a stack of self-attention layers [16]. We process the inputs same as before butwe add a 280-dimensional fixed positional encoding to each in-put frame. We use exactly the same positional encoding schemeof alternating sine and cosine waves with varying wavelengthsas proposed in [16]. We use a stack of 6 self-attention layers andthe computation performed by each layer is depicted in Figure1. We use 4 heads for each of the self-attention transforms witheach head yielding a 64-dimensional key, query and value vec-tors. Each head is concatenated to output a 256-dimensionalvector. We use a hidden size of 1024 dimensions for the feed-forward layer. We also use skip connections for both the self-attention and the feed-forward layers and we apply Layer Nor-malization [17] to the outputs of both transforms. The resultingnetwork contains 4.8 million trainable weights, which is 10%smaller than the BiLSTM baseline. We train the network byminimizing the CTC loss as before, using the same trainingdata. We refer to this configuration as
Self Attention Encoder in the rest of the paper.
Recently, there have been several studies [18, 19, 20] that sug-gest that the accuracy of a sequence-to-sequence architecture(an encoder and an autoregressive decoder with cross-attention)for speech recognition [21] can be improved by jointly minimiz-ing the cross-entropy loss on the decoder and the CTC loss act-ing on the outputs of the encoder. The intuition is that these twolosses regularize each other resulting in faster training conver-gence and more accurate models. We experiment with such anarchitecture (dashed arrow in Figure 1). At inference, we onlyuse the encoder branch of the network. So effectively the de-coder trained to minimize the cross-entropy loss is acting as anadditional regularization term added to the network described in Figure 1:
Proposed network architecture. The encoder com-prises a stack of self-attention layers and is trained by minimiz-ing the CTC loss (left). The dashed arrow shows an optionaldecoder that can be added during training (right). Note that forall configurations, only the encoder branch (left) is used duringinference.
Section 2.2. Since we do not use the decoder at inference, thenumber of parameters of this model and the one describe aboveremain the same. The architecture of the decoder is depictedin Figure 1. We use a stack of 6 layers in the decoder keepingthe parameters of the self-attention and the feed-forward layersexactly the same as the encoder. We linearly combine the cross-entropy loss and the CTC loss for every utterance with unitycoefficients. We refer to this configuration as
Transformer En-coder in the rest of the paper.
3. Multi-task Learning
The model architectures outlined in Section 2 are all mono-phone AMs that are trained to minimize the CTC loss or acombination of the CTC loss and the cross-entropy loss for themodels in Section 2.3. As argued in [5], this training objec-tive does not match the final objective we care about, whichis to discriminate between examples of true triggers and pho-netically similar acoustic segments. Previously we showed thatwe can achieve significant performance improvements in trig-ger detection by adding a relatively small amount of triggerphrase specific discriminative data and finetuning a pre-trainedphonetic AM to minimize the CTC loss and the discriminativeloss simultaneously [5]. We apply this idea to the models de-scribed in Section 2. For each of the model architectures, wetake the encoder branch of the model and add an additional out-put layer (affine transformation + softmax non-linearity) with2 output units at the end of the encoder network. One unitcorresponds to the trigger phrase, while the other unit corre-sponds to the negative class. The objective for the discrimina-tive branch is as follows: for positive examples we minimizethe loss C = − max t log y Pt , where y Pt is the network outputat time t for the positive class. This loss function encourageshe network to yield a high score independent of the temporalposition , note that this is only useful for networks that read theentire input at once. For negative examples, the loss functionis C = − (cid:80) t log y Nt , where y Nt is the network output for thenegative class at time t . This loss forces the network to outputa high score for the negative class at every frame .
4. Model Training
We follow a similar pipeline as [5] for preparing the trainingdata for the monophone AMs described in Section 2. We startwith a clean dataset with about 2700 hours of transcribed audio.These examples are recorded on mobile phones and thereforeare assumed to be near field . For each utterance in the dataset,we augment it by convolving the audio with a room impulse re-sponse (RIR) that is randomly selected from a set of 3000 RIRs.This process yields a reverberated copy of the original dataset.Next, we collect over 400,000 examples of echo residuals fromvarious devices playing music, podcasts and text-to-speech atvarying volumes [22]. We then mix each example in the re-verberated dataset with a randomly selected echo residual fromthe corpus, resulting in over 8700 hours of transcribed and aug-mented training data for the AMs. We pick training examplesthat explicitly do not contain the trigger phrase in order to avoidbiasing the AMs.
We use the same dataset as described in [5] for MTL experi-ments: 40,000 examples that false trigger the baseline systemand another 140,000 examples of true trigger phrases. We runa first-pass DNN-HMM detector on the audio to obtain trig-ger start and end boundaries for each utterance. We then ex-tract only these segments from each utterance, which resultsin 90 hours of audio. For MTL experiments, we concatenatethe AM training dataset and the discriminative dataset and ran-domly sample mini-batches from the combined dataset.
We use exactly the same hyper-parameters for training all themodels. We use mini-batches of 32 utterances per GPU with16 GPUs in parallel and an initial learning rate of 5e-5. Weuse the Adam optimizer [23] and synchronous gradient updates.We stop training if the validation loss does not improve after 8consecutive training epochs.
5. Evaluation
We use the same datasets for evaluation as described in [5]with no changes and therefore the results are directly compa-rable. Both datasets are internally collected specifically forthe purpose of evaluating voice trigger models. Both datasetsare collected using smart speakers in a variety of environmentsand conditions to simulate real-world usage. The first struc-tured dataset contains utterances from 100 participants, approx-imately evenly divided between male and female. Each sub-ject speaks a series of prompted voice commands, where eachcommand is preceded by the trigger phrase. The recordingsare made in 4 different acoustic settings: quiet room, externalnoise from TV or kitchen appliances, music playing from therecording device at medium volume and finally music playback from the device at loud volume. The final condition is the mostdifficult since the input signals have a considerable amplitudeof residual noise. We collect 13,000 such positive utterances.These examples allow us to measure the number of false rejec-tions (FRs) made by the system. We also use a set of 2000 hoursof audio recordings comprising Podcasts, audiobooks, TV play-back etc. This audio does not contain the trigger phrase and actsas a negative set that allows us to estimate the number of falsealarms (FAs) per hour of active audio.We also conduct a second unstructured data collection athome by our employees. We ask 42 participants to use a smartspeaker daily for 2 weeks. We enable extra logging and reviewby the users in order to allow them to choose which record-ings they want to delete. This setup allows us to collect datathat represents more spontaneous device usage (non-stationarysources, non-stationary noise, children’s speech, overlappingspeech etc.). Continuous recording for 2 weeks on a deviceis not possible therefore we use a first-pass DNN-HMM sys-tem [1] with a low-threshold that detects audio segments pho-netically similar to the trigger phrase, allowing us to measure(almost) unbiased false-reject rates for realistic in-home usage.(With customer data, the audio sent to the server has alreadytriggered the device, therefore making it impossible to measurefalse reject rates.)
Figure 2 presents modified detection error tradeoff (DET)curves for all the models evaluated on the structured evaluationdataset. The (log) X-axis represents the number of false alarms(FAs) per hour of active audio while the Y-axis represents theproportion of false rejects (FRs) by the system (lower is better).Solid curves represent the baseline models trained only on thelarge AM training dataset, while the dashed curves represent theMTL versions of these models trained on the dataset describedin Section 4. From Figure 2, note that the model with self-attention layers trained with the CTC objective function (red)yields much better accuracy than the baseline BiLSTM model(blue). In fact, this model yields similar accuracies as the MTLversion of the BiLSTM model (dashed red). This is a signif-icant result as the new model was trained without any addi-tional discriminative data and therefore the improvements canbe attributed entirely to the change of layer architecture. Ad-ditionally, the self-attention model has 10% fewer parametersthan the baseline BiLSTM model. Next, the model with self-attention layers trained with a decoder loss (green) yields morethan 50% relative improvement over the BiLSTM baseline (Ta-ble 1). This model is trained only on the AM training set but itis more accurate than the MTL versions of the both the base-line BiLSTM (dashed blue) and self-attention + CTC model(dashed red). This result suggests that adding the decoder asan additional loss results in significant improvements withoutadding any extra parameters to the model (since we only usethe encoder part of the network). Again, this result is practi-cally useful since collecting a dataset of difficult negative exam-ples is challenging and is not available for many languages andthese new models yield better results than the MTL versions ofthe baseline without requiring any discriminative training data.Finally for all architectures, the MTL versions of the models(dashed curves) always yield significant improvements over thebaselines. Table 1 presents FR rates at an operating point of 1FA per 100 hours for the different models.Figure 3 presents DET curves for the unstructured evalu-igure 2:
DET curves for structured evaluation set
Figure 3:
DET curves for take-home evaluation set
Model Phonetic Training MTL TrainingArchitecture FRR (%) FRR (%)BiLSTM 11.7 8.9Self Attention Encoder 8.9 5.6
Transformer Encoder 4.9 3.2
Table 1:
False Reject Rate (FRR) at an operating point of 1FA/100 hrs for structured evaluation set
Model Phonetic Training MTL TrainingArchitecture FRR (%) FRR (%)BiLSTM 3.4 1.9Self Attention Encoder 2.6 1.6
Transformer Encoder 2.2 1.6
Table 2:
False Reject Rate (FRR) at an operating point of 100FAs on unstructured take home evaluation set ation dataset. We observe a similar trend as above. The self-attention network trained with the CTC loss (red) improves overthe baseline BiLSTM network (blue). Next, the self-attentionnetwork trained with both the CTC loss and the additional de-coder yields further improvements (green), notably yieldingbetter accuracies than the MTL version of the BiLSTM base-line (dashed blue). Finally, the MTL versions of both the self-attention networks yield significant improvements over all thebaselines. Table 2 presents the FR rate at 100 FAs for the dif-ferent architectures considered.
We compare the on-device inference times for the baseline BiL-STM model (5.5M params) and the Transformer Encoder (4.8Mparams) in Table 3. At inference, Self Attention Encoder andTransformer Encoder perform exactly the same computation.Inference was performed on 1.8 seconds of audio sampled at16KHz. The compute platform was a 2019 smart phone withfixed CPU core and frequency. Compared to baseline 8-bit5.5M BiLSTM model, using an 8-bit quantized 4.8M Trans-former Encoder model, we obtain an improvement of 70% ininference speed. The huge improvements in runtime can be at-tributed to the self-attention layers not having sequential depen-dencies and being more parallelizable than BiLSTMs. Henceself-attention layers can make better use of specialized hard-ware on a modern processor for accelerating matrix computa-tions. Table 4 also shows that the compute speed advantageof the Transformer Encoder model is less on a variety of olderplatforms. In addition to faster inference, Table 5 shows that the Metric BiLSTM
Transformer
Impr.(5.5M)
Encoder (4.8 M) (%)Network runtime 95 ms
28 ms
Network runtime and memory usage on 2019 smartphone
Platform % Improvement in Network RuntimeSmart Speaker 20.8%Smart Watch 30.5%2015 Smart Phone 17.8%Table 4:
Network runtime improvements of 4.8M TF encoderover BiLSTM on various platforms
Model Utterances Epochs Train TimeArchitecture per second (minutes)BiLSTM 210 115 10080
Self Attention Encoder 1121 77 1304
Transformer Encoder 788 75 1862Table 5:
Average AM training time statistics over 5 runs.
Transformer Encoder models are significantly faster ( ∼
6. Conclusions
In this work we study the problem of designing a hardware ef-ficient voice trigger detection system. We start with a BiLSTMnetwork trained to minimize the CTC loss. We explore replac-ing the BiLSTM layers with self attention layers and show im-provements in accuracy and inference times. We propose toregularize the training process by adding an auto regressivetransformer decoder with a cross entropy loss and show sig-nificant improvements in accuracy. We then improve the resultsfurther by using MTL on the encoder outputs, with the addi-tional task being a true trigger/false trigger classifier. We showthat compared to baseline BiLSTM approach, the hybrid trans-former/CTC setup significantly improves the FRR by ∼
60% fora given FAR (1 FA/100 hrs) with 10 % fewer model parameters.Additionally, the proposed approach reduces the on-device in-ference time by 70% and is ∼
5X faster to train. . References [1] Apple Machine Learning Blog, “Hey Siri: An On-deviceDNN-powered Voice Trigger for Apples Personal Assistant,”https://machinelearning.apple.com/2017/10/01/hey-siri.html, Oc-tober 2017.[2] A. Gruenstein, R. Alvarez, C. Thornton, and M. Ghodrat, “ACascade Architecture for Keyword Spotting on Mobile Devices,” arXiv preprint arXiv:1712.03603 , 2017.[3] M. Wu, S. Panchapagesan, M. Sun, J. Gu, R. Thomas, S. N. P. Vi-taladevuni, B. Hoffmeister, and A. Mandal, “Monophone-BasedBackground Modeling for Two-Stage On-Device Wake Word De-tection,” in . IEEE, 2018, pp. 5494–5498.[4] S. Sigtia, R. Haynes, H. Richards, E. Marchi, and J. Bridle, “Ef-ficient Voice Trigger Detection for Low Resource Hardware,” in
INTERSPEECH , 2018, pp. 2092–2096.[5] S. Sigtia, P. Clark, R. Haynes, H. Richards, and J. Bridle, “Multi-Task Learning for Voice Trigger Detection,” in , 2020, pp. 7449–7453.[6] G. Chen, C. Parada, and G. Heigold, “Small-Footprint KeywordSpotting Using Deep Neural Networks,” in . IEEE, 2014, pp. 4087–4091.[7] S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim,and S. Ha, “Temporal Convolution for Real-time Keyword Spot-ting on Mobile Devices,”
INTERSPEECH , 2019.[8] T. N. Sainath and C. Parada, “Convolutional Neural Networksfor Small-Footprint Keyword Spotting,” in
Sixteenth Annual Con-ference of the International Speech Communication Association ,2015.[9] S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky,C. Fougner, R. Prenger, and A. Coates, “Convolutional RecurrentNeural Networks for Small-Footprint Keyword Spotting,”
INTER-SPEECH , 2017.[10] C.-C. Kao, M. Sun, Y. Gao, S. Vitaladevuni, and C. Wang, “Sub-band Convolutional Neural Networks for Small-footprint SpokenTerm Classification,”
INTERSPEECH , 2019.[11] S. Fern´andez, A. Graves, and J. Schmidhuber, “An Applicationof Recurrent Neural Networks to Discriminative Keyword Spot-ting,” in
International Conference on Artificial Neural Networks .Springer, 2007, pp. 220–229.[12] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, andI. McGraw, “Streaming Small-Footprint Keyword Spotting Us-ing Sequence-to-Sequence Models,” in
IEEE Automatic SpeechRecognition and Understanding Workshop (ASRU) . IEEE, 2017,pp. –.[13] T. Yamamoto, R. Nishimura, M. Misaki, and N. Kitaoka, “Small-Footprint Magic Word Detection Method Using ConvolutionalLSTM Neural Network,”
INTERSPEECH , pp. 2035–2039, 2019.[14] J. Cheng, L. Dong, and M. Lapata, “Long Short-TermMemory-Networks for Machine Reading,” arXiv preprintarXiv:1601.06733 , 2016.[15] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, andY. Bengio, “A Structured Self-Attentive Sentence Embedding,” arXiv preprint arXiv:1703.03130 , 2017.[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,”in
Advances in Neural Information Processing Systems (NIPS) ,2017, pp. 5998–6008.[17] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” arXiv preprint arXiv:1607.06450 , 2016.[18] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy-brid CTC/Attention Architecture for End-to-End Speech Recog-nition,”
IEEE Journal of Selected Topics in Signal Processing ,vol. 11, no. 8, pp. 1240–1253, 2017. [19] Z. Yuan, Z. Lyu, J. Li, and X. Zhou, “An Improved HybridCTC-Attention Model for Speech Recognition,” arXiv preprintarXiv:1810.12020 , 2018.[20] Z. Xiao, Z. Ou, W. Chu, and H. Lin, “Hybrid CTC-Attentionbased End-to-End Speech Recognition Using Subword Units,” 042019.[21] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, Attend andSpell: A Neural Network for Large Vocabulary ConversationalSpeech Recognition,” in . IEEE, 2016,pp. 4960–4964.[22] Apple Machine Learning Blog, “Optimizing Siri on HomePod inFarField Settings,” https://machinelearning.apple.com/2018/12/03/optimizing-siri-on-homepod-in-far-/field-settings.html, De-cember 2018.[23] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Opti-mization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980