Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation
Rishika Agarwal, Xiaochuan Niu, Pranay Dighe, Srikanth Vishnubhotla, Sameer Badaskar, Devang Naik
CComplementary Language Model and Parallel Bi-LRNN for False TriggerMitigation
Rishika Agarwal, Xiaochuan Niu, Pranay Dighe, Srikanth Vishnubhotla, Sameer Badaskar, DevangNaik
Apple, One Apple Park Way, Cupertino, CA, USA { rishika agarwal,xniu,pdighe,svishnubhotla,badaskar,naik.d } @apple.com Abstract
False triggers in voice assistants are unintended invocations ofthe assistant, which not only degrade the user experience butmay also compromise privacy. False trigger mitigation (FTM)is a process to detect the false trigger events and respond appro-priately to the user. In this paper, we propose a novel solutionto the FTM problem by introducing a parallel ASR decodingprocess with a special language model trained from “out-of-domain” data sources. Such language model is complementaryto the existing language model optimized for the assistant task.A bidirectional lattice RNN (Bi-LRNN) classifier trained fromthe lattices generated by the complementary language modelshows a . relative reduction of the false trigger (FT) rateat the fixed rate of . false suppression (FS) of correct in-vocations, compared to the current Bi-LRNN model. In addi-tion, we propose to train a parallel Bi-LRNN model based onthe decoding lattices from both language models, and examinevarious ways of implementation. The resulting model leads tofurther reduction in the false trigger rate by . . Index Terms : Voice Trigger Detection, False Trigger Mitiga-tion, Lattice RNN, Language Model
1. Introduction
Voice trigger detection is a vital part of current voice assis-tant products. In such systems, one or multiple trigger phrasesare defined for users to invoke the device to process voice re-quests. The design of a trigger detector is often constrained bylimited computation resources and power consumption of hard-ware, therefore we often adopt simple DSP and acoustic mod-els [1, 2]. In practice, a trigger detector is usually operated in alow-false-rejection mode in order to allow most acoustic sam-ples to be passed to downstream processes. However, such de-sign may cause the assistant (wrongly) respond to unintendedacoustic inputs. There are also cases when users accidentallyinvoke the assistant through a UI element such as a button-pressor a particular gesture. Such unintended invocations of voiceassistants can be referred to as “false triggers”. To mitigate thefalse trigger cases, one can introduce an extra process to deter-mine whether an acoustic sample is intended or not, which is inessence a binary classification problem.The false trigger mitigation process can make use of bothacoustic and linguistic clues from the input sample. When theerrors are due to voice trigger detector, an intuitive approachis to feed the acoustic sample into an ASR system and checkfor the existence of trigger phrases in the 1-best output [3]. Inmore general cases, the text output contains the intent informa-tion from the user, therefore can be used as input to the classi-fier. In [4], the 1-best output is encoded as an LSTM embed-
To appear in Proceedings of InterSpeech 2020 ding to represent the linguistic feature. It is combined with theLSTM embedding of the acoustic features, and decoder featuresincluding trellis entropy, Viterbi cost, confidence and averagenumber of arcs as the final input feature set to the classier. Con-sidering the ASR results may contain errors, the decoder fea-tures are designed explicitly to capture the ambiguity during thedecoding process. A recent follow-up work [5] focuses on im-proving the acoustic features by incorporating utterance-levelrepresentations. It also introduces dialog-type information tofacilitate the classifier to make better decisions.To build an intent classifier, the authors of [6] propose acondense representation of lattices from ASR decoder, called“Lattice RNN” (LRNN). By introducing a pooling operationover the incoming arcs of each node in the lattice, and a propa-gation operation over the outgoing arcs of the nodes, the authorsare able to construct a neural network on a lattice, and encodethe whole lattice information as the vector output from the finalnode of the lattice. The LRNN embedding is used as the inputvector of the intent classifier, which achieves better accuracyand faster run-time, compared to the baseline model running onN-best results. A similar approach can be applied to the FTMtask. Our previous work [7] redefined the feature set attachedto each arc in the decoder lattice, and extended the network tobi-directional (Bi-LRNN). The decoding lattice is encoded asthe concatenation of hidden layers from the start and end nodesof the lattice. Thus a classifier built on top of the Bi-LRNN isable to mitigate the false trigger cases significantly. A recentwork [8] explored the use of graph neural networks (GNN) toencode the decoding lattice, which achieves similar accuracy asthe Bi-LRNN representation with more efficient training.In this paper, we investigate the impact of the decoder’s lan-guage model (LM) on false trigger mitigation. Considering thatthe voice assistant’s LM is usually well trained with in-domain data, and the LM also tends to see more usage data with thetrigger phrase at the beginning, it is likely that the LM is biasedtowards the in-domain data, which thereby biases it towards de-tecting the trigger phase. This bias may reduce the power ofthe decoding lattice in mitigating false triggers. In our study,we train a new LM that is not biased to the trigger phrase and in-domain data. We compare the mitigation performance be-tween the Bi-LRNN classifiers built from the the lattice outputsof different LMs. We further investigate how to make use ofthe complementary information in two different language mod-els, and propose some approaches to build parallel Bi-LRNN,which leads to further improvement in false trigger mitigation.
2. Method
In our baseline Bi-LRNN system, we obtain the word hypothe-sis lattice L for an acoustic sample X , from the ASR decoder. a r X i v : . [ ee ss . A S ] A ug he lattice consists of a start node, an end node, and other inter-mediate nodes. The nodes are connected via arcs and each archas a feature vector associated with it. The Bi-LRNN computesa forward and backward latent embedding for each node in thelattice (refer to [7] for more details).The final outputs of the Bi-LRNN are the forward latentembedding of the end node h f ( s end ) and backward latent em-bedding of the start node h b ( s start ) , where s start and s end de-note the lattice’s start and end nodes. A feed-forward classifierthen takes the input as [ h f ( s start ) , h b ( s end )] . The classifiergives a real valued output y ∈ [0 , , which is converted to alabel l pred ∈ { , } by choosing a threshold t . The thresholdcan be kept fixed at certain value, or can be evaluated empiri-cally on the cross validation set, to achieve the desirable FalseSuppression (FS) of invocation rate. A typical ASR decoding process can be formulated as search-ing the best word sequence W ∗ that maximizes (1), where P ( X | W ) denotes acoustic model (AM), representing theconditional probability of acoustic features X given a word se-quence W , and P ( W ) denotes language model (LM), repre-senting the probability of any word sequence W . Ideally, theLM of an ASR system should approximate the distribution ofall the word sequences that could reach the decoder. However,in practice, the voice assistant application is only designed torespond to a relevant set of user requests. So the LM is usu-ally trained to maximize the likelihood of in-domain sentences.If we refer to the in-domain sentences as a class L D , and out-of-domain sentences as a class L O , the ASR LM trained from in-domain data can be explicitly represented as P ( W | L D ) in(2), with P ( L D ) denoting the prior probability of in-domainusage. W ∗ = arg max W { P ( X | W ) P ( W ) } (1) = arg max W { P ( X | W, L D ) P ( W | L D ) P ( L D ) } (2)To use the ASR decoding information to determine whether asequence of acoustic features represent an unintended invoca-tion, we can compute the probability of in-domain usage giventhe acoustic observation, P ( L D | X ) . This measurement canbe expanded as in (4), in which the first factor is the summationof AM and LM probabilities over all sentence hypotheses. Anapproximation can be made to apply the summation over theresulting lattice paths during decoding, when ignoring the lowlikelihood word sequences being pruned. The Bi-LRNN em-bedding can be interpreted as an implicit representation of suchmeasurement with more flexibility and modeling capacity. [7] P ( L D | X ) = P ( X | L D ) P ( L D ) /P ( X ) (3) = (cid:80) i { P ( X | W i , L D ) P ( W i | L D ) } P ( L D ) P ( X ) (4)The drawback of the above measurement in (4) is that P ( W | L D ) only contains in-domain information, so its powerof rejecting false triggered samples may be limited. If wehave a good estimation of out-of-domain sentences with LM P ( W | L O ) , we can construct a complementary measurement, P ( L O | X ) , which in theory should have more power to rejectfalse trigger. Equation (5) implies we will run ASR decodingwith a different set of LM, P ( W | L O ) , to generate latticesdifferent from the default ones. We can apply the similar Bi-LRNN operation on the out-of-domain lattices for more model- ing capacity. P ( L O | X ) = (cid:80) i { P ( X | W i , L O ) P ( W i | L O ) } P ( L O ) P ( X ) (5)Furthermore, we can derive a probability ratio measurement asshown in (6), which adopts the ratio between the in-domain and out-of-domain probabilities given the acoustic observationto balance the suppression/trigger decision. This measurementimplies two ASR decoders can be run in parallel to achieve twodifferent lattices from the same acoustic input. By combiningthe information from both lattices, we may be able to achievebetter discriminative capacity between the two classes. Oncemore, this measurement can be generalized by training two Bi-LRNNs from the lattices of two decoders, and hopefully thenetwork can learn more complex relationship between the twolattices when the targeting cost function is set to minimize theclassification errors. P ( L D | X ) P ( L O | X ) = (cid:80) i { P ( X | W i , L D ) P ( W i | L D ) } P ( L D ) (cid:80) j { P ( X | W j , L O ) P ( W j | L O ) } P ( L O ) (6) In the error analysis (Section 3.4), we show that the base modelis more accurate for some examples, and the out-of-domainmodel is better on others, depending on the true label of the ex-ample. Thus, the language models likely represent complemen-tary information, and a model comprised of both the LMs couldout-perform the individual models based on either of the LMs.To achieve this, the outputs from the two Bi-LRNN models canbe combined in different ways before passing to the classifier.We explore the following ensembling techniques, and comparethe FT rates achieved by each of them in Section 3.5:•
Combine scores from the pre-trained Bi-LRNNs : Wetake the prediction scores y and y from the two Bi-LRNNs trained separately, and pass them to a shallowclassifier. (Only classifier layers are trained).• Combine the Bi-LRNN embeddings from the pre-trained Bi-LRNNs : We take the latent Bi-LRNN em-beddings h f , h b , h f , h b from the pre-trained Bi-LRNNs, and pass them to a classifier (Here again, onlythe classifier is trained).• Train the Bi-LRNNs in parallel, by back-propagatingthe classifier loss : The setting is the same as the pre-vious case, but we back-propagate the classifier loss toboth the Bi-LRNNs as well. Thus, the entire modelis trained end-to-end (from scratch or by loading theweights of the trained Bi-LRNNs and fine-tuning them).The schematic of the model is shown in Figure 1.•
Mixture of Experts : Instead of concatenating the em-beddings of the two Bi-LRNNs, we can pass theirweighted sum to the classifier. A Mixture of Expertsmodel [9], computes the relative importance of each “ex-pert” (in this case, the two Bi-LRNNs are the “experts”),and weighs the outputs of the models by a parameter α . The weight parameter α determines the reliabilityof each Bi-LRNN for an input sample, and we pass aweighted sum of the lattice embeddings to a classifier (cid:0)(cid:2) α h f + (1 − α ) h f , α h b + (1 − α ) h b (cid:3)(cid:1) . Themodel is trained end-to-end. The schematic of the modelis shown in Figure 2.igure 1: Schematic diagram of Parallel LRNN model
3. Experiment and results
All our experiments are performed on an FTM dataset [8],which is composed of far field usage samples with manual la-bels of “true trigger” (TT) and “false trigger” (FT) classes. Theraw audio data are split into train , cv , dev , and eval sets for thepurposes of training, cross-validation, development and evalua-tion. The train and cv sets are augmented by adding gain, noise,and speed perturbations, which increases the amount of trainingdata by 3x. Table 1 summarizes the amount of data in each setand condition.We train FTM classifiers for multiple epochs on the train set. The training epoch which achieved the lowest FT on the cv set is evaluated on the dev and eval sets. We expect our voiceassistant to have minimal false triggers, and maximum true pos-itives (minimal FS), for a good user-experience. We thus focuson the low FS regime in our DET curves, and the lower the AUC(Area Under Curve) of the DET curve, the better the model. Inour experiments, we arbitrarily choose a small FS rate of . to act as the operating point. So the False Trigger (FT) rate atthis FS rate is the key metric in evaluating the false trigger miti-gation models, while the AUC gives us an estimate of how goodthe model performs overall, irrespective of the operating point.We set the threshold that achieves the target FS rate (0.4%) onthe dev set. The performance metric of concern to us is the cor-responding FT rate on the eval set. In all experiments, we adopt an internal ASR decoder with var-ious model configurations. The acoustic model has an Hid-den Markov model (HMM) and Convolutional Neural Network(CNN) hybrid structure [10], which is trained with filter bankfeatures from US English speech data using cross-entropy andsubsequent BMMI objective functions [11]. The CNN com-prises 50 layers and uses the scaled exponential linear unit(SELU) activation function to achieve self-normalization dur-ing training [12], which achieves state-of-art performance. The
Label train cv dev evalTrue Trigger , × , × ,
829 11 , False Trigger , × × ,
657 11 , Table 1:
FTM dataset for model training and evaluation
Figure 2:
Schematic diagram of Mixture of Experts model baseline language model in the decoder is a 4-gram model in-terpolated from multiple sub-LMs trained from different datasources that are relevant to the far field application (in-domain).The data sources include enumerations of various usage do-mains, the re-decoding transcripts of live usage, and accumu-lated error corrections from the users. All the sub-LMs share aword lexicon with a vocabulary size of around K . The fi-nal interpolated LM is pruned to contain about . M . M trigrams and . M bigrams. We refer to this as the BaseLM . In order to capture the out-of-domain usages, we considerthe following data sources to train a complementary languagemodel called
ChatterLM . The first data source is from the au-tomatic transcriptions of the dictation application; the secondsource is from the voice search application. The language us-age styles of these two applications are different from that of theassistant application in our current study. The third source is ar-tificial data generated from enumeration of extra use cases thatare not relevant to the specific device under study. The
Chat-terLM is built in the same way as the
BaseLM in production,then combined with the baseline AM for the ASR decoder touse.With the two sets of ASR models, we generate decodinglattices on the FTM train and cv sets, then build two separateBi-LRNN classifiers from the lattice features. To compare theaccuracy between the two classifiers, we plot the DET curvesof them on the eval sets in Figure 3 (We plot only the regionof interest of the DET curve, ie F S < ). At the operationpoints around the fixed FS rate of . , the ChatterLM basedclassifier achieves a lower FT rate than the baseline classifierbased on
BaseLM . The relative reduction of FT rate is . (FT reduces from . for the BaseLM based Bi-LRNN to . for the ChatterLM based Bi-LRNN). Such a significantFT rate reduction clearly indicates that the LM trained from out-of-domain data sources is more capable of detecting false trig-gers than the LMs trained from in-domain data sources.
We choose the two Bi-LRNN classifiers, which use lattices fromthe
BaseLM and the
ChatterLM respectively, to further analyzethe error patterns. The idea is that if the models make mis-igure 3:
DET curves of
BaseLM and
ChatterLM
LRNN models. takes on different samples, then ensembling them would pro-vide complementary information, and thus improve the over-all performance. We compute the matrix showing the num-ber of samples for which the
BaseLM
Bi-LRNN and the
Chat-terLM
Bi-LRNN got the predictions correct and incorrect (seeTable 2). Both models achieve very high accuracy on the TrueTrigger (TT) class, with
BaseLM being the more accurate ofthe two. For the False Trigger (FT) class, both models areless accurate, and
ChatterLM is more accurate than
BaseLM (only . samples where BaseLM
Bi-LRNN gets correct and
ChatterLM
Bi-LRNN gets wrong, cf. . samples where BaseLM
Bi-LRNN gets wrong and
ChatterLM
Bi-LRNN getscorrect). These results align with our expectations, as the
Chat-terLM
Bi-LRNN model is expected to be more accurate for un-intended speech samples since it uses an LM trained on out-of-domain data, while the
BaseLM
Bi-LRNN uses the LM primar-ily trained on in-domain data. Thus, the models are stronger indifferent sample spaces, and should be able to complement eachother when used together in an ensemble model.
Assuming we have two ASR models available, one comprisingof the LM trained on in-domain data (
BaseLM ), and the othercomprising of LM trained on out-of-domain data (
ChatterLM ),we can leverage the two complementary lattices by training par-allel Bi-LRNN classifiers. We implement the different ensem-bling methods proposed in Section 2.3 to compare their perfor-mance. Figure 4 shows the DET curves (restricted to the regionof interest) of different parallel Bi-LRNN models, along withthat of the single
ChatterLM based Bi-LRNN. Table 3 showsthe FT rates of these classifiers at the fixed FS rate of . , andthe Area under the DET curve. Fully trained parallel Bi-LRNNsachieve a better FT rate than the ChatterLM based single Bi-LRNN classifier, while the classifiers trained on merged scoresor embeddings, and the Mixture of Experts model perform bet-
ChatterLM Correct ChatterLM WrongTT BaseLM Correct .
4% 0 . BaseLM Wrong .
08% 0 . FT BaseLM Correct .
3% 3 . BaseLM Wrong .
27% 9 . Table 2:
Error Analysis of True (TT) and False (FT) Triggers
Classifier FT at FS = 0.4% AUC
BaseLM based Bi-LRNN .
35% 0 . ChatterLM based Bi-LRNN .
93% 0 . Classifier on merged scores . Classifier on merged embedding vectors .
83% 0 . Fully trained parallel Bi-LRNN(random initialization) . % 0 . Fully trained parallel Bi-LRNN(initialized with pre-trained weights) .
80% 0 . Mixture of Experts .
18% 0 . Table 3:
False Trigger rates for different models
Figure 4:
DET curves of
ChatterLM and Ensemble of parallel Bi-LRNNs trained on
BaseLM and
ChatterLM ter than the BaseLM based Bi-LRNN classifier, but worse thanthe ChatterLM based classifier alone. The best performance isachieved by the classifier trained by fully back-propagating theloss to the Parallel Bi-LRNNs – . relative reduction in FTrate (over the ChatterLM based Bi-LRNN baseline). Initializingthe Bi-LRNNs with individually pre-trained Bi-LRNNs givesalmost identical results as random initialization (red and cyancurves in Fig 4); At the operating point (FS = 0.4%), fine-tuningthe pre-trained Bi-LRNNs is slightly worse than training fromrandom initialization, although the former has marginally lowerAUC. The improvement made by parallel Bi-LRNN model overthe single
ChatterLM based Bi-LRNN is consistently signifi-cant in our region of interest, ie, for FS rates below .
4. Conclusions
We proposed a novel solution to the ASR lattice based falsetrigger mitigation approach by introducing a complementaryLM to the decoding process. The LM is trained from out-of-domain data sources and provides complementary informa-tion to the original LM optimized for in-domain ASR accuracy.We demonstrated that a Bi-LRNN classifier built from the lat-tices generated from the complementary LM significantly out-performs the classifier built from the baseline ASR model set.With this single
ChatterLM
Bi-LRNN, we achieved a . relative reduction of the FT rate at the fixed . FS level com-paring to the current production FTM model. Furthermore, weproposed a novel approach of parallel Bi-LRNN, and examinedultiple ways to implement and train the classifier. By back-propagating the training loss fully to the parallel Bi-LRNN net-work, we saw a further . relative reduction of the FT rate.These results indicate that there is room for improving the tra-ditional ASR decoder in the FTM task, and encourage us toreconsider the architecture design that can enable parallel LMdecoding and parallel Bi-LRNN computation.
5. References [1] S. Sigtia, R. Haynes, H. Richards, E. Marchi, and J. Bridle,“Efficient Voice Trigger Detection for Low Resource Hardware,”in
Proc. Interspeech , Sept 2018, pp. 2092–2096. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2018-2204[2] M. Wu, S. Panchapagesan, M. Sun, J. Gu, R. Thomas,S. N. Prasad Vitaladevuni, B. Hoffmeister, and A. Mandal,“Monophone-Based Background Modeling for Two-Stage On-Device Wake Word Detection,” in ,2018, pp. 5494–5498.[3] A. H. Michaely, X. Zhang, G. Simko, C. Parada, and P. Alek-sic, “Keyword Spotting for Google Assistant Using ContextualSpeech Recognition,” in , 2017, pp. 272–278.[4] S. Mallidi, R. Maas, K. Goehner, A. Rastrow, S. Matsoukas, andB. Hoffmeister, “Device-directed Utterance Detection,” in
Proc.Interspeech , Sept 2018, pp. 1225–1228.[5] C.-W. Huang, R. Maas, S. H. Mallidi, and B. Hoffmeister,“A Study for Improving Device-Directed Speech DetectionToward Frictionless Human-Machine Interaction,” in
Proc.Interspeech , Sept 2019, pp. 3342–3346. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2019-2840[6] F. Ladhak, A. Gandhe, M. Dreyer, L. Mathias, A. Rastrow, andB. Hoffmeister, “LatticeRnn: Recurrent Neural Networks OverLattices,” in
Proc. Interspeech , Sept 2016, pp. 695–699. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2016-1583[7] W. Jeon, L. Liu, and H. Mason, “Voice Trigger Detection fromLVCSR Hypothesis Lattices Using Bidirectional Lattice Recur-rent Neural Networks,” in , May 2019,pp. 6356–6360.[8] P. Dighe, S. Adya, N. Li, S. Vishnubhotla, D. Naik, A. Sagar,Y. Ma, S. Pulman, and J. Williams, “Lattice-Based Improvementsfor Voice Triggering Using Graph Neural Networks,” in
ICASSP2020 - 2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2020, pp. 7459–7463.[9] R. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “AdaptiveMixtures of Local Experts,”
Meual Computation , February 1991.[10] Z. Huang, T. Ng, L. Liu, H. Mason, X. Zhuang, and D. Liu,“SNDCNN: Self-Normalizing Deep CNNs with Scaled Exponen-tial Linear Units for Speech Recognition,” in
ICASSP 2020 - 2020IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2020, pp. 6854–6858.[11] K. Vesel´y, A. Ghoshal, L. Burget, and D. Povey, “Sequence Dis-criminative Training of Deep Neural Networks,” in
Proc. Inter-speech , Aug 2013.[12] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochre-iter, “Self-Normalizing Neural Networks,” in