[PDF] Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation

Abstract

False triggers in voice assistants are unintended invocations of the assistant, which not only degrade the user experience but may also compromise privacy. False trigger mitigation (FTM) is a process to detect the false trigger events and respond appropriately to the user. In this paper, we propose a novel solution to the FTM problem by introducing a parallel ASR decoding process with a special language model trained from "out-of-domain" data sources. Such language model is complementary to the existing language model optimized for the assistant task. A bidirectional lattice RNN (Bi-LRNN) classifier trained from the lattices generated by the complementary language model shows a 38.34% relative reduction of the false trigger (FT) rate at the fixed rate of 0.4% false suppression (FS) of correct invocations, compared to the current Bi-LRNN model. In addition, we propose to train a parallel Bi-LRNN model based on the decoding lattices from both language models, and examine various ways of implementation. The resulting model leads to further reduction in the false trigger rate by 10.8% .

Full PDF

CComplementary Language Model and Parallel Bi-LRNN for False TriggerMitigation

Rishika Agarwal, Xiaochuan Niu, Pranay Dighe, Srikanth Vishnubhotla, Sameer Badaskar, DevangNaik

Apple, One Apple Park Way, Cupertino, CA, USA { rishika agarwal,xniu,pdighe,svishnubhotla,badaskar,naik.d } @apple.com Abstract

False triggers in voice assistants are unintended invocations ofthe assistant, which not only degrade the user experience butmay also compromise privacy. False trigger mitigation (FTM)is a process to detect the false trigger events and respond appro-priately to the user. In this paper, we propose a novel solutionto the FTM problem by introducing a parallel ASR decodingprocess with a special language model trained from “out-of-domain” data sources. Such language model is complementaryto the existing language model optimized for the assistant task.A bidirectional lattice RNN (Bi-LRNN) classiﬁer trained fromthe lattices generated by the complementary language modelshows a . relative reduction of the false trigger (FT) rateat the ﬁxed rate of . false suppression (FS) of correct in-vocations, compared to the current Bi-LRNN model. In addi-tion, we propose to train a parallel Bi-LRNN model based onthe decoding lattices from both language models, and examinevarious ways of implementation. The resulting model leads tofurther reduction in the false trigger rate by . . Index Terms : Voice Trigger Detection, False Trigger Mitiga-tion, Lattice RNN, Language Model

1. Introduction

Voice trigger detection is a vital part of current voice assis-tant products. In such systems, one or multiple trigger phrasesare deﬁned for users to invoke the device to process voice re-quests. The design of a trigger detector is often constrained bylimited computation resources and power consumption of hard-ware, therefore we often adopt simple DSP and acoustic mod-els [1, 2]. In practice, a trigger detector is usually operated in alow-false-rejection mode in order to allow most acoustic sam-ples to be passed to downstream processes. However, such de-sign may cause the assistant (wrongly) respond to unintendedacoustic inputs. There are also cases when users accidentallyinvoke the assistant through a UI element such as a button-pressor a particular gesture. Such unintended invocations of voiceassistants can be referred to as “false triggers”. To mitigate thefalse trigger cases, one can introduce an extra process to deter-mine whether an acoustic sample is intended or not, which is inessence a binary classiﬁcation problem.The false trigger mitigation process can make use of bothacoustic and linguistic clues from the input sample. When theerrors are due to voice trigger detector, an intuitive approachis to feed the acoustic sample into an ASR system and checkfor the existence of trigger phrases in the 1-best output [3]. Inmore general cases, the text output contains the intent informa-tion from the user, therefore can be used as input to the classi-ﬁer. In [4], the 1-best output is encoded as an LSTM embed-

To appear in Proceedings of InterSpeech 2020 ding to represent the linguistic feature. It is combined with theLSTM embedding of the acoustic features, and decoder featuresincluding trellis entropy, Viterbi cost, conﬁdence and averagenumber of arcs as the ﬁnal input feature set to the classier. Con-sidering the ASR results may contain errors, the decoder fea-tures are designed explicitly to capture the ambiguity during thedecoding process. A recent follow-up work [5] focuses on im-proving the acoustic features by incorporating utterance-levelrepresentations. It also introduces dialog-type information tofacilitate the classiﬁer to make better decisions.To build an intent classiﬁer, the authors of [6] propose acondense representation of lattices from ASR decoder, called“Lattice RNN” (LRNN). By introducing a pooling operationover the incoming arcs of each node in the lattice, and a propa-gation operation over the outgoing arcs of the nodes, the authorsare able to construct a neural network on a lattice, and encodethe whole lattice information as the vector output from the ﬁnalnode of the lattice. The LRNN embedding is used as the inputvector of the intent classiﬁer, which achieves better accuracyand faster run-time, compared to the baseline model running onN-best results. A similar approach can be applied to the FTMtask. Our previous work [7] redeﬁned the feature set attachedto each arc in the decoder lattice, and extended the network tobi-directional (Bi-LRNN). The decoding lattice is encoded asthe concatenation of hidden layers from the start and end nodesof the lattice. Thus a classiﬁer built on top of the Bi-LRNN isable to mitigate the false trigger cases signiﬁcantly. A recentwork [8] explored the use of graph neural networks (GNN) toencode the decoding lattice, which achieves similar accuracy asthe Bi-LRNN representation with more efﬁcient training.In this paper, we investigate the impact of the decoder’s lan-guage model (LM) on false trigger mitigation. Considering thatthe voice assistant’s LM is usually well trained with in-domain data, and the LM also tends to see more usage data with thetrigger phrase at the beginning, it is likely that the LM is biasedtowards the in-domain data, which thereby biases it towards de-tecting the trigger phase. This bias may reduce the power ofthe decoding lattice in mitigating false triggers. In our study,we train a new LM that is not biased to the trigger phrase and in-domain data. We compare the mitigation performance be-tween the Bi-LRNN classiﬁers built from the the lattice outputsof different LMs. We further investigate how to make use ofthe complementary information in two different language mod-els, and propose some approaches to build parallel Bi-LRNN,which leads to further improvement in false trigger mitigation.

2. Method

In our baseline Bi-LRNN system, we obtain the word hypothe-sis lattice L for an acoustic sample X , from the ASR decoder. a r X i v : . [ ee ss . A S ] A ug he lattice consists of a start node, an end node, and other inter-mediate nodes. The nodes are connected via arcs and each archas a feature vector associated with it. The Bi-LRNN computesa forward and backward latent embedding for each node in thelattice (refer to [7] for more details).The ﬁnal outputs of the Bi-LRNN are the forward latentembedding of the end node h f ( s end ) and backward latent em-bedding of the start node h b ( s start ) , where s start and s end de-note the lattice’s start and end nodes. A feed-forward classiﬁerthen takes the input as [ h f ( s start ) , h b ( s end )] . The classiﬁergives a real valued output y ∈ [0 , , which is converted to alabel l pred ∈ { , } by choosing a threshold t . The thresholdcan be kept ﬁxed at certain value, or can be evaluated empiri-cally on the cross validation set, to achieve the desirable FalseSuppression (FS) of invocation rate. A typical ASR decoding process can be formulated as search-ing the best word sequence W ∗ that maximizes (1), where P ( X | W ) denotes acoustic model (AM), representing theconditional probability of acoustic features X given a word se-quence W , and P ( W ) denotes language model (LM), repre-senting the probability of any word sequence W . Ideally, theLM of an ASR system should approximate the distribution ofall the word sequences that could reach the decoder. However,in practice, the voice assistant application is only designed torespond to a relevant set of user requests. So the LM is usu-ally trained to maximize the likelihood of in-domain sentences.If we refer to the in-domain sentences as a class L D , and out-of-domain sentences as a class L O , the ASR LM trained from in-domain data can be explicitly represented as P ( W | L D ) in(2), with P ( L D ) denoting the prior probability of in-domainusage. W ∗ = arg max W { P ( X | W ) P ( W ) } (1) = arg max W { P ( X | W, L D ) P ( W | L D ) P ( L D ) } (2)To use the ASR decoding information to determine whether asequence of acoustic features represent an unintended invoca-tion, we can compute the probability of in-domain usage giventhe acoustic observation, P ( L D | X ) . This measurement canbe expanded as in (4), in which the ﬁrst factor is the summationof AM and LM probabilities over all sentence hypotheses. Anapproximation can be made to apply the summation over theresulting lattice paths during decoding, when ignoring the lowlikelihood word sequences being pruned. The Bi-LRNN em-bedding can be interpreted as an implicit representation of suchmeasurement with more ﬂexibility and modeling capacity. [7] P ( L D | X ) = P ( X | L D ) P ( L D ) /P ( X ) (3) = (cid:80) i { P ( X | W i , L D ) P ( W i | L D ) } P ( L D ) P ( X ) (4)The drawback of the above measurement in (4) is that P ( W | L D ) only contains in-domain information, so its powerof rejecting false triggered samples may be limited. If wehave a good estimation of out-of-domain sentences with LM P ( W | L O ) , we can construct a complementary measurement, P ( L O | X ) , which in theory should have more power to rejectfalse trigger. Equation (5) implies we will run ASR decodingwith a different set of LM, P ( W | L O ) , to generate latticesdifferent from the default ones. We can apply the similar Bi-LRNN operation on the out-of-domain lattices for more model- ing capacity. P ( L O | X ) = (cid:80) i { P ( X | W i , L O ) P ( W i | L O ) } P ( L O ) P ( X ) (5)Furthermore, we can derive a probability ratio measurement asshown in (6), which adopts the ratio between the in-domain and out-of-domain probabilities given the acoustic observationto balance the suppression/trigger decision. This measurementimplies two ASR decoders can be run in parallel to achieve twodifferent lattices from the same acoustic input. By combiningthe information from both lattices, we may be able to achievebetter discriminative capacity between the two classes. Oncemore, this measurement can be generalized by training two Bi-LRNNs from the lattices of two decoders, and hopefully thenetwork can learn more complex relationship between the twolattices when the targeting cost function is set to minimize theclassiﬁcation errors. P ( L D | X ) P ( L O | X ) = (cid:80) i { P ( X | W i , L D ) P ( W i | L D ) } P ( L D ) (cid:80) j { P ( X | W j , L O ) P ( W j | L O ) } P ( L O ) (6) In the error analysis (Section 3.4), we show that the base modelis more accurate for some examples, and the out-of-domainmodel is better on others, depending on the true label of the ex-ample. Thus, the language models likely represent complemen-tary information, and a model comprised of both the LMs couldout-perform the individual models based on either of the LMs.To achieve this, the outputs from the two Bi-LRNN models canbe combined in different ways before passing to the classiﬁer.We explore the following ensembling techniques, and comparethe FT rates achieved by each of them in Section 3.5:•

Combine scores from the pre-trained Bi-LRNNs : Wetake the prediction scores y and y from the two Bi-LRNNs trained separately, and pass them to a shallowclassiﬁer. (Only classiﬁer layers are trained).• Combine the Bi-LRNN embeddings from the pre-trained Bi-LRNNs : We take the latent Bi-LRNN em-beddings h f , h b , h f , h b from the pre-trained Bi-LRNNs, and pass them to a classiﬁer (Here again, onlythe classiﬁer is trained).• Train the Bi-LRNNs in parallel, by back-propagatingthe classiﬁer loss : The setting is the same as the pre-vious case, but we back-propagate the classiﬁer loss toboth the Bi-LRNNs as well. Thus, the entire modelis trained end-to-end (from scratch or by loading theweights of the trained Bi-LRNNs and ﬁne-tuning them).The schematic of the model is shown in Figure 1.•

Mixture of Experts : Instead of concatenating the em-beddings of the two Bi-LRNNs, we can pass theirweighted sum to the classiﬁer. A Mixture of Expertsmodel [9], computes the relative importance of each “ex-pert” (in this case, the two Bi-LRNNs are the “experts”),and weighs the outputs of the models by a parameter α . The weight parameter α determines the reliabilityof each Bi-LRNN for an input sample, and we pass aweighted sum of the lattice embeddings to a classiﬁer (cid:0)(cid:2) α h f + (1 − α ) h f , α h b + (1 − α ) h b (cid:3)(cid:1) . Themodel is trained end-to-end. The schematic of the modelis shown in Figure 2.igure 1: Schematic diagram of Parallel LRNN model

3. Experiment and results

All our experiments are performed on an FTM dataset [8],which is composed of far ﬁeld usage samples with manual la-bels of “true trigger” (TT) and “false trigger” (FT) classes. Theraw audio data are split into train , cv , dev , and eval sets for thepurposes of training, cross-validation, development and evalua-tion. The train and cv sets are augmented by adding gain, noise,and speed perturbations, which increases the amount of trainingdata by 3x. Table 1 summarizes the amount of data in each setand condition.We train FTM classiﬁers for multiple epochs on the train set. The training epoch which achieved the lowest FT on the cv set is evaluated on the dev and eval sets. We expect our voiceassistant to have minimal false triggers, and maximum true pos-itives (minimal FS), for a good user-experience. We thus focuson the low FS regime in our DET curves, and the lower the AUC(Area Under Curve) of the DET curve, the better the model. Inour experiments, we arbitrarily choose a small FS rate of . to act as the operating point. So the False Trigger (FT) rate atthis FS rate is the key metric in evaluating the false trigger miti-gation models, while the AUC gives us an estimate of how goodthe model performs overall, irrespective of the operating point.We set the threshold that achieves the target FS rate (0.4%) onthe dev set. The performance metric of concern to us is the cor-responding FT rate on the eval set. In all experiments, we adopt an internal ASR decoder with var-ious model conﬁgurations. The acoustic model has an Hid-den Markov model (HMM) and Convolutional Neural Network(CNN) hybrid structure [10], which is trained with ﬁlter bankfeatures from US English speech data using cross-entropy andsubsequent BMMI objective functions [11]. The CNN com-prises 50 layers and uses the scaled exponential linear unit(SELU) activation function to achieve self-normalization dur-ing training [12], which achieves state-of-art performance. The

Label train cv dev evalTrue Trigger , × , × ,

829 11 , False Trigger , × × ,

657 11 , Table 1:

FTM dataset for model training and evaluation

Figure 2:

Schematic diagram of Mixture of Experts model baseline language model in the decoder is a 4-gram model in-terpolated from multiple sub-LMs trained from different datasources that are relevant to the far ﬁeld application (in-domain).The data sources include enumerations of various usage do-mains, the re-decoding transcripts of live usage, and accumu-lated error corrections from the users. All the sub-LMs share aword lexicon with a vocabulary size of around K . The ﬁ-nal interpolated LM is pruned to contain about . M . M trigrams and . M bigrams. We refer to this as the BaseLM . In order to capture the out-of-domain usages, we considerthe following data sources to train a complementary languagemodel called

ChatterLM . The ﬁrst data source is from the au-tomatic transcriptions of the dictation application; the secondsource is from the voice search application. The language us-age styles of these two applications are different from that of theassistant application in our current study. The third source is ar-tiﬁcial data generated from enumeration of extra use cases thatare not relevant to the speciﬁc device under study. The

Chat-terLM is built in the same way as the

BaseLM in production,then combined with the baseline AM for the ASR decoder touse.With the two sets of ASR models, we generate decodinglattices on the FTM train and cv sets, then build two separateBi-LRNN classiﬁers from the lattice features. To compare theaccuracy between the two classiﬁers, we plot the DET curvesof them on the eval sets in Figure 3 (We plot only the regionof interest of the DET curve, ie F S < ). At the operationpoints around the ﬁxed FS rate of . , the ChatterLM basedclassiﬁer achieves a lower FT rate than the baseline classiﬁerbased on

BaseLM . The relative reduction of FT rate is . (FT reduces from . for the BaseLM based Bi-LRNN to . for the ChatterLM based Bi-LRNN). Such a signiﬁcantFT rate reduction clearly indicates that the LM trained from out-of-domain data sources is more capable of detecting false trig-gers than the LMs trained from in-domain data sources.

We choose the two Bi-LRNN classiﬁers, which use lattices fromthe

BaseLM and the

ChatterLM respectively, to further analyzethe error patterns. The idea is that if the models make mis-igure 3:

DET curves of

BaseLM and

ChatterLM

LRNN models. takes on different samples, then ensembling them would pro-vide complementary information, and thus improve the over-all performance. We compute the matrix showing the num-ber of samples for which the

BaseLM

Bi-LRNN and the

Chat-terLM

Bi-LRNN got the predictions correct and incorrect (seeTable 2). Both models achieve very high accuracy on the TrueTrigger (TT) class, with

BaseLM being the more accurate ofthe two. For the False Trigger (FT) class, both models areless accurate, and

ChatterLM is more accurate than

BaseLM (only . samples where BaseLM

Bi-LRNN gets correct and

ChatterLM

Bi-LRNN gets wrong, cf. . samples where BaseLM

Bi-LRNN gets wrong and

ChatterLM

Bi-LRNN getscorrect). These results align with our expectations, as the

Chat-terLM

Bi-LRNN model is expected to be more accurate for un-intended speech samples since it uses an LM trained on out-of-domain data, while the

BaseLM

Bi-LRNN uses the LM primar-ily trained on in-domain data. Thus, the models are stronger indifferent sample spaces, and should be able to complement eachother when used together in an ensemble model.

Assuming we have two ASR models available, one comprisingof the LM trained on in-domain data (

BaseLM ), and the othercomprising of LM trained on out-of-domain data (

ChatterLM ),we can leverage the two complementary lattices by training par-allel Bi-LRNN classiﬁers. We implement the different ensem-bling methods proposed in Section 2.3 to compare their perfor-mance. Figure 4 shows the DET curves (restricted to the regionof interest) of different parallel Bi-LRNN models, along withthat of the single

ChatterLM based Bi-LRNN. Table 3 showsthe FT rates of these classiﬁers at the ﬁxed FS rate of . , andthe Area under the DET curve. Fully trained parallel Bi-LRNNsachieve a better FT rate than the ChatterLM based single Bi-LRNN classiﬁer, while the classiﬁers trained on merged scoresor embeddings, and the Mixture of Experts model perform bet-

ChatterLM Correct ChatterLM WrongTT BaseLM Correct .

4% 0 . BaseLM Wrong .

08% 0 . FT BaseLM Correct .

3% 3 . BaseLM Wrong .

27% 9 . Table 2:

Error Analysis of True (TT) and False (FT) Triggers

Classiﬁer FT at FS = 0.4% AUC

BaseLM based Bi-LRNN .

35% 0 . ChatterLM based Bi-LRNN .

93% 0 . Classiﬁer on merged scores . Classiﬁer on merged embedding vectors .

83% 0 . Fully trained parallel Bi-LRNN(random initialization) . % 0 . Fully trained parallel Bi-LRNN(initialized with pre-trained weights) .

80% 0 . Mixture of Experts .

18% 0 . Table 3:

False Trigger rates for different models

Figure 4:

DET curves of

ChatterLM and Ensemble of parallel Bi-LRNNs trained on

BaseLM and

ChatterLM ter than the BaseLM based Bi-LRNN classiﬁer, but worse thanthe ChatterLM based classiﬁer alone. The best performance isachieved by the classiﬁer trained by fully back-propagating theloss to the Parallel Bi-LRNNs – . relative reduction in FTrate (over the ChatterLM based Bi-LRNN baseline). Initializingthe Bi-LRNNs with individually pre-trained Bi-LRNNs givesalmost identical results as random initialization (red and cyancurves in Fig 4); At the operating point (FS = 0.4%), ﬁne-tuningthe pre-trained Bi-LRNNs is slightly worse than training fromrandom initialization, although the former has marginally lowerAUC. The improvement made by parallel Bi-LRNN model overthe single

ChatterLM based Bi-LRNN is consistently signiﬁ-cant in our region of interest, ie, for FS rates below .

4. Conclusions

We proposed a novel solution to the ASR lattice based falsetrigger mitigation approach by introducing a complementaryLM to the decoding process. The LM is trained from out-of-domain data sources and provides complementary informa-tion to the original LM optimized for in-domain ASR accuracy.We demonstrated that a Bi-LRNN classiﬁer built from the lat-tices generated from the complementary LM signiﬁcantly out-performs the classiﬁer built from the baseline ASR model set.With this single

ChatterLM

Bi-LRNN, we achieved a . relative reduction of the FT rate at the ﬁxed . FS level com-paring to the current production FTM model. Furthermore, weproposed a novel approach of parallel Bi-LRNN, and examinedultiple ways to implement and train the classiﬁer. By back-propagating the training loss fully to the parallel Bi-LRNN net-work, we saw a further . relative reduction of the FT rate.These results indicate that there is room for improving the tra-ditional ASR decoder in the FTM task, and encourage us toreconsider the architecture design that can enable parallel LMdecoding and parallel Bi-LRNN computation.

5. References [1] S. Sigtia, R. Haynes, H. Richards, E. Marchi, and J. Bridle,“Efﬁcient Voice Trigger Detection for Low Resource Hardware,”in

Proc. Interspeech , Sept 2018, pp. 2092–2096. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2018-2204[2] M. Wu, S. Panchapagesan, M. Sun, J. Gu, R. Thomas,S. N. Prasad Vitaladevuni, B. Hoffmeister, and A. Mandal,“Monophone-Based Background Modeling for Two-Stage On-Device Wake Word Detection,” in ,2018, pp. 5494–5498.[3] A. H. Michaely, X. Zhang, G. Simko, C. Parada, and P. Alek-sic, “Keyword Spotting for Google Assistant Using ContextualSpeech Recognition,” in , 2017, pp. 272–278.[4] S. Mallidi, R. Maas, K. Goehner, A. Rastrow, S. Matsoukas, andB. Hoffmeister, “Device-directed Utterance Detection,” in

Proc.Interspeech , Sept 2018, pp. 1225–1228.[5] C.-W. Huang, R. Maas, S. H. Mallidi, and B. Hoffmeister,“A Study for Improving Device-Directed Speech DetectionToward Frictionless Human-Machine Interaction,” in

Proc.Interspeech , Sept 2019, pp. 3342–3346. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2019-2840[6] F. Ladhak, A. Gandhe, M. Dreyer, L. Mathias, A. Rastrow, andB. Hoffmeister, “LatticeRnn: Recurrent Neural Networks OverLattices,” in

Proc. Interspeech , Sept 2016, pp. 695–699. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2016-1583[7] W. Jeon, L. Liu, and H. Mason, “Voice Trigger Detection fromLVCSR Hypothesis Lattices Using Bidirectional Lattice Recur-rent Neural Networks,” in , May 2019,pp. 6356–6360.[8] P. Dighe, S. Adya, N. Li, S. Vishnubhotla, D. Naik, A. Sagar,Y. Ma, S. Pulman, and J. Williams, “Lattice-Based Improvementsfor Voice Triggering Using Graph Neural Networks,” in

ICASSP2020 - 2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2020, pp. 7459–7463.[9] R. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “AdaptiveMixtures of Local Experts,”

Meual Computation , February 1991.[10] Z. Huang, T. Ng, L. Liu, H. Mason, X. Zhuang, and D. Liu,“SNDCNN: Self-Normalizing Deep CNNs with Scaled Exponen-tial Linear Units for Speech Recognition,” in

ICASSP 2020 - 2020IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2020, pp. 6854–6858.[11] K. Vesel´y, A. Ghoshal, L. Burget, and D. Povey, “Sequence Dis-criminative Training of Deep Neural Networks,” in

Proc. Inter-speech , Aug 2013.[12] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochre-iter, “Self-Normalizing Neural Networks,” in