Error-driven Pruning of Language Models for Virtual Assistants
Sashank Gondala, Lyan Verwimp, Ernest Pusateri, Manos Tsagkias, Christophe Van Gysel
aa r X i v : . [ c s . C L ] F e b ERROR-DRIVEN PRUNING OF LANGUAGE MODELS FOR VIRTUAL ASSISTANTS
Sashank Gondala ,⋆ Lyan Verwimp ,⋆ Ernest Pusateri Manos Tsagkias Christophe Van Gysel Georgia Institute of Technology, Apple [email protected] , { lverwimp, epusateri, etsagkias, cvangysel } @apple.com ABSTRACT
Language models (LMs) for virtual assistants (VAs) are typ-ically trained on large amounts of data, resulting in pro-hibitively large models which require excessive memoryand/or cannot be used to serve user requests in real-time. En-tropy pruning results in smaller models but with significantdegradation of effectiveness in the tail of the user requestdistribution. We customize entropy pruning by allowing fora keep list of infrequent n-grams that require a more relaxedpruning threshold, and propose three methods to constructthe keep list. Each method has its own advantages and dis-advantages with respect to LM size, ASR accuracy and costof constructing the keep list. Our best LM gives 8% averageWord Error Rate (WER) reduction on a targeted test set, butis 3 times larger than the baseline. We also propose discrim-inative methods to reduce the size of the LM while retainingthe majority of the WER gains achieved by the largest LM.
Index Terms — ASR, LM Pruning, discriminative, dataselection, error prediction
1. INTRODUCTION
VAs are popular services [1] that help users accomplishmultiple tasks through voice queries. The Automatic SpeechRecognition (ASR) engine, the VA component responsible forconverting spoken queries into text, faces a challenge due tothe many task domains VAs supports. Task domains includeperforming actions on the device where the VA runs (e.g.placing a call on a cell phone) or querying information aboutreal-world events such as the outcome of a sports competition.Accurately recognizing queries that concern contempo-rary events is a difficult problem due to the dynamic nature ofthe world. Hence, within VA systems, the language models(LMs) of the ASR system are typically trained on syntheticqueries that are generated from knowledge bases [2], in ad-dition to transcribed user queries. For example, when theartist Kanye West announced his album
Donda: With Child in July 2020, the artificial queries “play Donda With Child” Work done while the first author was an intern at Apple. ⋆ Equal contribution. and “what is Donda With Child” are included within LMtraining data. Likewise, artificial queries corresponding toentities that may occur infrequently within real usage data(i.e., tail entities), but should still be recognized accurately,are also included within LM training data.As users expect low-latency responses from online ser-vices [3], n-gram backoff LMs [4] are frequently used, alongwith entropy-based pruning [5] to reduce speech recognitionruntime and memory consumption. Entropy-based prun-ing removes n-grams from an LM that have the smallestimpact on training set perplexity. This can be problematicif explicitly-observed n-grams from synthetic queries areremoved, making them indistinguishable from unobservedn-grams in the training data. However, entropy pruning canbe modified to apply different pruning thresholds to a subsetof LM n-grams, and hence, we use a more relaxed pruningthreshold for specific (tail) n-grams that enhance recognition.In this paper we investigate how to determine the minimalset of n-grams that require a more relaxed pruning thresh-old from textual features. In particular, we are interested inimproving recognition on a set of synthetic queries Q thatwe know we want our VA to recognize, but are absent orunderrepresented in our live usage data. This problem ischallenging due to the large number of synthetic queries, thelack of generalization in n-gram LMs, the dynamicity of liveusage data, and VA runtime constraints. Our research questions are: (
RQ1 ) Are there text-basedsignals that are a good predictor for speech recognition dif-ficulty? (
RQ2 ) Can we determine an optimal subset ofsynthetic query n-grams that need a more relaxed pruningthreshold without degrading speech recognition effective-ness? We contribute: (1) a formal framework for applyingdifferent entropy-based pruning thresholds on subsets of n–grams, (2) three methods for determining a subset of n-gramsfor which pruning needs to be relaxed to improve their recog-nition, (3) insight into which signals are useful to predictspeech recognition difficulty directly from text. We also tried a variety of different smoothing methods [6] to try toachieve the same goal, including Kneser-Ney smoothing [7, 8]; no significantimprovements on our tail- and entity-rich test sets were observed.To appear in
Proc. ICASSP 2021, June 6-11, 2021, Toronto, Canada © IEEE 2021. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works. . CUSTOMIZED PRUNING FOR QUERIES
We describe three methods for pruning backoff LMs, all ofwhich build upon entropy-based pruning [5], with the goalof improving recognition on a set of synthetic queries Q . Inentropy-based pruning ( EP ), n-grams that are estimated to in-crease training data perplexity by less than a threshold, θ , aregreedily removed from a backoff n-gram LM. We extend thisapproach by introducing a keep list of n-grams. The keep listdefines a set of n-grams, extracted from the queries in Q , forwhich a more relaxed threshold, θ keep , will be applied. Noticethat even though we apply this extension to entropy pruningand a set of synthetic queries, it can in principle be applied toany type of n-gram LM pruning (e.g. [9]) and any dataset forwhich you want to improve recognition. The three methodsdescribed below differ only in the way the keep list is gener-ated. Each method provides a different space of trade-offs be-tween keep list generation cost, LM size, and ASR accuracy. In query-driven entropy pruning (
QEP ), we generate the keeplist by extracting n-grams from every query q ∈ Q . Using thefull set of synthetic queries does not require extra processingand gives the best ASR accuracy on tail- and entity-rich testsets, but can easily blow up the size of the LM. While effec-tive, QEP is infeasible to apply when Q is large and containsmany unique n-grams. To reduce the number of n-grams in the keep list, we applya discriminative approach (e.g., [10, 11, 12]), error-drivenentropy pruning (
EEP ). Here we only want to exclude thosen-grams from pruning for which our baseline ASR, usingan entropy-pruned LM, fails. However, large-scale manualtranscription of audio is expensive, and we want to optimizethe ASR accuracy of data that is possibly heavily underrepre-sented in real user data. Therefore, for every phrase in Q , wegenerate audio with Text-To-Speech (TTS) and recognize itwith our baseline ASR (see [13] for a similar approach). Thedecoding errors are then used as a source to extract n-gramsfor the keep list. This TTS–ASR loop is costly both in termsof time and computational resources and thus prohibitive forlarge datasets. The TTS–ASR loop can be avoided if we can predict, basedon textual features alone, whether a query q will be recog-nized incorrectly by the baseline ASR. Approximate error-driven entropy pruning ( AEEP ) – works as follows: we traina binary classifier (see §4.1) on a training set T Q , which issampled from the same distribution as Q , that predicts the out-come of the TTS–ASR loop, and more specifically, whetheran n-gram will be recognized incorrectly. After training, weuse that classifier to select a subset of Q as keep list. For every unique query q i ∈ T Q in our dataset, we applythe TTS–ASR loop to obtain e q i , the top-hypothesis after rec-ognizing the synthesized audio of phrase q i . Subsequently,we obtain an alignment (edit distance) between q i and e q i andextract n-grams from q i . The input to our model is an n-gramextracted from q i and it is assigned a positive label if thetarget token in the n-gram (e.g. the target token in a 4-gramis the 4th token) differs from its aligned token in e q i . Everyinstance is represented as a real-valued feature vector. Weconsider five categories of features: Word-level features.
We compute n-gram statistics, includ-ing n-gram count and frequency in T Q , and whether the n-gram context words and/or target word are out-of-vocabulary. Language model features.
We compute features derivedfrom the baseline entropy-pruned LM, such as log probabil-ity, perplexity and entropy, of the full n-gram and of the targetword given the context.
Phoneme-level features.
Our VA lexicon is generated basedon a list of pre-defined word-pronunciation mappings forfrequent words and exceptions, and a grapheme-to-phonemetool (G2P, see §3) that automatically generates pronuncia-tions for words that are not in that list. For a given n-gram,we generate phonemes for each word individually with ourG2P (even when the word occurs in the pre-defined list) andjoin them together to create the phoneme string and cor-responding phoneme n-grams. From these, we extract thefollowing features: the number of phonemes, whether thephoneme string contains infrequent phonemes, whether theword-pronunciation mapping occurs in the pre-defined list,and the edit distance between the phoneme string generatedusing G2P and the pre-defined pronunciation (if present).
Template features.
Our dataset of synthetic queries consistsof a set of templates (e.g. “Who is the position-tag of team-tag ?”) where the tags are replaced with entities. Wetag every word of the utterance to be either a template tokenor entity and create two features: whether the target word isan entity and whether any word in the n-gram is an entity. Phonetic confusion features.
For each n-gram, we takethe target word, extract its phoneme string and compute theset of phoneme strings with an edit distance of 1. For eachphoneme string in that set, we check if it maps to any ofthe words in the vocabulary and if it does, we compute thetotal log probability of the n-gram where the target word isreplaced with the new mapped word. We store the highest ofthese log probabilities as a feature.
3. EXPERIMENTAL SETUPTraining LMs.
We train our LMs on manually and automati-cally transcribed anonymized VA requests, and on the datasetof synthetic queries Q , which consists of a large set of do-mains, e.g. sports, music and home automation. The queriesconsist of a set of templates with slots and a list of entitiesthat can fill those slots (§2.3). Both templates and entities are2ypically derived from real user data and thus have have priorprobabilities that are used to sample the synthetic requests. Inour experiments, query set Q consists of 2.8B queries (112Munique) spanning 27 domains. The total training data containsmore than 10B utterances. Pruning methods.
Query set Q is also used for our threeproposed pruning methods to extract the following: all pos-sible n-grams (QEP), a subset of n-grams based on the errorscollected through the TTS–ASR loop (EEP), and a subset ofn-grams based on the classifier trained on the output of theTTS–ASR loop (AEEP). In addition, we compare against reg-ular entropy pruning (EP), where Q is not taken into account. Evaluating the AEEP classifier.
To assess the utility ofAEEP, we train on all domains except one and validate andtest on a held-out domain. We consider only one held outdomain, that of sports, because it is the only domain forwhich we have domain-specific test sets using both manuallytranscribed user requests and synthesized requests.
ASR accuracy evaluation.
After applying the pruning meth-ods outlined above, we evaluate WER on three main testsets: (1) General VA contains VA requests sampled from theactual distribution, thus containing mostly frequent/head ut-terances. (2) Sports is a sample of less frequent VA requestsfor the sports domain. (3) TTS-All consists of synthesizedrequests, it is sampled from the same templates as Q but isa different instance. Thanks to the prior probabilities, wecan make a distinction between subsets of TTS-All that con-tain utterances from the head (top 10%), torso (10-50%) andtail (50-100%) of the distribution. TTS-Sports is a subsetof TTS-All containing all utterances related to sports. Ourgoal is to improve WER (with a minimum increase in LMsize) on the Sports and TTS-All test sets, while not degradingrecognition effectiveness on the General VA test set. System description.
Our ASR system consists of an acous-tic model that is a deep convolutional neural network [14],a 4-gram LM with Good-Turing smoothing in the first pass(see [15, 16] for details), and the same LM interpolated witha Feed-Forward Neural Network (FFNN) LM [17] in thesecond pass. To build a scalable TTS–ASR loop, we use ourprevious generation speech synthesizer, a unit selection sys-tem described in [18]. We use scikit-learn [19] to train binaryclassifiers for AEEP. For combining entropy pruning with akeep list we modified SRILM [20]. Our G2P is an LSTMencoder-decoder architecture with attention (similar to [21]).For all experiments we set θ = 6 e − and θ keep = 0 , whichmeans that all n-grams in the keep list are excluded frompruning. We experimented with different values for θ keep asan alternative approach to reduce the size of the LM, but didnot observe good improvements. All experiments reportedhere are for American English. We are unable to provide the exact number due to confidentiality. Themanually transcribed dataset is a small random sample. For QEP compared to EP, we observe similar WER reductions for sev-eral other languages/regional variants, e.g., German and Mandarin Chinese.
10 20 30 40 50 60 70 80 9000 . . . . Percentage of n-grams assigned to positive class/keep list ( k ) R eca ll @ k Sports domainY = X
Fig. 1 . Recall@ k for the sports domain.
4. RESULTS4.1. AEEP Classifier
In this section, we report the results for training a binary clas-sifier on the output of the TTS–ASR loop to predict whetheran n-gram will be recognized incorrectly. Comparing 4 dif-ferent model types, Random Forests (RF), AdaBoost, linearsupport-vector machines and FFNNs, we found that RFsachieved the best results on the validation data. Hence, allexperiments reported below use RFs.We parameterize our experiments based on the numberof n-grams to consider: We take the top- k % of n-grams asranked by the classifier confidence score and assign them tothe positive class that will form the keep list. In Fig. 1, weshow recall@ k for the sports domain: the relative number ofn-grams correctly classified as positive if k % of the rankeddata is assigned to the positive class/keep list. We observegood performance, with a recall of 0.65 at 40%. For thedownstream task (§4.2), we select this 40% of n-grams as thekeep list for AEEP (Sports). For our ablation study, we rankthe features according to the importance assigned to themby the RF. We find that by using just the top-3 features totrain a new RF, i.e. (1) log probability of target word givencontext, (2) log probability of the full n-gram and (3) largestlog probability of the n-gram at an edit distance of 1, we stillobtain a recall of 0.63 at 40%. We will use the data selectedby this model as well in §4.2 (AEEP–Top-3).These results provide an answer to RQ1 : Yes, there aretext-based signals that can predict ASR difficulty. LM andphonetic confusion features are the most important ones.
We now evaluate our 3 customized pruning approaches inour ASR system. We set our success criterion to minimizethe size of a LM while improving WER compared to thatof a system using an entropy-pruned LM. In the last row ofTable 1, we show the size of each LM in number of n-grams.QEP roughly triples the size of the entropy-pruned LM, whileEEP doubles it. The unpruned LM contains 313M n-gramsand is too large to use in ASR decoding. In our ASR system,using the larger LMs leads to increased, but still acceptable,memory usage, and we observe negligible impact on the de-coding speed. Since all our proposed approaches increase3 able 1 . WER results (best result per test set in bold) and size (last row) for LMs with various pruning strategies. EP base is thebaseline, EP ∼ LMs are entropy-pruned LMs of the same size as the proposed models. AEEP uses all features, and AEEP–Top-3uses the top-3 features after feature selection. ‘All domains’ and ‘Sports domain’ in the top row refer to the query set used togenerate the n-gram set.
Test set ∼ QEP EP ∼ EEP EP ∼ QEP EP ∼ EEP AEEP AEEP–Top-3
General VA 49k 4.13
Number of n-grams
22M 71M 71M 45M 45M 27M 27M 25M 25M 25M 25M the size of the LM, we created additional EP baselines of thesame size for each pruning strategy for a fair comparison interms of LM size – see the EP ∼ columns in Table 1.The first row of Table 1 shows that none of the pruningapproaches hurts the WER on our regular, head-heavy gen-eral VA test set. On the test sets representing the usage thatwe want to improve on (i.e., tail utterances with many namedentities), we see consistent WER reductions for all settings.QEP leads to the largest LMs and the best WER resultson TTS-All, with 8% relative WER reduction compared tothe baseline EP LM, while the EP LM of the same size (EP ∼ column to the left of QEP) has only a 4% reduction. On thesports test with user queries we also observe a WER reductionof 8%, compared to 5% for EP ∼ . Looking at results qual-itatively, we observe that the EP LMs do not contain manyhigher order n-grams that make a difference, e.g. who is PhilBengtson (in TTS-All) is recognized correctly by the QEPLM because it contains all relevant higher-order n-gramswhile the EP LMs only contain the unigrams.The EEP LM, that is significantly smaller, only leads tosmall WER degradations compared to the QEP LM. We ob-serve 7% relative WER reduction w.r.t to the baseline on TTS-All, compared to only 3% for the corresponding EP ∼ LM.Finally, approximating the TTS–ASR loop with AEEPgives us WER results that are close to the real decoding er-rors (EEP). The AEEP model using all features described in§2.3 gives a 17% relative WER reduction with respect to thebaseline EP LM on the TTS-Sports test set, and 10% on theSports test set extracted from real user data. AEEP–Top-3,the model using only the top-3 features (§4.1), is on par withthe variant using all features. On the Sports test set withreal user requests, AEEP does even better than EEP. We hy-pothesize that EEP is overfitting on the data that is used forthe TTS–ASR loop, while AEEP counteracts this problembecause it approximates the results of the TTS–ASR loop.For both the TTS-based error-driven pruning (EEP) andits approximation (AEEP), we select about 40% of the fullSports dataset for the keep list. One could argue that any addition of sports-related n-grams to the LM will improve theWER on sports test sets. Thus, as a sanity check we randomlyselect 40% of the sports n-grams and use it as the keep list.The resulting LM gives a WER of 16.27 on the TTS-Sportstest set, which is still 0.3 absolute worse than AEEP. We canconclude that the 40% selected by EEP and AEEP is a moremeaningful selection than a randomly selected 40%.We can conclude that w.r.t.
RQ2 the optimal set of syn-thetic query n-grams that require a more relaxed pruningthreshold can be the full set (§2.1) if there are no memorylimits. However, we showed that it is possible to retain themajority of the WER gain with a much smaller LM by select-ing n-grams with a model trained on textual features alone.
5. CONCLUSIONS
We explored three methods to customize LM pruning to im-prove ASR accuracy on infrequent and entity-rich utterances,by constructing a keep list of n-grams that require a morerelaxed pruning threshold. QEP results in LMs that are threetimes larger than the baseline and give relative WER reduc-tions of on average 8%, both on a targeted synthetic test setand a test set with user queries. EEP and the more efficientAEEP reduce the size of the keep list by selecting only de-coding errors, resulting in LMs that are only twice as largeas the baseline and still have good WER improvements of17% on the domain-specific synthetic test set and 10% on theuser query test set. We also showed that we can predict ASRdifficulty based on textual signals.In our future work, we would like to explore more text-only approaches to customize pruning, e.g. by modifying thepruning criterion itself or selecting data based on LM featuresand improved (FST-based) phonetic confusion features.
Acknowledgements.
We thank Youssef Oualil, Amr Mousa, RussWebb, Barry Theobald for their comments and feedback. . REFERENCES [1] Juniper Research, “Digital Voice Assistants in Use to Tripleto 8 Billion by 2023, Driven by Smart Home Devices,” PressRelease, Feb. 2019.[2] Ankur Gandhe, Ariya Rastrow, and Bjorn Hoffmeister, “Scal-able language model adaptation for spoken dialogue systems,”in IEEE Spoken Language Technology Workshop (SLT) , 2018,pp. 907–912.[3] Eric Schurman and Jake Brutlag, “Performance RelatedChanges and their User Impact,” Presentation at Velocity –Web Performance and Operations Conference, 2009.[4] Slava Katz, “Estimation of probabilities from sparse datafor the language model component of a speech recognizer,”
Transactions on Acoustics, Speech, and Signal Processing ,vol. 35, no. 3, pp. 400–401, 1987.[5] Andreas Stolcke, “Entropy-based Pruning of BackoffLanguage Models,”
ArXiv , vol. cs.CL/0006025, 1998.[6] Jianfeng Gao and Min Zhang, “Improving Language ModelSize Reduction using Better Pruning Criteria,” in
Pro-ceedings of the 40th Annual Meeting of the Association forComputational Linguistics (ACL) , 2002, pp. 176–182.[7] Stanley F. Chen and Joshua Goodman, “An empirical studyof smoothing techniques for language modeling,”
ComputerSpeech and Language , vol. 13, pp. 359–394, 1999.[8] Ciprian Chelba, Thorsten Brants, Will Neveitt, and PengXu, “Study on Interaction between Entropy Pruning andKneser-Ney Smoothing,” in
Proceedings Interspeech , 2010,pp. 2242–2245.[9] Kristie Seymore and Ronald Rosenfeld, “Scalable backofflanguage models,” in
International Conference on SpokenLanguage Processing (ICSLP) , 1996, pp. 232–235.[10] Brian Roark, Murat Saraclar, and Michael Collins, “Correc-tive language modeling for large vocabulary ASR with theperceptron algorithm,” in
IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 2004, pp.749–752.[11] Takanobul Oba, Takaaki Hori, Atsushi Nakamura, and AkinoriIto, “Round-Robin Duel Discriminative Language Models,”
Transactions on Audio, Speech, and Language Processing ,vol. 20, no. 4, pp. 1244–1255, 2012.[12] Yuuki Tachioka and Shinji Watanabe, “Discriminative methodfor recurrent neural network language models,” in
IEEEInternational Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2015, pp. 5386–5390.[13] Jinxi Guo, Tara N. Sainath, and Ron J. Weiss, “A SpellingCorrection Model for End-to-end Speech Recognition,” in
IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2019, pp. 5651–5655.[14] Zhen Huang, Tim Ng, Leo Liu, Henry Mason, XiaodanZhuang, and Daben Liu, “SNDCNN: Self-Normalizing DeepCNNs with Scaled Exponential Linear Units for SpeechRecognition,” in
IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , 2020, pp.6854–6858. [15] Ernest Pusateri, Christophe Van Gysel, Rami Botros, SameerBadaskar, Mirko Hannemann, Youssef Oualil, and IlyaOparin, “Connecting and Comparing Language ModelInterpolation Techniques,” in
Proceedings Interspeech , 2019,pp. 3500–3504.[16] Christophe Van Gysel, Manos Tsagkias, Ernest Pusateri, andIlya Oparin, “Predicting Entity Popularity to Improve SpokenEntity Recognition by Virtual Assistants,” in
SIGIR , 2020, pp.1613–1616.[17] Shiliang Zhang, Hui Jiang, Mingbin Xu, Junfeng Hou, andLirong Dai, “The Fixed-Size Ordinally-Forgetting EncodingMethod for Neural Network Language Models,” in
IJCNLP ,2015, pp. 495–500.[18] Tim Capes, Paul Coles, Alistair Conkie, Ladan Golipour, AbieHadjitarkhani, Qiong Hu, Nancy Huddleston, Melvyn Hunt,Jiangchuan Li, Matthias Neeracher, Kishore Prahallad, TuomoRaitio, Ramya Rasipuram, Greg Townsend, Becci Williamson,David Winarsky, Zhizheng Wu, and Hepeng Zhang, “Siri On-Device Deep Learning-Guided Unit Selection Text-to-SpeechSystem,” in
Proceedings Interspeech , 2017, pp. 4011–4015.[19] Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort,Vincent Michel, Bertrand Thirion, Olivier Grisel, MathieuBlondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg,et al., “Scikit-learn: Machine learning in Python,”
Journal ofmachine learning research , vol. 12, pp. 2825–2830, 2011.[20] Andreas Stolcke, “SRILM – An extensible language modelingtoolkit,” in
International Conference on Spoken LanguageProcessing (ICSLP) , 2002, pp. 901–904.[21] Shubham Toshniwal and Karen Livescu, “Jointly learningto align and convert graphemes to phonemes with neuralattention models,” in
IEEE Spoken Language TechnologyWorkshop (SLT) , 2016, pp. 76–82., 2016, pp. 76–82.