Domain-aware Neural Language Models for Speech Recognition
Linda Liu, Yile Gu, Aditya Gourav, Ankur Gandhe, Shashank Kalmane, Denis Filimonov, Ariya Rastrow, Ivan Bulyko
DDOMAIN-AWARE NEURAL LANGUAGE MODELS FOR SPEECH RECOGNITION
Linda Liu Yile Gu Aditya Gourav Ankur GandheShashank Kalmane Denis Filimonov Ariya Rastrow Ivan Bulyko
Amazon Alexa
ABSTRACT
As voice assistants become more ubiquitous, they are increas-ingly expected to support and perform well on a wide varietyof use-cases across different domains. We present a domain-aware rescoring framework suitable for achieving domain-adaptation during second-pass rescoring in production set-tings. In our framework, we fine-tune a domain-general neu-ral language model on several domains, and use an LSTM-based domain classification model to select the appropriatedomain-adapted model to use for second-pass rescoring. Thisdomain-aware rescoring improves the word error rate by up to2.4% and slot word error rate by up to 4.1% on three individ-ual domains – shopping, navigation, and music – compared todomain general rescoring. These improvements are obtainedwhile maintaining accuracy for the general use case.
Index Terms — language modeling, second-pass rescor-ing, domain adaptation, automatic speech recognition
1. INTRODUCTION
Voice assistants have become increasingly popular and areused for an ever-expanding set of use-cases. For example,users can currently ask voice assistants to play music (mu-sic domain), obtain a business address (navigation domain),and buy an item (shopping domain). Often, these automaticspeech recognition systems consist of a separate languagemodel (LM) component in the first-pass used in conjunctionwith an acoustic model, and an optional LM in the second-pass for rescoring. These LM models are trained to estimatethe probability of a sequence of words P ( w ..wn ) . Whilen-gram LMs do this by estimating the probability of eachword, given the previous n − words, neural LMs (NLMs)learn a distributed representation for words as well as theprobability function for the word sequence in context [1].This allows them to generalize estimates to unseen word se-quences, and for longer word histories. Both types of LMscan be trained on a variety of textual data sources, and aretypically optimized for the general use case.However, different usage domains may differ significantlyin their language statistics and a general LM may not performas well on new domains, or domains otherwise not well rep-resented by the general use case [2]. Training a single LM that performs well on many domains is challenging. Somerecent approaches include using attention to acoustic embed-ddings during decoding [3] or a compositional neural lan-guage model that learns how to combine multiple componentlanguage models [4].In this work, we describe a domain-aware rescoringframework that can be used to address this in a productionsetting in the second-pass. We fine-tune a domain-generalLM on data from three domains (music, navigation, shop-ping), and show that these models capture improvements inword error rate (WER) and slot WER (SlotWER) – whichmeasures the performance on certain critical content words– beyond the general LM. We find this holds even whenusing neural models capable of capturing longer word histo-ries and more complex relationships than traditional n-gramLMs. We demonstrate an effective way to combine a classi-fication model to determine which domain-adapted model touse in second-pass rescoring. With this framework, we areable to obtain 0.7%-4.1% improvement on domain WER andSlotWER. The experimental results that we report use an ex-perimental ASR system that does not reflect the performanceof the current Alexa ASR system in production.
2. PREVIOUS WORK
Previous work on LM adaptation has shown that incorpo-rating contextual or domain-adapted knowledge in the LMcan improve the performance of ASR systems. One wayto achieve this is by dynamically adjusting the weights ofan interpolated n-gram LM, based on the preceding text[5]. The interpolated LM can consist of different LMs thatare optimized for different dialog states or applications [6],topic/domain [5, 7], or decomposition of topic factors [8, 9].These LMs can be trained separately, and then the appropriatemixture selected at runtime. Additionally, a cache compo-nent, which maintains a representation of recently occuringwords, can be combined with an interpolated n-gram modelto adapt the model to a target domain, based on the recenthistory [10].More recently, neural adapation approaches have beenused to adapt a LM to a target domain based on non-linguisticcontextual signals, such as the application at the time of therequest [11], or learned topic vectors [12, 13]. For example, a r X i v : . [ c s . C L ] F e b
12] used topic representations obtained from latent dirich-let allocation to adapt an NLM for genres and shows in amulti-genre broadcast transcription task. Domain-adaptationcan also be achieved via shallow-fusion, in which an exter-nal (contextually constrained) LM is integrated during beamsearch [14].Given the limited search space in the second-pass, wecan use larger and more complex models in the second-passof ASR systems, resulting in additional WER improvements[1]. In this approach, some set of hypotheses (representedas an n-best list, lattice, or confusion network) generated by adomain-general first-pass model is then rescored by a domain-adapted model. The LM scores obtained from the first passmodel are updated using some combination of the first andsecond pass LM scores.Recently, [15] demonstrated the feasibility of select-ing adapted models to use in a production setting, usinga pipeline with a topic classifier to select the appropriatedomain-adapted LM. They show that this setup results in im-provements beyond using the domain-general model alone.However, their second-pass LM is a 3-gram model. Recurrentneural LMs consistantly outperform traditional n-gram mod-els, particularly on less frequent tail words, and are capable ofcapturing more complex relationship between words [1, 16],so it is unclear whether domain-adapted improvements mightstill hold when using a stronger LM.In the next section, we present our domain-aware rescor-ing framework, in which we select a domain-aware NLM forsecond-pass based on the first pass ASR output.
3. DOMAIN-AWARE RESCORING FRAMEWORK
Our proposed domain-aware rescoring framework is illus-trated in Fig. 1. After first-pass decoding, the one-best ASRoutput is fed into a classification model, in order to determinewhich second-pass rescoring model to use. The second-passrescoring model is selected if the posteriors for a given classmeet a selected threshold. The first and second-pass LMscores are interpolated based on an optimization for mini-mizing WER. In the following sections, we describe each ofthese components in turn: we first describe the first pass ASRsystem that we will use. Then we describe the classificationmodel used to select the second-pass rescoring model, basedon the first pass output. Finally, we describe the domain-adapted NLMs used in second-pass rescoring.
The first-pass LM is Kneser-Ney [17] smoothed n-gram LMtrained on a variety of data, including in-domain data (tran-scribed and semi-supervised data from real user-agent inter-actions ), and out-of-domain data (e.g., corpora available on The user data is anonymized and consists of only a subset of data usedfor our production system
Fig. 1 . Second-pass rescoring framework. The classificationmodel select the appropriate rescoring model based on thefirst-pass decoding output. The n-best hypotheses are thenrescored by the selected model (e.g., music-specific model).the web). It is optimized to minimize perplexity over gen-eral traffic, which covers all domains [5]. The acoustic modelis a low-frame-rate model with 2-layer frquency LSTM [18]followed by a 5-layer time LSTM trained with connectionisttemporal classification loss [19].
Selecting the appropriate second-pass rescoring model can bethought of as a text classification task. We train a light-weightLSTM-based classifier. It consists of an input embeddinglayer with 100 units and a vocabulary of roughly 150,000, in-cluding a token for out-of-vocabulary words, followed by anLSTM layer with 64 hidden units, a fully connected layer, andan output layer with 4 output units. Each output unit corre-sponds to one of the second-pass models (one domain-generalmodel, three domain-adapted models). We use cross-entropyloss, softmax activation, as well as the Adam optimizer withintial learning rate of 0.001 and early stopping based on per-formance on a development set. At training time, the modelis trained on annotated transcribed data. At test time, we feedin one-best ASR hypothesis. Input utterances were padded ortruncated to a length of 10.
Following [20], we trained a NLM with two LSTM layers,each comprising 512 hidden units. We use noise contrastiveestimatation based training, which is self-normalized and re-sults in faster computation during training and inference, asit does not require the computation of the full softmax duringtraining [21]. To adapt the NLMs to individual domains,we initialized training from the converged domain-generalmodel, using the subset of data from each of the respectivedomains in order to fine-tune the LM to the domain [22]. Fol-lowing [23], who used a smaller learning rate for fine-tuning,we used / of the initial leaning rate, which yielded the bestimprovements on perplexity (PPL) from a range of [ / - / ].In this way, each of our NLMs was trained on the same data,and also used the same vocabulary. .4. Second-pass rescoring and optimization We generate a lattice from the 10-best ASR hypotheses, andrescore it using the second-pass model selected by the domainclassifier. Rescoring itself uses the push-forward algorithm asdescribed in [24, 25]. Additionally, we optimize the interpola-tion weights used to combine the first- and second-pass LMsbased on (1) overall WER and (2) SlotWER. The optimiza-tion is carried out using the simulated annealing algorithm asdescribed in [26].We compare these results to an offline method where weuse the one best ASR results to estimate per-utterance inter-polation weights for each of our rescoring models using theEM algorithm. We then use these weights to do an linear in-terpolation between the models in the second-pass [5]. Thisapproach is not practical in a production setting, but providesa point of comparison for our results.
4. EXPERIMENTS4.1. Second-pass models
Model PPL
Other
PPL
Nav
PPL
Music
PPL
Shop LM Genrl
0% 0% 0% 0%LM
Nav -47.7%
Music -43.6%
Shop -45.8%Table 1 . Relative PPL on evaluation set by domain spliteach domain-adapted LM compared to the domain-generalLM baselineThe domain-general NLM was trained on 80 million ut-terances from anonymized user interactions with Alexa, con-stituting a portion of all live traffic. A portion of these utter-ances further included annotations for their domains. Amongthose utterances that were annotated, a portion were naviga-tion, music, and shopping domain utterances (over 4 millioneach). We selected the top 238k most frequent words fromthis corpus as the vocabulary for the NLM, and mapped theout-of-vocabulary tokens to < unk > . The < unk > token wasscaled by a factor of − in the rescoring experiments.We adapted this domain-general NLM to three separatedomains by fine-tuning of the annotated data for that domain:music, navigation, and shopping. Table 1 shows the per-plexities for our test set, split out by domain, for each ofthe domain-adapted LMs and the domain-general LM. The Other partition does not contain music, navigation, shoppingdomain utterances, and only includes other domains, such asfactual questions and commands to control connected homeappliances. We observe a reduction in PPL on the evaluationdata for the corresponding domain when using the domain-adapted model as compared to the domain-general model.However, we also observe an increase in the PPL of the
Other partition when using any domain-adapted model. This sug-gests that conditioning the second-pass LM on domain of theutterance can improve recognition, and supports the use of adomain classifier.
Domain Music Navigation Shopping OtherPrecision 0.94 0.92 0.92 0.97Recall 0.93 0.89 0.97 0.92
Table 2 . Per domain precision and recall for model classifierused to select the second-pass rescoring modelThis model is trained with roughly 9 million utterancesfrom user interactions with Alexa that were human-annotatedfor domain, using a 8-1-1 split for dev and eval (a subset of thetotal annotated data available to us). All utterances are first-turn interactions, in which the user directly prompts Alexa,and does not contain utterances that consist of only the wake-word (e.g., “Alexa” ).Across all domains, we observe an overall classificationaccuracy of 95.9%, with unweighted precision of 93.9% andrecall of 92.8%, based on the max class. Per-domain classifi-cation breakdown is shown in Table 2. In our rescoring frame-work, we select the domain-adapted model for rescoring onlyin cases where the posteriors for that class meet some speci-fied threshold (the higher the threshold, the more conservativethe classification). Based on precision and recall values perdomain, we used a 0.85 threshold across domains, resultingin an overall classification accuracy of 91%.
All of our experiments are evaluated on an evaluation setconsisting of anonymized live user interactions with Alexa.The evaluation set consists of annotated data across severaldifferent Alexa devices for a select time period. The fi-nal evaluation set consists of roughly 10-20k utterances foreach of the shopping, music, and navigation domains. Theremainder of the set consists of 115k utterances from otherdomains. Additionally, for each domain, we annotate the pop-ular contentful slots, in order to evaluate improvements in theSlotWER. These slots are the SongName, ArtistName, andAlbumName slots for Music, the PlaceName and StreetNameslots for Navigation, and the ItemName slot for Shopping.Improving on recognition of the slot content is especiallyimportant for voice interactions, where the WER is higher.Slots are often specific to certain domains, making them mostlikely to benefit from domain-aware rescoring.We created an development set using the same criteria asour evaluation set. This development set was used to deter-mine the interpolation weights used to combine the LM scoresfrom the first- and second-pass models.odel Navigation Music Shopping OtherWER SlotWER WER SlotWER WER SlotWER WERLM
Genrl -1.3% -3.4% -0.6% 1.0% -2.2% -0.3% -0.8%LM
Nav -5.3% -7.3% – – – – 1.8%LM
Music – – -2.6% -4.1% – – 2.5%LM
Shop – – – – -4.0% -4.1% 1.7%DomainAware -3.7% -6.5% -2.3% -3.1% -2.9% -3.6% -0.9%AdaptationBaseline -1.6% -4.5% -3.2% -4.7% -1.7% -2.7% -0.9%Oracle -31.2% – -25.2% – -32.8% – -18.4%
Table 3 . WER improvements (shown as negative number) on domain-adapted test sets using general (LM
Genrl ), domain-adapted LM (LM
Nav,Music,Shop ), domain-aware rescoring with a domain classifier (DomainAware), or the dynamically inter-polated model (AdaptationBaseline) as a second-pass rescorer. All relative comparisons are respect to the first-pass baseline(not shown). The Oracle provides a lower bound for the minimum WER achieveable, given a second pass model that alwaysprefers the hypothesis in the nbest list with the lowest WER.
We present WER results obtained using our domain-aware es-coring framework (DomainAware), in which we rescore eachutterance based on the model selected by the domain classifier(Table 3). We show these results split by each domain basedon the annotation in our test set (Navigation, Music, Shop-ping), as well on the other domains (Other). The
Other parti-tion does not contain utterances from the aforementioned do-mains. Compared to the LM
Genrl alone, we find that domain-aware rescoring results show improvements of 0.7%-2.4% onWER and 3.1%-4.1% on SlotWER, across all domains. Wealso find a minor improvement (0.1% WERR) on the Otherpartition of the test set; these improvements are not obtainedat the expense of general WER.We also present the results that can be obtained on eachevaluation set split, when rescoring that split using onlythe corresponding rescoring LM (LM
Nav,Music,Shop ). Thisshows that the domain-adapted LMs can improve WERbeyond the domain-general LM, even when trained usingthe same data. Compared to using LM
Genrl , we observeimprovements of 1.8%-4.0% on WER and 3.8%-5.1% onSlotWER of the corresponding evaluation set split. Takentogether, this shows that domain-aware rescoring allows usto capture the most of the improvements from using domain-adapted models, all while avoiding degradation on otherdomains.We conducted the above comparisons using a weightedcombination of the first and second pass models, with interpo-lation weights between first and second-pass models obtainedby optimizing for WER or SlotWER on our development set.We observed no significant impact based on whether we usedWER or SlotWER; consequently, all of the aforementionedresults are obtained from WER optimization.Finally, we compare the results obtained using domain-aware rescoring to the standard approach of dynamicallyadapting the interpolation weights over LM
Genrl , LM
Nav ,LM
Music , and LM
Shop . This is the AdaptationBaseline shown in Table 3. From this, we see that the DomainAwareapproach is competitive with the dynamically interpolatedadaptation baseline – both improve over the general neurallanguage model, but DomainAware shows larger WER im-provements for the navigation and shopping domains, andsmaller WER improvements in the music domain. Thoughthe DomainAware approach uses a domain classifier to select one model for rescoring, the single model used is trained tominimize perplexity on the respective domain; this is similarto the AdaptationBaseline, in which the interpolation weightsare determined to minimize the perplexity for the word se-quence, using a combination of all the models. However,unlike the DomainAware approach, the AdaptationBaselineapproach is not practical for second-pass rescoring in produc-tion due to the latency incurred during weight optimization.
5. CONCLUSION
In this work, we present a framework for domain-awarerescoring that is suitable for use in a production environment.We show that using this framework enables us to captureWER improvements on domains beyond what is captured bya domain-general model trained on the same data. This frame-work has the benefit of allowing individual rescoring modelsto be maintained independently, making it suitable for asyn-chronous model updates. We demonstrate that domain-awarerescoring is competitive with traditional model interpolationmethods, which can only be run in an offline environmentwhen latency is not an issue. One additional benefit of a sep-arate classification model is that individual domain-adaptedmodels can be maintained and updated separately. Futurework looks to combine the classification model and second-pass rescoring models into a single model; i.e., by incorporat-ing an additional domain embeddding during the training ofthe NLM. The model would allow the NLM to dynamicallyadapt a specific domain, and can be further extended to othercharacteristics of the user or utterance beyond domain. . REFERENCES [1] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, andChristian Jauvin, “A neural probabilistic languagemodel,”
Journal of machine learning research , vol. 3,no. Feb, pp. 1137–1155, 2003.[2] John Blitzer,
Domain Adaptation of Natural LanguageProcessing Systems , Ph.D. thesis, 2008.[3] Ankur Gandhe and Ariya Rastrow, “Audio-attentiondiscriminative language model for asr rescoring,” in
ICASSP . IEEE, 2020, pp. 7944–7948.[4] Denis Filimonov, Ravi Teja Gadde, and Ariya Rastrow,“Neural composition: Learning to generate from mul-tiple models,” in
Self-Supervised Learning for Speechand Audio Processing Workshop .[5] Reinhard Kneser and Volker Steinbiss, “On the dynamicadaptation of stochastic language models,” in
ICASSP .IEEE, 1993, vol. 2, pp. 586–589.[6] Frank Wessel, Andrea Baader, and Hermann Ney, “Acomparison of dialogue-state dependent language mod-els,” in
ESCA Tutorial and Research Workshop (ETRW)on Interactive Dialogue in Multi-Modal Systems , 1999.[7] Anirudh Raju, Behnam Hedayatnia, Linda Liu, AnkurGandhe, Chandra Khatri, et al., “Contextual languagemodel adaptation for conversational agents,” in
Inter-speech , 2018.[8] Jerome R Bellegarda, “Exploiting latent semantic infor-mation in statistical language modeling,”
Proceedingsof the IEEE , vol. 88, no. 8, pp. 1279–1296, 2000.[9] Daniel Gildea and Thomas Hofmann, “Topic-based lan-guage models using em,” in
Sixth European Conferenceon Speech Communication and Technology , 1999.[10] Roland Kuhn and Renato De Mori, “A cache-based nat-ural language model for speech recognition,”
TPAMI ,vol. 12, no. 6, pp. 570–583, 1990.[11] Min Ma, Shankar Kumar, Fadi Biadsy, Michael Nirschl,Tomas Vykruta, and Pedro Moreno, “Modeling non-linguistic contextual signals in lstm language models viadomain adaptation,” in
ICASSP . IEEE, 2018, pp. 6094–6098.[12] Xie Chen, Tian Tan, Xunying Liu, Pierre Lanchantin,Moquan Wan, et al., “Recurrent neural networklanguage model adaptation for multi-genre broadcastspeech recognition,” in
ISCA , 2015.[13] Tomas Mikolov and Geoffrey Zweig, “Context depen-dent recurrent neural network language model,” in .IEEE, 2012, pp. 234–239. [14] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk,Philemon Brakel, and Yoshua Bengio, “End-to-endattention-based large vocabulary speech recognition,” in . IEEE, 2016, pp. 4945–4949.[15] Dang Hoang Vu, Phuong Le-Hong, et al., “Online adap-tation of language models for speech recognition,” in
International Conference on Computer Recognition Sys-tems . Springer, 2019, pp. 164–173.[16] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, NoamShazeer, and Yonghui Wu, “Exploring the limits of lan-guage modeling,” arXiv:1602.02410 , 2016.[17] Reinhard Kneser and Hermann Ney, “Improvedbacking-off for m-gram language modeling,” in
ICASSP . IEEE, 1995, vol. 1, pp. 181–184.[18] Bo Li, Tara N Sainath, Arun Narayanan, Joe Caroselli,Michiel Bacchiani, et al., “Acoustic modeling for googlehome.,” in
Interspeech , 2017, pp. 399–403.[19] Alex Graves, Santiago Fern´andez, Faustino Gomez, andJ¨urgen Schmidhuber, “Connectionist temporal classifi-cation: labelling unsegmented sequence data with recur-rent neural networks,” in
ICML , 2006, pp. 369–376.[20] Anirudh Raju, Denis Filimonov, Gautam Tiwari, Gui-tang Lan, and Ariya Rastrow, “Scalable multi corporaneural language models for asr,” in
Interspeech , 2019.[21] Andriy Mnih and Yee Whye Teh, “A fast and simple al-gorithm for training neural probabilistic language mod-els,” arXiv preprint arXiv:1206.6426 , 2012.[22] Fadi Biadsy, Michael Alexander Nirschl, Min Ma, andShankar Kumar, “Approaches for neural-network lan-guage model adaptation,” in
Interspeech , Stockholm,Sweden, 2017.[23] Craig Atkinson, Brendan McCane, Lech Szymanski,and Anthony Robins, “Pseudo-recursal: Solving thecatastrophic forgetting problem in deep neural net-works,” arXiv preprint arXiv:1802.03875 , 2018.[24] Michael Auli, Michel Galley, Chris Quirk, and GeoffreyZweig, “Joint language and translation modeling withrecurrent neural networks,” in
EMNLP , 2013, pp. 1044–1054.[25] Shankar Kumar, Michael Nirschl, Daniel Holtmann-Rice, Hank Liao, Ananda Theertha Suresh, and FelixYu, “Lattice rescoring strategies for long short termmemory language models in speech recognition,” in . IEEE, 2017, pp. 165–172.[26] Yang Xiang, Sylvain Gubian, Brian Suomela, and JuliaHoeng, “Generalized simulated annealing for efficientglobal optimization: the GenSA package for R.,”