Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus
Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming Pang, Shankar Kumar
IImproving Tail Performance of a Deliberation E2E ASR Model Using a LargeText Corpus
Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming Pang, Shankar Kumar
Google Inc. { cpeyser,sepand,tsainath,japfel,rpang,shankarkumar } @google.com Abstract
End-to-end (E2E) automatic speech recognition (ASR) systemslack the distinct language model (LM) component that char-acterizes traditional speech systems. While this simplifies themodel architecture, it complicates the task of incorporating text-only data into training, which is important to the recognition oftail words that do not occur often in audio-text pairs. Whileshallow fusion has been proposed as a method for incorporat-ing a pre-trained LM into an E2E model at inference time, ithas not yet been explored for very large text corpora, and ithas been shown to be very sensitive to hyperparameter settingsin the beam search. In this work, we apply shallow fusion toincorporate a very large text corpus into a state-of-the-art E2EASR model. We explore the impact of model size and showthat intelligent pruning of the training set can be more effec-tive than increasing the parameter count. Additionally, we showthat incorporating the LM in minimum word error rate (MWER)fine tuning makes shallow fusion far less dependent on optimalhyperparameter settings, reducing the difficulty of that tuningproblem.
1. Introduction
Rare words pose an ongoing problem to building high-qualityspeech recognition systems. Since rare words are likely to benamed entities such as names and locations, these “tail” wordsare often critical to the meaning of the decoded transcript. Sincethey do not occur often in the audio-text pairs that comprise anASR system’s training set, they are difficult to predict correctly.Conventional ASR systems contain separate acoustic, pro-nunciation, and language models which are run one after an-other. In such systems, the distinct language model provides anopportunity to train part of the model on text-only data, whichis often far more plentiful than audio-text pairs, and can containmany occurrences of words that are rare in the acoustic data.The independence of the LM from the ASR system allows itsdataset or training procedure to be adapted to specific domains,including tail words [1, 2].E2E ASR systems consist of a single neural network inwhich all components are jointly trained. These models offerthe advantage of simplifying the alignment of audio to text [3],as well as decreased model size [4]. However, there is no ex-plicit LM in an E2E architecture, complicating the task of inte-grating text-only data. Many “LM fusion” methods have beenproposed, including “shallow fusion” [5], in which LM logitsare interpolated with those of an E2E model during inference, aswell as more sophisticated methods such as “deep” and “cold”fusion, in which the LM is incorporated into the neural archi-tecture of the E2E system [5, 6]. In [7], shallow fusion wasshown to be the most effective fusion method with a state-of-the-art E2E system, although the “density ratio method”, has been shown to outperform shallow fusion for a domain transferscenario [8].Earlier works on shallow fusion such as [9] and [10] useLMs taken from Kaldi [11] recipes which are trained on nomore than a few hundred million words. Of course, languagemodels can scale to far larger datasets. [12] trained an RNN-LMon the One Billion Word Benchmark [13], while [14] trained atransformer on 8 million web documents totaling 40GB of text.To our knowledge, the study that uses the most text data in train-ing an RNN-LM for shallow fusion to date is [15], which usesabout 4 billion words.The research has also shown that shallow fusion is difficultto implement correctly. In [9], it is shown that without carefultuning of several hyperparameters, shallow fusion causes tran-scripts to be cut off after only a few words, massively degradingperformance.We have two goals in this work. First, we seek to re-duce the difficulty of tuning fusion hyperparameters. We showthat applying shallow fusion during minimum word error rate(MWER) training adapts the model to a particular setting ofhyperparameters, and almost eliminates the impact of those pa-rameters in inference. Second, we seek to scale shallow fusionto a text corpus of about 50 billion words, an order of magni-tude larger than [15]. We show that tail performance can beimproved by careful pruning of the dataset without resorting toextremely large model sizes.In this study, we focus on the particularly difficult problemof tail words. We use shallow fusion to incorporate an LM intoan E2E model trained in the recent deliberation framework [16],which already achieves state-of-the-art transcription quality onrare words. We show that shallow fusion with a large text corpusyields further improvements on the tail.The rest of this paper is organized as follows. Section 2outlines the architecture of our deliberation model and summa-rizes the techniques of MWER fine-tuning and shallow fusion.Section 3 describes the techniques we use to achieve the twogoals given above. Section 4 gives details on our dataset andmodel architecture. Section 5 gives results and analysis, and weconclude in Section 6.
2. Background
In this section, we summarize our baseline model, fine-tuningprocedure, and method for language model integration.
Two-pass ASR models combine a pre-trained recurrent neuralnetwork transducer (RNN-T) [17] with a second decoder thatrescores top-n hypotheses [18]. In a deliberation model, on theother hand, the second decoder has the option to attend to theRNN-T hypotheses instead of rescoring them, allowing all parts a r X i v : . [ ee ss . A S ] A ug f the model to be trained together.Figure 1: Deliberation Architecture, adapted from [16].
More specifically, a deliberation model’s encoder consumesacoustic features x and maps them onto encoder features e . TheRNN-T decoder attends to the encoder features, and an n-bestlist of hypothesis y RNN-T is extracted using a beam search. Asecond encoder adapts e into modified encoder features e (cid:48) tobe consumed by the deliberation decoder. The deliberation de-coder attends to both features derived from y RNN-T and e (cid:48) , andthe final transcript is extracted with a second beam search. MWER training [19] is a fine-tuning procedure designed to di-rectly minimize the number of word errors instead of cross-entropy. In MWER training, we seek to optimize the expectednumber of word errors over all possible hypotheses. Sincewe cannot practically marginalize over all possible output se-quences, we instead compute the expected word error rate froma sample of predictions: L ( x , y ∗ ) = (cid:88) y ∈ B P ( y | x ) ˆ W ( y , y ∗ ) (1)where x is the input acoustic features, y ∗ is the ground truth,and ˆ W gives a normalized word error count. Here, y is a hy-pothesis from a beam B that is sampled from the model usinga beam search, and the posterior P is normalized accordinglyso that all probabilities sum to one. It was demonstrated in [20]that this method is effective for beam sizes as small as 4. In shallow fusion, a language model is incorporated into ASRdecoding by interpolating the posteriors directly: ˆ y = arg max y P AM ( y | x ) + αP LM ( y ) + β C (2)where ˆ y is the selected hypothesis, P AM and P LM are pos-teriors in the acoustic and language models respectively, and α and β are hyperparameters. C is a coverage term as in [9],which seeks to discourage truncated transcripts by rewardinghypotheses that have been allocated weight above some thresh-old by the attention mechanism.
3. Methods
In this section, we describe our techniques for fusing an LMtrained on our large text corpus into our deliberation model.
In [9], the authors identified a failure mode for shallow fusionin which the fused model predicts a shortened transcript con-sisting of only the first few words that were spoken. As wewill see, this “truncation” problem turned out to be quite severe when incorporating our LMs, which were trained on a large textcorpus.
There are several hyperparameters proposed in the literature forthe truncation problem. We experimented with tuning the fol-lowing:• The coverage penalty C , as in equation 2 above.• The beam size. In principle, increasing the size of thebeam leaves room for longer, less errorful hypotheses onthe beam even when truncated hypotheses are present.• The maximum EOS logprob delta . This hyperparameteris proposed in [9]. When a hypothesis is ended with theEOS token during beam search, it must have a log prob-ability no worse than that of the best hypothesis so farminus this value in order to be removed from the beamand marked as complete. It was shown in [21] that using a particular hyperparameter set-ting (blank scale, in that work) during training of a conventionalASR system can sometimes adapt the model to that setting ininference. We attempt to reduce the difficulty of the hyperpa-rameter tuning problem given above by showing that this can bedone with beam search parameters in a deliberation model.MWER fine tuning provides an opportunity to do this byrunning a beam search during training. Since this beam searchwill include an LM in inference, we would like to perform shal-low fusion during MWER fine tuning. [22] develops a tech-nique for fusion with RNN-T in which the LM’s logit values areadded to RNN-T’s non-blank outputs, while leaving the logitfor the blank output unchanged, and demonstrates small per-formance improvements. Unlike [22], this work seeks to useMWER training as a hyperparameter adaptation mechanism.So, we instead fine tune the beam search of our second decoder,which will be the site of shallow fusion during inference. Sincethis decoder does not emit a blank output, we can define a losswith direct logit interpolation: L ( x , y ∗ ) = (cid:88) y ∈ B LM ( P AM ( y | x ) + αP LM ( y ) + β C ) ˆ W ( y , y ∗ ) (3)where B LM are hypotheses drawn using a beam search withshallow fusion. We implement the following pruning scheme to eliminate noisydata and reduce overfitting to extremely common sentences(e.g. “facebook”):1. For every sentence, each unigram is compared againsta 1 million word vocabulary. Any unigram not in thislist is considered to be misspelled, and the sentence isdiscarded.2. If a sentence is duplicated n times in the remaining ex-amples, all but log( n ) examples are discarded.3. The desired number of sentences are selected by randomsampling.Large text corpora have been exploited successfully forASR in the past by sampling a training set that is relevanto a domain of interest [23]. These results, however, used amaximum-entropy LM, which presents a convex optimizationproblem that scales naturally to large amounts of data, whilewe seek to optimize an non-convex RNN-LM. Also, these re-sults targeted geographical queries, which are plentiful in ASRtraining corpora, while we seek to improve performance on rarewords. Nevertheless, we seek to adapt this method to our prob-lem by experimenting with an additional step between steps 2and 3 above:2*. For every example, each unigram is compared to a list ofword counts from the deliberation model’s training data.Any sentence not containing at least one “rare” word isdiscarded, where rareness is defined as occurring a num-ber of times smaller than some threshold.All together, this scheme is designed to take advantage ofthe large size of our text corpus while still maintaining a man-ageable number of sentences for LM training. We weigh theimpact of this data reduction against that of increasing the LM’ssize.
4. Experiments
In this section, we describe the parameters of our experiments.We also describe our methods for measuring the success of ourLM integration and evaluating performance on the tail.
Our deliberation model is similar to that presented in [16]. Weuse 128-dimensional log-Mel audio features with a 32ms win-dow and 10ms shift. The RNN-T component of the deliber-ation model contains eight LSTM layers in its encoder, eachwith 2,048 units and a 640-dimensional projection. The jointnetwork contains 640 units, followed by a final softmax layer.Hypotheses from RNN-T are passed to a two-layer bidirectionalLSTM which projects them into a 320-dimensional space. Oursecond decoder attends to both these features and RNN-T en-coder output and emits context vectors which are passed to afinal 2-layer LSTM. Our training set is described in [24]. Transcripts are lower-cased and processed with a 4k word piece model.
Our language models are similar to those in [7]. The modelsconsist of LSTM layers with 512 nodes each, with a projectionlayer of 256 nodes. Our baseline model has two hidden layers.The models are trained on a sample of anonymized pro-duction traffic to Google applications. We divide this data intodomains that describe the origin of the queries. All examplesare stripped of metadata, so that only the query text is visibleto the model. Our training set is selected from this data usingthe pruning procedure outlined in section 3.2. The total size ofour data before pruning is about 230 billion examples. Vocab-ulary pruning (Step 1) reduces that size to about 218 billion,and log n pruning (Step 2) further reduces the size to about 25billion examples.For our baseline models, we omit rare word filtering (Step2*) from our pruning procedure and sample down to a final sizeof 4.5 billion examples (about 50 billion words). When we in-clude rare word filtering, we obtain a dataset of about 1 billiontotal examples (about 11 billion words), and omit Step 3. We would like to create evaluation sets that measure the degreeto which our LM has been integrated into our model, and todetermine performance on the tail. We create separate test setsfor these two purposes. We split our test sets into those focusedon geographical queries (Maps) and general queries (Search).The test sets are created by looking for utterances in the textdata that have a very different perplexity distribution comparedto the audio-text pair training data.To measure LM integration, we build test sets consistingof words that are common in the LM training data but rare inthe AM training data. To this end, we compute unigram statis-tics for both corpora and construct a list of unigrams that occurat most five times in the AM data (about three quarters of allwords) and at least 150 times in the LM data (about 99% of all(a)
Maps, LM Only (b)
Search, LM Only (c)
Maps, Surprising Prons (d)
Search, Surprising Prons (e)
Maps, LM Only (f)
Search, LM Only (g)
Maps, Surprising Prons (h)
Search, Surprising Prons
Figure 2:
A sweep of maximum EOS log-probability delta and beam size on our test sets, before (a-d) and after (e-h) MWER fine tuning. xperiment Maps, LM Only Search, LM Only Maps, Surprising Prons Search, Surprising Prons E1 E2 E2-4
E2-6
E2-8 E3 E4 WER Results of Expanded Models words).To measure tail performance, we target words that have pro-nunciations that are surprising given the spelling. Unusual pro-nunciations have been shown to be difficult for ASR systems[25, 26, 27]. To select examples with surprising utterances, wemanually assemble a map from grapheme sequences to corre-sponding phoneme sequences. Our mapping consists of 487correspondences. For a given example, we process each uni-gram grapheme-by-grapheme, using the map to assemble a listof possible corresponding phoneme sequences. If none of thepredicted pronunciations match the true pronunciation of theunigram, we consider the unigram to have surprising pronunci-ation.For each test set, we select 10000 examples and synthesizeaudio for each transcript with a TTS system as in [28].
5. Results
This section presents our experimental results and discussion.
We find that optimizing only the LM interpolation weight α andcoverage penalty weight β does not yield improvement overour baseline model. Improvement was only shown after tun-ing beam size and maximum EOS logprob delta, which relatedirectly to the beam search. To understand the problem, wecompare our models’ word error rate to “truncation word er-ror rate”, which is the word error rate on examples for whichthe prediction has at most half as many unigrams as the ref-erence. Table 2 compares WER and Truncation WER for ourbaseline deliberation model to a fusion model with α = 0 . and β = 0 . and to a second fusion model in which the beamsize is set to and maximum EOS logprob delta is set to . .This data suggests that truncation errors are largely responsibleWER Truncation WERBaseline 19.9 2.0Fusion 21.0 3.6Fusion w/ BS Params 18.7 2.2(a) Maps, LM Only
WER Truncation WERBaseline 28.4 6.9Fusion 31.5 7.6Fusion w/ BS Params 27.1 7.1(b)
Search, LM Only
Table 2:
Impact of the Truncation Problem for the degradation in WER in the initial fusion model, and thattuning the beam search parameters recovers those losses.Figure 2 (a-d) shows the results of a sweep of the two beamsearch parameters: beam size and maximum EOS logprob delta.Interestingly, while we find that increasing beam size yields im-provements, for a sufficiently small value of maximum EOSlogprob delta the beam size does not make a difference. Nev-ertheless, it is clear that WER results are highly dependent oncorrect setting of these hyperparameters.We find that MWER fine-tuning dramatically diminishesthe importance of beam search parameters in evaluation. Fig-ure 2 (e-h) shows the results of training 25 MWER models,using the same combinations of maximum EOS logprob deltaand beam size from above during the MWER beam search andthen evaluating using shallow fusion with those same param-eters. We find a significantly smaller range of WER than be-fore MWER fine tuning. This suggests that MWER fine tuningserves to adapt a model to some choice of beam search parame-ters by using those parameters during training. This could makeMWER useful as a tool to alleviate the difficulty of hyperparam-eter tuning in shallow fusion.
We compare the importance of model size to data selection cri-terion in LM training. Table 1 gives results for shallow fusionwith four progressively larger LMs including our baseline ( E1 ),an expanded model in which the LM’s projection layer is re-moved effectively doubling the parameter count ( E2 ), and thatsame model with 4, 6, and 8 hidden layers ( E2-4, E2-6, E2-8 ).Table 1 also gives results for the 4-layer variant in which Step2* of the pruning procedure described in Section 3.2 is appliedto the training data ( E3 ), and finally where the training set isfurther sampled down to about 50 million examples ( E4 ).We see that filtering rare words shows significantly largergains than increasing model capacity. This suggests that it iseasier to take advantage of a large text corpus by selecting a sub-set of relevant examples than it is to model the entire distribu-tion. Interestingly, this benefit is strongest when we only pruneto 1 billion examples, and weakens when we further prune downto 50 million. This further suggests that an RNN-LM used infusion is capable of benefiting from a very large text corpus.
6. Conclusions
In this paper, we’ve explored shallow fusion using a very largetext-only corpus. We’ve quantified and explored solutions tothe truncated utterances problem and demonstrated that MWERfine tuning almost eliminates the need for hyperparameter tun-ing. Finally we showed how a pruning strategy can beat outlarge models in taking advantage of large amounts of text data. . References [1] H. Sak, F. Beaufays, K. Nakajima, and C. Allauzen, “Languagemodel verbalization for automatic speech recognition,” in
IEEEICASSP , 2013, pp. 8262–8266.[2] S. Huang and S. Renals, “Hierarchical bayesian language modelsfor conversational speech recognition,”
IEEE Transactions on Au-dio, Speech, and Language Processing , vol. 18, no. 8, pp. 1941–1954, 2010.[3] D. Wang, X. Wang, and S. Lv, “An overview of end-to-end auto-matic speech recognition,”
Symmetry , vol. 11, p. 1018, 08 2019.[4] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez,D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang,D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby,S. Chang, K. Rao, and A. Gruenstein, “Streaming end-to-endspeech recognition for mobile devices,” in
IEEE ICASSP , 2019,pp. 6381–6385.[5] C¸ . G¨ulc¸ehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin,F. Bougares, H. Schwenk, and Y. Bengio, “On usingmonolingual corpora in neural machine translation,”
CoRR ,vol. abs/1503.03535, 2015. [Online]. Available: http://arxiv.org/abs/1503.03535[6] A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion:Training seq2seq models together with language models,” in
IN-TERSPEECH , 2018.[7] S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. Sainath, andK. Livescu, “A comparison of techniques for language model inte-gration in encoder-decoder speech recognition,”
IEEE SLT , 2018.[8] E. McDermott, H. Sak, and E. Variani, “A density ratio approachto language model fusion in end-to-end automatic speech recog-nition,” in , 2019, pp. 434–441.[9] J. Chorowski and N. Jaitly, “Towards better decoding and lan-guage model integration in sequence to sequence models,”
IN-TERSPEECH , 2016.[10] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Li, H. Liu,S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducersfor end-to-end speech recognition,” , pp. 206–213,2017.[11] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlcek, Y. Qian, P. Schwarz,J. Silovsk, G. Stemmer, and K. Vesel, “The kaldi speech recogni-tion toolkit,”
IEEE 2011 Workshop on Automatic Speech Recog-nition and Understanding , 01 2011.[12] R. J´ozefowicz, O. Vinyals, M. Schuster, N. Shazeer, andY. Wu, “Exploring the limits of language modeling,”
CoRR , vol.abs/1602.02410, 2016. [Online]. Available: http://arxiv.org/abs/1602.02410[13] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn,and T. Robinson, “One billion word benchmark for measuringprogress in statistical language modeling,”
INTERSPEECH , vol.abs/1312.3005, 2014.[14] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, andI. Sutskever, “Language models are unsupervised multitasklearners,” 2019. [Online]. Available: https://openai.com/blog/better-language-models/[15] A. Kannan, Y. Wu, P. Nguyen, T. Sainath, Z. Chen, and R. Prab-havalkar, “An analysis of incorporating an external languagemodel into a sequence-to-sequence model,” in
IEEE ICASSP , 042018, pp. 1–5828.[16] K. Hu, T. Sainath, R. Pang, and R. Prabhavalkar, “Deliberationmodel based two-pass end-to-end speech recognition,” in
IEEEICASSP , 2020.[17] A. Graves, “Sequence transduction with recurrent neural net-works,” in
International Conference of Machine Learning (ICML)Workshop on Representation Learning , vol. abs/1211.3711, 2012. [18] C.-C. Chiu, D. Rybach, I. McGraw, M. Visontai, Q. Liang,R. Prabhavalkar, R. Pang, T. Sainath, T. Strohman, W. Li, Y. R.He, and Y. Wu, “Two-pass end-to-end speech recognition,” in
In-terspeech , 2019.[19] M. Shannon, “Optimizing expected word error rate via samplingfor speech recognition,” in
INTERSPEECH , 2017.[20] R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and A. Kannan, “Minimum word error rate training forattention-based sequence-to-sequence models,” in
IEEE ICASSP ,2018, pp. 4839–4843.[21] E. McDermott, “A deep generative acoustic model for composi-tional automatic speech recognition,” in
Proceedings of NeuralInformation Processing Systems (NeurIPS) Workshop: Inter-pretability and Robustness in Audio, Speech, and Language , 2018.[Online]. Available: https://openreview.net/pdf?id=S1fbqB0noQ[22] C. Weng, C. Yu, J. Cui, C. Zhang, and D. Yu, “Minimum bayesrisk training of rnn-transducer for end-to-end speech recognition,”
ArXiv , vol. abs/1911.12487, 2019.[23] F. Biadsy, M. Ghodsi, and D. Caseiro, “Effectively building terascale maxent language models incorporating non-linguistic sig-nals,” in
INTERSPEECH , 2017.[24] A. Narayanan, A. Misra, K. C. Sim, G. Pundak, A. Tripathi,M. Elfeky, P. Haghani, T. Strohman, and M. Bacchiani, “Towarddomain-invariant speech recognition via large scale training,” in
IEEE Spoken Language Technology Workshop (SLT) , 2018.[25] C. Peyser, T. N. Sainath, and G. Pundak, “Improving proper nounrecognition in end-to-end asr by customization of the mwer losscriterion,” in
IEEE ICASSP , 2020, pp. 7789–7793.[26] F. Beaufays, A. Sankar, S. Williams, and M. Weintraub, “Learningname pronunciations in automatic speech recognition systems,” in
Proceedings. 15th IEEE International Conference on Tools withArtificial Intelligence , Nov 2003, pp. 233–240.[27] A. Laurent, S. Meignier, T. Merlin, and P. Deleglise, “Acoustics-based phonetic transcription method for proper nouns,” in
Pro-ceedings of the 11th Annual Conference of the InternationalSpeech Communication Association, INTERSPEECH 2010 , 012010, pp. 2286–2289.[28] X. Gonzalvo, S. Tazari, C. Chan, M. Becker, A. Gutkin,and H. Sil´en, “Recent advances in google real-time hmm-driven unit selection synthesizer,” in