Neural Language Modeling With Implicit Cache Pointers
NNeural Language Modeling With Implicit Cache Pointers
Ke Li , Daniel Povey , Sanjeev Khudanpur , Center for Language and Speech Processing & Human Language Technology Center of ExcellenceThe Johns Hopkins University, Baltimore, MD 21218, USA. Xiaomi Corp., Beijing, China. { kli26,khudanpur } @jhu.edu, [email protected] Abstract
A cache-inspired approach is proposed for neural languagemodels (LMs) to improve long-range dependency and betterpredict rare words from long contexts. This approach is asimpler alternative to attention-based pointer mechanism thatenables neural LMs to reproduce words from recent history.Without using attention and mixture structure, the method onlyinvolves appending extra tokens that represent words in his-tory to the output layer of a neural LM and modifying train-ing supervisions accordingly. A memory-augmentation unitis introduced to learn words that are particularly likely to re-peat. We experiment with both recurrent neural network- andTransformer-based LMs. Perplexity evaluation on Penn Tree-bank and WikiText-2 shows the proposed model outperformsboth LSTM and LSTM with attention-based pointer mechanismand is more effective on rare words. N -best rescoring experi-ments on Switchboard indicate that it benefits both very rareand frequent words. However, it is challenging for the proposedmodel as well as two other models with attention-based pointermechanism to obtain good overall WER reductions. Index Terms : RNNLM, Transformer, cache model, pointercomponent, automatic speech recognition
1. Introduction
Neural language models (LMs) are an important module in au-tomatic speech recognition (ASR) [1, 2, 3]. Standard recurrentneural network language models (RNNLMs) make predictionsbased on a fix-sized hidden vector, making modeling long-rangedependency challenging. Although LSTMs outperform vanillaRNNs, it has been observed that they usually retain only a rela-tively short span of context [4, 5]. Memory augmented mod-els and attention mechanism have been proposed to increasethe hidden state’s capacity to retrieve information from hiddenstates in the more distant past. Though improved performancehas been reported, RNNLMs with the standard softmax outputstill struggle with rare or unknown words, even with attention.Since the self-attention architecture was proposed [6], deepTransformers have demonstrated state-of-the-art performanceon natural language processing tasks [7, 8, 9]. Transformer-based LMs have outperformed RNNLMs on large corpora andbeen used in rescoring stage in ASR systems [10, 11]. However,their ability to capture long-term dependency, e.g. self-triggereffects (word repetitions), remains unclear.In real scenarios, especially in conversations, after a wordor phrase is spoken, it is highly likely to be spoken again[12, 13]. These self-triggers or topic-word effects can be cap-tured by cache models, which stores the unigram distribution of
This work was partially supported by unrestricted gifts from Face-book and Applications Technology (AppTek). recently seen words. Cache models adapt pre-trained LMs tolocal contexts (decoded hypotheses) in ASR systems and hencecan improve ASR performance [14, 15]. Usually, cache mod-els are integrated in pre-trained models at test time. It is alightweight approach as no model retraining is required, whileit may not be optimal. Effectively incorporating them in train-ing stages and enabling neural LMs to learn to adapt to recenthistory remain to be explored.In this work, we propose a cache-inspired approach for neu-ral LMs to improve the capability of modeling long-term de-pendency, especially for rare words. The output is extendedby a predefined size L to represent L preceding words in his-tory. The pre-softmax activation of the L units, like the otherpre-softmax units, is computed by a linear transformation of thehidden state or context vector, and then appended to the out-put before the softmax layer, as shown in Figure 1. The train-ing loss is still cross entropy. However, unlike standard train-ing, wherein supervision comes from a vocabulary-sized one-hot vector encoding the predicted word, the supervision vectoris now L bits longer and contains additional ones in each historyposition where the word is the same as the predicted one.The extended output and modified supervision implicitlyenable learning from where in history to copy . While it maystill be difficult for the model to learn which words are partic-ularly likely to be self-triggers, i.e. when to copy . To providea mechanism for this, one additional unit is introduced in thepre-softmax layer (but not included in the softmax computa-tion) to capture the probability that the current word may be aself-trigger. At each word position, activations from these ad-ditional units in the L previous positions are added to the L extended output units.Though cache-inspired neural LMs for improving long-range dependency have been proposed and demonstrated su-perior performance than LSTMs in terms of perplexity, to ourbest knowledge, their effect on ASR accuracy remains to be ex-plored. In this study, we evaluate neural LMs on ASR tasks. Wealso apply the proposed approach to Transformer architecture toverify if cache-based information is still beneficial.
2. Related Work
In this section, we briefly introduce related work about ap-proaches to improve performance on rare words and long-termdependency for sequence modeling problems including neurallanguage modeling and machine translation [16, 17, 18, 19, 20,21]. Vinyals et al. [16] introduces an attention-based pointernetwork to select items from the input as output. It has beenshown to help on geometric problems [16]. The pointer networkcan also improve performance of text summarization [17, 18]and alleviate issues of rare or unknown words in neural machinetranslation [18]. a r X i v : . [ ee ss . A S ] S e p or neural LMs, similar ideas have been proposed to bet-ter model long-range dependency [19, 20]. The most relevantwork is the pointer sentinel mixture model (PSMM), a mixturemodel of a standard LSTM and an auxiliary pointer networkwhich captures the unigram distribution of history words via at-tention [19]. The mixture weight is jointly optimized. A similarmixture model, neural cache model [20], differs from PSMM inaspects such as the query vector for computing attention scoresis hidden state itself instead of a projected version and it doesnot require model retraining. The motivation of dynamic eval-uation [21] is similar to the neural cache model, but the imple-mentation is different: it adjusts model parameters via gradientupdates based on partial predicted sequences during test time.It may be viewed as a modified version of the dynamic updatingmethod proposed by Mikolov et al. [1].
3. Proposed Model
Language modeling can be framed as predicting the next word(target) given preceding words (history). It usually can be ob-served that some words tend to be much more likely targets oncethey have occurred in the history. PSMM learns to “reproduce”a word from recent history by an attention-based pointer net-work. And its mixture weight is computed by a specially de-signed gating mechanism. Though the PSMM achieves lowerperplexity than standard LSTM, the attention and gating mech-anisms are relatively complex and not the only approach to doso. We aim to achieve a similar effect with simpler models.
Let us denote the hidden state of an RNNLM at time step t as h t . Conventionally, the RNNLM output y t is determined as y t = softmax ( Wh t + b ) , (1)where W ∈ R V × H , b ∈ R V , and y t ∈ R V , with V and H being the vocabulary size and hidden state dimension, respec-tively.We extend the output dimension by a predefined size L .The extended part represents the L immediately precedingwords in history. Activation of these L extended units, denotedas p t in Figure 1, is computed via a linear projection of h t fromthe last hidden layer of the RNNLM. We thus have p t = W p h t , (2) z t = concat ( Wh t + b , p t ) , (3) y t = softmax ( z t ) , (4)where W p ∈ R L × H , z t ∈ R V + L , and p t ∈ R L . Applyingsoftmax on z t generates the extended output y t ∈ R V + L .Since the L extended outputs indicate where to copy from the history, we call p t the pointer component of our model. Itonly introduces L × H additional parameters.The objective for training a neural LM is to maximize thelog likelihood of training data. The loss function is written as L ( θ ) = − T T (cid:88) t =1 log( y t · s t ) , (5)where T is the total number of words in the training data, s t is the supervision vector, and · is vector dot product operation.In conventional training, s t is a one-hot vector with 1 in theindex of the target word. To train the proposed neural LM withthe pointer component, the supervision vector s t is set to have y t ... h t-1 p t-1 m t-1 ...... m t ....... VL s o ft m a x ...... h t p t m t ...... ....... ... h t-(L-1) p t-(L-1) m t-(L-1) ... ...... ... ....... m t-1 Figure 1:
Neural LMs with implicit cache pointers.additional ones in history positions where the the target waspreviously seen. So s t is an at-least-one-hot vector. The pointer component and the modified training supervisionmakes an RNNLM be aware of where to copy from the history.However, it may still be challenging for the model to memorizewhich words are particularly likely to reoccur, i.e. are ”bursty”.To learn the burstiness of words, one additional unit, denoted by m t is introduced alongside the pointer p t , as shown in Figure 1.This additional unit is computed by a vector dot product ofthe hidden state h t and a parameter vector with the same di-mension as h t , but is not used in computing the softmax. Itinfluences the probability that a word may repeat through p t .Specifically, these additional units from the L immediately pre-ceding word positions are concatenated to form m t = concat ( m t − ( L − , ..., m t ) , (6)where m t ∈ R L , and m t is element-wisely added to the pointercomponent p t , i.e. p t := p t + m t . (7)Thus m t influences the output y t indirectly by modifying p t in(3), which in turn is a part of z t in (4).Compared with an RNNLM, the memory augmentedpointer component only has ( L +1) × H total additional param-eters while a PSMM has extra H + 2 H parameters. Withoutattention and gating mechanism, the proposed model is simplerthan PSMM and has fewer extra parameters when L ≤ H .This pointer mechanism described above is for RNNLMs,while it can also be easily incorporated into Transformer-basedLMs. In the latter, the context vector from the last Transformerblock is treated as the hidden state in RNNLMs.
4. Experimental Setup
We conduct experiments on two text datasets, Penn Treebank(PTB) and WikiText-2 [19], and two ASR corpora, Switch-board (SWBD) and Wall Street Journal (WSJ). We use KaldiRNNLM [3] for data preprocessing on SWBD, e.g. in-cluding the English Fisher corpus, and WSJ. Sentences inSWBD+Fisher interleave conversation turns, as derived fromtime-information in transcriptions. Statistics of the datasets areshown in Table 1 (“sent len” is average sentence length).We develop baselines with both LSTM and Transformer-based LMs. Model details are present in Table 2. Plain LSTMsable 1:
Statistics of datasets used in experiments.
Dataset * | Vocab | sent len OOV (train / dev / test) StylePTB 929K 10K 21 4.8% / 4.7% / 5.8% writtenWikiText-2 2M 33K 22 2.6% / 5.4% / 6.2% writtenSWBD+Fisher 34M 30K 10 8.9% / 0.0% / 5.8% spokenWSJ 39M 123K 23 4.6% / 4.6% / 5.6% written * The end of sentence token is included in the count of training words. are baselines from each dataset except for WSJ. We has astronger baseline for PTB and WikiText-2: AWD-LSTM [22]with frequency-agnostic word embeddings [23], denoted byFrage-AWD-LSTM. For SWBD+Fisher, the stronger baseline isa Transformer LM with self-attention [6]. We only experimentwith Transformer architecture on WSJ. Given our academiccomputational resources, we were unable to make comparisonswith even stronger baselines, e.g. GPT and BERT, or withthe optimized architectures of [24], which requires industrial-strength resources.All neural LMs are on word level, implemented with Py-torch, and optimized via SGD . We tie the embedding andoutput matrices in all setups. The dropout rate for PTB andWikiText-2 is 0.5, while for SWBD+Fisher it is 0.1 for bothLSTM and Transformer LMs. Parameters of Frage-AWD-LSTMs not listed in Table 2 follow the settings in [23].Table 2: Details of neural network dimensions for various LMs.
Model Corpus Layers Units HeadsPlain LSTM All (except for WSJ) 2 650 -Frage-AWD-LSTM PTB/WikiText-2 3 1150 -Transformer SWBD+Fisher/WSJ 6 512/768 8
For ASR experiments on SWBD and WSJ, we use the Kalditoolkit [25] to train acoustic models and perform N -best rescor-ing. Acoustic models are factorized TDNNs [26], trained usingthe LF-MMI objective [27]. We do not include Fisher audioto train acoustic models for SWBD. To rescore each of the N hypotheses for an utterance, we find it useful to initialize theinitial LM state with the last LM state of the best hypothesis forthe previous utterance.
5. Experiments
We first compare the proposed model with PSMM and neuralcache under the plain LSTM setup. Perplexities on PTB andWikiText-2 are shown in Table 3. The performance gap betweenthe PSMM in the paper [19] and ours is mainly caused by dif-ferent implementations of truncated back-propagation throughtime (BPTT). They use an explicit truncated BPTT while wefollow the normal way discussed in [19] considering efficiencyand convenience of data preprocessing. We first concatenateall text words and then chunk them with fixed size L . So, ifthe truncated BPTT length is L , each training word on averageexperiences L /2 instead of L time-steps for back-propagation.This means each training word sees L /2 history words on aver-age. For the Transformer LMs, we tried Adam with the learning rateschedule proposed in [6], but failed to get better performance than SGD.
Table 3:
Perplexities on PTB and WikiText-2 (plain LSTMs).
Model PTB WikiText-2 In Table 3 “Memory Aug” refers to the memory augmentedpointer. We set history length as 100, equal to the truncatedBPTT length. Setting L = 50 for neural cache is a fair com-parison with others. Results on both datasets show that mem-ory augmentation provides further improvement on top of thepointer component. And with memory augmentation, the pro-posed model outperforms the rest on both datasets. In subse-quent tables, “Proposed” refers to LMs with the memory aug-mented pointer.To verify whether the proposed approach is robust, we con-duct experiments on a stronger baseline Frage-AWD-LSTMsetup [23]. We reproduced their results and implementedthe proposed approach on top of theirs, without tuning meta-parameters. Perplexity results in Table 4 shows that the pro-posed model on both datasets achieves better results than Frage-AWD-LSTM. Further improvements are observed with increas-ing the history length from 50 to 100, as expected. We alsoobserve complementary effects of the proposed model and neu-ral cache model.Table 4: Perplexities on PTB and WikiText-2 (Frage-AWD-LSTM setup).
Model PTB WikiText-2
We experiment with both LSTM- and Transformer-based LMson SWBD. Perplexity on dev set from Kaldi RNNLM is 50.The history length as well as BPTT length is set to 100 for theproposed model and PSMM. Results of Pytorch trained mod-els are in Table 5. For LSTM-based models, the proposed ap-Table 5:
Perplexities on SWBD.
Model
LSTM + Proposed 26.5M 45.9 40.4Transformer w/o positional embedding [6] 25.0M 51.6 44.4Transformer with positional embedding 25.1M 46.8 41.5Transformer + Proposed 25.1M 45.0 40.2 roach outperforms the baseline LSTM, but performs slightlyworse than the PSMM and neural cache models. For Trans-former LMs, the proposed approach also achieves better per-plexity than the two Transformer baselines.We notice the performance gains of the proposed modelover both LSTM and Transformer baselines on SWBD aresmaller than on PTB and WikiText-2. To check whether thismay relate to style (only SWBD is spoken style) and averagesentence length (sentences on Switchboard are the shortest onaverage), we experiment with Transformer-based LMs on WSJ.The proposed approach reduces perplexity from 71.5 to 65.6 ontest set (eval92), compared with a baseline Transformer LM.The improvement is relative 8.2% and in similar range withgains on WikiText-2. And this results in 0.1 absolute WER re-duction from 1.5 to 1.4 on the same test set by N -best rescoring.While no conclusions can be drawn yet, experiments show thatthe proposed model perform better on written-style text whichusually has longer average sentence length. One desired feature of the proposed model is that it may pre-dict rare words better than LSTMs. To verify this, we furtherexamine LM performance on the test set of each dataset. Wesplit the vocabulary of each corpus into 10 buckets based onword frequencies in training data such that we get a roughlyequal number of test tokens in each bucket. We then computethe differences between the test cross entropy of the proposedmodel and the LSTM baseline on words in each bucket for eachdataset. Figure 2 shows the results on WikiText-2. As expected,larger reductions in cross-entropy are observed from the pro-posed model on rare words. Similar trends are seen on PTB andSWBD, though the overall perplexity improvement on SWBDis marginal. Figures for them are omitted due to page limits.
Word buckets with equal size (rare words on left) R e d u c t i o n s o f C E L o ss ( h i g h e r - > b e tt e r ) -0.10.00.10.20.30.40.5 Figure 2:
Cross-entropy reduction from the proposed model w.r.tan LSTM on WikiText-2.
As Transformers show similar perplexity gains as LSTMs onSWBD, we only experiment with LSTMs for N -best rescoring.WERs on the full HUB5’00 evaluation set (Eval’00), the SWBDsubset (SWB), and Callhome subset (CH) are in Table 6. “State-carry” means when scoring a hypothesis for current utterance,the initial hidden state is copied from the last hidden state ofthe best hypothesis for the previous utterance, instead of beingzero initialized. WER improvements by the LSTM with “state-carry” in Table 6 indicate that cross-sentence context is useful.Similar observations are presented in [30]. If not specified with“w/o state-carry”, models are evaluated in the state-carry way.To investigate the effect on WERs of rare words, we con-duct a similar analysis as Section 5.3 does. Words with errors on Table 6: WERs by N-best rescoring with baselines and the pro-posed model on SWBD.
Model Eval’00 SWB CHKaldi RNNLM (w/o state-carry) 11.3 7.5 15.0LSTM (w/o state-carry) 11.2 7.3 15.1LSTM 10.9 7.1 14.5PSMM 10.9 7.1 14.6LSTM + Neural Cache 10.9 7.2 14.5LSTM + Proposed
Eval’00 ( < mas-ters and offered , are recognized correctly. Decoded output alsoshows that frequent words such as train , short , and were arecorrectly recognized. Though the overall WER improvementby the proposed model is marginal, correctly recognizing rela-tively rare words plays an important role in user experience ofASR-based products or service. Word buckets with equal size (rare words on left) R e l a t i v e W E R r e d u c t i o n ( % ) -0.50.00.51.01.52.02.5 1 2 3 4 5 Figure 3:
Relative WER reduction on by the proposed modelw.r.t an LSTM on SWBD.
We also notice the proposed model sometimes introduceserrors on words that are wrongly recognized in first pass de-coding. A possible reason is that the supervision vectors forpointer components are from decoded hypotheses and hencemay contain errors. We verify this by using test transcriptionin rescoring and observe a further 0.1 absolute WER reduc-tion on Eval’00 of SWBD. The mismatched condition betweentraining and evaluation is a common issue for the proposed ap-proach, PSMM, and neural cache model. To alleviate the mis-match, word level confidence scores and error adaptive trainingapproaches could be considered.
6. Conclusion and Future Work
In this work, we propose a cache-inspired pointer mechanismfor neural LMs to improve the capacity of modeling long-rangedependency and better predict rare words. It can be applied toboth RNN- and Transformer-based models. Perplexity evalu-ation show that the proposed approach generally outperformsLSTM and PSMM and is more effective on rare words. Rescor-ing with the proposed model on SWBD and WSJ gives marginalWER improvements. Analysis shows that the mismatch be-tween training and rescoring conditions (i.e. potentially in-correct histories) may make it challenging for both the pro-posed model and models with attention-based pointer networkto achieve large overall WER reductions. Future work is there-fore focused on methods that can mitigate the mismatch issue. . References [1] T. Mikolov, M. Karafi´at, L. Burget, J. ˇCernock`y, and S. Khudan-pur, “Recurrent neural network based language model,” in
Proc.of Interspeech , 2010.[2] X. Chen, X. Liu, M. J. Gales, and P. C. Woodland, “Recurrentneural network language model training with noise contrastive es-timation for speech recognition,” in
Proc. of ICASSP , 2015.[3] H. Xu, K. Li, Y. Wang, J. Wang, S. Kang, X. Chen, D. Povey, andS. Khudanpur, “Neural network language modeling with letter-based features and importance sampling,” in
Proc. of ICASSP ,2018.[4] U. Khandelwal, H. He, P. Qi, and D. Jurafsky, “Sharp nearby,fuzzy far away: How neural language models use context,” in
Proc. of ACL , 2018.[5] C. Chelba, M. Norouzi, and S. Bengio, “N-gram language mod-eling using recurrent neural network estimation,” arXiv preprintarXiv:1703.10724 , 2017.[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in
Proc. of NeurIPS , 2017.[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” in
Proc. of NAACL , 2019.[8] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever,“Improving language understanding by generative pre-training,”2018.[9] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhut-dinov, “Transformer-xl: Attentive language models beyond afixed-length context,” in
Proc. of ACL , 2019.[10] K. Irie, A. Zeyer, R. Schl¨uter, and H. Ney, “Language modelingwith deep transformers,” in
Proc. of Interspeech , 2019.[11] K. Li, Z. Liu, T. He, H. Huang, F. Peng, D. Povey, and S. Khu-danpur, “An empirical study of transformer-based neural languagemodel adaptation,” in
Proc. of ICASSP , 2020.[12] R. Lau, R. Rosenfeld, and S. Roukos, “Trigger-based languagemodels: A maximum entropy approach,” in
Proc. of ICASSP ,1993.[13] K. W. Church, “Empirical estimates of adaptation: the chance oftwo noriegas is closer to p/2 than p ,” in Proc. of COLING , 2000.[14] R. Kuhn and R. De Mori, “A cache-based natural language modelfor speech recognition,”
IEEE transactions on pattern analysisand machine intelligence , vol. 12, no. 6, pp. 570–583, 1990.[15] K. Li, H. Xu, Y. Wang, D. Povey, and S. Khudanpur, “Recur-rent neural network language model adaptation for conversationalspeech recognition.” in
Proc. of Interspeech , 2018.[16] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in
Proc. of NeurIPS , 2015.[17] J. Gu, Z. Lu, H. Li, and V. O. Li, “Incorporating copying mecha-nism in sequence-to-sequence learning,” in
Proc. of ACL , 2016.[18] C. Gulcehre, S. Ahn, R. Nallapati, B. Zhou, and Y. Bengio,“Pointing the unknown words,” in
Proc. of ACL , 2016.[19] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinelmixture models,” in
Proc. of ICLR , 2017.[20] E. Grave, A. Joulin, and N. Usunier, “Improving neural languagemodels with a continuous cache,”
Proc. of ICLR , 2017.[21] B. Krause, E. Kahembwe, I. Murray, and S. Renals, “Dynamicevaluation of neural sequence models,”
Proc. of ICML , 2018.[22] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and opti-mizing lstm language models,” in
ICLR , 2018.[23] C. Gong, D. He, X. Tan, T. Qin, L. Wang, and T.-Y. Liu, “Frage:Frequency-agnostic word representation,” in
Proc. of NeurIPS ,2018.[24] C. Wang, M. Li, and A. J. Smola, “Language models with trans-formers,” arXiv preprint arXiv:1904.09408 , 2019. [25] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” in
Proc. of ASRU , 2011.[26] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi,and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza-tion for deep neural networks.” in
Proc. of Interspeech , 2018.[27] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar,X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neu-ral networks for asr based on lattice-free mmi.” in
Proc. of Inter-speech , 2016.[28] T. Mikolov and G. Zweig, “Context dependent recurrent neuralnetwork language model,” in
Proc. of SLT , 2012.[29] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural net-work regularization,” arXiv preprint arXiv:1409.2329 , 2014.[30] K. Irie, A. Zeyer, R. Schl¨uter, and H. Ney, “Training languagemodels for long-span cross-sentence evaluation,” in