[PDF] Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

Abstract

The neural language models (NLM) achieve strong generalization capability by learning the dense representation of words and using them to estimate probability distribution function. However, learning the representation of rare words is a challenging problem causing the NLM to produce unreliable probability estimates. To address this problem, we propose a method to enrich representations of rare words in pre-trained NLM and consequently improve its probability estimation performance. The proposed method augments the word embedding matrices of pre-trained NLM while keeping other parameters unchanged. Specifically, our method updates the embedding vectors of rare words using embedding vectors of other semantically and syntactically similar words. To evaluate the proposed method, we enrich the rare street names in the pre-trained NLM and use it to rescore 100-best hypotheses output from the Singapore English speech recognition system. The enriched NLM reduces the word error rate by 6% relative and improves the recognition accuracy of the rare words by 16% absolute as compared to the baseline NLM.

Full PDF

EEnriching Rare Word Representations in Neural Language Modelsby Embedding Matrix Augmentation

Yerbolat Khassanov , Zhiping Zeng , Van Tung Pham , , Haihua Xu , Eng Siong Chng , School of Computer Science and Engineering, Nanyang Technological University, Singapore Temasek Laboratories, Nanyang Technological University, Singapore { yerbolat002,zengzp,vantung001,haihuaxu,aseschng } @ntu.edu.sg Abstract

The neural language models (NLM) achieve strong generaliza-tion capability by learning the dense representation of wordsand using them to estimate probability distribution function.However, learning the representation of rare words is a chal-lenging problem causing the NLM to produce unreliable proba-bility estimates. To address this problem, we propose a methodto enrich representations of rare words in pre-trained NLM andconsequently improve its probability estimation performance.The proposed method augments the word embedding matricesof pre-trained NLM while keeping other parameters unchanged.Speciﬁcally, our method updates the embedding vectors of rarewords using embedding vectors of other semantically and syn-tactically similar words. To evaluate the proposed method, weenrich the rare street names in the pre-trained NLM and use itto rescore -best hypotheses output from the Singapore En-glish speech recognition system. The enriched NLM reducesthe word error rate by relative and improves the recognitionaccuracy of the rare words by absolute as compared to thebaseline NLM. Index Terms : rare words, word embeddings, neural languagemodels, speech recognition

1. Introduction

The neural language models (NLM) have achieved great suc-cess in many speech and language processing applications [1,2, 3]. Particularly, it is highly employed in automatic speechrecognition (ASR) systems to rescore the n -best hypotheses listwhere the state-of-the-art results are attained. Different fromthe traditional count-based N -gram models that suffer from thedata sparsity problem [4], the NLMs possess superior gener-alization capability. The generalization is mainly achieved bylearning dense vector representations of words as part of thetraining process and using them to express the probability func-tion [1]. As a result, the learned word representations capturedifferences and commonalities between words, and thus enableNLMs to model different combination of words including theones unseen during the training.However, this concept assumes that each word appears asufﬁcient amount of times in the training data. For rare wordswith little or no training samples, the learned representationswill be poor [5]. Consequently, the NLM will assign them unre-liable probability estimates. Moreover, the representation of therare word will be used as a context for the neighbouring words,as such, the entire word sequence containing the rare word willbe underestimated. The problem exacerbates when a rare wordis a named entity such as names of persons, locations, organiza-tions and so on which are important keywords for downstreamtasks such as voice search [6]. Currently, a common practice in language modeling is toignore the rare word problem. For example, by limiting theNLM’s vocabulary set to the most frequent words and treatthe remaining words as out-of-vocabulary (OOV), i.e. map-ping them to special < unk > token [7] or train the NLM withfull vocabulary as usual. The former approach conﬂates allthe meanings of rare words into a single representation, los-ing the properties of individual words. The latter approach willresult in low-quality rare word representations. For both ap-proaches, the probability estimates of hypotheses incorporatingthe rare words will be unreliable, leading to the sub-optimalperformance of NLM.In this work, we propose an efﬁcient method to enrich thevector representations of rare words in pre-trained NLMs. Theproposed method augments the word embedding matrices ofpre-trained NLM while keeping other parameters unchanged.Speciﬁcally, our method shifts the rare word representation to-wards its semantic landmark in the embedding space usingrepresentations of other semantically and syntactically similarwords. This method has been shown effective for word sim-ilarity task [8] and vocabulary expansion for NLMs [9]. Wefurther extend its application to the rare word representationenrichment. To evaluate the proposed method, we ﬁrst enrichthe representations of rare Singapore street names in pre-trainedNLM and then use it to rescore the -best hypotheses outputfrom the state-of-the-art Singapore English ASR system. Theenriched NLM reduces the word error rate by relative andimproves the recognition accuracy of the rare words by absolute as compared to the strong baseline NLM.The rest of the paper is organized as follows. Section 2 re-views related approaches designed to deal with the rare wordproblem. In Section 3, we brieﬂy describe the architecture ofbaseline NLM. Section 4 presents the proposed embedding ma-trix augmentation technique. In Section 5, we explain the ex-periment setup and discuss the obtained results. Lastly, Section6 concludes the paper.

2. Related works

The continuous vector representations of words are typically de-rived from large unlabeled corpora using co-occurrence statis-tics [10, 11]. They became the dominant feature for many natu-ral language processing applications achieving the state-of-the-art results. To generalize well, however, these tasks requiremany occurrences of each word and fall short if a word appearsonly a handful of times [12]. Several approaches have beenproposed to deal with the rare word problem and most of themcan be classiﬁed under one of the three main categories shownbelow.

1) Morphological word representations.

A bunch of pro-posed works resorts to subword level linguistic units by break- a r X i v : . [ c s . C L ] J u l ng down the words into morphemes [12, 13, 14]. For ex-ample, [12] represented words as a function of its morphemeswhere the recursive neural network is applied over morphemeembeddings to obtain the embedding for a whole word. Whilesuch works have been proven effective to deal with the infre-quent word variations, they depend on the morphological an-alyzer such as Morfessor [15] and unable to model wordswhose morphemes are unseen during the training stage.

2) Character-level representations.

To alleviate the rareword problem, ﬁner level linguistic units such as syllables andcharacters have been also studied [16, 17, 18, 19]. For exam-ple, [16] explored both word-syllable and word-character levelhybrid NLMs where most frequent words are kept unchanged,while rare words are split into the syllables and characters, re-spectively. In similar fashion, [17] and [18] examined character-aware NLM architectures which rely only on character level in-puts, but predictions are still made at the word level.Character-level models eliminate the need for morpholog-ical tagging or manual feature engineering and they comprisesubstantially fewer number of parameters compared to word-level models. Moreover, these approaches success at capturingproperties of morphologically related words (e.g. ‘run’ vs ‘run-ning’), but may fail to capture distinctions between semanticallyunrelated words (e.g. ‘run’ vs ‘rung’) [19, 5].

3) Knowledge-powered word representations.

Anotherdirection of works leverage external knowledge to enhance rep-resentations of rare words [5, 20, 21]. For example, [5] em-ployed word deﬁnitions obtained from the

WordNet [22] tomodel rare words on the separate network. Alternatively, [20]proposed to incorporate external knowledge as a regularizationterm to the original model’s objective function. Although theseapproaches have shown promising results, they highly dependon the availability of external hand engineered lexical resources.Note that the aforementioned approaches can be also usedjointly. For example, by using factored NLM architecture [23]where different feature types can be combined.

3. Baseline NLM architecture

The NLM architectures can be generally classiﬁed into twomain categories: feedforward [1] and recurrent [2]. Our methodcan be applied to both of them, but in this paper we will focus onrecurrent architecture with LSTM units which has been shownto achieve the state-of-the-art results [24].The conventional recurrent LSTM architecture can be de-coupled into three main components as shown in Figure 1: 1)input projection layer, 2) middle layers, and 3) output projec-tion layer. The input layer is parameterized by input embeddingmatrix S used to map one-hot encoding representation of word w t ∈ R | V | at time t into continuous vector representation s t ,where | V | is a vocabulary size: s t = S w t (1)The embedding vector s t and a high-level context featurevector from the previous time step h t − are then combined bynon-linear middle layers, which can be represented as function f () , to produce a new context feature vector h t : h t = f ( s t , h t − ) (2)The non-linear function f () can employ simple activation unitssuch as ReLU and hyperbolic tangent or more complex unitssuch as LSTM and GRU. The middle layers can be also formedby composing several such functions. Figure 1: NLM architecture decomposed into three components.

Lastly, the context vector h t is fed to the output layer whichis parameterized by output embedding matrix U to produce ahigh-dimensional vector y t ∈ R | V | : y t = U T h t (3)The entries of output vector y t represent the scores of wordsto follow the context h t . These scores are then normalized bysoftmax function to form a probability distribution.Our method modiﬁes the embedding matrices S and U while keeping the middle layer f () intact as will be explainedin the next section.

4. Embedding matrix augmentation

We start with the assumption that we are given a pre-trainedNLM which models full vocabulary including both frequent andrare words. In such models, the rare words will be poorly rep-resented, leading to the sub-optimal performance. Therefore,our goal is to enrich the representation of rare words withoutcollecting additional training data and incurring expensive post-processing procedures.To achieve this goal, we exploit the structure of NLM whereinput and output layers are parameterized by word embeddingmatrices (see Figure 1). Particularly, we propose to modifyboth input and output embedding vectors of the rare words,while keeping the parameters of middle layers unchanged. Theembedding vectors of the rare words are modiﬁed using em-bedding vectors of other semantically and syntactically similarwords. This approach will retain the linguistic regularities en-capsulated within original pre-trained NLM, given that embed-dings of the rare words are properly modiﬁed. Our method canbe also viewed as a language model adaptation task [25] whereinstead of topic or speaking style the vocabulary is adapted toconform with the words used in the target domain.The proposed method has three main steps: 1) identifyingthe rare words, 2) ﬁnding similar words and 3) enriching rareword representations.

1) Identifying the rare words.

To identify the rare wordswe can simply count the frequency of words in the training dataand set a frequency threshold below which all words are con-sidered rare. This approach, however, might result in too manyrare words. To reduce the computation time, we can limit therare words to those which appear in the n -best hypotheses orword lattice output.

2) Finding similar words.

Given a subset of rare words V rare ⊂ V , the next step is to select a list of similar candi-date words C for each rare word. The selected candidates willbe used to enrich representations of rare words, hence, theymust be frequent and present in the vocabulary V of NLM. Inaddition, they should be similar to the target rare word bothin semantic meaning and syntactic behavior. Note that select-ing inadequate candidates might deteriorate the performance ofNLM, thus, they should be carefully inspected.everal effective methods exist that can ﬁnd appropriatecandidate words. For example, using lexical resources that con-tain synonyms and related words such as WordNet or employ-ing pre-trained word embeddings from skip-gram or cbow models [10] which can also ﬁnd similar words. In our exper-iments, we use lexical resource containing a list of Singaporestreet names where frequent street names will be used to updaterepresentations of rare street names.

3) Enriching rare word representations.

Let s r be an em-bedding vector of some rare word w r in space deﬁned by inputembedding matrix S and let C r be corresponding set of similarwords. We enrich the representation of s r using the words in C r by the following formula: ˆ s r = s r + (cid:80) s c ∈C r m c s c |C r | + 1 (4)where ˆ s r is the enriched representation of s r , s c is an embed-ding vector of similar candidate word and m c is a metric usedto weigh candidates based on importance. The m c can be es-timated using frequency counts or similarity score where mostfrequent or most similar candidates are given higher weights. Inour experiments, we weigh the candidates equally.The Eq. (4) typically shifts the embedding of a rare wordtowards the weighted centroid of its semantic landmark. Themotivation is that highly correlated words, in terms of both se-mantic meaning and syntactic behavior, should be close to eachother in the embedding space. We then use the same candidatesand formula to update the corresponding rare word embedding u r in the output embedding matrix U . This procedure is thenrepeated for the remaining words in subset V rare .

5. Experiment

In this section, we describe experiments conducted to evaluatethe effectiveness of the proposed embedding matrix augmenta-tion technique. Particularly, we ﬁrst enrich the rare Singaporestreet name representations in pre-trained NLM and then use theenriched NLM to rescore 100-best hypotheses output from theSingapore English ASR. The ASR system is built by Kaldi [26]speech recognition toolkit using Singapore English speech cor-pus. To highlight the importance of enriching the rare word rep-resentations, we used 1 hour recording of 9 read articles aboutSingapore streets as an evaluation set (7.3k words).We compare our enriched model against three state-of-the-art language models (LM) including Kneser-Ney smoothed4-gram (KN4), Kaldi-RNNLM [27] and recurrent LSTMLM [3]. Our model is obtained by enriching the representationsof rare Singapore street names in the recurrent LSTM LM, andwe call it E-LSTM. The performance of these four LMs is eval-uated on the 100-best rescoring task. The acoustic model (AM) is built using‘nnet3+chain’ setup of Kaldi and trained on 270 hours of tran-scribed Singapore English data which mostly consist of speechtaken from parliament, talk shows and interviews.

Lexicon.

The lexicon is constructed by assembling k unique words which include around k Singapore street names.To avoid ambiguity, the street names consisting of more than https://github.com/khassanoff/SG_streets We also examined other N -gram models and found 4-gram to workbest for our case. one word are joined using the underscore symbol, e.g. ‘Boonlay’ is changed to ‘Boon lay’. This lexicon was also used as avocabulary set for LMs. Language model.

To train LMs, we used AM transcriptsand web crawled Singapore related data which resulted in to-tal 1M in-domain sentences (16M words). In addition, we usedGoogle’s 1 billion word (1BW) benchmark corpus [28] to ac-count for generic English word sequence statistics.The KN4 is trained on combined in-domain (AMtranscript + web crawled) and generic 1BW data. It was built us-ing SRILM toolkit [29] with k vocabulary set. We used KN4model to rescore both word lattice and -best list. Its prunedversion KN4 pruned was used during the decoding stage.The Kaldi-RNNLM is a word-character level hybrid modeldesigned to overcome the rare word problem by decomposingthe rare words into character n -grams while keeping the mostfrequent words unchanged. It was trained as a -layer LSTM with units in each layer using only in-domain data. Theinput and output embedding matrices were tied and embeddingspace dimension was set to . For vocabulary, we tried tokeep a different number of most frequent words and found k to perform best, the remaining 31k words were decomposed.The recurrent LSTM LM is a word level model which wasbuilt using our own implementation in PyTorch [30]. It wastrained as a single layer LSTM with k units using in-domaindata and k vocabulary set. The input and output embeddingspace dimensions were set to and , respectively. Theparameters of the model were learned by truncated BPTT [31]and SGD with gradient clipping. We also applied dropout forregularization [32].Lastly, our E-LSTM model is obtained by enriching the rareword representations in the pre-trained recurrent LSTM LM. Asa case study, we use Singapore street names where frequentstreets will be used to enrich the rare street representations. Inparticular, we ﬁrst count the frequency of each street name inthe in-domain data and then divide them into two subsets offrequent and rare streets using some threshold value. Next, werandomly choose words from the subset of frequent streets andemploy them to enrich the representations of all rare streets (all-Streets) using the Eq. (4). To reduce computation time, we alsotried to enrich only rare streets present in the -best hypothe-ses output (fromNbest). The experiment results are shown in Table 1. In these experi-ments, we divide the street names into frequent and rare sub-sets using the threshold value of 10. To enrich the rare streetswe used 5 randomly chosen frequent street names . The initialword error rate (WER) without any rescoring is . .The obtained results show that the E-LSTM model outper-forms the strong KN4 used to rescore the word lattice by relative WER (from . to . ). Moreover, it achieves relative WER improvement over Kaldi-RNNLM and LSTMmodels (from . to . ). We found that enrichingonly rare streets present in the 100-best hypotheses ( fromNbest )achieves a similar result as enriching all rare streets ( allStreets ),while fromNbest being much faster.The state-of-the-art WER results are usually achieved by in-terpolating NLM and count-based N -gram models which have Changing the number of layers and its size didn’t improve WER. https://geographic.org/streetview/singapore/ For consistency, we ﬁx the chosen frequent streets to be same. able 1:

The perplexity and WER results on evaluation set

LM Perplexity Rescore WER

KN4 pruned 436 - 17.07%KN4 351 Lattice 16.52%100-best 16.84%Kaldi-RNNLM - 100-best 14.73%+KN4 - 100-best 14.10%LSTM 295 100-best 14.74%+KN4 - 100-best 14.95%E-LSTM (allStreets) 242 100-best 13.87%+KN4 - 100-best 13.58%E-LSTM (fromNbest) 234 100-best %+KN4 - 100-best %been shown to complement each other [2, 3]. To this end, we in-terpolated NLMs with KN4 and achieved further WER reduc-tions. Interestingly, the baseline LSTM model doesn’t beneﬁtfrom KN4 , while E-LSTM gains additional relative WERimprovement (from . to . ). To determine the effective frequency threshold range, used tosplit the street names into frequent and rare subsets, we repeatthe experiment with different threshold values as shown in Fig-ure 2. We observe that setting it between and is sufﬁcientto achieve good results. On the other hand, setting it too low orhigh will deteriorate WER as can be seen from the left and righttails of the plot in Figure 2. W E R Frequency threshold

E-LSTM (fromNbest) E-LSTM (allStreets)

Figure 2:

WER performance of E-LSTM at different frequencythreshold values used to split frequent and rare street names.5.2.2. Changing the number of frequent words

We also repeat the experiment to determine the optimal numberof frequent words to use to enrich the rare words. We observedthat for all cases the WER results are similarly good. For fastcomputation, we recommend to use around - most frequentwords. This experiment is incomplete as we didn’t examine thequality of selected frequent words which requires more substan-tial analysis. Due to the space limitations, we leave the furtheranalysis for future work. Interpolation weight for KN4 is set to . . Changing the interpolation weight didn’t help.

Table 2:

Recognition accuracy of 265 rare street names

LM Rescore Accuracy

KN4 pruned - 37.36%KN4 Lattice 36.98%Kaldi-RNNLM 100-best 43.40%LSTM 100-best 43.02%E-LSTM (fromNbest) 100-best % To ensure that WER improvements are achieved as a result ofcorrectly recognizing the enriched rare street names, we com-pute the recognition accuracy of rare street names (see Ta-ble 2). The experiment results show that after enriching thebaseline recurrent LSTM LM, the recognition accuracy is in-creased by . (from . to . ) achieving thebest result among all LMs. Furthermore, we observe that cor-rectly recognizing the rare street names also helps to recoverneighbouring words (see Table 3). These results conﬁrm theeffectiveness of the proposed method.Table 3: Examples of correctly recovered neighbouring wordsafter rescoring with E-LSTM

LM Example

KN4 pruned 1) a hawker centre and market began operations at bully plays in nineteen seventy six2) by nineteen ninety four when the book develop-ment guide plan was announcedLSTM 1) a hawker centre and market began operation at bully police in nineteen seventy six2) by nineteen ninety four when the product devel-opment gap plan was announcedE-LSTM 1) a hawker centre and market began operations at boon lay place in nineteen seventy six2) by nineteen ninety four when the bedok develop-ment guide plan was announced

6. Conclusions

In this work, we proposed an effective method to enrich therepresentations of rare words in pre-trained NLM. The pro-posed method augments the embedding matrices of pre-trainedNLM while keeping other parameters unchanged. Importantly,it doesn’t require additional in-domain data and expensive post-training procedures. We applied our method to enrich the rareSingapore street names in pre-trained LSTM LM and used it torescore the -best list generated by the state-of-the-art Singa-pore English ASR system. The enriched LSTM LM achieved relative WER improvement over the baseline LSTM LM. Incomparison to other strong baseline LMs, our method achievessigniﬁcant WER improvements, i.e. and improvementover Kaldi-RNNLM and KN4, respectively. Moreover, the en-riched LSTM increased the recognition accuracy of rare streetnames by absolute. We believe that the proposed methodcan beneﬁt other models with similar network architecture andbe easily adapted to other scenarios.

7. Acknowledgements

This work is supported by the project of Alibaba-NTU Singa-pore Joint Research Institute. . References [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neuralprobabilistic language model,”

Journal of Machine Learning Re-search , vol. 3, pp. 1137–1155, 2003.[2] T. Mikolov, M. Karaﬁ´at, L. Burget, J. ˇCernock`y, and S. Khudan-pur, “Recurrent neural network based language model,” in , 2010.[3] M. Sundermeyer, R. Schl¨uter, and H. Ney, “Lstm neural networksfor language modeling,” in ,2012.[4] S. F. Chen and J. Goodman, “An empirical study of smoothingtechniques for language modeling,”

Computer Speech & Lan-guage , vol. 13, no. 4, pp. 359–394, 1999.[5] D. Bahdanau et al. , “Learning to compute word embeddings onthe ﬂy,” arXiv preprint arXiv:1706.00286 , 2017.[6] J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba,M. Cohen, M. Kamvar, and B. Strope, “Your word is my com-mand: google search by voice: A case study,” in

Advances inspeech recognition . Springer, 2010, pp. 61–90.[7] J. Park, X. Liu, M. J. Gales, and P. C. Woodland, “Improved neu-ral network based language modelling and adaptation,” in , 2010.[8] M. T. Pilehvar and N. Collier, “Inducing embeddings for rare andunseen words by leveraging lexical resources,” in

Proceedings ofthe 15th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL , 2017, pp. 388–393.[9] Y. Khassanov and C. E. Siong, “Unsupervised and efﬁcient vocab-ulary expansion for recurrent neural network language models inasr,” in , 2018, pp. 3343–3347.[10] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efﬁcient esti-mation of word representations in vector space,” arXiv preprintarXiv:1301.3781 , 2013.[11] J. Pennington, R. Socher, and C. Manning, “Glove: Global vec-tors for word representation,” in

Proceedings of the Conferenceon Empirical Methods in Natural Language Processing, EMNLP ,2014, pp. 1532–1543.[12] T. Luong, R. Socher, and C. Manning, “Better word representa-tions with recursive neural networks for morphology,” in

Proceed-ings of the 17th Conference on Computational Natural LanguageLearning, CoNLL , 2013, pp. 104–113.[13] A. Lazaridou et al. , “Compositional-ly derived representations ofmorphologically complex words in distributional semantics,” in

Proceedings of the 51st Annual Meeting of the Association forComputational Linguistics, ACL, Volume 1: Long Papers , 2013,pp. 1517–1526.[14] S. Qiu, Q. Cui, J. Bian, B. Gao, and T.-Y. Liu, “Co-learning ofword representations and morpheme representations,” in ,2014, pp. 141–150.[15] M. Creutz and K. Lagus, “Unsupervised models for morphemesegmentation and morphology learning,”

TSLP , vol. 4, no. 1, p. 3,2007.[16] T. Mikolov, I. Sutskever, A. Deoras, H.-S. Le, and S. Kombrink,“Subword language modeling with neural networks,” 2012.[17] W. Ling et al. , “Finding function in form: Compositional char-acter models for open vocabulary word representation,” in

Pro-ceedings of the Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP , 2015, pp. 1520–1530.[18] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-awareneural language models.” in

Proceedings of the 13th AAAI Con-ference on Artiﬁcial Intelligence , 2016, pp. 2741–2749. [19] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrichingword vectors with subword information,”

TACL , vol. 5, pp. 135–146, 2017.[20] C. Xu et al. , “Rc-net: A general framework for incorporatingknowledge into word representations,” in

Proceedings of the 23rdACM International Conference on Information and KnowledgeManagement, CIKM , 2014, pp. 1219–1228.[21] M. Faruqui et al. , “Retroﬁtting word vectors to semantic lexi-cons,” in

Proceedings of the North American Chapter of the As-sociation for Computational Linguistics: Human Language Tech-nologies, NAACL HLT , 2015, pp. 1606–1615.[22] G. A. Miller, “Wordnet: a lexical database for english,”

Commu-nications of the ACM , vol. 38, no. 11, pp. 39–41, 1995.[23] A. Alexandrescu and K. Kirchhoff, “Factored neural languagemodels,” in

Human Language Technology Conference of theNorth American Chapter of the Association of Computational Lin-guistics . ACL, 2006.[24] M. Sundermeyer et al. , “Comparison of feedforward and recurrentneural network language models,” in

IEEE International Confer-ence on Acoustics, Speech and Signal Processing, ICASSP , 2013,pp. 8430–8434.[25] Y. Khassanov et al. , “Unsupervised language model adaptation bydata selection for speech recognition,” in

Asian Conference on In-telligent Information and Database Systems, ACIIDS . Springer,2017, pp. 508–517.[26] D. Povey et al. , “The kaldi speech recognition toolkit,” in

IEEESignal Processing Society , 2011.[27] H. Xu et al. , “Neural network language modeling with letter-basedfeatures and importance sampling,” in

IEEE International Confer-ence on Acoustics, Speech and Signal Processing, ICASSP , 2018,pp. 6109–6113.[28] C. Chelba et al. , “One billion word benchmark for measuringprogress in statistical language modeling,” in , 2014, pp. 2635–2639.[29] A. Stolcke, “Srilm-an extensible language modeling toolkit,” in , 2002.[30] A. Paszke et al. , “Automatic differentiation in pytorch,” 2017.[31] P. J. Werbos, “Backpropagation through time: what it does andhow to do it,”

Proceedings of the IEEE , vol. 78, no. 10, pp. 1550–1560, 1990.[32] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural net-work regularization,” arXiv preprint arXiv:1409.2329arXiv preprint arXiv:1409.2329