Accelerating recurrent neural network language model based online speech recognition system
AACCELERATING RECURRENT NEURAL NETWORK LANGUAGE MODEL BASEDONLINE SPEECH RECOGNITION SYSTEM
Kyungmin Lee, Chiyoun Park, Namhoon Kim, and Jaewon Lee
DMC R&D Center, Samsung Electronics, Seoul, Korea { k.m.lee, chiyoun.park, namhoon.kim, jwonlee } @samsung.com ABSTRACT
This paper presents methods to accelerate recurrent neural networkbased language models (RNNLMs) for online speech recognitionsystems. Firstly, a lossy compression of the past hidden layer out-puts (history vector) with caching is introduced in order to reducethe number of LM queries. Next, RNNLM computations are de-ployed in a CPU-GPU hybrid manner, which computes each layerof the model on a more advantageous platform. The added overheadby data exchanges between CPU and GPU is compensated througha frame-wise batching strategy. The performance of the proposedmethods evaluated on LibriSpeech test sets indicates that the re-duction in history vector precision improves the average recognitionspeed by 1.23 times with minimum degradation in accuracy. On theother hand, the CPU-GPU hybrid parallelization enables RNNLMbased real-time recognition with a four times improvement in speed. Index Terms — Online speech recognition, language model, re-current neural network, graphic processing unit
1. INTRODUCTION
A language model (LM) computes the likelihood of a given sentenceand is used to improve the accuracy of an automatic speech recogni-tion (ASR) system. Recent research has focused on neural network(NN) based LMs [1] because of their outstanding performances ingeneralizing from sparse data, which traditional n-gram based LMscould not do. In particular, recurrent neural network based LMs(RNNLMs) [2] do not even require Markov assumptions as they canmodel word histories of variable-length, and these virtues of themhave helped improve the performance of many ASR systems [3, 4].However, to our knowledge, they are not yet actively adopted in real-time ASR systems due to their high computational complexities.Several attempts have been made to utilize RNNLMs for on-line decoding in real-time ASR systems [5, 6, 7] However, they ei-ther simulate only some aspects of RNNLMs into the traditional ar-chitectures [5, 6], or perform a 2-pass decoding [7] which innatelycould not be applied before the end of the utterance was reached.There have also been attempts to apply RNNLM directly to on-line ASR without approximation by eliminating redundant compu-tations [8, 9, 10]. In our previous research [9], we were successful inapplying moderate size RNNLMs directly to CPU-GPU hybrid on-line ASR systems with a cache strategy [10]. However, in order toapply it to a more complex task with bigger RNNLMs, we needed tofind a way to accelerate it further.Recent studies indicate that one can reduce the number ofdistinct RNN computations by treating similar past hidden layeroutputs, also referred to as history vectors, as same [11], and that RNNLMs can be accelerated with GPU parallelization [12]. In thispaper, we attempt two different approaches in order to achieve real-time performance in a large RNNLM based ASR system. Firstly, alossy compression is applied to the cache of the history vector. Theprecision of the vectors can be controlled by either rounding up witha smaller number of significant digits or at an extreme, by storingonly the sign of each element. Next, we propose GPU paralleliza-tion of RNNLM computations, but only on selected layers. Insteadof performing all RNNLM computations on the same platform,compute-intensive parts of the model are computed on GPUs, andthe parts that need to utilize a large memory are calculated on CPUs.This method inherently increases the overhead of data transfer be-tween CPUs and GPUs. This is handled by coordinating a batchtransfer method that reduces the number of communications and thesize of the data blocks at the same time in the hybrid ASR systems.The paper is organized as follows. The architecture of our base-line ASR system is explained in Section 2. The lossy compressionmethod of the history vectors is explained in Section 3. Section 4explains how RNNLM rescoring is accelerated with CPU-GPU par-allelization. Section 5 evaluates performance improvements of theproposed methods, followed by conclusion in Section 6.
2. ARCHITECTURE OF OUR BASELINE CPU-GPUHYBRID RNNLM RESCORING
In the CPU-GPU hybrid ASR system [13], the weighted finite statetransducer (WFST) is composed of four layers each representingan acoustic model (AM), a context model, a pronunciation model,and an LM. WFSTs output word hypotheses when they reach wordboundaries during frame-synchronous Viterbi searches and the hy-potheses can be rescored by a separately stored RNNLM. However,in order to speed up on-the-fly rescoring based on RNNLMs, weneeded to reduce redundant computations as much as possible. Inthis section, we briefly outline the architecture of our baseline CPU-GPU hybrid RNNLM rescoring proposed in [9]. The main high-lights of our baseline architecture are the use of gated recurrent unit(GRU) [14] based RNNLM, noise contrastive estimation (NCE) [15]at the output layer, n-gram based maximum entropy (MaxEnt) by-pass [16] from input to output layers, and cache based on-the-flyrescoring.
We employed a GRU which is a type of gated RNNs [14]. The GRUis a mechanism designed to prevent vanishing gradient problems re-lated to long-term dependencies over time by using reset gates andupdate gates. For calculating a output vector of a GRU hidden layer,a total of six weight matrices and three bias vectors need to be loadedinto memory since for each gate and a candidate activation, two a r X i v : . [ c s . C L ] J a n ig. 1 . On-the-fly rescoring with the LM query cache (baseline).weight matrices and one bias vector are required. Thus the memoryusage can go up to several megabytes even if the weights are storedin a single precision floating-point format. The computational com-plexities of GRU computations are O ( H × H ) for a hidden layerof size H . This is a highly compute-intensive task considering thatthe number of unique LM queries in decoding an utterance can reachseveral hundreds of thousands. In order to guarantee that the scores calculated at the output layerof an RNNLM are valid probabilities, they need to be normalizedover different word sequences. The normalization is a highly com-putationally intensive task considering the vocabulary size V canreach millions. In order to address this, we employ an NCE at theoutput layer [15]. NCE is a sampling-based approximation methodthat treats partition functions as separate parameters and learns themby non-linear logistic regression. The variances of these partitionfunctions estimated by NCE are often limited to small values [17],allowing us to use the unnormalized scores without significant re-duction in the recognition accuracy. Even though the only requiredcomputations are inner products between the GRU outputs and NCEweights corresponding to the current word, the NCE weight matrixof size H × V need to be loaded into memory. The second strategy to reduce computation in our GRU basedRNNLM is to use an n-gram based MaxEnt bypass connectionsfrom input to output layers [16]. The MaxEnt scheme helps in main-taining a relatively small size for the hidden layer without significantreduction in recognition accuracy. The two types of parallel models,the main network consisting of GRUs and NCE, and the other withMaxEnt bypass connections, operate as an ensemble model andcan improve the overall recognition accuracy. In order to reducethe computational overhead because of the bypass connections, weimplemented a hash-based MaxEnt. This method requires the load-ing of a large hash table proportional to the number of n-grams, toretrieve a probability for the given n-gram in constant time.
Fig. 2 . Proposed on-the-fly rescoring with the cache of quantizedhistory vectors.
The process flow diagram of our baseline CPU-GPU hybrid RNNLMrescoring is shown in Figure 1. The LM queries with same historyas well as following words are deduplicated by applying a cachestrategy at the start of the rescoring procedure [9]. After the dedupli-cation, the embedding vectors corresponding to indices are retrievedby using an “Index Table”. The RNNLM computations are thenperformed with appropriate values in CPU memory. The results ofthe calculations are converted to indices, cached, and returned tograph traversals.
3. QUANTIZATION OF HISTORY VECTORS
The cache-based strategy for deduplicating LM queries dramaticallyaccelerated our baseline RNNLM rescoring with a cache hit ratioof around 89% and more than 10 times reduction in computationtime [9]. However, there is still room for improvement by extendingthis caching strategy to the outputs of GRU hidden layers.The current GRU hidden layer outputs computed based on theprevious GRU hidden layer outputs (history vectors) could be sharedbetween similar LM queries. Therefore, in order to reuse the pre-computed history vectors, we created another cache for that vectorsjust before computing RNNLMs as shown in Figure 2. The key ofthe cache is the GRU input which is a pair of a word embedding anda history vector, and the value of the cache is a GRU hidden layeroutput corresponding to that input. The number of unique computa-tions is further reduced by assuming that close history vectors wouldresult in similar GRU hidden layer outputs, with negligible effect onthe overall ASR results. Euclidean distance would be an easy way tomeasure the similarity [11], but it would still require a lot of compu-tations that can slow down the whole rescoring process. Instead, wepropose to quantize the history vectors by controlling the precisionof history vector itself by rounding up to a specified decimal point.We also consider an extreme case, in which we store only the signsof each element, as it would still capture some of the latent meaningswhich the hidden layers represent.Table 1 shows the possible reduction of computations for a able 1 . Redundancy rates of quantized history vectors.Precision Count Redundancy rate(baseline) 103,904 0.0 %round-2 102,776 1.09 %round-1 102,776 1.09 %sign 88,749 14.59 %four-second utterance. (Note that each element of the history vectorranges from -1 to 1.) The term “Precision” refers to the quantizationof history vectors to a specified decimal place. After the initialdeduplication, in our baseline system, we have 103,904 unique LMqueries as can be observed from the first row of Table 1. The “round-2” row shows that only 1.09% of the computations can be reducedby caching the history vectors rounded to the second decimal place.Rounding off the history vectors to the first decimal place showsthat there is no further redundancy. However, as shown in the lastrow of Table 1, an extreme case of quantization where only signsof each element are stored, we were able to reduce 14.59% of thecomputations. This relatively huge reduction may affect the accu-racy of RNNLM results to some extent since after the extreme signquantization there are still possible unique history vectors for ahidden layer of size H = 256 , but it is worthy to evaluate its effecton ASR systems.
4. CPU-GPU HYBRID DEPLOYMENT OF RNNLMCOMPUTATION
As described in Section 2, the proposed RNNLM model cannot bereadily deployed on a GPU processor due to its large memory re-quirement. The word embedding step at the input layer requiresspace proportional to the size of vocabulary, and the MaxEnt stepat the output layer need to maintain a large hash table that can storethe n-grams and the corresponding scores. Also, the NCE step at theoutput layer requires loading of an NCE weight matrix proportionalto the size of vocabulary. On the other hand, the hidden layer oc-cupies only a fixed amount of memory but needs a large number ofcomputations instead.
Table 2 . Operation times for each RNNLM computation step inseconds.Processor Data transfer Hidden OutputUnit Count Time Layer LayerCPU - - - 6.23 0.04GPU LM Query 102,172 5.94 2.15 0.06GPU Frame 518 0.60 2.26 0.03The first row of Table 2 shows a profiling result of an RNNLMcomputation with a single layer of 128 GRU nodes based on a three-second utterance. As is expected, the hidden layer takes 99% of theoverall computation, which we aim to reduce in this section. Thehigh computational rates of neural networks are easily acceleratedby utilization of GPUs, but their high memory requirements for wordembeddings and MaxEnts prevent us from doing so. Therefore, wedeploy only the hidden layer part of the computation on the GPUsand keep the input embedding and output layer computations on theCPU side, as shown in Figure 3. As can be observed from the secondrow of Table 2, the hybrid deployment reduces the computation timefor the hidden layer to one-third of what was done on CPU alone.
Fig. 3 . Proposed GPU based RNNLM rescoring with frame-wisebatch data transfer.However, this method also introduces a setback. Because onlythe middle layer of the RNNLM computations was deployed on theGPU side, and its surrounding layers are computed on CPUs, theinformation needs to be shared across the two heterogeneous pro-cessor units frequently. As the number of data exchanges increases,the decoding speed of the hybrid ASR system inevitably decreases.The second row of Table 2 also shows that there have been morethan a hundred thousand data exchanges during an utterance, whichdelayed the overall computation by 5.94 seconds, which is twice aslong as the original utterance.The frequency of data transfers between CPUs and GPUs affectsthe decoding speed more critically than the data size in each transfer.Therefore, we propose a method in which we reduce the number ofdata copies between CPUs and GPUs by concatenating the neededinformation to one block per frame. During the batching step, thehistory vectors and their next word embeddings that are emitted foreach frame are stored in a consecutive CPU memory block, and thewhole data block is transferred to GPU memory at once. The GRUoutputs from the GPU are also copied back to the output layer com-putation in one data block. This effect can be observed from the lastrow of Table 2, in which the data transfer time is reduced to 10% ofthe original. In addition, this approach still works in multi-GPU en-vironments without additional operations by evenly distributing theblock to GPUs since the hidden layer calculations for each segmentof the CPU memory block are not sequentially related to each other.
5. EXPERIMENTS5.1. Experimental setup
The LMs in our experiments were trained on the training corpus ofLibriSpeech [18]. To compare the performance with n-grams, “4-gram full LM” in LibriSpeech was used. Both vanilla-RNNLMs andGRU-RNNLMs consisted of a single hidden layer and 4-gram basedMaxEnt connections. The vocabularies used for all RNNLMs werethe same as “4-gram full LM” ( V = 200 , ). A bi-directional re-current deep neural network (RDNN) based AM with three hidden able 3 . Performances on LibriSpeech’s test sets; all evaluations were performed with same decoding options.LM Processor Rescoring Precision dev-clean test-clean dev-other test-otherthreads WER RTF WER RTF WER RTF WER RTF4-gram full CPU 4 - 4.28 0.18 4.95 0.33 11.92 0.54 11.87 0.26CPU 4 (baseline) 4.05 2.16 4.69 2.19 11.70 3.58 11.47 3.37round-2 4.06 1.85 4.69 1.89 11.69 2.91 11.49 2.85round-1 4.05 1.82 4.69 1.87 11.69 2.95 11.48 2.89GRU-RNNLM sign 4.06 1.79 4.69 1.80 11.69 2.82 11.47 2.75 ( H = 256) GPU 1 - 4.05 1.10 4.69 1.08 11.70 1.94 11.49 2.292 4.06 0.71 4.69 0.70 11.70 1.24 11.47 1.203 4.05 0.58 4.68 0.63 11.69 0.98 11.47 0.974 4.05 0.52 4.69 0.52 11.70 0.88 11.47 0.94long short term memory (LSTM) layers (500 nodes for each layer),and a softmax output layer was trained using about 7,600 hours ofthe fully transcribed in-house English speech data mostly consist-ing of voice commands and dialogs. WFSTs were compiled with2-gram LMs, and all the epsilon transitions were removed so thatcomputations on GPUs could be optimized.The hardware specification for the evaluations was Intel XeonE5-2680 with 12 physical CPU cores and four Nvidia Tesla K80GPUs equipped with 12 GB memory. We used CUDA for GPU par-allelization. CUBLAS, which is a linear algebra library of CUDA,was used for matrix multiplications and kernel functions were im-plemented for relatively simple operations such as element-wise op-erations. For RNNLM computations on CPUs such as output layercomputations, we used EIGEN which is a C++ based linear algebralibrary. . Perplexities depending on LM types.In our experiments, LibriSpeech’s development and test caseswere used for evaluations. The performance of different LMs mea-sured in terms of perplexity is shown in Figure 4. The term “other”in the evaluation cases means the speech data sets were recordedin noisy environments. As can be seen in Figure 4, the vanilla-RNNLM of size 128 showed the worst accuracies over all the setsand was even worse than that of the 4-gram LM. The accuracy ofvanilla-RNNLM improved dramatically for a hidden layer size of256 and showed the lowest perplexities, but still worse than a 128-size GRU-RNNLM. Perplexities of GRU-RNNLMs were droppedby 7.81, 10.10, and 9.75 absolute (averaged over all four cases) formodel sizes of 128, 256, and 512, respectively, as compared to the perplexity of the 4-gram LM. In all tasks except for “dev-other,” theGRU-RNNLM size of 256 showed the lowest LM perplexities.Table 3 shows the word error rate (WER) and the real-time factor(RTF) for the proposed methods for accelerating the online RNNLMrescoring. All decoding options otherwise mentioned in Table 3 aresame for all the methods being compared. The meanings of valuesin the column “Precision” are the same as Table 1. Regarding recog-nition accuracies, the average WER of the baseline system was im-proved by 3.39% relatively than that of the 4-gram LM based systemas can be observed from the first two rows of Table 3. As expected inSection 3, caching quantized history vectors rounded off in the firstand the second decimal points did not show noticeable improvementin recognition speed compared to the baseline system. However, theproposed quantization strategy of caching only signs of the historyvectors was 1.23 times faster compared to the baseline system with-out any accuracy degradations.As shown in the fifth and sixth rows of Table 3, with the pro-posed GPU parallelization method, even one thread was 1.43 timesfaster on an average than the fastest CPU based system (sign). Therecognition speed improves further with the use of multiple GPUs.In particular, when the number of GPUs increased to two, the speedwas significantly improved, which was 1.61 times faster than a sin-gle GPU-based system. When three GPUs were utilized, we attainedreal-time speech recognition over all the test cases. Finally, theRNNLM-based ASR system with four GPUs has shown the fastestaverage recognition speed of 0.72 RTF. It was three times faster thanthe fastest CPU-based system and four times faster than the baselinesystem.
6. CONCLUSION
We devised a faster RNNLM based on-the-fly rescoring on bothCPU and GPU platforms by introducing a lossy compression strat-egy of history vectors and the novel hybrid parallelization method.As cache hit ratios got higher by lowering decimal precisions of thevectors, speech recognition was speeded up by 1.23 times. Althoughit was not a significant improvement, the fact that recognition rateswere not affected even if each dimension of the history vectors wasstored by one bit representing the sign seemed to provide a clue to theefficient compression way of embedding vectors while minimizingthe loss of their information. Finally, with the CPU-GPU hybrid par-allelization method, the decoding speed over all the cases has fallenwithin real-time. . REFERENCES [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neuralprobabilistic language model,”
Journal of Machine LearningResearch , vol. 3, pp. 1137–1155, 2003.[2] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khu-danpur, “Recurrent neural network based language model,” in
Proc. Interspeech , 2010, pp. 1045–1048.[3] S. Kombrink, T. Mikolov, M. Karafiat, and L. Burget, “Re-current neural network based language modeling in meetingrecognition,” in
Proc. Interspeech , 2011, pp. 5528–5531.[4] O. Tilk and T. Alum¨ae, “Multi-domain recurrent neural net-work language model for medical speech recognition,” in
Proc.Human Language Technologies , 2014, vol. 268, pp. 149–152.[5] L. Gwnol and M. Petr, “Conversion of recurrent neural net-work language models to weighted finite state transducers forautomatic speech recognition,” in
Proc. Interspeech , 2012, pp.131–134.[6] E. Arisoy, S. Chen, B. Ramabhadran, and A. Sethy, “Con-verting neural network language models into back-off languagemodels for efficient decoding in automatic speech recognition,”
IEEE Transactions on Acoustics, Speech, and Signal Process-ing , vol. 22, no. 1, pp. 184–192, 2014.[7] Y. Si, Q. Zhang, T. Li, J. Pan, and Y. Yan, “Prefix tree basedn-best list re-scoring for recurrent neural network languagemodel used in speech recognition system,” in
Proc. Inter-speech , 2013, pp. 3419–3423.[8] T. Hori, Y. Kubo, and A. Nakamura, “Real-time one-passdecoding with recurrent neural network language model forspeech recognition,” in
Proc. ICASSP , 2014, pp. 6364–6368.[9] K. Lee, C. Park, I. Kim, N. Kim, and J. Lee, “Applying gpgputo recurrent neural network language model based fast networksearch in the real-time lvcsr,” in
Proc. Interspeech , 2015, pp.2102–2106.[10] Z. Huang, G. Zweig, and B. Dumoulin, “Cache based recurrentneural network language model inference for first pass speechrecognition,” in
Proc. ICASSP , 2014, pp. 6354–6358.[11] X. Liu, X. Chen, Y. Wang, M. Gales, and P. Woodland, “Twoefficient lattice rescoring methods using recurrent neural net-work language models,”
IEEE/ACM Trans. Audio, Speech &Language Processing , vol. 24, no. 8, pp. 1438–1449, 2016.[12] X. Chen, Y. Wang, X. Liu, M. Gales, and P. Woodland, “Ef-ficient gpu-based training of recurrent neural network lan-guage models using spliced sentence bunch,” in
Proc. INTER-SPEECH , 2014, pp. 641–645.[13] J. Kim, J. Chong, and I. Lane, “Efficient on-the-fly hypothesisrescoring in a hybrid gpu/cpu-based large vocabulary contin-uous speech recognition engine,” in
Proc. INTERSPEECH ,2012, pp. 1035–1038.[14] J. Chung, C¸ . G¨ulc¸ehre, K. Cho, and Y. Bengio, “Empiri-cal evaluation of gated recurrent neural networks on sequencemodeling,”
CoRR , vol. abs/1412.3555, 2014.[15] M. Gutmann and A. Hyv¨arinen, “Noise-contrastive estimationof unnormalized statistical models, with applications to naturalimage statistics,”
Journal of Machine Learning Research , vol.13, pp. 307–361, 2012. [16] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernock´y,“Strategies for training large scale neural network languagemodels,” in
Proc. ASRU , 2011, pp. 196–201.[17] X. Chen, X. Liu, M. Gales, and P. Woodland, “Recurrent neu-ral network language model training with noise contrastive es-timation for speech recognition,” in
Proc. ICASSP , 2015, pp.5411–5415.[18] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audiobooks,” in