[PDF] Applying GPGPU to Recurrent Neural Network Language Model based Fast Network Search in the Real-Time LVCSR

Abstract

Recurrent Neural Network Language Models (RNNLMs) have started to be used in various fields of speech recognition due to their outstanding performance. However, the high computational complexity of RNNLMs has been a hurdle in applying the RNNLM to a real-time Large Vocabulary Continuous Speech Recognition (LVCSR). In order to accelerate the speed of RNNLM-based network searches during decoding, we apply the General Purpose Graphic Processing Units (GPGPUs). This paper proposes a novel method of applying GPGPUs to RNNLM-based graph traversals. We have achieved our goal by reducing redundant computations on CPUs and amount of transfer between GPGPUs and CPUs. The proposed approach was evaluated on both WSJ corpus and in-house data. Experiments shows that the proposed approach achieves the real-time speed in various circumstances while maintaining the Word Error Rate (WER) to be relatively 10% lower than that of n-gram models.

Full PDF

AApplying GPGPU to Recurrent Neural Network Language Model based FastNetwork Search in the Real-Time LVCSR

Kyungmin Lee, Chiyoun Park, Ilhwan Kim, Namhoon Kim, and Jaewon Lee

DMC R&D Center, Samsung Electronics, Suwon, Korea { karaf.lee, chiyoun.park, ilhwans.kim, namhoon.kim, jwonlee } @samsung.com Abstract

Recurrent Neural Network Language Models (RNNLMs) havestarted to be used in various ﬁelds of speech recognition dueto their outstanding performance. However, the high compu-tational complexity of RNNLMs has been a hurdle in apply-ing the RNNLM to a real-time Large Vocabulary ContinuousSpeech Recognition (LVCSR). In order to accelerate the speedof RNNLM-based network searches during decoding, we ap-ply the General Purpose Graphic Processing Units (GPGPUs).This paper proposes a novel method of applying GPGPUs toRNNLM-based graph traversals. We have achieved our goalby reducing redundant computations on CPUs and amount oftransfer between GPGPUs and CPUs. The proposed approachwas evaluated on both WSJ corpus and in-house data. Experi-ments shows that the proposed approach achieves the real-timespeed in various circumstances while maintaining the Word Er-ror Rate (WER) to be relatively 10% lower than that of n-grammodels.

Index Terms : Recurrent Neural Network, Language Model,General Purpose Graphics Processing Units, Large VocabularyContinuous Speech Recognition

1. Introduction

Recently, the Recurrent Neural Network Language Model(RNNLM) has gained its popularity in the ﬁeld of AutomaticSpeech Recognition (ASR). Various academic research has re-ported the effectiveness of RNNLMs, which can train unseencontexts by sharing the statistics between words with syntacti-cally and semantically similar contexts [1, 2, 3, 4, 5]. However,heavy computational load of RNNLM over traditional n-grambased approaches has been a hurdle in applying the RNNLMto diverse areas of ASR applications. Especially, when ASRsystems are required to run under real-time constraint (i.e., lessthan 1xRT), the real-time decoder is hardly attainable with di-rect application of RNNLMs in place of traditional n-grams.In order to overcome such computational issues, most of theRNNLN systems adopt two-pass decoding strategy, which gen-erates lattices or a set of n-best results based on n-gram in theﬁrst path, and then performs the rescoring on the hypotheseswith RNNLMs.Prior studies have investigated the possibility of implement-ing real-time decoder with RNNLM [5]. The study reducedcomputational complexity of the RNNLMs by caching the con-ditional probabilities of the words and the results of RNN com-putation and reusing the cached data. However, even though thecomputational load was minimized by introducing cache strat-egy and reducing redundant computations, the result was stillfar from achieving real-time performance with large vocabularybased RNNLM. Recent studies have applied the General Purpose GraphicProcessing Units (GPGPUs) in various ﬁelds of ASR [6, 7, 8, 9].One of the studies applied the GPGPU to training RNNLMs,and showed that the outstanding parallelization capability ofGPGPU was suitable in minimizing the computational load ofprobability normalization processes [9].In this paper, we investigate the possibility of imple-menting a GPGPU-based real-time Large Vocabulary Contin-uous Speech Recognition (LVCSR) that utilizes RNNLM. Eventhough GPGPUs have powerful parallelization capabilities, ob-stacles such as their insufﬁcient memory size and slow datatransfer speed between GPGPUs and CPUs discourage the useof GPGPUs among RNNLM-based real-time decoders. More-over, it is also important to balance the computation time be-tween GPGPU and CPU, as the acceleration on GPGPU maynot have a prominent impact on the overall speed if the GPGPUneeds to wait for the CPU computation to ﬁnish.In order to achieve real-time decoding of RNNLM-basedLVCSR, we apply on-the-ﬂy rescoring of RNNLM to GPGPUbased network traversal technique proposed in [8]. We acceler-ate the speed of data exchange between the two heterogeneousprocessors and reduce redundant computations on CPUs by ap-plying cache strategies. The resulting recognition system hasshown almost twice faster than real-time speed when experi-mented under various conditions, while maintaining relatively10% lower Word Error Rate (WER) than that of conventionaln-gram models.This paper is organized as the following. In section 2, thestructure of RNNLMs is explained. Section 3 explains how weapplied GPGPUs to RNNLM-based network search. Section 4explains the RNNLM rescoring with caches. Section 5 evalu-ates the improvement of the proposed method, followed by theconclusion in Section 6.

2. Recurrent Neural Network

In order to speed up RNNLM computations, we apply an ef-ﬁcient RNNLM architecture described in this section, whichconsists of the hierarchically decomposed output layer [10], andMaximum Entropy (MaxEnt) strategy [11].

In RNNLM-based ASR system, directly computing the condi-tional probability of an input word for a given word sequencehas high computational complexity since the system needs tonormalize probabilities over the all the words in the vocabulary.In order to alleviate the computational burden, the hierarchi-cal softmax method was applied for our RNNLM implementa-tion [12].The output layer consists of a binary tree. We used Huff- a r X i v : . [ c s . C L ] J u l igure 1: The process of RNNLM-based graph traversals man tree method to build up the binary tree because it assignsshorter codes to more frequently used words and it leads to fasttraining and decoding speed [14]. In the output layer, the soft-max normalization for computing the likelihood is performedonly over the nodes on the path of the binary tree from the rootto the input word node so that the computational cost is reducedfrom O ( n ) to O (log n ) , where n represents the vocabulary size. In order to further reduce the computational cost, we interpo-lated hash-based MaxEnt models with the RNNLM itself. Wehave used n-gram based MaxEnt models, which has similar ac-curacy to traditional n-gram maximum likelihood model and iseasy to integrate with neural network conﬁguration [13].Because the RNN-based and MaxEnt-based LanguageModels (LMs) are complementary to each other [11], interpo-lating both LM scores enables us to reduce the size of the num-ber of nodes in RNNLM without loss of accuracy. Because thesize of a hidden layer is a dominant factor for the computationalcomplexity of RNNLMs, the amount of computation can be re-duced.

3. GPGPU Acceleration in RNNLM-basedGraph Search

GPGPUs have been successfully applied in various ﬁelds ofASR by virtue of their characteristics of the powerful paral-lelism. However, there are some obstacles in utilizing GPGPUsfor accelerating RNNLM-based network search. First, GPG-PUs have insufﬁcient memory to load the whole content of largevocabulary RNNLMs. Moreover, the data transfers betweenGPGPUs and CPUs are very time consuming works. This sec-tion explains how we apply GPGPUs to RNNLM-based graphtraversals.

We employ the GPGPU-CPU hybrid architecture as notedin [8]. The RNNLM-based rescoring is deployed to both GPG-PUs and CPUs in such a way that the Weighted Finite StateTransducer (WFST) which is composition of HCLG networks.The G network is composed with a short span n-gram, whichcan be resided in the available GPGPU memory. While the frame-synchronous Viterbi search is performed to generate lat-tices on the GPGPU side, the LM portion of the lattice score issimultaneously rescored with RNNLM in on-the-ﬂy manner onthe CPU side.

Whenever a new word hypothesis is output from the WFSTgraph traversal on the GPGPU, the word hypothesis and its priorRNNLM context are sent to the CPU side so that the RNNLMcomputation can be done on the CPU. After the RNNLM com-putation, the resulting rescored score and the updated RNNLMcontext are sent back to GPGPU.The exact size of each RNNLM context may depend on thestructure of the RNNLM, but it is generally larger than a fewhundreds of bytes. Moreover, the number of new word hy-potheses per each frame can be as high as a few thousands,and so millions of RNNLM lookups may be requested per anutterance. Considering the size of RNNLM context and thenumber of data exchange, the data transfer between GPGPUand CPU can cause speed degradation of RNNLM-based on-the-ﬂy rescoring WFST traversals. We reduced the size of thedata transfer by storing the RNNLM contexts on the CPU side,and only transferring the indices of the stored context to theGPGPU.Figure 1 depicts the proposed RNNLM-based graph traver-sal processes. Each number and arrow represents bytes andthe ﬂow of data, respectively. We have created an

IndexTable for storing and retrieving RNNLM contexts and put the tableinto the CPU memory. The

IndexTable is in charge of con-verting a large-sized RNNLM context into an 8-byte index, andvice versa. Because both encoding and decoding of the con-texts need to be performed in a short time, we have made

In-dexTable bidirectional to handle both purposes. In addition tothe RNNLM index, the index of the small LM, which corre-sponds to an ngram of the LM used to build the WFST graph,also needs to be transferred in order to compute and replace thesmall LM score with the RNNLM score. Instead of exchang-ing the two index sequences separately, we concatenate the twoindices into one numerical value in order to further reduce thedata size per each transaction. With the indexing and packingmethod, we could reduce the exchanged data size to approxi-mately / compared to transmitting the whole context infor-mation including a recurrent layer.

4. RNNLM Rescoring with Cache

In GPGPU-CPU hybrid architectures, the balance between theGPGPU and CPU speed is important, because relatively slowerspeed of CPUs can degrade the overall speed of RNNLM-basedgraph traversals since GPGPUs have to wait until the compu-tations on CPUs are ﬁnished in order to work synchronously.Therefore, a fast RNNLM computation strategy on the CPUside is crucial in accelerating the overall on-the-ﬂy rescoringtime. This section explains the efﬁcient method to speed upRNNLM-based graph search on the hybrid architecture.

We have optimized the computation on the CPU side by re-ducing the number of redundant computations during RNNLMcalculation. We use a cache strategy which stores the once-computed results and reuses them for the same input contexts.Each element in the cache consists of a key-value pair: thekey consists of a prior RNNLM context and a new word hypoth-nnlmProb( w , c )1 I ← ( w , c )2 if Cache[ I ] exists then3 O ← Cache[ I ]4 else5 C ← IndexTable[ c ]6 ( p , C (cid:48) ) ← ComputeRnnlm( w , C )7 if IndexTable − [ C (cid:48) ] exists then8 c (cid:48) ← IndexTable − [ C (cid:48) ]9 else10 c (cid:48) ← lengthOf(IndexTable) + 111 IndexTable[ c (cid:48) ] ← C (cid:48)

12 end if13 O ← ( p , c (cid:48) )14 Cache[ I ] ← O

15 end if16 return O Table 1:

The process of RNNLM computation with caches esis, and the value consists of the resulting RNNLM score forthe given word and the updated RNNLM context. The RNNLMcontext generally consists of the values in the previous hiddenlayer, but we also use the previous word sequence for computingthe MaxEnt portion of the RNNLM. Since we are compressingthe hundred bytes of context data into a 8-byte index by using

IndexTable as explained in Section 3, we store the compressedindices instead of the whole context data.

Table 1 shows a procedure for the probability computation ofa word hypothesis in the proposed RNNLM-based networksearch. I represents the key structure of the Cache element andit consists of the current RNNLM context index c and the fol-lowing word index w . O represents the value structure of the Cache element and it consists of the LM probability p and theupdated RNNLM context index c (cid:48) . IndexTable compresses theRNNLM context data C into an index variable c , and Cache stores the already computed pairs of inputs and outputs ( I , O ).The context index c is associated with each path on thegraph traversal on GPGPUs, and at every time the WFST net-work outputs a new word index w , the parameters w and c aresent back to the CPU side and fed into the RNNLM likelihoodcomputation process. At line 1 of Table 1, the two input pa-rameters are stored to the input structure I . In lines 2–3, if I is already cached, then the retrieved value of Cache [ I ] is savedto O and is returned without having to do any further compu-tations. Otherwise, in lines 4–6, the C is retrieved from the IndexTable with the index c , and the function ComputeRnnlm computes the conditional probability of w based on the priorcontext C , outputting the RNNLM probability value p and theupdated context C (cid:48) .On a side note, the hidden layer value is initialized to azero vector at the beginning of each utterance. The length ofthe word sequence in the RNNLM context C is restricted to theorder of MaxEnt model, and the latest word sequence is main-tained by removing the oldest word from the sequence when-ever the number of previous words in the context exceeds thepredeﬁned size.In lines 7–12, the updated context C (cid:48) is stored into the In-dexTable and its index value c (cid:48) is retrieved. When the context has been already stored before, the corresponding index is re-trieved without adding it to the table again. In lines 13–15, p and c (cid:48) are saved to O , and I and O are cached as a pair of keyand value for later use. Finally at line 16, O is returned. Dif-ference between the returned value and the score of short spann-gram will be sent back to the GPGPU side and will be used torescore the partially decoded WFST lattice.

5. Result

The experiments were performed both with the Wall Street Jour-nal (WSJ) database and with a much larger set of data collectedwithin the company ( in-house ).The acoustic model for evaluating the in-house data wastrained on 2,000 hours of the fully-transcribed Korean speechdata. All the speech data were sampled at 16 kHz and werecoded with 40-dimensional mel-frequency ﬁlterbank features,plus an additional dimension for the log-energy. The frameswere computed every 10ms, and was windowed by 25msHamming window. Five frames to the left and right of thegiven frame were concatenated to the features to make a 451-dimensional acoustic feature vector in total. All the acousticmodels were trained with the Deep Neural Network (DNN)which consisted of 5 hidden layers with 2,000 modes each. Rec-tiﬁed linear unit (ReLU) was used for the activation functions.The total number of output states was approximately 6,000.The LM trained for evaluation of the in-house data wasbased on a total of 4GB text corpus, which amounts to approx-imately 74 million sentences with 475 million words, and thevocabulary size was about 1 million. The RNNLMs consist ofone hidden layer with one hundred nodes, and the order of Max-Ent LM features was 3. As for the n-gram models used for thecomparison to the RNNLM, the 3-gram back-off models withKneser-Ney smoothing were used.The WFST was compiled with a bigram LM and all the ep-silon transitions were removed from the graph so that the com-putation on the GPGPU side can be optimized.The evaluation tasks were performed on Intel Xeon X56903.47GHz processors with a total of 12 physical CPU cores andone Nvidia Tesla M2075 GPU equipped with 6GB memory.

By considering that the number of rescoring request per eachutterance is as large as millions, it was expected that naivelyapplying RNNLM-based on-the-ﬂy rescoring to ASR systemwould lead to high computational complexity.As a matter of fact, our experiment showed that the naiveapplication of the RNNLM to the on-the-ﬂy rescoring decoderresulted in the speed slower than 10xRT. However, the RTF wassigniﬁcantly dropped to much lower than 1xRT when the cachestrategy explained in section 4 was adopted. The hit ratio of thecache was at least 88.66% over all the test cases, showing thatmost of the RNNLM computations were highly redundant.

Figure 2 depicts the performance comparison between RNNLMand 3-gram in the in-house 1 million vocabulary evaluation set.Various beam widths were applied to each LM in order to in-vestigate how the word error rate changes with respect to thedecoding speed. The ﬁgure shows that while the lowest WERthat 3-gram model can achieve is at around 8.49%, RNNLMigure 2:

Comparison of decoding speed and WER for RNNLMand 3-gram (in-house)

Type in-house eval92 dev93Pass / Model WER xRT WER xRT WER xRT1 / 3-gram 8.49 0.19 5.74 0.16 11.73 0.201 / RNNLM 6.83 0.64 4.15 0.43 10.80 0.402 / hybrid 7.57 0.33 5.74 0.25 11.31 0.272 / RNNLM 8.34 0.33 5.81 0.25 11.46 0.27Table 2:

Performance comparison of different types of LMrescorings can reach as low as 6.83% WER at the speed of 0.33 xRT.Table 2 shows the performance comparison of four differ-ent rescoring methods with same decoding options which showthe best performance all over the types of decodings. The one-pass types were computed by the proposed on-the-ﬂy rescoringmethod, and the two-pass types were evaluated by rescoring the1000-best hypotheses extracted from the lattices with 3-grammodels. The two-pass hybrid type used interpolated score be-tween 3-gram and RNNLMs.Although the speed of the 1-pass RNNLM type was gener-ally slower than that of other types, it was well within a real-time speed. This improvement shows that applying more ac-curate LMs at the WFST traversal stage at the ﬁrst-pass leadsto a better overall recognition result, instead of depending onthe rescoring on the less accurate hypotheses generated fromn-gram models.

Another measure to consider is the memory footprint of the

In-dexTable . Whereas

Cache elements consists of only four integervalues, each element in

IndexTable is as large as a few hundredbytes and may increase rapidly as the decoding goes on. Inour experimental settings, each

IndexTable item consists of fourtypes of information and is as big as 432 bytes, as shown in Ta-ble 3. The size of the whole table increases during the decodingprocess and it is proportional to the number of unique elements.During the in-house evaluation task, the average memoryusage for

IndexTable per utterance was 191.78MB. More than80% of utterances over all evaluation data sets were shorter than10 seconds, and average utterance length was about 7.69 sec.Considering that the size of the

IndexTable will be generallyproportional to the length of the utterance, we can reasonablyassume that the memory footprint for the

IndexTable is con-tained within acceptable size. Total Memory

The composition and the size of an element in In-dexTable (in bytes)

Capacity(KB)

Cache memory usage and decoding speed dependingon the capacity of cache (in-house)

We noticed that many speakers tend to repeat similar commandsin different utterances, and so we hypothesized that maintainingthe cache and

IndexTable over multiple utterances may increasethe hit ratio of the cache and will decrease the decoding speedfurther. Therefore, instead of resetting the caches for each ut-terance, we set a certain size boundary for the number of cacheto be maintained, and kept the information over multiple utter-ances. We expected that a larger cache size will lead to a fasterdecoding speed, due to increased cache hit ratio.Table 4 shows the decoding speed for different cache ca-pacity. The memory usages of caches and the number of cachesare averaged over all the utterances in the test set.However, as can be seen from the table, the decoding speeddid not get much faster than 1.12x even though the capacityof cache increased. We concluded that the result reﬂects thefact that even if the users are prone to speak the similar com-mands repeatedly, most of the word hypotheses are differentfor different utterances, and so the cache hit ratio does notget higher. Although the multi-utterance cache strategy onlyshowed a marginal improvement, it would still be meaningfulto adopt a smart cache strategy that limits the size of the cacheby removing the cached items that are least frequently used, sothat the size of the cache table does not overﬂow.

6. Conclusions

This paper explained how we applied RNNLMs to a real-time large vocabulary decoder by introducing the use of GPG-PUs. We tried to accelerate RNNLM-based WFST traversalsin GPGPU-CPU hybrid architectures by solving some practicalissues for applying GPGPUs. Moreover, in order to minimizethe computation burden on CPUs, we applied a cache strategy.The decoding speed of RNNLMs was still slower than that ofn-gram models, but the proposed method achieved the real-timespeed while maintaining relatively 10% lower WER as shownin Table 2, and so we could perfectly apply this approach toan on-line streaming speech recognition engine. The memoryfootprint for the cache method was small enough to perform theexperiment on the large data set. However, it will be more desir-able to employ efﬁcient cache techniques to reduce the memoryusage further. . References [1] T. Mikolov et al, “Recurrent neural network based languagemodel,” in Proc. Interspeech , pp. 1045–1048, 2010.[2] L. Gwnol et al, “Conversion of Recurrent Neural Network Lan-guage Models to Weighted Finite State Transducers for AutomaticSpeech Recognition,” in Proc. Interspeech , 2012.[3] E. Arisoy et al, “Converting Neural Network Language Modelsinto Back-off Language Models for Efﬁcient Decoding in Auto-matic Speech Recognition,”

TASLP ,vol. 22, no. 1, 2014.[4] T. Hori et al, “Real-time one-pass decoding with recurrent neuralnetwork language model for speech recognition,”

ICASSP , 2014[5] Z. Huang et al, “Cache based recurrent neural network languagemodel inference for ﬁrst pass speech recognition,”,

ICASSP , 2014[6] H. Kou et al, “Parallelized Feature Extraction and Acoustic ModelTraining,” in Proc. Int’l Conf. on DSP , 2014[7] I. KIM et al, “Development of Highly Accurate Real-Time LargeScale Speech Recognition System,”

ICCE ,2015[8] J. KIM et al, “Efﬁcient On-The-Fly Hypothesis Rescoring in a Hy-brid GPGPU/CPU-based Large Vocabulary Continuous SpeechRecognition Engine,” in Proc. ICASSP , 2014[9] X. Chen et al, “Efﬁcient GPGPU-based Training of Recur-rent Neural Network Language Models Using Spliced SentenceBunch, ” in Proc. Interspeech , 2014[10] F. Morin et al, “Hierarchical probabilistic neural network lan-guage model,” in Proc. the Int’l Workshop on AIStats , pp. 246-252, 2005[11] R. Rosenfeld, “Strategies for training large scale neural networklanguage models,”

ASRU , 2011.[12] A. Mnih et al, “A Scalable Hierarchical Distributed LanguageModel,”

NIPS , pp. 1081–1088, 2008[13] T. Alumae et al, “Efﬁcient estimation of maximum entropy lan-guage models with N-gram features: an SRILM extension,” inProc. Interspeech , 2010.[14] T. Mikolov et al, “Distributed Representations of Words andPhrases and their Compositionality,”