Unidirectional Memory-Self-Attention Transducer for Online Speech Recognition
UUNIDIRECTIONAL MEMORY-SELF-ATTENTION TRANSDUCER FOR ONLINE SPEECHRECOGNITION
Jian Luo, Jianzong Wang*, Ning Cheng, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
ABSTRACT
Self-attention models have been successfully applied in end-to-end speech recognition systems, which greatly improve theperformance of recognition accuracy. However, such attention-based models cannot be used in online speech recognition,because these models usually have to utilize a whole acousticsequences as inputs. A common method is restricting the fieldof attention sights by a fixed left and right window, whichmakes the computation costs manageable yet also introducesperformance degradation. In this paper, we propose Memory-Self-Attention (MSA), which adds history information into theRestricted-Self-Attention unit. MSA only needs localtime fea-tures as inputs, and efficiently models long temporal contextsby attending memory states. Meanwhile, recurrent neural net-work transducer (RNN-T) has proved to be a great approach foronline ASR tasks, because the alignments of RNN-T are localand monotonic. We propose a novel network structure, calledMemory-Self-Attention (MSA) Transducer. Both encoder anddecoder of the MSA Transducer contain the proposed MSAunit. The experiments demonstrate that our proposed modelsimprove WER results than Restricted-Self-Attention modelsby . on WSJ and . on SWBD datasets relatively, andwithout much computation costs increase. Index Terms — Speech Recognition, Self Attention, RNNTransducer
1. INTRODUCTION
In the past few years, models employing transformer structurehave achieved state-of-art results for many tasks, such as na-ture language understanding, machine translation, and speechrecognition. Especially in speech recognition, lots of attention-based models have proved to obtain a substantial performanceimprovement [1][2][3][4]. For example, self-attention blocksare successfully applied in CTC-based network, SAN-CTC [5]showed that self-attention encoder is competitive with exist-ing end-to-end models. Speech-transformer [6] proposed a2D-Attention module, which computes attention weights onboth time and frequency axes, in order to extract more dis-criminated representations of speech features. Transformer-Transducer [7] used VGGNet with causal convolution as the *Corresponding author: Jianzong Wang, [email protected] frontend of encoder, and self-attention transducer as the net-work architecture. However, such attention-based models can-not be used in online speech recognition, because these modelsusually have to utilize a whole acoustic sequences as inputs.An additional challenge is that the computation complexityof these models increases quadratically with input sequencelength, which is unacceptable for online ASR tasks. A typi-cal solution for this challenge is restricting the field of atten-tion sights by a fixed left and right window, which makes thecomputation costs manageable but also leads to performancedegradation. To overcome the drawbacks of these restricted-attention model, we proposed Memory-Self-Attention (MSA)in this paper. MSA only needs localtime features as inputs,and efficiently models long temporal contexts by attendingmemory states. These memory states help MSA to gain betterperformance than window-restricted-attention models. More-over, the computation complexity of MSA is linear with inputsequence length, which is significant for online ASR tasks.CTC [8], Transformer [9], RNN-Transducer [10][11] aremost common used architectures in speech recognition [12].CTC [13][14] is first widely used to end-to-end models, butCTC has a fatal drawback that every timestep is outputtedindependently. Therefore, it has to be optimized jointly withexternal language model in practice. Transformer [15][16] isanother choice by encoder-decoder infrastructure. However,the mechanism of transformer allows the model to attend any-where in the input sequence at each timestep. Therefore, thealignments of transformer are non-local and non-monotonic.The RNN-Transducer [17][18] was proposed as an extensionto CTC, which also marginalizes over all possible alignmentsbetween the input sequence and the output targets. RNN-Transducer is typically composed of an encoder, which trans-forms the acoustic features into high-level representations, anda decoder, which produces linguistic outputs. Previous worksemployed GRU or LSTM as the encoders, giving the RNN-Tits name. In this paper, we explore the possibility of replacingRNN-based encoders and decoders with our proposed MSAunits, which is called Memory-Self-Attention (MSA) Trans-ducer. MSA Transducer can learn an implicit language modeland therefore removes the conditional independence assump-tion in CTC. More importantly, the alignments of RNN-T arelocal and monotonic, allowing MSA Transducer for onlinespeech recognition tasks. a r X i v : . [ ee ss . A S ] F e b n the previous works, some unidirectional neural networkarchitectures were proposed for online ASR tasks, such asdeep LCBLSTM [19], and TDNN-LSTM [20]. Daniel Poveyproposed a time-restricted-attention layer for ASR [21], andused it in LF-MMI [22] models which are not end-to-end. Twounidirectional architectures, the time-delay LSTM (TDLSTM)and parallel time-delayed LSTM (PTDLSTM) were presentedin [23]. Another researches are worked on local monotonicattention [24][25]. Google proposed transformer encoderswith RNN-T loss [26], and they showed that limiting the leftand right context of attention per-layer can obtain not badaccuracy but still have some gap between the performance offull-attention models.
2. MEMORY-SELF-ATTENTION (MSA)TRANSDUCER2.1. Model Architecture
The speech recognition tasks can be defined as a sequenceto sequence problem, which takes acoustic features X t =[ x , x , ..., x T ] as the inputs, and produces predicted labelsequence Y u = [ y , y , ..., y U ] as the outputs. In which, T isthe acoustic frames length, and U is the predicted label length.RNN-T consists of an acoustic encoder network, a separatelanguage model named prediction network and a joint network.For online ASR, the encoder network takes acoustic features x t as the inputs and outputs encoder states e t . Meanwhile,the decoder network takes previous predicted label y u − recurrently, and outputs decoder states d u . Finally, the jointnetwork merges the encoder and decoder states together, andproduces label prediction y t,u at each timestep t . e t = EncoderNetwork( x t ) (1) d u = DecoderNetwork( y u − ) (2) y t,u = JointNetwork( e t , d u ) (3)Our proposed network is based on RNN-T architecture.Encoder and decoder network contain convolutional blocksand our proposed MSA blocks, as shown in Figure 1. Theencoder takes acoustic features as inputs into a 2-D Conv block,in order to overcome the local variance of the features bothon time and frequency axis. MSA blocks are after the 2-DConv block, and before the joint network. The decoder alsohas a 1-D convolutional layer, but with grapheme inputs andconvolutional kernel on time axis only. At last, a joint networkis designed to concatenate both encoder and decoder outputhidden states together. After linear layers and Tanh activation,the probability distribution of output labels are produced withsoftmax function. As shown in Figure 2 a, 2-D Conv and 1-DConv block share the similar architecture, with LayerNormand Relu activation after convolutional layer. Moreover, thestride of 2-D convolution is set to , in order to reduce thecomputation cost of subsequent MSA blocks. Fig. 1 . Memory-Self-Attention TransducerAs shown in Figure 2 b, our proposed MSA block is avariant of restricted-attention unit. The first MSA block takesthe output of convolutional layer as inputs, and subsequentMSA blocks are fed with the output of bottom MSA block.MSA block processes the inputs into two kinds of paths. Thefirst path is a standard self-attention unit, and the other is amemory recurrent unit. Similar to restricted-attention unit, theMSA block takes window-restricted states into a multi-headattention layer. As x t is the acoustic feature of each frame, wedenote the window-restricted consecutive features as s t . s t = [ x t − l : x t + r ] (4) m t = MultiHeadAttention( s t ) (5) h t = LSTM( h t − , s t ) (6) f t = LayerNorm( m t + h t + s t ) (7) e t = LayerNorm(FFN( f t ) + f t ) (8)In which, l and r is the left and right window size respectively.The multi-head attention layer can attend every position withinthe restricted window. The output of multi-head attention layeris denoted as m t . Another path of MSA block is a memoryrecurrent unit, where we denote the memory state as h t . Weuse LSTM as the recurrent layer, which takes the previoushidden state h t − and current input features s t , outputing thecurrent memory state h t . Then, s t , h t , and m t are addedtogether, feeding into a LayerNorm layer as f t . At last, a ig. 2 . (a) 1/2-D Conv Block (b) MSA Blockfeedfoward layer and another LayerNorm layer take f t , andproduce output states of encoder network as e t . Fig. 3 . Visualization of Memory-Self-AttentionAs shown in Figure 3, Memory-Self-Attention not onlyattends the current window-restricted states but also historymemory states, allowing the model to capture long-term de-pendency. In this section, we compare MSA with other modelstructure on the relationship between operation complexityand their reception of fields. As well known, BLSTM andunrestrict-self-attention are most popular used structures inacoustic network. However, such kinds of models cannot beused in online speech recognition, because these models usu-ally have to utilize a whole acoustic sequences as inputs. Itmeans that they cannot produce prediction labels until theysee the last frame of the inputs. A common solution is mak-ing the network unidirectional like LSTM, or restricting thefield of attention sights by a fixed left and right window asRestricted-Self-Attention. All of the above methods make thecomputation costs manageable yet also introduce performancedegradation. As shown in Table 1, it has proved that left-onlyRestrict-SA (left window = infinite) has a better performancethan unidirectional LSTM. However the obvious drawback is that it makes operations complexity up to O ( T d ) , where T is the length of input features and d is the model dimensionof hidden states. It means the inference speed will becomeslower and slower when speech frames are accumulated, whichis unacceptable in online ASR tasks. Extreme left and rightRestrict-SA (left=l, right=r) overcomes this drawback, andreduce the operations complexity to O ( T ( l + r ) d ) , but it alsorestricts the reception field to O ( ml + mr ) . Though it canstack multiple layers to acquire a larger reception field, where m is the number of attention layers. Our proposed Memory-Self-Attention not only maintains the operations complexityto O ( T ( l + r + d ) d ) , but also extends the reception field to O ( T ) . In intuitive thinking, the complicated self-attentionunit of MSA will model the localtime features carefully, andrelative light memory unit of MSA will look as far as possibleof the history information.
3. EXPERIMENTS
Our experimental works were implemented by evaluating theperformance of our models on two publicly available ASRcorpus (WSJ and SWBD). As a comparison, we trained theproposed MSA encoder with CTC loss, and trained the MSATransducer with RNN-T loss. Both Character Error Rates(CER) and Word Error Rates (WER) are evaluated, and theresults are summarized as Table 2.
As inputs to the system, the audio data is encoded with mean-variance normalized Fbank coefficients (plus energy) of ms frame length and ms frame shift, together with theirfirst and second temporal derivatives. Therefore, each inputvector is dimensions. At each timestep, we concat theleft and right frames into one feature vector, so total frames are used as the input vector. All of the speech textare capitalized, and grapheme labels (plus − , (cid:48) , blank and ∅ ) are used during training and decoding, so the output is classes for scoring. The whole system was set up on Pytorchframework, and used Nvidia V100 cards for training. PytorchDataParallel is used to train mini-batch data over multipleGPUs with samples per batch.In the training stage of MSA Transducer, the model isstarted with warmup steps, and we divide the learningrate by for every epochs. Adam optimizer with β = 0 . , β = 0 . and epsilon = 10 − is used. The standard MSATransducer in most experiments has the following configura-tion. The encoder has (1) Two 2-D convolutional blocks, eachwith two convolutional layers with kernel size = 21 × andchannel = 32 . The second convolutional layer has stride = 3 in temporal dimension. (2) Twelve MSA blocks, with attention layer dimension, multi-heads attention, and feedforward layer dimension. The decoder is similar with theencoder, but with two 1-D convolutional blocks with kernel able 1 . Operation Complexity Analysis Model Type Model Structure Operations Per-layer Seq-Operations Max Path Length
Bidirectional BLSTM O ( T d ) O ( T ) O ( T ) Bidirectional Unrestricted-SA O ( T d ) O (1) O ( T ) Unidirectional LSTM O ( T d ) O ( T ) O ( T ) Unidirectional Restricted-SA (left=inf, right=r) O ( T d ) O (1) O ( T ) Unidirectional Restricted-SA (left=l, right=r) O ( T ( l + r ) d ) O (1) O ( ml + mr ) Unidirectional
Memory-SA O ( T ( l + r + d ) d ) O ( T ) O ( T ) Table 2 . Performance of Memory-Self-Attention Model
Train Loss Encoder Decoder WSJ SWBD
CER WER CER WERCTC LSTMx4 — 6.07 16.76 17.99 35.43CTC Restricted-SAx12 (l=16, r=4) — 5.21 14.22 16.42 33.70CTC
Memory-SAx12 (l=16, r=4) — RNN-T Memory-SAx12 (l=16, r=4) LSTMx2
Memory-SAx12 (l=16, r=4) Memory-SAx2 (l=10, r=0) size = 5 and stride = 1 , followed by another two MSA blocks.Finally, the linear and projection layer dimension is , withdropout = 0 . for preventing overfit. In the inference stage,we used the standard beam search algorithm. More specificly,the beam width of CTC is set to , and for RNN-T.Like previous work in [27], we pre-trained the encodernetwork with CTC loss, and pre-trained the decoder networkwith cross-entropy loss. The joint network is fine-tuned withRNN-T loss eventually. The pre-trained encoder is connectedto a dimension feed-forward layer and a softmax layer,to output grapheme label probabilities. The performance ofthis pre-trained CTC model is shown in Table 2 as comparison.The experimental results infer that pre-training can acceleratethe convergence of MSA Transducer, and it is the essentialprocedure to achieve better performance. The performance of all the models are summarized as Table 2.Because our works focus on the online ASR tasks, we comparethem with forward-only networks. We take -layer unidirec-tional LSTM and -layer Restricted-Self-Attention as ourbaseline models, where l = 16 and r = 4 is the left and rightcontext window that every position in self-attention can attend.In addition, both CTC model and RNN-T model are evaluatedwith language model by shallow fusion for better performance.We used standard kaldi s5 recipe to train the language model.Specificly, 3-Gram model trained on the extended text data isincorporated with the CER and WER test of WSJ, and 4-Grammodel trained on Switchboard and Fisher transcripts is usedfor SWBD test.The experiments on encoder part of the model demonstratethat both Restricted-SA and our proposed Memory-SA can achieve better results than baseline forward-only LSTM net-work. The MSA model with CTC loss improves WER resultsthan Restricted-SA network . relatively on WSJ test data,and . on SWBD dataset. However, the models with RNN-T loss do not beat ones with CTC loss. We thought that thedecoder part of RNN-T models might not have enough datato learn a good implicit language model on small datasets. Inour experiments, we also found that external language modeltrained on the same transcripts greatly improves the results ofCTC models but has little impact on RNN-T models.
4. CONCLUSION
In this paper, we present Memory-Self-Attention (MSA) Trans-ducer, which adds history information into the Restricted-Self-Attention unit. We trained the MSA models on RNN-Tloss, making it suitable for online ASR tasks, because thealignments of RNN-T are local and monotonic. Our experi-ments show that MSA has better results than basic LSTM andwindow-restricted-attention networks. Moreover, MSA Trans-ducer achieves these results without much computation costincrease, because it only needs localtime features as inputs, andefficiently models long temporal contexts by attending mem-ory states. We are interested to investigate the performanceof MSA unit in various monotonic models such as truncatedattention-based speech transformer. Exploring better archi-tectures that add history information into the self-attentionmodels, can also be extended. We will leave these ideas asfuture works. . ACKNOWLEDGEMENT
This paper is supported by National Key Research and Devel-opment Program of China under grant No. 2017YFB1401202,No. 2018YFB1003500, and No. 2018YFB0204400. Corre-sponding author is Jianzong Wang from Ping An Technology(Shenzhen) Co., Ltd. . REFERENCES [1] H. Miao, G. Cheng, P. Zhang, T. Li, and Y. Yan, “Onlinehybrid ctc/attention architecture for end-to-end speechrecognition,” in
INTERSPEECH , 09 2019.[2] Q. Pham, T.-S. Nguyen, J. Niehues, M. M¨uller, andA. Waibel, “Very deep self-attention networks for end-to-end speech recognition,” in
INTERSPEECH , 09 2019.[3] M. Sperber, J. Niehues, G. Neubig, S. St¨uker, andA. Waibel, “Self-attentional acoustic models,” in
INTER-SPEECH , 09 2018.[4] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, andY. Bengio, “Attention-based models for speech recogni-tion,” in
NIPS , 06 2015.[5] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attentionnetworks for connectionist temporal classification inspeech recognition,” in
ICASSP , 2019.[6] L. Dong, S. Xu, and B. Xu, “Speech-transformer: ano-recurrence sequence-to-sequence model for speechrecognition,” in
ICASSP , 2018.[7] C.-F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang,D. Le, M. Jain, K. Schubert, C. Fuegen, and M. L. Seltzer,“Transformer-transducer: End-to-end speech recognitionwith self-attention,” in arXiv:1910.12977 , 2019.[8] A. Graves, S. Fern ´a ndez, F. Gomez, and J. u. r. Schmid-huber, “Connectionist temporal classification: labellingunsegmented sequence data with recurrent neural net-works,” in
ICML , 2006.[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin,“Attention is all you need,” in
NIPS , 2017.[10] Z. Tian, J. Yi, J. Tao, Y. Bai, and Z. Wen, “Self-attention transducers for end-to-end speech recognition,”in arXiv:1909.13037 , 2019.[11] S. Wang, P. Zhou, W. Chen, J. Jia, and L. Xie, “Explor-ing rnn-transducer for chinese speech recognition,” in
APSIPA ASC , 11 2019.[12] E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li,H. Liu, S. Satheesh, A. Sriram, and Z. Zhu, “Exploringneural transducers for end-to-end speech recognition,” in
ASRU , 2017.[13] T. N. Sainath, O. Vinyals, A. Senior, and H. c. i. Sak,“Convolutional, long short-term memory, fully connecteddeep neural networks,” in
ICASSP , 2015. [14] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai,E. Battenberg, C. Case, J. Casper, B. Catanzaro,Q. Cheng, G. Chen et al. , “Deep speech 2: End-to-endspeech recognition in english and mandarin,” in
ICML ,2016.[15] A. Zeyer, P. Bahar, K. Irie, R. Schluter, and H. Ney, “Acomparison of transformer and lstm encoder decodermodels for asr,” in
ASRU , 12 2019.[16] J. Li, H. Hu, and Y. Gong, “Improving rnn transducermodeling for end-to-end speech recognition,” in
ASRU ,12 2019.[17] A. Graves, A.-r. Mohamed, and G. Hinton, “Speechrecognition with deep recurrent neural networks,” in
ICASSP , 2013.[18] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe,“Towards online end-to-end transformer automatic speechrecognition,” in arXiv:1910.11871 , 10 2019.[19] S. Xue and Z. Yan, “Improving latency-controlled blstmacoustic models for online speech recognition,” in
ICASSP , 2017.[20] V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Lowlatency acoustic modeling using temporal convolutionand lstms,”
IEEE Signal Processing Letters , vol. 25, no. 3,pp. 373–377, 2018.[21] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khu-danpur, “A time-restricted self-attention layer for asr,” in
ICASSP , 04 2018.[22] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani,V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purelysequence-trained neural networks for asr based on lattice-free mmi,” in
INTERSPEECH , 2016.[23] N. Moritz, T. Hori, and J. Le Roux, “Unidirectional neu-ral network architectures for end-to-end automatic speechrecognition,” in
INTERSPEECH , 09 2019.[24] A. Merboldt, A. Zeyer, R. Schl¨uter, and H. Ney, “Ananalysis of local monotonic attention variants,” in
IN-TERSPEECH , 09 2019.[25] L. Dong, F. Wang, and B. Xu, “Self-attention aligner:A latency-control end-to-end model for asr using self-attention network and chunk-hopping,” in
ICASSP , 2019.[26] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott,S. Koo, and S. Kumar, “Transformer transducer: Astreamable speech recognition model with transformerencoders and rnn-t loss,” in
ICASSP , 2020.[27] K. Rao, H. c. i. Sak, and R. Prabhavalkar, “Exploringarchitectures, data and units for streaming end-to-endspeech recognition with rnn-transducer,” in